#39 – FIRST_VALUE() and LAST_VALUE() window functions in Apache Spark SQL
Hi friends!
Say we want to get first or last value in a group according to a certain criteria, and then use it further in the analysis. This can be accomplished for example by means of window function ROW_NUMBER() with result condition "= 1" .
But it is more concise to use FIRST_VALUE() or LAST_VALUE() window functions.
Let's see how FIRST_VALUE() works. LAST_VALUE() does is the same except the order of records.
The FIRST_VALUE() window function is used in SQL to retrieve the first value within a specified partition or result set based on ordering criteria. It’s particularly useful for scenarios where you need to extract specific values from sorted rows without needing subqueries or joins.
For example our task is to calculate gaps in days between order dates and the first order's date in each month. So we want first to get dates of the first order in each month and then calculate gaps for each successive order.
Let's generate test orders data with two columns: order_id and order_date. The dates belong to two months.
Now add common table expression (CTE) with calculated "order_mon" and using "partition by order_mon" get date of first order in each month
You may have notice that in the code use first() but not first_value() – they are synonyms in this case as well as last_value() and last().
It's almost done! Now add calculated column "days_diff" with gap in days
We can summary the analysis like this
The notebook code is here.
Stay tuned!