Window Function Query Optimization
On a recent project, I needed to build features to contrast the average handling time of a given employee with the overall average handling time of cases within a queue.
After a bit thought and research, I came up with the following query:
SELECT
t.*,
AVG(handle_time_secs) OVER (PARTITION BY queue_name ORDER BY UNIX_DATE(DATE(assigned_at)) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) AS queue_aht_30days,
AVG(handle_time_secs) OVER (PARTITION BY emp_id, queue_name ORDER BY UNIX_DATE(DATE(assigned_at)) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) AS emp_queue_aht_30days
FROM project.dataset.table t
However, when I tried to run it in BigQuery, I got the following exception:
Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 158% of limit. Top memory consumer(s): sort operations used for analytic OVER() clauses: 98% other/unattributed: 2%
My manager suggested that I reduce the number of rows that the window function needs to iterate over by first aggregating the data by day.
The following query first obtains the date of the timestamp when the case was first assigned to the employee and then computes the total and average handle time of cases by day.
WITH features_daily_emp_queue AS
SELECT
queue_name,
emp_id,
DATE(assigned_at) AS assigned_at_date,
avg(handle_time_secs) emp_queue_1day_aht,
sum(handle_time_secs) emp_queue_1day_total_ht,
COUNT(1) emp_queue_1day_cases
FROM project.dataset.table
GROUP BY
queue_name,
emp_id,
assigned_at_date
)(
Then, rather than using the AVG function, we can compute the average by summing the total handle time across each day and dividing by the sum of the number of cases assigned each day.
SELEC
(SUM(emp_queue_1day_total_ht) OVER (PARTITION BY queue_name ORDER BY UNIX_DATE(assigned_at_date) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW)) / (SUM(emp_queue_1day_cases) OVER (PARTITION BY queue_name ORDER BY UNIX_DATE(assigned_at_date) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW)) AS queue_aht_30days
FROM features_daily_emp_queueT
We can repeat the process; adding the employee ID to the partition.
Recommended by LinkedIn
SELEC
(SUM(emp_queue_1day_total_ht) OVER (PARTITION BY emp_id, queue_name ORDER BY UNIX_DATE(assigned_at_date) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) /
(SUM(emp_queue_1day_touches) OVER (PARTITION BY emp_id, queue_name ORDER BY UNIX_DATE(assigned_at_date) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW)) AS emp_queue_aht_30days
FROM features_daily_emp_queueT
Finally, we put it all together and join the features back to the original table using the employee ID and the assigned date.
WITH features_daily_emp_queue AS
SELECT
queue_name,
emp_id,
DATE(assigned_at) AS assigned_at_date,
avg(handle_time_secs) emp_queue_1day_aht,
sum(handle_time_secs) emp_queue_1day_total_ht,
COUNT(1) emp_queue_1day_cases
FROM project.dataset.table
GROUP BY
queue_name,
emp_id,
assigned_at_date
),
features_daily_emp_queue_window AS (
SELECT
queue_name,
emp_id,
assigned_at_date,
(SUM(emp_queue_1day_total_ht) OVER (PARTITION BY queue_name ORDER BY UNIX_DATE(assigned_at_date) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) /
SUM(emp_queue_1day_cases) OVER (PARTITION BY queue_name ORDER BY UNIX_DATE(assigned_at_date) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW)) AS queue_aht_30days,
(SUM(emp_queue_1day_total_ht) OVER (PARTITION BY emp_id, queue_name ORDER BY UNIX_DATE(assigned_at_date) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) /
SUM(emp_queue_1day_cases) OVER (PARTITION BY emp_id, queue_name ORDER BY UNIX_DATE(assigned_at_date) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW)) AS emp_queue_aht_30days
)
SELECT
t.*,
queue_aht_30days,
emp_queue_aht_30days
FROM project.dataset.table t
LEFT JOIN features_daily_emp_queue_window f
ON t.emp_id = f.emp_id
AND DATE(t.assigned_at) = f.assigned_at_date(
Unlike the initial query, the preceding query ran to completion.