Window Function Query Optimization

Window Function Query Optimization

On a recent project, I needed to build features to contrast the average handling time of a given employee with the overall average handling time of cases within a queue.

After a bit thought and research, I came up with the following query:

SELECT
	t.*,
	AVG(handle_time_secs) OVER (PARTITION BY queue_name ORDER BY UNIX_DATE(DATE(assigned_at)) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) AS queue_aht_30days,
	AVG(handle_time_secs) OVER (PARTITION BY emp_id, queue_name ORDER BY UNIX_DATE(DATE(assigned_at)) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) AS emp_queue_aht_30days
FROM project.dataset.table t        

However, when I tried to run it in BigQuery, I got the following exception:

Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 158% of limit. Top memory consumer(s): sort operations used for analytic OVER() clauses: 98% other/unattributed: 2%        

My manager suggested that I reduce the number of rows that the window function needs to iterate over by first aggregating the data by day.

The following query first obtains the date of the timestamp when the case was first assigned to the employee and then computes the total and average handle time of cases by day.

WITH features_daily_emp_queue AS 
	SELECT
	   queue_name,
	   emp_id,
	   DATE(assigned_at) AS assigned_at_date,
	   avg(handle_time_secs) emp_queue_1day_aht,
	   sum(handle_time_secs) emp_queue_1day_total_ht,
	   COUNT(1) emp_queue_1day_cases
	FROM project.dataset.table
	GROUP BY
	   queue_name,
	   emp_id,
	   assigned_at_date
)(        

Then, rather than using the AVG function, we can compute the average by summing the total handle time across each day and dividing by the sum of the number of cases assigned each day.

SELEC
	(SUM(emp_queue_1day_total_ht) OVER (PARTITION BY queue_name ORDER BY UNIX_DATE(assigned_at_date) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW)) / (SUM(emp_queue_1day_cases) OVER (PARTITION BY queue_name ORDER BY UNIX_DATE(assigned_at_date) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW)) AS queue_aht_30days
FROM features_daily_emp_queueT        

We can repeat the process; adding the employee ID to the partition.

SELEC
   (SUM(emp_queue_1day_total_ht) OVER (PARTITION BY emp_id, queue_name ORDER BY UNIX_DATE(assigned_at_date) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) /
              (SUM(emp_queue_1day_touches) OVER (PARTITION BY emp_id, queue_name ORDER BY UNIX_DATE(assigned_at_date) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW)) AS emp_queue_aht_30days
FROM features_daily_emp_queueT        

Finally, we put it all together and join the features back to the original table using the employee ID and the assigned date.

WITH features_daily_emp_queue AS 
       SELECT
           queue_name,
           emp_id,
           DATE(assigned_at) AS assigned_at_date,
           avg(handle_time_secs) emp_queue_1day_aht,
           sum(handle_time_secs) emp_queue_1day_total_ht,
           COUNT(1) emp_queue_1day_cases
       FROM project.dataset.table
       GROUP BY
           queue_name,
           emp_id,
           assigned_at_date
   ),
   features_daily_emp_queue_window AS (
       SELECT
           queue_name,
           emp_id,
           assigned_at_date,
           (SUM(emp_queue_1day_total_ht) OVER (PARTITION BY queue_name ORDER BY UNIX_DATE(assigned_at_date) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) /
              SUM(emp_queue_1day_cases) OVER (PARTITION BY queue_name ORDER BY UNIX_DATE(assigned_at_date) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW)) AS queue_aht_30days,
               (SUM(emp_queue_1day_total_ht) OVER (PARTITION BY emp_id, queue_name ORDER BY UNIX_DATE(assigned_at_date) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) /
              SUM(emp_queue_1day_cases) OVER (PARTITION BY emp_id, queue_name ORDER BY UNIX_DATE(assigned_at_date) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW)) AS emp_queue_aht_30days
   )
SELECT
   t.*,
   queue_aht_30days,
   emp_queue_aht_30days
FROM project.dataset.table t
   LEFT JOIN features_daily_emp_queue_window f
	   ON t.emp_id = f.emp_id
		   AND DATE(t.assigned_at) = f.assigned_at_date(        

Unlike the initial query, the preceding query ran to completion.

#dataengineering #dataengineer #datascientist #softwareengineer #development #code






To view or add a comment, sign in

More articles by Cory Maklin

  • Ownership In A Work Context

    As you start to think about moving up and taking on more responsibility, you'll need to have a solid grasp of the…

  • Commodity Labor

    I'm interested by the fact that you could be a director at one of the largest financial institutions in Canada…

  • Get More Done In Less Time

    Most people approach their investments in time like they approach their financial investments. That is, they focus on…

  • Capitalizing On Your Intellectual Capital

    If you're like the average person, you need to work a day job in order to pay the bills. However, working a day job…

  • Approximate Join On Timestamps

    I was working on a project where we needed join a reporting table containing data pertaining to cases with another…

  • Senior Data Engineer Interview: Late Arriving Data

    Interviewer: Suppose you were responsible for maintaining the data warehouse pipelines. How would you partition the…

  • Repository Design Pattern

    I had worked with frameworks like SpringBoot before, but I had never given too much thought into the reasoning behind…

  • Senior Data Engineer Interview: Right To Be Forgotten In Practice

    Interviewer: Let's say you were responsible for designing the data warehouse and associated pipelines. How would you…

  • Senior Data Engineer Interview: Backfilling Data

    Let's say you were responsible for maintaining the data warehouse pipelines. There was a bug in the code and now you…

  • Top 5 Data Engineer Interview SQL Questions

    As a data engineer, in addition to your standard LeetCode like programming problems, companies will ask you to solve…

Others also viewed

Explore content categories