The Power of Analytic Functions

The Power of Analytic Functions

As a developer, you need to understand all the tools that SQL provides to write efficient queries, and one of the most powerful is the analytic function.  We can illustrate the power with a common requirement to extract active records from a table that contains both active and historic records. 

It is a trivial task if the table has an active flag column, but is much more difficult otherwise. Absent a flag; you are forced to group by employee and search for the record with the greatest effective date, sequence number, or some other ordered column. This is usually accomplished with a coordinated sub-query:

SELECT empid, first_name, last_name, num_dependants

  FROM emp A

 WHERE A.effdt = (SELECT MAX(B.effdt)

                    FROM emp B

                   WHERE A.empid = B.empid);

For each record in the table, the sub-query finds all the other rows for that same employee, orders them, and compares the dates to see if the record you have is the oldest one.

If you have five records for a given employee, you are reading those five records once in the main query and five times again in the subquery for each of the five original rows. That's 5 + (5x5) or 30 reads. While this may not seem like a big deal, it really does have a large impact as the group size grows.

Work by Group Size


Since the work required is proportional to the square of the group size, your query could seemingly turn bad overnight as your average group size grows.

While this is a natural way to code the problem for a lot of developers, it is not at all how you would approach it if you had to do it by hand. It is a good exercise to ask yourself how you do the job manually as that is likely to be the most efficient method.

So how would you find the most recent record for each employee? The same way you would find the highest ranked card for each of the four suits. If you were handed a partial deck of cards and told to return the highest card of each suit you would:

  1. Group the cards by suit
  2. Order the cards in each group
  3. Take one card off the end of each group

This way, you handle each card only once.

The RANK analytic function allows you to just that.

The Oracle syntax is:

RANK () OVER (PARTITION BY columns ORDER BY columns)

The RANK function creates the pseudo column rank_order. The PARTITION BY clause is what we use to specify our grouping for our record set and the ORDER BY clause controls the sorting within each group.

 This is how our example query would look using RANK:

SELECT empid, first_name, last_name, num_dependants

  FROM (SELECT RANK()OVER (PARTITION BY empid

                               ORDER BY effdt DESC) AS rank_order,

               Empid, first_name, last_name, num_dependants

          FROM emp

       )

 WHERE rank_order = 1);

We wrap the inline view with a second query which performs filtering operation to retain only the first row in each partition set. That is function of the predicate "rank_order = 1".

And now here is the most complex case you will likely run across. It comes from the PeopleSoft application which utilizes not one, but two separate columns to order its records: effective sequence within effective date.

SELECT

       c.EMPLID,

       a.FIRST_NAME

  FROM SYSADM.PS_Personal_Data A,

       SYSADM.PS_Job C

 WHERE A.Emplid = C.Emplid

   AND C.Effdt = (SELECT MAX(C1.Effdt)

                    FROM SYSADM.PS_Job C1

                   WHERE C.Emplid = C1.Emplid

                     AND C.Empl_Rcd = C1.Empl_Rcd

                     AND C1.Int_Proc_Stat IN ('BS','MS','OS')

                     AND C1.Effdt <= TO_DATE('01-SEP-2007'))

   AND C.Effseq = (SELECT MAX(C2.Effseq) FROM SYSADM.PS_Job C2

                    WHERE C.Emplid = C2.Emplid

                      AND C.Empl_Rcd = C2.Empl_Rcd

                      AND C2.Int_Proc_Stat IN ('BS','MS','OS')

                      AND C.Effdt = C2.Effdt)

   AND C.Paygroup in ('001','002')

   AND C.Company in

                 ('024','001','006','010','023','025','026','027',

                  '028','008','036','022','062','014','004','186')

   AND C.Empl_Status in ('A','P')

   AND C.Int_Proc_Stat IN ('BS','MS','OS')

   AND C.Reg_Region = 'USA'

The query has two predicates inside the subqueries, the Int_Proc_Stat column must be in the value string ('BS', 'MS', 'OS'), and the Effdt must be <= a particular date. It also has several predicates referencing the PS_JOB table in the main WHERE clause. So which predicates go where in the transformed query? The answer is simple: Predicates inside the subqueries are placed inside the in-line view, and the rest remain where they are.

Predicates inside the inline view limit the rows that are considered part of the group before the sorting takes place while predicates in the main query filter out whole groups. In this example, we filter out any rows that aren't effective by the current date or don't have a certain status before choosing the first record for the group. Then in the main query, we filter out any employee whose last record is not in a certain pay group, company, employment status, or region. The proc_stat check is redundant since every record allowed in the group will meet it due to it also occurring inside the inline view.

 When transformed to use the RANK function it looks like this:

SELECT C.Effdt, C.Effseq, C.Emplid, A.FIRST_NAME

  FROM (SELECT RANK ( ) OVER (PARTITION BY Emplid, Empl_Rcd

                              ORDER BY Effdt DESC, Effseq DESC) rank_order,

               Effdt,

               Effseq,

               EMPLID

          FROM PS_JOB

         WHERE Fmc_Int_Proc_Stat IN ('BS','MS','OS')

           AND Effdt <= TO_DATE('01-SEP-2007')

       ) C,

       PS_Personal_Data A,

 WHERE A.Emplid = C.Emplid

   AND C.rank_order = 1

   AND Paygroup in ('001','002')

   AND Company in

       ('024','001','006','010','023','025','026','027'

       ,'028','008','036','022','062','014','004','186')

   AND Empl_Status in ('A','P')

   AND Int_Proc_Stat IN ('BS','MS','OS')

   AND Reg_Region = 'USA'

The two subqueries are replaced by a single partitioning and ordering by effseq within effdt. The function returns the rank ordered record.

We then filter any record that is not first in line with the additional predicate:

 AND C.rank_order = 1

The original query would run for 30+ hours at a major manufacturing company as it interrogated the huge PS_JOB table again and again while the transformed query runs in 18 minutes.

When to use this technique

Note that this technique forces you to place the table access inside an inline view and execute that view before filtering out all but the first rank rows. That means the resultant data set may be large (if you don't have a selective filter) and won't have any indexes. 

The consequence is that you typically want to apply this technique only to the driving table of the execution plan. It makes no sense to start a query plan by selecting five employee records only to join them to a full scan of the jobs table. In that case, leave the subqueries alone so the five employee records will join to perhaps 10-20 job records. On the other hand, if you are retrieving most of the employee records you are better off performing a full table scan of the jobs table and a hash join of the results. The lack of an index on the jobs data doesn't affect the hash join, and you get an overall savings by pre-filtering the jobs data with a single table scan.

To view or add a comment, sign in

More articles by David Jordan

  • Get a Map! The Power of SQL Diagrams

    Suppose you fly into an unfamiliar city and pick up a rental car to drive to your hotel. The counter is out of maps and…

  • Query Tuning: Digging Deeper

    By David Jordan Like anyone who performs a lot of database tuning, I frequently receive poorly running SQL statements…

    1 Comment

Others also viewed

Explore content categories