Query Tuning: Digging Deeper
By David Jordan
Like anyone who performs a lot of database tuning, I frequently receive poorly running SQL statements from unfamiliar applications. In one very memorable case, I received the following:
SELECT MIN(S.REC_NUM)
INTO StuRecNum
FROM STUDENT S
WHERE S.REC_NUM > StuRecNum
AND EXISTS (SELECT *
FROM TESTS T
WHERE T.STUDENT = S.REC_NUM
);
The note told me the statement was consuming six hours in the QA system and only two hours in production with identical data sets. I checked, and sure enough, QA was using a Nested Loop join while production was performing a Hash join of a full scan of each tables' index. So we have our answer, just toss in a hint or pin the good execution plan from production and QA will be fixed. Most automated tuning advisers will suggest one or the other, and the temptation is to apply the proffered solution and be done with it.
Unfortunately, quick fixes are often sub-optimal and brittle. Hints can get lost if the query was auto-generated by a development tool and pinned execution plans can get disconnected if the query's hash value is changed because the text is modified. If that happens, you have to diagnose the problem all over again. Ideally, you want a stable fix, one where the query optimizer generates the optimal plan on its own.
There is always a reason the optimizer is having a problem calculating the best plan and to discover that reason we need to dig deeper into the query.
Often the reason the optimizer is having difficulties is due to complexity, which generates a large number of options that cannot be costed accurately. Other times it is due to poor structure that forces a huge intermediate result set most of which is later discarded. In this case, it was even more fundamental: the query was asking the wrong question.
After writing SQL queries for years, you get a feel for good structure and this one threw up a huge red flag.
SELECT MIN(S.REC_NUM)
INTO StuRecNum
FROM STUDENT S
WHERE S.REC_NUM > StuRecNum
Why is it selecting the minimum value of the column REC_NUM that is greater than a bind variable? The record number is a meaningless system assigned key. Why is it asking for the next record greater than the one in hand in this manner? What business meaning could that possibly have?
To answer that question we need to look to the application for context:
PROCEDURE CalcAllTests
IS
CURSOR testsCur(StuRecParm NUMBER)
IS SELECT REC_NUM
FROM TESTS
WHERE STUDENT = StuRecParm AND
(LOCKED IS NULL OR LOCKED <> 'A')
ORDER BY DATE, TIME;
recNum NUMBER(10);
StuRecNum NUMBER(10);
BEGIN
StuRecNum := 0;
LOOP
SELECT MIN(S.REC_NUM)
INTO StuRecNum
FROM STUDENT S
WHERE S.REC_NUM > StuRecNum AND
EXISTS (SELECT *
FROM TESTS T
WHERE T.STUDENT = S.REC_NUM);
EXIT WHEN StuRecNum IS NULL;
OPEN testsCur(StuRecNum);
LOOP
FETCH testsCur
INTO recNum;
EXIT WHEN testsCur%NOTFOUND;
CalcTest(recNum, 'n', 'y', 30);
END LOOP;
COMMIT; -- commit after each student
CLOSE testsCur;
END LOOP;
END;
A review of the code shows the programmer is using the StuRecNum variable as a manual pointer into the student table and each execution of the query shifts the pointer one record.
BEGIN
StuRecNum := 0; -- Set the pointer before the first student record
LOOP
SELECT MIN(S.REC_NUM) -- Increment the pointer to the next record
INTO StuRecNum
FROM STUDENT S
WHERE S.REC_NUM > StuRecNum ...
...
END LOOP;
The business logic is a simple full table scan of the Student table that could be accomplished in a single execution by:
SELECT S.REC_NUM
INTO StuRecNum
FROM STUDENT S
WHERE ...
But rather than touching each record once, the problem query finds and sorts N records the first time, then N-1, N-2, N-3, …, 1, 0. That integer series sums to N(N + 1)/2 which while not geometric is still overly sensitive to table size.
The reason it was taking a minimum of two hours is now quite clear. The reason the Hash join is faster than the Nested Loop join is also apparent. The efficiency break-even point for a Nested Loop join via index access is typically around 20% of the table, but the average result set for all executions is of course 50%. The Nested Loop join is a bad choice, but executing N full table scans and Hash joins is also a terrible choice.
The developer did not understand that they could have just opened a cursor on the student table and fetched from it and the database would keep track of the cursor's position in the data set. So how might such extreme procedural coding have come about? My impression is that this was a file based or IMS application that was re-coded for relational database storage, but translated literally rather than being interpreted. It has the flavor of a hierarchical database in that it steps through each parent record (student) before stepping though the child record (test). This procedural navigation is standard in that environment and someone not familiar relational data may not realize that it wasn't necessary.
After finding such a fundamental issue we are compelled to dig deeper yet ...
Further investigation revealed that every record in the TEST table has a matching record in the student table ( there is no join filter action) so in relational environment the student table need never be referenced at all. The following would suffice to replace the entire procedure:
CURSOR TEST_CUR IS
SELECT REC_NUM
FROM TESTS
WHERE (LOCKED IS NULL OR LOCKED <> 'A')
ORDER BY STUDENT, DATE, TIME;
BEGIN
FOR TEST_REC IN TEST_CUR LOOP
CalcTest(recNum, 'n', 'y', 30);
END LOOP;
COMMIT;
This cursor retains the ordering of test within student without touching the student records (which it turns out wasn't even needed). This version commits only once which would dramatically reduce log file sync waits (each student has only a few tests) and also avoids fetching across commits.
We aren't to the bottom of the hole yet ...
I had a discussion with the trouble ticket requester where I asked him the purpose of the routine. He reported that the manager in charge was suspicious of an application modification, thinking it might corrupt some test records. and this was the verification routine. I asked when that change was made and he replied it was over two years ago. I asked if any corruption was ever discovered in those two years. He replied none. I asked if the entire routine might be eliminated and he admitted that yes, it wasn't required anymore.
We went from a six-hour run to two hours with a hint, then down to a few minutes by refactoring the application, to a 100% reduction in execution time by eliminating the whole job.
Lessons:
If you are tuning SQL from an unfamiliar application, don't assume the statement is written correctly. In fact don’t even assume it is required! Dig a little deeper if it doesn't feel right. While most code does make sense you will eventually run across something like this.
If you are a developer take the time to fully understand the power of set processing and SQL. When you code a problem proceduraly you are forcing a specific execution plan, one that more often than not is sub-optimal. While you may not be rigidly converting non-relational code to a relational DB or considering N(N + 1)/2 operations for a simple table scan you can still easily fall prey to the procedural trap.
Great article