Get a Map! The Power of SQL Diagrams
Suppose you fly into an unfamiliar city and pick up a rental car to drive to your hotel. The counter is out of maps and GPS is unavailable, but the attendant helpfully writes out turn-by-turn directions to get you to your hotel.
How would you know if it was the most efficient route? What if you suspected it was not and you wanted to find the best route. What would you do?
You could look at the directions and try to get a feel if they look reasonable based on years of experience driving, but if the route is complicated and the city unfamiliar you wouldn’t have much of a chance.
You could try modifications of the route, by turning left instead of right, or going straight and turn at the next intersection instead. Again, if the trip is long and the route complicated you are going to spend a long time testing various routes and in the end you won’t be sure you found the best available. In fact, the task would be hopeless enough that you wouldn't even consider it.
If you had a map of the city, in addition to the turn-by-turn directions, the task suddenly becomes simple. You can determine the best route just by examining the map and compare that to the route the attendant gave you. In fact, you wouldn’t require the directions at all, you could simply determine the best route directly from the map.
You have been stuck in turn-by-turn mode your entire SQL career
Execution plans are the turn-by-turn directions that the database optimizer takes to execute your SQL statement. You can easily estimate what a future plan might be, examine past plans, or even watch the progress of a current execution plan using a tool like Oracle’s SQL*Monitor. This seems like a wealth of information, doesn't it? But is it enough to discover the best execution plan? No! Just like our driving directions, we can’t determine the best plan because we don’t have a map. And lacking a map, we are forced to guess and apply modifications in the hope we get something better. If we are lucky, we might find a faster plan in history, but is it really better for all cases or does it only work with a specific data set?
So we are forced to rely upon past experience and rules of thumb to try various plan modifications in the hope of obtaining faster results. We might swap out a Hash Join for a Nested Loop join, or alter the join order with no idea if it will help or not. If we do obtain an improvement we do not know if we could do better or if the improvement will hold up over time.
There is a better way!
It is time you started using SQL maps! It may come as a surprise, but SQL maps (diagrams) have been available since 2003 when Dan Tow published “SQL Tuning – Generating Optimal Execution Plans”. In the book, Dan describes the simple technique of drawing a directed graph to represent any SQL query. He also demonstrated how to use that diagram to determine an optimal plan (there may be more than one) with no guessing. It is just like looking at a street map to determine the best route to your destination. This is a phenomenal achievement, a mechanical method to determine the best possible execution plan for any SQL query.
But that's not all...
The optimizer is limited to working with the query as given. In my long career I found that very often a poorly run query has structural issues that prevent an efficient execution. Cartesian products, outer joins leading to inner joins, and filtered outer joins may all cause extra work or even incorrect results. These problems require a rewrite that no hint, pinned plan, or automated tuner can fix. No optimizer, no matter how advanced is able to fix a broken query. You have to do that yourself, but these problems are invisible in the text and you will only become aware of them when you draw the diagram.
So what does a SQL Diagram look like?
Diagrams are made up of only a handful of elements:
- Tables are represented by nodes named with the alias
- Filter conditions are represented by a percentage, so a filter that passes one third of all rows will have 0.33 inside the node
- Joins are represented by lines connecting the nodes
- Arrow heads are placed on the end of a line if the join to that node is into a unique key
- Outer joins are represented with dashed lines and the optional side has a hollow head or tail marker
SELECT *
FROM Departments Dept,
Employees Emp
WHERE Dept.Dept_no = Emp.Dept_no
AND Emp.Last_Name = 'SMITH'
It looks a lot like an ER diagram, but it includes only the tables referenced in the query where each table appears once per reference. It adds the filter percentages and the type of join specified in the query which might be different from the ER diagram if the query leaves out part of the join.
Now let us look at a slightly more complex query from Dan Tow's book:
SELECT C.Phone_Number, C.Honorific, C.First_Name, C.Last_Name
C.Suffix, C.addressID, A.address_ID, A.Street_Address_Line1,
A.Street_Address_Line2, A.City_Name, A.State_Abbreviation,
A.ZIP_Code, OD.Deferred_Shipment_Date, OD.Item_Count,
ODT.Text, OT.Text, P.Product_Description, S.Shipment_Date
FROM Orders O, Order_Details OD, Products P, Customers C, Shipments S,
Addresses A, Code_Translations ODT, Code_Translations OT
WHERE UPPER(C.Last_Name) LIKE (:Last_Name || '%')
AND UPPER(C.First_Name) LIKE (:First_Name || '%')
AND OD.Order_ID = O.Order_ID
AND O.Customer_ID = C.Customer_ID
AND OD.Product_ID = P.Product_ID(+)
AND OD.Shipment_ID = S.Shipment_ID(+)
AND S.Address_ID = A.Address_ID(+)
AND O.Status_Code = OT.Code
AND OT.Code_Type = 'ORDER_STATUS'AND OD.Status_Code = ODT.Code
AND ODT.Code_Type = 'ORDER_DETAIL_STATUS'AND O.Order_Date > :Now - 366
ORDER BY C.Customer_ID, O.Order_ID DESC, S.Shipment_ID, OD.Order_Detail_ID;
Even something as simple as this is impossible to visualize accurately from the text; however, draw the diagram and it is immediately clear:
This query answers a question about Order Details since it is the top level node. It joins many to one to Orders, Shipments, Product, and Order Detail code Translation. Orders joins to Customers, Order code Translation, and Shipments joins to Addresses.
The joins to Shipments and Addresses and Products are outer joined. The full diagram shows the number of "many" in the One-to-Many by labeling the joins. For example, there is an average of three Order Detail records per Order record. The numbers inside the nodes denotes the filter percentage in the WHERE clause. So we expect 30% of the Orders and only 0.02% of Customers to pass the query filter.
The diagram reveals we want to drive the query from the Customer table since it has the greatest filter effect. So, we know we need an index on the filter columns. Then we need a Foreign Key (FK) index on Orders so we can join without a full table scan of Orders. For the same reason we need a FK index on Order Details to support the join up from Orders. If we lack any of these, it will prevent the optimizer from generating the optimal execution plan. Given the map, we can easily trace the most efficient route and determine precisely which indexes are required to travel that route.
Structural problems are immediately evident
Let us modify the query just a bit:
SELECT DISTINCT C.Phone_Number, C.Honorific, C.First_Name, C.Last_Name
C.Suffix, C.addressID, A.address_ID, A.Street_Address_Line1,
A.Street_Address_Line2, A.City_Name, A.State_Abbreviation,
A.ZIP_Code, OD.Deferred_Shipment_Date, OD.Item_Count,
ODT.Text, OT.Text, P.Product_Description, S.Shipment_Date
FROM Orders O, Orders O2, Order_Details OD, Products P, Customers C,
Shipments S, Addresses A, Code_Translations ODT, Code_Translations OT
WHERE UPPER(C.Last_Name) LIKE (:Last_Name || '%')
AND UPPER(C.First_Name) LIKE (:First_Name || '%')
AND OD.Order_ID = O.Order_ID
AND O.Customer_ID = C.Customer_ID
AND OD.Product_ID = P.Product_ID(+)
AND OD.Shipment_ID = S.Shipment_ID(+)
AND S.Address_ID = A.Address_ID
AND O.Status_Code = OT.Code
AND O.Customer_ID = O2.Customer_ID
AND O.Order_DT < O2.Order_DT + 30
AND P.Product_Status <> "Backorder"
AND OT.Code_Type = 'ORDER_STATUS'AND OD.Status_Code = ODT.Code
AND ODT.Code_Type = 'ORDER_DETAIL_STATUS'AND O.Order_Date > :Now - 366
ORDER BY C.Customer_ID, O.Order_ID DESC, S.Shipment_ID, OD.Order_Detail_ID;
Can you spot the three structural issues? They are difficult to see in the query text, but obvious in the diagram:
The first one we notice is the Many-to-Many self join on the Orders table. The developer added this join to return only those orders that were placed less than 30 days after a previous order. While it does accomplish that, it also returns duplicates which must then be filtered out at further expense. The join should be converted into an EXISTS subquery to avoid duplication.
The second issue is the inner join S -> A following the outer join OD -> S. Any rows preserved via the outer join will be discarded during the second join because the NULL address in Shipments will not match any row in Address.
The third issue is the filter on the outer joined Products table. Again any row preserved due to the outer join will have NULL for Product_Status column which will not be "<>" to any value since NULL is both not equal and not "not equal" to any value including NULL itself.
No optimizer can correct these errors. While powerful servers can hide the inefficiency of the first error, they can't bring back mistakenly discarded rows.
You owe it to yourself to learn this powerful technique
Diagrams give you the power to uncover structural issues in even the most complex queries, problems that would otherwise confound the optimizer and perhaps even return an incorrect result set. Not knowing how to diagram SQL is like not knowing how to use a map.