In a Perfect World and SQL too

In a Perfect World and SQL too

I remember one of my professors from my school days stating that “as a programmer you shouldn’t have to be concerned about what platform your code will run on, how memory is managed, how I/O is handled, or how to write efficient code. Any concerns you may have about such things should be handled by the compiler (or interpreter) and optimizers.” While it’s a nice thought and perhaps something to strive for, I never thought it to be exactly true and typically would amend his statement with “in a perfect world.” However, certain things have happened over the years to make that statement “more true” than ever.

  • Processors have gotten much faster, making inefficiencies in coding much less noticeable.
  • Memory and disk technologies have improved greatly (no more 100 user machines with 256 K of RAM and 300 MB of disk storage).
  • Languages like Java have come about with their “write once, run anywhere” virtual machines, removing most implementation dependencies and allowing precompiled code to run regardless of the underlying computer architecture.
  • We also have Integrated Development Environments (IDE) that help us code our best.
  • And we have optimizers. Optimizers are everywhere and are integrated into every aspect of computing, machine code execution, memory management, disk management, network traffic management, etc.

Many of the statements and concepts above even hold true to databases. One of the biggest advancements in database querying was the development of Structured Query Language (SQL). SQL allows for data from multiple tables to be joined together into a result table using one statement. Prior to SQL, programmers would often have to read the tables individually and join them within their program by writing nested reads within nested loops. Queries that may have taken 20 to 100 lines of code can be done with one SQL statement. As with traditional programming, development environments have been introduced to make query development (programming) easier, most of which include graphical interfaces allowing for tables to be joined and queries to be developed without knowing SQL at all.

All these advancements in technology certainly make ones job easier and open up computing and database querying to a whole new group of people, but with these advancements comes a loss of control. You are no longer telling the system how to do something, instead you are telling the system what you want to do and the system is figuring out how to do it. To alleviate some of the issues that can come from this loss of control, databases have also taken to employing memory management, disk management, and query optimization. In my opinion, databases (at least the larger commercially available ones and some open source databases) have some of the best technology in these areas, especially in the area of optimization. SQL optimizers can be quite impressive and can do things like moving search criteria from an outer query to inner/sub-queries to improve their performance or optimizing sub-queries away completely.

SQL (being based on a standard) looks basically the same from database to database. This can give the impression that similar looking queries will execute similarly regardless of the database, but that is not always true. Optimizers, memory management, and file management vary depending on what database you’re using – and tables and indices vary from schema to schema. This can make it important to look at how you write your queries.

There are some methods that can be applied to queries that will help with execution time regardless of the database, its optimizer, the schema, and its indices. Here’s one such example involving the joining of 3 tables.

Let’s say you want to look at the utilization of procedure codes by doctors (this could just as easily be cars sold by manufacturer and model, and many other similar type queries).

In this example we have the following tables:

What I would like to do is look at the usage in 2015 of 15 particular codes by 6 particular doctors. I would like a complete matrix (i.e., all 6 codes for all 15 doctors for a total of 90 rows in my result table). The list of doctors and codes can easily be produced with the following query:

The result table would look something like the following and should contain 90 rows:

Now I would like to add the usage of these codes for each doctor, in the year 2015. I could modify my query to look like this (new lines are highlighted in yellow). Let’s call this “Query A”:

The result table would look something like the following and again should contain 90 rows:

I chose to use “LEFT OUTER JOIN” for the activity table because I wanted a complete matrix returned (i.e., all codes for all doctors regardless of whether they used the code or not). You can see some rows in the table above with a null value signifying that doctor did not use that code. If I used “INNER JOIN” it would have left out the rows without usage.

To best illustrate the next technique to be presented, I removed all indices for the Activity table thus forcing the database to do a complete table scan for the Activity table (i.e., all 598,951 rows). The execution time for “Query A” above was 119.18 seconds.

Another way to write the query is like this. Let’s call this “Query B”:

The result table would look the same as the result table from “Query A”, but in this example the query execution time was significantly faster. “Query B” only took 6.5 seconds for a time savings of 112.68 seconds. Why was there such a significant time savings? The source and result tables still had the same numbers of rows, the optimizer was still active, and “Query B” actually looks more complicated.

Even with optimizers, you can’t always tell what they are going to do and they don’t always pick the best way to get at the data. Therefore, I like to look at ways of writing better queries regardless of optimization.

In the case of “Query A”, we can’t be sure if the database did a complete table scan of the activity table for each doctor and code combination (i.e., 6 doctors * 15 codes * 598,951 activity rows = 53,905,590 rows worst case) or not. Additionally, we don’t know what the actual usage of these codes is. If the usage for each doctor is 1,000, then before we can group the data and sum up the charge quantity, we’re looking at an intermediate table of 6 doctors * 15 codes * 1,000 activity rows = 90,000 rows to be grouped and summed.

Looking at “Query B”, the inner/sub-query will run independently of the outer query. Since we removed all indices we know a full table scan will be done (i.e., 598,951 rows), but it should only be done once. Using the numbers from the preceding paragraph (i.e., usage activity at 1,000 per doctor) the resulting intermediate table (before the grouping and summing) will still contain 90,000 rows; however, after it is grouped and summed it will only contain 90 rows. This resulting table (named “table s” in the query above) with 90 rows is then joined with the 6 doctors * 15 procedures (also 90 rows). This is a lot less joining than “Query A” and is where part of the speed comes from (the other being that we know the activity table is only scanned once). Please note: this is worst case. In actuality each doctor does not use each code, so we’re looking at a sparse matrix here. This will decrease the number of rows to be grouped and summed in the intermediate table and also decrease the number of resulting rows in “table s” again meaning fewer rows to join and adding to the performance increase.

So, while advances in technology are good and tremendous strides have been made, you may still want to look closely at your code and how the system reacts to it. Someday we may not need to be concerned about efficiency, but that day is not here yet. And, even when that day is here, someone will still have to write the optimizers….

To view or add a comment, sign in

More articles by Wes Suddaby

Others also viewed

Explore content categories