On the issue of SQL tuning
What flaw in the modern DBMS world should be corrected first?
I think it is weird that we need to care for and tune SQL expressions.
This language was created with the intention of being similar to ordinary human language. But human language is characterised by some redundancy. So why do modern DBMS optimisers fail to cope with redundancy in the description of the retrieved data?
Let me give you an example. Let's take a public database of power generating stations (https://datasets.wri.org/dataset/globalpowerplantdatabase) - a popular topic nowadays - and the latest version of PostgreSQL 16.
Let's see how accurately the DBMS will predict the number of records in the database if we ask to list all power plants in the Russian Federation:
EXPLAIN ANALYZE
SELECT source FROM power_plants
WHERE
country = 'RUS';
Seq Scan on power_plants (rows=544) (actual rows=545)
Filter: ((country)::text = 'RUS'::text)
The actual number of rows is 545, and the predicted number is 544. The optimizer coped with great accuracy.
Now let's ask him to do the same thing but add some redundancy to the query:
EXPLAIN ANALYZE
SELECT * FROM power_plants
WHERE
country = 'RUS' AND country_long = 'Russia'
;
Seq Scan on power_plants (rows=8) (actual rows=545)
Filter: (((country)::text = 'RUS'::text) AND (country_long = 'Russia'::text))
The actual number of rows returned has stayed the same since the abbreviation is unique to a particular country. However, the predicted number of rows has changed dramatically to 8.
Recommended by LinkedIn
Ok, let's introduce a piece of advanced statistics and fix the situation:
CREATE STATISTICS extstat (ndistinct,dependencies,mcv)
ON country,country_long FROM power_plants;
In this case, the answer to the query is accurate enough:
Seq Scan on power_plants (rows=543) (actual rows=545)
Filter: (((country)::text = 'RUS'::text) AND (country_long = 'Russia'::text))
We have done it manually. However, let’s put ourselves in the position of a startup or online store developer. Such a developer focuses on the application and works with the database indirectly, often through the framework. He hardly knows what and how it functions in the database and what specific queries are executed there. So why don’t DBMSs still have the necessary functionality to analyze and identify relationships in the data they store?
Well, let’s say the previous query was too artificial and let’s try to formulate a more plausible query, gradually adding to the WHERE section clarifying expressions and see how the accuracy of the final result assessment changes:
EXPLAIN ANALYZE
SELECT * FROM power_plants
WHERE ...
As you can see, the DBMS has no information about the mutual data distribution in a normal situation. And the estimation of the amount of data returned quickly falls to zero. In frameworks that generate complex queries with many JOIN and GROUP-BY operators, such an error on the basic level of estimating a simple fetch from a table can have a fatal effect on the choice of the optimal query execution plan. Especially if you remember the optimiser's habit of using Nested Loop Join on small data :). What to do? Let's try to find a solution next time ...