On the issue of SQL tuning

Andrei Lepikhov

Published Mar 5, 2024

What flaw in the modern DBMS world should be corrected first?

I think it is weird that we need to care for and tune SQL expressions.

This language was created with the intention of being similar to ordinary human language. But human language is characterised by some redundancy. So why do modern DBMS optimisers fail to cope with redundancy in the description of the retrieved data?

Let me give you an example. Let's take a public database of power generating stations (https://datasets.wri.org/dataset/globalpowerplantdatabase) - a popular topic nowadays - and the latest version of PostgreSQL 16.

Let's see how accurately the DBMS will predict the number of records in the database if we ask to list all power plants in the Russian Federation:

EXPLAIN ANALYZE
SELECT source FROM power_plants
WHERE
  country = 'RUS';

 Seq Scan on power_plants  (rows=544) (actual rows=545)
   Filter: ((country)::text = 'RUS'::text)

The actual number of rows is 545, and the predicted number is 544. The optimizer coped with great accuracy.

Now let's ask him to do the same thing but add some redundancy to the query:

EXPLAIN ANALYZE
SELECT * FROM power_plants
WHERE
  country = 'RUS' AND country_long = 'Russia'
;

 Seq Scan on power_plants  (rows=8) (actual rows=545)
   Filter: (((country)::text = 'RUS'::text) AND (country_long = 'Russia'::text))

The actual number of rows returned has stayed the same since the abbreviation is unique to a particular country. However, the predicted number of rows has changed dramatically to 8.

Recommended by LinkedIn

JSONB SUBSCRIPTING in PostgreSQL 14

Asad Ali 1 year ago

Unlocking Business Insights: The Artistry of SQL Joins…

Peter Sigurdson 2 years ago

Introduction to Time Series Database

Pinglei G. 8 years ago

Ok, let's introduce a piece of advanced statistics and fix the situation:

CREATE STATISTICS extstat (ndistinct,dependencies,mcv)
ON country,country_long FROM power_plants;

In this case, the answer to the query is accurate enough:

Seq Scan on power_plants  (rows=543) (actual rows=545)
   Filter: (((country)::text = 'RUS'::text) AND (country_long = 'Russia'::text))

We have done it manually. However, let’s put ourselves in the position of a startup or online store developer. Such a developer focuses on the application and works with the database indirectly, often through the framework. He hardly knows what and how it functions in the database and what specific queries are executed there. So why don’t DBMSs still have the necessary functionality to analyze and identify relationships in the data they store?

Well, let’s say the previous query was too artificial and let’s try to formulate a more plausible query, gradually adding to the WHERE section clarifying expressions and see how the accuracy of the final result assessment changes:

EXPLAIN ANALYZE
SELECT * FROM power_plants
WHERE ...

country = 'RUS': planned rows = 544, actual rows = 545.
source = 'EnergyBase (RUS)': planned rows = 7, actual rows = 484.
primary_fuel IN ('Solar'): planned rows = 2, actual rows = 17.
primary_fuel IN ('Solar', 'Biomass'): planned rows = 3, actual rows = 20.
primary_fuel IN ('Solar', 'Biomass', 'Coal'): planned rows = 3, actual rows = 109.

As you can see, the DBMS has no information about the mutual data distribution in a normal situation. And the estimation of the amount of data returned quickly falls to zero. In frameworks that generate complex queries with many JOIN and GROUP-BY operators, such an error on the basic level of estimating a simple fetch from a table can have a fatal effect on the choice of the optimal query execution plan. Especially if you remember the optimiser's habit of using Nested Loop Join on small data :). What to do? Let's try to find a solution next time ...

On the issue of SQL tuning

Andrei Lepikhov

Recommended by LinkedIn

Others also viewed

SQL - a perfect thing

Demystifying the Database: How SQL Engines Work Behind the Scenes

SQL, NoSQL and NewSQL...

From SQL To Parse Trees: Inside the Lexers and Parsers of Modern Databases

Why SQL Is Not Right for Querying JSON

A short view of all my previous articles

SQL or SQL no more?

Understanding Conditionals in PostgreSQL: A Simple Guide

Decoding SQL: A Concise Historical Guide to Pronunciation

The solution to SQL sucks

Explore content categories