DuckDB lambda functions (HOFs)

Daniar Achakeev

Published Jul 9, 2024

Modern relational database management systems (RDBMS) support complex data types such as arrays, maps, and structs. Additionally, file formats like Parquet and ORC efficiently handle these data types. Leveraging these features opens up new possibilities for modeling and processing data. Furthermore, many vendors provide support for high-order functions (HOF), also known as lambda functions, allowing efficient data processing within array/list column types. This trend reflects the growing demand for flexible data modeling and advanced processing capabilities within RDBMS systems (among few: duckdb HOFs, Snowflake, KSQLDB Apache Spark).

To illustrate the expressive power of these SQL extensions, I will use DuckDB (to run examples below you can use either DuckDB (WASM shell) or DuckDB CLI). Data engineering pipelines often contain data conversion and validation steps. In the example below, we have a pipeline that processes CSV data. The file contains three columns: key_t, cities, and populations. The latter two columns decode city names and their population numbers, which are encoded as comma-separated string values. The SQL view below represents the CSV input:

create or replace view t_cities as 
select t_key, city_names, populations
from (values ('t_1', 'Marburg,New York,Amsterdam,Heidelberg', 'P78.203,8804190,921402,162273') ) 
as stats_t(t_key, city_names, populations);

The example above contains an additional value with an unexpected format, P78.203.

The resulting table should contain valid converted data in the following schema:

key_t as varchar
city_name as varchar
population as integer
ct_cat as varchar

Here’s the resulting table:

┌─────────┬────────────┬────────────┬─────────┐
│  t_key      │ city_name     │ population   │ ct_cat      │
│ varchar    │  varchar        │   int32           │ varchar   │
├─────────┼────────────┼────────────┼─────────┤
│ t_1           │ New York     │    8804190     │ city         │
│ t_1           │ Amsterdam  │     921402     │ city          │
│ t_1           │ Heidelberg   │     162273     │ town       │
└─────────┴────────────┴────────────┴─────────┘

The last column represents the category ‘town’ if the population is less than 200,000 and ‘city’ otherwise.

First, we start by splitting each comma-separated string and applying the list_zip function (with the functional signature [a] -> [b] -> [(a,b)]) to merge both lists. The resulting list looks like this: [(Marburg, P78.203), (New York, 8804190), ...].

select
    t_key,
     list_zip( string_split(city_names, ','), string_split(populations, ',')) as merged_cities
  from t_cities;

Result record: "t_1, [(Marburg, P78.203), (New York, 8804190), (Amsterdam, 921402), (Heidelberg, 162273)]"

Recommended by LinkedIn

DBMS

Dharshana Kannan 1 year ago

Pages and Write-Ahead Logging (WAL): A Deep Dive

Santhosh K 1 year ago

SQL

Islem Othmani 2 years ago

Now, let’s modify the query above by adding the list_transform function for format conversion. We’ll use the try_cast function, which returns null on error.

select
      t_key,
       list_zip( string_split(city_names, ','), 
            list_transform(string_split(populations, ','), x -> try_cast(x as int)))
        as merged_cities
  from t_cities;

Result record: "t_1, [(Marburg, NULL), (New York, 8804190), (Amsterdam, 921402), (Heidelberg, 162273)]"

Next, we can add another transform (map) step to create a named struct with three fields: c_n for name, c_p for population, and c_c for category.

select
      t_key,
       list_transform( list_zip( string_split(city_names, ','), 
            list_transform(string_split(populations, ','), x -> try_cast(x as int))),
       x -> {
          c_n: x[1],
          c_p: x[2],
          c_c: case  when x[2] <= 200000 then 'town' 
                     else 'city'
                end
        }
       ) as merged_cities
 from t_cities;

Result record: "t_1, [{'c_n': Marburg, 'c_p': NULL, 'c_c': city}, {'c_n': New York, 'c_p': 8804190, 'c_c': city}, {'c_n': Amsterd… "

However, we have an entry for Marburg with a “malformed” format and the wrong category. To address this, we can modify the query by adding the list_filter function to drop the struct with the “malformed” population entry.

select
      t_key,
           list_transform( 
                list_filter(
                    list_zip( string_split(city_names, ','), 
                          list_transform(string_split(populations, ','), x -> try_cast(x as int))
                    ), 
                    x -> x[2] is not null
                ),
               x -> {
                  c_n: x[1],
                  c_p: x[2],
                  c_c: case  when x[2] <= 200000 then 'town' else 'city' end
                }
           )
      as merged_cities
from t_cities;

We obtain the final result by applying unnest to “flatten” or explode the list and access the struct fields. The following SQL query achieves this:

select t_key, city_struct.c_n as city_name, city_struct.c_p as population, city_struct.c_c as ct_cat
from (
select
  t_key,
  unnest (
     list_transform( 
        list_filter(
            list_zip( string_split(city_names, ','), 
                  list_transform(string_split(populations, ','), x -> try_cast(x as int))
            ), 
            x -> x[2] is not null
        ),
       x -> {
          c_n: x[1],
          c_p: x[2],
          c_c: case  when x[2] <= 200000 then 'town' else 'city' end
        }
    )
  ) as city_struct
from t_cities
);

The resulting table looks like this:

┌─────────┬────────────┬────────────┬─────────┐
│  t_key      │ city_name     │ population   │ ct_cat      │
│ varchar    │  varchar        │   int32           │ varchar   │
├─────────┼────────────┼────────────┼─────────┤
│ t_1           │ New York     │    8804190     │ city         │
│ t_1           │ Amsterdam  │     921402     │ city          │
│ t_1           │ Heidelberg   │     162273     │ town       │
└─────────┴────────────┴────────────┴─────────┘

DuckDB allows you to write templates in your host language, enabling efficient parameterization of your SQL pipelines. Give it a try!

To view or add a comment, sign in

DuckDB lambda functions (HOFs)

Daniar Achakeev

Recommended by LinkedIn

More articles by Daniar Achakeev

Others also viewed

DuckDB: A swiss army tool

Accelerate GenAI: Stream Data from MySQL to Kafka

Archiving Historical Data for RDBMS (MySQL) to the Data Lake / Object Storage (OSS) for workload and cost optimization

Database Management System

The 5 Common Types of Database Storage & When to Use Each

DBMS TECHNOLOGY

Applying Appropriate Database Technologies for Operations Analytics

distributed-index, a new big data distributed database

Big Data for the Hopelessly Relational

An Overview of NoSQL Databases

Explore content categories

Recommended by LinkedIn

More articles by Daniar Achakeev

Pyspark RDD logging

DuckDB minimalist data engineering

Others also viewed

DuckDB: A swiss army tool

Accelerate GenAI: Stream Data from MySQL to Kafka

Archiving Historical Data for RDBMS (MySQL) to the Data Lake / Object Storage (OSS) for workload and cost optimization

Database Management System

The 5 Common Types of Database Storage & When to Use Each

DBMS TECHNOLOGY

Applying Appropriate Database Technologies for Operations Analytics

distributed-index, a new big data distributed database

Big Data for the Hopelessly Relational

An Overview of NoSQL Databases

Explore content categories