DuckDB lambda functions (HOFs)
Modern relational database management systems (RDBMS) support complex data types such as arrays, maps, and structs. Additionally, file formats like Parquet and ORC efficiently handle these data types. Leveraging these features opens up new possibilities for modeling and processing data. Furthermore, many vendors provide support for high-order functions (HOF), also known as lambda functions, allowing efficient data processing within array/list column types. This trend reflects the growing demand for flexible data modeling and advanced processing capabilities within RDBMS systems (among few: duckdb HOFs, Snowflake, KSQLDB Apache Spark).
To illustrate the expressive power of these SQL extensions, I will use DuckDB (to run examples below you can use either DuckDB (WASM shell) or DuckDB CLI). Data engineering pipelines often contain data conversion and validation steps. In the example below, we have a pipeline that processes CSV data. The file contains three columns: key_t, cities, and populations. The latter two columns decode city names and their population numbers, which are encoded as comma-separated string values. The SQL view below represents the CSV input:
create or replace view t_cities as
select t_key, city_names, populations
from (values ('t_1', 'Marburg,New York,Amsterdam,Heidelberg', 'P78.203,8804190,921402,162273') )
as stats_t(t_key, city_names, populations);
The example above contains an additional value with an unexpected format, P78.203.
The resulting table should contain valid converted data in the following schema:
Here’s the resulting table:
┌─────────┬────────────┬────────────┬─────────┐
│ t_key │ city_name │ population │ ct_cat │
│ varchar │ varchar │ int32 │ varchar │
├─────────┼────────────┼────────────┼─────────┤
│ t_1 │ New York │ 8804190 │ city │
│ t_1 │ Amsterdam │ 921402 │ city │
│ t_1 │ Heidelberg │ 162273 │ town │
└─────────┴────────────┴────────────┴─────────┘
The last column represents the category ‘town’ if the population is less than 200,000 and ‘city’ otherwise.
First, we start by splitting each comma-separated string and applying the list_zip function (with the functional signature [a] -> [b] -> [(a,b)]) to merge both lists. The resulting list looks like this: [(Marburg, P78.203), (New York, 8804190), ...].
select
t_key,
list_zip( string_split(city_names, ','), string_split(populations, ',')) as merged_cities
from t_cities;
Result record: "t_1, [(Marburg, P78.203), (New York, 8804190), (Amsterdam, 921402), (Heidelberg, 162273)]"
Recommended by LinkedIn
Now, let’s modify the query above by adding the list_transform function for format conversion. We’ll use the try_cast function, which returns null on error.
select
t_key,
list_zip( string_split(city_names, ','),
list_transform(string_split(populations, ','), x -> try_cast(x as int)))
as merged_cities
from t_cities;
Result record: "t_1, [(Marburg, NULL), (New York, 8804190), (Amsterdam, 921402), (Heidelberg, 162273)]"
Next, we can add another transform (map) step to create a named struct with three fields: c_n for name, c_p for population, and c_c for category.
select
t_key,
list_transform( list_zip( string_split(city_names, ','),
list_transform(string_split(populations, ','), x -> try_cast(x as int))),
x -> {
c_n: x[1],
c_p: x[2],
c_c: case when x[2] <= 200000 then 'town'
else 'city'
end
}
) as merged_cities
from t_cities;
Result record: "t_1, [{'c_n': Marburg, 'c_p': NULL, 'c_c': city}, {'c_n': New York, 'c_p': 8804190, 'c_c': city}, {'c_n': Amsterd… "
However, we have an entry for Marburg with a “malformed” format and the wrong category. To address this, we can modify the query by adding the list_filter function to drop the struct with the “malformed” population entry.
select
t_key,
list_transform(
list_filter(
list_zip( string_split(city_names, ','),
list_transform(string_split(populations, ','), x -> try_cast(x as int))
),
x -> x[2] is not null
),
x -> {
c_n: x[1],
c_p: x[2],
c_c: case when x[2] <= 200000 then 'town' else 'city' end
}
)
as merged_cities
from t_cities;
We obtain the final result by applying unnest to “flatten” or explode the list and access the struct fields. The following SQL query achieves this:
select t_key, city_struct.c_n as city_name, city_struct.c_p as population, city_struct.c_c as ct_cat
from (
select
t_key,
unnest (
list_transform(
list_filter(
list_zip( string_split(city_names, ','),
list_transform(string_split(populations, ','), x -> try_cast(x as int))
),
x -> x[2] is not null
),
x -> {
c_n: x[1],
c_p: x[2],
c_c: case when x[2] <= 200000 then 'town' else 'city' end
}
)
) as city_struct
from t_cities
);
The resulting table looks like this:
┌─────────┬────────────┬────────────┬─────────┐
│ t_key │ city_name │ population │ ct_cat │
│ varchar │ varchar │ int32 │ varchar │
├─────────┼────────────┼────────────┼─────────┤
│ t_1 │ New York │ 8804190 │ city │
│ t_1 │ Amsterdam │ 921402 │ city │
│ t_1 │ Heidelberg │ 162273 │ town │
└─────────┴────────────┴────────────┴─────────┘
DuckDB allows you to write templates in your host language, enabling efficient parameterization of your SQL pipelines. Give it a try!