Rust for Data Engineering
Capturing data is a core fundamental part of what makes today data engineering. And we do that using different programs and pipelines deployed on technologies like #DataFactory to more complex data warehouse analytics platforms and lakehouses, example #Databricks.
Today we call this - data ingestion.
Layers and layers of ingested data to be filtered, organized and saved for analysis. As more data becomes available we need faster and more scalable ingestion processes to keep up. Pipelines get slower, take longer time to run and cost more. We turn quickly to coding to fix these things, to find more complex pipelines and less accurate data. What’s wrong?
Capturing the RIGHT data is complex and requires more thinking than fetching and building big lakes of data. Since 2007, when I started Data Recorders and crossed paths with Prof. Neil Gunther , this was one my goal: find ways to filter and capture the right data to solve certain problems. Data recorders have ingestion capabilities built-in: can capture, filter and prepare the output for various things. But thats not enough. You need more.
You need to think, plan and build on a strategy on how to get the right data. Three main things you need:
► Plan and write down WHAT needs to be done
► Think and develop the logic or the algorithm HOW to do that
► Then DOIT. Using the right programming language
Enter #Rust.
Today’s Data Engineering world is made of Python and Java/Scala. Everyone is consuming these. But as soon as your data pipelines will need to process more data and use more computing resources, you will need to rethink. You will need to look and start fresh from the logic level, building and writing a technical spec of your pipeline, and select the best programming language to scale-up.
Thats how Rust meets Data Engineering. Rust is a very well known systems programming language, with a large popularity due to its performance, safety, and concurrency features. Based on these merits and capabilities Rust is the right programming language for many data engineering tasks: data ingestion and transformation.
Recommended by LinkedIn
You dont need to change your cluster type on Databricks platform to handle more data, and pay more - you need a smarter and more efficient ingestion pipeline, written in Rust to achieve that
Rust advantages
Why Rust for Data Engineering?
Rust has excellent capabilities for processing large amounts of data because of its efficient memory management system. Combined with its performance and concurrency features Rust is a very good candidate for writing:
There are a number of ready crates (libraries) for Data Engineering like Polars, DataFusion, Delta Lake, Parquet.
Rust is here to stay for Data Engineering tasks next to Python and Java. I look forward to see how quickly companies like #Databricks and others will support Rust on their platforms.
For the rest of us, start with your technical specs, write down your logic and algorithm before coding anything. And when ready, give #Rust a try.
Merry Xmas
Espoo 2025 Dec 20