Rust for Data Engineering

Rust for Data Engineering

Capturing data is a core fundamental part of what makes today data engineering. And we do that using different programs and pipelines deployed on technologies like #DataFactory to more complex data warehouse analytics platforms and lakehouses, example #Databricks.

Today we call this - data ingestion.

Layers and layers of ingested data to be filtered, organized and saved for analysis. As more data becomes available we need faster and more scalable ingestion processes to keep up. Pipelines get slower, take longer time to run and cost more. We turn quickly to coding to fix these things, to find more complex pipelines and less accurate data. What’s wrong?

Capturing the RIGHT data is complex and requires more thinking than fetching and building big lakes of data. Since 2007, when I started Data Recorders and crossed paths with Prof. Neil Gunther , this was one my goal: find ways to filter and capture the right data to solve certain problems. Data recorders have ingestion capabilities built-in: can capture, filter and prepare the output for various things. But thats not enough. You need more.

You need to think, plan and build on a strategy on how to get the right data. Three main things you need:

► Plan and write down WHAT needs to be done

► Think and develop the logic or the algorithm HOW to do that

► Then DOIT. Using the right programming language

Enter #Rust.

Today’s Data Engineering world is made of Python and Java/Scala. Everyone is consuming these. But as soon as your data pipelines will need to process more data and use more computing resources, you will need to rethink. You will need to look and start fresh from the logic level, building and writing a technical spec of your pipeline, and select the best programming language to scale-up.

Thats how Rust meets Data Engineering. Rust is a very well known systems programming language, with a large popularity due to its performance, safety, and concurrency features. Based on these merits and capabilities Rust is the right programming language for many data engineering tasks: data ingestion and transformation.

Article content
You dont need to change your cluster type on Databricks platform to handle more data, and pay more - you need a smarter and more efficient ingestion pipeline, written in Rust to achieve that

Rust advantages

  • Memory Safety: Rust’s original ownership system ensures that memory errors are caught at compile time, reducing eliminating the risk of runtime crashes. There is no need for a garbage collector.
  • Concurrency: With its lightweight concurrency model and strict compile-time checks, Rust makes it easier to write concurrent programs that are both safe and efficient, a critical need in data-intensive applications.
  • Performance: Rust’s performance is similar to C, making it suitable for high-throughput data processing tasks using less computing resources.
  • Excellent Build Package Manager: Cargo is the integrated package manager and build system, handling dependency management, compilation, testing, documentation generation, and package publishing. It has everything you need for complete software development.

Why Rust for Data Engineering?

Rust has excellent capabilities for processing large amounts of data because of its efficient memory management system. Combined with its performance and concurrency features Rust is a very good candidate for writing:

  • Real-time data pipelines
  • Complex transformation and business logic pipelines
  • Output capabilities for data lakes or data lakehouses
  • Very efficient pipelines which consume less CPU or Memory
  • Easy maintainable data pipelines using Cargo

There are a number of ready crates (libraries) for Data Engineering like Polars, DataFusion, Delta Lake, Parquet.

Rust is here to stay for Data Engineering tasks next to Python and Java. I look forward to see how quickly companies like #Databricks and others will support Rust on their platforms.

For the rest of us, start with your technical specs, write down your logic and algorithm before coding anything. And when ready, give #Rust a try.

Merry Xmas

Espoo 2025 Dec 20

To view or add a comment, sign in

More articles by Stefan Parvu

  • Fundamentals Are Everything - Part I: Functional Block Diagrams

    Functional block diagrams help you understand relationships of a system, by describing their main functions, including…

  • What is a data recorder?

    A data recorder, is a simple software probe, designed to connect and fetch data from one or many data sources, for…

  • Beware of Short Service Interruptions

    Short service interruptions? Sometimes production services might not properly answer to one or many client requests…

    3 Comments
  • Every breath you take

    Indoor air pollution presents a major hazard, directly affecting everyone’s health: children, the elderly, office…

  • cpuplayer - multiprocessor performance analysis

    Problem solving is a very important skill for any System Administrator, Performance Analyst or even for a System…

Others also viewed

Explore content categories