Test Spark processing software

Bertrand Madet

Published Jun 19, 2019

For three years, I have worked in a team on Spark (in Scala) software. Basically our processes take Parquet files as input and output. Some data are not tabular (ie. contains array of object for instance).

As concerned software engineers (and data engineers for some members of the team), we have written a lot of tests for our Spark processes.

But what are the ways to build test datasets ? Let's enumerate ways to handle data for test.

How to create test datasets

Please note, that as our data are not tabular, we have excluded loading from csv files.

Format of input files

We use the format of file that is the input of your process. If our process take parquet files as input, we can load Parquet test files and test our process. We use Parquet files but Avro and ORC files fall into this case.

It seems natural to use the same file format as our production input. We can test the whole workflow. We can create test files with Spark shell (or Zeppelin) by extracting it from given input files.

However optimized format files are really not designed for readibilty, so our tests are not readable. It became also a little bit tricky to create files of intermediate functions.

As data for tests and test code are separated, we also have to handle matching between files and tests.

Pros:

We can test exactly as the software as it is used in prod.
We can extract sample of input files as test files.

Cons:

Parquet, Avro or ORC are not human readable.
Hard to create dataset for intermediate process.
Separate data and tests.
Not really Git compatible (ie harder to track differences).

JSON

JSON (or JavaScript Object Notation) is a human readable format - well known of software engineers. We use Spark API to load test files. The test files are readable so our tests are more readable. We can create specific files for all our needs.

We have experimented some strange conversion when loading JSON file with integer in our mind (5 for instance), Spark type it as a big integer even the schema with integer. We have to create a specific UDF to convert these big integers in integers.

As data for tests and test code are separated, we also have to handle matching between files and tests.

Pros:

Readable files;
Compatible with Git;
Flexible format for all tests .

Cons:

Separate data and tests.

Warnings:

Some strange type conversion;

Code

We can also create test datasets using Spark API. Spark RDD and DataFrame can be created using rows but we have to define schemas. Spark Dataset can be created using list of objets. Complex structures can be hard to create. Unfortunatly, we have to write a lot of code. So it is readable and also time consuming.

Please note that with Spark Dataset, we can create objects and unlike all other solutions type errors are detected at compilation time.

Pros:

Readable tests.
Compatible with Git.
Only for Spark Dataset, type errors detected at compilation time.

Cons:

We have to write a lot of code.
For Spark RDD and Spark DataFrame, hard to create complex structures.

What we have done ?

We have three ways to create test datasets. Hopefully we have different types of tests :

unit tests: numerous, small, fast and isolated tests;
process tests: tests of whole process task;
end-to-end tests : rare big tests of linked processes.

Nowadays, our processes use a lot of DataFrame. We have coded a lot of unit tests using JSON (and proper functions to handle type problems). The more we will refactor (and use Spark Datasets), the more we will code test datasets.

What we should do ?

Based on what I have experimented and learned, I suggest to use :

JSON files for large part of unit tests and process tests.
Code for unit tests and process tests of functions using Spark Datasets.
Format of input files for some process tests and for end-to-end tests.

I hope you enjoy your reading and you get some take-aways.

To view or add a comment, sign in

Test Spark processing software

Bertrand Madet

How to create test datasets

Format of input files

JSON

Code

What we have done ?

What we should do ?

More articles by Bertrand Madet

Others also viewed

Data Engineer with Scala and Spark - From dried CSV files to Insightful visualization

Interacting with Cosmos DB through Python

Spark Optimization - Serialization

🧱 Why Every Data Engineer Should Learn Object-Oriented Python

R, Python, Scala ? Building the Data Science Dream Team.

Lambda Expression in Scala

Choosing the Right Programming Language for Your Data Engineering Pipeline: Python vs SQL

3 Reasons a Data Engineer Should Learn Scala

pyspark tunning #Data Serialization

Towards functional data engineering with Scala

Explore content categories

How to create test datasets

Format of input files

JSON

Code

What we have done ?

What we should do ?

More articles by Bertrand Madet

Fiche de lecture : Clean Agile

Revue de magazines Juillet-Août 2020

Améliorer ses compétences sur Git

Fiche de lecture : DevOps Handbook

Pourquoi vous devriez vous intéresser aux licences de vos logiciels ?

What I have learnt about MongoDB in two years

Quelques petits détails que j'ai appris sur FTPS

Others also viewed

Data Engineer with Scala and Spark - From dried CSV files to Insightful visualization

Interacting with Cosmos DB through Python

Spark Optimization - Serialization

🧱 Why Every Data Engineer Should Learn Object-Oriented Python

R, Python, Scala ? Building the Data Science Dream Team.

Lambda Expression in Scala

Choosing the Right Programming Language for Your Data Engineering Pipeline: Python vs SQL

3 Reasons a Data Engineer Should Learn Scala

pyspark tunning #Data Serialization

Towards functional data engineering with Scala

Similar topics

How to Write Maintainable and Readable Tests

How to Build Reliable Test Scripts

Explore content categories