Test Spark processing software
For three years, I have worked in a team on Spark (in Scala) software. Basically our processes take Parquet files as input and output. Some data are not tabular (ie. contains array of object for instance).
As concerned software engineers (and data engineers for some members of the team), we have written a lot of tests for our Spark processes.
But what are the ways to build test datasets ? Let's enumerate ways to handle data for test.
How to create test datasets
Please note, that as our data are not tabular, we have excluded loading from csv files.
Format of input files
We use the format of file that is the input of your process. If our process take parquet files as input, we can load Parquet test files and test our process. We use Parquet files but Avro and ORC files fall into this case.
It seems natural to use the same file format as our production input. We can test the whole workflow. We can create test files with Spark shell (or Zeppelin) by extracting it from given input files.
However optimized format files are really not designed for readibilty, so our tests are not readable. It became also a little bit tricky to create files of intermediate functions.
As data for tests and test code are separated, we also have to handle matching between files and tests.
Pros:
- We can test exactly as the software as it is used in prod.
- We can extract sample of input files as test files.
Cons:
- Parquet, Avro or ORC are not human readable.
- Hard to create dataset for intermediate process.
- Separate data and tests.
- Not really Git compatible (ie harder to track differences).
JSON
JSON (or JavaScript Object Notation) is a human readable format - well known of software engineers. We use Spark API to load test files. The test files are readable so our tests are more readable. We can create specific files for all our needs.
We have experimented some strange conversion when loading JSON file with integer in our mind (5 for instance), Spark type it as a big integer even the schema with integer. We have to create a specific UDF to convert these big integers in integers.
As data for tests and test code are separated, we also have to handle matching between files and tests.
Pros:
- Readable files;
- Compatible with Git;
- Flexible format for all tests .
Cons:
- Separate data and tests.
Warnings:
- Some strange type conversion;
Code
We can also create test datasets using Spark API. Spark RDD and DataFrame can be created using rows but we have to define schemas. Spark Dataset can be created using list of objets. Complex structures can be hard to create. Unfortunatly, we have to write a lot of code. So it is readable and also time consuming.
Please note that with Spark Dataset, we can create objects and unlike all other solutions type errors are detected at compilation time.
Pros:
- Readable tests.
- Compatible with Git.
- Only for Spark Dataset, type errors detected at compilation time.
Cons:
- We have to write a lot of code.
- For Spark RDD and Spark DataFrame, hard to create complex structures.
What we have done ?
We have three ways to create test datasets. Hopefully we have different types of tests :
- unit tests: numerous, small, fast and isolated tests;
- process tests: tests of whole process task;
- end-to-end tests : rare big tests of linked processes.
Nowadays, our processes use a lot of DataFrame. We have coded a lot of unit tests using JSON (and proper functions to handle type problems). The more we will refactor (and use Spark Datasets), the more we will code test datasets.
What we should do ?
Based on what I have experimented and learned, I suggest to use :
- JSON files for large part of unit tests and process tests.
- Code for unit tests and process tests of functions using Spark Datasets.
- Format of input files for some process tests and for end-to-end tests.
I hope you enjoy your reading and you get some take-aways.