“Big Data” Path to Production

Amit Sehgal

Published Sep 21, 2018

Quality Assurance & Ops Strategy for Spark based big data pipeline

Overview

This article touches upon the phase of Big Data Adoption in which fills in the gap between the ideas of a big data analytics prototype to a quality assured production grade solution consistently fulfilling real business needs. The phase is called as “QA or Testing”.

In order, make the exponentially growing data usable, Big Data ELT pipelines are becoming the basic need in all the major organizations. To make use of the data within, large enterprises like Bank and Financial Institution are adopting newer technologies like Hadoop, Spark etc.

The journey starts with a bunch of POCs in picking up right technology e.g. Hadoop + Spark or some NoSql solution. Most of the organizations succeed with excellence in the prototype phase able to showcase a working prototype in a prototype environment mostly VMs.

Some organization also involve expertise of companies like Cloudera, Hortonworks or MapR during this phase.

The real story starts when organizations starts to think PRODUCTION. So many concerns starts to haunt the architects and developers and stakeholders on this journey (apart from the up-skill people) e.g.

Security - How the information stored on these stores can be kept protected
Production stability - Thinking about fault tolerance and DR strategies, production support, monitoring etc.
Capacity Planning - However, the whole concept of horizontally scalable stores lets you start low and scale gradually.
The Quality Assurance Strategy -, which includes unit, functional, integration and regression testing for big data pipelines.

The list is very long, the one above are some top priority items for so called "Path to Production" for Spark+Hadoop data pipeline project. This page focus more on the Testing Strategy for Big data pipelines.

Spark data pipeline and testing process

Generally, Big Data pipeline test process involves amalgamation of Agile ideas, Continuous Integration practices, automation and testing toolset. The diagram below nicely depicts the entire process:.

Self-Service "Great powers brings greater responsibilities" - Data Wrangling Workbench

In large enterprises the Business Analyst & Data Analyst are the domain experts who understands how to bring real value out of immense data with their organizations. A normal SDLC Requirements -> Dev Build -> Test -> Deploy loop does not work that well.

Because of the variety of data and its process
Variety of interpretations of the data within a large organization, as same data add different value to different department
The entire process is too slow and leads to impedance miss-match Business/Data Analysts (Domain Experts), developers (technologists) and the QA.

The need of the hour is to give more power to Domain Experts to be able play with data and create data wrangling pipelines by themselves. This help them massage the data and make it ready for multiple usecases like prediction calculations, analytic reporting, building Enterprise Data Models etc. As these Domain Experts are not experts in technology especially Big Data technology, this led to emergence of many platforms providing "Self-Service Data Pipelines" capabilities.

Publicis.Sapient has multiple accelerators built to provide self-service capability to powerful frameworks like Apache Spark to improve usability of technology for non-technical domain experts.

However, one has to note that, being self-service it reduces interventions by Developer's unit testing and QAs that is good in terms of turnaround time, however makes the quality of the outcome a bit questionable. As the solution does not go through proper multi-eye checks of developers and QAs.

Self service platforms like those that the one's developed in Publicis.Sapient gives lot of power to domain experts to implement a pipeline without a need of a developer. Nevertheless, with "Great powers comes great responsibilities", the domain experts have to be responsible enough to work with QAs to create enough functional test scenarios so that it can be automatically tested.

Having a Self-Service tool is not enough, but building a Quality Assurance process around it is necessary to make it production grade product. Moreover, guess what the data itself helps you test the data wrangling and hence we call this a "Data Driven Quality Assurance".

The diagram below depicts the Lifecycle of a Self-Service Data Wrangling Platform from inception to production. Note: Tools below are just make it more clear and feel free to replace them by standards followed in one’s organization

CI + Testing Life Cycle of Spark Transformation Data Pipeline

Approach & Considerations - Data Driven Quality Assurance

For a self-service data wrangling platform to be able to become a useful production grade solution, below are the few key considerations one must keep in mind to ensure quality of the solution:-

Continuous integration is a must for any such pipeline.
Developers writing the core framework must write unit tests for it, checked via build breakers.
There must be capability to mock some Big data ecosystems tools HBase, HDFS, Elastic etc. for local testing.
DO NOT over-engineer to create developer's environment heavy.
To make it domain-experts friendly, the implementation of test has to be non-technical and more english like.
Reporting has to be very crisp and frequent, providing early feedback.
There must be a capability to test things in distributed more test serialisation and parallelism issues.
As most of these projects happen to be a migration project, enhancements to existing system. Reconciliation and break reporting has to be the key aspect of overall strategy.

With all these consideration, it is evident that we need to rely on Data Driven Testing approach. Which more lie into the functional testing arena Wikipedia Functional Testing

Idea was to make the test data, and expected output as rich as possible to cater to any complex or edge conditions. Build a testing suite which can take any test data and match it against any expected data.

Tools, Technologies & Responsibilities

This section will give you an idea of tools and technologies one can use in this arena. However, as there a multiple tools options available in the market, which solve similar problems in the entire process, feel free to plug and play similar tools/ technologies as approved in your organizations.

1. Jenkins sit as the backbone of CI build pipeline orchestrating the entire build and test lifecycle.

2. JUnit is something developers religiously use to write test cases for the framework code.

3. To mock some of the Big Data ecosystems, one can make use of frameworks like MiniDFS clusters,

4. Cucumber provides an easy way for Domain Experts and QA to write functional test case the heart of testing here. It uses "Gherkin" as a language, giving capability to define scenarios is more like plain English.

5. BA and QA work extensively to write test scenarios in Cucumber. More details here Cucumber HowTo written by one of my team mate.

6. BA and QA have to be very diligent in creating Data Driven Tests.

7. Maven provides a great integration of Java code and Cucumber tests.

8. All reports are produced using Maven Cucumber plugin and published to Sonar cube for central reporting.

9. Ansible is the deployment tool to deploy on Integration test to Production environment.

10. The integration test suite built on top of cucumber and Spark SQL.

11. Excel based automation to generate test mock data.

12. Integration tests were run on a full scale environment. And was run multiple times in a day in fully automated fashion.

13. ANSIBLE was used to auto deploy code on IT environment and trigger integration test suite frequently in a day.

Data Driven QA Endpoints

Below diagram represents various end-point where testing & reconciliation are performed.

Testing End-points

Testing Ingestion - Layering with Raw and Curated Zone

Developers used embedded Kafka and embedded Elastic to write unit test cases for ingestion of data. Providing an acceptable code coverage with test cases.

The data was segregated in Raw and Curated zones, all the ingested data lands up AS-IS in Raw zone before it is took up transformed and pushed in Curated Zone.

Automated integration test had hooks at various endpoints from Source to RAW zone to test arrival of messages/ files. Various checks were performed like count, size and CRC checks to validate the messages/ files ingested.

Data Driven Testing - Sapient Big Data Reconciliation Accelerator @ Core

Most of the big data projects in large enterprises start with migrations or enhancements of existing process. This would inherently need to validate the the before and after the release. That could only be achieved by reconciling the data before and after or legacy versus strategic.

Reconciliation process includes not only comparison of data, but also let you easily analyze the breaks and resolve them through flexible data reporting and querying capabilities.

In normal we end up performing sampling based tests on same of data, because there are not many tools in the market which can do large volume reconciliation efficiently.

Using Sapient's Big Data Recon Accelerator, we were able to perform a full-blown data testing. Recon Accelerator is built on scalable technologies like Apache Spark itself we no more rely on sample testing and be 100% sure on the quality of data.

1. Data Profile Reconciliation: This is a summary reconciliation test, where one test the completeness of data rather than accuracy of data. This is done by reconciling and asking following questions:-

1. Is number of records in input and output same?

2. Or say are the no. of trades in input same as outputs + errors?

3. For ingestion, reconcile the CRC or the file size etc.

4. Or may be reconcile total amount of Position by Book is same in input and output.

2. Full Data Reconciliation: This is more of accuracy and low-grained test. This requires much more horse-power. This level of reconciliation are normally done in two scenarios:-

1. Data Migration Test: Migrating from old data warehouse to new Big Data Platform, where project reconcile every row and attribute of data between legacy and new platform.

2. Inter Release Test: Deploying a new release into production requires running multiple cycles of extracting of existing data from production and reconciling against the new release output. Comparing every cell of data produced and providing crisp report on what reconciled well and what not and why.

3. Sapient Big Data Reconciliation Accelerator, built on scalable Apache Spark technology, provides connectors to multiple legacy databases and multiple file formats in Hadoop and provide flexible configurable reconciliation capability and reporting against each attributes. This also provide capability for providing configurable lookups and roll-up reconciliation and reporting.

Pramod Gupta 7y

well written article ...

To view or add a comment, sign in

“Big Data” Path to Production

Amit Sehgal

More articles by Amit Sehgal

Others also viewed

Data Ingestion - Demystified

Understanding Data Engineering: A Beginner’s Guide

Engineering Intelligence: A Five-Layered Blueprint for Modern Data Architecture

Best Tools for Data Validation & Quality Checks

Databricks Lakebase: Rethinking Databases for the Age of AI

Complete understanding on the Role of UNSTRUCTURED DATA for a Data Analyst.

Data Engineering Demystified (And Why It’s Not Just Fancy ETL)

Building a Production-Scale Azure ETL Pipeline on Azure

The Critical Role of Data Engineering in Today's Data-Driven World

Big-Data Ingestion

Big Data Analytics Adoption

Big Data Strategy Development

Big Data Application Development

How to Ensure Data Quality in Complex Data Pipelines

Ensuring Data Quality For Scalable AI

Explore content categories