DevOps in the Data World Series # 1
For the past 10 years, DevOps has continued to be a buzzword and is still increasing in popularity. Google trends analysis from 2010 till date for the word "DevOps" highlights this rise. Well, why is there such a lot of hype? How relevant is it in the world of Data? If I'm a "Data professional", should I even bother? If these are some of the questions you have asked yourself, you're in the right company. These are certainly some of the questions I'd asked myself and this article is my attempt to declutter the DevOps ecosystem from a data delivery perspective.
Let us start with the term "DevOps". Wikipedia definition of DevOps states it as "a set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while ensuring high quality". Let us consider a small sample workflow with just 4 stages between dev and prod, as shown below
In order to understand how DevOps principles can be applied to these stages, we need to discuss about two key concepts that lay the foundation to the entire DevOps process
- CI/CD Pipeline - The string that holds all the different stages in the lifecycle together is the Continuous Integration (CI) / Continuous Deployment (CD) pipeline. Continuous Integration is the process of automating the build and testing of code every time code changes are committed into version control. Continuous Deployment, as the name suggests, is the process of automating the configuration and deployment of the changes into production. The whole process of CI & CD can be facilitated by tools like Jenkins, Circle CI, Team City, Travis, etc. These tools can detect changes (like code commit or pull requests) in the version control and flow them through the entire pipeline starting from build, right through to the final deployment, in an automated fashion.
- Testing Strategy - Testing strategies like Test Driven Development (TDD) are the essential starting point that kick start the whole process. An explanation into these models of development is a whole topic in itself. However, the idea is to start off with test cases, before penning down code. Then write minimum amount of code to get the tests to pass. This process is continuously refactored in an agile fashion, adding further tests and improving the quality of the code in each iteration.
We can now examine each of the stages in further detail and what tools can be used to support the transition of the manual stages into a DevOps pipeline. The tools listed are obviously not comprehensive, and are intended only to be examples in nature. To bring some focus into the discussion, we will have to narrow down the huge realm of Big Data Tools/Languages and handpick a few. Let's consider a sample estate that has different teams, each having their own approaches to coding, say in Java, HiveQL, Scala & Python.
- Build - This phase can be skipped for the interpreted languages, python and Hive. However, for the compiled languages (Java & Scala), the build process would create the binaries necessary for execution of the tests. This is usually the first step in the pipeline where the CI/CD tool interacts with build tools like Maven, Sbt, Ant or Gradle to create the binaries for the next phase
- Test Execution - Remember we started with a Test Driven Development? It means that we already have our test cases ready for execution, as soon as the code or binaries are available. When we come to unit testing frameworks, the more obvious ones are for Java (JUnit), Scala (Scalatest) and python (pytest-spark). Then we come to HiveQL. This is usually where data professionals, especially the ones coming from a world of databases, stutter. But, don't you worry. There are tools to help us here, as well. Rather than a specific tool, I'll point you to a Hive Confluence page on Unit Testing Hive SQL, that should help you out. Now that we've crossed the hurdle with Hive, we can move on to discuss briefly on Performance Testing. The approach to performance testing will depend on how it is defined within the specific context. For instance, if we are testing the performance of batch jobs on the cluster and against volume data, the test approach could be to use cluster metrics as test cases within any of the Unit Testing Frameworks. In other cases which need simulation of user requests, we could use specialised tools like Gatling, to support the process.
- Peer Review - The review process could be done either be a peer who validates the code logic, quality and formatting or the Ops team who verifies performance test results before giving the final go ahead to production. However, when the intention is to automate the entire pipeline, the best approach is to transition the review stage into a monitoring stage, such that it is no longer a bottle neck, but still ensures quality within the pipeline. The stage could thus be transformed as a review of the unit tests or performance tests for coverage. Tools like JCov or SCoverage can help accelerate this review process. The test results from the execution phase can also be posted to the pull request to demonstrate that the pull request is clean.
- Production Deployment - The CI/CD tool can interact with various tools to enable each task within the production deployment. For Instance, the CI/CD tool could get the deployment components from either the version control or from the artefact repository (e.g. tools like Docker). Any DDL deployment could be automated using shell script wrappers triggered directly from the CI/CD Tool. For scheduling of the CI/CD tool can interact with most of the schedulers like Airflow or Oozie and set the schedule up.
What does all these improvements and pipelining do to our multistage workflow? It optimises the stages as it progresses through each iteration. The leaner it gets, the more the value of improvement.
Historically, DevOps has evolved from the world of web & mobile application development. Data Professionals often struggle to correlate and transition these topics into the world of MI and Analytics. As you read through, you'd have realised that DevOps principles can be applied to any form of Software Delivery, and is certainly very relevant in the world of Data & Analytics.
Where do we go next? The next article in the series, will look into a practical implementation of the sample workflow into a CI/CD pipeline using Jenkins. Watch this space! :)