DataOps: A Practical Understanding and Implementation
Not long ago I wrote a series of articles on DevOps for Business Intelligence when meaningful search results for 'DataOps' hadn't really taken off.
However, in 2022 DataOps seem to grabbing more of the limelight and is touted as an industry leading practice by Gartner, McKinsey and others.
This article provides a quick and comprehensive overview of DataOps for an all-you-need-to-know-to-get-going with links to find more information, as well as a few practical examples.
For a better understanding of what is DataOps, and what it's not, check out my previous articles and keep watch for my upcoming article: 10 Myths and Facts about DataOps.
You can also educate yourself on the DataOps Methodology through a wealth of resources, such as IBMDeveloperSkillsNetwork: DE0205EN.
The course is designed for anyone who is interested in approaching analytics with a little more rigour than a handy notebook for queries. Without going knee-deep into the technology, it provides the right mindset, concepts as well common vocabulary in Data Operations, going through each of the phases in the lifecycle as below.
For a more hands-on technology and architecture oriented knowledgebase, Microsoft provides a wiki guide in their documentation pages, with DataOps basically piggy-backing the application lifecycle for DevOps, as below to manage cloud-scale analytics.
Why is DataOps important?
At it's core, DataOps is a methodology that allows your BI solutions to persist, both in terms of robust design as well as business relevance. It's also a pre-requisite to AI/ ML (Artificial Intelligence and Machine Learning) capabilities and business automation since it provides a reliable stream of quality data that continuously evolves and improves to meet the needs of the business.
Through the product of a marriage between IT and business specifically for analytics and intelligence, DataOps enables quality insights with carefully curated datasets and self-serve tools to travel up the maturity curve for business analytics.
McKinsey encourages business to invest in DataOps to accelerate the design, development, and deployment of new components into the data architecture so teams can rapidly implement and frequently update solutions based on feedback.
What pain points does DataOps overcome?
Typically, there is a disconnect between the heaps of data generated by the business and what business analysts and data scientists really need for generating business insights.
This means business teams go through a significant amount of effort to fetch and prepare data on a best-effort basis. But even two teams that sit on the same floor can end up with 'different versions of the truth' - neither of which are guaranteed to be complete, accurate or validated.
The success of DataOps comes from its ability to remove barriers for Business to consume data. No longer are IT and the business distant cousins that speak different languages and have separate missions.
DataOps provides a streamlined repeatable process to generate reliable data that the business needs and trusts on an ongoing basis.
In any case, strong foundational capabilities in data, analytics practices and people enable breakaway companies to scale, as shown below:
How to Establish DataOps?
Before reading ahead, it may be worthwhile to self-assess if any of the ten red flags that signal an analytics program is in danger of failure will risk the adoption of DataOps.
DataOps cannot operate as a silo, much the way analytics has been so far. In order to operate at an enterprise level, executive buy-in is crucial even if means starting small, and there is an overwhelming sense of purpose in becoming an analytics-driven organisation. This is an essential pre-requisite to establish a data strategy, teams/ ways of working and toolchain as outlined below:
1 - Establish Data Strategy
For any organisation that truly believes 'data is the new oil', there needs to be a data strategy (often also called a manifesto) with executive sponsorship in place that defines the internal and external environment in which data exists, such as the regulatory requirements, governance, risks and issues, where data lives etc. as well as how the organisation will realise Return on Investment for their data strategy.
As the course defines it, this could be an architecture and actionable roadmap with an action plan, but there are a number of considerations that arrive at this plan. It should also continue to evolve with new discoveries and pivot if needed. For example, after evaluating the current architecture, consolidating data platforms may not be a feasible solution after all and the organisation may look towards data virtualisation instead.
For some relevant examples, McKinsey has published a seven characteristics that define a data-driven enterprise, and the Australian Department of Industry, Science and Resources has publicly outlined a Data Strategy for how to continue to build capabilities and resources as a data-driven organisation.
2 - Establish Team and Ways of Working
DataOps is essentially a marriage of Business with IT to enable analytics and insights so it would involve a mix of people that have the required business knowledge as well as IT skills (often two ends of a spectrum), and can work together in harmony with common goals.
Lean Agile teams are the best use of organisational resources in terms of ways of working, with regular communication and visibility over requirements, outputs and performance for iterative delivery.
The course doesn't extensively cover the best of breed industry frameworks for DataOps but this blog post provides a quick read on how Lean, Agile, Six-Sigma, ITSM and Scrum are well-suited for achieving high value and minimal waste, complexity, delivery and continuous improvement. Add to that list is the DAMA framework for Data Management that is becoming an increasingly popular standard, as well as a foundation in ISACA CGEIT.
Key Roles:
Chief Data Officer (CDO) - While A Chief Information Officer (CIO) is typically responsible for the overall IT strategy and implementation of an organization’s technology infrastructure, the executive in-charge of data strategy would be the CDO. However Digital Adoption argues that while the Chief Digital Officer (CDO) has led data strategy in many organizations, there are more there are signs that the CIO is taking the helm in this regard.
Data Governance Manager - Typically a senior role and pivotal to the Data Governance Council which includes business process owners and legal and compliance experts to create broad data policies that govern the data landscape. They may be responsible for drafting data governance policies.
Data Stewards typically work with the Data Governance Manager to create appropriate artefacts for data governance and ensure they are correctly implemented within their domain. They are subject matter experts (SMEs) who understand and communicate the meaning and use of information, Templar says, and they work with other data stewards across the organization as the governing body for most data decisions.
Data Stewards are instrumental in compiling a semantic layer over data across the organisation, and modern tools often use AI to profile the data. The role of the data steward is sometimes bundled up with other responsibilities, so roles may be advertised with specifically that title, or quite familiar from the job description. This article from Informatica explains the role of the data steward in greater detail.
Data Stewards and Data Governance Managers work closely together in the model shown below for industry best practice:
DataOps Manager - A DataOps manager normally orchestrates the work being performed by managing the delivery team and is pivotal in ensuring business priorities and governance aspects are being addressed in rapid iterations of each product until business needs are fulfilled and end 'products' stabilised with automation in place to reliably provision data that can be trusted and meets agreed standards.
Data Engineers and Modellers typically compose the delivery team and do the pipelining and automation required to build the data assets for business consumption through modelling, engineering and data wrangling. Depending on the size and complexity of business requirements they may be part of a development team that looks after a certain domain, or may sometimes be organised around individual products.
Additionally data engineers and modellers are ideally supported by a Platform Team composed of platform analysts and administrators that build, maintain and continuously improve the environment on which products are built as well as ensuring the elimination of tech debt.
Recommended by LinkedIn
The platform team looks after automation of CI/ CD pipelines that push upgraded source code from dev efforts into the Production environment without any disruption. Once a product version has been productionised it is live for the business to consume and the platform team ensures its smooth running.
Data Analysts/ Business Analysts/ Story Tellers consume the data assets built by the Data Operations, and actively play a part in iteratively improving data assets by communicating with business managers and discussing evolving requirements. It's not uncommon for these resources to be a part of the business teams to become Subject Matter Experts in their business domain.
Data Scientists undertake much more exploratory and deeper analyses, as well as building machine learning algorithms and artificial intelligence for complex business problems. While most data scientists still spend the majority of time in data cleansing and engineering, DataOps would allow them to focus on actual problem solving once a continuous stream of the required data is available.
The typical rollout of DataOps involves the interaction of these various roles in delivering high quality products with strong governance that the business can consume for its information needs.
DAMA International Body of Knowledge (DAMA DMBOK2) provides a look at how these typically come together in the data governance model shown below:
Key Practices:
An important concept in DataOps practice based on ITIL delivery principles is to establish the Minimal Viable Product (MVP) as a starting point and continuously improve over time through iterations that are prioritised in each sprint, most likely using an Agile Scrum framework.
Dev teams normally use a Kanban to define the work (known as the backlog) that goes into each sprint, progressively moving cards from the 'To Do' pile into the 'Done' pile and calling out blockers in advance.
Backlog items for each sprint plan are prioritised by the DataOps Manager and planned around available resources, with clear cut responsibilities and a definition of done. The diagram below shows McKinsey's view of prioritisation based on feasibility and impact.
Performance of the team may be tracked based on output, but each sprint should ideally close all of the allocated tasks for the sprint.
Key Performance Indicators for performance typically relate to: Data Inventory KPI's (e.g. how many data points captured from the set of Critical Data Elements), Data Governance KPI's (e.g. quality of the data) and Data Pipeline KPI's (e.g. level of automation).
Together with Sprint Planning, each sprint has a review and retrospective at the end of the sprint to evaluate performance metrics and come up with ways in which it may be improved.
Similar to software development, the environment would be able designed to allow new developments to take place that are then tested and eventually put into a stable production environment that the business can consume without disruption from new developments that are in the process of being built and tested.
Once products are fully productionised (typically with high levels of automation), they are made available to the business and monitored for performance against agreed service level agreements (SLA's).
Useful starting points for DataOps are compilation of a Data Catalogue, although more often than not this is an organic by-product of need-basis product development.
Other starting points may be to consolidate legacy reporting into enterprise BI tools through systematically removing tech debt, achieving automation with minimum quality standards, and standardisation of data elements across the business into common standards set by the platform team for deployment to production.
3 - Establish Platform and Toolchain
Hybrid cloud solution provide the ideal environment for a DataOps implementation quickly and cheaply (relatively speaking), even for large corporates, government and not-for-profits with a hodge-podge of legacy systems.
McKinsey provides six foundational shifts companies are making to their data-architecture blueprints that enable more rapid delivery of new capabilities and vastly simplify existing architectural approaches, which organisations can aspire to in shaping their data landscapes:
The quick and ideal solution using a Modern Warehouse Solution (MDW) in Microsoft Azure that can bring all kinds of data together in Dev, Test and Production with CI/CD integration, is shown in the example architecture below for a fictional city planning office:
The platform collects data from many different sources through an API that uses Azure Data Factory (ADF) to orchestrate and Azure Data Lake Storage (ADLS) Gen2 to store the data. The data is validated, cleansed, and transformed to a known schema using Azure Databricks for planners to explore and assess report data on parking use through Azure Synapse with data visualization tools such as Power BI to determine whether they need more parking or related resources.
For a more detailed look into an ideal state modern analytics architecture with Azure Databricks with ADF ingestion pipelines, refer to the documentation provided by Microsoft with the below architecture:
The following talk from my trusted colleague Sandeep B. also provides a detailed look at a Microsoft Azure implementation of DataOps with some excellent insights on parallels with DevOps.
How does DataOps continue to operate in the organisation?
Depending on the level of complexity, DataOps often expands its reach to discover more data sources in order to solve business problems. For example, additional information may be required to decide on the impact of changing customer preferences which may require data from other areas of the business to be thrown into the mix for an analysis.
New data elements would then need to be classified to be usable for the purpose of the analysis. Three primary classification taxonomies the course mentions are: (i) domain (related categories or subject areas, broken down into sub-domains to avoid redundancy by maintaining a single domain classification for each data element); (ii) confidentiality (sensitivity and access requirements) and (iii) retention (how long the data is allowed to be retained for compliance).
Classification is relevant to data outcomes and the above three taxonomies need to be justifiable within the scope of the requirements, so that the data is ethically and efficiently consumed by the business. Sophisticated workflows in data discovery and classification may include leveraging AI capabilities in a well maintained metadata environment.
Data that has been discovered and catalogued for consumption needs to also provide an understanding of its quality, such as its completeness or accuracy, which can greatly impact the reliability of the resulting analysis.
Some of the data quality dimensions would be prioritised in assessing the quality of the data to come up with a data profile quality score that can be added to the catalogue. Data quality enhancements through data cleansing, validation etc. can be used to enhance data quality, often using advanced probability algorithms for matching and linking records.
As the data catalogue continues to be refined, data policies will need to be updated and maintained to reflect protection, accessibility and security that can be enforced, monitored and audited, often due to regulatory requirements.
Source data often needs to be combined and manipulated to be usable for the business. For example several tables may be combined into a view, and data lineage within the data catalogue is used to describe the data origins and its transformations.
A number of considerations need to made for data movement and integration tasks from source to target systems. These include source and target system considerations, such as conversion from source data model to target, number of sources, repositories etc.
There may be one-time migration tasks as well as recurring transformations, validation rules, data compression etc. that need to be managed with the appropriate tooling to meet service level agreements (SLA's), disaster recovery, scalability etc.
DataOps enables the business to self-serve and needs to ensure that the business is able to consume the data assets as intended. It also needs to monitor usage for changing requirements and translate these into ongoing work. For example, data that is not being used may need to be decommissioned or there may be accessibility issues in consuming the data.
Finally, DataOps will seek to continuously improve on performance indicators, especially when it is falling short in specific areas, while creating relevant artefacts such as documentation of automation rules, classification rules, data lineage etc.
In order to scale successfully, McKinsey provides 9 drivers for scaling analytics. Data Operations would need to continuous assess how it can be a key enabler for these in achieving all the stages of the Capability Maturity Model as applied to analytics.
Since the time DataOps was first coined, significant changes in industry frameworks, tools and best practices have been tried and tested.
Each organisation is unique and there is no silver bullet for a successful implementation. True to its inherent philosophy, DataOps itself will continue to evolve and shape with its own unique character that differentiates itself from the DevOps ITSM application software delivery lifecycle.
However, as this recent article from McKinsey spells out, DataOps is key to unlock the next frontier for driving value from data - by managing it like a product.
Its a very good read.
There is a lot here. I think the "DataOps is another name for everything we do in data and analytics" is confusing. We use a more limited definition: https://datakitchen.io/a-guide-to-understanding-dataops-solutions, https://datakitchen.io/what-is-dataops/
Very well written article Tuaha Shakir (Taaha Khan). Thank you for sharing.
This is a DataOps crash course. Thanks for sharing Taaha.