Why Are Data Scientists Writing Code?
Turn to Gartner, Forrester, Harvard Business Review, and any other self-respecting industry analyst, and the answer seems to be clear: Because business is about data.
The value of data has become indispensable in the digital business world. Enterprises must detect insights quickly, immediately respond to the needs and opportunities they reveal, and enrich and monetize their relationships with customers and partners.
Turn to the business analysts, data analysts, data scientists, and data engineers, and they tell you they were hired to find and extract such valuable insights from data, with incentives to understand the business and generate actionable intelligence that shapes these interactions and decisions every day. So far, so good.
And yet, if these specialists should be focused on the data, why are they spending all their time writing software? By analogy, would you force an electrician to do the work of a plumber? Both professions seem similar as they get resources from the meter to various points in the house, but the results would probably include leaky pipes, a failed report from the county inspector, a sad electrician, and lots of collateral damage.
The answers to this question lie in (1) the nature of today's data, (2) the specifics of the work to be done, and (3) the capabilities of today’s technology landscape.
First, to understand the nature of the data, let’s recap three attributes of today’s big data in the enterprise, and contrast it with traditional transactional data:
- It is semi-structured. The data structures from apps, websites, logs, and digital services like social networks are ad-hoc and always changing. Today, only a fraction of enterprise data comes from transactional systems with stable, cleanly defined structures that are carefully documented in data dictionaries.
- It is far more verbose. Billions of website clicks, ad impressions, app launches, catalog browsing, and social likes quickly overwhelm traditional enterprise databases and networks. This volume dwarfs the usual purchases and customer records of traditional business systems, and holds many more hidden opportunities.
- It is far more perishable. The value of this digital data degrades quickly with time. For example, a signal indicating that a consumer is in the grocery store when he launches the store’s app is only valuable for a few minutes to send a targeted promotion. After that its value fades.
Second, to produce real business value from this data, the teams have to apply four key strategies:
- Distill specific actionable insights from vast volumes of data. Finding new insights about the business is like finding needles in haystacks, requiring chained sequences of smart operations that hone in on what matters. It is fundamentally different from aggregations and summaries produced by traditional business intelligence.
- Compress the time between new data arrival and targeted action. Fast, efficient throughput requires smart, dynamic scaling of computational infrastructure to process massive new data arrivals. This is fundamentally different from scheduled batch operations of traditional system integration and database processing.
- Refresh insights about the business more frequently. Efficient continuous updating requires applying data analysis to the fresh data in minimal time to refresh the metric / insight. This is fundamentally different from data streams, which carry raw data and basic event detection, and different from traditional ETL-centric integration, which typically replicates data within defined time windows without applying intelligence.
- Turn the insight into immediate action. Today’s apps are expected to instantly merge deep insights with newly arriving data to create rich, contextualized and personalized interactions that trigger valuable intelligent action. Since the value of such actionable insight degrades quickly, they are increasingly fed directly into live AI systems, dashboards, and live applications. This is fundamentally different from traditional business intelligence, which renders summarized information in common visualizations on a scheduled basis.
The pressure of applying these demanding strategies to the swells of semi-structured big data feeds recently gave rise to a pattern of software called “data pipelines.” Depending on the backgrounds and experience of the data team, this pattern is being interpreted in various ways to become the dominant form of processing business data today.
Third, now that we have a handle on the nature of the work to be done, it turns out that the plethora of technologies that data teams can work with to operationalize the data pipeline pattern have inherent flaws in serving these new needs:
- Traditional ETL tools. These tools are designed for extracting, transforming, and loading (ETL) data of well-structured transactional systems. When applied to process modern volumes of semi-structured data, they succumb in their limitations in scale and diversity of structure.
- Modern BI tools. Designed to drive user interactions, reports, and dashboards, business intelligence (BI) systems operate best with limited volumes of pre-processed data, and have few deep analytics capabilities.
- Big data tools. Kafka, Spark, clouds, various data prep and open-source tools, and a burgeoning world of AI code libraries all provide powerful building blocks for processing massive semi-structured data in concert with a data lake strategy. However, these tools only work in concert with the many other tools dedicated to the dark arts of sophisticated software development. There is a big gap between the methods of constructing data pipelines and the analytical methods of data scientists and analysts.
This brings us to the crux of the issue: Every data team is forced to perform like a software development team. To construct each of their data pipelines from scratch, they write thousands of lines of code around hundreds of raw open source, cloud, and legacy building blocks. In most cases they retrofit the pipeline pattern to an existing data lake, unwittingly conflating strategies for data at rest with those for data in motion. As the number of pipelines grows into the hundreds, so does the sprawl of technologies the team is using, the millions of lines of code written and copied, the silos of data replicated across the business, and the time taken to build, orchestrate, deploy and maintain these custom ad-hoc software systems.
And THAT is how business analysts, data analysts, data scientists, and data engineers end up spending all their time writing code.
How does this status quo threaten the enterprise, and what to do about it? Let’s dig into that in future articles.
Contact me at strategy@ascend.io
Thank you for reading and please comment below!
#bigdata #data #transformation #cloud #digitaltransformation #dataengineering #innovation #technology
I have yet to meet a modern scientist or engineer that is not writing software to analyze data. Nice write up of explaining some of the recent data challenges. Yep, it is becoming a best practice to read the source code to validate the data formats. It is helpful to use some archeological, and forensic techniques to understand the data. My professors would warn me about trusting any data from computer. Thanks for the article!