BigData is about data first. devops is second.

Tal Franji

Published Jan 15, 2019

Typical big data consulting session in a company: I'm invited to a two hours review to see how I can help. Around the table are: Manager, Data-engineer, Data-scientist and DevOps. All 4 of those or some - but there is always devops.

Don't get me wrong - operation is a critical part of BigData - operations, costs, and design. For some companies I do some of the devops myself. I do not disrespect devops in any way.

One problem is that the performance task is assigned to devops. The data-engineers and data-scientist are busy sprint to sprint in adding features. The don't have the time to look at the pipeline they have build (or their predecessors) and understand why it is so slow.

Another problem is - the devops "levers" are obvious and everybody suggest to pull them.

We should increase the executor memory!
We should use larger machine-instances - I heard it's better than small ones!
We should use spot instances for the core group!
We should try to write to SSD!
I've read in that blog that changing the Java GC can help.

However - in most cases - the thing to look at is (surprise!) the data and how you handle it.

Do you use CSV or Parquet
Can you use "cluster-by" (bucketing) for your joins
Can you use logical partitions (e.g. on dates)
Can you test how many (physical) partitions you need
Do you have data-skew
Can you move some of the joins/enrichments to earlier stages to reduce data
How much data/records do you have at each stage? Why do you have stages that increase the amount of data?
Are you trying to update in a system which is inherently immutable - should you use RDBMS or a different tool?
Can you use approximations for some of your calculations?

In the RDBMS world - we have the application programmers who use SQL and we have the DBA that handles indices, bucketing, layout etc. In the big-data world the data-engineer is required to be responsible both to business-logic (application) and the data optimisation (DBA) side - and in a much more low level way than in RDBMS.

People are surprised that when we actually talk about the data, the process and what we try to achieve - we can reach performance improvements of 1+ order of magnitude. Bottom line - start talking about the data before pulling the devops levers.

To view or add a comment, sign in

BigData is about data first. devops is second.

Tal Franji

More articles by Tal Franji

Others also viewed

How Kafka Works And Why It’s Crucial for Modern ETL Pipelines

PowerReviews Engineering Digest: How we Modernized our Data ETLs

DataOps: an Automation Journey in Tuidi

Understanding Data Engineering: A Beginner’s Guide

Building a Data Engineering Pipeline with Docker and PostgreSQL

🚀 Building a Kafka Producer with Avro Serialization, Schema Registry, and Partition Strategy

Data Engineering & ML Engineering

AWS EMR vs GLUE evaluation for ETL workflow @Data engineering

AWS Data Engineering

Choosing the Right Data Pipeline Tool: NiFi, Airflow, Flink and Beyond

Explore content categories

More articles by Tal Franji

pkood - run tens of agent terminals (Claude/Gemini)

AG and I

Elegant code refactor and ChatGPT

Front-page: "Hadoop is dead!" (no it's not)

The Single Paradigm Startup Problem

Hebrew Bible through RNN generative model

GCE+DataFlow outperforms GCE+Spark in a Google sponsored benchmark

Rethink your Spark cluster performance

1 IDE to rule them all?