BigData is about data first. devops is second.

BigData is about data first. devops is second.

Typical big data consulting session in a company: I'm invited to a two hours review to see how I can help. Around the table are: Manager, Data-engineer, Data-scientist and DevOps. All 4 of those or some -  but there is always devops.

Don't get me wrong - operation is a critical part of BigData - operations, costs, and design. For some companies I do some of the devops myself. I do not disrespect devops in any way.

One problem is that the performance task is assigned to devops. The data-engineers and data-scientist are busy sprint to sprint in adding features. The don't have the time to look at the pipeline they have build (or their predecessors) and understand why it is so slow.

Another  problem is - the devops "levers" are obvious and everybody suggest to pull them.

  • We should increase the executor memory!
  • We should use larger machine-instances - I heard it's better than small ones!
  • We should use spot instances for the core group!
  • We should try to write to SSD!
  • I've read in that blog that changing the Java GC can help.

However - in most cases - the thing to look at is (surprise!) the data and how you handle it.

  • Do you use CSV or Parquet
  • Can you use "cluster-by" (bucketing) for your joins
  • Can you use logical partitions (e.g. on dates)
  • Can you test how many (physical) partitions you need
  • Do you have data-skew
  • Can you move some of the joins/enrichments to earlier stages to reduce data
  • How much data/records do you have at each stage? Why do you have stages that increase the amount of data?
  • Are you trying to update in a system which is inherently immutable - should you use RDBMS or a different tool?
  • Can you use approximations for some of your calculations?

In the RDBMS world - we have the application programmers who use SQL and we have the DBA that handles indices, bucketing, layout etc. In the big-data world the data-engineer is required to be responsible both to business-logic (application) and the data optimisation (DBA) side - and in a much more low level way than in RDBMS.

People are surprised that when we actually talk about the data, the process and what we try to achieve - we can reach performance improvements of 1+ order of magnitude. Bottom line - start talking about the data before pulling the devops levers.

To view or add a comment, sign in

More articles by Tal Franji

  • pkood - run tens of agent terminals (Claude/Gemini)

    I found that when running too many projects and development threads I start losing track. Yes - 4 monitors may help -…

    3 Comments
  • AG and I

    This post is about an app (link: tellem.app), and my experience building this app in 4 days using Google AntiGravity.

    3 Comments
  • Elegant code refactor and ChatGPT

    In the following example I will show how short refactor - which my colleagues thought to be elegant - is not trivial to…

    2 Comments
  • Front-page: "Hadoop is dead!" (no it's not)

    I got a night call from a VP R&D of an analytics company. "We want to move out of HDFS.

    1 Comment
  • The Single Paradigm Startup Problem

    As an Engineering Consultant I see many start-ups. A recurring problem in many small/medium startup is what I would…

    1 Comment
  • Hebrew Bible through RNN generative model

    I prepared a lecture for my daughter's junior-high class. The title is "learning machines".

    1 Comment
  • GCE+DataFlow outperforms GCE+Spark in a Google sponsored benchmark

    http://mammothdata.com/wp-content/uploads/2016/04/GoogleCloudDataflow.

  • Rethink your Spark cluster performance

    This paper does a great work at measuring Spark performance and question some of the performance myths:…

    1 Comment
  • 1 IDE to rule them all?

    Unfortunately I need 3: Python - pycharm golang - LiteIDE java-Android - Android Studio (limited IntelliJ)

    1 Comment

Others also viewed

Explore content categories