BigData is about data first. devops is second.
Typical big data consulting session in a company: I'm invited to a two hours review to see how I can help. Around the table are: Manager, Data-engineer, Data-scientist and DevOps. All 4 of those or some - but there is always devops.
Don't get me wrong - operation is a critical part of BigData - operations, costs, and design. For some companies I do some of the devops myself. I do not disrespect devops in any way.
One problem is that the performance task is assigned to devops. The data-engineers and data-scientist are busy sprint to sprint in adding features. The don't have the time to look at the pipeline they have build (or their predecessors) and understand why it is so slow.
Another problem is - the devops "levers" are obvious and everybody suggest to pull them.
- We should increase the executor memory!
- We should use larger machine-instances - I heard it's better than small ones!
- We should use spot instances for the core group!
- We should try to write to SSD!
- I've read in that blog that changing the Java GC can help.
However - in most cases - the thing to look at is (surprise!) the data and how you handle it.
- Do you use CSV or Parquet
- Can you use "cluster-by" (bucketing) for your joins
- Can you use logical partitions (e.g. on dates)
- Can you test how many (physical) partitions you need
- Do you have data-skew
- Can you move some of the joins/enrichments to earlier stages to reduce data
- How much data/records do you have at each stage? Why do you have stages that increase the amount of data?
- Are you trying to update in a system which is inherently immutable - should you use RDBMS or a different tool?
- Can you use approximations for some of your calculations?
In the RDBMS world - we have the application programmers who use SQL and we have the DBA that handles indices, bucketing, layout etc. In the big-data world the data-engineer is required to be responsible both to business-logic (application) and the data optimisation (DBA) side - and in a much more low level way than in RDBMS.
People are surprised that when we actually talk about the data, the process and what we try to achieve - we can reach performance improvements of 1+ order of magnitude. Bottom line - start talking about the data before pulling the devops levers.