Use Case Discovery :: Summarization Optimization with HDP Apache Spark & Avro.
I am on-site at a UK client teaching Apache Spark. As I usually do these days, on day one I let them know we can do my Use Case Discovery workshop on the last day of training, and on a whim, we got this process rolling after lunch today (day two). Here is what we came up with:
A. They already have a Spark process which is involved with the creation and querying of an Avro "table." But it takes too long (five hours) to return query results. Wait where have I heard this use case before? Just recently, and we got a 25x speed increase. The only problem is Spark is already involved at 5 hours. Hmmmm, wait -- Spark RDDs are involved. Let's get a DataFame (and Catalyst Optimizer) involved here.
B. What other optimizations can we find here... Time for Use Case Discovery process and my favorite questions:
{ Ingestion (120+ datasources) > Curation > Events (Transactions Avro Table) > Summarization (text files) with Sqoop to RDBMS }
{ Rob & Karen in foreground }
- Short Name (elevator pitch name): Summarization Optimization.
- Problem Statement (Description): The set of queries against the Transactions table are taking (way) too long. We need this job to finish in under one hour. (Currently it takes five hours). Can we create a 5x speed increase?
- Datasources: Events (Transactions) Table (just one); Reference Table (don't ask... xN); Parameter Table (don't ask about this either... xN).
- Proposed Solution(s): Many optimizations may make this faster (question marks mean the point requires discussion and decision, statements are not questions and must happen, ending with periods:
A. Make better use of our 256MB blocks in HDFS.
B. Partitioning of Avro, but how and when (as in we need this immediately)?
C. Partitioning sympathetic to HDFS blocks?
D. Redesign the Avro schema altogether? [Red Flag the designer may be here]
E. Come up with a better way to use the two extra Reference/Parameter tables.
E.1. Do away with the Reference table?
F. Use Apache HBase rather than Avro?
G. Write more efficient queries!
H. Can we leverage some sort of indexing of Avro?
5. Stakeholders (e.g. who cares?): Customer Insight Team (S.V.); Sprint Dock; GoFE IM; D & D (A.B.); Transformation (Z); Mortgage Team.
6. Reviewers (e.g. who can say this is right, or this won't work and here's why?): Man Co (GBP); GoFE IM. Big Data Design Authority; CI (for accuracy).
7. Further thoughts: No more room on whiteboard! :)
So we met again about this on day three of training, and turned out thoughts into stories. Here is what we have so far as Story [Priority 1=high ... 5=low]:
A. Create a job to compact the data into full blocks.[1] => Tariq & Caroline.
{ Sukhvinder & Jason }
B. Partition events on the Event Date.[1] => Dave & Nimesh.
C. Spark partitioning to increase speed:
- Partitioning in Summarization filtering job.[2] => Richard and Lee.
- Partitioning data content jobs.[5]
D. Use Spark Broadcast Variables (instead of reference tables):
- Broadcast Variable in Summarization filtering job.[1] => Rob & Karen
- Broadcast Variable for Data Content jobs.[4]
E. Refactor existing queries to be more efficient:
- Cache Event Table (subsets).[2] => Alex & Anna
- Use Spark DataFrames.[1] => Jason and Sukhvinder => "DONE" before lunch break.
- Use Spark via JDBC (instead of Sqoop).[3]
- Translate Spark SQL code into native Spark SQL API invocations.[4]
F. Integrate and Test (all of the above).[*] => Owen. This will be continual integration.
The current duration of the job we are trying to improve == 5 hours: y minutes. (This job is being run tonight to give us a baseline for speed, still looking for y).
Definition of Done: "Team members must submit runnable code. Code is to be continuously compiled."
We will have 30 minute Scrum Sprints until we have our PoC.
Laurent Weichberger, Big Data Bear, Hortonworks: lweichberger@hortonworks.com