Use Case Discovery :: Summarization Optimization with HDP Apache Spark & Avro.

Laurent Weichberger

Published Dec 12, 2017

I am on-site at a UK client teaching Apache Spark. As I usually do these days, on day one I let them know we can do my Use Case Discovery workshop on the last day of training, and on a whim, we got this process rolling after lunch today (day two). Here is what we came up with:

A. They already have a Spark process which is involved with the creation and querying of an Avro "table." But it takes too long (five hours) to return query results. Wait where have I heard this use case before? Just recently, and we got a 25x speed increase. The only problem is Spark is already involved at 5 hours. Hmmmm, wait -- Spark RDDs are involved. Let's get a DataFame (and Catalyst Optimizer) involved here.

B. What other optimizations can we find here... Time for Use Case Discovery process and my favorite questions:

{ Ingestion (120+ datasources) > Curation > Events (Transactions Avro Table) > Summarization (text files) with Sqoop to RDBMS }

{ Rob & Karen in foreground }

Short Name (elevator pitch name): Summarization Optimization.
Problem Statement (Description): The set of queries against the Transactions table are taking (way) too long. We need this job to finish in under one hour. (Currently it takes five hours). Can we create a 5x speed increase?
Datasources: Events (Transactions) Table (just one); Reference Table (don't ask... xN); Parameter Table (don't ask about this either... xN).
Proposed Solution(s): Many optimizations may make this faster (question marks mean the point requires discussion and decision, statements are not questions and must happen, ending with periods:

A. Make better use of our 256MB blocks in HDFS.

B. Partitioning of Avro, but how and when (as in we need this immediately)?

C. Partitioning sympathetic to HDFS blocks?

D. Redesign the Avro schema altogether? [Red Flag the designer may be here]

E. Come up with a better way to use the two extra Reference/Parameter tables.

E.1. Do away with the Reference table?

F. Use Apache HBase rather than Avro?

G. Write more efficient queries!

H. Can we leverage some sort of indexing of Avro?

5. Stakeholders (e.g. who cares?): Customer Insight Team (S.V.); Sprint Dock; GoFE IM; D & D (A.B.); Transformation (Z); Mortgage Team.

6. Reviewers (e.g. who can say this is right, or this won't work and here's why?): Man Co (GBP); GoFE IM. Big Data Design Authority; CI (for accuracy).

7. Further thoughts: No more room on whiteboard! :)

So we met again about this on day three of training, and turned out thoughts into stories. Here is what we have so far as Story [Priority 1=high ... 5=low]:

A. Create a job to compact the data into full blocks.[1] => Tariq & Caroline.

{ Sukhvinder & Jason }

B. Partition events on the Event Date.[1] => Dave & Nimesh.

C. Spark partitioning to increase speed:

Partitioning in Summarization filtering job.[2] => Richard and Lee.
Partitioning data content jobs.[5]

D. Use Spark Broadcast Variables (instead of reference tables):

Broadcast Variable in Summarization filtering job.[1] => Rob & Karen
Broadcast Variable for Data Content jobs.[4]

E. Refactor existing queries to be more efficient:

Cache Event Table (subsets).[2] => Alex & Anna
Use Spark DataFrames.[1] => Jason and Sukhvinder => "DONE" before lunch break.
Use Spark via JDBC (instead of Sqoop).[3]
Translate Spark SQL code into native Spark SQL API invocations.[4]

F. Integrate and Test (all of the above).[*] => Owen. This will be continual integration.

The current duration of the job we are trying to improve == 5 hours: y minutes. (This job is being run tonight to give us a baseline for speed, still looking for y).

Definition of Done: "Team members must submit runnable code. Code is to be continuously compiled."

We will have 30 minute Scrum Sprints until we have our PoC.

Laurent Weichberger, Big Data Bear, Hortonworks: lweichberger@hortonworks.com

To view or add a comment, sign in

Use Case Discovery :: Summarization Optimization with HDP Apache Spark & Avro.

Laurent Weichberger

More articles by Laurent Weichberger

Others also viewed

Add SPARK to your life.

Apache Spark : Things to keep in mind while setting up executors

Parallel data processing with Apache Spark

A Closer Look at RDDs

[Tutorial] How to use Apache Spark to find the most Popular Movies!

Spark Cluster Manager

Using Apache Spark to predict delayed flights

Apache Spark for Beginner's

Apache Spark Overview

Introduction to Apache Spark

Explore content categories

More articles by Laurent Weichberger

Collibra Data Quality and Standard Deviation

Collibra Data Quality Capabilities for Basel Committee on Banking Supervision’s Standard 239 (“BCBS 239”) by By K. Haslbeck & L. Weichberger

Python Code to Grab Two Letter State Codes

Anatomy of a Healthy Data Quality Project Team, Part III: DQ Information and Sharing with Project Management, by L. Weichberger & C. Schmidt (2023).

Anatomy of a Healthy Data Quality Project Team, Part II: DQ Team Culture, by Laurent Weichberger (August 2023).

Anatomy of a Healthy Data Quality Project Team :: Part I, DQ Roles and Responsibilities

DQ Outlier Detection with Interquartile Range (IQR) in Python

Data Quality Without Software, Towards a DQ Culture

Snowflake Stored Procedure with JavaScript

Python Snowflake Connector Implementation