Big Data Delta Processing with Attunity, Hive, Spark, and HDFS
I am teaching for Hortonworks on-site at a client in Illinois. I just conducted my Use Case Discovery Workshop to help them try to solve a real business case. We stepped through the questions I love to ask, and came up with some User Stories. I gave them five minute Sprints, using Agile-Scrum, so that we can get a real implementation in short order. (Shout out to my Agile mentor, Alistair Cockburn, for teaching me about ultra-short Sprints!)
Use Case Discovery Questions:
- Short Name?
- Problem Statement?
- Proposed Solution?
- Stakeholders?
- Solution Reviewers?
- Data Sources?
- Further thoughts (maybe something juicy here).
I won't disclose all of the answers, because that would not serve, however the problem is basically: How do we capture changes to our data on a daily basis, while archiving the historical (legacy) version of the data for the actuaries and others who may need it for other queries?
After we all agreed on the problem, we shifted into writing Agile User Stories, with priorities. Stories: (Name. Story. [Priority 1=high.. 5=low]):
A. Ingest data from mainframe to Attunity, and then Spark DataFrame and Hadoop Hive tables.[1]
B. Ingest "Master File" from Legacy EDW to HDFS.[1]
C. Compare ingested Attunity to "Current Master" (delta process).[3]
D. Split Master into current and change log.[2]
E. Persist changes from story “C” to Apache Hive.[3]
F. Destroy Current Master, replace with Ingested Attunity.[4]
G. Integrate and Test.[5]
Open Question for management: How long must we retain data for each data domain?
Wes at the whiteboard sharing his vision for integrating legacy Main Frame (MF), Attunity, Apache Spark DataFrames with temp tables, SQLContext, Apache Hive, and HDFS to create a "daily delta process."
They are writing the implementation code now. Here is a sample, which I worked on with Aaron (I left out the names of some things on purpose):
We needed to load the Attunity JDBC NvDriver and there were some bear-ish dependencies to get right in our environment which included pyspark, Zeppelin, and HDP:
After all that got settled, he was able to read data from the DataFrame and write it back out again to Hive.
Enjoy!
More about Attunity: https://www.attunity.com/
For more information, Laurent Weichberger, Big Data Bear, Hortonworks: lweichberger@hortonworks.com
I love to see a plan coming together! Congratulations on the ongoing progress! Putting together a workshop was a great idea.