Big Data Delta Processing with Attunity, Hive, Spark, and HDFS

Laurent Weichberger

Published Oct 12, 2017

I am teaching for Hortonworks on-site at a client in Illinois. I just conducted my Use Case Discovery Workshop to help them try to solve a real business case. We stepped through the questions I love to ask, and came up with some User Stories. I gave them five minute Sprints, using Agile-Scrum, so that we can get a real implementation in short order. (Shout out to my Agile mentor, Alistair Cockburn, for teaching me about ultra-short Sprints!)

Use Case Discovery Questions:

Short Name?
Problem Statement?
Proposed Solution?
Stakeholders?
Solution Reviewers?
Data Sources?
Further thoughts (maybe something juicy here).

I won't disclose all of the answers, because that would not serve, however the problem is basically: How do we capture changes to our data on a daily basis, while archiving the historical (legacy) version of the data for the actuaries and others who may need it for other queries?

After we all agreed on the problem, we shifted into writing Agile User Stories, with priorities. Stories: (Name. Story. [Priority 1=high.. 5=low]):

A. Ingest data from mainframe to Attunity, and then Spark DataFrame and Hadoop Hive tables.[1]

B. Ingest "Master File" from Legacy EDW to HDFS.[1]

C. Compare ingested Attunity to "Current Master" (delta process).[3]

D. Split Master into current and change log.[2]

E. Persist changes from story “C” to Apache Hive.[3]

F. Destroy Current Master, replace with Ingested Attunity.[4]

G. Integrate and Test.[5]

Open Question for management: How long must we retain data for each data domain?

Wes at the whiteboard sharing his vision for integrating legacy Main Frame (MF), Attunity, Apache Spark DataFrames with temp tables, SQLContext, Apache Hive, and HDFS to create a "daily delta process."

They are writing the implementation code now. Here is a sample, which I worked on with Aaron (I left out the names of some things on purpose):

We needed to load the Attunity JDBC NvDriver and there were some bear-ish dependencies to get right in our environment which included pyspark, Zeppelin, and HDP:

After all that got settled, he was able to read data from the DataFrame and write it back out again to Hive.

Enjoy!

More about Attunity: https://www.attunity.com/

For more information, Laurent Weichberger, Big Data Bear, Hortonworks: lweichberger@hortonworks.com

Jeff Gentry 8y

I love to see a plan coming together! Congratulations on the ongoing progress! Putting together a workshop was a great idea.

1 Reaction

See more comments

To view or add a comment, sign in

More articles by Laurent Weichberger

Collibra Data Quality and Standard Deviation

Mar 23, 2026

Collibra Data Quality and Standard Deviation

Recently a Collibra customer asked me how we can directly use Standard Deviation in a Collibra Classic DQ custom rule…

11 Comments
Collibra Data Quality Capabilities for Basel Committee on Banking Supervision’s Standard 239 (“BCBS 239”) by By K. Haslbeck & L. Weichberger

Mar 29, 2025

Collibra Data Quality Capabilities for Basel Committee on Banking Supervision’s Standard 239 (“BCBS 239”) by By K. Haslbeck & L. Weichberger

Preamble During 2024 I met with one of our multinational banking clients at one of their offices in India, and we…

9 Comments
Python Code to Grab Two Letter State Codes

Jan 23, 2025

Python Code to Grab Two Letter State Codes

I needed to have just a list of two letter state codes for a Collibra Data Quality custom rule I was creating for a…

3 Comments
Anatomy of a Healthy Data Quality Project Team, Part III: DQ Information and Sharing with Project Management, by L. Weichberger & C. Schmidt (2023).

Aug 10, 2023

Anatomy of a Healthy Data Quality Project Team, Part III: DQ Information and Sharing with Project Management, by L. Weichberger & C. Schmidt (2023).

In our previous blog post we shared about DQ Team Culture. We will end this series with Part 3: DQ Required Information…

4 Comments
Anatomy of a Healthy Data Quality Project Team, Part II: DQ Team Culture, by Laurent Weichberger (August 2023).

Aug 7, 2023

Anatomy of a Healthy Data Quality Project Team, Part II: DQ Team Culture, by Laurent Weichberger (August 2023).

In Part One of this blog we discuss Data Quality Roles and Responsibilities. Beyond roles and responsibilities, we have…

5 Comments
Anatomy of a Healthy Data Quality Project Team :: Part I, DQ Roles and Responsibilities

Aug 5, 2023

Anatomy of a Healthy Data Quality Project Team :: Part I, DQ Roles and Responsibilities

One of our most successful DQ customers has asked me to compose something related to the “Anatomy of a Successful DQ…

7 Comments
DQ Outlier Detection with Interquartile Range (IQR) in Python

Apr 4, 2022

DQ Outlier Detection with Interquartile Range (IQR) in Python

I recently created a presentation on Getting Started with Data Quality Outlier Detection for my company Collibra…

10 Comments
Data Quality Without Software, Towards a DQ Culture

Oct 16, 2021

Data Quality Without Software, Towards a DQ Culture

Preamble: The purpose of this document is to discuss data quality in isolation from any software product which solved…

1 Comment
Snowflake Stored Procedure with JavaScript

Mar 8, 2021

Snowflake Stored Procedure with JavaScript

I wrote my first Snowflake stored procedure using the public CitiBikes data from Citigroup Inc. We need to use…

2 Comments
Python Snowflake Connector Implementation

Dec 21, 2020

Python Snowflake Connector Implementation

Last week I added data to my brand new Hashmap Snowflake database, and this morning I was able to rapidly query that…

2 Comments

See all articles

Big Data Delta Processing with Attunity, Hive, Spark, and HDFS

Laurent Weichberger

More articles by Laurent Weichberger

Others also viewed

Comparison Between 3 data Abstraction- Apache Spark RDD vs DataFrame vs DataSet

The 6 Most Common Issues in Spark ETL Pipelines - and how to avoid them

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

Apache Iceberg Table Optimization #4: Smarter Data Layout — Sorting and Clustering Iceberg Tables

💊 DATA Pill #154 - Flink or Kafka Streams? Apache Airflow® 3

A Comparative Analysis of Avro, Parquet, and ORC: Understanding the Differences

Twitch Stream Data Analytics ETL Pipeline

Learn how to use Variant Feature in Iceberg 1.10.0 (V3) with PySpark Code Example

Deep dive Apache Parquet

My Observation, we agree to disagree!

Explore content categories

More articles by Laurent Weichberger

Collibra Data Quality and Standard Deviation

Collibra Data Quality Capabilities for Basel Committee on Banking Supervision’s Standard 239 (“BCBS 239”) by By K. Haslbeck & L. Weichberger

Python Code to Grab Two Letter State Codes

Anatomy of a Healthy Data Quality Project Team, Part III: DQ Information and Sharing with Project Management, by L. Weichberger & C. Schmidt (2023).

Anatomy of a Healthy Data Quality Project Team, Part II: DQ Team Culture, by Laurent Weichberger (August 2023).

Anatomy of a Healthy Data Quality Project Team :: Part I, DQ Roles and Responsibilities

DQ Outlier Detection with Interquartile Range (IQR) in Python

Data Quality Without Software, Towards a DQ Culture

Snowflake Stored Procedure with JavaScript

Python Snowflake Connector Implementation

Others also viewed

Comparison Between 3 data Abstraction- Apache Spark RDD vs DataFrame vs DataSet

The 6 Most Common Issues in Spark ETL Pipelines - and how to avoid them

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

Apache Iceberg Table Optimization #4: Smarter Data Layout — Sorting and Clustering Iceberg Tables

💊 DATA Pill #154 - Flink or Kafka Streams? Apache Airflow® 3

A Comparative Analysis of Avro, Parquet, and ORC: Understanding the Differences

Twitch Stream Data Analytics ETL Pipeline

Learn how to use Variant Feature in Iceberg 1.10.0 (V3) with PySpark Code Example

Deep dive Apache Parquet

My Observation, we agree to disagree!

Explore content categories