HBase & Spark - Transformation+Aggregation of JSON.

Deepak Dabi

Published Dec 1, 2015

I came across a use case where the processing is a bit messy when data is stored in a json format into HBase; and you need to do some transformation + aggregation of json object/array, Guess what Spark is you answer.

Presuming you have spark installed and known the basic, let connect to HBase first , Connect to HBase table and create an JavaPairRDD.

Now transformation:

And finally save your data as table and do whatever you want :)

Command to run spark Job

spark-submit --driver-memory 2g --executor-memory 2g --files SparkApp.properties --class "ul.spark.app.YourClass" --master "spark://master:7077" SparkAppV0.1.jar --driver-class-path "/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar" --driver-java-options "-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar"

Love to hear your feedback

Happy Learning - DD

Jun Wu 10y

Thanks! It seems you are running a standalone CDH cluster. We are using HDP cluster. Not much sample about integrating Spark with HBase in HDP. I can get it working in local mode and has some difficulty to get it working in yarn-client or yarn-cluster mode.

Jun Wu 10y

Hi Deepak, Can you share your setting in Spark-submit and what kind of jars you have to use to make it compile and execute ? thx

Vishal Pannala 10y

Good work Deepak. I am glad you are also sharing the knowledge widely.

Abhijit K 10y

Good work Deepak. Hope you are enjoying :)

HBase & Spark - Transformation+Aggregation of JSON.

Deepak Dabi

More articles by Deepak Dabi

Explore content categories

More articles by Deepak Dabi

Gradle Java plugin to publish artifact in AWS S3

Spark UDAF with window function & Groupby

AWS Aurora Postgres and IAM Auth.

Writing Avro From Spark to Kafka

Live stream process with history data -Kafka & spark streaming

Cassandra As File Chunk Store

HBase & Solr - Near Real time indexing and search

Explore content categories