HBase & Spark - Transformation+Aggregation of JSON.

I came across a use case where the processing is a bit messy when data is stored in a json format into HBase;  and you need to do some transformation + aggregation of json object/array, Guess what Spark is you answer.

Presuming you have spark installed and known the basic, let connect to HBase first , Connect to HBase table and create an JavaPairRDD.

Now transformation:

And finally save your data as table and do whatever you want :)

Command to run spark Job

spark-submit --driver-memory 2g --executor-memory 2g --files SparkApp.properties --class "ul.spark.app.YourClass" --master "spark://master:7077" SparkAppV0.1.jar --driver-class-path  "/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar" --driver-java-options "-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar"

Love to hear your feedback 

Happy Learning - DD

Thanks! It seems you are running a standalone CDH cluster. We are using HDP cluster. Not much sample about integrating Spark with HBase in HDP. I can get it working in local mode and has some difficulty to get it working in yarn-client or yarn-cluster mode.

Like
Reply

Hi Deepak, Can you share your setting in Spark-submit and what kind of jars you have to use to make it compile and execute ? thx

Like
Reply

Good work Deepak. I am glad you are also sharing the knowledge widely.

Like
Reply

Good work Deepak. Hope you are enjoying :)

Like
Reply

To view or add a comment, sign in

More articles by Deepak Dabi

  • Gradle Java plugin to publish artifact in AWS S3

    Gradle is a great open-source build automation tool, which most of java developers prefer over maven due to its high…

    2 Comments
  • Spark UDAF with window function & Groupby

    Apache Spark has become de-facto framework for big data processing . Spark has great library support for UDF (user…

    4 Comments
  • AWS Aurora Postgres and IAM Auth.

    Hi Folks ! Recently worked on AWS Aurora Postgres & IAM Authentication. This reduce one substantial action item of…

    1 Comment
  • Writing Avro From Spark to Kafka

    Hi All, Writing data from spark to any target is pretty standard, but when it comes to writing Avro object to Kafka;…

    2 Comments
  • Live stream process with history data -Kafka & spark streaming

    I was recently given an exercise to write an end to end flow where live events flow from Kafka as json format, which…

  • Cassandra As File Chunk Store

    Hi Guys, Cassandra is cool NoSQL DB and recently been getting traction due to its CQL (Cousin of SQL), probably could…

    5 Comments
  • HBase & Solr - Near Real time indexing and search

    For Solr - How beautiful an open-source cloud be :), Cheers to Team Solr for there good work. Now the use case is…

    8 Comments

Explore content categories