Big Data - Design level optimization - Bucketing

Gourav Nagar

Published Sep 1, 2022

Big Data Design level optimization - Bucketing on Hive:

We use hive for query and analyze data set available on HDFS or AWS S3. While reading the specific set of data from TB's or PB's data reading should be efficient and partitioning and Bucketing is the best way to make this read efficient and make significant performance gain on hive.

Bucketing is the one of the major optimization technique which we need to consider at the time of table design along with Partitioning.

Bucketing can be done on columns with High cardinality - it means we can go for bucketing when we have more no of distinct value in particular column.

Bucketing split data into manageable part by applying hash function on bucketing column. we can specify bucketing column and number of buckets while creating bucketed table.Bucketing can also be done along with Partitioning.

Each bucket is stored as file within the tables directory or the partitions directories on HDFS or AWS S3.

Suppose we are creating customer table which is having entry of each customer and we are applying bucketing on cust_id column and creating 3 bucket - data will be divided on 3 files withing table directory.

table_name/file

table_name/file2

table_name/file31

Need to set certain properties for bucketing

SET hive.enforce.bucketing=true;

Like Dynamic partitioning here also we can not load directly data in Bucketed table with load commands instead we have to follow certain steps where we need to insert data from one table to another.

=>Create stage table(Non-Bucketed table) and load data into it with load command

=>Create Bucketed table

=>Insert data from Non-Bucketed table to Bucketed table with Insert into command

We use similar example of customer table here for dynamic partition where we will use cust_id as Bucketing column and specifying 4 bucket.

1) Create Non Bucketed table -stage table:

Create table customer_stage (cust_id int, cust_name string,cust_contact long,country string)
row format delimited fields terminated by ',' ;

2) Load data to Non Bucketed table with normal load commands:

Load data local inpath /local_path/data/customer.csv into table customer_stage;

3) Create Bucketed table:

Create table customer (cust_id int, cust_name string,cust_contact long, country string)
CLUSTERED BY (cust_id) INTO 4 BUCKETS
row format delimited fields terminated by ',';

4) Load data from Non Bucketed-stage table to Bucketed table:

Recommended by LinkedIn

Small things about Big Data

Sarthak Patnaik 11 years ago

How confused are you about big data?

Navpreet Gill 9 years ago

Big Data - How Big is it?

Hemang Davé 9 years ago

Insert into customer select * from customer_stage ;

Hash function apply on cust_id column and data will be divided into 4 no of bucket which we specified in table DDL

Bucketed table folder structure:

user/hive/warehouse/db_name/customer/file_0

user/hive/warehouse/db_name/customer/file_1

user/hive/warehouse/db_name/customer/file_2

user/hive/warehouse/db_name/customer/file_3

This bucket also can be used for Sampling purpose for Analyzing large data set using TABLESAMPLE command with select query.

Select * from customer TABLESAMPLE(bucket 1 out of 4);

Select * from customer TABLESAMPLE(bucket 2 out of 4);

BUCKETING WITH PARTITIONING:

We can use bucketing along with Partition and can further divide data into multiple bucket inside one partitions:

table_name/country=india/file

table_name/country=india/file2

table_name/country=india/file3

table_name/country=usa/file1

table_name/country=usa/file2

table_name/country=usa/file3

Set below properties:

Partitioning Properties:

SET hive.exec.dynamic.partition=true ;

SET hive.exec.dynamic.partiion.mode=nonstrict ;

Bucketing Properties:

SET hive.enforce.bucketing=true;:

1) Create Partition table with bucketing:

Create table customer (cust_id int, cust_name string,cust_contact long)

PARTITIONED BY (country string)

CLUSTERED BY (cust_id) INTO 4 BUCKETS

row format delimited fields terminated by ',';

2) Create Non Bucketed table -stage table

Create table customer_stage (cust_id int, cust_name string,cust_contact long,country string)

row format delimited fields terminated by ',' ;

3) Load data to Non Partition and Bucketed table with normal load commands

 Load data local inpath /local_path/data/customer.csv into table customer_stage;

4) Insert data from Non Bucketed-stage table to Partition with Bucketed table:

Insert into customer partition(country) select * from customer_stage ;

Follow me on LinkedIn and my Blog for more such Article on big data ,data engineering and AWS cloud.

Blog: https://bigdatabygourav.blogspot.com/

#bigdataengineer #awscloud #bigdataengineer

Big Data - Design level optimization - Bucketing

Gourav Nagar

Recommended by LinkedIn

Others also viewed

A method to madness – Harnessing big data - 7 elements of an effective big data strategy

Why we use R for Big Data Challenges

What is Big Data?

5 Telltale Signs You Don't Understand Big Data

What is Big Data: A Beginner's Guide

How to start your Big Data journey

Untangle your data spaghetti

Back To Basics

Data Persistence in SwiftUI: Core Data vs. UserDefaults

How decide whether to use big data analytics tools or not?

Explore content categories

Recommended by LinkedIn

Others also viewed

A method to madness – Harnessing big data - 7 elements of an effective big data strategy

Why we use R for Big Data Challenges

What is Big Data?

5 Telltale Signs You Don't Understand Big Data

What is Big Data: A Beginner's Guide

How to start your Big Data journey

Untangle your data spaghetti

Back To Basics

Data Persistence in SwiftUI: Core Data vs. UserDefaults

How decide whether to use big data analytics tools or not?

Similar topics

How to Optimize Data Serialization

Explore content categories