Big Data - Design level optimization - Bucketing
Big Data Design level optimization - Bucketing on Hive:
We use hive for query and analyze data set available on HDFS or AWS S3. While reading the specific set of data from TB's or PB's data reading should be efficient and partitioning and Bucketing is the best way to make this read efficient and make significant performance gain
Bucketing is the one of the major optimization technique which we need to consider at the time of table design along with Partitioning.
Bucketing can be done on columns with High cardinality - it means we can go for bucketing when we have more no of distinct value in particular column.
Bucketing split data into manageable part by applying hash function
Each bucket is stored as file within the tables directory or the partitions directories on HDFS or AWS S3.
Suppose we are creating customer table which is having entry of each customer and we are applying bucketing on cust_id column and creating 3 bucket - data will be divided on 3 files withing table directory.
table_name/file
table_name/file2
table_name/file31
Need to set certain properties for bucketing
SET hive.enforce.bucketing=true;
Like Dynamic partitioning here also we can not load directly data in Bucketed table with load commands instead we have to follow certain steps where we need to insert data from one table to another.
=>Create stage table(Non-Bucketed table) and load data into it with load command
=>Create Bucketed table
=>Insert data from Non-Bucketed table to Bucketed table with Insert into command
We use similar example of customer table here for dynamic partition where we will use cust_id as Bucketing column and specifying 4 bucket.
1) Create Non Bucketed table -stage table:
Create table customer_stage (cust_id int, cust_name string,cust_contact long,country string)
row format delimited fields terminated by ',' ;
2) Load data to Non Bucketed table with normal load commands:
Load data local inpath /local_path/data/customer.csv into table customer_stage;
3) Create Bucketed table:
Create table customer (cust_id int, cust_name string,cust_contact long, country string)
CLUSTERED BY (cust_id) INTO 4 BUCKETS
row format delimited fields terminated by ',';
4) Load data from Non Bucketed-stage table to Bucketed table:
Recommended by LinkedIn
Insert into customer select * from customer_stage ;
Hash function apply on cust_id column and data will be divided into 4 no of bucket which we specified in table DDL
Bucketed table folder structure:
user/hive/warehouse/db_name/customer/file_0
user/hive/warehouse/db_name/customer/file_1
user/hive/warehouse/db_name/customer/file_2
user/hive/warehouse/db_name/customer/file_3
This bucket also can be used for Sampling purpose for Analyzing large data set using TABLESAMPLE command with select query.
Select * from customer TABLESAMPLE(bucket 1 out of 4);
Select * from customer TABLESAMPLE(bucket 2 out of 4);
BUCKETING WITH PARTITIONING:
We can use bucketing along with Partition and can further divide data into multiple bucket inside one partitions:
table_name/country=india/file
table_name/country=india/file2
table_name/country=india/file3
table_name/country=usa/file1
table_name/country=usa/file2
table_name/country=usa/file3
Set below properties:
Partitioning Properties:
SET hive.exec.dynamic.partition=true ;
SET hive.exec.dynamic.partiion.mode=nonstrict ;
Bucketing Properties:
SET hive.enforce.bucketing=true;:
1) Create Partition table with bucketing:
Create table customer (cust_id int, cust_name string,cust_contact long)
PARTITIONED BY (country string)
CLUSTERED BY (cust_id) INTO 4 BUCKETS
row format delimited fields terminated by ',';
2) Create Non Bucketed table -stage table
Create table customer_stage (cust_id int, cust_name string,cust_contact long,country string)
row format delimited fields terminated by ',' ;
3) Load data to Non Partition and Bucketed table with normal load commands
Load data local inpath /local_path/data/customer.csv into table customer_stage;
4) Insert data from Non Bucketed-stage table to Partition with Bucketed table:
Insert into customer partition(country) select * from customer_stage ;
Follow me on LinkedIn and my Blog for more such Article on big data ,data engineering and AWS cloud.
Blog: https://bigdatabygourav.blogspot.com/