Hive Optimization Parameters
Hello All,
In this post I am going to write about few optimizations in hive in terms of Partitioning, I/O and Compression. Hope you all get benefitted from this while doing you everyday job task as Hadoop Developer on Hive.
I think everybody is eager to know these optimization tricks(not tricks actually). I worked on an migration project recently, that to from Oracle to Hive. My job task involvels re-writting Oracle queries in Hive and then validating those for Output and performance gain. One thing I observed that tuning query is not always involves the task of changing the query. Most of the time we were not aware about configuration parameter and we waste too much of our time in modifying query for performance gain. So, here I am going to highlight few parameters from hive that can help you achieve performance gain without wasting too much of time analyzing the queries.
Lets, start with PARTITIONING . As you all know hive support static and dynamic partitioning. Static partitioning needs hardcoding the partition names and in-turn it leads us to changing query frequently for new partitions. To avoid this I suggest you to go with dynamic partitioning wherever possible.
You can enable dynamic partitioning by using following two parameters with values given:
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.exec.dynamic.partition = true;
It allows Hive to create partition dynamically based on values in the source table. (Note: If not enabled, the partition value needs to be hardcoded in the insert into/overwrite query).
One obvious question everybody may have is that "How will I control the number of dynamic partition? and it should not end in creating the huge number of partitions and defying our purpose of query optimization?"
No worries, we got few more parameters for partitioning which will allow you to control or limit the number of partition to be created.
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.dynamic.partitions.pernode=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;
Above parameters allows you to increase the default limits set for the number of dynamic partitions possible.
One important thing you should keep in mind while using dynamic partition is that "Dont end up creating too many partitions or too many small files, as it leads the problem of small files and increases the overhead to namenode for maintaining the metadata about too many small files."
One thing you can do to avoid small file creation is to use DISTRIBUTE BY clause in query. It avoids creation of too many map-files and uses reducer to create less number of files for query result-set. Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. If you don't use the distribute by clause and try to create dynamic partitions, then you can observe that under partition directory too many map-files are created as a result of query leading to small file problem.
Now, coming to I/O part. You might have observed that most of the time processor take lot of time while reading/writing the results to disk and looses the performance factor. So, we should be concentrating on I/O too for performance gain. It is very straight forward that 'By reducing I/O we can achieve performance gain'. Now question is "How to reduce I/O?".
Lets say, you are querying something and producing results which are being written to disk at some location after some validations or checks. But, before passing those validations and checks the query results are hold on to temporary disk space and temp location. We can reduce I/O by avoiding the data movement from temp to original location, and avoiding movement, means writing temp results into original location itself. Hive provides one parameter which will help us to achieve the above and it is as follows:
set hive.optimize.insert.dest.volume=true;
It sets Hive Temp directory To Same As Final Output Directory (Note: when using Create Table As, this saves time writing data to the final HDFS directory by putting the temporary data on the same disk).
One another thing in I/O is forcing processors to create larger allowed block size and avoiding creation of small fiiles by setting below parameter to expected block size.
set mapred.max.split.size=128000000;
It forces the block size for the files generated in session.
Now moving to next topic about compression parameters. You should know that keeping data compressed in Hive tables, in some cases, been known to give better performance than uncompressed storage; both in terms of disk usage and query performance. So, lets study few params about compression in hive.
set hive.exec.compress.output = true;
set mapred.map.output.compression = true;
set mapred.map.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec ;
set mapred.output.compression.type = BLOCK;
set mapred.output.compression.codec = org.apache.hadoop.io.compress.GzipCodec ;
Above parameters enable compression for map / final job output and allow us to specify compression to use. The available codecs are .compress.SnappyCodec / GzipCodec / BZip2Codec .
If you are using parquet file formats then use below parameter for compression:
set parquet.compression=SNAPPY/GZIP/NONE;
This parameter is dedicated for parquet tables.
Throughout this post we saw different hive paramertes that we can use for optimization. You might have noticed that there is no single query in this post and we have not modified any for optimization because this post is all about parameters as I stated in first few paragraphs at start of this post.
Thanks for giving your time to go through this post.
Vishal R. K.