BigWorld with BigData: Apache Hive Optimizations — Part 1
What is Apache Hive ?
Apache Hive is a data warehouse analytics tool to process the structure data on top of Hadoop. It helps in querying and analysis of data in Hadoop.

What is Not Apache Hive ?
- Is not a relational database.
- Is not for real time analysis.
- Is not for transactional, real time updates
In this article we go through some Hive optimization techniques. By the end of article you can process huge amount with necessary optimizations.
Before jumping into hive Optimizations let’s understand Hive components.

Hive contains different components, User Interface , HQL process Engine, Execution Engine, Meta store,HDFS.
- User Interface: Provides the interaction between user and HDFS. Hive supports WebUI, Hive Command line.
- Meta Store: Hive stores the schema of tables, databases,HDFS mappings in any relational database.
- HQL Process Engine: Hive Query Language is similar to the SQL for querying. It is an abstraction of MapReduce job.
- Execution Engine: Process the query and generate the results. It uses the Map Reduce job.
- HDFS: Hadoop distributed file system stores the Hive tabular data.
Hive process/query a huge amount of data, but optimizations can help in achieving a lot of processing time and cost. There are bunch of optimization techniques.
- Table level optimizations
i. Partitioning ii. Bucketing
2. File Formats and Compression techniques
3. Joins Optimizations
4. Execution Engine (Tez, Spark)
5. Vectorization
1. Table Level Optimizations :
Optimization at the table structure level helps in managing the data in optimized way. we can achieve table structure level in two ways.
i. Partitioning: Is a logical splitting the table into subsets based on particular columns values.A table can have multiple partitions with distinct name.
i.e No of partitions = No of Distinct Values of partition columns
The logical split data (partitions) are created as directories with partition column values and each directory contains the files.This file structure helps to scan only relevant directories by skipping the irrelevant directories, increases the query execution performance.
- If still the data in partition are quite huge, we can further do partition on the partition columns if required or if the cardinality is high then go for Bucketing.
Creation of partition table:
CREATE TABLE <table_name>(
<column_name1> <data_type>,
<column_name2> <data_type>,
<column_name3> <data_type>,
<column_name4> <data_type>,
...
)
PARTITIONED BY (
<partition_col1_name> <data_type>,
<partition_col2_name> <data_type>,
...
);
There are two types of partitioning in hive.
- Dynamic Partition
- Static Partition
- Dynamic Partition: Hive will take care of partitioning. Based on the distinct values, it will do partition dynamically, but it is slow.
- Triggers the MapReduce job.
- Should not needed knowledge on data
Note: By default dynamic partition is disabled
Usually dynamic partition has two modes,
i. strict: Dynamic partition strict mode requires at least one static partition column
ii. non strict: Dynamic partition non strict mode allows to to turnoff strict mode, i.e can have dynamic values on all partitions.
Set these two properties before dynamic partitioning
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
Optional Properties:
In order to change the average load for a reducer (in bytes)
set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers
set hive.exec.reducers.max=<number>In order to set a constant number of reducers
set mapreduce.job.reduces=<number>
2. Static Partition: Manually creating the partition directories, but it is very fast.
- Partitioning is loaded manually.
- Should have knowledge on data.
Inserting the data
INSERT INTO TABLE<table_name>
PARTITION(partition_col1_name, partition_col2_name)
VALUES
column_name1_value,
column_name2_value,
column_name3_value,
column_name4_value,
partition_col1_value,
partition_col2_value
from <table_name>;
ii. Bucketing: Bucketing is similar to the partitioning, splitting the data into subsets of files,(Bucketing creates a files instead of directories).
We should use when there is a high cardinality of partition columns ( Large number of distinct values ).
Tables or Partitions are divided into sub files based on the hash function of a columns. Based on the hashed value, it group all the related data into one file, which helps in scanning only particular files by skipping the irrelevant files. This helps to increase the query execution performance.
By default bucketing is disabled, to enable
set hive.enforce.bucketing = true;
Table Creation:
CREATE TABLE <table_name> (
<column_name1> <data_type>,
<column_name2> <data_type>,
<column_name3> <data_type>,
<column_name4> <data_type>,
...)
clustered by (column_name1,
column_name2,...) into <no_of_buckets> BUCKETS
row format delimited fields terminated by ","
stored as orc
TBLPROPERTIES('orc.compress'='SNAPPY')
Inserting the data:
INSERT INTO TABLE <table_name> VALUES (
column_name1_value,
column_name2_value,
column_name3_value,
column_name4_value,
...);
Note: Hive table can have both partitioning and bucketing
Syntax:
CREATE TABLE <table_name> (
<column_name1> <data_type>,
<column_name2> <data_type>,
<column_name3> <data_type>,
<column_name4> <data_type>,
...)
PARTITIONED BY (
<partition_col1_name> <data_type>,
<partition_col1_name> <data_type>,
...
)
CLUSTERED BY(
<column_name1>,
<column_name2>,
...
) INTO <no_of_buckets> BUCKETS
row format delimited fields terminated by ","
stored as orc
TBLPROPERTIES('orc.compress'='SNAPPY')
2. File Formats:
Hive stores the actual data in the form of files in Hadoop. Choosing the right file format, we can optimize the processing job to a great extent.Default Hive is uses text file format, but additionally hive supports many file formats, we will discuss mostly using file formats.
- Text File
- Avro File
- ORC File
- Parquet File
- Sequence File