BigWorld with BigData: Apache Hive Optimizations — Part 1

Koushik Muthakana
4 min readMar 9, 2021

--

What is Apache Hive ?

Apache Hive is a data warehouse analytics tool to process the structure data on top of Hadoop. It helps in querying and analysis of data in Hadoop.

Image Credits: https://en.wikipedia.org

What is Not Apache Hive ?

  • Is not a relational database.
  • Is not for real time analysis.
  • Is not for transactional, real time updates

In this article we go through some Hive optimization techniques. By the end of article you can process huge amount with necessary optimizations.

Before jumping into hive Optimizations let’s understand Hive components.

Hive contains different components, User Interface , HQL process Engine, Execution Engine, Meta store,HDFS.

  • User Interface: Provides the interaction between user and HDFS. Hive supports WebUI, Hive Command line.
  • Meta Store: Hive stores the schema of tables, databases,HDFS mappings in any relational database.
  • HQL Process Engine: Hive Query Language is similar to the SQL for querying. It is an abstraction of MapReduce job.
  • Execution Engine: Process the query and generate the results. It uses the Map Reduce job.
  • HDFS: Hadoop distributed file system stores the Hive tabular data.

Hive process/query a huge amount of data, but optimizations can help in achieving a lot of processing time and cost. There are bunch of optimization techniques.

  1. Table level optimizations
   i. Partitioning   ii. Bucketing 

2. File Formats and Compression techniques

3. Joins Optimizations

4. Execution Engine (Tez, Spark)

5. Vectorization

1. Table Level Optimizations :

Optimization at the table structure level helps in managing the data in optimized way. we can achieve table structure level in two ways.

i. Partitioning: Is a logical splitting the table into subsets based on particular columns values.A table can have multiple partitions with distinct name.

i.e No of partitions = No of Distinct Values of partition columns

The logical split data (partitions) are created as directories with partition column values and each directory contains the files.This file structure helps to scan only relevant directories by skipping the irrelevant directories, increases the query execution performance.

  • If still the data in partition are quite huge, we can further do partition on the partition columns if required or if the cardinality is high then go for Bucketing.

Creation of partition table:

CREATE TABLE <table_name>(
<column_name1> <data_type>,
<column_name2> <data_type>,
<column_name3> <data_type>,
<column_name4> <data_type>,
...
)
PARTITIONED BY (
<partition_col1_name> <data_type>,
<partition_col2_name> <data_type>,
...
);

There are two types of partitioning in hive.

  1. Dynamic Partition
  2. Static Partition
  3. Dynamic Partition: Hive will take care of partitioning. Based on the distinct values, it will do partition dynamically, but it is slow.
  • Triggers the MapReduce job.
  • Should not needed knowledge on data

Note: By default dynamic partition is disabled

Usually dynamic partition has two modes,

i. strict: Dynamic partition strict mode requires at least one static partition column

ii. non strict: Dynamic partition non strict mode allows to to turnoff strict mode, i.e can have dynamic values on all partitions.

Set these two properties before dynamic partitioning

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

Optional Properties:

In order to change the average load for a reducer (in bytes)           
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers
set mapreduce.job.reduces=<number>

2. Static Partition: Manually creating the partition directories, but it is very fast.

  • Partitioning is loaded manually.
  • Should have knowledge on data.

Inserting the data

INSERT INTO TABLE<table_name> 
PARTITION(partition_col1_name, partition_col2_name)
VALUES
column_name1_value,
column_name2_value,
column_name3_value,
column_name4_value,
partition_col1_value,
partition_col2_value
from <table_name>;

ii. Bucketing: Bucketing is similar to the partitioning, splitting the data into subsets of files,(Bucketing creates a files instead of directories).

We should use when there is a high cardinality of partition columns ( Large number of distinct values ).

Tables or Partitions are divided into sub files based on the hash function of a columns. Based on the hashed value, it group all the related data into one file, which helps in scanning only particular files by skipping the irrelevant files. This helps to increase the query execution performance.

By default bucketing is disabled, to enable

set hive.enforce.bucketing  = true;

Table Creation:

CREATE TABLE <table_name> (
<column_name1> <data_type>,
<column_name2> <data_type>,
<column_name3> <data_type>,
<column_name4> <data_type>,
...
)
clustered by (column_name1,
column_name2,...) into <no_of_buckets> BUCKETS
row format delimited fields terminated by ","
stored as orc
TBLPROPERTIES('orc.compress'='SNAPPY')

Inserting the data:

INSERT INTO TABLE <table_name> VALUES (
column_name1_value,
column_name2_value,
column_name3_value,
column_name4_value,
...);

Note: Hive table can have both partitioning and bucketing

Syntax:

CREATE TABLE <table_name> (
<column_name1> <data_type>,
<column_name2> <data_type>,
<column_name3> <data_type>,
<column_name4> <data_type>,
...
)
PARTITIONED BY (
<partition_col1_name> <data_type>,
<partition_col1_name> <data_type>,
...
)
CLUSTERED BY(
<column_name1>,
<column_name2>,
...
) INTO <no_of_buckets> BUCKETS
row format delimited fields terminated by ","
stored as orc
TBLPROPERTIES('orc.compress'='SNAPPY')

2. File Formats:

Hive stores the actual data in the form of files in Hadoop. Choosing the right file format, we can optimize the processing job to a great extent.Default Hive is uses text file format, but additionally hive supports many file formats, we will discuss mostly using file formats.

  • Text File
  • Avro File
  • ORC File
  • Parquet File
  • Sequence File

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Koushik Muthakana
Koushik Muthakana

Written by Koushik Muthakana

A passionate Big Data Engineer@Bangalore

No responses yet

Write a response