A Comprehensive Guide On Apache Hive Basics. | by Rahul Patidar

15 min read

Jul 12, 2022

Hive is an open source data warehouse, which means anyone can customise the code of Hive and use it as per the requirement. Hive was developed by Facebook for distributed processing and data analytics. It is developed on top of the Hadoop Distributed File System (HDFS). That means we store data at HDFS , we can create the table as per the hdfs data and we can query hdfs data just like our sql queries using Hive. Whenever we write any query in hive, it will be internally converted into a Mapreduce Job and submitted to the Hadoop Cluster.

Parts of Hive Tables:

Hive Table Components:

The hive table is divided into two sections.

Let’s say we have one file with the below contents.

Id,Name,Salary

1, Rahul, 10000

2, Umesh, 20000

3, Pawan, 3000

The metadata in the above file is “Id, Name, Salary,” which hive stores in MySQL, and the remaining three rows are our actual data, which will be available in HDFS. Whenever we fire any select query to select this data, Hive will combine this metadata and actual data and will display the data in tabular format just like a normal SQL table.

A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates

Hive is fast and scalable.
It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or Spark jobs.
It is capable of analysing large datasets stored in HDFS.
It allows different storage types such as plain text, RCFile.
It can operate on compressed data stored in the Hadoop ecosystem.
It supports user-defined functions (UDFs) where user can provide its functionality.

Architecture Of Hive:

Hive allows writing applications in various languages, including Java, Python, and C++. It supports different types of clients such as:-

Thrift Server — It is a cross-language service provider platform that serves the request from all those programming languages that supports Thrift.
JDBC Driver — It is used to establish a connection between hive and Java applications.
ODBC Driver — It allows the applications that support the ODBC protocol to connect to Hive.

The following are the services provided by Hive:-

Hive CLI — The Hive CLI (Command Line Interface) is a shell where we can execute Hive queries and commands.
Hive Web User Interface — The Hive Web UI is just an alternative of Hive CLI. It provides a web-based GUI for executing Hive queries and commands.
Hive MetaStore — It is a central repository that stores all the structure information of various tables and partitions in the warehouse. It also includes metadata of column and its type information.
Hive Server — It is referred to as Apache Thrift Server. It accepts the request from different clients and provides it to Hive Driver.
Hive Driver — It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC driver. It transfers the queries to the compiler.

Whenever a user submits any query , Hive will convert it into a Mapreduce job and take the data from HDFS and merge it with MySQL metadata. It show the data in tabular format.

Hive DataType:

There are two kinds of datatypes in Hive.

Primitive data types and complex data types

Aside from these Primitive Datatypes, Hive also provides Complex Datatypes that we can use with our use case data from HDFS.

Array : An array is an ordered collection of elements. The elements in the array must be of the same type.

Map: It is an unordered collection of key-value pairs. The keys must be of primitive types. Values can be of any type.

Struct: A Struct is a collection of elements of different types.

Arrays Example:

cat arrayDatafile.csv
1,Rahul,10000,Accenture|Tangoe|IBM|JIo
2,Avinash,20000,Paytm|Airtel|Jio
3,Ankit,3000,SG|BT|Jio

create table if not exists rahul44.Employee(id int,name string,
sal bigint,company array)
row format delimited
fields terminated by ','
collection items terminated by '|';Load  data local inpath 
'/home/rahulpatidar071020156806/hiveDatatypes/arrayDatafile.csv' 
into table rahul44.Employee;
select * from rahul44.Employee;
select name,company[0] as company1 from rahul44.Employee;

Map Example:

cat mapfile.csv
1,Rahul,10000,company1#Accenture$company2#Tangoe$company3#IBM$company4#JIo
2,Avinash,20000,company1#PAYTM$company2#AIRTEL$company3#TCS$company4#JIo
3,Ankit,3000,company1#SG$company2#BT$company3#IBM$company4#JIo

create table  if not exists Employee2(id int,name string,sal bigint,
company map) row format delimited fields terminated by ',' 
collection items terminated by '$' map keys terminated by '#';load data local inpath 
'/home/rahulpatidar071020156806/hiveDatatypes/mapfile.csv'  
into table Employee2;
select * from Employee2;
select name,company["company1"] as company1 from Employee2;

STRUCT Example:

cat /home/rahulpatidar071020156806/hiveDatatypes/structfile.csv
1,Rahul,10000,India$Karnataka$Bangalore$560068$IBM
2,Avinash,20000,India$AP$HYD$500001$Paytm
3,Ankit,30000,India$MP$bhopal$462030$Accenture

set hive.cli.print.header=true;
create table if not exists rahul44.Employee4(id int,name string,sal bigint,
companyaddr structcompany:string>) 
row format delimited fields terminated by ',' 
collection items terminated by '$' map keys terminated by '#';load data local inpath 
'/home/rahulpatidar071020156806/hiveDatatypes/structfile.csv' 
into table rahul44.Employee4;
select * from rahul44.Employee4;
select companyaddr.country as country, companyaddr.state as state,
companyaddr.city as city,companyaddr.pin as pin,companyaddr.company as company from rahul44.Employee4;

System-define Function In hive:

There are mainly 3 categories of function in a hive.

User-defined Function (UDF)) :In Hive, users can define their own functions to meet certain client requirements. These are known as UDFs in Hive. User-defined Functions written in Java for specific modules Some of the UDFs are specifically designed for the reusability of code in application frameworks. UDFs work on a single row in a table and produce a single row as output. It’s a one to-one relationship between the input and output of a function.

Hive’s built-in TRIM(), sum(), and avg() functions are all examples of UDF.

How to Create Your Own UDF in Hive:

The Hive allows us to define our own UDFs as well. You can refer below to a LinkedIn article written by Gaurav Singh on LinkedIn.

https://www.linkedin.com/pulse/hive-functions-udfudaf-udtf-examples-gaurav-singh/

UDTF: User-defined tabular function that works on one row as input and returns multiple rows as output. So the relationship is one to many in this case.Hive’s built-in EXPLODE () and posexplode () functions, for example.

Let’s say we have the column values as [1,2,3,4] When we apply the EXPLODE() function here, it will return 4 rows as output:

e.g. select explode(mark) as mark from student;Output:
1
2
3
4

In hive, explode(mark) will work like a virtual table (View)and store data in different rows.

Student Table
studentname  subjects
rahul        [Math,Physics,Chemistry]
umesh        [Commerce,Accounting,Tally]This will Store Subject Data Like below:Subjects:
Math
Physics
Chemistry
Commerce
Accounting
Tally.

Let’s say we want the following output from the above data.

studentname    subjects
rahul           Math
rahul           Physics
rahul           Chemistry
umesh           Commerce
umesh           Accounting
umesh           Tally

To get the above output internally, we need to join students (actual table) with virtual tables (views) (subjects) . Here, the subjects table(virtual table/view) is also called Lateral Views. The Explode function will internally create these Lateral Views(Subjects) and join them with the actual table(Student) and will produce the result.

UDAF: User-defined aggregate functions work on more than one row and give a single row as output. Hive’s built-in MAX() and COUNT() functions, for example.The relationship is many to one in this case.Lets say you have a table with students’ names, ids and total marks. So here, if I have 10 rows in the table and I have to find a student who got the maximum number, then our query needs to check each 10 rows to find the maximum, but ultimately we get only one output, which is the maximum.

Set Hive Operation:

Union : It will remove duplicates from the queries and give unique records.

Union All It will not remove duplicates from the queries and will give all the records.

A View: In a hive, a view is just a virtual table which stores a subset of data from a larger table. We can create multiple views from a single table and share them with the different teams or use them for different use cases without giving all the unnecessary columns.

Let’s say I have a department table in my databases and it has columns for multiple departments (i.e IT,Admin,HR, etc.). Then I can just create the 3 different views (IT_view, Admin_view, and HR_view) from this department table and share them with the teams without giving other department data. So this is how we can also restrict users from accessing other department data.

create view IT_view as 
select IT_manager,IT_Code,IT_Head,IT_Avg_salary from department;SELECT * FROM IT_view;

It’s also to hide the complexity of the query. Let’s say we have a complex query of 1000+ lines with multiple joins and some other team(business/testing) wants to access the data from this query. So instead of giving this query to teams, we can create a view from the complex queries and share the view name,so whenever they do select * from view_name, it will internally execute the complex query and give the result.

Advantages of Views:

Restrict Users to access the unnecessary data.

Reduce Query Complexity.

Create Hive Database :

create database  IF NOT EXISTS Rahuldb;

Create Hive Table :

create table IF NOT EXISTS student (id Int,Name string,Company String) COMMENT ‘Employee details’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘\n’ STORED AS TEXTFILE TBLPROPERTIES (“skip.header.line.count”=”1");

Let’s try to understand the properties that we are using in creating the table.

-IF NOT EXISTS: It will check whether the same table already exists or not. If it’s already there, then hive will ignore this statement and if not, then it will create the table.

-COMMENT ‘StudentId’ : This will add the comment to a column.

-COMMENT “Employee details”: This will add comments to the table.

-ROW FORMAT DELIMITED : This will tell the hive that we are creating the table with delimiters in it and that the hive needs to split the records based on delimiter.

-FIELDS TERMINATED BY ‘,’ : This property will tell you to hive that consider it as a field delimiter and store the value that is next to ‘,’ in the next column.

-LINES TERMINATED BY ’n’: This property will tell the hive to consider a new line as a line terminator. If there is any enter/newLine in the data, then consider it as next/New Record.

-STORED AS TEXTFILE : This property will tell the hive to store data in TextFileFormat.

-”skip.header.line.count”=”1″ :This property will tell Hive to consider the first line as a header and do not load the first row of data in the hive table.

describe student;This Command will describe the table structure and will give only limited Information(Only Column Details).

describe  formatted student;displays additional information, in a format familiar to users of Apache Hive

show create table student;Display the create statement for the table.

Loading data into a hive table:

We can Load Data In Hive from HDFS as well as Local Location.

Loading from Local Location:
load data local inpath '/home/rahulpatidar071020156806/student.txt' into table student;//Loading from HDFS Location:Just Remove local from the command and give HDFS location It will Load the data from HDFS Location into Hive table.load data inpath '/user/rahulpatidar071020156806/student.txt' into table student;

To print the header or column name with data, we need to set the below property in the hive.

set hive.cli.print.header=true;

How to Validate Hive Tables Data

By default, Hive stores data in the warehouse directory for managed (internal table) we will discuss managed tables in the next post.

Hive Table’s default warehouse directory is:

/apps/hive/warehouse/databasename.db/tablename/files/

hadoop fs -ls /apps/hive/warehouse/rahuldb.db/studenthadoop fs -cat /apps/hive/warehouse/rahuldb.db/student/student.txthadoop fs -cat /apps/hive/warehouse/rahuldb.db/student/student_copy_1.txt//Since we are loading same file from 2 different Location(HDFS AND LOCAL) so it will create file with *_copy_1.txt 
//We can change the above warehouse Location using  below propertyset hive.metastore.warehouse.dir="New_PAth"

Hive offers three ways to connect from the hive.

hive-command-line (using hive directly we can connect).

HUE: Hadoop User Experience ( It provides a UI to execute Hive queries).

Beeline: A secure way to connect to Hive. This is the most Preferred Choice.

connect command: beeline-u jdbc: hive2://

beeline -u jdbc:hive2://

Run hql from Beeline.

beeline -u jdbc:hive2:// -f /home/rahulpatidar071020156806/query.hql

Execute Hive Query from Beeline

beeline -u jdbc:hive2:// -e “select * from Rahuldb.student”

Hive Managed Tables :

The student table that we created is a managed table as data was available inside the /apps/hive/warehouse directory, which is the hive’s internal warehouse directory. In a managed table, both the table data and the table schema are managed by Hive. The data will be located in a folder named after the table within the Hive data warehouse, which is essentially just a file location in HDFS.

i.e. Default warehouse Directory for Hive Table:

/apps/hive/warehouse/databasename.db/tablename/files/

2. We noticed one more scenario: whenever we are creating a managed table, we have to load data into the table using the Load data command.

Loading from Local Location:load data local inpath '/home/rahulpatidar071020156806/student.txt' into table student;Loading from HDFS Location:Just Remove local from the command and give HDFS location It will Load the data from HDFS Location into Hive table.load data inpath '/user/rahulpatidar071020156806/student.txt' into table student;

3. Now let’s see what happens when we drop/truncate the table in the hive.

drop table student;

If we drop or truncate the hive table, it will delete the complete table information. i.e., metadata as well as actual data.

/*
Task:
1. create same student Command Again.
2. Load data using Load Command.
3. select Data From Hive Table by select Command.
4. Check The data at warehouse Directory.
4.Truncate Hive table.
5. Again select Data From Hive Table by select Command.
6. Again Check The data at warehouse Directory.
7. You will see the in warehouse(Hadoop fs -ls ) and in hive(select command) 
This student table will be available, but truncate Command will Delete the actual data will not we available, so you be not able to see any files or data in hive.
*/

create table  IF NOT EXISTS Rahuldb.student 
(id Int COMMENT 'StudentId',Name string,College  String) 
COMMENT 'student details' ROW FORMAT DELIMITED  FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n' STORED AS TEXTFILE 
TBLPROPERTIES ("skip.header.line.count"="1");load data local inpath '/home/rahulpatidar071020156806/student.txt' 
into table student;

hadoop fs -ls /apps/hive/warehouse/rahuldb.db/student/

select * from Rahuldb.student;
truncate table Rahuldb.student;
select * from Rahuldb.student;

hadoop fs -ls /apps/hive/warehouse/rahuldb.db/student/**

Please Execute  Hadoop command on hadoop Terminal and Hive query on 
Hive Terminal 
**we can execute Hadoop command using hive Terminal also ,
to do that you need to use ! and ; in hadoop command.e.g.!hadoop fs -ls /apps/hive/warehouse/rahuldb.db/student/;

Hive External Tables.:

Hive External is not depended on hive it is not managed by hive.For External Tables, Hive stores the data in the LOCATION specified during creation of the table(generally not in warehouse directory). If the external table is dropped, then the table metadata is deleted but not the data.
We don’t need to execute a load command to load data. In hive, instead of the load command, we can just give location while creating a hive table and it will directly read data from location.
If we drop an external table, then it will drop only metadata(table information) and not our actual data.
We cannot truncate the Hive External Table(since truncate deletes actual data, but external data never deletes its actual data.)
We Need to Give an External Keyword While Creating a Hive External Table.

6. A Hive External table cannot be created from a local location. Data should only be available at the HDFS location.

create  external table  IF NOT EXISTS Rahuldb.student_extab 
(id Int COMMENT 'StudentId',Name string,College  String) 
COMMENT 'student details' ROW FORMAT DELIMITED  FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n' STORED AS TEXTFILE 
location '/user/rahulpatidar071020156806/externaltable/' 
TBLPROPERTIES ("skip.header.line.count"="1");select * from Rahuldb.student_extab;
truncate table Rahuldb.student_extab;

FAILED: SemanticException [Error 10146]: Cannot truncate non-managed table 
Rahuldb.student_extab.hive>

!hadoop fs -ls /user/rahulpatidar071020156806/externaltable/

Now, wherever we want to load new data, we just copy the file at this hdfs location. The external table will directly read the data without using any load or insert command.

!hadoop fs -cp /user/rahulpatidar071020156806/externaltable/student.txt 
/user/rahulpatidar071020156806/externaltable/student2.txt

create  external table  IF NOT EXISTS Rahuldb.student_extab 
(id Int COMMENT 'StudentId',Name string,College  String)
COMMENT 'student details' ROW FORMAT DELIMITED  FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n' STORED AS TEXTFILE 
location '/user/rahulpatidar071020156806/externaltable/' 
TBLPROPERTIES ("skip.header.line.count"="1");select * from Rahuldb.student_extab;

Hive Subqueries:

A query present within a query is known as a “sub query.” The main query will depend on the values returned by the subqueries.

Subqueries can be classified into two types.

Subqueries in the FROM clause.

Subqueries in the WHERE clause

When to use:

Example: Below, Query will combine name and college and will return as a single column.

select t.* from (
(select name from student_extab
union
select college from student_extab) t
);

For example, if we need to retrieve customerName from the customer and order tables based on a join condition, we can simply combine the results using union and return using subquery.

Indexes In Hive:

Hive Index is used to speed up the performance of hive tables where data is stored in columns. This concept is not important in hive and it has been removed since Hive version 3.0. Instead of indexes, we can now use column-based file formats in Hive and can optimise hive queries.

How to Create Index in Hive:

create index stud_id on table rahuldb.student (id) As 'Compact' 
with deferred rebuild.

Here we are creating an index (stud_id) on the id column of the table rahuldb.student. Compact is the type of index which we are creating, and deferred rebuild means we can rebuild or alter this index later.

Show Index:

Show Index on rahuldb.student;

Drop Index:

Drop index stud_id on rahuldb.student;

Explode and Literal View:

In Hive, the explode function is used to split the values into rows.

Assume we have Hive columns (subject) of ComplexDatatype (Array, Map, or Struct) and names of String Datatype.

and for data subjects such as [Spark, Scala, Hive, Hadoop]

When we apply explode to the subject, it will split the column and will give the result in 4 rows as below:

select explode(subject) from Student_details;

Spark

Scala

Hive

Hadoop

The above query will give the expected result if we use only a single column in our hive select statement.

Let’s assume we want name and subject and are expecting data like below.

Name subject

Rahul Spark

Rahul Scala

Rahul Hive

Rahul Hadoop

select name,explode(subject) from Student_details;

Then this will give an invalid syntax error. To handle this scenario, we need to use lateral view, which will create a virtual table and for explode(subject) and join it with our actual table(Student_details) and give the expected result.

cat explode.csv
Rahul,hive|hadoop|spark|pyspark
Avinash,airflow|javascript|reactjs

create table if not exists rahuldb.Student_details(name string,subject array)
row format delimited
fields terminated by ','
collection items terminated by '|';Load  data local inpath '/home/rahulpatidar071020156806/explode.csv' 
into table rahuldb.Student_details;
select * from rahuldb.Student_details;
select explode(subject) from rahuldb.Student_details;
select name,explode(subject) from rahuldb.Student_details;
select name,subject  from rahuldb.Student_details 
LATERAL VIEW explode(subject) subject_list as subjects;

In this article we learnt about the basics of hives. In the next article we will discuss Hive advanced concepts.

Note:

Link for Next Blog: Hive Advance

LinkedIn

Important Links For Data Engineers:

1. EveryThing You Need To Need About Scala.

2. Top 20 Scala Programming Question For Interview.

3. Spark Dataframes For Beginner.

4. A Complete Guide On Spark Structure Streaming.

5. A Complete Guide On Apache Hive Basics.

6. A Complete Guide On Apache Hive Advance.

7. Hadoop Components Installation Guide In MacOs

8. Slowly Changing Dimension In Hive — DataWareHouse.

9. A Comprehensive Guide On Apache Sqoop

10. Linux Commands You’ll Actually Need for Your Data-Engineering Journey.

11. Apache Spark Checkpoints: Let’s Recover Lost Data From Offset Values.

Happy Learning!!

Note: To discover the best big-data blogs, visit Feedspot. This is the best Big Data blog list curated from thousands of blogs on the web and ranked by traffic, social media followers, domain authority, and freshness.

A Comprehensive Guide On Apache Hive Basics. | by Rahul Patidar

Explode and Literal View:

Rehan

Leave a Reply Cancel reply

Explode and Literal View:

Rehan

Leave a Reply Cancel reply

You May Like

Meet Hannes Mühleisen, a 2024 BigDATAwire Person to Watch

Data Analytics Helps TikTok Marketers Boost Reach

A New Era for AI & IT Education