Big Data: Part-3 Intro to Hive
🥽What is Hive?
Hive is a query interface on top of Hadoop’s native MapReduce. Remember Hive is a data warehouse it is not an RDBMS!
Hive allows us to write SQL-style queries in a native language known as Hive Query Language (HQL).
🔦️Why do we need Hive?
In Hadoop we have our data stored in HDFS, and as I told you in the article Big Data: Part 1 MapReduce is used for processing that data. But there is a problem, the MapReduce framework needs a person who knows JAVA. So if you want to work with this HDFS data you should be a JAVA expert!
Facebook noticed this problem and in 2010, they invented Hive which lets you process this HDFS data(structured one) even if you don’t know JAVA.
HIVE execution engine converts the script written using HQL into Map-Reduce programs. These MapReduce programs are further converted to .JAR files and they work in the same way when executed as the MapReduce should have worked.
HIVE Reads data from HDFS. HIVE works with structured data in Hadoop.
An operation like left-inner join would need around 200–300 lines of code in JAVA Map-Reduce whereas in SQL it would be just a couple of lines of code. You can say to enable SQL Developers to exploit the power of Hadoop, an abstraction interface known as HIVE was developed on top of native MAp-Reduce.
📢Do Tables in Hive store Data?
In Hive Tables and Data are stored separately. Data is stored only in HDFS. Tables are just projected over that data of HDFS. Remember data is not stored in these tables of HIVE like we have in RDBMS!! Hive only stores the table’s schema information (table metadata) in its metastore which is borrowed from an RDBMS (derby is the default one). But in Production, Derby is replaced by Oracle, MSSQL as Derby is a single-user database.
The table metadata is stored separately from the data.
But few things that HIVE is not capable of doing effectively:-
🎤Adhoc Real-time queries (OLTP Transactions) →Hadoop is actually built for batch processing over large datasets so it should not be used for transactional queries.
🎤No ACID Support(Recently limited ACID support has been incorporated in HIVE).
🎤No suitable for frequent updates and inserts. These updates and inserts were allowed only in recent releases of HIVE.
🎤Not meant for unstructured data analysis.
🎶How to download Hive on Ubuntu machine
→ First start all daemons using start-all. sh in the terminal and then execute this:
sudo wget https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
→Secondly extract this tar file:
sudo tar xvzf apache-hive-3.1.3-bin.tar.gz
→Thirdly add the path in.bashrc file
sudo nano .bashrc
export HIVE_HOME= /home/hadoop/apache-hive-3.1.3-bin
export PATH=$PATH:$HIVE_HOME/bin
→Applying changes to .bashrc file
source ~/.bashrc
→Edit hive-config.sh file to share where my Hadoop is lying.
sudo nano $HIVE_HOME/bin/hive-config.sh
export HADOOP_HOME=/home/hadoop/hadoop-3.3.6
→Now create a tmp directory in HDFS to store temporary results of queries we will be running on Hive:
hdfs dfs -mkdir /tmp
hdfs dfs -chmod g+w /tmp
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod g+w /user/hive/warehouse
The directory warehouse will be used to store the table or data related to the hive in HDFS. The directory tmp is the location to store the intermediate result of processing.
→To configure hive with Hadoop we need to edit hive-env.sh file. For this go to the conf folder:
cd apache-hive-3.1.3-bin/conf
sudo vim hive-env.sh
export HADOOP_INSTALL=/home/hadoop/hadoop
→Now downloading Apache Derby
wget https://dlcdn.apache.org/db/derby/db-derby-10.16.1.1/db-derby-10.16.1.1-bin.tar.gz
sudo tar xvzf db-derby-10.16.1.1-bin.tar.gz -C /home/hadoop
cd ..
→Setup the derby environment by appending the following lines to .bashrc file:-
export DERBY_HOME=/home/hadoop/db-derby-10.16.1.1-bin
export PATH=$PATH:$DERBY_HOME/bin
export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jarsource ~/.bashrc
→Create a directory named data in DERBY_HOME directory to store metadata:
sudo mkdir $DERBY_HOME/data
→Configuring Metastore →specifying to hive where the database is stored
cd $HIVE_HOME/conf
sudo cp hive-default.xml.template hive-site.xml
sudo vim hive-site.xml
Check if these configurations are present in it.
→Set permission to hive folder
sudo chown -R hadoop apache-hive-3.1.3-bin
→Metastore schema initialization
cd apache-hive-3.1.3-bin/bin
schematool -dbType derby -initSchema
now type hive in terminal and it will startNote: Jdk-11 will be throwing errors on installing hive so make sure that you have jdk-8 with you
So, this completes the installation of the hive on the Ubuntu machine. Do follow each step and then you will be able to install it easily for sure. This is how your terminal will look once you are in a hive shell.
🧠Working with Hive
To work with hive we first have to create a database and then have to navigate into that database.
create database DatabaseName;
show databases;
use DatabaseName;
quit; -->will get you out of hive shell
show tables;
#Creation of table in hive
create table emp(id int, name string, salary int)
> row format delimited
> fields terminated by ','
> stored as textfile;
#Loading the data into the table --> If the file is present in local file system
load data local inpath '/home/hadoop/hiveData/employees.txt' into table emp;
#If the file is present in hdfs
load data inpath '/home/hadoop/hiveData/employees.txt' into table emp;
#Fetching data from the table
SELECT * FROM emp;
#Deleting a empty database that is it is not having any files
drop database databaseName;
#Deleting a database which is not empty
drop database DatabaseName cascade;
#Deleting a table
Drop table TableName;
#Suppose you have a tableA and now you want to create a tableB having same schema/structure as tableA(ONLY Schema not the data)
Create table tableB LIKE tableA stored as textfile;
#Suppose you want to store the table data in parquet file format
Create table ABC LIKE DEF stored as prquetfile;
#Loading data into this
Insert into table ABC Select * from DEF;
#DEF is having the data being stored in text file format
🙄How does the Hive store the data?
I’m repeating that, actual data is not stored in HIVE, it is stored in HDFS, hive only stores the metadata(default Derby Database).
📌First understand that storage of data loaded in the table of Hive
The warehouse is the directory on HDFS that is used by the hive execution engine to store databases and tables we create in the Hive. In this image you can see it clearly:-
By default, the hive stores the data at the location /user/hive/warehouse in HDFS. As you can see in the above image, the data(employees.txt) which I loaded into the table in the hive is stored in HDFS under the warehouse directory. This is configured in hive-site.xml as shown in the below image:
📌Now understand the storage of MetaData i.e Data about Data
In metastore (default Derby Database), we store the information about tables we have projected over the data stored in HDFS using Hive. So table schema-related info is stored here in the RDBMS system.
In hive-site.xml, we set the property declaring that we will be using the derby database as metastore. By default the metastore database name is metastore_db
.
Derby by default creates a metastore_db folder in the current directory (local file system not HDFS), for storing the hive metadata. The database can also be created in a specific folder, using the javax.jdo.option.ConnectionURL property in hive-site.xml as you can see in the image above.
If you want to create a database at some other location in HDFS rather than at the warehouse directory :
create database demoDatabase location ‘/home/hadoop/hiveData’
You an check this creation by hadoops fs -ls /home/hadoop/hiveData
As you can see in the image above when I tried creating a database in the hiveData directory instead of the warehouse directory then in that directory a folder having the name of the table is created (demotable1 inside demodatabase). The second thing I want you to notice is that employees.txt was the file that I used to load data from the local file system into the demotable1 and now you can see that too is present here in demotable1 folder on HDFS.
Note regarding the error which I faced and you can too:
❌If you are facing an error like the one below while creating tables and databases in some other directory then warehouse:
FAILED: HiveException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
🧩Then I cracked the solution for it after so many hits and trials. You just need to make a change in the hive-site.xml file in the below lines like this:
cd $HIVE_HOME/conf
sudo vim hive-site.xml
Make sure that you have set the databaseName attribute as I am showing below:
External Tables in Hive
We can have two different types of Hive Tables:
🪀Managed_Tables or Internal Tables
🪀External_Tables
To figure out whether a table is managed or not type this in the hive shell:
describe extended tableName;
Managed Tables or Internal Tables are the default table types in Hive. Whenever you don’t tell Hive which table to create explicitly, it will create Internal Tables.
In managed tables, if you decide to drop a table then all the data of the table(file In HDFS) as well as metastore data related to that table will be deleted.
Now let’s focus on External Tables:
hadoop fs -put employees.txt /testDemo #Copying a file from local file system to hdfs
#Creating the database at the specified location
create database externalDemo location '/home/hadoop/hiveData';
#Creating the external table
#Notice word external will be used
#if not exists --> optional to use.It make sure that table if exist of this type should not be created again
create external table if not exists demoExternalTable1 (id int, name string, salary int)
> row format delimited
> fields terminated by ','
> stored as textfile
> location '/testDemo'
#Thing to notice here is the location.
#In this location I have put the file employees.txt and we are asking hive engine to look at the data present in this location on HDFS.
👀Now the thing to notice in the case of an external table is that we are not loading the data here using something like load data in… Instead at the time of table creation itself, we are making it to point at the location where data is already present in the HDFS. This is the beauty of external tables they will not be moving the data to its folder instead it will keep pointing to the location where data is already present in HDFS.
🔈️In the case of External tables if you drop the table then only the metastore data will be deleted while the actual data will remain as it is in HDFS. This is the important one! In production, you generally pull data from the source then you clean it and then you make it store in HDFS. Now imagine in the case of the Internal table if you drop the table mistakenly all this data of yours will be dropped. That’s why people prefer to make external tables in production.
🔈️The second advantage of using an external table is that you can just have read permission over the data that belongs to some other department and can run your hive queries over it. So data could lie anywhere on HDFS you can project your table structure over it without moving or copying it from its original location.
Note: For internal tables, stored as textfile/parquetfile; influences how data will be physically stored within the Hive warehouse directory, while for external tables, it helps Hive interpret the data files present at the location it is made to point.
🔮Loading data into Hive Tables
We sometimes can have complex data structures inside a column in a file and we can have to store that in the hive table. So let’s learn to deal with this:
Suppose I have a text file(friends.txt) like this, as you can see last column has many values inside it.
1,alice,28,ryan#faiza
2,ainee,23,deep#harsh
3,tom,24,shreya#ram#shyam
hadoop fs -put friends.txt /test1/ #Putting this file to HDFS
#Creating the table
create table if not exists friendsData(id int, name string, age int, friends array<string>)
> row format delimited
> fields terminated by ','
> collection items terminated by '#'
> lines terminated by '\n'
> stored as textfile;
#Notice we have used the datastructure array here
load data local inpath '/home/hadoop/hiveData/friends.txt' into table friendsData;
This is how our array elements will be stored:-
Now, let’s try to deal with some more complex data. So, now we will try to push the data given below into the hive table:
1,amu,28|private|45.0,aligarh|1998
2,wisdom,89|semi-private|36.0,pune|1289
3,olf,78|government|25.0,rajasthan|1445
hadoop fs -put school.txt /test1 #Putting this above file in HDFS
#Creating this table
create table SchoolDetails(id int,name string,information struct<years:int,type:string,rating:double>, basics struct<city:string, estb:int>)
> row format delimited
> fields terminated by ','
> collection items terminated by '|';
#Loading the data into the table
load data local inpath 'school.txt' into table SchoolDetails;
See, just like an array struct is also a data structure where we can store a collection of elements that have mixed data types. This is how the above data will look after loading it into the hive table.
🎒Simple Operations on Hive Table
#RENAMING THE TABLE
alter table schooldetails rename to UniversityDetails;
#RENAMING A PARTICULAR COLUMN OF THE TABLE
alter table universitydetails change id UniversityId int;alter table universitydetails change id UniversityId int;
#ADDING A NEW COLUMN TO AN EXISTING TABLE
alter table universitydetails add columns (address string);
#DROPPING AN EXISTING COLUMN
alter table tableName replace columns(name of columns to be retained) #Rest of the columns will be dropped
🏝️Query Operations on Hive Table
Let’s create a table first but before that, I want to share a thing here. See, you can have the data with headers and footers but while loading the data you will surely want to skip these headers and footers, so we have a property called tblproperties that will fulfill this purpose.
#Creating a table
create table if not exists friends.friendsList(firstName string, lastName string ,email string, phone int)
> row format delimited
> fields terminated by ','
> lines terminated by '\n'
> tblproperties('skip.header.line.count'='1');
#Here we can also do something like friends.friendList to skip running this statement use friends.
#Loading the data
load data local inpath 'demoFile.txt' into table friends.friendsList;
#Now you can run queries like
SELECT * FROM friendsList ORDER BY firstName #map reduce job will run for this
#Get how many numbers of rows are there in the table
SELECT count(*) from friendslist; #map reduce job will run for this
#Using Group By clause
SELECT count(*) FROM friendslist GROUP BY lastname;
SET hive.auto.convert.join = false; #Disable map join
#Views are logical tables.They do not occupy space as physical Tables.
#Views will be deleted automatically once the parent table is dropped
#You can run your regular queries over views
Create View if not exists SELECT * FROM TableName WHERE Age >= 18
#Deleting a view
drop viewName
#Check if it is a view
describe formatted viewName
Note: One error that you can face while running these queries could be like this:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
Now the solution for this is that you have to make some changes in the mapred-site.xml file lying at the location <HADOOP_HOME>/etc/hadoop/mapred-site.xml . Your mapred-site.xml should have the content something like this :
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/home/hadoop/hadoop-3.3.6</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/home/hadoop/hadoop-3.3.6</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/home/hadoop/hadoop-3.3.6</value>
</property>
<property>
<name>mapreduce.application.classpath</name> #You can get the value for this by typing hadoop classpath in the terminal
<value>/home/hadoop/hadoop/etc/hadoop:/home/hadoop/hadoop/share/hadoop/common/lib/*:/home/hadoop/hadoop/share/hadoop/common/*:/home/hadoop/hadoop/share/hadoop/hdfs:/home/hadoop/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/hadoop/share/hadoop/hdfs/*:/home/hadoop/hadoop/share/hadoop/mapreduce/*:/home/hadoop/hadoop/share/hadoop/yarn:/home/hadoop/hadoop/share/hadoop/yarn/lib/*:/home/hadoop/hadoop/share/hadoop/yarn/*</value>
</property>
</configuration>
📂Partition in Hive
Let’s understand the concept of Partitioning. Imagine your organization spans various countries, with branches situated in India, the UK, and Singapore. Each branch boasts a sizable workforce, comprising numerous employees. Now, picture yourself needing specific information about an employee based in the Indian branch. So, if you will be firing a query over the entire dataset of employees from all the branches don’t you think it’ll be overwork? If you need to search for info about an employee based in an Indian branch then just fire the query over a dataset of employees of an Indian branch and this is where Partitioning comes to help!
Through partitioning, we will split the data into folders where each folder will have all records belonging to a particular country. In the above scenario, partitioning based on the country column will be a wise decision. This way we will be getting results faster and also will be saving the processing resources.
Now we have two types of Partitioning:-
🥁Static/Manual Partitioning
🥁Dynamic Partitioning
In static partitioning, we will be doing partitioning based on columns that the input file will not have. We will be giving the values based on which the input file will be ingested. Hold on! You will get it better with the below example:
Suppose we have an Input file like this that has to be loaded in Hive:
#(Employees1.txt) This is the data of employees working in the branch based in India(Uttar Pradesh).
198 Donald 2000
199 Douglas 3000
200 Jennifer 4000
201 Michael 5000
#(Employees2.txt) This is the data of employees working in the branch based in India(Rajasthan)
19 Faiza 2000
13 Ritwiz 3000
20 Anushka 4000
25 Tom 5000
We will be doing partition on the basis of country and the state
create table employees1(id int, name string, salary int)
> partitioned by (country string, state string)
> row format delimited
> fields terminated by ' ';
#Loading the data
load data local inpath '/home/hadoop/employees1.txt' into table employees1
> partition(country="India", state="UttarPradesh");
load data local inpath '/home/hadoop/employees2.txt' into table employees1
> partition(country="India", state="Rajasthan");
Notice the columns we gave in the partition bracket also became the columns in the table. Now is the time to see how this partition has actually stored the data we have loaded.
You can see how 2 different directories have been created based on the partition we did. Woohoo!! We mastered static partition.
Let’s move on to understanding dynamic partitioning now. To enable dynamic partitioning in the hive execute these commands first in the hive shell:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
In dynamic partition data is added dynamically to their respective partitions whereas in static partitioning we manually load the data into desired partitions.
Suppose we have a input file like this [Employees3.txt]
0 Jose Lopez 8500 India UttarPradesh
1 Diane Carter 7000 UK England
2 Shawn Foster 17000 India UttarPradesh
3 Brenda Fisher 10000 India Rajasthan
4 Sean Hunter 14500 UK England
#Now first create the table without any partition
create table employees2(id int, firstName string, lastName string, salary int, country string, state string)
> row format delimited
> fields terminated by ' ';
#Load the data
load data local inpath '/home/hadoop/employees3.txt' into table employees2;
#Create the partitioned table
create table employees3(id int, firstName string, lastName string, salary int)
> partitioned by(country string, state string)
> row format delimited
> fields terminated by ' ';
#Load the data into the partitioned table
insert into employees3
> partition(country, state)
> select id, firstName, lastName, salary, country, state
> from employees2;
Now is the time to see the directory structure after doing this dynamic partition:
🪣Bucketing in Hive
Suppose we have to load the weather reports of the last five years into the hive. Now imagine if we decide to do the partition based on date, then the folders we will be having are 365*5. So, partitioning is not a good idea here!
Now we will be using bucketing. In bucketing, we use the concept of hashing. Suppose we have millions of rows, and we declare that this data has to be placed in 5 buckets then has function will work like columnUsedForBucketing%noOfBuckets and according to that record will be placed in the corresponding bucket.
Now let’s see it practically :
#We have sample data like this
1,aligarh,uttar pradesh,2-10-23
2,rajasthan,rajasthan,3-5-23
3,mathura,uttar pradesh,7-8-22
4,allahabad,uttar pradesh,1-8-21
5,jaipur,rajasthan,9-6-20
6,surat,gujarat,4-7-19
7,lucknow,uttar pradesh,2-4-18
8,imphal,manipur,4-7-18
#Creating the tables
create table weatherDemo(id int, city string, state string, reportDate date)
> row format delimited
> fields terminated by ','
> stored as textfile;
load data local inpath '/home/hadoop/weather.txt' into table weatherdemo;
#Creating the bucket table
create table weatherDemoBucket(id int, city string, state string, reportDate date)
> clustered by (id) into 3 buckets
> row format delimited
> fields terminated by ','
> stored as textfile;
#Loading data in bucketed table
insert overwrite table weatherDemoBucket select * from weatherDemo;
#SELECT the date bucket wise
select * from weatherdemobucket
> TABLESAMPLE(BUCKET 3 OUT OF 3 ON ID);
You can see three buckets have been created. Notice folders are not created here instead bucketing creates the files containing the divided data based on a hash function. Remember bucketing can also be used along with partitioning. The concept of bucketing is used in data sampling and map side join.
So, here comes the end of this blog. As we move forward, our next stop is exploring PySpark, a powerful tool for big data processing using Python. Soon I will be publishing a blog on PySpark.Stay tuned! Till then keep learning and sharing. Let’s shift our focus from I, me, and myself to We, our, and ourselves and witness the magic of learning through sharing. Happy Learning!