Untitled

-- Hive is a data warehouse infrastructure tool
-- Hive process structured data in Hadoop
-- It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
-- It is a platform used to develop SQL type scripts to do MapReduce operations.
-- Designed for OLAP not OnLine Transaction Processing (OLTP)

-- Create a database
CREATE DATABASE IF NOT EXISTS userdb;
CREATE SCHEMA userdb;


CREATE TABLE IF NOT EXISTS students ( id int, name String,
mark Float, city String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ","
LINES TERMINATED BY "\n"
STORED AS TEXTFILE;


check where it is stored
describe formatted students;

-----------------------------Inserting Data---------------------------------------------------

-- While inserting data into Hive, it is better to use LOAD DATA to store bulk records.
-- There are two ways to load data: one is from local file system and second is from Hadoop file system.
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)]

-- LOCAL is identifier to specify the local path. It is optional.
-- OVERWRITE is optional to overwrite the data in the table.
-- PARTITION is optional.


LOAD DATA LOCAL INPATH '/home/hadoop/students.csv'
OVERWRITE INTO TABLE students3;


-- Loading Data way 1
CREATE external TABLE IF NOT EXISTS students2 ( id int, name String,
mark Float, city String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ","
LINES TERMINATED BY "\n"
STORED AS TEXTFILE;


LOAD DATA LOCAL INPATH '/home/hadoop/students.csv'
OVERWRITE INTO TABLE students2;


-- Loading Data way 2
CREATE external TABLE IF NOT EXISTS students4 ( id int, name String,
mark Float, city String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ","
LINES TERMINATED BY "\n"
LOCATION '/user/studentdata';


-----------------------------Internal and External Tables --------------------------------------------------

-- Hive Internal / Managed Tables-
	When we create a table in Hive, it by default manages the data.
	We want Hive to completely manage the lifecycle of the data and table.
	Data is temporary

-- Hive External Tables
	Data is used outside of Hive.
    For example, the data files are read and processed by an existing program that does not lock the files.
	We are not creating a table based on the existing table.
	We need data to remain in the underlying location even after a DROP TABLE.
	The hive shouldn’t own data and control settings, directories etc.,
    we may have another program or process that will do these things.


-- Internal Table mentioning location to store

CREATE TABLE IF NOT EXISTS students5 ( id int, name String,
mark Float, city String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ","
LINES TERMINATED BY "\n"
STORED AS TEXTFILE
LOCATION '/students5';

LOAD DATA LOCAL INPATH '/home/hadoop/students.csv'
OVERWRITE INTO TABLE students5;

-- Internal Table without mentioning location to store
-- will store in warehouse

CREATE TABLE IF NOT EXISTS students6 ( id int, name String,
mark Float, city String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ","
LINES TERMINATED BY "\n"
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '/home/hadoop/students.csv'
OVERWRITE INTO TABLE students6;


-- Extrnal Table without mentioning location to store
-- will store in warehouse

CREATE EXTERNAL TABLE IF NOT EXISTS students7 ( id int, name String,
mark Float, city String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ","
LINES TERMINATED BY "\n"
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '/home/hadoop/students.csv'
OVERWRITE INTO TABLE students7;


-----------------------------HQL Queries---------------------------------------------------

ALTER TABLE students RENAME TO student;

ALTER TABLE students CHANGE name studentname String;
ALTER TABLE student RENAME TO students;

ALTER TABLE students CHANGE mark mark Double;
ALTER TABLE students ADD COLUMNS (class STRING);

ALTER TABLE students CHANGE studentname name String;


--replace all the columns
ALTER TABLE students REPLACE COLUMNS (marks Double mark Float,studentname STRING name String);
DROP TABLE IF EXISTS students;


SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number];


SELECT * FROM students WHERE name="Ram";
SELECT * FROM students WHERE marks>90;

SELECT * FROM students s JOIN class c ON (s.class = c.id);


CREATE VIEW topstudents AS
SELECT * FROM students4 WHERE mark>90;


-----------------------------PARTITIONING---------------------------------------------------


CREATE TABLE IF NOT EXISTS student_partitioned (id int, name String,marks Float)
PARTITIONED BY (city string);

set hive.exec.dynamic.partition.mode=nonstrict;


INSERT OVERWRITE TABLE student_partitioned PARTITION(city)
SELECT id,name,mark,city from  students4;


--------static partitioning-----

Partitions are created when data is inserted into table.
Usually when loading files (big files) into Hive tables static partitions are preferred.
That saves your time in loading data compared to dynamic partition.
You "statically" add a partition in table and move the file into the partition of the table.
You can get the partition column value form the filename, day of date etc without reading the whole big file.

in static partitioning we need to specify the partition column value in each and every LOAD statement.

suppose we are having partition on column country for table t1(userid, name,occupation, country), so each time we need to provide country value

hive>LOAD DATA INPATH '/hdfs path of the file' INTO TABLE t1 PARTITION(country="US")
hive>LOAD DATA INPATH '/hdfs path of the file' INTO TABLE t1 PARTITION(country="UK")

--------daynamic partitioning-----------

Incase of dynamic partition whole big file i.e. every row of the data is read and data is partitioned through a MR job into the destination tables depending on certain field in file. So usually dynamic partition are useful when you are doing sort of a ETL flow in your data pipeline.

dynamic partition allow us not to specify partition column value each time. the approach we follows is as below:

create a non-partitioned table t2 and insert data into it.
now create a table t1 partitioned on intended column(say country).
load data in t1 from t2 as below:

hive> INSERT INTO TABLE t2 PARTITION(country) SELECT * from T1;
make sure that partitioned column is always the last one in non partitioned table(as we are having country column in t2)


You can perform dynamic partition on hive external table and managed table If you want to use Dynamic partition in hive then mode is in nonstrict mode Here is hive dynamic partition properties you should allow

SET hive.exec.dynamic.partition = true;

SET hive.exec.dynamic.partition.mode = nonstrict;


-----------------------------Bucketing---------------------------------------------------


Buckets in hive is used in segregating of hive table-data into multiple files or directories. it is used for efficient querying.

The data i.e. present in that partitions can be divided further into Buckets
The division is performed based on Hash of particular columns that we selected in the table.
Buckets use some form of Hashing algorithm at back end to read each record and place it into buckets
In Hive, we have to enable buckets by using the set.hive.enforce.bucketing=true;

CREATE TABLE IF NOT EXISTS student_bucketed (id int, name String,marks Float,city String)
CLUSTERED BY (city) INTO 3 BUCKETS;


INSERT OVERWRITE TABLE student_bucketed
SELECT id,name,mark,city from  students4;