Untitled

Hive runs on your workstation and converts your SQL query into series of MapReduce jobs for execution on Hadoop cluster. Hive organizes data into tables, which provides a means for attaching structure to data stored in HDFS.

Metadata is stored in database called metasore.

Hive Installation: Hive require hadoop and sqoop installed on the machine before installing hive.
1. install three services hive, hive-metastore, hive-server2
          yum install hive
          yum install hive-metastore
          yum install hive-server2
2. Install MySQL server and start MySQL
          sudo yum install mysql-server
     sudo service mysql start
3. install and link MySQL connectors
          sudo yum install mysql-connector-java
     ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysql-connector-java.jar
4. export HIVE_HOME into .bashrc
          export HIVE_HOME=/usr/lib/hive/
          export PATH=$PATH:$HIVE_HOME/bin
5. configure hive-site.xml and hive-env.sh files in /usr/lib/hive/conf/ directory
          vi hive-site.xml
6. change the following properties in hive-site.xml

          <property>
               <name>javax.jdo.option.ConnectionURL</name>
               <value>jdbc:mysql://172.16.0.82:3306/hivedb?createDatabaseIfNotExist=true</value>
               <description>JDBC connect string for a JDBC metastore</description>
          </property>

          <property>
               <name>javax.jdo.option.ConnectionDriverName</name>
               <value>com.mysql.jdbc.Driver</value>
               <description>Driver class name for a JDBC metastore</description>
          </property>

/* give the database user name in value field */
          <property>
               <name>javax.jdo.option.ConnectionUserName</name>
               <value>root</value>
               <description>username to use against metastore database</description>
          </property>

/* give database user password in value field */
          <property>
               <name>javax.jdo.option.ConnectionPassword</name>
               <value>Kittuu@209</value>
               <description>password to use against metastore database</description>
          </property>

7. add HADOOP_HOME into hive-env.sh file
          export HADOOP_HOME=/usr/lib/hadoop
8. configure mysql jdbc jar file in /usr/lib/hive/lib directory. by downloading latest jar into lib directory
          mysql-connector-java-5.1.26-bin.jar
9. before running Hive grant access to the nodes atleast one data node for that database
          GRANT ALL PRIVILEGES ON *.* TO 'root'@'<Host-IP>' IDENTIFIED BY '<MySQL Password>' WITH GRANT OPTION
          GRANT ALL PRIVILEGES ON *.* TO 'hive'@'<Host-IP>' IDENTIFIED BY '<MySQL Password>' WITH GRANT OPTION
10. create permissions for root:hadoop in hdfs
          >su hdfs
          >hadoop fs -chown -R root:hadoop /user
11. open MySQL adn run the metastore script which is in hive directory
          > mysql -u root -p
Enter password:
mysql> CREATE DATABASE metastore;
mysql> USE metastore;

/hive/scripts/metastore/upgrade/mysql/hive-schema-0.13.0.mysql.sql;
mysql> SOURCE /usr/lib

if the above schema is in different host just copy it to host which has MySQL and run it.

12. run hive by just typing hive command it will open HIVE shell.
13. for all the tables you create in hive shell metadata will be created in metastore you can check that in metastore DB.

Hive Shell: Hive Query Language (HQL) is
1. Hive tables created will create metastore database on you machine in location where we ran hive command

The queries you run in hive shell will be replicated in /user/hive/warehouse/ directory
                    hadoop fs -ls /user/hive/warehouse/

	* hive -f sample.q : sample.q is a file with some queries which can be used to run hive
	* hive -e 'select * from sampletable' : -e can be used for short scripts
	*


Partitioning vs Bucketing:
Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion. Example like if we are dealing with large employeetable and often run queries with WHERE clauses that restrict the results to a particular country or department . For a faster query response Hive table can be PARTITIONED BY (country STRING, DEPT STRING), Partitioning tables changes how Hive structures the data storage and Hive will now create subdirectories reflecting the partitioning structure like . .../employees/country=ABC/DEPT=XYZ. If query limits for employee from country ABC t will only scan the contents of one directory ABC. This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering. Partitioning feature is very useful in Hive, however, a design that creates too many partitions may optimize some queries, but be detrimental for other important queries. Other drawback is having too many partitions is the large number of Hadoop files and directories that are created unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.
Bucketing is another technique for decomposing data sets into more manageable parts. For example, suppose a table using the date as the top-level partition and the employee_id as the second-level partition leads to too many small partitions. Instead, if we bucket the employee table and use employee_id as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. Records with the same employee_id will always be stored in the same bucket. Assuming the number of employee_id is much greater than the number of buckets, each bucket will have many employee_id. While creating table you can specify like CLUSTERED BY (employee_id) INTO XX BUCKETS ; where XX is the number of buckets . Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with data. If two tables are bucketed by employee_id, Hive can create a logically correct sampling. Bucketing also aids in doing efficient map-side joins etc.

Hive Service vs Server vs Metastore:

HiveServer:
Hive has a Driver which parses the SQL query, creates the plan and executes it. It also talks to Metastore to get/store the table metadata. When you run Hive CLI it loads the Driver object directly in its process. With HiveServer the communication between Driver object and CLI is changed. HiveServer runs a service which accepts the queries sent by CLI (or JDBC/ODBC), executes the query and sents the results back. All communication is over the network.


Advantages are

	1. clients are thin and varied (such as JDBC/ODBC clients)

	2. Multiple users can use the same Hive service.


There are two types of HiveServers:

	1. HiveServer: First implementation. It has concurrency and security issues.

	2. HiveServer2: Second implementation which solved the drawbacks in HiveServer.


Metastore:
It talks to the backend such as Derby/MySql to store and retrieve table metadata. If any Hive component wants to get/set metadata, it calls the MetaStore APIs. APIs are such getTable(tableName), createDatabase(dbName, ...) etc. Basically metastore abstracts and provides backend (derby/mysql/postgres) independent API layer. Similar to HiveServer this can also run as a server. If there is no metastore server running, then the Driver will load the metastore in its process. If metastore is running as a separate server then the Driver object communicates with the metastore over network. You can continue to run the same queries (including data loading) that you are currently running from Hive CLI in HiveServer 1/2 through its clients such as beeline CLI/JDBC/ODBC.

Schema on Read: unlike traditional databases hive doesn't support schema on write it supports schema on write coz of that load happens faster which is useful in few scenarios like when there are no indexes to apply or when schema is not known.

Updates, Transactions and Indexes:

	* Hive doesn't support updates but it does support INSERT INTO which can insert new row into an existing table.
	* Hive support indexes which speed up queries
	* There are two types of indexes compact and bitmap indexes. compact indexes use the block number for each value and bitmap indexes use compressed bitset to store the rows that a particular value appears in, bit map indexes are appropriate for low cardinality columns(such as gender and country).
	* LOCKS are managed by ZooKeeper which prevent a certain operations like dropping table when another process is using the table. SHOW LOCKS command is used to see which locks are being held.


Partitions:

Step 1: creating a table with partition
     create table userspart(userid int, age int, occupation string, zip int) partitioned by(gender string);

Step 2: there are two ways to load data into this table one is via load command which works only if the data file has data in the same order mentioned in the above table other wise it will return nulls in the unmatched column
     load data inpath '/user/hive/warehouse/users/part-m-00000' into table userspart partition(gender='M');

Step 3: use insert overwrite with a where clause in select statement other wise it will load all the records even gender with M to the new table.
 insert overwrite table userspart partition(gender='F') select userid, occupation from users where gender='F';

Step 4:


OPTIMIZATION:


	*