Advertisement
Guest User

Untitled

a guest
May 29th, 2015
284
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 9.19 KB | None | 0 0
  1. Hive runs on your workstation and converts your SQL query into series of MapReduce jobs for execution on Hadoop cluster. Hive organizes data into tables, which provides a means for attaching structure to data stored in HDFS.
  2.  
  3. Metadata is stored in database called metasore.
  4.  
  5. Hive Installation: Hive require hadoop and sqoop installed on the machine before installing hive.
  6. 1. install three services hive, hive-metastore, hive-server2
  7. yum install hive
  8. yum install hive-metastore
  9. yum install hive-server2
  10. 2. Install MySQL server and start MySQL
  11. sudo yum install mysql-server
  12. sudo service mysql start
  13. 3. install and link MySQL connectors
  14. sudo yum install mysql-connector-java
  15. ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysql-connector-java.jar
  16. 4. export HIVE_HOME into .bashrc
  17. export HIVE_HOME=/usr/lib/hive/
  18. export PATH=$PATH:$HIVE_HOME/bin
  19. 5. configure hive-site.xml and hive-env.sh files in /usr/lib/hive/conf/ directory
  20. vi hive-site.xml
  21. 6. change the following properties in hive-site.xml
  22.  
  23. <property>
  24. <name>javax.jdo.option.ConnectionURL</name>
  25. <value>jdbc:mysql://172.16.0.82:3306/hivedb?createDatabaseIfNotExist=true</value>
  26. <description>JDBC connect string for a JDBC metastore</description>
  27. </property>
  28.  
  29. <property>
  30. <name>javax.jdo.option.ConnectionDriverName</name>
  31. <value>com.mysql.jdbc.Driver</value>
  32. <description>Driver class name for a JDBC metastore</description>
  33. </property>
  34.  
  35. /* give the database user name in value field */
  36. <property>
  37. <name>javax.jdo.option.ConnectionUserName</name>
  38. <value>root</value>
  39. <description>username to use against metastore database</description>
  40. </property>
  41.  
  42. /* give database user password in value field */
  43. <property>
  44. <name>javax.jdo.option.ConnectionPassword</name>
  45. <value>Kittuu@209</value>
  46. <description>password to use against metastore database</description>
  47. </property>
  48.  
  49. 7. add HADOOP_HOME into hive-env.sh file
  50. export HADOOP_HOME=/usr/lib/hadoop
  51. 8. configure mysql jdbc jar file in /usr/lib/hive/lib directory. by downloading latest jar into lib directory
  52. mysql-connector-java-5.1.26-bin.jar
  53. 9. before running Hive grant access to the nodes atleast one data node for that database
  54. GRANT ALL PRIVILEGES ON *.* TO 'root'@'<Host-IP>' IDENTIFIED BY '<MySQL Password>' WITH GRANT OPTION
  55. GRANT ALL PRIVILEGES ON *.* TO 'hive'@'<Host-IP>' IDENTIFIED BY '<MySQL Password>' WITH GRANT OPTION
  56. 10. create permissions for root:hadoop in hdfs
  57. >su hdfs
  58. >hadoop fs -chown -R root:hadoop /user
  59. 11. open MySQL adn run the metastore script which is in hive directory
  60. > mysql -u root -p
  61. Enter password:
  62. mysql> CREATE DATABASE metastore;
  63. mysql> USE metastore;
  64.  
  65. /hive/scripts/metastore/upgrade/mysql/hive-schema-0.13.0.mysql.sql;
  66. mysql> SOURCE /usr/lib
  67.  
  68. if the above schema is in different host just copy it to host which has MySQL and run it.
  69.  
  70. 12. run hive by just typing hive command it will open HIVE shell.
  71. 13. for all the tables you create in hive shell metadata will be created in metastore you can check that in metastore DB.
  72.  
  73. Hive Shell: Hive Query Language (HQL) is
  74. 1. Hive tables created will create metastore database on you machine in location where we ran hive command
  75.  
  76. The queries you run in hive shell will be replicated in /user/hive/warehouse/ directory
  77. hadoop fs -ls /user/hive/warehouse/
  78.  
  79. * hive -f sample.q : sample.q is a file with some queries which can be used to run hive
  80. * hive -e 'select * from sampletable' : -e can be used for short scripts
  81. *
  82.  
  83.  
  84. Partitioning vs Bucketing:
  85. Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion. Example like if we are dealing with large employeetable and often run queries with WHERE clauses that restrict the results to a particular country or department . For a faster query response Hive table can be PARTITIONED BY (country STRING, DEPT STRING), Partitioning tables changes how Hive structures the data storage and Hive will now create subdirectories reflecting the partitioning structure like . .../employees/country=ABC/DEPT=XYZ. If query limits for employee from country ABC t will only scan the contents of one directory ABC. This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering. Partitioning feature is very useful in Hive, however, a design that creates too many partitions may optimize some queries, but be detrimental for other important queries. Other drawback is having too many partitions is the large number of Hadoop files and directories that are created unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.
  86. Bucketing is another technique for decomposing data sets into more manageable parts. For example, suppose a table using the date as the top-level partition and the employee_id as the second-level partition leads to too many small partitions. Instead, if we bucket the employee table and use employee_id as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. Records with the same employee_id will always be stored in the same bucket. Assuming the number of employee_id is much greater than the number of buckets, each bucket will have many employee_id. While creating table you can specify like CLUSTERED BY (employee_id) INTO XX BUCKETS ; where XX is the number of buckets . Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with data. If two tables are bucketed by employee_id, Hive can create a logically correct sampling. Bucketing also aids in doing efficient map-side joins etc.
  87.  
  88. Hive Service vs Server vs Metastore:
  89.  
  90. HiveServer:
  91. Hive has a Driver which parses the SQL query, creates the plan and executes it. It also talks to Metastore to get/store the table metadata. When you run Hive CLI it loads the Driver object directly in its process. With HiveServer the communication between Driver object and CLI is changed. HiveServer runs a service which accepts the queries sent by CLI (or JDBC/ODBC), executes the query and sents the results back. All communication is over the network.
  92.  
  93.  
  94. Advantages are
  95.  
  96. 1. clients are thin and varied (such as JDBC/ODBC clients)
  97.  
  98. 2. Multiple users can use the same Hive service.
  99.  
  100.  
  101. There are two types of HiveServers:
  102.  
  103. 1. HiveServer: First implementation. It has concurrency and security issues.
  104.  
  105. 2. HiveServer2: Second implementation which solved the drawbacks in HiveServer.
  106.  
  107.  
  108.  
  109. Metastore:
  110. It talks to the backend such as Derby/MySql to store and retrieve table metadata. If any Hive component wants to get/set metadata, it calls the MetaStore APIs. APIs are such getTable(tableName), createDatabase(dbName, ...) etc. Basically metastore abstracts and provides backend (derby/mysql/postgres) independent API layer. Similar to HiveServer this can also run as a server. If there is no metastore server running, then the Driver will load the metastore in its process. If metastore is running as a separate server then the Driver object communicates with the metastore over network. You can continue to run the same queries (including data loading) that you are currently running from Hive CLI in HiveServer 1/2 through its clients such as beeline CLI/JDBC/ODBC.
  111.  
  112. Schema on Read: unlike traditional databases hive doesn't support schema on write it supports schema on write coz of that load happens faster which is useful in few scenarios like when there are no indexes to apply or when schema is not known.
  113.  
  114. Updates, Transactions and Indexes:
  115.  
  116. * Hive doesn't support updates but it does support INSERT INTO which can insert new row into an existing table.
  117. * Hive support indexes which speed up queries
  118. * There are two types of indexes compact and bitmap indexes. compact indexes use the block number for each value and bitmap indexes use compressed bitset to store the rows that a particular value appears in, bit map indexes are appropriate for low cardinality columns(such as gender and country).
  119. * LOCKS are managed by ZooKeeper which prevent a certain operations like dropping table when another process is using the table. SHOW LOCKS command is used to see which locks are being held.
  120.  
  121.  
  122. Partitions:
  123.  
  124. Step 1: creating a table with partition
  125. create table userspart(userid int, age int, occupation string, zip int) partitioned by(gender string);
  126.  
  127. Step 2: there are two ways to load data into this table one is via load command which works only if the data file has data in the same order mentioned in the above table other wise it will return nulls in the unmatched column
  128. load data inpath '/user/hive/warehouse/users/part-m-00000' into table userspart partition(gender='M');
  129.  
  130. Step 3: use insert overwrite with a where clause in select statement other wise it will load all the records even gender with M to the new table.
  131. insert overwrite table userspart partition(gender='F') select userid, occupation from users where gender='F';
  132.  
  133. Step 4:
  134.  
  135.  
  136.  
  137. OPTIMIZATION:
  138.  
  139.  
  140. *
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement