josephxsxn

Hive Char Encoding

Jul 11th, 2017
81
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
SQL 1.21 KB | None | 0 0
  1. --Multibyte is not an issue at all in Hadoop, Hadoop is by default all UTF-8 Based (I can give reasons too). UTF-8 is multi-byte and will have no character loss from UTF-16 (but we all know that.) You can convert the data into UTF-8 with iconv at linux, or set a Hive SERDE property for Lazy tables. But id recommend you get out of your char-encoding and into UTF-8 ASAP if you expect anything to work correctly over multiple toolsets. If you need or specific sets like TIS-620 need a special Lazy prop, but you have breaking chars like \n they will break you stuff and you will be required to either write an MR job with a custom record reader OR convert your encoding before loading to HDFS with incov or something like it. Also some users my just not have the Glyphs to even show on their screen.
  2.  
  3.  
  4. CREATE TABLE `thaiiso_codec`(
  5.   `line` string)
  6. ROW FORMAT SERDE
  7.   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
  8. WITH SERDEPROPERTIES (
  9.   'serialization.encoding'='TIS-620')
  10. STORED AS INPUTFORMAT
  11.   'org.apache.hadoop.mapred.TextInputFormat'
  12. OUTPUTFORMAT
  13.   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  14. LOCATION
  15.   '/user/joe/charissue/thaiiso'
  16. TBLPROPERTIES (
  17.   'transient_lastDdlTime'='1475767247')
Add Comment
Please, Sign In to add comment