Hive Char Encoding

--Multibyte is not an issue at all in Hadoop, Hadoop is by default all UTF-8 Based (I can give reasons too). UTF-8 is multi-byte and will have no character loss from UTF-16 (but we all know that.) You can convert the data into UTF-8 with iconv at linux, or set a Hive SERDE property for Lazy tables. But id recommend you get out of your char-encoding and into UTF-8 ASAP if you expect anything to work correctly over multiple toolsets. If you need or specific sets like TIS-620 need a special Lazy prop, but you have breaking chars like \n they will break you stuff and you will be required to either write an MR job with a custom record reader OR convert your encoding before loading to HDFS with incov or something like it. Also some users my just not have the Glyphs to even show on their screen.


CREATE TABLE `thaiiso_codec`(
  `line` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.encoding'='TIS-620')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '/user/joe/charissue/thaiiso'
TBLPROPERTIES (
  'transient_lastDdlTime'='1475767247')