josephxsxn

Tez Partition Hive DeDupe

Jul 18th, 2017
109
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!

test data (initaly loaded to prepopulat emaster)

12,THEDUPE,1,1,1
11,test2,1,1,2
10,test3,2,1,1
9,test4,2,1,1

data with dupe (put into the stage_new table after cleaning it out from the inital test data and loading it)

12,THEDUPE,1,1,1
8,test1,1,1,1
7,test2,1,1,2
6,test3,3,1,1

Populate

set hive.exec.dynamic.partition.mode=nonstrict;
create temporary table new_parts
AS select p1, p2, p3 from stage_new group by p1, p2, p3;

DEDUPE and INSERT

INSERT OVERWRITE TABLE master_base PARTITION(p1, p2, p3)
select hash, line, p1, p2, p3 from stage_new
UNION DISTINCT
select hash, line, mb.p1, mb.p2, mb.p3 from master_base mb JOIN new_parts np ON (mb.p1 = np.p1 AND mb.p2 = np.p2 AND mb.p3 = np.p3);

master_base BEFORE Insert of dedupe required data

hive> select * from master_base;
OK
12 THEDUPE 1 1 1
11 test2 1 1 2
10 test3 2 1 1
9 test4 2 1 1

master_base AFTER Insert

hive> select * from master_base;
OK
12 THEDUPE 1 1 1
8 test1 1 1 1
11 test2 1 1 2
7 test2 1 1 2
10 test3 2 1 1
9 test4 2 1 1
6 test3 3 1 1

Add Comment
Please, Sign In to add comment