Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- *======================================================================*
- *
- * Defining local macros for use in de-duplication process
- *
- *======================================================================*;
- /* This macro defines variables where we want to replace potential duplicate observations
- with the most recent, non-missing observation */;
- local last BIRTHYR CONTINT GENDER
- ICIG30 ICGR30 IPIP30 ISLS30
- ICIGUPD ICGRWK IPIPWK ISLSWK
- ICIGSOON ZIP
- INSURANCE PREGNANT MHSA_INTAKE INTERPRETER LANGUAGE;
- /* This macro defines variables where we want to replace potential duplicate observations
- with the lowest value non-missing observation for the unique year & quarter combination in question.
- We would do this for cases where 1 = a positive occurence (such as race or referral source)
- and 2 or higher = some other outcome of interest. */;
- local lowest HISPANIC REASCL SELFPROX ICONSNT
- ICIGTYP ICGRTYP IPIPTYP ISLSTYP IOTYP ITYP_NONE
- ICIGUSE ICGRUSE IPIPUSE ISLSUSE
- ITOQUITCIG ITOQUITCGR ITOQUITPIP ITOQUITSLS ITOQUITO
- RACEWHT RACEBLK RACEAIAN RACEASN RACEHPI RACEOTH
- REFSAFL REFSAO REFSAPB REFSRCOM REFSRFAM REFSRHP
- REFSRINS REFSRMNP REFSRMNT REFSRMO REFSRMRA REFSRMTV
- REFSRO REFSRWK;
- /* This macro defines variables where we want to replace potential duplicate observations
- with the highest value non-missing observation for the unique year & quarter combination in question.
- We would do this for cases where values ascend in relative importance (such as 5 representing greater
- education than 4) or for cases where a numerically greater outcome (2 > 1) carries meaningful significance
- (for FRSTCALL) */;
- local highest EDUCATE FRSTCALL;
- /* This macro defines variables where we want to replace potential duplicate observations
- with the most recent (chronologically) non-missing observation for the unique year & quarter combination in question. */;
- local datefix ICIGLST ICGRLST IPIPLST ISLSLST
- INTDATE;
- *======================================================================*
- *
- * De-duplication process
- *
- *======================================================================*;
- *======================================================================*
- * Capturing potential duplicates and readying data for de-duplication by CALLERID, YEAR & QTR
- *======================================================================*;
- forvalues j = 1/26{;
- preserve;
- keep if YEARQTR == `j';
- bysort CALLERID: gen pop = _N;
- egen maxpop = max(pop);
- local maxpop = maxpop;
- * This section replaces each variable in list 'last' with the most recent non-missing entry for that variable;
- foreach var of varlist `last'{;
- gen temp_`var' = "" ;
- gsort CALLERID -INTDATE_dt, gen(CALLERIDBYDATE_`var');
- sort CALLERID CALLERIDBYDATE_`var';
- forvalues i = 1/`maxpop'{;
- gsort CALLERID;
- by CALLERID: replace temp_`var' = `var'[`i'] if !missing(`var'[`i']) & missing(temp_`var');
- };
- bysort CALLERID: replace `var' = temp_`var';
- drop CALLERIDBYDATE_`var';
- drop temp_`var';
- };
- * This section replaces each variable in list 'lowest' with the lowest non-missing value observed for that variable;
- foreach var of varlist `lowest'{;
- destring `var', replace;
- gen byte temp_`var' = .m;
- gsort CALLERID +`var', gen(CALLERLOW_`var');
- sort CALLERID CALLERLOW_`var';
- forvalues i = 1/`maxpop'{;
- gsort CALLERID;
- by CALLERID: replace temp_`var' = `var'[`i'] if !missing(`var'[`i']) & missing(temp_`var');
- };
- bysort CALLERID: replace `var' = temp_`var';
- drop CALLERLOW_`var';
- drop temp_`var';
- };
- * This section replaces each variable in list 'highest' with the highest non-missing value observed for that variable;
- foreach var of varlist `highest'{;
- destring `var', replace;
- gen byte temp_`var' = .m;
- gsort CALLERID -`var', gen(CALLERHIGH_`var');
- sort CALLERID CALLERHIGH_`var';
- forvalues i = 1/`maxpop'{;
- gsort CALLERID;
- by CALLERID: replace temp_`var' = `var'[`i'] if !missing(`var'[`i']) & missing(temp_`var');
- };
- bysort CALLERID: replace `var' = temp_`var';
- drop CALLERHIGH_`var';
- drop temp_`var';
- };
- * This section replaces each variable in list 'datefix' with the most recent (chronologically) value observed for that variable;
- foreach var of varlist `datefix'{;
- gen temp_`var' = "" ;
- gsort CALLERID -INTDATE_dt, gen(CALLDATEFIX_`var');
- sort CALLERID CALLDATEFIX_`var';
- forvalues i = 1/`maxpop'{;
- gsort CALLERID;
- by CALLERID: replace temp_`var' = `var'[`i'] if !missing(`var'[`i']) & missing(temp_`var');
- };
- bysort CALLERID: replace `var' = temp_`var';
- drop CALLDATEFIX_`var';
- drop temp_`var';
- };
- * This section limits cleaned data to one observation per CALLERID per QTR;
- bysort CALLERID: gen NEWPOP = _N;
- keep if NEWPOP == 1;
- quietly compress;
- quietly tempfile data_`j';
- quietly save `data_`j'', replace;
- * This section is used to generate Excel workbooks for manual cross-reference against source data;
- * It is currently commented out as it is only needed for testing purposes.
- * export excel using clean_`j'_test.xlsx, firstrow(variables) replace;
- restore;
- };
- clear;
- *======================================================================*
- * Re-assembling de-duplicated data into one large data set
- *======================================================================*;
- use `data_1';
- forvalues j = 2/26{;
- display "now appending `j'";
- append using `data_`j'';
- };
- desc;
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement