Advertisement
Guest User

Untitled

a guest
Jun 1st, 2024
16
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 134.42 KB | None | 0 0
  1. take these errors:
  2.  
  3. ```
  4. healthcare-data/data_processing on ξ‚  HEAD (aa330b9) [?] via 🐍 v3.10.14 (.venv) πŸ•™ 7s
  5. ❯ dbt run --select source:social_determinants_of_health+
  6. 12:02:17 Running with dbt=1.7.11
  7. 12:02:18 Registered adapter: duckdb=1.7.3
  8. 12:02:18 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
  9. There are 1 unused configuration paths:
  10. - models.healthcare_data.example
  11. 12:02:18 Found 22 models, 41 sources, 0 exposures, 0 metrics, 527 macros, 0 groups, 0 semantic models
  12. 12:02:18
  13. 12:02:38 Concurrency: 1 threads (target='dev')
  14. 12:02:38
  15. 12:02:38 1 of 3 START sql external model main.sdoh_county ............................... [RUN]
  16. 12:02:38 Unhandled error while executing
  17. [Errno 2] No such file or directory: '../data/SDOH/SDOH_2009_COUNTY_1_0.xlsx'
  18. 12:02:38 1 of 3 ERROR creating sql external model main.sdoh_county ...................... [ERROR in 0.02s]
  19. 12:02:38 2 of 3 START sql external model main.sdoh_tract ................................ [RUN]
  20. 12:02:38 Unhandled error while executing
  21. [Errno 2] No such file or directory: '../data/SDOH/sdoh_2009_tract_1_0.xlsx'
  22. 12:02:38 2 of 3 ERROR creating sql external model main.sdoh_tract ....................... [ERROR in 0.01s]
  23. 12:02:38 3 of 3 START sql external model main.sdoh_zipcode .............................. [RUN]
  24. 12:02:38 Unhandled error while executing
  25. [Errno 2] No such file or directory: '../data/SDOH/SDOH_2011_ZIPCODE_1_0.xlsx'
  26. 12:02:38 3 of 3 ERROR creating sql external model main.sdoh_zipcode ..................... [ERROR in 0.01s]
  27. 12:02:38
  28. 12:02:38 Finished running 3 external models in 0 hours 0 minutes and 20.56 seconds (20.56s).
  29. 12:02:38
  30. 12:02:38 Completed with 3 errors and 0 warnings:
  31. 12:02:38
  32. 12:02:38 [Errno 2] No such file or directory: '../data/SDOH/SDOH_2009_COUNTY_1_0.xlsx'
  33. 12:02:38
  34. 12:02:38 [Errno 2] No such file or directory: '../data/SDOH/sdoh_2009_tract_1_0.xlsx'
  35. 12:02:38
  36. 12:02:38 [Errno 2] No such file or directory: '../data/SDOH/SDOH_2011_ZIPCODE_1_0.xlsx'
  37. 12:02:38
  38. 12:02:38 Done. PASS=0 WARN=0 ERROR=3 SKIP=0 TOTAL=3
  39. ```
  40.  
  41. and take this code:
  42.  
  43. ```
  44. .
  45. β”œβ”€β”€ download_industry_codes.ipynb
  46. β”œβ”€β”€ generate_sql_schemas_for_extracted_csv_files.py
  47. β”œβ”€β”€ generate_sql_with_enum_types.py
  48. β”œβ”€β”€ generate_sql_with_enum_types_and_mapped_values.py
  49. β”œβ”€β”€ generate_sql_with_enum_types_and_mapped_values_renamed.py
  50. β”œβ”€β”€ generate_sql_with_mapped_values.py
  51. β”œβ”€β”€ generate_sql_with_renamed_columns.py
  52. β”œβ”€β”€ generate_sql_with_types.py
  53. β”œβ”€β”€ parse_data_dictionary.py
  54. └── process_one_year_of_american_community_survey_data.sh
  55.  
  56. 1 directory, 10 files
  57. ```
  58. Relative file path: process_one_year_of_american_community_survey_data.sh
  59. ```
  60. #!/bin/bash
  61.  
  62. # Function to print time taken for each step in minutes and seconds
  63. print_time_taken() {
  64. local start=$1
  65. local end=$2
  66. local step=$3
  67. local duration=$((end-start))
  68. local minutes=$((duration / 60))
  69. local seconds=$((duration % 60))
  70. echo "Time taken for $step: ${minutes} minutes and ${seconds} seconds."
  71. }
  72.  
  73. # Check if the user has provided a year
  74. if [ -z "$1" ]
  75. then
  76. echo "You must provide a four-character year as an argument."
  77. exit 1
  78. fi
  79.  
  80. YEAR=$1
  81.  
  82. # Capture start time
  83. start=$(date +%s)
  84.  
  85. echo "Starting process for the year: $YEAR"
  86.  
  87. start_step=$(date +%s)
  88.  
  89. echo "Step 1: Listing URLs of the $YEAR 1-Year ACS PUMS data"
  90. dbt run --select "public_use_microdata_sample.list_urls" \
  91. --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
  92. --threads 8
  93.  
  94. echo "Step 1: Saving the database of the $YEAR 1-Year ACS PUMS data"
  95. dbt run --select "public_use_microdata_sample.urls" \
  96. --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
  97. --threads 8
  98.  
  99. # echo "Step 1: Listing URLs of the $YEAR 1-Year ACS PUMS data"
  100. # dbt run --select "public_use_microdata_sample.list_urls" \
  101. # --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
  102. # --threads 8
  103.  
  104. # echo "Step 1: Saving the database of the $YEAR 1-Year ACS PUMS data"
  105. # dbt run --select "public_use_microdata_sample.urls" \
  106. # --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
  107. # --threads 8
  108.  
  109. echo "Checking URLs..."
  110. duckdb -c "SELECT * FROM '~/data/american_community_survey/urls.parquet'"
  111.  
  112. end_step=$(date +%s)
  113. print_time_taken $start_step $end_step "Step 1"
  114.  
  115. start_step=$(date +%s)
  116.  
  117. echo "Step 2: Downloading and extracting the archives for all of the 50 states' PUMS files"
  118. dbt run --select "public_use_microdata_sample.download_and_extract_archives" \
  119. --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
  120. --threads 8
  121.  
  122. echo "Saving paths to the CSV files..."
  123. dbt run --select "public_use_microdata_sample.csv_paths" \
  124. --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
  125. --threads 8
  126.  
  127. echo "Checking presence of CSV files..."
  128. duckdb -c "SELECT * FROM '~/data/american_community_survey/csv_paths.parquet'"
  129.  
  130.  
  131. end_step=$(date +%s)
  132. print_time_taken $start_step $end_step "Step 2"
  133.  
  134. echo "Step 3: Parsing the data dictionary"
  135.  
  136. # dbt run --select "public_use_microdata_sample.parse_data_dictionary" \
  137. # --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
  138. # --threads 8
  139.  
  140. # dbt run --select "public_use_microdata_sample.data_dictionary_path" \
  141. # --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
  142. # --threads 8
  143.  
  144. # dbt run --select "public_use_microdata_sample.parse_data_dictionary" \
  145. # --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
  146. # --threads 8
  147.  
  148. # dbt run --select "public_use_microdata_sample.data_dictionary_path" \
  149. # --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
  150. # --threads 8
  151.  
  152. # echo "Checking data dictionary path..."
  153. # duckdb -c "SELECT * FROM '~/data/american_community_survey/data_dictionary_path.parquet'"
  154.  
  155. # echo "Step 4: Generating SQL commands for mapping variables"
  156. # python scripts/generate_sql_with_enum_types_and_mapped_values_renamed.py \
  157. # ~/data/american_community_survey/csv_paths.parquet \
  158. # ~/data/american_community_survey/PUMS_Data_Dictionary_$YEAR.json
  159.  
  160. # echo "Step 5: Executing generated SQL queries"
  161. # dbt run --select "public_use_microdata_sample.generated.$YEAR.enum_types_mapped_renamed+" \
  162. # --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
  163. # --threads 8
  164.  
  165. # echo "Step 6: Testing presence and size of compressed parquet files"
  166. # du -sh ~/data/american_community_survey/$YEAR
  167. # du -hc ~/data/american_community_survey/*$YEAR.parquet
  168.  
  169. # echo "Checking SQL query execution..."
  170. # duckdb -c "SELECT COUNT(*) FROM '~/data/american_community_survey/*individual_people_united_states*$YEAR.parquet'"
  171.  
  172. # Capture end time
  173. end_step=$(date +%s)
  174.  
  175. print_time_taken $start_step $end_step "Step 3"
  176.  
  177. # Calculate and report the total time taken
  178. print_time_taken $start $end_step "Total 3"
  179. ```
  180. Relative file path: generate_sql_with_enum_types_and_mapped_values.py
  181. ```
  182. import pandas as pd
  183. import sys
  184. import os
  185. import json
  186.  
  187. def generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup):
  188. type_char = folder_name.split("_")[1][0].lower()
  189. folder_code = folder_name.split("_")[1][1:].upper()
  190. human_readable_name = "individual_people" if type_char == "p" else "housing_units"
  191.  
  192. if folder_code == "US":
  193. csv_code = csv_name.split("_")[1][1:].upper()
  194. name = national_lookup.get(csv_code, "Unknown national code")
  195. elif len(folder_code) == 2:
  196. name = state_lookup.get(folder_code, "Unknown state code")
  197. else:
  198. raise ValueError(f"Invalid code: {folder_code}")
  199.  
  200. return f"{human_readable_name}_{name.replace(' ', '_')}".lower()
  201.  
  202.  
  203. def clean_enum_value(value):
  204. value = value.replace("'", "")
  205. value = value.replace("N/A", "Not applicable")
  206. value = value.replace("/", " or ")
  207. value = value.replace("(", "- ")
  208. value = value.replace(")", "")
  209. return value
  210.  
  211.  
  212. if len(sys.argv) < 3:
  213. print("Usage: python script.py <parquet_database_path> <PUMS_data_dictionary_path>")
  214. sys.exit(1)
  215.  
  216. parquet_database_path, data_dictionary_path = sys.argv[1:3]
  217.  
  218. with open(data_dictionary_path, "r") as json_file:
  219. data_dict = json.load(json_file)
  220.  
  221. state_lookup = {code: name for name, code in [x.split("/") for x in data_dict["ST"]["Values"].values()]}
  222. national_lookup = {"USA": "United States first tranche", "USB": "United States second tranche"}
  223.  
  224. df_csv_paths = pd.read_parquet(parquet_database_path)
  225. models_dir = "models/public_use_microdata_sample/generated/enum_types_mapped"
  226. os.makedirs(models_dir, exist_ok=True)
  227.  
  228.  
  229. def should_include_key(description):
  230. exclude_criteria = ["weight", "identifier", "number", "age", "income", "time", "hours", "weeks", "puma", "total", "fee", "cost", "amount", "rent", "value", "taxes"]
  231. # Check if any of the exclude criteria are in the value or if "age" is in the description.
  232. if any(criterion in description.lower() and "flag" not in description.lower() for criterion in exclude_criteria):
  233. return False
  234. return True
  235.  
  236. for csv_path in df_csv_paths["csv_path"]:
  237. folder_name = os.path.basename(os.path.dirname(csv_path))
  238. csv_name = os.path.basename(csv_path).split(".")[0]
  239. materialized_name = generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup)
  240.  
  241. df_headers = pd.read_csv(csv_path, nrows=0)
  242. column_types = {column: 'VARCHAR' for column in df_headers.columns}
  243. columns = ', '.join([f"'{col}': '{typ}'" for col, typ in column_types.items()])
  244. sql_select_parts = ["SELECT"]
  245. enum_creation_statements = []
  246. table_creation_statement = f"CREATE TABLE {materialized_name} ("
  247. column_definitions = []
  248. newline = "\n"
  249.  
  250. for header, details in data_dict.items():
  251. if "Values" in details:
  252. if header in df_headers.columns:
  253. enum_values = [f"'{key.strip()}'" for key, value in details["Values"].items()]
  254. if should_include_key(details["Description"]) and len(enum_values) > 0:
  255. enum_name = f"{header}_enum"
  256. value_mapping = "\n\t\t".join([
  257. f"WHEN '{clean_enum_value(code)}' THEN '{clean_enum_value(label)}'"
  258. for code, label in data_dict[header]["Values"].items()
  259. ])
  260. enum_labels = [f"'{clean_enum_value(label)}'" for code, label in data_dict[header]["Values"].items()]
  261. mapped_column = f"""CASE {header}\n\t\t{value_mapping}\n\tEND::ENUM ({','.join(enum_labels)}) AS {header},"""
  262. column_definitions.append(mapped_column)
  263. else:
  264. column_definitions.append(f" {header}::VARCHAR,")
  265. else:
  266. print(f"Column {header} not found in {csv_name}.csv")
  267.  
  268.  
  269.  
  270. sql_select_parts[-1] = sql_select_parts[-1].rstrip(',')
  271. sql_select_statement = "\n".join(sql_select_parts)
  272. newline = "\n"
  273. newline_with_comma = ",\n"
  274.  
  275. # Combine ENUM creation, table creation, and COPY command in SQL content
  276. sql_content = f"""-- SQL transformation for {csv_name} generated by {os.path.basename(__file__)}
  277. {{{{ config(materialized='external', location=var('output_path') + '/{materialized_name}_enum_types_mapped.parquet') }}}}
  278.  
  279. SELECT
  280. {newline.join(column_definitions)}
  281. FROM read_csv('{csv_path}',
  282. parallel=False,
  283. all_varchar=True,
  284. auto_detect=True)"""
  285.  
  286. sql_file_path = os.path.join(models_dir, f"{materialized_name}_enum_mapped.sql")
  287. with open(sql_file_path, "w") as sql_file:
  288. sql_file.write(sql_content)
  289. ```
  290. Relative file path: generate_sql_with_enum_types.py
  291. ```
  292. import pandas as pd
  293. import sys
  294. import os
  295. import json
  296.  
  297. def generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup):
  298. type_char = folder_name.split("_")[1][0].lower()
  299. folder_code = folder_name.split("_")[1][1:].upper()
  300. human_readable_name = "individual_people" if type_char == "p" else "housing_units"
  301.  
  302. if folder_code == "US":
  303. csv_code = csv_name.split("_")[1][1:].upper()
  304. name = national_lookup.get(csv_code, "Unknown national code")
  305. elif len(folder_code) == 2:
  306. name = state_lookup.get(folder_code, "Unknown state code")
  307. else:
  308. raise ValueError(f"Invalid code: {folder_code}")
  309.  
  310. return f"{human_readable_name}_{name.replace(' ', '_')}".lower()
  311.  
  312.  
  313. def clean_enum_value(value):
  314. value = value.replace("'", "")
  315. value = value.replace("N/A", "Not applicable")
  316. value = value.replace("/", " or ")
  317. value = value.replace("(", "- ")
  318. value = value.replace(")", "")
  319. return value
  320.  
  321.  
  322. if len(sys.argv) < 3:
  323. print("Usage: python script.py <parquet_database_path> <PUMS_data_dictionary_path>")
  324. sys.exit(1)
  325.  
  326. parquet_database_path, data_dictionary_path = sys.argv[1:3]
  327.  
  328. with open(data_dictionary_path, "r") as json_file:
  329. data_dict = json.load(json_file)
  330.  
  331. state_lookup = {code: name for name, code in [x.split("/") for x in data_dict["ST"]["Values"].values()]}
  332. national_lookup = {"USA": "United States first tranche", "USB": "United States second tranche"}
  333.  
  334. df_csv_paths = pd.read_parquet(parquet_database_path)
  335. models_dir = "models/public_use_microdata_sample/generated/enum_types"
  336. os.makedirs(models_dir, exist_ok=True)
  337.  
  338.  
  339. def should_include_key(description):
  340. exclude_criteria = ["weight", "identifier", "number", "age", "income", "time", "hours", "weeks", "puma", "total", "fee", "cost", "amount", "rent", "value", "taxes"]
  341. # Check if any of the exclude criteria are in the value or if "age" is in the description.
  342. if any(criterion in description.lower() and "flag" not in description.lower() for criterion in exclude_criteria):
  343. return False
  344. return True
  345.  
  346. for csv_path in df_csv_paths["csv_path"]:
  347. folder_name = os.path.basename(os.path.dirname(csv_path))
  348. csv_name = os.path.basename(csv_path).split(".")[0]
  349. materialized_name = generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup)
  350.  
  351. df_headers = pd.read_csv(csv_path, nrows=0)
  352. column_types = {column: 'VARCHAR' for column in df_headers.columns}
  353. columns = ', '.join([f"'{col}': '{typ}'" for col, typ in column_types.items()])
  354. sql_select_parts = ["SELECT"]
  355. enum_creation_statements = []
  356. table_creation_statement = f"CREATE TABLE {materialized_name} ("
  357. column_definitions = []
  358.  
  359. for header, details in data_dict.items():
  360. if "Values" in details:
  361. if header in df_headers.columns:
  362. # enum_values = [f"'{key}'" for key, value in details["Values"].items() if should_include_value(key, details["Description"], value)]
  363. enum_values = [f"'{key.strip()}'" for key, value in details["Values"].items()]
  364. if should_include_key(details["Description"]) and len(enum_values) > 0:
  365. enum_name = f"{header}_enum"
  366. enum_creation_statements.append(f"CREATE TYPE {enum_name} AS ENUM ({','.join(enum_values)});")
  367. # column_definitions.append(f" {header} {enum_name}")
  368. column_definitions.append(f" {header}::ENUM ({','.join(enum_values)}),")
  369. # sql_select_parts.append(f" CAST({header} AS {enum_name}) AS {header},")
  370. else:
  371. column_definitions.append(f" {header}::VARCHAR,")
  372.  
  373. # sql_select_parts.append(f" {header},")
  374. else:
  375. print(f"Column {header} not found in {csv_name}.csv")
  376.  
  377.  
  378.  
  379. sql_select_parts[-1] = sql_select_parts[-1].rstrip(',')
  380. sql_select_statement = "\n".join(sql_select_parts)
  381. newline = "\n"
  382. newline_with_comma = ",\n"
  383. # sql_content = f"""-- SQL transformation for {csv_name} generated by {os.path.basename(__file__)}
  384. # {{{{ config(materialized='external', location=var('output_path') + '/{materialized_name}.parquet') }}}}
  385. # {newline.join(enum_creation_statements)}
  386.  
  387. # {sql_select_statement}
  388. # FROM read_csv_auto('{csv_path}')
  389. # """
  390.  
  391. # Combine ENUM creation, table creation, and COPY command in SQL content
  392. sql_content = f"""-- SQL transformation for {csv_name} generated by {os.path.basename(__file__)}
  393. {{{{ config(materialized='external', location=var('output_path') + '/{materialized_name}_enum_types.parquet') }}}}
  394.  
  395. SELECT
  396. {newline.join(column_definitions)}
  397. FROM read_csv('{csv_path}',
  398. parallel=False,
  399. all_varchar=True,
  400. auto_detect=True)"""
  401.  
  402. sql_file_path = os.path.join(models_dir, f"{materialized_name}_enum.sql")
  403. with open(sql_file_path, "w") as sql_file:
  404. sql_file.write(sql_content)
  405. ```
  406. Relative file path: generate_sql_with_enum_types_and_mapped_values_renamed.py
  407. ```
  408. import pandas as pd
  409. import sys
  410. import os
  411. import json
  412.  
  413.  
  414. def generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup):
  415. type_char = folder_name.split("_")[1][0].lower()
  416. folder_code = folder_name.split("_")[1][1:].upper()
  417. human_readable_name = "individual_people" if type_char == "p" else "housing_units"
  418.  
  419. if folder_code == "US":
  420. csv_code = csv_name.split("_")[1][1:].upper()
  421. name = national_lookup.get(csv_code, "Unknown national code")
  422. elif len(folder_code) == 2:
  423. name = state_lookup.get(folder_code, "Unknown state code")
  424. else:
  425. raise ValueError(f"Invalid code: {folder_code}")
  426.  
  427. return f"{human_readable_name}_{name.replace(' ', '_')}".lower()
  428.  
  429.  
  430. def clean_enum_value(value):
  431. value = value.replace("'", "")
  432. value = value.replace("N/A", "Not applicable")
  433. value = value.replace("/", " or ")
  434. value = value.replace("(", "- ")
  435. value = value.replace(")", "")
  436. return value
  437.  
  438.  
  439. if len(sys.argv) < 3:
  440. print("Usage: python script.py <parquet_database_path> <PUMS_data_dictionary_path>")
  441. sys.exit(1)
  442.  
  443. parquet_database_path, data_dictionary_path = sys.argv[1:3]
  444.  
  445. year = data_dictionary_path.split("/")[-1].split(".")[0].split("_")[-1]
  446. print(f"Year: {year}")
  447. with open(data_dictionary_path, "r") as json_file:
  448. data_dict = json.load(json_file)
  449.  
  450. state_lookup = {
  451. code: name
  452. for name, code in [x.split("/") for x in data_dict["ST"]["Values"].values()]
  453. }
  454. national_lookup = {
  455. "USA": "United States first tranche",
  456. "USB": "United States second tranche",
  457. }
  458.  
  459. df_csv_paths = pd.read_parquet(parquet_database_path)
  460. models_dir = (
  461. f"models/public_use_microdata_sample/generated/{year}/enum_types_mapped_renamed"
  462. )
  463. os.makedirs(models_dir, exist_ok=True)
  464.  
  465.  
  466. def should_include_key(description):
  467. exclude_criteria = [
  468. "weight",
  469. "identifier",
  470. "number",
  471. "age",
  472. "income",
  473. "time",
  474. "hours",
  475. "weeks",
  476. "puma",
  477. "total",
  478. "fee",
  479. "cost",
  480. "amount",
  481. "rent",
  482. "value",
  483. "taxes",
  484. ]
  485. # Check if any of the exclude criteria are in the value or if "age" is in the description.
  486. if any(
  487. criterion in description.lower() and "flag" not in description.lower()
  488. for criterion in exclude_criteria
  489. ):
  490. return False
  491. return True
  492.  
  493.  
  494. for csv_path in df_csv_paths["csv_path"]:
  495. folder_name = os.path.basename(os.path.dirname(csv_path))
  496. csv_name = os.path.basename(csv_path).split(".")[0]
  497. materialized_name = generate_materialized_name(
  498. folder_name, csv_name, state_lookup, national_lookup
  499. )
  500.  
  501. df_headers = pd.read_csv(csv_path, nrows=0)
  502. column_types = {column: "VARCHAR" for column in df_headers.columns}
  503. columns = ", ".join([f"'{col}': '{typ}'" for col, typ in column_types.items()])
  504. sql_select_parts = ["SELECT"]
  505. enum_creation_statements = []
  506. table_creation_statement = f"CREATE TABLE {materialized_name} ("
  507. column_definitions = []
  508. newline = "\n"
  509.  
  510. for header, details in data_dict.items():
  511.  
  512. if "Values" in details:
  513. if header in df_headers.columns:
  514. enum_values = [
  515. f"'{key.strip()}'" for key, value in details["Values"].items()
  516. ]
  517. col_info = data_dict.get(header, {"Description": header})
  518. description = col_info["Description"]
  519.  
  520. if should_include_key(details["Description"]) and len(enum_values) > 0:
  521. enum_name = f"{header}_enum"
  522. value_mapping = "\n\t\t".join(
  523. [
  524. f"WHEN '{clean_enum_value(code)}' THEN '{clean_enum_value(label)}'"
  525. for code, label in data_dict[header]["Values"].items()
  526. ]
  527. )
  528. enum_labels = [
  529. f"'{clean_enum_value(label)}'"
  530. for code, label in data_dict[header]["Values"].items()
  531. ]
  532. mapped_column = f"""CASE {header}\n\t\t{value_mapping}\n\tEND::ENUM ({','.join(enum_labels)}) AS "{description}","""
  533. column_definitions.append(mapped_column)
  534. else:
  535. column_definitions.append(
  536. f' {header}::VARCHAR AS "{description}",'
  537. )
  538. else:
  539. # print(f"Column {header} not found in {csv_name}.csv")
  540. pass
  541.  
  542. sql_select_parts[-1] = sql_select_parts[-1].rstrip(",")
  543. sql_select_statement = "\n".join(sql_select_parts)
  544. newline = "\n"
  545. newline_with_comma = ",\n"
  546. username = os.environ.get("USER")
  547. path_without_user = "~/" + csv_path.split(username + '/')[1]
  548. # Combine ENUM creation, table creation, and COPY command in SQL content
  549. sql_content = f"""-- SQL transformation for {csv_name} generated by models/public_use_microdata_sample/scripts/{os.path.basename(__file__)}
  550. {{{{ config(materialized='external', location=var('output_path') + '/acs_pums_{materialized_name}_{year}.parquet') }}}}
  551.  
  552. SELECT
  553. {newline.join(column_definitions)}
  554. FROM read_csv('{path_without_user}',
  555. parallel=False,
  556. all_varchar=True,
  557. auto_detect=True)
  558. """
  559.  
  560. sql_file_path = os.path.join(
  561. models_dir, f"{materialized_name}_enum_mapped_renamed_{year}.sql"
  562. )
  563. with open(sql_file_path, "w") as sql_file:
  564. sql_file.write(sql_content)
  565. ```
  566. Relative file path: generate_sql_with_types.py
  567. ```
  568. import pandas as pd
  569. import duckdb
  570. import sys
  571. import os
  572. import json
  573.  
  574.  
  575. def generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup):
  576. # Extract the state code and type (P or H) from the folder name
  577. type_char = folder_name.split("_")[1][
  578. 0
  579. ].lower() # Assuming folder format is "csv_PXX" or "csv_HXX"
  580. folder_code = folder_name.split("_")[1][1:].upper()
  581.  
  582. # Determine the human-readable name based on the type character
  583. human_readable_name = "individual_people" if type_char == "p" else "housing_units"
  584.  
  585. if folder_code == "US":
  586. # Get the national-level name from the lookup table
  587. csv_code = csv_name.split("_")[1][1:].upper()
  588. name = national_lookup.get(csv_code, "Unknown national code")
  589. print(csv_code, name)
  590. elif len(folder_code) == 2:
  591. # Get the state name from the lookup table
  592. name = state_lookup.get(folder_code, "Unknown state code")
  593. print(folder_code, name)
  594. else:
  595. raise ValueError(f"Invalid code: {folder_code}")
  596. # Return the formatted name
  597. return f"{human_readable_name}_{name.replace(' ', '_')}".lower()
  598.  
  599.  
  600. if len(sys.argv) < 3:
  601. print("Usage: python script.py <parquet_database_path> <PUMS_data_dictionary_path>")
  602. sys.exit(1)
  603.  
  604. parquet_database_path, data_dictionary_path = sys.argv[1:3]
  605.  
  606. # Load the data dictionary from the JSON file
  607. with open(data_dictionary_path, "r") as json_file:
  608. data_dict = json.load(json_file)
  609.  
  610. # Generate lookup table for state codes
  611. state_lookup = {
  612. code: name
  613. for name, code in [x.split("/") for x in data_dict["ST"]["Values"].values()]
  614. }
  615.  
  616. # define short codes for first and second tranches of national-level data
  617. national_lookup = {
  618. "USA": "United States first tranche",
  619. "USB": "United States second tranche",
  620. }
  621.  
  622. # Connect to DuckDB
  623. conn = duckdb.connect(database=":memory:", read_only=False)
  624.  
  625. # Assuming the Parquet file contains paths to CSV files
  626. df_csv_paths = pd.read_parquet(parquet_database_path)
  627.  
  628. models_dir = "models/public_use_microdata_sample/generated/with_types"
  629. os.makedirs(models_dir, exist_ok=True)
  630.  
  631. for csv_path in df_csv_paths["csv_path"]:
  632. folder_name = os.path.basename(os.path.dirname(csv_path))
  633. csv_name = os.path.basename(csv_path)
  634. csv_name = csv_name.split(".")[0]
  635. materialized_name = generate_materialized_name(
  636. folder_name, csv_name, state_lookup, national_lookup
  637. )
  638.  
  639. df_headers = pd.read_csv(csv_path, nrows=0)
  640. column_types = {column: 'VARCHAR' for column in df_headers.columns}
  641. columns = ', '.join([f"'{col}': '{typ}'" for col, typ in column_types.items()])
  642.  
  643. sql_select_parts = ["SELECT"]
  644. for header in df_headers.columns:
  645. col_info = data_dict.get(header, {"Description": header})
  646. description = col_info["Description"]
  647. sql_select_parts.append(f' {header} AS "{description}",')
  648.  
  649. sql_select_parts[-1] = sql_select_parts[-1].rstrip(",")
  650. sql_select_statement = "\n".join(sql_select_parts)
  651.  
  652. sql_content = f"""-- SQL transformation for {os.path.basename(csv_path)} generated by {os.path.basename(__file__)}
  653. {{{{ config(materialized='external', location=var('output_path') + '/{materialized_name}.parquet') }}}}
  654. {sql_select_statement}
  655. FROM read_csv('{csv_path}', columns={{{columns}}})"""
  656.  
  657. sql_file_path = os.path.join(models_dir, f"{materialized_name}.sql")
  658. with open(sql_file_path, "w") as sql_file:
  659. sql_file.write(sql_content)
  660. ```
  661. Relative file path: download_industry_codes.ipynb
  662. ```
  663. {
  664. "cells": [
  665. {
  666. "cell_type": "code",
  667. "execution_count": 4,
  668. "metadata": {},
  669. "outputs": [],
  670. "source": [
  671. "import pandas as pd"
  672. ]
  673. },
  674. {
  675. "cell_type": "code",
  676. "execution_count": 26,
  677. "metadata": {},
  678. "outputs": [],
  679. "source": [
  680. "df = pd.read_excel(\"https://www.census.gov/naics/2017NAICS/2-6%20digit_2017_Codes.xlsx\", dtype=str, skiprows=0)"
  681. ]
  682. },
  683. {
  684. "cell_type": "code",
  685. "execution_count": 27,
  686. "metadata": {},
  687. "outputs": [
  688. {
  689. "data": {
  690. "text/html": [
  691. "<div>\n",
  692. "<style scoped>\n",
  693. " .dataframe tbody tr th:only-of-type {\n",
  694. " vertical-align: middle;\n",
  695. " }\n",
  696. "\n",
  697. " .dataframe tbody tr th {\n",
  698. " vertical-align: top;\n",
  699. " }\n",
  700. "\n",
  701. " .dataframe thead th {\n",
  702. " text-align: right;\n",
  703. " }\n",
  704. "</style>\n",
  705. "<table border=\"1\" class=\"dataframe\">\n",
  706. " <thead>\n",
  707. " <tr style=\"text-align: right;\">\n",
  708. " <th></th>\n",
  709. " <th>Seq. No.</th>\n",
  710. " <th>2017 NAICS US Code</th>\n",
  711. " <th>2017 NAICS US Title</th>\n",
  712. " <th>Unnamed: 3</th>\n",
  713. " <th>Unnamed: 4</th>\n",
  714. " <th>Unnamed: 5</th>\n",
  715. " </tr>\n",
  716. " </thead>\n",
  717. " <tbody>\n",
  718. " <tr>\n",
  719. " <th>0</th>\n",
  720. " <td>NaN</td>\n",
  721. " <td>NaN</td>\n",
  722. " <td>NaN</td>\n",
  723. " <td>NaN</td>\n",
  724. " <td>NaN</td>\n",
  725. " <td>NaN</td>\n",
  726. " </tr>\n",
  727. " <tr>\n",
  728. " <th>1</th>\n",
  729. " <td>1</td>\n",
  730. " <td>11</td>\n",
  731. " <td>Agriculture, Forestry, Fishing and Hunting</td>\n",
  732. " <td>NaN</td>\n",
  733. " <td>NaN</td>\n",
  734. " <td>NaN</td>\n",
  735. " </tr>\n",
  736. " <tr>\n",
  737. " <th>2</th>\n",
  738. " <td>2</td>\n",
  739. " <td>111</td>\n",
  740. " <td>Crop Production</td>\n",
  741. " <td>NaN</td>\n",
  742. " <td>NaN</td>\n",
  743. " <td>NaN</td>\n",
  744. " </tr>\n",
  745. " <tr>\n",
  746. " <th>3</th>\n",
  747. " <td>3</td>\n",
  748. " <td>1111</td>\n",
  749. " <td>Oilseed and Grain Farming</td>\n",
  750. " <td>NaN</td>\n",
  751. " <td>NaN</td>\n",
  752. " <td>NaN</td>\n",
  753. " </tr>\n",
  754. " <tr>\n",
  755. " <th>4</th>\n",
  756. " <td>4</td>\n",
  757. " <td>11111</td>\n",
  758. " <td>Soybean Farming</td>\n",
  759. " <td>NaN</td>\n",
  760. " <td>NaN</td>\n",
  761. " <td>NaN</td>\n",
  762. " </tr>\n",
  763. " <tr>\n",
  764. " <th>...</th>\n",
  765. " <td>...</td>\n",
  766. " <td>...</td>\n",
  767. " <td>...</td>\n",
  768. " <td>...</td>\n",
  769. " <td>...</td>\n",
  770. " <td>...</td>\n",
  771. " </tr>\n",
  772. " <tr>\n",
  773. " <th>2192</th>\n",
  774. " <td>2192</td>\n",
  775. " <td>9281</td>\n",
  776. " <td>National Security and International Affairs</td>\n",
  777. " <td>NaN</td>\n",
  778. " <td>NaN</td>\n",
  779. " <td>NaN</td>\n",
  780. " </tr>\n",
  781. " <tr>\n",
  782. " <th>2193</th>\n",
  783. " <td>2193</td>\n",
  784. " <td>92811</td>\n",
  785. " <td>National Security</td>\n",
  786. " <td>NaN</td>\n",
  787. " <td>NaN</td>\n",
  788. " <td>NaN</td>\n",
  789. " </tr>\n",
  790. " <tr>\n",
  791. " <th>2194</th>\n",
  792. " <td>2194</td>\n",
  793. " <td>928110</td>\n",
  794. " <td>National Security</td>\n",
  795. " <td>NaN</td>\n",
  796. " <td>NaN</td>\n",
  797. " <td>NaN</td>\n",
  798. " </tr>\n",
  799. " <tr>\n",
  800. " <th>2195</th>\n",
  801. " <td>2195</td>\n",
  802. " <td>92812</td>\n",
  803. " <td>International Affairs</td>\n",
  804. " <td>NaN</td>\n",
  805. " <td>NaN</td>\n",
  806. " <td>NaN</td>\n",
  807. " </tr>\n",
  808. " <tr>\n",
  809. " <th>2196</th>\n",
  810. " <td>2196</td>\n",
  811. " <td>928120</td>\n",
  812. " <td>International Affairs</td>\n",
  813. " <td>NaN</td>\n",
  814. " <td>NaN</td>\n",
  815. " <td>NaN</td>\n",
  816. " </tr>\n",
  817. " </tbody>\n",
  818. "</table>\n",
  819. "<p>2197 rows Γ— 6 columns</p>\n",
  820. "</div>"
  821. ],
  822. "text/plain": [
  823. " Seq. No. 2017 NAICS US Code \\\n",
  824. "0 NaN NaN \n",
  825. "1 1 11 \n",
  826. "2 2 111 \n",
  827. "3 3 1111 \n",
  828. "4 4 11111 \n",
  829. "... ... ... \n",
  830. "2192 2192 9281 \n",
  831. "2193 2193 92811 \n",
  832. "2194 2194 928110 \n",
  833. "2195 2195 92812 \n",
  834. "2196 2196 928120 \n",
  835. "\n",
  836. " 2017 NAICS US Title Unnamed: 3 Unnamed: 4 \\\n",
  837. "0 NaN NaN NaN \n",
  838. "1 Agriculture, Forestry, Fishing and Hunting NaN NaN \n",
  839. "2 Crop Production NaN NaN \n",
  840. "3 Oilseed and Grain Farming NaN NaN \n",
  841. "4 Soybean Farming NaN NaN \n",
  842. "... ... ... ... \n",
  843. "2192 National Security and International Affairs NaN NaN \n",
  844. "2193 National Security NaN NaN \n",
  845. "2194 National Security NaN NaN \n",
  846. "2195 International Affairs NaN NaN \n",
  847. "2196 International Affairs NaN NaN \n",
  848. "\n",
  849. " Unnamed: 5 \n",
  850. "0 NaN \n",
  851. "1 NaN \n",
  852. "2 NaN \n",
  853. "3 NaN \n",
  854. "4 NaN \n",
  855. "... ... \n",
  856. "2192 NaN \n",
  857. "2193 NaN \n",
  858. "2194 NaN \n",
  859. "2195 NaN \n",
  860. "2196 NaN \n",
  861. "\n",
  862. "[2197 rows x 6 columns]"
  863. ]
  864. },
  865. "execution_count": 27,
  866. "metadata": {},
  867. "output_type": "execute_result"
  868. }
  869. ],
  870. "source": [
  871. "df"
  872. ]
  873. },
  874. {
  875. "cell_type": "code",
  876. "execution_count": 43,
  877. "metadata": {},
  878. "outputs": [],
  879. "source": [
  880. "names = df[df['2017 NAICS US Code'].str.len() == 2][['2017 NAICS US Title']].values.tolist()"
  881. ]
  882. },
  883. {
  884. "cell_type": "code",
  885. "execution_count": 44,
  886. "metadata": {},
  887. "outputs": [
  888. {
  889. "data": {
  890. "text/plain": [
  891. "[['Agriculture, Forestry, Fishing and Hunting'],\n",
  892. " ['Mining, Quarrying, and Oil and Gas Extraction'],\n",
  893. " ['Utilities'],\n",
  894. " ['Construction'],\n",
  895. " ['Wholesale Trade'],\n",
  896. " ['Information'],\n",
  897. " ['Finance and Insurance'],\n",
  898. " ['Real Estate and Rental and Leasing'],\n",
  899. " ['Professional, Scientific, and Technical Services'],\n",
  900. " ['Management of Companies and Enterprises'],\n",
  901. " ['Administrative and Support and Waste Management and Remediation Services'],\n",
  902. " ['Educational Services'],\n",
  903. " ['Health Care and Social Assistance'],\n",
  904. " ['Arts, Entertainment, and Recreation'],\n",
  905. " ['Accommodation and Food Services'],\n",
  906. " ['Other Services (except Public Administration)'],\n",
  907. " ['Public Administration']]"
  908. ]
  909. },
  910. "execution_count": 44,
  911. "metadata": {},
  912. "output_type": "execute_result"
  913. }
  914. ],
  915. "source": [
  916. "names"
  917. ]
  918. },
  919. {
  920. "cell_type": "code",
  921. "execution_count": 53,
  922. "metadata": {},
  923. "outputs": [
  924. {
  925. "data": {
  926. "text/plain": [
  927. "[]"
  928. ]
  929. },
  930. "execution_count": 53,
  931. "metadata": {},
  932. "output_type": "execute_result"
  933. }
  934. ],
  935. "source": [
  936. "df[df['2017 NAICS US Code'].str.len() == 1][['2017 NAICS US Title']].values.tolist()"
  937. ]
  938. },
  939. {
  940. "cell_type": "code",
  941. "execution_count": null,
  942. "metadata": {},
  943. "outputs": [],
  944. "source": []
  945. },
  946. {
  947. "cell_type": "code",
  948. "execution_count": 50,
  949. "metadata": {},
  950. "outputs": [],
  951. "source": [
  952. "names = df[df['2017 NAICS US Code'].str.len() == 3][['2017 NAICS US Title']].values.tolist()"
  953. ]
  954. },
  955. {
  956. "cell_type": "code",
  957. "execution_count": 51,
  958. "metadata": {},
  959. "outputs": [
  960. {
  961. "data": {
  962. "text/plain": [
  963. "[['Crop Production'],\n",
  964. " ['Animal Production and Aquaculture'],\n",
  965. " ['Forestry and Logging'],\n",
  966. " ['Fishing, Hunting and Trapping'],\n",
  967. " ['Support Activities for Agriculture and Forestry'],\n",
  968. " ['Oil and Gas Extraction'],\n",
  969. " ['Mining (except Oil and Gas)'],\n",
  970. " ['Support Activities for Mining'],\n",
  971. " ['Utilities '],\n",
  972. " ['Construction of Buildings'],\n",
  973. " ['Heavy and Civil Engineering Construction'],\n",
  974. " ['Specialty Trade Contractors'],\n",
  975. " ['Food Manufacturing'],\n",
  976. " ['Beverage and Tobacco Product Manufacturing'],\n",
  977. " ['Textile Mills'],\n",
  978. " ['Textile Product Mills'],\n",
  979. " ['Apparel Manufacturing'],\n",
  980. " ['Leather and Allied Product Manufacturing'],\n",
  981. " ['Wood Product Manufacturing'],\n",
  982. " ['Paper Manufacturing'],\n",
  983. " ['Printing and Related Support Activities'],\n",
  984. " ['Petroleum and Coal Products Manufacturing'],\n",
  985. " ['Chemical Manufacturing'],\n",
  986. " ['Plastics and Rubber Products Manufacturing'],\n",
  987. " ['Nonmetallic Mineral Product Manufacturing'],\n",
  988. " ['Primary Metal Manufacturing'],\n",
  989. " ['Fabricated Metal Product Manufacturing'],\n",
  990. " ['Machinery Manufacturing'],\n",
  991. " ['Computer and Electronic Product Manufacturing'],\n",
  992. " ['Electrical Equipment, Appliance, and Component Manufacturing'],\n",
  993. " ['Transportation Equipment Manufacturing'],\n",
  994. " ['Furniture and Related Product Manufacturing'],\n",
  995. " ['Miscellaneous Manufacturing'],\n",
  996. " ['Merchant Wholesalers, Durable Goods '],\n",
  997. " ['Merchant Wholesalers, Nondurable Goods '],\n",
  998. " ['Wholesale Electronic Markets and Agents and Brokers '],\n",
  999. " ['Motor Vehicle and Parts Dealers '],\n",
  1000. " ['Furniture and Home Furnishings Stores '],\n",
  1001. " ['Electronics and Appliance Stores '],\n",
  1002. " ['Building Material and Garden Equipment and Supplies Dealers '],\n",
  1003. " ['Food and Beverage Stores '],\n",
  1004. " ['Health and Personal Care Stores '],\n",
  1005. " ['Gasoline Stations '],\n",
  1006. " ['Clothing and Clothing Accessories Stores '],\n",
  1007. " ['Sporting Goods, Hobby, Musical Instrument, and Book Stores '],\n",
  1008. " ['General Merchandise Stores '],\n",
  1009. " ['Miscellaneous Store Retailers '],\n",
  1010. " ['Nonstore Retailers '],\n",
  1011. " ['Air Transportation'],\n",
  1012. " ['Rail Transportation'],\n",
  1013. " ['Water Transportation'],\n",
  1014. " ['Truck Transportation'],\n",
  1015. " ['Transit and Ground Passenger Transportation'],\n",
  1016. " ['Pipeline Transportation'],\n",
  1017. " ['Scenic and Sightseeing Transportation'],\n",
  1018. " ['Support Activities for Transportation'],\n",
  1019. " ['Postal Service'],\n",
  1020. " ['Couriers and Messengers'],\n",
  1021. " ['Warehousing and Storage'],\n",
  1022. " ['Publishing Industries (except Internet)'],\n",
  1023. " ['Motion Picture and Sound Recording Industries'],\n",
  1024. " ['Broadcasting (except Internet)'],\n",
  1025. " ['Telecommunications'],\n",
  1026. " ['Data Processing, Hosting, and Related Services'],\n",
  1027. " ['Other Information Services'],\n",
  1028. " ['Monetary Authorities-Central Bank'],\n",
  1029. " ['Credit Intermediation and Related Activities'],\n",
  1030. " ['Securities, Commodity Contracts, and Other Financial Investments and Related Activities'],\n",
  1031. " ['Insurance Carriers and Related Activities'],\n",
  1032. " ['Funds, Trusts, and Other Financial Vehicles '],\n",
  1033. " ['Real Estate'],\n",
  1034. " ['Rental and Leasing Services'],\n",
  1035. " ['Lessors of Nonfinancial Intangible Assets (except Copyrighted Works)'],\n",
  1036. " ['Professional, Scientific, and Technical Services'],\n",
  1037. " ['Management of Companies and Enterprises'],\n",
  1038. " ['Administrative and Support Services'],\n",
  1039. " ['Waste Management and Remediation Services'],\n",
  1040. " ['Educational Services'],\n",
  1041. " ['Ambulatory Health Care Services'],\n",
  1042. " ['Hospitals'],\n",
  1043. " ['Nursing and Residential Care Facilities'],\n",
  1044. " ['Social Assistance'],\n",
  1045. " ['Performing Arts, Spectator Sports, and Related Industries'],\n",
  1046. " ['Museums, Historical Sites, and Similar Institutions'],\n",
  1047. " ['Amusement, Gambling, and Recreation Industries'],\n",
  1048. " ['Accommodation'],\n",
  1049. " ['Food Services and Drinking Places'],\n",
  1050. " ['Repair and Maintenance'],\n",
  1051. " ['Personal and Laundry Services'],\n",
  1052. " ['Religious, Grantmaking, Civic, Professional, and Similar Organizations'],\n",
  1053. " ['Private Households'],\n",
  1054. " ['Executive, Legislative, and Other General Government Support '],\n",
  1055. " ['Justice, Public Order, and Safety Activities '],\n",
  1056. " ['Administration of Human Resource Programs '],\n",
  1057. " ['Administration of Environmental Quality Programs '],\n",
  1058. " ['Administration of Housing Programs, Urban Planning, and Community Development '],\n",
  1059. " ['Administration of Economic Programs '],\n",
  1060. " ['Space Research and Technology '],\n",
  1061. " ['National Security and International Affairs ']]"
  1062. ]
  1063. },
  1064. "execution_count": 51,
  1065. "metadata": {},
  1066. "output_type": "execute_result"
  1067. }
  1068. ],
  1069. "source": [
  1070. "names"
  1071. ]
  1072. },
  1073. {
  1074. "cell_type": "code",
  1075. "execution_count": 35,
  1076. "metadata": {},
  1077. "outputs": [],
  1078. "source": [
  1079. "df_ind = pd.read_parquet(path='/Users/me/data/american_community_survey/upload/2022_acs_pums_individual_people_united_states_first_tranche.parquet')"
  1080. ]
  1081. },
  1082. {
  1083. "cell_type": "code",
  1084. "execution_count": 41,
  1085. "metadata": {},
  1086. "outputs": [],
  1087. "source": [
  1088. "counts = df_ind['North American Industry Classification System (NAICS) recode for 2018 and later based on 2017 NAICS codes'].apply(lambda x: x[:3] if x else None).value_counts()"
  1089. ]
  1090. },
  1091. {
  1092. "cell_type": "code",
  1093. "execution_count": 42,
  1094. "metadata": {},
  1095. "outputs": [
  1096. {
  1097. "data": {
  1098. "text/plain": [
  1099. "North American Industry Classification System (NAICS) recode for 2018 and later based on 2017 NAICS codes\n",
  1100. "PRF 126549\n",
  1101. "RET 112195\n",
  1102. "MED 109888\n",
  1103. "EDU 100248\n",
  1104. "MFG 98385\n",
  1105. "ENT 94067\n",
  1106. "CON 64671\n",
  1107. "FIN 63735\n",
  1108. "SRV 50004\n",
  1109. "ADM 49945\n",
  1110. "TRN 48243\n",
  1111. "SCA 23852\n",
  1112. "WHL 20993\n",
  1113. "INF 19848\n",
  1114. "AGR 16734\n",
  1115. "UTL 8484\n",
  1116. "MIL 7768\n",
  1117. "Une 5492\n",
  1118. "EXT 2678\n",
  1119. "Name: count, dtype: int64"
  1120. ]
  1121. },
  1122. "execution_count": 42,
  1123. "metadata": {},
  1124. "output_type": "execute_result"
  1125. }
  1126. ],
  1127. "source": [
  1128. "counts"
  1129. ]
  1130. },
  1131. {
  1132. "cell_type": "code",
  1133. "execution_count": 49,
  1134. "metadata": {},
  1135. "outputs": [
  1136. {
  1137. "data": {
  1138. "text/plain": [
  1139. "Index(['PRF', 'RET', 'MED', 'EDU', 'MFG', 'ENT', 'CON', 'FIN', 'SRV', 'ADM',\n",
  1140. " 'TRN', 'SCA', 'WHL', 'INF', 'AGR', 'UTL', 'MIL', 'Une', 'EXT'],\n",
  1141. " dtype='object', name='North American Industry Classification System (NAICS) recode for 2018 and later based on 2017 NAICS codes')"
  1142. ]
  1143. },
  1144. "execution_count": 49,
  1145. "metadata": {},
  1146. "output_type": "execute_result"
  1147. }
  1148. ],
  1149. "source": [
  1150. "counts.index"
  1151. ]
  1152. },
  1153. {
  1154. "cell_type": "code",
  1155. "execution_count": 47,
  1156. "metadata": {},
  1157. "outputs": [
  1158. {
  1159. "data": {
  1160. "text/plain": [
  1161. "North American Industry Classification System (NAICS) recode for 2018 and later based on 2017 NAICS codes\n",
  1162. "EXT-Support Activities For Mining 1253\n",
  1163. "EXT-Nonmetallic Mineral Mining And Quarrying 549\n",
  1164. "EXT-Oil And Gas Extraction 330\n",
  1165. "EXT-Metal Ore Mining 298\n",
  1166. "EXT-Coal Mining 248\n",
  1167. "Name: count, dtype: int64"
  1168. ]
  1169. },
  1170. "execution_count": 47,
  1171. "metadata": {},
  1172. "output_type": "execute_result"
  1173. }
  1174. ],
  1175. "source": [
  1176. "df_ind[df_ind['North American Industry Classification System (NAICS) recode for 2018 and later based on 2017 NAICS codes'].apply(lambda x: x.startswith('EXT') if x else False)]['North American Industry Classification System (NAICS) recode for 2018 and later based on 2017 NAICS codes'].value_counts()"
  1177. ]
  1178. },
  1179. {
  1180. "cell_type": "code",
  1181. "execution_count": 48,
  1182. "metadata": {},
  1183. "outputs": [
  1184. {
  1185. "data": {
  1186. "text/plain": [
  1187. "North American Industry Classification System (NAICS) recode for 2018 and later based on 2017 NAICS codes\n",
  1188. "Unemployed, With No Work Experience In The Last 5 Years Or Earlier Or Never Worked 5492\n",
  1189. "Name: count, dtype: int64"
  1190. ]
  1191. },
  1192. "execution_count": 48,
  1193. "metadata": {},
  1194. "output_type": "execute_result"
  1195. }
  1196. ],
  1197. "source": [
  1198. "df_ind[df_ind['North American Industry Classification System (NAICS) recode for 2018 and later based on 2017 NAICS codes'].apply(lambda x: x.startswith('Une') if x else False)]['North American Industry Classification System (NAICS) recode for 2018 and later based on 2017 NAICS codes'].value_counts()"
  1199. ]
  1200. },
  1201. {
  1202. "cell_type": "code",
  1203. "execution_count": 45,
  1204. "metadata": {},
  1205. "outputs": [
  1206. {
  1207. "data": {
  1208. "text/plain": [
  1209. "(19, 17)"
  1210. ]
  1211. },
  1212. "execution_count": 45,
  1213. "metadata": {},
  1214. "output_type": "execute_result"
  1215. }
  1216. ],
  1217. "source": [
  1218. "len(counts), len(names)"
  1219. ]
  1220. },
  1221. {
  1222. "cell_type": "code",
  1223. "execution_count": 52,
  1224. "metadata": {},
  1225. "outputs": [],
  1226. "source": [
  1227. "\n",
  1228. "# # https://usa.ipums.org/usa-action/variables/US2001A_1071#description_section\n",
  1229. "industry_codes = {\n",
  1230. " 'AGR': 'Agriculture, Forestry, Fishing and Hunting',\n",
  1231. " 'EXT': 'Mining, Quarrying, and Oil and Gas Extraction',\n",
  1232. " 'UTL': 'Utilities',\n",
  1233. " 'CON': 'Construction',\n",
  1234. " 'WHL': 'Wholesale Trade',\n",
  1235. " 'INF': 'Information',\n",
  1236. " 'FIN': 'Finance and Insurance',\n",
  1237. " 'RET': 'Retail Trade',\n",
  1238. " 'PRF': 'Professional, Scientific, and Technical Services',\n",
  1239. " 'SRV': 'Management of Companies and Enterprises',\n",
  1240. " 'ADM': 'Administrative and Support and Waste Management and Remediation Services',\n",
  1241. " 'EDU': 'Educational Services',\n",
  1242. " 'MED': 'Health Care and Social Assistance',\n",
  1243. " 'ENT': 'Arts, Entertainment, and Recreation',\n",
  1244. " 'MFG': 'Manufacturing', \n",
  1245. " 'TRN': 'Transportation and Warehousing',\n",
  1246. " 'SCA': 'Services',\n",
  1247. " 'MIL': 'Public Administration',\n",
  1248. " 'UNE': 'Unemployed, With No Work Experience In The Last 5 Years Or Earlier Or Never Worked'\n",
  1249. "}\n"
  1250. ]
  1251. },
  1252. {
  1253. "cell_type": "code",
  1254. "execution_count": null,
  1255. "metadata": {},
  1256. "outputs": [],
  1257. "source": []
  1258. }
  1259. ],
  1260. "metadata": {
  1261. "kernelspec": {
  1262. "display_name": ".venv",
  1263. "language": "python",
  1264. "name": "python3"
  1265. },
  1266. "language_info": {
  1267. "codemirror_mode": {
  1268. "name": "ipython",
  1269. "version": 3
  1270. },
  1271. "file_extension": ".py",
  1272. "mimetype": "text/x-python",
  1273. "name": "python",
  1274. "nbconvert_exporter": "python",
  1275. "pygments_lexer": "ipython3",
  1276. "version": "3.11.7"
  1277. }
  1278. },
  1279. "nbformat": 4,
  1280. "nbformat_minor": 2
  1281. }
  1282. ```
  1283. Relative file path: generate_sql_with_mapped_values.py
  1284. ```
  1285. import pandas as pd
  1286. import sys
  1287. import os
  1288. import json
  1289.  
  1290. def generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup):
  1291. type_char = folder_name.split("_")[1][0].lower()
  1292. folder_code = folder_name.split("_")[1][1:].upper()
  1293. human_readable_name = "individual_people" if type_char == "p" else "housing_units"
  1294.  
  1295. if folder_code == "US":
  1296. csv_code = csv_name.split("_")[1][1:].upper()
  1297. name = national_lookup.get(csv_code, "Unknown national code")
  1298. elif len(folder_code) == 2:
  1299. name = state_lookup.get(folder_code, "Unknown state code")
  1300. else:
  1301. raise ValueError(f"Invalid code: {folder_code}")
  1302.  
  1303. return f"{human_readable_name}_{name.replace(' ', '_')}".lower()
  1304.  
  1305.  
  1306. def clean_enum_value(value):
  1307. value = value.replace("'", "")
  1308. value = value.replace("N/A", "Not applicable")
  1309. value = value.replace("/", " or ")
  1310. value = value.replace("(", "- ")
  1311. value = value.replace(")", "")
  1312. return value
  1313.  
  1314.  
  1315. if len(sys.argv) < 3:
  1316. print("Usage: python script.py <parquet_database_path> <PUMS_data_dictionary_path>")
  1317. sys.exit(1)
  1318.  
  1319. parquet_database_path, data_dictionary_path = sys.argv[1:3]
  1320.  
  1321. with open(data_dictionary_path, "r") as json_file:
  1322. data_dict = json.load(json_file)
  1323.  
  1324. state_lookup = {code: name for name, code in [x.split("/") for x in data_dict["ST"]["Values"].values()]}
  1325. national_lookup = {"USA": "United States first tranche", "USB": "United States second tranche"}
  1326.  
  1327. df_csv_paths = pd.read_parquet(parquet_database_path)
  1328. models_dir = "models/public_use_microdata_sample/generated/mapped_values"
  1329. os.makedirs(models_dir, exist_ok=True)
  1330.  
  1331. for csv_path in df_csv_paths["csv_path"]:
  1332. folder_name = os.path.basename(os.path.dirname(csv_path))
  1333. csv_name = os.path.basename(csv_path).split(".")[0]
  1334. materialized_name = generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup)
  1335.  
  1336. df_headers = pd.read_csv(csv_path, nrows=0)
  1337. column_types = {column: 'VARCHAR' for column in df_headers.columns}
  1338. columns = ', '.join([f"'{col}': '{typ}'" for col, typ in column_types.items()])
  1339.  
  1340. sql_select_parts = ["SELECT"]
  1341. enum_creation_statements = []
  1342.  
  1343. for header in df_headers.columns:
  1344. if header in data_dict and "Values" in data_dict[header]:
  1345. # Mapping codes to labels using CASE statement
  1346. if any(['Integer weight' in value for value in data_dict[header]["Values"].values()]):
  1347. print(f"Selecting unmapped {header} because it contains an 'Integer weight' value")
  1348. sql_select_parts.append(f" {header},")
  1349. else:
  1350. # value_mapping = ' '.join([f"WHEN '{clean_enum_value(code)}' THEN '{clean_enum_value(label)}'" for code, label in data_dict[header]["Values"].items()])
  1351. # mapped_column = f"CASE {header} {value_mapping} END AS {header}"
  1352. # sql_select_parts.append(f" {mapped_column},")
  1353. # Improved mapping with pretty printing
  1354.  
  1355. col_info = data_dict.get(header, {"Description": header})
  1356. description = col_info["Description"]
  1357. value_mapping = "\n\t\t".join([
  1358. f"WHEN '{clean_enum_value(code)}' THEN '{clean_enum_value(label)}'"
  1359. for code, label in data_dict[header]["Values"].items()
  1360. ])
  1361. # mapped_column = f"""CASE {header}\n\t\t{value_mapping}\n\tEND AS \"{description}\""""
  1362. mapped_column = f"""CASE {header}\n\t\t{value_mapping}\n\tEND AS {header}"""
  1363. sql_select_parts.append(f"\t{mapped_column},")
  1364. elif header in data_dict:
  1365. # Direct mapping for columns without "Values"
  1366. description = data_dict[header]["Description"].replace("'", "''")
  1367. # sql_select_parts.append(f' {header} AS "{description}",')
  1368. sql_select_parts.append(f' {header},')
  1369. else:
  1370. sql_select_parts.append(f" {header},")
  1371.  
  1372. sql_select_parts[-1] = sql_select_parts[-1].rstrip(',')
  1373. sql_select_statement = "\n".join(sql_select_parts)
  1374. newline = "\n"
  1375. sql_content = f"""-- SQL transformation for {csv_name} generated by {os.path.basename(__file__)}
  1376. {newline.join(enum_creation_statements)}
  1377.  
  1378. {{{{ config(materialized='external', location=var('output_path') + '/{materialized_name}.parquet') }}}}
  1379. {sql_select_statement}
  1380. FROM read_csv('{csv_path}',
  1381. parallel=False,
  1382. all_varchar=True,
  1383. auto_detect=True)"""
  1384.  
  1385. sql_file_path = os.path.join(models_dir, f"{materialized_name}_mapped.sql")
  1386. with open(sql_file_path, "w") as sql_file:
  1387. sql_file.write(sql_content)
  1388. ```
  1389. Relative file path: generate_sql_with_renamed_columns.py
  1390. ```
  1391. import pandas as pd
  1392. import duckdb
  1393. import sys
  1394. import os
  1395. import json
  1396.  
  1397.  
  1398. def generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup):
  1399. # Extract the state code and type (P or H) from the folder name
  1400. type_char = folder_name.split("_")[1][
  1401. 0
  1402. ].lower() # Assuming folder format is "csv_PXX" or "csv_HXX"
  1403. folder_code = folder_name.split("_")[1][1:].upper()
  1404.  
  1405. # Determine the human-readable name based on the type character
  1406. human_readable_name = "individual_people" if type_char == "p" else "housing_units"
  1407.  
  1408. if folder_code == "US":
  1409. # Get the national-level name from the lookup table
  1410. csv_code = csv_name.split("_")[1][1:].upper()
  1411. name = national_lookup.get(csv_code, "Unknown national code")
  1412. print(csv_code, name)
  1413. elif len(folder_code) == 2:
  1414. # Get the state name from the lookup table
  1415. name = state_lookup.get(folder_code, "Unknown state code")
  1416. print(folder_code, name)
  1417. else:
  1418. raise ValueError(f"Invalid code: {folder_code}")
  1419. # Return the formatted name
  1420. return f"{human_readable_name}_{name.replace(' ', '_')}".lower()
  1421.  
  1422.  
  1423. if len(sys.argv) < 3:
  1424. print("Usage: python script.py <parquet_database_path> <PUMS_data_dictionary_path>")
  1425. sys.exit(1)
  1426.  
  1427. parquet_database_path, data_dictionary_path = sys.argv[1:3]
  1428.  
  1429. # Load the data dictionary from the JSON file
  1430. with open(data_dictionary_path, "r") as json_file:
  1431. data_dict = json.load(json_file)
  1432.  
  1433. # Generate lookup table for state codes
  1434. state_lookup = {
  1435. code: name
  1436. for name, code in [x.split("/") for x in data_dict["ST"]["Values"].values()]
  1437. }
  1438.  
  1439. # define short codes for first and second tranches of national-level data
  1440. national_lookup = {
  1441. "USA": "United States first tranche",
  1442. "USB": "United States second tranche",
  1443. }
  1444.  
  1445. # Connect to DuckDB
  1446. conn = duckdb.connect(database=":memory:", read_only=False)
  1447.  
  1448. # Assuming the Parquet file contains paths to CSV files
  1449. df_csv_paths = pd.read_parquet(parquet_database_path)
  1450.  
  1451. models_dir = "models/public_use_microdata_sample/generated/renamed_columns"
  1452. os.makedirs(models_dir, exist_ok=True)
  1453.  
  1454. for csv_path in df_csv_paths["csv_path"]:
  1455. folder_name = os.path.basename(os.path.dirname(csv_path))
  1456. csv_name = os.path.basename(csv_path)
  1457. csv_name = csv_name.split(".")[0]
  1458. materialized_name = generate_materialized_name(
  1459. folder_name, csv_name, state_lookup, national_lookup
  1460. )
  1461.  
  1462. df_headers = pd.read_csv(csv_path, nrows=0)
  1463. sql_select_parts = ["SELECT"]
  1464. for header in df_headers.columns:
  1465. col_info = data_dict.get(header, {"Description": header})
  1466. description = col_info["Description"]
  1467. sql_select_parts.append(f' {header} AS "{description}",')
  1468.  
  1469. sql_select_parts[-1] = sql_select_parts[-1].rstrip(",")
  1470. sql_select_statement = "\n".join(sql_select_parts)
  1471.  
  1472. sql_content = f"""-- SQL transformation for {os.path.basename(csv_path)} generated by {os.path.basename(__file__)}
  1473. {{{{ config(materialized='external', location=var('output_path') + '/{materialized_name}.parquet') }}}}
  1474. {sql_select_statement}
  1475. FROM read_csv_auto('{csv_path}')"""
  1476.  
  1477. sql_file_path = os.path.join(models_dir, f"{materialized_name}_renamed_columns.sql")
  1478. with open(sql_file_path, "w") as sql_file:
  1479. sql_file.write(sql_content)
  1480. ```
  1481. Relative file path: generate_sql_schemas_for_extracted_csv_files.py
  1482. ```
  1483. import pandas as pd
  1484. import sys
  1485. import os
  1486. import json
  1487.  
  1488. def generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup):
  1489. type_char = folder_name.split("_")[1][0].lower()
  1490. folder_code = folder_name.split("_")[1][1:].upper()
  1491. human_readable_name = "individual_people" if type_char == "p" else "housing_units"
  1492.  
  1493. if folder_code == "US":
  1494. csv_code = csv_name.split("_")[1][1:].upper()
  1495. name = national_lookup.get(csv_code, "Unknown national code")
  1496. elif len(folder_code) == 2:
  1497. name = state_lookup.get(folder_code, "Unknown state code")
  1498. else:
  1499. raise ValueError(f"Invalid code: {folder_code}")
  1500.  
  1501. return f"{human_readable_name}_{name.replace(' ', '_')}".lower()
  1502.  
  1503. if len(sys.argv) < 3:
  1504. print("Usage: python script.py <parquet_database_path> <PUMS_data_dictionary_path>")
  1505. sys.exit(1)
  1506.  
  1507. parquet_database_path, data_dictionary_path = sys.argv[1:3]
  1508.  
  1509. with open(data_dictionary_path, "r") as json_file:
  1510. data_dict = json.load(json_file)
  1511.  
  1512. state_lookup = {code: name for name, code in [x.split("/") for x in data_dict["ST"]["Values"].values()]}
  1513. national_lookup = {"USA": "United States first tranche", "USB": "United States second tranche"}
  1514.  
  1515. df_csv_paths = pd.read_parquet(parquet_database_path)
  1516. models_dir = "models/public_use_microdata_sample/generated"
  1517. os.makedirs(models_dir, exist_ok=True)
  1518.  
  1519. for csv_path in df_csv_paths["csv_path"]:
  1520. folder_name = os.path.basename(os.path.dirname(csv_path))
  1521. csv_name = os.path.basename(csv_path).replace(".csv", "")
  1522. materialized_name = generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup)
  1523.  
  1524. df_headers = pd.read_csv(csv_path, nrows=0)
  1525. enum_creation_statements = []
  1526. sql_select_parts = []
  1527.  
  1528. for header in df_headers.columns:
  1529. if header in data_dict and "Values" in data_dict[header]:
  1530. enum_values = [f"'{x}'" for x in data_dict[header]["Values"].values()]
  1531. enum_name = f"{header}_enum"
  1532. enum_creation_statements.append(f"CREATE TYPE {enum_name} AS ENUM ({','.join(enum_values)});")
  1533. key = 'Description'
  1534. mapped_column = f'CAST({header} AS {enum_name}) AS "{data_dict[header][key]}"'
  1535. sql_select_parts.append(f" {mapped_column}")
  1536. elif header in data_dict:
  1537. description = data_dict[header]["Description"].replace("'", "''")
  1538. sql_select_parts.append(f' {header} AS "{description}"')
  1539. else:
  1540. sql_select_parts.append(f" {header}")
  1541.  
  1542. sql_select_statement = ",\n".join(sql_select_parts)
  1543. newline = "\n"
  1544. sql_content = f"""-- SQL transformation for {csv_name} generated by {os.path.basename(__file__)}
  1545. {newline.join(enum_creation_statements)}
  1546.  
  1547. {{{{ config(materialized='external', location=var('output_path') + '/{materialized_name}.parquet') }}}}
  1548. SELECT
  1549. {sql_select_statement}
  1550. FROM read_csv_auto('{csv_path}');
  1551. """
  1552.  
  1553. sql_file_path = os.path.join(models_dir, f"{materialized_name}.sql")
  1554. with open(sql_file_path, "w") as sql_file:
  1555. sql_file.write(sql_content)
  1556. ```
  1557. Relative file path: parse_data_dictionary.py
  1558. ```
  1559. import requests
  1560. import csv
  1561. from io import StringIO
  1562. import json
  1563. import sys
  1564.  
  1565.  
  1566. def csv_to_json_dictionary(url):
  1567. response = requests.get(url)
  1568. response.raise_for_status() # Ensure the request was successful
  1569.  
  1570. # Read the CSV content into a dictionary structure
  1571. data_dictionary = {}
  1572. reader = csv.reader(StringIO(response.text))
  1573.  
  1574. for row in reader:
  1575. if row[0] == "NAME":
  1576. # Initialize the variable entry with its details
  1577. data_dictionary[row[1]] = {
  1578. "Type": row[2],
  1579. "Length": row[3],
  1580. "Description": row[4],
  1581. "Values": {},
  1582. }
  1583. elif row[0] == "VAL" and row[1] in data_dictionary:
  1584. # Append value mappings to the variable
  1585. data_dictionary[row[1]]["Values"][row[4]] = row[6] if len(row) > 6 else ""
  1586.  
  1587. # Return the constructed dictionary
  1588. return data_dictionary
  1589.  
  1590.  
  1591. if __name__ == "__main__":
  1592. if len(sys.argv) < 2:
  1593. print("Usage: python script.py <data_dictionary_url>")
  1594. sys.exit(1)
  1595.  
  1596. url = sys.argv[1]
  1597. data_dict = csv_to_json_dictionary(url)
  1598.  
  1599. # Specify the JSON file name
  1600. json_file_name = "PUMS_Data_Dictionary.json"
  1601. with open(json_file_name, "w") as json_file:
  1602. json.dump(data_dict, json_file, indent=4)
  1603.  
  1604. print(f"Data dictionary processed and saved to {json_file_name}.")
  1605. ```
  1606.  
  1607. and take this current file structure:
  1608.  
  1609. ```
  1610. .
  1611. β”œβ”€β”€ ahrq.gov
  1612. β”‚Β Β  β”œβ”€β”€ generated
  1613. β”‚Β Β  β”‚Β Β  └── with_types
  1614. β”‚Β Β  β”‚Β Β  β”œβ”€β”€ syhdr_commercial_inpatient_2016.sql
  1615. β”‚Β Β  β”‚Β Β  β”œβ”€β”€ syhdr_commercial_outpatient_2016.sql
  1616. β”‚Β Β  β”‚Β Β  β”œβ”€β”€ syhdr_commercial_person_2016.sql
  1617. β”‚Β Β  β”‚Β Β  β”œβ”€β”€ syhdr_commercial_pharmacy_2016.sql
  1618. β”‚Β Β  β”‚Β Β  β”œβ”€β”€ syhdr_medicaid_inpatient_2016.sql
  1619. β”‚Β Β  β”‚Β Β  β”œβ”€β”€ syhdr_medicaid_outpatient_2016.sql
  1620. β”‚Β Β  β”‚Β Β  β”œβ”€β”€ syhdr_medicaid_person_2016.sql
  1621. β”‚Β Β  β”‚Β Β  β”œβ”€β”€ syhdr_medicaid_pharmacy_2016.sql
  1622. β”‚Β Β  β”‚Β Β  β”œβ”€β”€ syhdr_medicaid_provider_2016.sql
  1623. β”‚Β Β  β”‚Β Β  β”œβ”€β”€ syhdr_medicare_inpatient_2016.sql
  1624. β”‚Β Β  β”‚Β Β  β”œβ”€β”€ syhdr_medicare_outpatient_2016.sql
  1625. β”‚Β Β  β”‚Β Β  β”œβ”€β”€ syhdr_medicare_person_2016.sql
  1626. β”‚Β Β  β”‚Β Β  β”œβ”€β”€ syhdr_medicare_pharmacy_2016.sql
  1627. β”‚Β Β  β”‚Β Β  └── syhdr_medicare_provider_2016.sql
  1628. β”‚Β Β  └── sdoh
  1629. β”‚Β Β  β”œβ”€β”€ sdoh_county.sql
  1630. β”‚Β Β  β”œβ”€β”€ sdoh_tract.sql
  1631. β”‚Β Β  └── sdoh_zipcode.sql
  1632. β”œβ”€β”€ bls.gov
  1633. β”‚Β Β  β”œβ”€β”€ consumer_price_index.sql
  1634. β”‚Β Β  └── download_consumer_price_index.py
  1635. β”œβ”€β”€ config.yml
  1636. β”œβ”€β”€ figures
  1637. β”‚Β Β  β”œβ”€β”€ insurance_plan_payment_histogram.sql
  1638. β”‚Β Β  └── insurance_plan_payment_histogram_inflation_adjusted.sql
  1639. β”œβ”€β”€ mitre.org
  1640. β”‚Β Β  └── synthea.sql
  1641. └── sources.yml
  1642.  
  1643. 8 directories, 24 files
  1644. ```
  1645. Relative file path: ahrq.gov/generated/with_types/syhdr_medicaid_pharmacy_2016.sql
  1646. ```
  1647. -- SQL model for syhdr_medicaid_pharmacy_2016.CSV
  1648. {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
  1649.  
  1650. SELECT
  1651. PERSON_ID::UBIGINT AS PERSON_ID,
  1652. PERSON_WGHT::NUMERIC AS PERSON_WGHT,
  1653. PHMCY_CLM_NUM::NUMERIC AS PHMCY_CLM_NUM,
  1654. CLM_CNTL_NUM::NUMERIC AS CLM_CNTL_NUM,
  1655. LINE_NBR::VARCHAR,
  1656. FILL_DT::DATE AS FILL_DT,
  1657. SYNTHETIC_DRUG_ID::VARCHAR,
  1658. GENERIC_DRUG_NAME::VARCHAR AS GENERIC_DRUG_NAME,
  1659. replace(replace(PLAN_PMT_AMT, '$', ''), ',', '')::FLOAT AS PLAN_PMT_AMT,
  1660. replace(replace(TOT_CHRG_AMT, '$', ''), ',', '')::FLOAT AS TOT_CHRG_AMT
  1661. FROM read_csv('~/data/syh_dr/syhdr_medicaid_pharmacy_2016.CSV', header=True, null_padding=true)```
  1662. Relative file path: ahrq.gov/generated/with_types/syhdr_medicare_inpatient_2016.sql
  1663. ```
  1664. -- SQL model for syhdr_medicare_inpatient_2016.CSV
  1665. {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
  1666.  
  1667. SELECT
  1668. PERSON_ID::UBIGINT AS PERSON_ID,
  1669. PERSON_WGHT::NUMERIC AS PERSON_WGHT,
  1670. FACILITY_ID::UBIGINT AS FACILITY_ID,
  1671. CLM_CNTL_NUM::NUMERIC AS CLM_CNTL_NUM,
  1672. AT_SPCLTY::VARCHAR AS AT_SPCLTY,
  1673. SRVC_BEG_DATE::DATE AS SRVC_BEG_DATE,
  1674. SRVC_END_DATE::DATE AS SRVC_END_DATE,
  1675. LOS::UINTEGER AS LOS,
  1676. ADMSN_TYPE::VARCHAR AS ADMSN_TYPE,
  1677. TOB_CD::VARCHAR AS TOB_CD,
  1678. CLM_TYPE_CD::VARCHAR AS CLM_TYPE_CD,
  1679. DSCHRG_STUS::VARCHAR AS DSCHRG_STUS,
  1680. PRMRY_DX_IMPUTED::VARCHAR AS PRMRY_DX_IMPUTED,
  1681. PRMRY_DX_CD::VARCHAR AS PRMRY_DX_CD,
  1682. ICD_DX_CD_1::VARCHAR AS ICD_DX_CD_1,
  1683. ICD_DX_CD_2::VARCHAR AS ICD_DX_CD_2,
  1684. ICD_DX_CD_3::VARCHAR,
  1685. ICD_DX_CD_4::VARCHAR,
  1686. ICD_DX_CD_5::VARCHAR,
  1687. ICD_DX_CD_6::VARCHAR,
  1688. ICD_DX_CD_7::VARCHAR,
  1689. ICD_DX_CD_8::VARCHAR,
  1690. ICD_DX_CD_9::VARCHAR,
  1691. ICD_DX_CD_10::VARCHAR,
  1692. ICD_DX_CD_11::VARCHAR,
  1693. ICD_DX_CD_12::VARCHAR,
  1694. ICD_DX_CD_13::VARCHAR,
  1695. ICD_DX_CD_14::VARCHAR,
  1696. ICD_DX_CD_15::VARCHAR,
  1697. ICD_DX_CD_16::VARCHAR,
  1698. ICD_DX_CD_17::VARCHAR,
  1699. ICD_DX_CD_18::VARCHAR,
  1700. ICD_DX_CD_19::VARCHAR,
  1701. ICD_DX_CD_20::VARCHAR,
  1702. ICD_DX_CD_21::VARCHAR,
  1703. ICD_DX_CD_22::VARCHAR,
  1704. ICD_DX_CD_23::VARCHAR,
  1705. ICD_DX_CD_24::VARCHAR,
  1706. ICD_DX_CD_25::VARCHAR AS ICD_DX_CD_25,
  1707. ICD_PRCDR_CD_1::VARCHAR AS ICD_PRCDR_CD_1,
  1708. ICD_PRCDR_CD_2::VARCHAR AS ICD_PRCDR_CD_2,
  1709. ICD_PRCDR_CD_3::VARCHAR,
  1710. ICD_PRCDR_CD_4::VARCHAR,
  1711. ICD_PRCDR_CD_5::VARCHAR,
  1712. ICD_PRCDR_CD_6::VARCHAR,
  1713. ICD_PRCDR_CD_7::VARCHAR,
  1714. ICD_PRCDR_CD_8::VARCHAR,
  1715. ICD_PRCDR_CD_9::VARCHAR,
  1716. ICD_PRCDR_CD_10::VARCHAR,
  1717. ICD_PRCDR_CD_11::VARCHAR,
  1718. ICD_PRCDR_CD_12::VARCHAR,
  1719. ICD_PRCDR_CD_13::VARCHAR,
  1720. ICD_PRCDR_CD_14::VARCHAR,
  1721. ICD_PRCDR_CD_15::VARCHAR,
  1722. ICD_PRCDR_CD_16::VARCHAR,
  1723. ICD_PRCDR_CD_17::VARCHAR,
  1724. ICD_PRCDR_CD_18::VARCHAR,
  1725. ICD_PRCDR_CD_19::VARCHAR,
  1726. ICD_PRCDR_CD_20::VARCHAR,
  1727. ICD_PRCDR_CD_21::VARCHAR,
  1728. ICD_PRCDR_CD_22::VARCHAR,
  1729. ICD_PRCDR_CD_23::VARCHAR,
  1730. ICD_PRCDR_CD_24::VARCHAR,
  1731. ICD_PRCDR_CD_25::VARCHAR AS ICD_PRCDR_CD_25,
  1732. CPT_PRCDR_CD_1::VARCHAR AS CPT_PRCDR_CD_1,
  1733. CPT_PRCDR_CD_2::VARCHAR,
  1734. CPT_PRCDR_CD_3::VARCHAR AS CPT_PRCDR_CD_3,
  1735. CPT_PRCDR_CD_4::VARCHAR,
  1736. CPT_PRCDR_CD_5::VARCHAR,
  1737. CPT_PRCDR_CD_6::VARCHAR,
  1738. CPT_PRCDR_CD_7::VARCHAR,
  1739. CPT_PRCDR_CD_8::VARCHAR,
  1740. CPT_PRCDR_CD_9::VARCHAR,
  1741. CPT_PRCDR_CD_10::VARCHAR,
  1742. CPT_PRCDR_CD_11::VARCHAR,
  1743. CPT_PRCDR_CD_12::VARCHAR,
  1744. CPT_PRCDR_CD_13::VARCHAR,
  1745. CPT_PRCDR_CD_14::VARCHAR,
  1746. CPT_PRCDR_CD_15::VARCHAR,
  1747. CPT_PRCDR_CD_16::VARCHAR,
  1748. CPT_PRCDR_CD_17::VARCHAR,
  1749. CPT_PRCDR_CD_18::VARCHAR,
  1750. CPT_PRCDR_CD_19::VARCHAR,
  1751. CPT_PRCDR_CD_20::VARCHAR,
  1752. CPT_PRCDR_CD_21::VARCHAR,
  1753. CPT_PRCDR_CD_22::VARCHAR,
  1754. CPT_PRCDR_CD_23::VARCHAR,
  1755. CPT_PRCDR_CD_24::VARCHAR,
  1756. CPT_PRCDR_CD_25::VARCHAR,
  1757. CPT_PRCDR_CD_26::VARCHAR,
  1758. CPT_PRCDR_CD_27::VARCHAR,
  1759. CPT_PRCDR_CD_28::VARCHAR,
  1760. CPT_PRCDR_CD_29::VARCHAR,
  1761. CPT_PRCDR_CD_30::VARCHAR,
  1762. CPT_PRCDR_CD_31::VARCHAR,
  1763. CPT_PRCDR_CD_32::VARCHAR,
  1764. CPT_PRCDR_CD_33::VARCHAR,
  1765. CPT_PRCDR_CD_34::VARCHAR,
  1766. CPT_PRCDR_CD_35::VARCHAR AS CPT_PRCDR_CD_35,
  1767. replace(replace(PLAN_PMT_AMT, '$', ''), ',', '')::FLOAT AS PLAN_PMT_AMT,
  1768. replace(replace(TOT_CHRG_AMT, '$', ''), ',', '')::FLOAT AS TOT_CHRG_AMT
  1769. FROM read_csv('~/data/syh_dr/syhdr_medicare_inpatient_2016.CSV', header=True, null_padding=true, types={'CPT_PRCDR_CD_1': 'VARCHAR', 'CPT_PRCDR_CD_2': 'VARCHAR', 'CPT_PRCDR_CD_3': 'VARCHAR', 'CPT_PRCDR_CD_4': 'VARCHAR', 'CPT_PRCDR_CD_5': 'VARCHAR', 'CPT_PRCDR_CD_6': 'VARCHAR', 'CPT_PRCDR_CD_7': 'VARCHAR', 'CPT_PRCDR_CD_8': 'VARCHAR', 'CPT_PRCDR_CD_9': 'VARCHAR', 'CPT_PRCDR_CD_10': 'VARCHAR', 'CPT_PRCDR_CD_11': 'VARCHAR', 'CPT_PRCDR_CD_12': 'VARCHAR', 'CPT_PRCDR_CD_13': 'VARCHAR', 'CPT_PRCDR_CD_14': 'VARCHAR', 'CPT_PRCDR_CD_15': 'VARCHAR', 'CPT_PRCDR_CD_16': 'VARCHAR', 'CPT_PRCDR_CD_17': 'VARCHAR', 'CPT_PRCDR_CD_18': 'VARCHAR', 'CPT_PRCDR_CD_19': 'VARCHAR', 'CPT_PRCDR_CD_20': 'VARCHAR', 'CPT_PRCDR_CD_21': 'VARCHAR', 'CPT_PRCDR_CD_22': 'VARCHAR', 'CPT_PRCDR_CD_23': 'VARCHAR', 'CPT_PRCDR_CD_24': 'VARCHAR', 'CPT_PRCDR_CD_25': 'VARCHAR', 'CPT_PRCDR_CD_26': 'VARCHAR', 'CPT_PRCDR_CD_27': 'VARCHAR', 'CPT_PRCDR_CD_28': 'VARCHAR', 'CPT_PRCDR_CD_29': 'VARCHAR', 'CPT_PRCDR_CD_30': 'VARCHAR', 'CPT_PRCDR_CD_31': 'VARCHAR', 'CPT_PRCDR_CD_32': 'VARCHAR', 'CPT_PRCDR_CD_33': 'VARCHAR', 'CPT_PRCDR_CD_34': 'VARCHAR', 'CPT_PRCDR_CD_35': 'VARCHAR'}, ignore_errors=true)```
  1770. Relative file path: ahrq.gov/generated/with_types/syhdr_medicare_provider_2016.sql
  1771. ```
  1772. -- SQL model for syhdr_medicare_provider_2016.csv
  1773. {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
  1774.  
  1775. SELECT
  1776. Facility_ID::VARCHAR,
  1777. Prvdr_Ctgry_Cd::VARCHAR,
  1778. Prvdr_Ownrshp_Cd::VARCHAR,
  1779. Prvdr_Prtcptn_Cd::VARCHAR
  1780. FROM read_csv('~/data/syh_dr/syhdr_medicare_provider_2016.csv', header=True, null_padding=true)```
  1781. Relative file path: ahrq.gov/generated/with_types/syhdr_commercial_outpatient_2016.sql
  1782. ```
  1783. -- SQL model for syhdr_commercial_outpatient_2016.CSV
  1784. {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
  1785.  
  1786. SELECT
  1787. PERSON_ID::UBIGINT AS PERSON_ID,
  1788. PERSON_WGHT::NUMERIC AS PERSON_WGHT,
  1789. FACILITY_ID::UBIGINT AS FACILITY_ID,
  1790. CLM_CNTL_NUM::NUMERIC AS CLM_CNTL_NUM,
  1791. AT_SPCLTY::VARCHAR AS AT_SPCLTY,
  1792. SRVC_BEG_DATE::DATE AS SRVC_BEG_DATE,
  1793. SRVC_END_DATE::DATE AS SRVC_END_DATE,
  1794. LOS::UINTEGER AS LOS,
  1795. ADMSN_TYPE::VARCHAR AS ADMSN_TYPE,
  1796. TOB_CD::VARCHAR AS TOB_CD,
  1797. CLM_TYPE_CD::VARCHAR AS CLM_TYPE_CD,
  1798. DSCHRG_STUS::VARCHAR AS DSCHRG_STUS,
  1799. PRMRY_DX_IMPUTED::VARCHAR AS PRMRY_DX_IMPUTED,
  1800. PRMRY_DX_CD::VARCHAR AS PRMRY_DX_CD,
  1801. ICD_DX_CD_1::VARCHAR AS ICD_DX_CD_1,
  1802. ICD_DX_CD_2::VARCHAR AS ICD_DX_CD_2,
  1803. ICD_DX_CD_3::VARCHAR,
  1804. ICD_DX_CD_4::VARCHAR,
  1805. ICD_DX_CD_5::VARCHAR,
  1806. ICD_DX_CD_6::VARCHAR,
  1807. ICD_DX_CD_7::VARCHAR,
  1808. ICD_DX_CD_8::VARCHAR,
  1809. ICD_DX_CD_9::VARCHAR,
  1810. ICD_DX_CD_10::VARCHAR,
  1811. ICD_DX_CD_11::VARCHAR,
  1812. ICD_DX_CD_12::VARCHAR,
  1813. ICD_DX_CD_13::VARCHAR,
  1814. ICD_DX_CD_14::VARCHAR,
  1815. ICD_DX_CD_15::VARCHAR,
  1816. ICD_DX_CD_16::VARCHAR,
  1817. ICD_DX_CD_17::VARCHAR,
  1818. ICD_DX_CD_18::VARCHAR,
  1819. ICD_DX_CD_19::VARCHAR,
  1820. ICD_DX_CD_20::VARCHAR,
  1821. ICD_DX_CD_21::VARCHAR,
  1822. ICD_DX_CD_22::VARCHAR,
  1823. ICD_DX_CD_23::VARCHAR,
  1824. ICD_DX_CD_24::VARCHAR,
  1825. ICD_DX_CD_25::VARCHAR AS ICD_DX_CD_25,
  1826. ICD_PRCDR_CD_1::VARCHAR AS ICD_PRCDR_CD_1,
  1827. ICD_PRCDR_CD_2::VARCHAR AS ICD_PRCDR_CD_2,
  1828. ICD_PRCDR_CD_3::VARCHAR,
  1829. ICD_PRCDR_CD_4::VARCHAR,
  1830. ICD_PRCDR_CD_5::VARCHAR,
  1831. ICD_PRCDR_CD_6::VARCHAR,
  1832. ICD_PRCDR_CD_7::VARCHAR,
  1833. ICD_PRCDR_CD_8::VARCHAR,
  1834. ICD_PRCDR_CD_9::VARCHAR,
  1835. ICD_PRCDR_CD_10::VARCHAR,
  1836. ICD_PRCDR_CD_11::VARCHAR,
  1837. ICD_PRCDR_CD_12::VARCHAR,
  1838. ICD_PRCDR_CD_13::VARCHAR,
  1839. ICD_PRCDR_CD_14::VARCHAR,
  1840. ICD_PRCDR_CD_15::VARCHAR,
  1841. ICD_PRCDR_CD_16::VARCHAR,
  1842. ICD_PRCDR_CD_17::VARCHAR,
  1843. ICD_PRCDR_CD_18::VARCHAR,
  1844. ICD_PRCDR_CD_19::VARCHAR,
  1845. ICD_PRCDR_CD_20::VARCHAR,
  1846. ICD_PRCDR_CD_21::VARCHAR,
  1847. ICD_PRCDR_CD_22::VARCHAR,
  1848. ICD_PRCDR_CD_23::VARCHAR,
  1849. ICD_PRCDR_CD_24::VARCHAR,
  1850. ICD_PRCDR_CD_25::VARCHAR AS ICD_PRCDR_CD_25,
  1851. CPT_PRCDR_CD_1::VARCHAR AS CPT_PRCDR_CD_1,
  1852. CPT_PRCDR_CD_2::VARCHAR,
  1853. CPT_PRCDR_CD_3::VARCHAR AS CPT_PRCDR_CD_3,
  1854. CPT_PRCDR_CD_4::VARCHAR,
  1855. CPT_PRCDR_CD_5::VARCHAR,
  1856. CPT_PRCDR_CD_6::VARCHAR,
  1857. CPT_PRCDR_CD_7::VARCHAR,
  1858. CPT_PRCDR_CD_8::VARCHAR,
  1859. CPT_PRCDR_CD_9::VARCHAR,
  1860. CPT_PRCDR_CD_10::VARCHAR,
  1861. CPT_PRCDR_CD_11::VARCHAR,
  1862. CPT_PRCDR_CD_12::VARCHAR,
  1863. CPT_PRCDR_CD_13::VARCHAR,
  1864. CPT_PRCDR_CD_14::VARCHAR,
  1865. CPT_PRCDR_CD_15::VARCHAR,
  1866. CPT_PRCDR_CD_16::VARCHAR,
  1867. CPT_PRCDR_CD_17::VARCHAR,
  1868. CPT_PRCDR_CD_18::VARCHAR,
  1869. CPT_PRCDR_CD_19::VARCHAR,
  1870. CPT_PRCDR_CD_20::VARCHAR,
  1871. CPT_PRCDR_CD_21::VARCHAR,
  1872. CPT_PRCDR_CD_22::VARCHAR,
  1873. CPT_PRCDR_CD_23::VARCHAR,
  1874. CPT_PRCDR_CD_24::VARCHAR,
  1875. CPT_PRCDR_CD_25::VARCHAR,
  1876. CPT_PRCDR_CD_26::VARCHAR,
  1877. CPT_PRCDR_CD_27::VARCHAR,
  1878. CPT_PRCDR_CD_28::VARCHAR,
  1879. CPT_PRCDR_CD_29::VARCHAR,
  1880. CPT_PRCDR_CD_30::VARCHAR,
  1881. CPT_PRCDR_CD_31::VARCHAR,
  1882. CPT_PRCDR_CD_32::VARCHAR,
  1883. CPT_PRCDR_CD_33::VARCHAR,
  1884. CPT_PRCDR_CD_34::VARCHAR,
  1885. CPT_PRCDR_CD_35::VARCHAR AS CPT_PRCDR_CD_35,
  1886. replace(replace(PLAN_PMT_AMT, '$', ''), ',', '')::FLOAT AS PLAN_PMT_AMT,
  1887. replace(replace(TOT_CHRG_AMT, '$', ''), ',', '')::FLOAT AS TOT_CHRG_AMT
  1888. FROM read_csv('~/data/syh_dr/syhdr_commercial_outpatient_2016.CSV', header=True, null_padding=true, types={'CPT_PRCDR_CD_1': 'VARCHAR', 'CPT_PRCDR_CD_2': 'VARCHAR', 'CPT_PRCDR_CD_3': 'VARCHAR', 'CPT_PRCDR_CD_4': 'VARCHAR', 'CPT_PRCDR_CD_5': 'VARCHAR', 'CPT_PRCDR_CD_6': 'VARCHAR', 'CPT_PRCDR_CD_7': 'VARCHAR', 'CPT_PRCDR_CD_8': 'VARCHAR', 'CPT_PRCDR_CD_9': 'VARCHAR', 'CPT_PRCDR_CD_10': 'VARCHAR', 'CPT_PRCDR_CD_11': 'VARCHAR', 'CPT_PRCDR_CD_12': 'VARCHAR', 'CPT_PRCDR_CD_13': 'VARCHAR', 'CPT_PRCDR_CD_14': 'VARCHAR', 'CPT_PRCDR_CD_15': 'VARCHAR', 'CPT_PRCDR_CD_16': 'VARCHAR', 'CPT_PRCDR_CD_17': 'VARCHAR', 'CPT_PRCDR_CD_18': 'VARCHAR', 'CPT_PRCDR_CD_19': 'VARCHAR', 'CPT_PRCDR_CD_20': 'VARCHAR', 'CPT_PRCDR_CD_21': 'VARCHAR', 'CPT_PRCDR_CD_22': 'VARCHAR', 'CPT_PRCDR_CD_23': 'VARCHAR', 'CPT_PRCDR_CD_24': 'VARCHAR', 'CPT_PRCDR_CD_25': 'VARCHAR', 'CPT_PRCDR_CD_26': 'VARCHAR', 'CPT_PRCDR_CD_27': 'VARCHAR', 'CPT_PRCDR_CD_28': 'VARCHAR', 'CPT_PRCDR_CD_29': 'VARCHAR', 'CPT_PRCDR_CD_30': 'VARCHAR', 'CPT_PRCDR_CD_31': 'VARCHAR', 'CPT_PRCDR_CD_32': 'VARCHAR', 'CPT_PRCDR_CD_33': 'VARCHAR', 'CPT_PRCDR_CD_34': 'VARCHAR', 'CPT_PRCDR_CD_35': 'VARCHAR'}, ignore_errors=true)```
  1889. Relative file path: ahrq.gov/generated/with_types/syhdr_commercial_person_2016.sql
  1890. ```
  1891. -- SQL model for syhdr_commercial_person_2016.CSV
  1892. {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
  1893.  
  1894. SELECT
  1895. PERSON_ID::UBIGINT AS PERSON_ID,
  1896. PERSON_WGHT::NUMERIC AS PERSON_WGHT,
  1897. AGE_LOW::NUMERIC AS AGE_LOW,
  1898. AGE_HIGH::NUMERIC AS AGE_HIGH,
  1899. SEX_IDENT_CD::VARCHAR AS SEX_IDENT_CD,
  1900. STATE_CD::VARCHAR AS STATE_CD,
  1901. COUNTY_FIPS_CD::VARCHAR AS COUNTY_FIPS_CD,
  1902. ZIP_CD::VARCHAR AS ZIP_CD,
  1903. PHRMCY_CVRG_1::NUMERIC AS PHRMCY_CVRG_1,
  1904. PHRMCY_CVRG_2::VARCHAR,
  1905. PHRMCY_CVRG_3::VARCHAR,
  1906. PHRMCY_CVRG_4::VARCHAR,
  1907. PHRMCY_CVRG_5::VARCHAR,
  1908. PHRMCY_CVRG_6::VARCHAR,
  1909. PHRMCY_CVRG_7::VARCHAR,
  1910. PHRMCY_CVRG_8::VARCHAR,
  1911. PHRMCY_CVRG_9::VARCHAR,
  1912. PHRMCY_CVRG_10::VARCHAR,
  1913. PHRMCY_CVRG_11::VARCHAR,
  1914. PHRMCY_CVRG_12::NUMERIC AS PHRMCY_CVRG_12,
  1915. CMRCL_INSRC_1::NUMERIC AS CMRCL_INSRC_1,
  1916. CMRCL_INSRC_2::VARCHAR,
  1917. CMRCL_INSRC_3::VARCHAR,
  1918. CMRCL_INSRC_4::VARCHAR,
  1919. CMRCL_INSRC_5::VARCHAR,
  1920. CMRCL_INSRC_6::VARCHAR,
  1921. CMRCL_INSRC_7::VARCHAR,
  1922. CMRCL_INSRC_8::VARCHAR,
  1923. CMRCL_INSRC_9::VARCHAR,
  1924. CMRCL_INSRC_10::VARCHAR,
  1925. CMRCL_INSRC_11::VARCHAR,
  1926. CMRCL_INSRC_12::NUMERIC AS CMRCL_INSRC_12
  1927. FROM read_csv('~/data/syh_dr/syhdr_commercial_person_2016.CSV', header=True, null_padding=true)```
  1928. Relative file path: ahrq.gov/generated/with_types/syhdr_medicaid_person_2016.sql
  1929. ```
  1930. -- SQL model for syhdr_medicaid_person_2016.CSV
  1931. {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
  1932.  
  1933. SELECT
  1934. PERSON_ID::UBIGINT AS PERSON_ID,
  1935. MCAID_BENE_ID::UBIGINT AS MCAID_BENE_ID,
  1936. PERSON_WGHT::NUMERIC AS PERSON_WGHT,
  1937. AGE_LOW::NUMERIC AS AGE_LOW,
  1938. AGE_HIGH::NUMERIC AS AGE_HIGH,
  1939. SEX_IDENT_CD::VARCHAR AS SEX_IDENT_CD,
  1940. RACE_CD::VARCHAR AS RACE_CD,
  1941. MCAID_SBMTTG_ST_CD::VARCHAR AS MCAID_SBMTTG_ST_CD,
  1942. STATE_CD::VARCHAR AS STATE_CD,
  1943. COUNTY_FIPS_CD::VARCHAR AS COUNTY_FIPS_CD,
  1944. ZIP_CD::VARCHAR AS ZIP_CD,
  1945. RSN_ENRLMT_CD::VARCHAR AS RSN_ENRLMT_CD,
  1946. MDCD_ENRLMT_1::NUMERIC AS MDCD_ENRLMT_1,
  1947. MDCD_ENRLMT_2::VARCHAR,
  1948. MDCD_ENRLMT_3::VARCHAR,
  1949. MDCD_ENRLMT_4::VARCHAR,
  1950. MDCD_ENRLMT_5::VARCHAR,
  1951. MDCD_ENRLMT_6::VARCHAR,
  1952. MDCD_ENRLMT_7::VARCHAR,
  1953. MDCD_ENRLMT_8::VARCHAR,
  1954. MDCD_ENRLMT_9::VARCHAR,
  1955. MDCD_ENRLMT_10::VARCHAR,
  1956. MDCD_ENRLMT_11::VARCHAR,
  1957. MDCD_ENRLMT_12::NUMERIC AS MDCD_ENRLMT_12,
  1958. MDCD_MCO_ENRLMT_1::NUMERIC AS MDCD_MCO_ENRLMT_1,
  1959. MDCD_MCO_ENRLMT_2::VARCHAR,
  1960. MDCD_MCO_ENRLMT_3::VARCHAR,
  1961. MDCD_MCO_ENRLMT_4::VARCHAR,
  1962. MDCD_MCO_ENRLMT_5::VARCHAR,
  1963. MDCD_MCO_ENRLMT_6::VARCHAR,
  1964. MDCD_MCO_ENRLMT_7::VARCHAR,
  1965. MDCD_MCO_ENRLMT_8::VARCHAR,
  1966. MDCD_MCO_ENRLMT_9::VARCHAR,
  1967. MDCD_MCO_ENRLMT_10::VARCHAR,
  1968. MDCD_MCO_ENRLMT_11::VARCHAR,
  1969. MDCD_MCO_ENRLMT_12::NUMERIC AS MDCD_MCO_ENRLMT_12,
  1970. MDCD_CHIP_ENRLMT::NUMERIC AS MDCD_CHIP_ENRLMT,
  1971. RSTRCTD_BNFTS_IND::VARCHAR,
  1972. DUAL_ELGBL_1::NUMERIC AS DUAL_ELGBL_1,
  1973. DUAL_ELGBL_2::VARCHAR,
  1974. DUAL_ELGBL_3::VARCHAR,
  1975. DUAL_ELGBL_4::VARCHAR,
  1976. DUAL_ELGBL_5::VARCHAR,
  1977. DUAL_ELGBL_6::VARCHAR,
  1978. DUAL_ELGBL_7::VARCHAR,
  1979. DUAL_ELGBL_8::VARCHAR,
  1980. DUAL_ELGBL_9::VARCHAR,
  1981. DUAL_ELGBL_10::VARCHAR,
  1982. DUAL_ELGBL_11::VARCHAR,
  1983. DUAL_ELGBL_12::NUMERIC AS DUAL_ELGBL_12
  1984. FROM read_csv('~/data/syh_dr/syhdr_medicaid_person_2016.CSV', header=True, null_padding=true)```
  1985. Relative file path: ahrq.gov/generated/with_types/syhdr_commercial_inpatient_2016.sql
  1986. ```
  1987. -- SQL model for syhdr_commercial_inpatient_2016.CSV
  1988. {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
  1989.  
  1990. SELECT
  1991. PERSON_ID::UBIGINT AS PERSON_ID,
  1992. PERSON_WGHT::NUMERIC AS PERSON_WGHT,
  1993. FACILITY_ID::UBIGINT AS FACILITY_ID,
  1994. CLM_CNTL_NUM::NUMERIC AS CLM_CNTL_NUM,
  1995. AT_SPCLTY::VARCHAR AS AT_SPCLTY,
  1996. SRVC_BEG_DATE::DATE AS SRVC_BEG_DATE,
  1997. SRVC_END_DATE::DATE AS SRVC_END_DATE,
  1998. LOS::UINTEGER AS LOS,
  1999. ADMSN_TYPE::VARCHAR AS ADMSN_TYPE,
  2000. TOB_CD::VARCHAR AS TOB_CD,
  2001. CLM_TYPE_CD::VARCHAR AS CLM_TYPE_CD,
  2002. DSCHRG_STUS::VARCHAR AS DSCHRG_STUS,
  2003. PRMRY_DX_IMPUTED::VARCHAR AS PRMRY_DX_IMPUTED,
  2004. PRMRY_DX_CD::VARCHAR AS PRMRY_DX_CD,
  2005. ICD_DX_CD_1::VARCHAR AS ICD_DX_CD_1,
  2006. ICD_DX_CD_2::VARCHAR AS ICD_DX_CD_2,
  2007. ICD_DX_CD_3::VARCHAR,
  2008. ICD_DX_CD_4::VARCHAR,
  2009. ICD_DX_CD_5::VARCHAR,
  2010. ICD_DX_CD_6::VARCHAR,
  2011. ICD_DX_CD_7::VARCHAR,
  2012. ICD_DX_CD_8::VARCHAR,
  2013. ICD_DX_CD_9::VARCHAR,
  2014. ICD_DX_CD_10::VARCHAR,
  2015. ICD_DX_CD_11::VARCHAR,
  2016. ICD_DX_CD_12::VARCHAR,
  2017. ICD_DX_CD_13::VARCHAR,
  2018. ICD_DX_CD_14::VARCHAR,
  2019. ICD_DX_CD_15::VARCHAR,
  2020. ICD_DX_CD_16::VARCHAR,
  2021. ICD_DX_CD_17::VARCHAR,
  2022. ICD_DX_CD_18::VARCHAR,
  2023. ICD_DX_CD_19::VARCHAR,
  2024. ICD_DX_CD_20::VARCHAR,
  2025. ICD_DX_CD_21::VARCHAR,
  2026. ICD_DX_CD_22::VARCHAR,
  2027. ICD_DX_CD_23::VARCHAR,
  2028. ICD_DX_CD_24::VARCHAR,
  2029. ICD_DX_CD_25::VARCHAR AS ICD_DX_CD_25,
  2030. ICD_PRCDR_CD_1::VARCHAR AS ICD_PRCDR_CD_1,
  2031. ICD_PRCDR_CD_2::VARCHAR AS ICD_PRCDR_CD_2,
  2032. ICD_PRCDR_CD_3::VARCHAR,
  2033. ICD_PRCDR_CD_4::VARCHAR,
  2034. ICD_PRCDR_CD_5::VARCHAR,
  2035. ICD_PRCDR_CD_6::VARCHAR,
  2036. ICD_PRCDR_CD_7::VARCHAR,
  2037. ICD_PRCDR_CD_8::VARCHAR,
  2038. ICD_PRCDR_CD_9::VARCHAR,
  2039. ICD_PRCDR_CD_10::VARCHAR,
  2040. ICD_PRCDR_CD_11::VARCHAR,
  2041. ICD_PRCDR_CD_12::VARCHAR,
  2042. ICD_PRCDR_CD_13::VARCHAR,
  2043. ICD_PRCDR_CD_14::VARCHAR,
  2044. ICD_PRCDR_CD_15::VARCHAR,
  2045. ICD_PRCDR_CD_16::VARCHAR,
  2046. ICD_PRCDR_CD_17::VARCHAR,
  2047. ICD_PRCDR_CD_18::VARCHAR,
  2048. ICD_PRCDR_CD_19::VARCHAR,
  2049. ICD_PRCDR_CD_20::VARCHAR,
  2050. ICD_PRCDR_CD_21::VARCHAR,
  2051. ICD_PRCDR_CD_22::VARCHAR,
  2052. ICD_PRCDR_CD_23::VARCHAR,
  2053. ICD_PRCDR_CD_24::VARCHAR,
  2054. ICD_PRCDR_CD_25::VARCHAR AS ICD_PRCDR_CD_25,
  2055. CPT_PRCDR_CD_1::VARCHAR AS CPT_PRCDR_CD_1,
  2056. CPT_PRCDR_CD_2::VARCHAR,
  2057. CPT_PRCDR_CD_3::VARCHAR AS CPT_PRCDR_CD_3,
  2058. CPT_PRCDR_CD_4::VARCHAR,
  2059. CPT_PRCDR_CD_5::VARCHAR,
  2060. CPT_PRCDR_CD_6::VARCHAR,
  2061. CPT_PRCDR_CD_7::VARCHAR,
  2062. CPT_PRCDR_CD_8::VARCHAR,
  2063. CPT_PRCDR_CD_9::VARCHAR,
  2064. CPT_PRCDR_CD_10::VARCHAR,
  2065. CPT_PRCDR_CD_11::VARCHAR,
  2066. CPT_PRCDR_CD_12::VARCHAR,
  2067. CPT_PRCDR_CD_13::VARCHAR,
  2068. CPT_PRCDR_CD_14::VARCHAR,
  2069. CPT_PRCDR_CD_15::VARCHAR,
  2070. CPT_PRCDR_CD_16::VARCHAR,
  2071. CPT_PRCDR_CD_17::VARCHAR,
  2072. CPT_PRCDR_CD_18::VARCHAR,
  2073. CPT_PRCDR_CD_19::VARCHAR,
  2074. CPT_PRCDR_CD_20::VARCHAR,
  2075. CPT_PRCDR_CD_21::VARCHAR,
  2076. CPT_PRCDR_CD_22::VARCHAR,
  2077. CPT_PRCDR_CD_23::VARCHAR,
  2078. CPT_PRCDR_CD_24::VARCHAR,
  2079. CPT_PRCDR_CD_25::VARCHAR,
  2080. CPT_PRCDR_CD_26::VARCHAR,
  2081. CPT_PRCDR_CD_27::VARCHAR,
  2082. CPT_PRCDR_CD_28::VARCHAR,
  2083. CPT_PRCDR_CD_29::VARCHAR,
  2084. CPT_PRCDR_CD_30::VARCHAR,
  2085. CPT_PRCDR_CD_31::VARCHAR,
  2086. CPT_PRCDR_CD_32::VARCHAR,
  2087. CPT_PRCDR_CD_33::VARCHAR,
  2088. CPT_PRCDR_CD_34::VARCHAR,
  2089. CPT_PRCDR_CD_35::VARCHAR AS CPT_PRCDR_CD_35,
  2090. replace(replace(PLAN_PMT_AMT, '$', ''), ',', '')::FLOAT AS PLAN_PMT_AMT,
  2091. replace(replace(TOT_CHRG_AMT, '$', ''), ',', '')::FLOAT AS TOT_CHRG_AMT
  2092. FROM read_csv('~/data/syh_dr/syhdr_commercial_inpatient_2016.CSV', header=True, null_padding=true, types={'CPT_PRCDR_CD_1': 'VARCHAR', 'CPT_PRCDR_CD_2': 'VARCHAR', 'CPT_PRCDR_CD_3': 'VARCHAR', 'CPT_PRCDR_CD_4': 'VARCHAR', 'CPT_PRCDR_CD_5': 'VARCHAR', 'CPT_PRCDR_CD_6': 'VARCHAR', 'CPT_PRCDR_CD_7': 'VARCHAR', 'CPT_PRCDR_CD_8': 'VARCHAR', 'CPT_PRCDR_CD_9': 'VARCHAR', 'CPT_PRCDR_CD_10': 'VARCHAR', 'CPT_PRCDR_CD_11': 'VARCHAR', 'CPT_PRCDR_CD_12': 'VARCHAR', 'CPT_PRCDR_CD_13': 'VARCHAR', 'CPT_PRCDR_CD_14': 'VARCHAR', 'CPT_PRCDR_CD_15': 'VARCHAR', 'CPT_PRCDR_CD_16': 'VARCHAR', 'CPT_PRCDR_CD_17': 'VARCHAR', 'CPT_PRCDR_CD_18': 'VARCHAR', 'CPT_PRCDR_CD_19': 'VARCHAR', 'CPT_PRCDR_CD_20': 'VARCHAR', 'CPT_PRCDR_CD_21': 'VARCHAR', 'CPT_PRCDR_CD_22': 'VARCHAR', 'CPT_PRCDR_CD_23': 'VARCHAR', 'CPT_PRCDR_CD_24': 'VARCHAR', 'CPT_PRCDR_CD_25': 'VARCHAR', 'CPT_PRCDR_CD_26': 'VARCHAR', 'CPT_PRCDR_CD_27': 'VARCHAR', 'CPT_PRCDR_CD_28': 'VARCHAR', 'CPT_PRCDR_CD_29': 'VARCHAR', 'CPT_PRCDR_CD_30': 'VARCHAR', 'CPT_PRCDR_CD_31': 'VARCHAR', 'CPT_PRCDR_CD_32': 'VARCHAR', 'CPT_PRCDR_CD_33': 'VARCHAR', 'CPT_PRCDR_CD_34': 'VARCHAR', 'CPT_PRCDR_CD_35': 'VARCHAR'}, ignore_errors=true)```
  2093. Relative file path: ahrq.gov/generated/with_types/syhdr_medicaid_inpatient_2016.sql
  2094. ```
  2095. -- SQL model for syhdr_medicaid_inpatient_2016.CSV
  2096. {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
  2097.  
  2098. SELECT
  2099. PERSON_ID::UBIGINT AS PERSON_ID,
  2100. PERSON_WGHT::NUMERIC AS PERSON_WGHT,
  2101. FACILITY_ID::UBIGINT AS FACILITY_ID,
  2102. CLM_CNTL_NUM::NUMERIC AS CLM_CNTL_NUM,
  2103. AT_SPCLTY::VARCHAR AS AT_SPCLTY,
  2104. SRVC_BEG_DATE::DATE AS SRVC_BEG_DATE,
  2105. SRVC_END_DATE::DATE AS SRVC_END_DATE,
  2106. LOS::UINTEGER AS LOS,
  2107. ADMSN_TYPE::VARCHAR AS ADMSN_TYPE,
  2108. TOB_CD::VARCHAR AS TOB_CD,
  2109. CLM_TYPE_CD::VARCHAR AS CLM_TYPE_CD,
  2110. DSCHRG_STUS::VARCHAR AS DSCHRG_STUS,
  2111. PRMRY_DX_IMPUTED::VARCHAR AS PRMRY_DX_IMPUTED,
  2112. PRMRY_DX_CD::VARCHAR AS PRMRY_DX_CD,
  2113. ICD_DX_CD_1::VARCHAR AS ICD_DX_CD_1,
  2114. ICD_DX_CD_2::VARCHAR AS ICD_DX_CD_2,
  2115. ICD_DX_CD_3::VARCHAR,
  2116. ICD_DX_CD_4::VARCHAR,
  2117. ICD_DX_CD_5::VARCHAR,
  2118. ICD_DX_CD_6::VARCHAR,
  2119. ICD_DX_CD_7::VARCHAR,
  2120. ICD_DX_CD_8::VARCHAR,
  2121. ICD_DX_CD_9::VARCHAR,
  2122. ICD_DX_CD_10::VARCHAR,
  2123. ICD_DX_CD_11::VARCHAR,
  2124. ICD_DX_CD_12::VARCHAR,
  2125. ICD_DX_CD_13::VARCHAR,
  2126. ICD_DX_CD_14::VARCHAR,
  2127. ICD_DX_CD_15::VARCHAR,
  2128. ICD_DX_CD_16::VARCHAR,
  2129. ICD_DX_CD_17::VARCHAR,
  2130. ICD_DX_CD_18::VARCHAR,
  2131. ICD_DX_CD_19::VARCHAR,
  2132. ICD_DX_CD_20::VARCHAR,
  2133. ICD_DX_CD_21::VARCHAR,
  2134. ICD_DX_CD_22::VARCHAR,
  2135. ICD_DX_CD_23::VARCHAR,
  2136. ICD_DX_CD_24::VARCHAR,
  2137. ICD_DX_CD_25::VARCHAR AS ICD_DX_CD_25,
  2138. ICD_PRCDR_CD_1::VARCHAR AS ICD_PRCDR_CD_1,
  2139. ICD_PRCDR_CD_2::VARCHAR AS ICD_PRCDR_CD_2,
  2140. ICD_PRCDR_CD_3::VARCHAR,
  2141. ICD_PRCDR_CD_4::VARCHAR,
  2142. ICD_PRCDR_CD_5::VARCHAR,
  2143. ICD_PRCDR_CD_6::VARCHAR,
  2144. ICD_PRCDR_CD_7::VARCHAR,
  2145. ICD_PRCDR_CD_8::VARCHAR,
  2146. ICD_PRCDR_CD_9::VARCHAR,
  2147. ICD_PRCDR_CD_10::VARCHAR,
  2148. ICD_PRCDR_CD_11::VARCHAR,
  2149. ICD_PRCDR_CD_12::VARCHAR,
  2150. ICD_PRCDR_CD_13::VARCHAR,
  2151. ICD_PRCDR_CD_14::VARCHAR,
  2152. ICD_PRCDR_CD_15::VARCHAR,
  2153. ICD_PRCDR_CD_16::VARCHAR,
  2154. ICD_PRCDR_CD_17::VARCHAR,
  2155. ICD_PRCDR_CD_18::VARCHAR,
  2156. ICD_PRCDR_CD_19::VARCHAR,
  2157. ICD_PRCDR_CD_20::VARCHAR,
  2158. ICD_PRCDR_CD_21::VARCHAR,
  2159. ICD_PRCDR_CD_22::VARCHAR,
  2160. ICD_PRCDR_CD_23::VARCHAR,
  2161. ICD_PRCDR_CD_24::VARCHAR,
  2162. ICD_PRCDR_CD_25::VARCHAR AS ICD_PRCDR_CD_25,
  2163. CPT_PRCDR_CD_1::VARCHAR AS CPT_PRCDR_CD_1,
  2164. CPT_PRCDR_CD_2::VARCHAR,
  2165. CPT_PRCDR_CD_3::VARCHAR AS CPT_PRCDR_CD_3,
  2166. CPT_PRCDR_CD_4::VARCHAR,
  2167. CPT_PRCDR_CD_5::VARCHAR,
  2168. CPT_PRCDR_CD_6::VARCHAR,
  2169. CPT_PRCDR_CD_7::VARCHAR,
  2170. CPT_PRCDR_CD_8::VARCHAR,
  2171. CPT_PRCDR_CD_9::VARCHAR,
  2172. CPT_PRCDR_CD_10::VARCHAR,
  2173. CPT_PRCDR_CD_11::VARCHAR,
  2174. CPT_PRCDR_CD_12::VARCHAR,
  2175. CPT_PRCDR_CD_13::VARCHAR,
  2176. CPT_PRCDR_CD_14::VARCHAR,
  2177. CPT_PRCDR_CD_15::VARCHAR,
  2178. CPT_PRCDR_CD_16::VARCHAR,
  2179. CPT_PRCDR_CD_17::VARCHAR,
  2180. CPT_PRCDR_CD_18::VARCHAR,
  2181. CPT_PRCDR_CD_19::VARCHAR,
  2182. CPT_PRCDR_CD_20::VARCHAR,
  2183. CPT_PRCDR_CD_21::VARCHAR,
  2184. CPT_PRCDR_CD_22::VARCHAR,
  2185. CPT_PRCDR_CD_23::VARCHAR,
  2186. CPT_PRCDR_CD_24::VARCHAR,
  2187. CPT_PRCDR_CD_25::VARCHAR,
  2188. CPT_PRCDR_CD_26::VARCHAR,
  2189. CPT_PRCDR_CD_27::VARCHAR,
  2190. CPT_PRCDR_CD_28::VARCHAR,
  2191. CPT_PRCDR_CD_29::VARCHAR,
  2192. CPT_PRCDR_CD_30::VARCHAR,
  2193. CPT_PRCDR_CD_31::VARCHAR,
  2194. CPT_PRCDR_CD_32::VARCHAR,
  2195. CPT_PRCDR_CD_33::VARCHAR,
  2196. CPT_PRCDR_CD_34::VARCHAR,
  2197. CPT_PRCDR_CD_35::VARCHAR AS CPT_PRCDR_CD_35,
  2198. replace(replace(PLAN_PMT_AMT, '$', ''), ',', '')::FLOAT AS PLAN_PMT_AMT,
  2199. replace(replace(TOT_CHRG_AMT, '$', ''), ',', '')::FLOAT AS TOT_CHRG_AMT
  2200. FROM read_csv('~/data/syh_dr/syhdr_medicaid_inpatient_2016.CSV', header=True, null_padding=true, types={'CPT_PRCDR_CD_1': 'VARCHAR', 'CPT_PRCDR_CD_2': 'VARCHAR', 'CPT_PRCDR_CD_3': 'VARCHAR', 'CPT_PRCDR_CD_4': 'VARCHAR', 'CPT_PRCDR_CD_5': 'VARCHAR', 'CPT_PRCDR_CD_6': 'VARCHAR', 'CPT_PRCDR_CD_7': 'VARCHAR', 'CPT_PRCDR_CD_8': 'VARCHAR', 'CPT_PRCDR_CD_9': 'VARCHAR', 'CPT_PRCDR_CD_10': 'VARCHAR', 'CPT_PRCDR_CD_11': 'VARCHAR', 'CPT_PRCDR_CD_12': 'VARCHAR', 'CPT_PRCDR_CD_13': 'VARCHAR', 'CPT_PRCDR_CD_14': 'VARCHAR', 'CPT_PRCDR_CD_15': 'VARCHAR', 'CPT_PRCDR_CD_16': 'VARCHAR', 'CPT_PRCDR_CD_17': 'VARCHAR', 'CPT_PRCDR_CD_18': 'VARCHAR', 'CPT_PRCDR_CD_19': 'VARCHAR', 'CPT_PRCDR_CD_20': 'VARCHAR', 'CPT_PRCDR_CD_21': 'VARCHAR', 'CPT_PRCDR_CD_22': 'VARCHAR', 'CPT_PRCDR_CD_23': 'VARCHAR', 'CPT_PRCDR_CD_24': 'VARCHAR', 'CPT_PRCDR_CD_25': 'VARCHAR', 'CPT_PRCDR_CD_26': 'VARCHAR', 'CPT_PRCDR_CD_27': 'VARCHAR', 'CPT_PRCDR_CD_28': 'VARCHAR', 'CPT_PRCDR_CD_29': 'VARCHAR', 'CPT_PRCDR_CD_30': 'VARCHAR', 'CPT_PRCDR_CD_31': 'VARCHAR', 'CPT_PRCDR_CD_32': 'VARCHAR', 'CPT_PRCDR_CD_33': 'VARCHAR', 'CPT_PRCDR_CD_34': 'VARCHAR', 'CPT_PRCDR_CD_35': 'VARCHAR'}, ignore_errors=true)```
  2201. Relative file path: ahrq.gov/generated/with_types/syhdr_commercial_pharmacy_2016.sql
  2202. ```
  2203. -- SQL model for syhdr_commercial_pharmacy_2016.CSV
  2204. {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
  2205.  
  2206. SELECT
  2207. PERSON_ID::UBIGINT AS PERSON_ID,
  2208. PERSON_WGHT::NUMERIC AS PERSON_WGHT,
  2209. PHMCY_CLM_NUM::NUMERIC AS PHMCY_CLM_NUM,
  2210. CLM_CNTL_NUM::NUMERIC AS CLM_CNTL_NUM,
  2211. LINE_NBR::VARCHAR,
  2212. FILL_DT::DATE AS FILL_DT,
  2213. SYNTHETIC_DRUG_ID::VARCHAR,
  2214. GENERIC_DRUG_NAME::VARCHAR AS GENERIC_DRUG_NAME,
  2215. replace(replace(PLAN_PMT_AMT, '$', ''), ',', '')::FLOAT AS PLAN_PMT_AMT,
  2216. replace(replace(TOT_CHRG_AMT, '$', ''), ',', '')::FLOAT AS TOT_CHRG_AMT
  2217. FROM read_csv('~/data/syh_dr/syhdr_commercial_pharmacy_2016.CSV', header=True, null_padding=true)```
  2218. Relative file path: ahrq.gov/generated/with_types/syhdr_medicare_person_2016.sql
  2219. ```
  2220. -- SQL model for syhdr_medicare_person_2016.CSV
  2221. {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
  2222.  
  2223. SELECT
  2224. PERSON_ID::UBIGINT AS PERSON_ID,
  2225. PERSON_WGHT::NUMERIC AS PERSON_WGHT,
  2226. AGE_LOW::NUMERIC AS AGE_LOW,
  2227. AGE_HIGH::NUMERIC AS AGE_HIGH,
  2228. SEX_IDENT_CD::VARCHAR AS SEX_IDENT_CD,
  2229. RACE_CD::VARCHAR AS RACE_CD,
  2230. STATE_CD::VARCHAR AS STATE_CD,
  2231. COUNTY_FIPS_CD::VARCHAR AS COUNTY_FIPS_CD,
  2232. ZIP_CD::VARCHAR AS ZIP_CD,
  2233. RSN_ENRLMT_CD::VARCHAR AS RSN_ENRLMT_CD,
  2234. MDCR_ENTLMT_IND_1::VARCHAR AS MDCR_ENTLMT_IND_1,
  2235. MDCR_ENTLMT_IND_2::VARCHAR,
  2236. MDCR_ENTLMT_IND_3::VARCHAR,
  2237. MDCR_ENTLMT_IND_4::VARCHAR,
  2238. MDCR_ENTLMT_IND_5::VARCHAR,
  2239. MDCR_ENTLMT_IND_6::VARCHAR,
  2240. MDCR_ENTLMT_IND_7::VARCHAR,
  2241. MDCR_ENTLMT_IND_8::VARCHAR,
  2242. MDCR_ENTLMT_IND_9::VARCHAR,
  2243. MDCR_ENTLMT_IND_10::VARCHAR,
  2244. MDCR_ENTLMT_IND_11::VARCHAR,
  2245. MDCR_ENTLMT_IND_12::VARCHAR AS MDCR_ENTLMT_IND_12,
  2246. MDCR_HMO_CVRG_1::NUMERIC AS MDCR_HMO_CVRG_1,
  2247. MDCR_HMO_CVRG_2::VARCHAR,
  2248. MDCR_HMO_CVRG_3::VARCHAR,
  2249. MDCR_HMO_CVRG_4::VARCHAR,
  2250. MDCR_HMO_CVRG_5::VARCHAR,
  2251. MDCR_HMO_CVRG_6::VARCHAR,
  2252. MDCR_HMO_CVRG_7::VARCHAR,
  2253. MDCR_HMO_CVRG_8::VARCHAR,
  2254. MDCR_HMO_CVRG_9::VARCHAR,
  2255. MDCR_HMO_CVRG_10::VARCHAR,
  2256. MDCR_HMO_CVRG_11::VARCHAR,
  2257. MDCR_HMO_CVRG_12::NUMERIC AS MDCR_HMO_CVRG_12,
  2258. PHRMCY_CVRG_1::NUMERIC AS PHRMCY_CVRG_1,
  2259. PHRMCY_CVRG_2::VARCHAR,
  2260. PHRMCY_CVRG_3::VARCHAR,
  2261. PHRMCY_CVRG_4::VARCHAR,
  2262. PHRMCY_CVRG_5::VARCHAR,
  2263. PHRMCY_CVRG_6::VARCHAR,
  2264. PHRMCY_CVRG_7::VARCHAR,
  2265. PHRMCY_CVRG_8::VARCHAR,
  2266. PHRMCY_CVRG_9::VARCHAR,
  2267. PHRMCY_CVRG_10::VARCHAR,
  2268. PHRMCY_CVRG_11::VARCHAR,
  2269. PHRMCY_CVRG_12::NUMERIC AS PHRMCY_CVRG_12,
  2270. DUAL_ELGBL_1::NUMERIC AS DUAL_ELGBL_1,
  2271. DUAL_ELGBL_2::VARCHAR,
  2272. DUAL_ELGBL_3::VARCHAR,
  2273. DUAL_ELGBL_4::VARCHAR,
  2274. DUAL_ELGBL_5::VARCHAR,
  2275. DUAL_ELGBL_6::VARCHAR,
  2276. DUAL_ELGBL_7::VARCHAR,
  2277. DUAL_ELGBL_8::VARCHAR,
  2278. DUAL_ELGBL_9::VARCHAR,
  2279. DUAL_ELGBL_10::VARCHAR,
  2280. DUAL_ELGBL_11::VARCHAR,
  2281. DUAL_ELGBL_12::NUMERIC AS DUAL_ELGBL_12
  2282. FROM read_csv('~/data/syh_dr/syhdr_medicare_person_2016.CSV', header=True, null_padding=true)```
  2283. Relative file path: ahrq.gov/generated/with_types/syhdr_medicare_pharmacy_2016.sql
  2284. ```
  2285. -- SQL model for syhdr_medicare_pharmacy_2016.CSV
  2286. {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
  2287.  
  2288. SELECT
  2289. PERSON_ID::UBIGINT AS PERSON_ID,
  2290. PERSON_WGHT::NUMERIC AS PERSON_WGHT,
  2291. PHMCY_CLM_NUM::NUMERIC AS PHMCY_CLM_NUM,
  2292. CLM_CNTL_NUM::NUMERIC AS CLM_CNTL_NUM,
  2293. LINE_NBR::VARCHAR,
  2294. FILL_DT::DATE AS FILL_DT,
  2295. SYNTHETIC_DRUG_ID::VARCHAR,
  2296. GENERIC_DRUG_NAME::VARCHAR AS GENERIC_DRUG_NAME,
  2297. replace(replace(PLAN_PMT_AMT, '$', ''), ',', '')::FLOAT AS PLAN_PMT_AMT,
  2298. replace(replace(TOT_CHRG_AMT, '$', ''), ',', '')::FLOAT AS TOT_CHRG_AMT
  2299. FROM read_csv('~/data/syh_dr/syhdr_medicare_pharmacy_2016.CSV', header=True, null_padding=true)```
  2300. Relative file path: ahrq.gov/generated/with_types/syhdr_medicare_outpatient_2016.sql
  2301. ```
  2302. -- SQL model for syhdr_medicare_outpatient_2016.CSV
  2303. {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
  2304.  
  2305. SELECT
  2306. PERSON_ID::UBIGINT AS PERSON_ID,
  2307. PERSON_WGHT::NUMERIC AS PERSON_WGHT,
  2308. FACILITY_ID::UBIGINT AS FACILITY_ID,
  2309. CLM_CNTL_NUM::NUMERIC AS CLM_CNTL_NUM,
  2310. AT_SPCLTY::VARCHAR AS AT_SPCLTY,
  2311. SRVC_BEG_DATE::DATE AS SRVC_BEG_DATE,
  2312. SRVC_END_DATE::DATE AS SRVC_END_DATE,
  2313. LOS::UINTEGER AS LOS,
  2314. ADMSN_TYPE::VARCHAR AS ADMSN_TYPE,
  2315. TOB_CD::VARCHAR AS TOB_CD,
  2316. CLM_TYPE_CD::VARCHAR AS CLM_TYPE_CD,
  2317. DSCHRG_STUS::VARCHAR AS DSCHRG_STUS,
  2318. PRMRY_DX_IMPUTED::VARCHAR AS PRMRY_DX_IMPUTED,
  2319. PRMRY_DX_CD::VARCHAR AS PRMRY_DX_CD,
  2320. ICD_DX_CD_1::VARCHAR AS ICD_DX_CD_1,
  2321. ICD_DX_CD_2::VARCHAR AS ICD_DX_CD_2,
  2322. ICD_DX_CD_3::VARCHAR,
  2323. ICD_DX_CD_4::VARCHAR,
  2324. ICD_DX_CD_5::VARCHAR,
  2325. ICD_DX_CD_6::VARCHAR,
  2326. ICD_DX_CD_7::VARCHAR,
  2327. ICD_DX_CD_8::VARCHAR,
  2328. ICD_DX_CD_9::VARCHAR,
  2329. ICD_DX_CD_10::VARCHAR,
  2330. ICD_DX_CD_11::VARCHAR,
  2331. ICD_DX_CD_12::VARCHAR,
  2332. ICD_DX_CD_13::VARCHAR,
  2333. ICD_DX_CD_14::VARCHAR,
  2334. ICD_DX_CD_15::VARCHAR,
  2335. ICD_DX_CD_16::VARCHAR,
  2336. ICD_DX_CD_17::VARCHAR,
  2337. ICD_DX_CD_18::VARCHAR,
  2338. ICD_DX_CD_19::VARCHAR,
  2339. ICD_DX_CD_20::VARCHAR,
  2340. ICD_DX_CD_21::VARCHAR,
  2341. ICD_DX_CD_22::VARCHAR,
  2342. ICD_DX_CD_23::VARCHAR,
  2343. ICD_DX_CD_24::VARCHAR,
  2344. ICD_DX_CD_25::VARCHAR AS ICD_DX_CD_25,
  2345. ICD_PRCDR_CD_1::VARCHAR AS ICD_PRCDR_CD_1,
  2346. ICD_PRCDR_CD_2::VARCHAR AS ICD_PRCDR_CD_2,
  2347. ICD_PRCDR_CD_3::VARCHAR,
  2348. ICD_PRCDR_CD_4::VARCHAR,
  2349. ICD_PRCDR_CD_5::VARCHAR,
  2350. ICD_PRCDR_CD_6::VARCHAR,
  2351. ICD_PRCDR_CD_7::VARCHAR,
  2352. ICD_PRCDR_CD_8::VARCHAR,
  2353. ICD_PRCDR_CD_9::VARCHAR,
  2354. ICD_PRCDR_CD_10::VARCHAR,
  2355. ICD_PRCDR_CD_11::VARCHAR,
  2356. ICD_PRCDR_CD_12::VARCHAR,
  2357. ICD_PRCDR_CD_13::VARCHAR,
  2358. ICD_PRCDR_CD_14::VARCHAR,
  2359. ICD_PRCDR_CD_15::VARCHAR,
  2360. ICD_PRCDR_CD_16::VARCHAR,
  2361. ICD_PRCDR_CD_17::VARCHAR,
  2362. ICD_PRCDR_CD_18::VARCHAR,
  2363. ICD_PRCDR_CD_19::VARCHAR,
  2364. ICD_PRCDR_CD_20::VARCHAR,
  2365. ICD_PRCDR_CD_21::VARCHAR,
  2366. ICD_PRCDR_CD_22::VARCHAR,
  2367. ICD_PRCDR_CD_23::VARCHAR,
  2368. ICD_PRCDR_CD_24::VARCHAR,
  2369. ICD_PRCDR_CD_25::VARCHAR AS ICD_PRCDR_CD_25,
  2370. CPT_PRCDR_CD_1::VARCHAR AS CPT_PRCDR_CD_1,
  2371. CPT_PRCDR_CD_2::VARCHAR,
  2372. CPT_PRCDR_CD_3::VARCHAR AS CPT_PRCDR_CD_3,
  2373. CPT_PRCDR_CD_4::VARCHAR,
  2374. CPT_PRCDR_CD_5::VARCHAR,
  2375. CPT_PRCDR_CD_6::VARCHAR,
  2376. CPT_PRCDR_CD_7::VARCHAR,
  2377. CPT_PRCDR_CD_8::VARCHAR,
  2378. CPT_PRCDR_CD_9::VARCHAR,
  2379. CPT_PRCDR_CD_10::VARCHAR,
  2380. CPT_PRCDR_CD_11::VARCHAR,
  2381. CPT_PRCDR_CD_12::VARCHAR,
  2382. CPT_PRCDR_CD_13::VARCHAR,
  2383. CPT_PRCDR_CD_14::VARCHAR,
  2384. CPT_PRCDR_CD_15::VARCHAR,
  2385. CPT_PRCDR_CD_16::VARCHAR,
  2386. CPT_PRCDR_CD_17::VARCHAR,
  2387. CPT_PRCDR_CD_18::VARCHAR,
  2388. CPT_PRCDR_CD_19::VARCHAR,
  2389. CPT_PRCDR_CD_20::VARCHAR,
  2390. CPT_PRCDR_CD_21::VARCHAR,
  2391. CPT_PRCDR_CD_22::VARCHAR,
  2392. CPT_PRCDR_CD_23::VARCHAR,
  2393. CPT_PRCDR_CD_24::VARCHAR,
  2394. CPT_PRCDR_CD_25::VARCHAR,
  2395. CPT_PRCDR_CD_26::VARCHAR,
  2396. CPT_PRCDR_CD_27::VARCHAR,
  2397. CPT_PRCDR_CD_28::VARCHAR,
  2398. CPT_PRCDR_CD_29::VARCHAR,
  2399. CPT_PRCDR_CD_30::VARCHAR,
  2400. CPT_PRCDR_CD_31::VARCHAR,
  2401. CPT_PRCDR_CD_32::VARCHAR,
  2402. CPT_PRCDR_CD_33::VARCHAR,
  2403. CPT_PRCDR_CD_34::VARCHAR,
  2404. CPT_PRCDR_CD_35::VARCHAR AS CPT_PRCDR_CD_35,
  2405. replace(replace(PLAN_PMT_AMT, '$', ''), ',', '')::FLOAT AS PLAN_PMT_AMT,
  2406. replace(replace(TOT_CHRG_AMT, '$', ''), ',', '')::FLOAT AS TOT_CHRG_AMT
  2407. FROM read_csv('~/data/syh_dr/syhdr_medicare_outpatient_2016.CSV', header=True, null_padding=true, types={'CPT_PRCDR_CD_1': 'VARCHAR', 'CPT_PRCDR_CD_2': 'VARCHAR', 'CPT_PRCDR_CD_3': 'VARCHAR', 'CPT_PRCDR_CD_4': 'VARCHAR', 'CPT_PRCDR_CD_5': 'VARCHAR', 'CPT_PRCDR_CD_6': 'VARCHAR', 'CPT_PRCDR_CD_7': 'VARCHAR', 'CPT_PRCDR_CD_8': 'VARCHAR', 'CPT_PRCDR_CD_9': 'VARCHAR', 'CPT_PRCDR_CD_10': 'VARCHAR', 'CPT_PRCDR_CD_11': 'VARCHAR', 'CPT_PRCDR_CD_12': 'VARCHAR', 'CPT_PRCDR_CD_13': 'VARCHAR', 'CPT_PRCDR_CD_14': 'VARCHAR', 'CPT_PRCDR_CD_15': 'VARCHAR', 'CPT_PRCDR_CD_16': 'VARCHAR', 'CPT_PRCDR_CD_17': 'VARCHAR', 'CPT_PRCDR_CD_18': 'VARCHAR', 'CPT_PRCDR_CD_19': 'VARCHAR', 'CPT_PRCDR_CD_20': 'VARCHAR', 'CPT_PRCDR_CD_21': 'VARCHAR', 'CPT_PRCDR_CD_22': 'VARCHAR', 'CPT_PRCDR_CD_23': 'VARCHAR', 'CPT_PRCDR_CD_24': 'VARCHAR', 'CPT_PRCDR_CD_25': 'VARCHAR', 'CPT_PRCDR_CD_26': 'VARCHAR', 'CPT_PRCDR_CD_27': 'VARCHAR', 'CPT_PRCDR_CD_28': 'VARCHAR', 'CPT_PRCDR_CD_29': 'VARCHAR', 'CPT_PRCDR_CD_30': 'VARCHAR', 'CPT_PRCDR_CD_31': 'VARCHAR', 'CPT_PRCDR_CD_32': 'VARCHAR', 'CPT_PRCDR_CD_33': 'VARCHAR', 'CPT_PRCDR_CD_34': 'VARCHAR', 'CPT_PRCDR_CD_35': 'VARCHAR'}, ignore_errors=true)```
  2408. Relative file path: ahrq.gov/generated/with_types/syhdr_medicaid_outpatient_2016.sql
  2409. ```
  2410. -- SQL model for syhdr_medicaid_outpatient_2016.CSV
  2411. {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
  2412.  
  2413. SELECT
  2414. PERSON_ID::UBIGINT AS PERSON_ID,
  2415. PERSON_WGHT::NUMERIC AS PERSON_WGHT,
  2416. FACILITY_ID::UBIGINT AS FACILITY_ID,
  2417. CLM_CNTL_NUM::NUMERIC AS CLM_CNTL_NUM,
  2418. AT_SPCLTY::VARCHAR AS AT_SPCLTY,
  2419. SRVC_BEG_DATE::DATE AS SRVC_BEG_DATE,
  2420. SRVC_END_DATE::DATE AS SRVC_END_DATE,
  2421. LOS::UINTEGER AS LOS,
  2422. ADMSN_TYPE::VARCHAR AS ADMSN_TYPE,
  2423. TOB_CD::VARCHAR AS TOB_CD,
  2424. CLM_TYPE_CD::VARCHAR AS CLM_TYPE_CD,
  2425. DSCHRG_STUS::VARCHAR AS DSCHRG_STUS,
  2426. PRMRY_DX_IMPUTED::VARCHAR AS PRMRY_DX_IMPUTED,
  2427. PRMRY_DX_CD::VARCHAR AS PRMRY_DX_CD,
  2428. ICD_DX_CD_1::VARCHAR AS ICD_DX_CD_1,
  2429. ICD_DX_CD_2::VARCHAR AS ICD_DX_CD_2,
  2430. ICD_DX_CD_3::VARCHAR,
  2431. ICD_DX_CD_4::VARCHAR,
  2432. ICD_DX_CD_5::VARCHAR,
  2433. ICD_DX_CD_6::VARCHAR,
  2434. ICD_DX_CD_7::VARCHAR,
  2435. ICD_DX_CD_8::VARCHAR,
  2436. ICD_DX_CD_9::VARCHAR,
  2437. ICD_DX_CD_10::VARCHAR,
  2438. ICD_DX_CD_11::VARCHAR,
  2439. ICD_DX_CD_12::VARCHAR,
  2440. ICD_DX_CD_13::VARCHAR,
  2441. ICD_DX_CD_14::VARCHAR,
  2442. ICD_DX_CD_15::VARCHAR,
  2443. ICD_DX_CD_16::VARCHAR,
  2444. ICD_DX_CD_17::VARCHAR,
  2445. ICD_DX_CD_18::VARCHAR,
  2446. ICD_DX_CD_19::VARCHAR,
  2447. ICD_DX_CD_20::VARCHAR,
  2448. ICD_DX_CD_21::VARCHAR,
  2449. ICD_DX_CD_22::VARCHAR,
  2450. ICD_DX_CD_23::VARCHAR,
  2451. ICD_DX_CD_24::VARCHAR,
  2452. ICD_DX_CD_25::VARCHAR AS ICD_DX_CD_25,
  2453. ICD_PRCDR_CD_1::VARCHAR AS ICD_PRCDR_CD_1,
  2454. ICD_PRCDR_CD_2::VARCHAR AS ICD_PRCDR_CD_2,
  2455. ICD_PRCDR_CD_3::VARCHAR,
  2456. ICD_PRCDR_CD_4::VARCHAR,
  2457. ICD_PRCDR_CD_5::VARCHAR,
  2458. ICD_PRCDR_CD_6::VARCHAR,
  2459. ICD_PRCDR_CD_7::VARCHAR,
  2460. ICD_PRCDR_CD_8::VARCHAR,
  2461. ICD_PRCDR_CD_9::VARCHAR,
  2462. ICD_PRCDR_CD_10::VARCHAR,
  2463. ICD_PRCDR_CD_11::VARCHAR,
  2464. ICD_PRCDR_CD_12::VARCHAR,
  2465. ICD_PRCDR_CD_13::VARCHAR,
  2466. ICD_PRCDR_CD_14::VARCHAR,
  2467. ICD_PRCDR_CD_15::VARCHAR,
  2468. ICD_PRCDR_CD_16::VARCHAR,
  2469. ICD_PRCDR_CD_17::VARCHAR,
  2470. ICD_PRCDR_CD_18::VARCHAR,
  2471. ICD_PRCDR_CD_19::VARCHAR,
  2472. ICD_PRCDR_CD_20::VARCHAR,
  2473. ICD_PRCDR_CD_21::VARCHAR,
  2474. ICD_PRCDR_CD_22::VARCHAR,
  2475. ICD_PRCDR_CD_23::VARCHAR,
  2476. ICD_PRCDR_CD_24::VARCHAR,
  2477. ICD_PRCDR_CD_25::VARCHAR AS ICD_PRCDR_CD_25,
  2478. CPT_PRCDR_CD_1::VARCHAR AS CPT_PRCDR_CD_1,
  2479. CPT_PRCDR_CD_2::VARCHAR,
  2480. CPT_PRCDR_CD_3::VARCHAR AS CPT_PRCDR_CD_3,
  2481. CPT_PRCDR_CD_4::VARCHAR,
  2482. CPT_PRCDR_CD_5::VARCHAR,
  2483. CPT_PRCDR_CD_6::VARCHAR,
  2484. CPT_PRCDR_CD_7::VARCHAR,
  2485. CPT_PRCDR_CD_8::VARCHAR,
  2486. CPT_PRCDR_CD_9::VARCHAR,
  2487. CPT_PRCDR_CD_10::VARCHAR,
  2488. CPT_PRCDR_CD_11::VARCHAR,
  2489. CPT_PRCDR_CD_12::VARCHAR,
  2490. CPT_PRCDR_CD_13::VARCHAR,
  2491. CPT_PRCDR_CD_14::VARCHAR,
  2492. CPT_PRCDR_CD_15::VARCHAR,
  2493. CPT_PRCDR_CD_16::VARCHAR,
  2494. CPT_PRCDR_CD_17::VARCHAR,
  2495. CPT_PRCDR_CD_18::VARCHAR,
  2496. CPT_PRCDR_CD_19::VARCHAR,
  2497. CPT_PRCDR_CD_20::VARCHAR,
  2498. CPT_PRCDR_CD_21::VARCHAR,
  2499. CPT_PRCDR_CD_22::VARCHAR,
  2500. CPT_PRCDR_CD_23::VARCHAR,
  2501. CPT_PRCDR_CD_24::VARCHAR,
  2502. CPT_PRCDR_CD_25::VARCHAR,
  2503. CPT_PRCDR_CD_26::VARCHAR,
  2504. CPT_PRCDR_CD_27::VARCHAR,
  2505. CPT_PRCDR_CD_28::VARCHAR,
  2506. CPT_PRCDR_CD_29::VARCHAR,
  2507. CPT_PRCDR_CD_30::VARCHAR,
  2508. CPT_PRCDR_CD_31::VARCHAR,
  2509. CPT_PRCDR_CD_32::VARCHAR,
  2510. CPT_PRCDR_CD_33::VARCHAR,
  2511. CPT_PRCDR_CD_34::VARCHAR,
  2512. CPT_PRCDR_CD_35::VARCHAR AS CPT_PRCDR_CD_35,
  2513. replace(replace(PLAN_PMT_AMT, '$', ''), ',', '')::FLOAT AS PLAN_PMT_AMT,
  2514. replace(replace(TOT_CHRG_AMT, '$', ''), ',', '')::FLOAT AS TOT_CHRG_AMT
  2515. FROM read_csv('~/data/syh_dr/syhdr_medicaid_outpatient_2016.CSV', header=True, null_padding=true, types={'CPT_PRCDR_CD_1': 'VARCHAR', 'CPT_PRCDR_CD_2': 'VARCHAR', 'CPT_PRCDR_CD_3': 'VARCHAR', 'CPT_PRCDR_CD_4': 'VARCHAR', 'CPT_PRCDR_CD_5': 'VARCHAR', 'CPT_PRCDR_CD_6': 'VARCHAR', 'CPT_PRCDR_CD_7': 'VARCHAR', 'CPT_PRCDR_CD_8': 'VARCHAR', 'CPT_PRCDR_CD_9': 'VARCHAR', 'CPT_PRCDR_CD_10': 'VARCHAR', 'CPT_PRCDR_CD_11': 'VARCHAR', 'CPT_PRCDR_CD_12': 'VARCHAR', 'CPT_PRCDR_CD_13': 'VARCHAR', 'CPT_PRCDR_CD_14': 'VARCHAR', 'CPT_PRCDR_CD_15': 'VARCHAR', 'CPT_PRCDR_CD_16': 'VARCHAR', 'CPT_PRCDR_CD_17': 'VARCHAR', 'CPT_PRCDR_CD_18': 'VARCHAR', 'CPT_PRCDR_CD_19': 'VARCHAR', 'CPT_PRCDR_CD_20': 'VARCHAR', 'CPT_PRCDR_CD_21': 'VARCHAR', 'CPT_PRCDR_CD_22': 'VARCHAR', 'CPT_PRCDR_CD_23': 'VARCHAR', 'CPT_PRCDR_CD_24': 'VARCHAR', 'CPT_PRCDR_CD_25': 'VARCHAR', 'CPT_PRCDR_CD_26': 'VARCHAR', 'CPT_PRCDR_CD_27': 'VARCHAR', 'CPT_PRCDR_CD_28': 'VARCHAR', 'CPT_PRCDR_CD_29': 'VARCHAR', 'CPT_PRCDR_CD_30': 'VARCHAR', 'CPT_PRCDR_CD_31': 'VARCHAR', 'CPT_PRCDR_CD_32': 'VARCHAR', 'CPT_PRCDR_CD_33': 'VARCHAR', 'CPT_PRCDR_CD_34': 'VARCHAR', 'CPT_PRCDR_CD_35': 'VARCHAR'}, ignore_errors=true)```
  2516. Relative file path: ahrq.gov/generated/with_types/syhdr_medicaid_provider_2016.sql
  2517. ```
  2518. -- SQL model for syhdr_medicaid_provider_2016.csv
  2519. {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
  2520.  
  2521. SELECT
  2522. Facility_ID::VARCHAR,
  2523. Prvdr_Ctgry_Cd::VARCHAR,
  2524. Prvdr_Ownrshp_Cd::VARCHAR,
  2525. Prvdr_Prtcptn_Cd::VARCHAR
  2526. FROM read_csv('~/data/syh_dr/syhdr_medicaid_provider_2016.csv', header=True, null_padding=true)```
  2527. Relative file path: ahrq.gov/sdoh/sdoh_zipcode.sql
  2528. ```
  2529. {{ config(
  2530. materialized = 'external',
  2531. location = '../data/sdoh_zipcode.parquet'
  2532. )
  2533. }}
  2534.  
  2535. {% set sdoh_zipcode_years = [
  2536. '2011', '2012', '2013', '2014', '2015',
  2537. '2016', '2017', '2018', '2019', '2020'] %}
  2538.  
  2539. {% set relations = [] %}
  2540. {% for sdoh_zipcode_year in sdoh_zipcode_years -%}
  2541. {% do relations.append(source('social_determinants_of_health', 'sdoh_zipcode_' ~ sdoh_zipcode_year)) %}
  2542. {% endfor %}
  2543.  
  2544. with union_unpivot as (
  2545.  
  2546. {% for relation in relations %}
  2547. unpivot {{ relation }}
  2548. on columns(* exclude (year, statefips, zipcode, zcta, state, region, territory, point_zip))
  2549. into
  2550. name survey_variable_name
  2551. value survey_score
  2552.  
  2553. {%- if not loop.last %} union all {% endif -%}
  2554. {% endfor %}
  2555. )
  2556.  
  2557. select
  2558. {{ dbt_utils.generate_surrogate_key(
  2559. ['year', 'zcta', 'zipcode']
  2560. ) }} as sdoh_zipcode_key,
  2561. *
  2562. from union_unpivot
  2563. ```
  2564. Relative file path: ahrq.gov/sdoh/sdoh_county.sql
  2565. ```
  2566. {{ config(
  2567. materialized = 'external',
  2568. location = '../data/sdoh_county.parquet'
  2569. )
  2570. }}
  2571.  
  2572. {% set sdoh_county_years = [
  2573. '2009', '2010', '2011', '2012', '2013', '2014',
  2574. '2015', '2016', '2017', '2018', '2019', '2020'] %}
  2575.  
  2576. {% set relations = [] %}
  2577. {% for sdoh_county_year in sdoh_county_years -%}
  2578. {% do relations.append(source('social_determinants_of_health', 'sdoh_county_' ~ sdoh_county_year)) %}
  2579. {% endfor %}
  2580.  
  2581. with union_unpivot as (
  2582.  
  2583. {% for relation in relations %}
  2584. unpivot {{ relation }}
  2585. on columns(* exclude (year, countyfips, statefips, state, county, region, territory))
  2586. into
  2587. name survey_variable_name
  2588. value survey_score
  2589.  
  2590. {% if not loop.last %} union all {% endif -%}
  2591. {% endfor %}
  2592. )
  2593.  
  2594. select
  2595. {{ dbt_utils.generate_surrogate_key(
  2596. ['year', 'countyfips', 'county']
  2597. ) }} as sdoh_county_key,
  2598. *
  2599. from union_unpivot
  2600. ```
  2601. Relative file path: ahrq.gov/sdoh/sdoh_tract.sql
  2602. ```
  2603. {{ config(
  2604. materialized = 'external',
  2605. location = '../data/sdoh_tract.parquet') }}
  2606.  
  2607. {% set sdoh_tract_years = [
  2608. '2009', '2010', '2011', '2012', '2013', '2014',
  2609. '2015', '2016', '2017', '2018', '2019', '2020'] %}
  2610.  
  2611. {% set relations = [] %}
  2612. {% for sdoh_tract_year in sdoh_tract_years -%}
  2613. {% do relations.append(source('social_determinants_of_health', 'sdoh_tract_' ~ sdoh_tract_year)) %}
  2614. {% endfor %}
  2615.  
  2616. with union_unpivot as (
  2617.  
  2618. {% for relation in relations %}
  2619. unpivot {{ relation }}
  2620. on columns(* exclude (year, tractfips, countyfips, statefips, state, county, region, territory))
  2621. into
  2622. name survey_variable_name
  2623. value survey_score
  2624.  
  2625. {% if not loop.last %} union all {% endif -%}
  2626. {% endfor %}
  2627. )
  2628.  
  2629. select
  2630. {{ dbt_utils.generate_surrogate_key(
  2631. ['year', 'tractfips', 'countyfips', 'county']
  2632. ) }} as sdoh_county_key,
  2633. *
  2634. from union_unpivot```
  2635. Relative file path: bls.gov/download_consumer_price_index.py
  2636. ```
  2637. import pandas as pd
  2638. import os
  2639. import requests
  2640.  
  2641. def model(dbt, session):
  2642. # URL to the Excel file containing the Consumer Price Index data
  2643. output_path = dbt.config.get('output_path')
  2644. base_path = os.path.expanduser(output_path)
  2645. excel_path = os.path.join(base_path, "r-cpi-u-rs-allitems.xlsx")
  2646. # cpi_url = "https://www.bls.gov/cpi/research-series/r-cpi-u-rs-allitems.xlsx"
  2647. # download and save to output path as r-cpi-u-rs-allitems.xlsx
  2648. consumer_price_index_df = pd.read_excel(excel_path, skiprows=5, usecols=['YEAR', 'AVG'])
  2649. return consumer_price_index_df```
  2650. Relative file path: bls.gov/consumer_price_index.sql
  2651. ```
  2652. {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
  2653.  
  2654. SELECT
  2655. YEAR AS year,
  2656. "AVG" AS consumer_price_index
  2657. FROM {{ ref('download_consumer_price_index') }}```
  2658. Relative file path: mitre.org/synthea.sql
  2659. ```
  2660. {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
  2661.  
  2662. WITH patient_data AS (
  2663. SELECT
  2664. json_extract_string(resource, '$.id') AS patient_id,
  2665. json_transform(json_extract(resource, '$.identifier'), '[{"type":{"text":"VARCHAR"},"value":"VARCHAR"}]') AS identifiers,
  2666. json_transform(json_extract(resource, '$.extension'), '[{"url":"VARCHAR","extension":[{"valueCoding":{"display":"VARCHAR"}}]}]') AS extensions,
  2667. json_transform(json_extract(resource, '$.address'), '[{"city":"VARCHAR","state":"VARCHAR","postalCode":"VARCHAR","country":"VARCHAR","line":["VARCHAR"]}]') AS addresses,
  2668. json_extract_string(resource, '$.name[0].given[0]') AS first_name,
  2669. json_extract_string(resource, '$.name[0].family') AS last_name,
  2670. json_extract_string(resource, '$.name[0].given[1]') AS first_name_alt,
  2671. json_extract_string(resource, '$.name[0].prefix[0]') AS patient_name_prefix,
  2672. json_extract_string(resource, '$.name[0].use') AS patient_name_use,
  2673. json_extract_string(resource, '$.name[1].given[0]') AS first_name2,
  2674. json_extract_string(resource, '$.name[1].given[1]') AS first_name_alt2,
  2675. json_extract_string(resource, '$.name[1].family') AS last_name2,
  2676. json_extract_string(resource, '$.name[1].prefix[0]') AS patient_name_prefix2,
  2677. json_extract_string(resource, '$.gender') AS gender,
  2678. json_extract_string(resource, '$.birthDate') AS birth_date,
  2679. json_extract_string(resource, '$.telecom[0].system') AS telecom_system,
  2680. json_extract_string(resource, '$.telecom[0].value') AS telecom_value,
  2681. json_extract_string(resource, '$.telecom[0].use') AS telecom_use,
  2682. json_extract_string(resource, '$.maritalStatus.text') AS patient_marital_status,
  2683. json_extract_string(resource, '$.deceasedDateTime') AS patient_deceased_date,
  2684. json_extract_string(resource, '$.multipleBirthBoolean') AS patient_multiple_birth_ind
  2685. FROM (
  2686. SELECT json_extract(json(unnested_entry.entry), '$.resource') AS resource
  2687. FROM read_json_auto('./../data/Gudrun69_Shaunna800_Goyette777_d5e33bd1-960e-bcf4-e5f9-9a4afc6d5a5e.json', columns={'entry': 'JSON[]'}) AS entries,
  2688. unnest(entries.entry) AS unnested_entry
  2689. WHERE json_extract_string(json(unnested_entry.entry), '$.resource.resourceType') = 'Patient'
  2690. ) AS patient_resource
  2691. ),
  2692. identifier_unnested AS (
  2693. SELECT
  2694. patient_id,
  2695. unnest(identifiers) AS identifier
  2696. FROM patient_data
  2697. ),
  2698. identifier_aggregated AS (
  2699. SELECT
  2700. patient_id,
  2701. MIN(CASE WHEN identifier.type.text = 'Medical Record Number' THEN identifier.value END) AS patient_mrn,
  2702. MIN(CASE WHEN identifier.type.text = 'Social Security Number' THEN identifier.value END) AS patient_ssn,
  2703. MIN(CASE WHEN identifier.type.text = 'Driver''s license number' THEN identifier.value END) AS patient_drivers_license_num,
  2704. MIN(CASE WHEN identifier.type.text = 'Passport Number' THEN identifier.value END) AS patient_passport_num
  2705. FROM identifier_unnested
  2706. GROUP BY patient_id
  2707. ),
  2708. extension_unnested AS (
  2709. SELECT
  2710. patient_id,
  2711. unnest(extensions) AS extension
  2712. FROM patient_data
  2713. ),
  2714. extension_aggregated AS (
  2715. SELECT
  2716. patient_id,
  2717. MIN(CASE WHEN extension.url = 'http://hl7.org/fhir/us/core/StructureDefinition/us-core-race' THEN extension.extension[1].valueCoding.display END) AS patient_core_race,
  2718. MIN(CASE WHEN extension.url = 'http://hl7.org/fhir/us/core/StructureDefinition/us-core-ethnicity' THEN extension.extension[1].valueCoding.display END) AS patient_core_ethnicity
  2719. FROM extension_unnested
  2720. GROUP BY patient_id
  2721. ),
  2722. addresses_unnested AS (
  2723. SELECT
  2724. patient_id,
  2725. unnest(addresses) AS address
  2726. FROM patient_data
  2727. ),
  2728. combined_data AS (
  2729. SELECT
  2730. p.patient_id,
  2731. p.first_name,
  2732. p.last_name,
  2733. i.patient_mrn,
  2734. i.patient_ssn,
  2735. e.patient_core_race,
  2736. e.patient_core_ethnicity,
  2737. a.address.line[1],
  2738. a.address.city,
  2739. a.address.state,
  2740. a.address.postalCode,
  2741. a.address.country
  2742. FROM patient_data p
  2743. JOIN identifier_aggregated i on i.patient_id = p.patient_id
  2744. JOIN extension_aggregated e on e.patient_id = p.patient_id
  2745. JOIN addresses_unnested a on a.patient_id = p.patient_id
  2746. )
  2747.  
  2748. select * from combined_data
  2749.  
  2750. ```
  2751. Relative file path: figures/insurance_plan_payment_histogram_inflation_adjusted.sql
  2752. ```
  2753. {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
  2754.  
  2755. WITH cpi_adjustment AS (
  2756. SELECT
  2757. year,
  2758. consumer_price_index
  2759. FROM {{ ref('consumer_price_index') }}
  2760. ),
  2761. latest_cpi AS (
  2762. SELECT
  2763. MAX(consumer_price_index) AS cpi_2022
  2764. FROM cpi_adjustment
  2765. WHERE year = 2022
  2766. ),
  2767. inflation_adjustment_factors AS (
  2768. SELECT
  2769. 2016 AS year,
  2770. (lc.cpi_2022 / ca.consumer_price_index) AS adjustment_factor_to_2022
  2771. FROM cpi_adjustment ca
  2772. CROSS JOIN latest_cpi lc
  2773. WHERE ca.year = 2016
  2774. ),
  2775. commercial_data AS (
  2776. SELECT
  2777. PLAN_PMT_AMT * iaf.adjustment_factor_to_2022 AS Payment,
  2778. COUNT(*) AS count,
  2779. 'Commercial' AS Insurance
  2780. FROM read_parquet('/Users/me/data/syh_dr/syhdr_commercial_inpatient_2016.parquet') cd
  2781. JOIN inflation_adjustment_factors iaf ON 1 = 1
  2782. GROUP BY PLAN_PMT_AMT, iaf.adjustment_factor_to_2022
  2783. ),
  2784. medicaid_data AS (
  2785. SELECT
  2786. PLAN_PMT_AMT * iaf.adjustment_factor_to_2022 AS Payment,
  2787. COUNT(*) AS count,
  2788. 'Medicaid' AS Insurance
  2789. FROM read_parquet('/Users/me/data/syh_dr/syhdr_medicaid_inpatient_2016.parquet') md
  2790. JOIN inflation_adjustment_factors iaf ON 1 = 1
  2791. GROUP BY PLAN_PMT_AMT, iaf.adjustment_factor_to_2022
  2792. ),
  2793. medicare_data AS (
  2794. SELECT
  2795. PLAN_PMT_AMT * iaf.adjustment_factor_to_2022 AS Payment,
  2796. COUNT(*) AS count,
  2797. 'Medicare' AS Insurance
  2798. FROM read_parquet('/Users/me/data/syh_dr/syhdr_medicare_inpatient_2016.parquet') mcd
  2799. JOIN inflation_adjustment_factors iaf ON 1 = 1
  2800. GROUP BY PLAN_PMT_AMT, iaf.adjustment_factor_to_2022
  2801. ),
  2802. combined_data AS (
  2803. SELECT * FROM commercial_data
  2804. UNION ALL
  2805. SELECT * FROM medicaid_data
  2806. UNION ALL
  2807. SELECT * FROM medicare_data
  2808. )
  2809. SELECT
  2810. Payment,
  2811. count,
  2812. Insurance
  2813. FROM combined_data
  2814. ORDER BY Insurance, Payment```
  2815. Relative file path: figures/insurance_plan_payment_histogram.sql
  2816. ```
  2817. {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
  2818.  
  2819. WITH commercial_data AS (
  2820. SELECT
  2821. PLAN_PMT_AMT AS Payment,
  2822. COUNT(*) AS count,
  2823. 'Commercial' AS Insurance
  2824. FROM read_parquet('/Users/me/data/syh_dr/syhdr_commercial_inpatient_2016.parquet')
  2825. GROUP BY PLAN_PMT_AMT
  2826. ),
  2827. medicaid_data AS (
  2828. SELECT
  2829. PLAN_PMT_AMT AS Payment,
  2830. COUNT(*) AS count,
  2831. 'Medicaid' AS Insurance
  2832. FROM read_parquet('/Users/me/data/syh_dr/syhdr_medicaid_inpatient_2016.parquet')
  2833. GROUP BY PLAN_PMT_AMT
  2834. ),
  2835. medicare_data AS (
  2836. SELECT
  2837. PLAN_PMT_AMT AS Payment,
  2838. COUNT(*) AS count,
  2839. 'Medicare' AS Insurance
  2840. FROM read_parquet('/Users/me/data/syh_dr/syhdr_medicare_inpatient_2016.parquet')
  2841. GROUP BY PLAN_PMT_AMT
  2842. ),
  2843. combined_data AS (
  2844. SELECT * FROM commercial_data
  2845. UNION ALL
  2846. SELECT * FROM medicaid_data
  2847. UNION ALL
  2848. SELECT * FROM medicare_data
  2849. )
  2850. SELECT
  2851. Payment,
  2852. count,
  2853. Insurance
  2854. FROM combined_data
  2855. ORDER BY Insurance, Payment```
  2856. Relative file path: config.yml
  2857. ```
  2858. version: 2
  2859.  
  2860. models:
  2861. - name: download_consumer_price_index
  2862. config:
  2863. data_path: "{{ var('data_path') }}"
  2864. output_path: "{{ var('output_path') }}"
  2865. - name: ahrq.gov
  2866. config:
  2867. data_path: "{{ var('data_path') }}"
  2868. output_path: "{{ var('output_path') }}"
  2869. - name: mitre.org
  2870. config:
  2871. data_path: "{{ var('data_path') }}"
  2872. output_path: "{{ var('output_path') }}"
  2873. ```
  2874. Relative file path: sources.yml
  2875. ```
  2876. version: 2
  2877.  
  2878. sources:
  2879. - name: syh_dr
  2880. description: Synthetic Healthcare data from AHRQ
  2881. tables:
  2882. - name: inpatient
  2883. description: Inpatient files from AHRQ.
  2884. meta:
  2885. external_location: >
  2886. read_csv('../data/syh_dr/syhdr_*_inpatient_2016.csv',
  2887. types = {
  2888. PERSON_ID: 'UBIGINT',
  2889. PERSON_WGHT: 'NUMERIC',
  2890. FACILITY_ID: 'UBIGINT',
  2891. CLM_CNTL_NUM: 'NUMERIC',
  2892. AT_SPCLTY: 'VARCHAR',
  2893. SRVC_BEG_DATE: 'DATE',
  2894. SRVC_END_DATE: 'DATE',
  2895. LOS: 'UINTEGER',
  2896. ADMSN_TYPE: 'VARCHAR',
  2897. TOB_CD: 'VARCHAR',
  2898. CLM_TYPE_CD: 'VARCHAR',
  2899. DSCHRG_STUS: 'VARCHAR',
  2900. PRMRY_DX_IMPUTED: 'VARCHAR',
  2901. PRMRY_DX_CD: 'VARCHAR',
  2902. ICD_DX_CD_1: 'VARCHAR',
  2903. ICD_DX_CD_2: 'VARCHAR',
  2904. ICD_DX_CD_3: 'VARCHAR',
  2905. ICD_DX_CD_4: 'VARCHAR',
  2906. ICD_DX_CD_5: 'VARCHAR',
  2907. ICD_DX_CD_6: 'VARCHAR',
  2908. ICD_DX_CD_7: 'VARCHAR',
  2909. ICD_DX_CD_8: 'VARCHAR',
  2910. ICD_DX_CD_9: 'VARCHAR',
  2911. ICD_DX_CD_10: 'VARCHAR',
  2912. ICD_DX_CD_11: 'VARCHAR',
  2913. ICD_DX_CD_12: 'VARCHAR',
  2914. ICD_DX_CD_13: 'VARCHAR',
  2915. ICD_DX_CD_14: 'VARCHAR',
  2916. ICD_DX_CD_15: 'VARCHAR',
  2917. ICD_DX_CD_16: 'VARCHAR',
  2918. ICD_DX_CD_17: 'VARCHAR',
  2919. ICD_DX_CD_18: 'VARCHAR',
  2920. ICD_DX_CD_19: 'VARCHAR',
  2921. ICD_DX_CD_20: 'VARCHAR',
  2922. ICD_DX_CD_21: 'VARCHAR',
  2923. ICD_DX_CD_22: 'VARCHAR',
  2924. ICD_DX_CD_23: 'VARCHAR',
  2925. ICD_DX_CD_24: 'VARCHAR',
  2926. ICD_DX_CD_25: 'VARCHAR',
  2927. ICD_PRCDR_CD_1: 'VARCHAR',
  2928. ICD_PRCDR_CD_2: 'VARCHAR',
  2929. ICD_PRCDR_CD_3: 'VARCHAR',
  2930. ICD_PRCDR_CD_4: 'VARCHAR',
  2931. ICD_PRCDR_CD_5: 'VARCHAR',
  2932. ICD_PRCDR_CD_6: 'VARCHAR',
  2933. ICD_PRCDR_CD_7: 'VARCHAR',
  2934. ICD_PRCDR_CD_8: 'VARCHAR',
  2935. ICD_PRCDR_CD_9: 'VARCHAR',
  2936. ICD_PRCDR_CD_10: 'VARCHAR',
  2937. ICD_PRCDR_CD_11: 'VARCHAR',
  2938. ICD_PRCDR_CD_12: 'VARCHAR',
  2939. ICD_PRCDR_CD_13: 'VARCHAR',
  2940. ICD_PRCDR_CD_14: 'VARCHAR',
  2941. ICD_PRCDR_CD_15: 'VARCHAR',
  2942. ICD_PRCDR_CD_16: 'VARCHAR',
  2943. ICD_PRCDR_CD_17: 'VARCHAR',
  2944. ICD_PRCDR_CD_18: 'VARCHAR',
  2945. ICD_PRCDR_CD_19: 'VARCHAR',
  2946. ICD_PRCDR_CD_20: 'VARCHAR',
  2947. ICD_PRCDR_CD_21: 'VARCHAR',
  2948. ICD_PRCDR_CD_22: 'VARCHAR',
  2949. ICD_PRCDR_CD_23: 'VARCHAR',
  2950. ICD_PRCDR_CD_24: 'VARCHAR',
  2951. ICD_PRCDR_CD_25: 'VARCHAR',
  2952. CPT_PRCDR_CD_1: 'VARCHAR',
  2953. CPT_PRCDR_CD_2: 'VARCHAR',
  2954. CPT_PRCDR_CD_3: 'VARCHAR',
  2955. CPT_PRCDR_CD_4: 'VARCHAR',
  2956. CPT_PRCDR_CD_5: 'VARCHAR',
  2957. CPT_PRCDR_CD_6: 'VARCHAR',
  2958. CPT_PRCDR_CD_7: 'VARCHAR',
  2959. CPT_PRCDR_CD_8: 'VARCHAR',
  2960. CPT_PRCDR_CD_9: 'VARCHAR',
  2961. CPT_PRCDR_CD_10: 'VARCHAR',
  2962. CPT_PRCDR_CD_11: 'VARCHAR',
  2963. CPT_PRCDR_CD_12: 'VARCHAR',
  2964. CPT_PRCDR_CD_13: 'VARCHAR',
  2965. CPT_PRCDR_CD_14: 'VARCHAR',
  2966. CPT_PRCDR_CD_15: 'VARCHAR',
  2967. CPT_PRCDR_CD_16: 'VARCHAR',
  2968. CPT_PRCDR_CD_17: 'VARCHAR',
  2969. CPT_PRCDR_CD_18: 'VARCHAR',
  2970. CPT_PRCDR_CD_19: 'VARCHAR',
  2971. CPT_PRCDR_CD_20: 'VARCHAR',
  2972. CPT_PRCDR_CD_21: 'VARCHAR',
  2973. CPT_PRCDR_CD_22: 'VARCHAR',
  2974. CPT_PRCDR_CD_23: 'VARCHAR',
  2975. CPT_PRCDR_CD_24: 'VARCHAR',
  2976. CPT_PRCDR_CD_25: 'VARCHAR',
  2977. CPT_PRCDR_CD_26: 'VARCHAR',
  2978. CPT_PRCDR_CD_27: 'VARCHAR',
  2979. CPT_PRCDR_CD_28: 'VARCHAR',
  2980. CPT_PRCDR_CD_29: 'VARCHAR',
  2981. CPT_PRCDR_CD_30: 'VARCHAR',
  2982. CPT_PRCDR_CD_31: 'VARCHAR',
  2983. CPT_PRCDR_CD_32: 'VARCHAR',
  2984. CPT_PRCDR_CD_33: 'VARCHAR',
  2985. CPT_PRCDR_CD_34: 'VARCHAR',
  2986. CPT_PRCDR_CD_35: 'VARCHAR',
  2987. PLAN_PMT_AMT: 'FLOAT'
  2988. TOT_CHRG_AMT: 'FLOAT'
  2989. })
  2990. formatter: oldstyle
  2991. - name: outpatient
  2992. description: Outpatient files from AHRQ.
  2993. meta:
  2994. external_location: >
  2995. read_csv('../data/syh_dr/syhdr_*_outpatient_2016.csv',
  2996. types = {
  2997. PERSON_ID: 'UBIGINT',
  2998. PERSON_WGHT: 'NUMERIC',
  2999. FACILITY_ID: 'UBIGINT',
  3000. CLM_CNTL_NUM: 'NUMERIC',
  3001. AT_SPCLTY: 'VARCHAR',
  3002. SRVC_BEG_DATE: 'DATE',
  3003. SRVC_END_DATE: 'DATE',
  3004. LOS: 'UINTEGER',
  3005. ADMSN_TYPE: 'VARCHAR',
  3006. TOB_CD: 'VARCHAR'',
  3007. CLM_TYPE_CD: 'VARCHAR',
  3008. DSCHRG_STUS: 'VARCHAR',
  3009. PRMRY_DX_IMPUTED: 'VARCHAR',
  3010. PRMRY_DX_CD: 'VARCHAR',
  3011. ICD_DX_CD_1: 'VARCHAR',
  3012. ICD_DX_CD_2: 'VARCHAR',
  3013. ICD_DX_CD_3: 'VARCHAR',
  3014. ICD_DX_CD_4: 'VARCHAR',
  3015. ICD_DX_CD_5: 'VARCHAR',
  3016. ICD_DX_CD_6: 'VARCHAR',
  3017. ICD_DX_CD_7: 'VARCHAR',
  3018. ICD_DX_CD_8: 'VARCHAR',
  3019. ICD_DX_CD_9: 'VARCHAR',
  3020. ICD_DX_CD_10: 'VARCHAR',
  3021. ICD_DX_CD_11: 'VARCHAR',
  3022. ICD_DX_CD_12: 'VARCHAR',
  3023. ICD_DX_CD_13: 'VARCHAR',
  3024. ICD_DX_CD_14: 'VARCHAR',
  3025. ICD_DX_CD_15: 'VARCHAR',
  3026. ICD_DX_CD_16: 'VARCHAR',
  3027. ICD_DX_CD_17: 'VARCHAR',
  3028. ICD_DX_CD_18: 'VARCHAR',
  3029. ICD_DX_CD_19: 'VARCHAR',
  3030. ICD_DX_CD_20: 'VARCHAR',
  3031. ICD_DX_CD_21: 'VARCHAR',
  3032. ICD_DX_CD_22: 'VARCHAR',
  3033. ICD_DX_CD_23: 'VARCHAR',
  3034. ICD_DX_CD_24: 'VARCHAR',
  3035. ICD_DX_CD_25: 'VARCHAR',
  3036. ICD_PRCDR_CD_1: 'VARCHAR',
  3037. ICD_PRCDR_CD_2: 'VARCHAR',
  3038. ICD_PRCDR_CD_3: 'VARCHAR',
  3039. ICD_PRCDR_CD_4: 'VARCHAR',
  3040. ICD_PRCDR_CD_5: 'VARCHAR',
  3041. ICD_PRCDR_CD_6: 'VARCHAR',
  3042. ICD_PRCDR_CD_7: 'VARCHAR',
  3043. ICD_PRCDR_CD_8: 'VARCHAR',
  3044. ICD_PRCDR_CD_9: 'VARCHAR',
  3045. ICD_PRCDR_CD_10: 'VARCHAR',
  3046. ICD_PRCDR_CD_11: 'VARCHAR',
  3047. ICD_PRCDR_CD_12: 'VARCHAR',
  3048. ICD_PRCDR_CD_13: 'VARCHAR',
  3049. ICD_PRCDR_CD_14: 'VARCHAR',
  3050. ICD_PRCDR_CD_15: 'VARCHAR',
  3051. ICD_PRCDR_CD_16: 'VARCHAR',
  3052. ICD_PRCDR_CD_17: 'VARCHAR',
  3053. ICD_PRCDR_CD_18: 'VARCHAR',
  3054. ICD_PRCDR_CD_19: 'VARCHAR',
  3055. ICD_PRCDR_CD_20: 'VARCHAR',
  3056. ICD_PRCDR_CD_21: 'VARCHAR',
  3057. ICD_PRCDR_CD_22: 'VARCHAR',
  3058. ICD_PRCDR_CD_23: 'VARCHAR',
  3059. ICD_PRCDR_CD_24: 'VARCHAR',
  3060. ICD_PRCDR_CD_25: 'VARCHAR',
  3061. CPT_PRCDR_CD_1: 'VARCHAR',
  3062. CPT_PRCDR_CD_2: 'VARCHAR',
  3063. CPT_PRCDR_CD_3: 'VARCHAR',
  3064. CPT_PRCDR_CD_4: 'VARCHAR',
  3065. CPT_PRCDR_CD_5: 'VARCHAR',
  3066. CPT_PRCDR_CD_6: 'VARCHAR',
  3067. CPT_PRCDR_CD_7: 'VARCHAR',
  3068. CPT_PRCDR_CD_8: 'VARCHAR',
  3069. CPT_PRCDR_CD_9: 'VARCHAR',
  3070. CPT_PRCDR_CD_10: 'VARCHAR',
  3071. CPT_PRCDR_CD_11: 'VARCHAR',
  3072. CPT_PRCDR_CD_12: 'VARCHAR',
  3073. CPT_PRCDR_CD_13: 'VARCHAR',
  3074. CPT_PRCDR_CD_14: 'VARCHAR',
  3075. CPT_PRCDR_CD_15: 'VARCHAR',
  3076. CPT_PRCDR_CD_16: 'VARCHAR',
  3077. CPT_PRCDR_CD_17: 'VARCHAR',
  3078. CPT_PRCDR_CD_18: 'VARCHAR',
  3079. CPT_PRCDR_CD_19: 'VARCHAR',
  3080. CPT_PRCDR_CD_20: 'VARCHAR',
  3081. CPT_PRCDR_CD_21: 'VARCHAR',
  3082. CPT_PRCDR_CD_22: 'VARCHAR',
  3083. CPT_PRCDR_CD_23: 'VARCHAR',
  3084. CPT_PRCDR_CD_24: 'VARCHAR',
  3085. CPT_PRCDR_CD_25: 'VARCHAR',
  3086. CPT_PRCDR_CD_26: 'VARCHAR',
  3087. CPT_PRCDR_CD_27: 'VARCHAR',
  3088. CPT_PRCDR_CD_28: 'VARCHAR',
  3089. CPT_PRCDR_CD_29: 'VARCHAR',
  3090. CPT_PRCDR_CD_30: 'VARCHAR',
  3091. CPT_PRCDR_CD_31: 'VARCHAR',
  3092. CPT_PRCDR_CD_32: 'VARCHAR',
  3093. CPT_PRCDR_CD_33: 'VARCHAR',
  3094. CPT_PRCDR_CD_34: 'VARCHAR',
  3095. CPT_PRCDR_CD_35: 'VARCHAR',
  3096. PLAN_PMT_AMT: 'FLOAT',
  3097. TOT_CHRG_AMT: 'FLOAT'
  3098. })
  3099. formatter: oldstyle
  3100. - name: person
  3101. description: Person-Level files from AHRQ.
  3102. meta:
  3103. external_location: >
  3104. read_csv('../data/syh_dr/syhdr_*_person_2016.csv',
  3105. types = {
  3106. PERSON_ID: 'UBIGINT',
  3107. MCAID_BENE_ID: 'UBIGINT',
  3108. PERSON_WGHT: 'NUMERIC',
  3109. AGE_LOW: 'NUMERIC',
  3110. AGE_HIGH: 'NUMERIC',
  3111. SEX_IDENT_CD: 'VARCHAR',
  3112. RACE_CD: 'VARCHAR',
  3113. MCAID_SBMTTG_ST_CD: 'VARCHAR',
  3114. STATE_CD: 'VARCHAR',
  3115. COUNTY_FIPS_CD: 'VARCHAR',
  3116. ZIP_CD: 'VARCHAR',
  3117. RSN_ENRLMT_CD: 'VARCHAR',
  3118. MDCD_ENRLMT_1: 'NUMERIC',
  3119. MDCD_ENRLMT_2: 'NUMERIC',
  3120. MDCD_ENRLMT_3: 'NUMERIC',
  3121. MDCD_ENRLMT_4: 'NUMERIC',
  3122. MDCD_ENRLMT_5: 'NUMERIC',
  3123. MDCD_ENRLMT_6: 'NUMERIC',
  3124. MDCD_ENRLMT_7: 'NUMERIC',
  3125. MDCD_ENRLMT_8: 'NUMERIC',
  3126. MDCD_ENRLMT_9: 'NUMERIC',
  3127. MDCD_ENRLMT_10: 'NUMERIC',
  3128. MDCD_ENRLMT_11: 'NUMERIC',
  3129. MDCD_ENRLMT_12: 'NUMERIC',
  3130. MDCD_MCO_ENRLMT_1: 'NUMERIC',
  3131. MDCD_MCO_ENRLMT_2: 'NUMERIC',
  3132. MDCD_MCO_ENRLMT_3: 'NUMERIC',
  3133. MDCD_MCO_ENRLMT_4: 'NUMERIC',
  3134. MDCD_MCO_ENRLMT_5: 'NUMERIC',
  3135. MDCD_MCO_ENRLMT_6: 'NUMERIC',
  3136. MDCD_MCO_ENRLMT_7: 'NUMERIC',
  3137. MDCD_MCO_ENRLMT_8: 'NUMERIC',
  3138. MDCD_MCO_ENRLMT_9: 'NUMERIC',
  3139. MDCD_MCO_ENRLMT_10: 'NUMERIC',
  3140. MDCD_MCO_ENRLMT_11: 'NUMERIC',
  3141. MDCD_MCO_ENRLMT_12: 'NUMERIC',
  3142. MDCD_CHIP_ENRLMT: 'NUMERIC',
  3143. RSTCTD_BNFTS_IND: 'NUMERIC',
  3144. DUAL_ELGBL_1: 'NUMERIC',
  3145. DUAL_ELGBL_2: 'NUMERIC',
  3146. DUAL_ELGBL_3: 'NUMERIC',
  3147. DUAL_ELGBL_4: 'NUMERIC',
  3148. DUAL_ELGBL_5: 'NUMERIC',
  3149. DUAL_ELGBL_6: 'NUMERIC',
  3150. DUAL_ELGBL_7: 'NUMERIC',
  3151. DUAL_ELGBL_8: 'NUMERIC',
  3152. DUAL_ELGBL_9: 'NUMERIC',
  3153. DUAL_ELGBL_10: 'NUMERIC',
  3154. DUAL_ELGBL_11: 'NUMERIC',
  3155. DUAL_ELGBL_12: 'NUMERIC'
  3156. })
  3157. formatter: oldstyle
  3158. - name: pharmacy
  3159. descripton: Pharmacy files from AHRQ.
  3160. meta:
  3161. external_location: >
  3162. read_csv('../data/syh_dr/syhdr_*_pharmacy_2016.csv',
  3163. types = {
  3164. PERSON_ID: 'UBIGINT',
  3165. PERSON_WGHT: 'NUMERIC',
  3166. PHMCY_CLM_NUM: 'NUMERIC',
  3167. CLM_CNTL_NUM: 'NUMERIC',
  3168. LINE_NBR: 'VARCHAR',
  3169. FILL_DT: 'DATE',
  3170. SYNTHETIC_DRUG_ID: 'VARCHAR',
  3171. GENERIC_DRUG_NAME: 'VARCHAR',
  3172. PLAN_PMT_AMT: 'FLOAT',
  3173. TOT_CHRG_AMT: 'FLOAT'
  3174. })
  3175. formatter: oldstyle
  3176. - name: provider
  3177. description: Provider files from AHRQ.
  3178. meta:
  3179. external_location: >
  3180. read_csv('../data/syh_dr/syhdr_*_provider_2016.csv',
  3181. types = {
  3182. Facility_ID: 'VARCHAR',
  3183. Prvdr_Ctgry_Cd: 'VARCHAR',
  3184. Prvdr_Ownrshp_Cd: 'VARCHAR',
  3185. Prvdr_Prtcptn_Cd: 'VARCHAR'
  3186. })
  3187. formatter: oldstyle
  3188. - name: bls_gov
  3189. config:
  3190. plugin: excel
  3191. tables:
  3192. - name: consumer_price_index
  3193. meta:
  3194. external_location: ../data/r-cpi-u-rs-allitems.xlsx
  3195. sheet_nane: 'All items'
  3196. - name: mitre_org
  3197. tables:
  3198. - name: synthea
  3199. meta:
  3200. external_location: >
  3201. read_json('../data/Gudrun69_Shaunna800_Goyette777_d5e33bd1-960e-bcf4-e5f9-9a4afc6d5a5e.json',
  3202. columns = {
  3203. entry: 'JSON[]'
  3204. })
  3205. formatter: oldstyle
  3206. - name: social_determinants_of_health
  3207. description: "These files contain [version 1 of AHRQ’s database on Social Determinants of Health (SDOH)](https://www.ahrq.gov/sdoh/data-analytics/sdoh-data.html), which was created under a project funded by the Patient Centered Outcomes Research (PCOR) Trust Fund. The database is curated from existing Federal datasets and other publicly available data sources. The purpose of these files is to make it easier to find a range of well-documented, readily linkable SDOH variables across domains without having to access multiple source files, facilitating SDOH research and analysis. Variables in the files correspond to five key SDOH domains: social context, economic context, education, physical infrastructure, and healthcare context. The files can be linked to other data by geography. Data are available at the county, ZIP Code, and census tract levels."
  3208. config:
  3209. plugin: excel
  3210. tables:
  3211. - name: sdoh_zipcode_2011
  3212. description: Social Determinants of Health Zipcode data from 2011.
  3213. meta:
  3214. external_location: ../data/SDOH/SDOH_2011_ZIPCODE_1_0.xlsx
  3215. sheet_name: Data
  3216. - name: sdoh_zipcode_2012
  3217. description: Social Determinants of Health Zipcode data from 2012.
  3218. meta:
  3219. external_location: ../data/SDOH/SDOH_2012_ZIPCODE_1_0.xlsx
  3220. sheet_name: Data
  3221. - name: sdoh_zipcode_2013
  3222. description: Social Determinants of Health Zipcode data from 2013.
  3223. meta:
  3224. external_location: ../data/SDOH/SDOH_2013_ZIPCODE_1_0.xlsx
  3225. sheet_name: Data
  3226. - name: sdoh_zipcode_2014
  3227. description: Social Determinants of Health Zipcode data from 2014.
  3228. meta:
  3229. external_location: ../data/SDOH/SDOH_2014_ZIPCODE_1_0.xlsx
  3230. sheet_name: Data
  3231. - name: sdoh_zipcode_2015
  3232. description: Social Determinants of Health Zipcode data from 2015.
  3233. meta:
  3234. external_location: ../data/SDOH/SDOH_2015_ZIPCODE_1_0.xlsx
  3235. sheet_name: Data
  3236. - name: sdoh_zipcode_2016
  3237. description: Social Determinants of Health Zipcode data from 2016.
  3238. meta:
  3239. external_location: ../data/SDOH/SDOH_2016_ZIPCODE_1_0.xlsx
  3240. sheet_name: Data
  3241. - name: sdoh_zipcode_2017
  3242. description: Social Determinants of Health Zipcode data from 2017.
  3243. meta:
  3244. external_location: ../data/SDOH/SDOH_2017_ZIPCODE_1_0.xlsx
  3245. sheet_name: Data
  3246. - name: sdoh_zipcode_2018
  3247. description: Social Determinants of Health Zipcode data from 2018.
  3248. meta:
  3249. external_location: ../data/SDOH/SDOH_2018_ZIPCODE_1_0.xlsx
  3250. sheet_name: Data
  3251. - name: sdoh_zipcode_2019
  3252. description: Social Determinants of Health Zipcode data from 2019.
  3253. meta:
  3254. external_location: ../data/SDOH/SDOH_2019_ZIPCODE_1_0.xlsx
  3255. sheet_name: Data
  3256. - name: sdoh_zipcode_2020
  3257. description: Social Determinants of Health Zipcode data from 2020.
  3258. meta:
  3259. external_location: ../data/SDOH/SDOH_2020_ZIPCODE_1_0.xlsx
  3260. sheet_name: Data
  3261. - name: sdoh_county_2009
  3262. description: Social Determinants of Health County data from 2009.
  3263. meta:
  3264. external_location: ../data/SDOH/SDOH_2009_COUNTY_1_0.xlsx
  3265. sheet_name: Data
  3266. - name: sdoh_county_2010
  3267. description: Social Determinants of Health County data from 2010.
  3268. meta:
  3269. external_location: ../data/SDOH/SDOH_2010_COUNTY_1_0.xlsx
  3270. sheet_name: Data
  3271. - name: sdoh_county_2011
  3272. description: Social Determinants of Health County data from 2011.
  3273. meta:
  3274. external_location: ../data/SDOH/SDOH_2011_COUNTY_1_0.xlsx
  3275. sheet_name: Data
  3276. - name: sdoh_county_2012
  3277. description: Social Determinants of Health County data from 2012.
  3278. meta:
  3279. external_location: ../data/SDOH/SDOH_2012_COUNTY_1_0.xlsx
  3280. sheet_name: Data
  3281. - name: sdoh_county_2013
  3282. description: Social Determinants of Health County data from 2013.
  3283. meta:
  3284. external_location: ../data/SDOH/SDOH_2013_COUNTY_1_0.xlsx
  3285. sheet_name: Data
  3286. - name: sdoh_county_2014
  3287. description: Social Determinants of Health County data from 2014.
  3288. meta:
  3289. external_location: ../data/SDOH/SDOH_2014_COUNTY_1_0.xlsx
  3290. sheet_name: Data
  3291. - name: sdoh_county_2015
  3292. description: Social Determinants of Health County data from 2015.
  3293. meta:
  3294. external_location: ../data/SDOH/SDOH_2015_COUNTY_1_0.xlsx
  3295. sheet_name: Data
  3296. - name: sdoh_county_2016
  3297. description: Social Determinants of Health County data from 2016.
  3298. meta:
  3299. external_location: ../data/SDOH/SDOH_2016_COUNTY_1_0.xlsx
  3300. sheet_name: Data
  3301. - name: sdoh_county_2017
  3302. description: Social Determinants of Health County data from 2017.
  3303. meta:
  3304. external_location: ../data/SDOH/SDOH_2017_COUNTY_1_0.xlsx
  3305. sheet_name: Data
  3306. - name: sdoh_county_2018
  3307. description: Social Determinants of Health County data from 2018.
  3308. meta:
  3309. external_location: ../data/SDOH/SDOH_2018_COUNTY_1_1.xlsx
  3310. sheet_name: Data
  3311. - name: sdoh_county_2019
  3312. description: Social Determinants of Health County data from 2019.
  3313. meta:
  3314. external_location: ../data/SDOH/SDOH_2019_COUNTY_1_0.xlsx
  3315. sheet_name: Data
  3316. - name: sdoh_county_2020
  3317. description: Social Determinants of Health County data from 2020.
  3318. meta:
  3319. external_location: ../data/SDOH/SDOH_2020_COUNTY_1_0.xlsx
  3320. sheet_name: Data
  3321. - name: sdoh_tract_2009
  3322. description: Social Determinants of Health tract data from 2009.
  3323. meta:
  3324. external_location: ../data/SDOH/sdoh_2009_tract_1_0.xlsx
  3325. sheet_name: Data
  3326. - name: sdoh_tract_2010
  3327. description: Social Determinants of Health tract data from 2010.
  3328. meta:
  3329. external_location: ../data/SDOH/sdoh_2010_tract_1_0.xlsx
  3330. sheet_name: Data
  3331. - name: sdoh_tract_2011
  3332. description: Social Determinants of Health tract data from 2011.
  3333. meta:
  3334. external_location: ../data/SDOH/sdoh_2011_tract_1_0.xlsx
  3335. sheet_name: Data
  3336. - name: sdoh_tract_2012
  3337. description: Social Determinants of Health tract data from 2012.
  3338. meta:
  3339. external_location: ../data/SDOH/sdoh_2012_tract_1_0.xlsx
  3340. sheet_name: Data
  3341. - name: sdoh_tract_2013
  3342. description: Social Determinants of Health tract data from 2013.
  3343. meta:
  3344. external_location: ../data/SDOH/sdoh_2013_tract_1_0.xlsx
  3345. sheet_name: Data
  3346. - name: sdoh_tract_2014
  3347. description: Social Determinants of Health tract data from 2014.
  3348. meta:
  3349. external_location: ../data/SDOH/sdoh_2014_tract_1_0.xlsx
  3350. sheet_name: Data
  3351. - name: sdoh_tract_2015
  3352. description: Social Determinants of Health tract data from 2015.
  3353. meta:
  3354. external_location: ../data/SDOH/sdoh_2015_tract_1_0.xlsx
  3355. sheet_name: Data
  3356. - name: sdoh_tract_2016
  3357. description: Social Determinants of Health tract data from 2016.
  3358. meta:
  3359. external_location: ../data/SDOH/sdoh_2016_tract_1_0.xlsx
  3360. sheet_name: Data
  3361. - name: sdoh_tract_2017
  3362. description: Social Determinants of Health tract data from 2017.
  3363. meta:
  3364. external_location: ../data/SDOH/sdoh_2017_tract_1_0.xlsx
  3365. sheet_name: Data
  3366. - name: sdoh_tract_2018
  3367. description: Social Determinants of Health tract data from 2018.
  3368. meta:
  3369. external_location: ../data/SDOH/sdoh_2018_tract_1_0.xlsx
  3370. sheet_name: Data
  3371. - name: sdoh_tract_2019
  3372. description: Social Determinants of Health tract data from 2019.
  3373. meta:
  3374. external_location: ../data/SDOH/sdoh_2019_tract_1_0.xlsx
  3375. sheet_name: Data
  3376. - name: sdoh_tract_2020
  3377. description: Social Determinants of Health tract data from 2020.
  3378. meta:
  3379. external_location: ../data/SDOH/sdoh_2020_tract_1_0.xlsx
  3380. sheet_name: Data```
  3381.  
  3382. (this is in the `data_processing/models` location)
  3383.  
  3384. proceed to write a short python script into a dbt model to download every file here: https://www.ahrq.gov/sdoh/data-analytics/sdoh-data.html#download
  3385.  
  3386. that should fix these errors.
  3387.  
  3388. debug it. proceed step-by-step as an elite site reliability/devops/L20 principal warez engineer at google, returning as few tokens as possible, to debug this dbt/ELT/analytics engineering code. give the complete corrected code!
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement