Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- take these errors:
- ```
- healthcare-data/data_processing on ξ HEAD (aa330b9) [?] via π v3.10.14 (.venv) π 7s
- β― dbt run --select source:social_determinants_of_health+
- 12:02:17 Running with dbt=1.7.11
- 12:02:18 Registered adapter: duckdb=1.7.3
- 12:02:18 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
- There are 1 unused configuration paths:
- - models.healthcare_data.example
- 12:02:18 Found 22 models, 41 sources, 0 exposures, 0 metrics, 527 macros, 0 groups, 0 semantic models
- 12:02:18
- 12:02:38 Concurrency: 1 threads (target='dev')
- 12:02:38
- 12:02:38 1 of 3 START sql external model main.sdoh_county ............................... [RUN]
- 12:02:38 Unhandled error while executing
- [Errno 2] No such file or directory: '../data/SDOH/SDOH_2009_COUNTY_1_0.xlsx'
- 12:02:38 1 of 3 ERROR creating sql external model main.sdoh_county ...................... [ERROR in 0.02s]
- 12:02:38 2 of 3 START sql external model main.sdoh_tract ................................ [RUN]
- 12:02:38 Unhandled error while executing
- [Errno 2] No such file or directory: '../data/SDOH/sdoh_2009_tract_1_0.xlsx'
- 12:02:38 2 of 3 ERROR creating sql external model main.sdoh_tract ....................... [ERROR in 0.01s]
- 12:02:38 3 of 3 START sql external model main.sdoh_zipcode .............................. [RUN]
- 12:02:38 Unhandled error while executing
- [Errno 2] No such file or directory: '../data/SDOH/SDOH_2011_ZIPCODE_1_0.xlsx'
- 12:02:38 3 of 3 ERROR creating sql external model main.sdoh_zipcode ..................... [ERROR in 0.01s]
- 12:02:38
- 12:02:38 Finished running 3 external models in 0 hours 0 minutes and 20.56 seconds (20.56s).
- 12:02:38
- 12:02:38 Completed with 3 errors and 0 warnings:
- 12:02:38
- 12:02:38 [Errno 2] No such file or directory: '../data/SDOH/SDOH_2009_COUNTY_1_0.xlsx'
- 12:02:38
- 12:02:38 [Errno 2] No such file or directory: '../data/SDOH/sdoh_2009_tract_1_0.xlsx'
- 12:02:38
- 12:02:38 [Errno 2] No such file or directory: '../data/SDOH/SDOH_2011_ZIPCODE_1_0.xlsx'
- 12:02:38
- 12:02:38 Done. PASS=0 WARN=0 ERROR=3 SKIP=0 TOTAL=3
- ```
- and take this code:
- ```
- .
- βββ download_industry_codes.ipynb
- βββ generate_sql_schemas_for_extracted_csv_files.py
- βββ generate_sql_with_enum_types.py
- βββ generate_sql_with_enum_types_and_mapped_values.py
- βββ generate_sql_with_enum_types_and_mapped_values_renamed.py
- βββ generate_sql_with_mapped_values.py
- βββ generate_sql_with_renamed_columns.py
- βββ generate_sql_with_types.py
- βββ parse_data_dictionary.py
- βββ process_one_year_of_american_community_survey_data.sh
- 1 directory, 10 files
- ```
- Relative file path: process_one_year_of_american_community_survey_data.sh
- ```
- #!/bin/bash
- # Function to print time taken for each step in minutes and seconds
- print_time_taken() {
- local start=$1
- local end=$2
- local step=$3
- local duration=$((end-start))
- local minutes=$((duration / 60))
- local seconds=$((duration % 60))
- echo "Time taken for $step: ${minutes} minutes and ${seconds} seconds."
- }
- # Check if the user has provided a year
- if [ -z "$1" ]
- then
- echo "You must provide a four-character year as an argument."
- exit 1
- fi
- YEAR=$1
- # Capture start time
- start=$(date +%s)
- echo "Starting process for the year: $YEAR"
- start_step=$(date +%s)
- echo "Step 1: Listing URLs of the $YEAR 1-Year ACS PUMS data"
- dbt run --select "public_use_microdata_sample.list_urls" \
- --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
- --threads 8
- echo "Step 1: Saving the database of the $YEAR 1-Year ACS PUMS data"
- dbt run --select "public_use_microdata_sample.urls" \
- --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
- --threads 8
- # echo "Step 1: Listing URLs of the $YEAR 1-Year ACS PUMS data"
- # dbt run --select "public_use_microdata_sample.list_urls" \
- # --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
- # --threads 8
- # echo "Step 1: Saving the database of the $YEAR 1-Year ACS PUMS data"
- # dbt run --select "public_use_microdata_sample.urls" \
- # --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
- # --threads 8
- echo "Checking URLs..."
- duckdb -c "SELECT * FROM '~/data/american_community_survey/urls.parquet'"
- end_step=$(date +%s)
- print_time_taken $start_step $end_step "Step 1"
- start_step=$(date +%s)
- echo "Step 2: Downloading and extracting the archives for all of the 50 states' PUMS files"
- dbt run --select "public_use_microdata_sample.download_and_extract_archives" \
- --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
- --threads 8
- echo "Saving paths to the CSV files..."
- dbt run --select "public_use_microdata_sample.csv_paths" \
- --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
- --threads 8
- echo "Checking presence of CSV files..."
- duckdb -c "SELECT * FROM '~/data/american_community_survey/csv_paths.parquet'"
- end_step=$(date +%s)
- print_time_taken $start_step $end_step "Step 2"
- echo "Step 3: Parsing the data dictionary"
- # dbt run --select "public_use_microdata_sample.parse_data_dictionary" \
- # --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
- # --threads 8
- # dbt run --select "public_use_microdata_sample.data_dictionary_path" \
- # --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
- # --threads 8
- # dbt run --select "public_use_microdata_sample.parse_data_dictionary" \
- # --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
- # --threads 8
- # dbt run --select "public_use_microdata_sample.data_dictionary_path" \
- # --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
- # --threads 8
- # echo "Checking data dictionary path..."
- # duckdb -c "SELECT * FROM '~/data/american_community_survey/data_dictionary_path.parquet'"
- # echo "Step 4: Generating SQL commands for mapping variables"
- # python scripts/generate_sql_with_enum_types_and_mapped_values_renamed.py \
- # ~/data/american_community_survey/csv_paths.parquet \
- # ~/data/american_community_survey/PUMS_Data_Dictionary_$YEAR.json
- # echo "Step 5: Executing generated SQL queries"
- # dbt run --select "public_use_microdata_sample.generated.$YEAR.enum_types_mapped_renamed+" \
- # --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/'$YEAR'/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_'$YEAR'.csv", "output_path": "~/data/american_community_survey"}' \
- # --threads 8
- # echo "Step 6: Testing presence and size of compressed parquet files"
- # du -sh ~/data/american_community_survey/$YEAR
- # du -hc ~/data/american_community_survey/*$YEAR.parquet
- # echo "Checking SQL query execution..."
- # duckdb -c "SELECT COUNT(*) FROM '~/data/american_community_survey/*individual_people_united_states*$YEAR.parquet'"
- # Capture end time
- end_step=$(date +%s)
- print_time_taken $start_step $end_step "Step 3"
- # Calculate and report the total time taken
- print_time_taken $start $end_step "Total 3"
- ```
- Relative file path: generate_sql_with_enum_types_and_mapped_values.py
- ```
- import pandas as pd
- import sys
- import os
- import json
- def generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup):
- type_char = folder_name.split("_")[1][0].lower()
- folder_code = folder_name.split("_")[1][1:].upper()
- human_readable_name = "individual_people" if type_char == "p" else "housing_units"
- if folder_code == "US":
- csv_code = csv_name.split("_")[1][1:].upper()
- name = national_lookup.get(csv_code, "Unknown national code")
- elif len(folder_code) == 2:
- name = state_lookup.get(folder_code, "Unknown state code")
- else:
- raise ValueError(f"Invalid code: {folder_code}")
- return f"{human_readable_name}_{name.replace(' ', '_')}".lower()
- def clean_enum_value(value):
- value = value.replace("'", "")
- value = value.replace("N/A", "Not applicable")
- value = value.replace("/", " or ")
- value = value.replace("(", "- ")
- value = value.replace(")", "")
- return value
- if len(sys.argv) < 3:
- print("Usage: python script.py <parquet_database_path> <PUMS_data_dictionary_path>")
- sys.exit(1)
- parquet_database_path, data_dictionary_path = sys.argv[1:3]
- with open(data_dictionary_path, "r") as json_file:
- data_dict = json.load(json_file)
- state_lookup = {code: name for name, code in [x.split("/") for x in data_dict["ST"]["Values"].values()]}
- national_lookup = {"USA": "United States first tranche", "USB": "United States second tranche"}
- df_csv_paths = pd.read_parquet(parquet_database_path)
- models_dir = "models/public_use_microdata_sample/generated/enum_types_mapped"
- os.makedirs(models_dir, exist_ok=True)
- def should_include_key(description):
- exclude_criteria = ["weight", "identifier", "number", "age", "income", "time", "hours", "weeks", "puma", "total", "fee", "cost", "amount", "rent", "value", "taxes"]
- # Check if any of the exclude criteria are in the value or if "age" is in the description.
- if any(criterion in description.lower() and "flag" not in description.lower() for criterion in exclude_criteria):
- return False
- return True
- for csv_path in df_csv_paths["csv_path"]:
- folder_name = os.path.basename(os.path.dirname(csv_path))
- csv_name = os.path.basename(csv_path).split(".")[0]
- materialized_name = generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup)
- df_headers = pd.read_csv(csv_path, nrows=0)
- column_types = {column: 'VARCHAR' for column in df_headers.columns}
- columns = ', '.join([f"'{col}': '{typ}'" for col, typ in column_types.items()])
- sql_select_parts = ["SELECT"]
- enum_creation_statements = []
- table_creation_statement = f"CREATE TABLE {materialized_name} ("
- column_definitions = []
- newline = "\n"
- for header, details in data_dict.items():
- if "Values" in details:
- if header in df_headers.columns:
- enum_values = [f"'{key.strip()}'" for key, value in details["Values"].items()]
- if should_include_key(details["Description"]) and len(enum_values) > 0:
- enum_name = f"{header}_enum"
- value_mapping = "\n\t\t".join([
- f"WHEN '{clean_enum_value(code)}' THEN '{clean_enum_value(label)}'"
- for code, label in data_dict[header]["Values"].items()
- ])
- enum_labels = [f"'{clean_enum_value(label)}'" for code, label in data_dict[header]["Values"].items()]
- mapped_column = f"""CASE {header}\n\t\t{value_mapping}\n\tEND::ENUM ({','.join(enum_labels)}) AS {header},"""
- column_definitions.append(mapped_column)
- else:
- column_definitions.append(f" {header}::VARCHAR,")
- else:
- print(f"Column {header} not found in {csv_name}.csv")
- sql_select_parts[-1] = sql_select_parts[-1].rstrip(',')
- sql_select_statement = "\n".join(sql_select_parts)
- newline = "\n"
- newline_with_comma = ",\n"
- # Combine ENUM creation, table creation, and COPY command in SQL content
- sql_content = f"""-- SQL transformation for {csv_name} generated by {os.path.basename(__file__)}
- {{{{ config(materialized='external', location=var('output_path') + '/{materialized_name}_enum_types_mapped.parquet') }}}}
- SELECT
- {newline.join(column_definitions)}
- FROM read_csv('{csv_path}',
- parallel=False,
- all_varchar=True,
- auto_detect=True)"""
- sql_file_path = os.path.join(models_dir, f"{materialized_name}_enum_mapped.sql")
- with open(sql_file_path, "w") as sql_file:
- sql_file.write(sql_content)
- ```
- Relative file path: generate_sql_with_enum_types.py
- ```
- import pandas as pd
- import sys
- import os
- import json
- def generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup):
- type_char = folder_name.split("_")[1][0].lower()
- folder_code = folder_name.split("_")[1][1:].upper()
- human_readable_name = "individual_people" if type_char == "p" else "housing_units"
- if folder_code == "US":
- csv_code = csv_name.split("_")[1][1:].upper()
- name = national_lookup.get(csv_code, "Unknown national code")
- elif len(folder_code) == 2:
- name = state_lookup.get(folder_code, "Unknown state code")
- else:
- raise ValueError(f"Invalid code: {folder_code}")
- return f"{human_readable_name}_{name.replace(' ', '_')}".lower()
- def clean_enum_value(value):
- value = value.replace("'", "")
- value = value.replace("N/A", "Not applicable")
- value = value.replace("/", " or ")
- value = value.replace("(", "- ")
- value = value.replace(")", "")
- return value
- if len(sys.argv) < 3:
- print("Usage: python script.py <parquet_database_path> <PUMS_data_dictionary_path>")
- sys.exit(1)
- parquet_database_path, data_dictionary_path = sys.argv[1:3]
- with open(data_dictionary_path, "r") as json_file:
- data_dict = json.load(json_file)
- state_lookup = {code: name for name, code in [x.split("/") for x in data_dict["ST"]["Values"].values()]}
- national_lookup = {"USA": "United States first tranche", "USB": "United States second tranche"}
- df_csv_paths = pd.read_parquet(parquet_database_path)
- models_dir = "models/public_use_microdata_sample/generated/enum_types"
- os.makedirs(models_dir, exist_ok=True)
- def should_include_key(description):
- exclude_criteria = ["weight", "identifier", "number", "age", "income", "time", "hours", "weeks", "puma", "total", "fee", "cost", "amount", "rent", "value", "taxes"]
- # Check if any of the exclude criteria are in the value or if "age" is in the description.
- if any(criterion in description.lower() and "flag" not in description.lower() for criterion in exclude_criteria):
- return False
- return True
- for csv_path in df_csv_paths["csv_path"]:
- folder_name = os.path.basename(os.path.dirname(csv_path))
- csv_name = os.path.basename(csv_path).split(".")[0]
- materialized_name = generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup)
- df_headers = pd.read_csv(csv_path, nrows=0)
- column_types = {column: 'VARCHAR' for column in df_headers.columns}
- columns = ', '.join([f"'{col}': '{typ}'" for col, typ in column_types.items()])
- sql_select_parts = ["SELECT"]
- enum_creation_statements = []
- table_creation_statement = f"CREATE TABLE {materialized_name} ("
- column_definitions = []
- for header, details in data_dict.items():
- if "Values" in details:
- if header in df_headers.columns:
- # enum_values = [f"'{key}'" for key, value in details["Values"].items() if should_include_value(key, details["Description"], value)]
- enum_values = [f"'{key.strip()}'" for key, value in details["Values"].items()]
- if should_include_key(details["Description"]) and len(enum_values) > 0:
- enum_name = f"{header}_enum"
- enum_creation_statements.append(f"CREATE TYPE {enum_name} AS ENUM ({','.join(enum_values)});")
- # column_definitions.append(f" {header} {enum_name}")
- column_definitions.append(f" {header}::ENUM ({','.join(enum_values)}),")
- # sql_select_parts.append(f" CAST({header} AS {enum_name}) AS {header},")
- else:
- column_definitions.append(f" {header}::VARCHAR,")
- # sql_select_parts.append(f" {header},")
- else:
- print(f"Column {header} not found in {csv_name}.csv")
- sql_select_parts[-1] = sql_select_parts[-1].rstrip(',')
- sql_select_statement = "\n".join(sql_select_parts)
- newline = "\n"
- newline_with_comma = ",\n"
- # sql_content = f"""-- SQL transformation for {csv_name} generated by {os.path.basename(__file__)}
- # {{{{ config(materialized='external', location=var('output_path') + '/{materialized_name}.parquet') }}}}
- # {newline.join(enum_creation_statements)}
- # {sql_select_statement}
- # FROM read_csv_auto('{csv_path}')
- # """
- # Combine ENUM creation, table creation, and COPY command in SQL content
- sql_content = f"""-- SQL transformation for {csv_name} generated by {os.path.basename(__file__)}
- {{{{ config(materialized='external', location=var('output_path') + '/{materialized_name}_enum_types.parquet') }}}}
- SELECT
- {newline.join(column_definitions)}
- FROM read_csv('{csv_path}',
- parallel=False,
- all_varchar=True,
- auto_detect=True)"""
- sql_file_path = os.path.join(models_dir, f"{materialized_name}_enum.sql")
- with open(sql_file_path, "w") as sql_file:
- sql_file.write(sql_content)
- ```
- Relative file path: generate_sql_with_enum_types_and_mapped_values_renamed.py
- ```
- import pandas as pd
- import sys
- import os
- import json
- def generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup):
- type_char = folder_name.split("_")[1][0].lower()
- folder_code = folder_name.split("_")[1][1:].upper()
- human_readable_name = "individual_people" if type_char == "p" else "housing_units"
- if folder_code == "US":
- csv_code = csv_name.split("_")[1][1:].upper()
- name = national_lookup.get(csv_code, "Unknown national code")
- elif len(folder_code) == 2:
- name = state_lookup.get(folder_code, "Unknown state code")
- else:
- raise ValueError(f"Invalid code: {folder_code}")
- return f"{human_readable_name}_{name.replace(' ', '_')}".lower()
- def clean_enum_value(value):
- value = value.replace("'", "")
- value = value.replace("N/A", "Not applicable")
- value = value.replace("/", " or ")
- value = value.replace("(", "- ")
- value = value.replace(")", "")
- return value
- if len(sys.argv) < 3:
- print("Usage: python script.py <parquet_database_path> <PUMS_data_dictionary_path>")
- sys.exit(1)
- parquet_database_path, data_dictionary_path = sys.argv[1:3]
- year = data_dictionary_path.split("/")[-1].split(".")[0].split("_")[-1]
- print(f"Year: {year}")
- with open(data_dictionary_path, "r") as json_file:
- data_dict = json.load(json_file)
- state_lookup = {
- code: name
- for name, code in [x.split("/") for x in data_dict["ST"]["Values"].values()]
- }
- national_lookup = {
- "USA": "United States first tranche",
- "USB": "United States second tranche",
- }
- df_csv_paths = pd.read_parquet(parquet_database_path)
- models_dir = (
- f"models/public_use_microdata_sample/generated/{year}/enum_types_mapped_renamed"
- )
- os.makedirs(models_dir, exist_ok=True)
- def should_include_key(description):
- exclude_criteria = [
- "weight",
- "identifier",
- "number",
- "age",
- "income",
- "time",
- "hours",
- "weeks",
- "puma",
- "total",
- "fee",
- "cost",
- "amount",
- "rent",
- "value",
- "taxes",
- ]
- # Check if any of the exclude criteria are in the value or if "age" is in the description.
- if any(
- criterion in description.lower() and "flag" not in description.lower()
- for criterion in exclude_criteria
- ):
- return False
- return True
- for csv_path in df_csv_paths["csv_path"]:
- folder_name = os.path.basename(os.path.dirname(csv_path))
- csv_name = os.path.basename(csv_path).split(".")[0]
- materialized_name = generate_materialized_name(
- folder_name, csv_name, state_lookup, national_lookup
- )
- df_headers = pd.read_csv(csv_path, nrows=0)
- column_types = {column: "VARCHAR" for column in df_headers.columns}
- columns = ", ".join([f"'{col}': '{typ}'" for col, typ in column_types.items()])
- sql_select_parts = ["SELECT"]
- enum_creation_statements = []
- table_creation_statement = f"CREATE TABLE {materialized_name} ("
- column_definitions = []
- newline = "\n"
- for header, details in data_dict.items():
- if "Values" in details:
- if header in df_headers.columns:
- enum_values = [
- f"'{key.strip()}'" for key, value in details["Values"].items()
- ]
- col_info = data_dict.get(header, {"Description": header})
- description = col_info["Description"]
- if should_include_key(details["Description"]) and len(enum_values) > 0:
- enum_name = f"{header}_enum"
- value_mapping = "\n\t\t".join(
- [
- f"WHEN '{clean_enum_value(code)}' THEN '{clean_enum_value(label)}'"
- for code, label in data_dict[header]["Values"].items()
- ]
- )
- enum_labels = [
- f"'{clean_enum_value(label)}'"
- for code, label in data_dict[header]["Values"].items()
- ]
- mapped_column = f"""CASE {header}\n\t\t{value_mapping}\n\tEND::ENUM ({','.join(enum_labels)}) AS "{description}","""
- column_definitions.append(mapped_column)
- else:
- column_definitions.append(
- f' {header}::VARCHAR AS "{description}",'
- )
- else:
- # print(f"Column {header} not found in {csv_name}.csv")
- pass
- sql_select_parts[-1] = sql_select_parts[-1].rstrip(",")
- sql_select_statement = "\n".join(sql_select_parts)
- newline = "\n"
- newline_with_comma = ",\n"
- username = os.environ.get("USER")
- path_without_user = "~/" + csv_path.split(username + '/')[1]
- # Combine ENUM creation, table creation, and COPY command in SQL content
- sql_content = f"""-- SQL transformation for {csv_name} generated by models/public_use_microdata_sample/scripts/{os.path.basename(__file__)}
- {{{{ config(materialized='external', location=var('output_path') + '/acs_pums_{materialized_name}_{year}.parquet') }}}}
- SELECT
- {newline.join(column_definitions)}
- FROM read_csv('{path_without_user}',
- parallel=False,
- all_varchar=True,
- auto_detect=True)
- """
- sql_file_path = os.path.join(
- models_dir, f"{materialized_name}_enum_mapped_renamed_{year}.sql"
- )
- with open(sql_file_path, "w") as sql_file:
- sql_file.write(sql_content)
- ```
- Relative file path: generate_sql_with_types.py
- ```
- import pandas as pd
- import duckdb
- import sys
- import os
- import json
- def generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup):
- # Extract the state code and type (P or H) from the folder name
- type_char = folder_name.split("_")[1][
- 0
- ].lower() # Assuming folder format is "csv_PXX" or "csv_HXX"
- folder_code = folder_name.split("_")[1][1:].upper()
- # Determine the human-readable name based on the type character
- human_readable_name = "individual_people" if type_char == "p" else "housing_units"
- if folder_code == "US":
- # Get the national-level name from the lookup table
- csv_code = csv_name.split("_")[1][1:].upper()
- name = national_lookup.get(csv_code, "Unknown national code")
- print(csv_code, name)
- elif len(folder_code) == 2:
- # Get the state name from the lookup table
- name = state_lookup.get(folder_code, "Unknown state code")
- print(folder_code, name)
- else:
- raise ValueError(f"Invalid code: {folder_code}")
- # Return the formatted name
- return f"{human_readable_name}_{name.replace(' ', '_')}".lower()
- if len(sys.argv) < 3:
- print("Usage: python script.py <parquet_database_path> <PUMS_data_dictionary_path>")
- sys.exit(1)
- parquet_database_path, data_dictionary_path = sys.argv[1:3]
- # Load the data dictionary from the JSON file
- with open(data_dictionary_path, "r") as json_file:
- data_dict = json.load(json_file)
- # Generate lookup table for state codes
- state_lookup = {
- code: name
- for name, code in [x.split("/") for x in data_dict["ST"]["Values"].values()]
- }
- # define short codes for first and second tranches of national-level data
- national_lookup = {
- "USA": "United States first tranche",
- "USB": "United States second tranche",
- }
- # Connect to DuckDB
- conn = duckdb.connect(database=":memory:", read_only=False)
- # Assuming the Parquet file contains paths to CSV files
- df_csv_paths = pd.read_parquet(parquet_database_path)
- models_dir = "models/public_use_microdata_sample/generated/with_types"
- os.makedirs(models_dir, exist_ok=True)
- for csv_path in df_csv_paths["csv_path"]:
- folder_name = os.path.basename(os.path.dirname(csv_path))
- csv_name = os.path.basename(csv_path)
- csv_name = csv_name.split(".")[0]
- materialized_name = generate_materialized_name(
- folder_name, csv_name, state_lookup, national_lookup
- )
- df_headers = pd.read_csv(csv_path, nrows=0)
- column_types = {column: 'VARCHAR' for column in df_headers.columns}
- columns = ', '.join([f"'{col}': '{typ}'" for col, typ in column_types.items()])
- sql_select_parts = ["SELECT"]
- for header in df_headers.columns:
- col_info = data_dict.get(header, {"Description": header})
- description = col_info["Description"]
- sql_select_parts.append(f' {header} AS "{description}",')
- sql_select_parts[-1] = sql_select_parts[-1].rstrip(",")
- sql_select_statement = "\n".join(sql_select_parts)
- sql_content = f"""-- SQL transformation for {os.path.basename(csv_path)} generated by {os.path.basename(__file__)}
- {{{{ config(materialized='external', location=var('output_path') + '/{materialized_name}.parquet') }}}}
- {sql_select_statement}
- FROM read_csv('{csv_path}', columns={{{columns}}})"""
- sql_file_path = os.path.join(models_dir, f"{materialized_name}.sql")
- with open(sql_file_path, "w") as sql_file:
- sql_file.write(sql_content)
- ```
- Relative file path: download_industry_codes.ipynb
- ```
- {
- "cells": [
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [],
- "source": [
- "import pandas as pd"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "metadata": {},
- "outputs": [],
- "source": [
- "df = pd.read_excel(\"https://www.census.gov/naics/2017NAICS/2-6%20digit_2017_Codes.xlsx\", dtype=str, skiprows=0)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>Seq. No.</th>\n",
- " <th>2017 NAICS US Code</th>\n",
- " <th>2017 NAICS US Title</th>\n",
- " <th>Unnamed: 3</th>\n",
- " <th>Unnamed: 4</th>\n",
- " <th>Unnamed: 5</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>1</td>\n",
- " <td>11</td>\n",
- " <td>Agriculture, Forestry, Fishing and Hunting</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>2</td>\n",
- " <td>111</td>\n",
- " <td>Crop Production</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>3</td>\n",
- " <td>1111</td>\n",
- " <td>Oilseed and Grain Farming</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>4</td>\n",
- " <td>11111</td>\n",
- " <td>Soybean Farming</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>...</th>\n",
- " <td>...</td>\n",
- " <td>...</td>\n",
- " <td>...</td>\n",
- " <td>...</td>\n",
- " <td>...</td>\n",
- " <td>...</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2192</th>\n",
- " <td>2192</td>\n",
- " <td>9281</td>\n",
- " <td>National Security and International Affairs</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2193</th>\n",
- " <td>2193</td>\n",
- " <td>92811</td>\n",
- " <td>National Security</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2194</th>\n",
- " <td>2194</td>\n",
- " <td>928110</td>\n",
- " <td>National Security</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2195</th>\n",
- " <td>2195</td>\n",
- " <td>92812</td>\n",
- " <td>International Affairs</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2196</th>\n",
- " <td>2196</td>\n",
- " <td>928120</td>\n",
- " <td>International Affairs</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "<p>2197 rows Γ 6 columns</p>\n",
- "</div>"
- ],
- "text/plain": [
- " Seq. No. 2017 NAICS US Code \\\n",
- "0 NaN NaN \n",
- "1 1 11 \n",
- "2 2 111 \n",
- "3 3 1111 \n",
- "4 4 11111 \n",
- "... ... ... \n",
- "2192 2192 9281 \n",
- "2193 2193 92811 \n",
- "2194 2194 928110 \n",
- "2195 2195 92812 \n",
- "2196 2196 928120 \n",
- "\n",
- " 2017 NAICS US Title Unnamed: 3 Unnamed: 4 \\\n",
- "0 NaN NaN NaN \n",
- "1 Agriculture, Forestry, Fishing and Hunting NaN NaN \n",
- "2 Crop Production NaN NaN \n",
- "3 Oilseed and Grain Farming NaN NaN \n",
- "4 Soybean Farming NaN NaN \n",
- "... ... ... ... \n",
- "2192 National Security and International Affairs NaN NaN \n",
- "2193 National Security NaN NaN \n",
- "2194 National Security NaN NaN \n",
- "2195 International Affairs NaN NaN \n",
- "2196 International Affairs NaN NaN \n",
- "\n",
- " Unnamed: 5 \n",
- "0 NaN \n",
- "1 NaN \n",
- "2 NaN \n",
- "3 NaN \n",
- "4 NaN \n",
- "... ... \n",
- "2192 NaN \n",
- "2193 NaN \n",
- "2194 NaN \n",
- "2195 NaN \n",
- "2196 NaN \n",
- "\n",
- "[2197 rows x 6 columns]"
- ]
- },
- "execution_count": 27,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 43,
- "metadata": {},
- "outputs": [],
- "source": [
- "names = df[df['2017 NAICS US Code'].str.len() == 2][['2017 NAICS US Title']].values.tolist()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 44,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[['Agriculture, Forestry, Fishing and Hunting'],\n",
- " ['Mining, Quarrying, and Oil and Gas Extraction'],\n",
- " ['Utilities'],\n",
- " ['Construction'],\n",
- " ['Wholesale Trade'],\n",
- " ['Information'],\n",
- " ['Finance and Insurance'],\n",
- " ['Real Estate and Rental and Leasing'],\n",
- " ['Professional, Scientific, and Technical Services'],\n",
- " ['Management of Companies and Enterprises'],\n",
- " ['Administrative and Support and Waste Management and Remediation Services'],\n",
- " ['Educational Services'],\n",
- " ['Health Care and Social Assistance'],\n",
- " ['Arts, Entertainment, and Recreation'],\n",
- " ['Accommodation and Food Services'],\n",
- " ['Other Services (except Public Administration)'],\n",
- " ['Public Administration']]"
- ]
- },
- "execution_count": 44,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "names"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 53,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 53,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df[df['2017 NAICS US Code'].str.len() == 1][['2017 NAICS US Title']].values.tolist()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "code",
- "execution_count": 50,
- "metadata": {},
- "outputs": [],
- "source": [
- "names = df[df['2017 NAICS US Code'].str.len() == 3][['2017 NAICS US Title']].values.tolist()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 51,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[['Crop Production'],\n",
- " ['Animal Production and Aquaculture'],\n",
- " ['Forestry and Logging'],\n",
- " ['Fishing, Hunting and Trapping'],\n",
- " ['Support Activities for Agriculture and Forestry'],\n",
- " ['Oil and Gas Extraction'],\n",
- " ['Mining (except Oil and Gas)'],\n",
- " ['Support Activities for Mining'],\n",
- " ['Utilities '],\n",
- " ['Construction of Buildings'],\n",
- " ['Heavy and Civil Engineering Construction'],\n",
- " ['Specialty Trade Contractors'],\n",
- " ['Food Manufacturing'],\n",
- " ['Beverage and Tobacco Product Manufacturing'],\n",
- " ['Textile Mills'],\n",
- " ['Textile Product Mills'],\n",
- " ['Apparel Manufacturing'],\n",
- " ['Leather and Allied Product Manufacturing'],\n",
- " ['Wood Product Manufacturing'],\n",
- " ['Paper Manufacturing'],\n",
- " ['Printing and Related Support Activities'],\n",
- " ['Petroleum and Coal Products Manufacturing'],\n",
- " ['Chemical Manufacturing'],\n",
- " ['Plastics and Rubber Products Manufacturing'],\n",
- " ['Nonmetallic Mineral Product Manufacturing'],\n",
- " ['Primary Metal Manufacturing'],\n",
- " ['Fabricated Metal Product Manufacturing'],\n",
- " ['Machinery Manufacturing'],\n",
- " ['Computer and Electronic Product Manufacturing'],\n",
- " ['Electrical Equipment, Appliance, and Component Manufacturing'],\n",
- " ['Transportation Equipment Manufacturing'],\n",
- " ['Furniture and Related Product Manufacturing'],\n",
- " ['Miscellaneous Manufacturing'],\n",
- " ['Merchant Wholesalers, Durable Goods '],\n",
- " ['Merchant Wholesalers, Nondurable Goods '],\n",
- " ['Wholesale Electronic Markets and Agents and Brokers '],\n",
- " ['Motor Vehicle and Parts Dealers '],\n",
- " ['Furniture and Home Furnishings Stores '],\n",
- " ['Electronics and Appliance Stores '],\n",
- " ['Building Material and Garden Equipment and Supplies Dealers '],\n",
- " ['Food and Beverage Stores '],\n",
- " ['Health and Personal Care Stores '],\n",
- " ['Gasoline Stations '],\n",
- " ['Clothing and Clothing Accessories Stores '],\n",
- " ['Sporting Goods, Hobby, Musical Instrument, and Book Stores '],\n",
- " ['General Merchandise Stores '],\n",
- " ['Miscellaneous Store Retailers '],\n",
- " ['Nonstore Retailers '],\n",
- " ['Air Transportation'],\n",
- " ['Rail Transportation'],\n",
- " ['Water Transportation'],\n",
- " ['Truck Transportation'],\n",
- " ['Transit and Ground Passenger Transportation'],\n",
- " ['Pipeline Transportation'],\n",
- " ['Scenic and Sightseeing Transportation'],\n",
- " ['Support Activities for Transportation'],\n",
- " ['Postal Service'],\n",
- " ['Couriers and Messengers'],\n",
- " ['Warehousing and Storage'],\n",
- " ['Publishing Industries (except Internet)'],\n",
- " ['Motion Picture and Sound Recording Industries'],\n",
- " ['Broadcasting (except Internet)'],\n",
- " ['Telecommunications'],\n",
- " ['Data Processing, Hosting, and Related Services'],\n",
- " ['Other Information Services'],\n",
- " ['Monetary Authorities-Central Bank'],\n",
- " ['Credit Intermediation and Related Activities'],\n",
- " ['Securities, Commodity Contracts, and Other Financial Investments and Related Activities'],\n",
- " ['Insurance Carriers and Related Activities'],\n",
- " ['Funds, Trusts, and Other Financial Vehicles '],\n",
- " ['Real Estate'],\n",
- " ['Rental and Leasing Services'],\n",
- " ['Lessors of Nonfinancial Intangible Assets (except Copyrighted Works)'],\n",
- " ['Professional, Scientific, and Technical Services'],\n",
- " ['Management of Companies and Enterprises'],\n",
- " ['Administrative and Support Services'],\n",
- " ['Waste Management and Remediation Services'],\n",
- " ['Educational Services'],\n",
- " ['Ambulatory Health Care Services'],\n",
- " ['Hospitals'],\n",
- " ['Nursing and Residential Care Facilities'],\n",
- " ['Social Assistance'],\n",
- " ['Performing Arts, Spectator Sports, and Related Industries'],\n",
- " ['Museums, Historical Sites, and Similar Institutions'],\n",
- " ['Amusement, Gambling, and Recreation Industries'],\n",
- " ['Accommodation'],\n",
- " ['Food Services and Drinking Places'],\n",
- " ['Repair and Maintenance'],\n",
- " ['Personal and Laundry Services'],\n",
- " ['Religious, Grantmaking, Civic, Professional, and Similar Organizations'],\n",
- " ['Private Households'],\n",
- " ['Executive, Legislative, and Other General Government Support '],\n",
- " ['Justice, Public Order, and Safety Activities '],\n",
- " ['Administration of Human Resource Programs '],\n",
- " ['Administration of Environmental Quality Programs '],\n",
- " ['Administration of Housing Programs, Urban Planning, and Community Development '],\n",
- " ['Administration of Economic Programs '],\n",
- " ['Space Research and Technology '],\n",
- " ['National Security and International Affairs ']]"
- ]
- },
- "execution_count": 51,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "names"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 35,
- "metadata": {},
- "outputs": [],
- "source": [
- "df_ind = pd.read_parquet(path='/Users/me/data/american_community_survey/upload/2022_acs_pums_individual_people_united_states_first_tranche.parquet')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 41,
- "metadata": {},
- "outputs": [],
- "source": [
- "counts = df_ind['North American Industry Classification System (NAICS) recode for 2018 and later based on 2017 NAICS codes'].apply(lambda x: x[:3] if x else None).value_counts()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 42,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "North American Industry Classification System (NAICS) recode for 2018 and later based on 2017 NAICS codes\n",
- "PRF 126549\n",
- "RET 112195\n",
- "MED 109888\n",
- "EDU 100248\n",
- "MFG 98385\n",
- "ENT 94067\n",
- "CON 64671\n",
- "FIN 63735\n",
- "SRV 50004\n",
- "ADM 49945\n",
- "TRN 48243\n",
- "SCA 23852\n",
- "WHL 20993\n",
- "INF 19848\n",
- "AGR 16734\n",
- "UTL 8484\n",
- "MIL 7768\n",
- "Une 5492\n",
- "EXT 2678\n",
- "Name: count, dtype: int64"
- ]
- },
- "execution_count": 42,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "counts"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 49,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "Index(['PRF', 'RET', 'MED', 'EDU', 'MFG', 'ENT', 'CON', 'FIN', 'SRV', 'ADM',\n",
- " 'TRN', 'SCA', 'WHL', 'INF', 'AGR', 'UTL', 'MIL', 'Une', 'EXT'],\n",
- " dtype='object', name='North American Industry Classification System (NAICS) recode for 2018 and later based on 2017 NAICS codes')"
- ]
- },
- "execution_count": 49,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "counts.index"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 47,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "North American Industry Classification System (NAICS) recode for 2018 and later based on 2017 NAICS codes\n",
- "EXT-Support Activities For Mining 1253\n",
- "EXT-Nonmetallic Mineral Mining And Quarrying 549\n",
- "EXT-Oil And Gas Extraction 330\n",
- "EXT-Metal Ore Mining 298\n",
- "EXT-Coal Mining 248\n",
- "Name: count, dtype: int64"
- ]
- },
- "execution_count": 47,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df_ind[df_ind['North American Industry Classification System (NAICS) recode for 2018 and later based on 2017 NAICS codes'].apply(lambda x: x.startswith('EXT') if x else False)]['North American Industry Classification System (NAICS) recode for 2018 and later based on 2017 NAICS codes'].value_counts()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 48,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "North American Industry Classification System (NAICS) recode for 2018 and later based on 2017 NAICS codes\n",
- "Unemployed, With No Work Experience In The Last 5 Years Or Earlier Or Never Worked 5492\n",
- "Name: count, dtype: int64"
- ]
- },
- "execution_count": 48,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df_ind[df_ind['North American Industry Classification System (NAICS) recode for 2018 and later based on 2017 NAICS codes'].apply(lambda x: x.startswith('Une') if x else False)]['North American Industry Classification System (NAICS) recode for 2018 and later based on 2017 NAICS codes'].value_counts()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 45,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "(19, 17)"
- ]
- },
- "execution_count": 45,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "len(counts), len(names)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 52,
- "metadata": {},
- "outputs": [],
- "source": [
- "\n",
- "# # https://usa.ipums.org/usa-action/variables/US2001A_1071#description_section\n",
- "industry_codes = {\n",
- " 'AGR': 'Agriculture, Forestry, Fishing and Hunting',\n",
- " 'EXT': 'Mining, Quarrying, and Oil and Gas Extraction',\n",
- " 'UTL': 'Utilities',\n",
- " 'CON': 'Construction',\n",
- " 'WHL': 'Wholesale Trade',\n",
- " 'INF': 'Information',\n",
- " 'FIN': 'Finance and Insurance',\n",
- " 'RET': 'Retail Trade',\n",
- " 'PRF': 'Professional, Scientific, and Technical Services',\n",
- " 'SRV': 'Management of Companies and Enterprises',\n",
- " 'ADM': 'Administrative and Support and Waste Management and Remediation Services',\n",
- " 'EDU': 'Educational Services',\n",
- " 'MED': 'Health Care and Social Assistance',\n",
- " 'ENT': 'Arts, Entertainment, and Recreation',\n",
- " 'MFG': 'Manufacturing', \n",
- " 'TRN': 'Transportation and Warehousing',\n",
- " 'SCA': 'Services',\n",
- " 'MIL': 'Public Administration',\n",
- " 'UNE': 'Unemployed, With No Work Experience In The Last 5 Years Or Earlier Or Never Worked'\n",
- "}\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": ".venv",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.11.7"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
- }
- ```
- Relative file path: generate_sql_with_mapped_values.py
- ```
- import pandas as pd
- import sys
- import os
- import json
- def generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup):
- type_char = folder_name.split("_")[1][0].lower()
- folder_code = folder_name.split("_")[1][1:].upper()
- human_readable_name = "individual_people" if type_char == "p" else "housing_units"
- if folder_code == "US":
- csv_code = csv_name.split("_")[1][1:].upper()
- name = national_lookup.get(csv_code, "Unknown national code")
- elif len(folder_code) == 2:
- name = state_lookup.get(folder_code, "Unknown state code")
- else:
- raise ValueError(f"Invalid code: {folder_code}")
- return f"{human_readable_name}_{name.replace(' ', '_')}".lower()
- def clean_enum_value(value):
- value = value.replace("'", "")
- value = value.replace("N/A", "Not applicable")
- value = value.replace("/", " or ")
- value = value.replace("(", "- ")
- value = value.replace(")", "")
- return value
- if len(sys.argv) < 3:
- print("Usage: python script.py <parquet_database_path> <PUMS_data_dictionary_path>")
- sys.exit(1)
- parquet_database_path, data_dictionary_path = sys.argv[1:3]
- with open(data_dictionary_path, "r") as json_file:
- data_dict = json.load(json_file)
- state_lookup = {code: name for name, code in [x.split("/") for x in data_dict["ST"]["Values"].values()]}
- national_lookup = {"USA": "United States first tranche", "USB": "United States second tranche"}
- df_csv_paths = pd.read_parquet(parquet_database_path)
- models_dir = "models/public_use_microdata_sample/generated/mapped_values"
- os.makedirs(models_dir, exist_ok=True)
- for csv_path in df_csv_paths["csv_path"]:
- folder_name = os.path.basename(os.path.dirname(csv_path))
- csv_name = os.path.basename(csv_path).split(".")[0]
- materialized_name = generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup)
- df_headers = pd.read_csv(csv_path, nrows=0)
- column_types = {column: 'VARCHAR' for column in df_headers.columns}
- columns = ', '.join([f"'{col}': '{typ}'" for col, typ in column_types.items()])
- sql_select_parts = ["SELECT"]
- enum_creation_statements = []
- for header in df_headers.columns:
- if header in data_dict and "Values" in data_dict[header]:
- # Mapping codes to labels using CASE statement
- if any(['Integer weight' in value for value in data_dict[header]["Values"].values()]):
- print(f"Selecting unmapped {header} because it contains an 'Integer weight' value")
- sql_select_parts.append(f" {header},")
- else:
- # value_mapping = ' '.join([f"WHEN '{clean_enum_value(code)}' THEN '{clean_enum_value(label)}'" for code, label in data_dict[header]["Values"].items()])
- # mapped_column = f"CASE {header} {value_mapping} END AS {header}"
- # sql_select_parts.append(f" {mapped_column},")
- # Improved mapping with pretty printing
- col_info = data_dict.get(header, {"Description": header})
- description = col_info["Description"]
- value_mapping = "\n\t\t".join([
- f"WHEN '{clean_enum_value(code)}' THEN '{clean_enum_value(label)}'"
- for code, label in data_dict[header]["Values"].items()
- ])
- # mapped_column = f"""CASE {header}\n\t\t{value_mapping}\n\tEND AS \"{description}\""""
- mapped_column = f"""CASE {header}\n\t\t{value_mapping}\n\tEND AS {header}"""
- sql_select_parts.append(f"\t{mapped_column},")
- elif header in data_dict:
- # Direct mapping for columns without "Values"
- description = data_dict[header]["Description"].replace("'", "''")
- # sql_select_parts.append(f' {header} AS "{description}",')
- sql_select_parts.append(f' {header},')
- else:
- sql_select_parts.append(f" {header},")
- sql_select_parts[-1] = sql_select_parts[-1].rstrip(',')
- sql_select_statement = "\n".join(sql_select_parts)
- newline = "\n"
- sql_content = f"""-- SQL transformation for {csv_name} generated by {os.path.basename(__file__)}
- {newline.join(enum_creation_statements)}
- {{{{ config(materialized='external', location=var('output_path') + '/{materialized_name}.parquet') }}}}
- {sql_select_statement}
- FROM read_csv('{csv_path}',
- parallel=False,
- all_varchar=True,
- auto_detect=True)"""
- sql_file_path = os.path.join(models_dir, f"{materialized_name}_mapped.sql")
- with open(sql_file_path, "w") as sql_file:
- sql_file.write(sql_content)
- ```
- Relative file path: generate_sql_with_renamed_columns.py
- ```
- import pandas as pd
- import duckdb
- import sys
- import os
- import json
- def generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup):
- # Extract the state code and type (P or H) from the folder name
- type_char = folder_name.split("_")[1][
- 0
- ].lower() # Assuming folder format is "csv_PXX" or "csv_HXX"
- folder_code = folder_name.split("_")[1][1:].upper()
- # Determine the human-readable name based on the type character
- human_readable_name = "individual_people" if type_char == "p" else "housing_units"
- if folder_code == "US":
- # Get the national-level name from the lookup table
- csv_code = csv_name.split("_")[1][1:].upper()
- name = national_lookup.get(csv_code, "Unknown national code")
- print(csv_code, name)
- elif len(folder_code) == 2:
- # Get the state name from the lookup table
- name = state_lookup.get(folder_code, "Unknown state code")
- print(folder_code, name)
- else:
- raise ValueError(f"Invalid code: {folder_code}")
- # Return the formatted name
- return f"{human_readable_name}_{name.replace(' ', '_')}".lower()
- if len(sys.argv) < 3:
- print("Usage: python script.py <parquet_database_path> <PUMS_data_dictionary_path>")
- sys.exit(1)
- parquet_database_path, data_dictionary_path = sys.argv[1:3]
- # Load the data dictionary from the JSON file
- with open(data_dictionary_path, "r") as json_file:
- data_dict = json.load(json_file)
- # Generate lookup table for state codes
- state_lookup = {
- code: name
- for name, code in [x.split("/") for x in data_dict["ST"]["Values"].values()]
- }
- # define short codes for first and second tranches of national-level data
- national_lookup = {
- "USA": "United States first tranche",
- "USB": "United States second tranche",
- }
- # Connect to DuckDB
- conn = duckdb.connect(database=":memory:", read_only=False)
- # Assuming the Parquet file contains paths to CSV files
- df_csv_paths = pd.read_parquet(parquet_database_path)
- models_dir = "models/public_use_microdata_sample/generated/renamed_columns"
- os.makedirs(models_dir, exist_ok=True)
- for csv_path in df_csv_paths["csv_path"]:
- folder_name = os.path.basename(os.path.dirname(csv_path))
- csv_name = os.path.basename(csv_path)
- csv_name = csv_name.split(".")[0]
- materialized_name = generate_materialized_name(
- folder_name, csv_name, state_lookup, national_lookup
- )
- df_headers = pd.read_csv(csv_path, nrows=0)
- sql_select_parts = ["SELECT"]
- for header in df_headers.columns:
- col_info = data_dict.get(header, {"Description": header})
- description = col_info["Description"]
- sql_select_parts.append(f' {header} AS "{description}",')
- sql_select_parts[-1] = sql_select_parts[-1].rstrip(",")
- sql_select_statement = "\n".join(sql_select_parts)
- sql_content = f"""-- SQL transformation for {os.path.basename(csv_path)} generated by {os.path.basename(__file__)}
- {{{{ config(materialized='external', location=var('output_path') + '/{materialized_name}.parquet') }}}}
- {sql_select_statement}
- FROM read_csv_auto('{csv_path}')"""
- sql_file_path = os.path.join(models_dir, f"{materialized_name}_renamed_columns.sql")
- with open(sql_file_path, "w") as sql_file:
- sql_file.write(sql_content)
- ```
- Relative file path: generate_sql_schemas_for_extracted_csv_files.py
- ```
- import pandas as pd
- import sys
- import os
- import json
- def generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup):
- type_char = folder_name.split("_")[1][0].lower()
- folder_code = folder_name.split("_")[1][1:].upper()
- human_readable_name = "individual_people" if type_char == "p" else "housing_units"
- if folder_code == "US":
- csv_code = csv_name.split("_")[1][1:].upper()
- name = national_lookup.get(csv_code, "Unknown national code")
- elif len(folder_code) == 2:
- name = state_lookup.get(folder_code, "Unknown state code")
- else:
- raise ValueError(f"Invalid code: {folder_code}")
- return f"{human_readable_name}_{name.replace(' ', '_')}".lower()
- if len(sys.argv) < 3:
- print("Usage: python script.py <parquet_database_path> <PUMS_data_dictionary_path>")
- sys.exit(1)
- parquet_database_path, data_dictionary_path = sys.argv[1:3]
- with open(data_dictionary_path, "r") as json_file:
- data_dict = json.load(json_file)
- state_lookup = {code: name for name, code in [x.split("/") for x in data_dict["ST"]["Values"].values()]}
- national_lookup = {"USA": "United States first tranche", "USB": "United States second tranche"}
- df_csv_paths = pd.read_parquet(parquet_database_path)
- models_dir = "models/public_use_microdata_sample/generated"
- os.makedirs(models_dir, exist_ok=True)
- for csv_path in df_csv_paths["csv_path"]:
- folder_name = os.path.basename(os.path.dirname(csv_path))
- csv_name = os.path.basename(csv_path).replace(".csv", "")
- materialized_name = generate_materialized_name(folder_name, csv_name, state_lookup, national_lookup)
- df_headers = pd.read_csv(csv_path, nrows=0)
- enum_creation_statements = []
- sql_select_parts = []
- for header in df_headers.columns:
- if header in data_dict and "Values" in data_dict[header]:
- enum_values = [f"'{x}'" for x in data_dict[header]["Values"].values()]
- enum_name = f"{header}_enum"
- enum_creation_statements.append(f"CREATE TYPE {enum_name} AS ENUM ({','.join(enum_values)});")
- key = 'Description'
- mapped_column = f'CAST({header} AS {enum_name}) AS "{data_dict[header][key]}"'
- sql_select_parts.append(f" {mapped_column}")
- elif header in data_dict:
- description = data_dict[header]["Description"].replace("'", "''")
- sql_select_parts.append(f' {header} AS "{description}"')
- else:
- sql_select_parts.append(f" {header}")
- sql_select_statement = ",\n".join(sql_select_parts)
- newline = "\n"
- sql_content = f"""-- SQL transformation for {csv_name} generated by {os.path.basename(__file__)}
- {newline.join(enum_creation_statements)}
- {{{{ config(materialized='external', location=var('output_path') + '/{materialized_name}.parquet') }}}}
- SELECT
- {sql_select_statement}
- FROM read_csv_auto('{csv_path}');
- """
- sql_file_path = os.path.join(models_dir, f"{materialized_name}.sql")
- with open(sql_file_path, "w") as sql_file:
- sql_file.write(sql_content)
- ```
- Relative file path: parse_data_dictionary.py
- ```
- import requests
- import csv
- from io import StringIO
- import json
- import sys
- def csv_to_json_dictionary(url):
- response = requests.get(url)
- response.raise_for_status() # Ensure the request was successful
- # Read the CSV content into a dictionary structure
- data_dictionary = {}
- reader = csv.reader(StringIO(response.text))
- for row in reader:
- if row[0] == "NAME":
- # Initialize the variable entry with its details
- data_dictionary[row[1]] = {
- "Type": row[2],
- "Length": row[3],
- "Description": row[4],
- "Values": {},
- }
- elif row[0] == "VAL" and row[1] in data_dictionary:
- # Append value mappings to the variable
- data_dictionary[row[1]]["Values"][row[4]] = row[6] if len(row) > 6 else ""
- # Return the constructed dictionary
- return data_dictionary
- if __name__ == "__main__":
- if len(sys.argv) < 2:
- print("Usage: python script.py <data_dictionary_url>")
- sys.exit(1)
- url = sys.argv[1]
- data_dict = csv_to_json_dictionary(url)
- # Specify the JSON file name
- json_file_name = "PUMS_Data_Dictionary.json"
- with open(json_file_name, "w") as json_file:
- json.dump(data_dict, json_file, indent=4)
- print(f"Data dictionary processed and saved to {json_file_name}.")
- ```
- and take this current file structure:
- ```
- .
- βββ ahrq.gov
- βΒ Β βββ generated
- βΒ Β βΒ Β βββ with_types
- βΒ Β βΒ Β βββ syhdr_commercial_inpatient_2016.sql
- βΒ Β βΒ Β βββ syhdr_commercial_outpatient_2016.sql
- βΒ Β βΒ Β βββ syhdr_commercial_person_2016.sql
- βΒ Β βΒ Β βββ syhdr_commercial_pharmacy_2016.sql
- βΒ Β βΒ Β βββ syhdr_medicaid_inpatient_2016.sql
- βΒ Β βΒ Β βββ syhdr_medicaid_outpatient_2016.sql
- βΒ Β βΒ Β βββ syhdr_medicaid_person_2016.sql
- βΒ Β βΒ Β βββ syhdr_medicaid_pharmacy_2016.sql
- βΒ Β βΒ Β βββ syhdr_medicaid_provider_2016.sql
- βΒ Β βΒ Β βββ syhdr_medicare_inpatient_2016.sql
- βΒ Β βΒ Β βββ syhdr_medicare_outpatient_2016.sql
- βΒ Β βΒ Β βββ syhdr_medicare_person_2016.sql
- βΒ Β βΒ Β βββ syhdr_medicare_pharmacy_2016.sql
- βΒ Β βΒ Β βββ syhdr_medicare_provider_2016.sql
- βΒ Β βββ sdoh
- βΒ Β βββ sdoh_county.sql
- βΒ Β βββ sdoh_tract.sql
- βΒ Β βββ sdoh_zipcode.sql
- βββ bls.gov
- βΒ Β βββ consumer_price_index.sql
- βΒ Β βββ download_consumer_price_index.py
- βββ config.yml
- βββ figures
- βΒ Β βββ insurance_plan_payment_histogram.sql
- βΒ Β βββ insurance_plan_payment_histogram_inflation_adjusted.sql
- βββ mitre.org
- βΒ Β βββ synthea.sql
- βββ sources.yml
- 8 directories, 24 files
- ```
- Relative file path: ahrq.gov/generated/with_types/syhdr_medicaid_pharmacy_2016.sql
- ```
- -- SQL model for syhdr_medicaid_pharmacy_2016.CSV
- {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
- SELECT
- PERSON_ID::UBIGINT AS PERSON_ID,
- PERSON_WGHT::NUMERIC AS PERSON_WGHT,
- PHMCY_CLM_NUM::NUMERIC AS PHMCY_CLM_NUM,
- CLM_CNTL_NUM::NUMERIC AS CLM_CNTL_NUM,
- LINE_NBR::VARCHAR,
- FILL_DT::DATE AS FILL_DT,
- SYNTHETIC_DRUG_ID::VARCHAR,
- GENERIC_DRUG_NAME::VARCHAR AS GENERIC_DRUG_NAME,
- replace(replace(PLAN_PMT_AMT, '$', ''), ',', '')::FLOAT AS PLAN_PMT_AMT,
- replace(replace(TOT_CHRG_AMT, '$', ''), ',', '')::FLOAT AS TOT_CHRG_AMT
- FROM read_csv('~/data/syh_dr/syhdr_medicaid_pharmacy_2016.CSV', header=True, null_padding=true)```
- Relative file path: ahrq.gov/generated/with_types/syhdr_medicare_inpatient_2016.sql
- ```
- -- SQL model for syhdr_medicare_inpatient_2016.CSV
- {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
- SELECT
- PERSON_ID::UBIGINT AS PERSON_ID,
- PERSON_WGHT::NUMERIC AS PERSON_WGHT,
- FACILITY_ID::UBIGINT AS FACILITY_ID,
- CLM_CNTL_NUM::NUMERIC AS CLM_CNTL_NUM,
- AT_SPCLTY::VARCHAR AS AT_SPCLTY,
- SRVC_BEG_DATE::DATE AS SRVC_BEG_DATE,
- SRVC_END_DATE::DATE AS SRVC_END_DATE,
- LOS::UINTEGER AS LOS,
- ADMSN_TYPE::VARCHAR AS ADMSN_TYPE,
- TOB_CD::VARCHAR AS TOB_CD,
- CLM_TYPE_CD::VARCHAR AS CLM_TYPE_CD,
- DSCHRG_STUS::VARCHAR AS DSCHRG_STUS,
- PRMRY_DX_IMPUTED::VARCHAR AS PRMRY_DX_IMPUTED,
- PRMRY_DX_CD::VARCHAR AS PRMRY_DX_CD,
- ICD_DX_CD_1::VARCHAR AS ICD_DX_CD_1,
- ICD_DX_CD_2::VARCHAR AS ICD_DX_CD_2,
- ICD_DX_CD_3::VARCHAR,
- ICD_DX_CD_4::VARCHAR,
- ICD_DX_CD_5::VARCHAR,
- ICD_DX_CD_6::VARCHAR,
- ICD_DX_CD_7::VARCHAR,
- ICD_DX_CD_8::VARCHAR,
- ICD_DX_CD_9::VARCHAR,
- ICD_DX_CD_10::VARCHAR,
- ICD_DX_CD_11::VARCHAR,
- ICD_DX_CD_12::VARCHAR,
- ICD_DX_CD_13::VARCHAR,
- ICD_DX_CD_14::VARCHAR,
- ICD_DX_CD_15::VARCHAR,
- ICD_DX_CD_16::VARCHAR,
- ICD_DX_CD_17::VARCHAR,
- ICD_DX_CD_18::VARCHAR,
- ICD_DX_CD_19::VARCHAR,
- ICD_DX_CD_20::VARCHAR,
- ICD_DX_CD_21::VARCHAR,
- ICD_DX_CD_22::VARCHAR,
- ICD_DX_CD_23::VARCHAR,
- ICD_DX_CD_24::VARCHAR,
- ICD_DX_CD_25::VARCHAR AS ICD_DX_CD_25,
- ICD_PRCDR_CD_1::VARCHAR AS ICD_PRCDR_CD_1,
- ICD_PRCDR_CD_2::VARCHAR AS ICD_PRCDR_CD_2,
- ICD_PRCDR_CD_3::VARCHAR,
- ICD_PRCDR_CD_4::VARCHAR,
- ICD_PRCDR_CD_5::VARCHAR,
- ICD_PRCDR_CD_6::VARCHAR,
- ICD_PRCDR_CD_7::VARCHAR,
- ICD_PRCDR_CD_8::VARCHAR,
- ICD_PRCDR_CD_9::VARCHAR,
- ICD_PRCDR_CD_10::VARCHAR,
- ICD_PRCDR_CD_11::VARCHAR,
- ICD_PRCDR_CD_12::VARCHAR,
- ICD_PRCDR_CD_13::VARCHAR,
- ICD_PRCDR_CD_14::VARCHAR,
- ICD_PRCDR_CD_15::VARCHAR,
- ICD_PRCDR_CD_16::VARCHAR,
- ICD_PRCDR_CD_17::VARCHAR,
- ICD_PRCDR_CD_18::VARCHAR,
- ICD_PRCDR_CD_19::VARCHAR,
- ICD_PRCDR_CD_20::VARCHAR,
- ICD_PRCDR_CD_21::VARCHAR,
- ICD_PRCDR_CD_22::VARCHAR,
- ICD_PRCDR_CD_23::VARCHAR,
- ICD_PRCDR_CD_24::VARCHAR,
- ICD_PRCDR_CD_25::VARCHAR AS ICD_PRCDR_CD_25,
- CPT_PRCDR_CD_1::VARCHAR AS CPT_PRCDR_CD_1,
- CPT_PRCDR_CD_2::VARCHAR,
- CPT_PRCDR_CD_3::VARCHAR AS CPT_PRCDR_CD_3,
- CPT_PRCDR_CD_4::VARCHAR,
- CPT_PRCDR_CD_5::VARCHAR,
- CPT_PRCDR_CD_6::VARCHAR,
- CPT_PRCDR_CD_7::VARCHAR,
- CPT_PRCDR_CD_8::VARCHAR,
- CPT_PRCDR_CD_9::VARCHAR,
- CPT_PRCDR_CD_10::VARCHAR,
- CPT_PRCDR_CD_11::VARCHAR,
- CPT_PRCDR_CD_12::VARCHAR,
- CPT_PRCDR_CD_13::VARCHAR,
- CPT_PRCDR_CD_14::VARCHAR,
- CPT_PRCDR_CD_15::VARCHAR,
- CPT_PRCDR_CD_16::VARCHAR,
- CPT_PRCDR_CD_17::VARCHAR,
- CPT_PRCDR_CD_18::VARCHAR,
- CPT_PRCDR_CD_19::VARCHAR,
- CPT_PRCDR_CD_20::VARCHAR,
- CPT_PRCDR_CD_21::VARCHAR,
- CPT_PRCDR_CD_22::VARCHAR,
- CPT_PRCDR_CD_23::VARCHAR,
- CPT_PRCDR_CD_24::VARCHAR,
- CPT_PRCDR_CD_25::VARCHAR,
- CPT_PRCDR_CD_26::VARCHAR,
- CPT_PRCDR_CD_27::VARCHAR,
- CPT_PRCDR_CD_28::VARCHAR,
- CPT_PRCDR_CD_29::VARCHAR,
- CPT_PRCDR_CD_30::VARCHAR,
- CPT_PRCDR_CD_31::VARCHAR,
- CPT_PRCDR_CD_32::VARCHAR,
- CPT_PRCDR_CD_33::VARCHAR,
- CPT_PRCDR_CD_34::VARCHAR,
- CPT_PRCDR_CD_35::VARCHAR AS CPT_PRCDR_CD_35,
- replace(replace(PLAN_PMT_AMT, '$', ''), ',', '')::FLOAT AS PLAN_PMT_AMT,
- replace(replace(TOT_CHRG_AMT, '$', ''), ',', '')::FLOAT AS TOT_CHRG_AMT
- FROM read_csv('~/data/syh_dr/syhdr_medicare_inpatient_2016.CSV', header=True, null_padding=true, types={'CPT_PRCDR_CD_1': 'VARCHAR', 'CPT_PRCDR_CD_2': 'VARCHAR', 'CPT_PRCDR_CD_3': 'VARCHAR', 'CPT_PRCDR_CD_4': 'VARCHAR', 'CPT_PRCDR_CD_5': 'VARCHAR', 'CPT_PRCDR_CD_6': 'VARCHAR', 'CPT_PRCDR_CD_7': 'VARCHAR', 'CPT_PRCDR_CD_8': 'VARCHAR', 'CPT_PRCDR_CD_9': 'VARCHAR', 'CPT_PRCDR_CD_10': 'VARCHAR', 'CPT_PRCDR_CD_11': 'VARCHAR', 'CPT_PRCDR_CD_12': 'VARCHAR', 'CPT_PRCDR_CD_13': 'VARCHAR', 'CPT_PRCDR_CD_14': 'VARCHAR', 'CPT_PRCDR_CD_15': 'VARCHAR', 'CPT_PRCDR_CD_16': 'VARCHAR', 'CPT_PRCDR_CD_17': 'VARCHAR', 'CPT_PRCDR_CD_18': 'VARCHAR', 'CPT_PRCDR_CD_19': 'VARCHAR', 'CPT_PRCDR_CD_20': 'VARCHAR', 'CPT_PRCDR_CD_21': 'VARCHAR', 'CPT_PRCDR_CD_22': 'VARCHAR', 'CPT_PRCDR_CD_23': 'VARCHAR', 'CPT_PRCDR_CD_24': 'VARCHAR', 'CPT_PRCDR_CD_25': 'VARCHAR', 'CPT_PRCDR_CD_26': 'VARCHAR', 'CPT_PRCDR_CD_27': 'VARCHAR', 'CPT_PRCDR_CD_28': 'VARCHAR', 'CPT_PRCDR_CD_29': 'VARCHAR', 'CPT_PRCDR_CD_30': 'VARCHAR', 'CPT_PRCDR_CD_31': 'VARCHAR', 'CPT_PRCDR_CD_32': 'VARCHAR', 'CPT_PRCDR_CD_33': 'VARCHAR', 'CPT_PRCDR_CD_34': 'VARCHAR', 'CPT_PRCDR_CD_35': 'VARCHAR'}, ignore_errors=true)```
- Relative file path: ahrq.gov/generated/with_types/syhdr_medicare_provider_2016.sql
- ```
- -- SQL model for syhdr_medicare_provider_2016.csv
- {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
- SELECT
- Facility_ID::VARCHAR,
- Prvdr_Ctgry_Cd::VARCHAR,
- Prvdr_Ownrshp_Cd::VARCHAR,
- Prvdr_Prtcptn_Cd::VARCHAR
- FROM read_csv('~/data/syh_dr/syhdr_medicare_provider_2016.csv', header=True, null_padding=true)```
- Relative file path: ahrq.gov/generated/with_types/syhdr_commercial_outpatient_2016.sql
- ```
- -- SQL model for syhdr_commercial_outpatient_2016.CSV
- {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
- SELECT
- PERSON_ID::UBIGINT AS PERSON_ID,
- PERSON_WGHT::NUMERIC AS PERSON_WGHT,
- FACILITY_ID::UBIGINT AS FACILITY_ID,
- CLM_CNTL_NUM::NUMERIC AS CLM_CNTL_NUM,
- AT_SPCLTY::VARCHAR AS AT_SPCLTY,
- SRVC_BEG_DATE::DATE AS SRVC_BEG_DATE,
- SRVC_END_DATE::DATE AS SRVC_END_DATE,
- LOS::UINTEGER AS LOS,
- ADMSN_TYPE::VARCHAR AS ADMSN_TYPE,
- TOB_CD::VARCHAR AS TOB_CD,
- CLM_TYPE_CD::VARCHAR AS CLM_TYPE_CD,
- DSCHRG_STUS::VARCHAR AS DSCHRG_STUS,
- PRMRY_DX_IMPUTED::VARCHAR AS PRMRY_DX_IMPUTED,
- PRMRY_DX_CD::VARCHAR AS PRMRY_DX_CD,
- ICD_DX_CD_1::VARCHAR AS ICD_DX_CD_1,
- ICD_DX_CD_2::VARCHAR AS ICD_DX_CD_2,
- ICD_DX_CD_3::VARCHAR,
- ICD_DX_CD_4::VARCHAR,
- ICD_DX_CD_5::VARCHAR,
- ICD_DX_CD_6::VARCHAR,
- ICD_DX_CD_7::VARCHAR,
- ICD_DX_CD_8::VARCHAR,
- ICD_DX_CD_9::VARCHAR,
- ICD_DX_CD_10::VARCHAR,
- ICD_DX_CD_11::VARCHAR,
- ICD_DX_CD_12::VARCHAR,
- ICD_DX_CD_13::VARCHAR,
- ICD_DX_CD_14::VARCHAR,
- ICD_DX_CD_15::VARCHAR,
- ICD_DX_CD_16::VARCHAR,
- ICD_DX_CD_17::VARCHAR,
- ICD_DX_CD_18::VARCHAR,
- ICD_DX_CD_19::VARCHAR,
- ICD_DX_CD_20::VARCHAR,
- ICD_DX_CD_21::VARCHAR,
- ICD_DX_CD_22::VARCHAR,
- ICD_DX_CD_23::VARCHAR,
- ICD_DX_CD_24::VARCHAR,
- ICD_DX_CD_25::VARCHAR AS ICD_DX_CD_25,
- ICD_PRCDR_CD_1::VARCHAR AS ICD_PRCDR_CD_1,
- ICD_PRCDR_CD_2::VARCHAR AS ICD_PRCDR_CD_2,
- ICD_PRCDR_CD_3::VARCHAR,
- ICD_PRCDR_CD_4::VARCHAR,
- ICD_PRCDR_CD_5::VARCHAR,
- ICD_PRCDR_CD_6::VARCHAR,
- ICD_PRCDR_CD_7::VARCHAR,
- ICD_PRCDR_CD_8::VARCHAR,
- ICD_PRCDR_CD_9::VARCHAR,
- ICD_PRCDR_CD_10::VARCHAR,
- ICD_PRCDR_CD_11::VARCHAR,
- ICD_PRCDR_CD_12::VARCHAR,
- ICD_PRCDR_CD_13::VARCHAR,
- ICD_PRCDR_CD_14::VARCHAR,
- ICD_PRCDR_CD_15::VARCHAR,
- ICD_PRCDR_CD_16::VARCHAR,
- ICD_PRCDR_CD_17::VARCHAR,
- ICD_PRCDR_CD_18::VARCHAR,
- ICD_PRCDR_CD_19::VARCHAR,
- ICD_PRCDR_CD_20::VARCHAR,
- ICD_PRCDR_CD_21::VARCHAR,
- ICD_PRCDR_CD_22::VARCHAR,
- ICD_PRCDR_CD_23::VARCHAR,
- ICD_PRCDR_CD_24::VARCHAR,
- ICD_PRCDR_CD_25::VARCHAR AS ICD_PRCDR_CD_25,
- CPT_PRCDR_CD_1::VARCHAR AS CPT_PRCDR_CD_1,
- CPT_PRCDR_CD_2::VARCHAR,
- CPT_PRCDR_CD_3::VARCHAR AS CPT_PRCDR_CD_3,
- CPT_PRCDR_CD_4::VARCHAR,
- CPT_PRCDR_CD_5::VARCHAR,
- CPT_PRCDR_CD_6::VARCHAR,
- CPT_PRCDR_CD_7::VARCHAR,
- CPT_PRCDR_CD_8::VARCHAR,
- CPT_PRCDR_CD_9::VARCHAR,
- CPT_PRCDR_CD_10::VARCHAR,
- CPT_PRCDR_CD_11::VARCHAR,
- CPT_PRCDR_CD_12::VARCHAR,
- CPT_PRCDR_CD_13::VARCHAR,
- CPT_PRCDR_CD_14::VARCHAR,
- CPT_PRCDR_CD_15::VARCHAR,
- CPT_PRCDR_CD_16::VARCHAR,
- CPT_PRCDR_CD_17::VARCHAR,
- CPT_PRCDR_CD_18::VARCHAR,
- CPT_PRCDR_CD_19::VARCHAR,
- CPT_PRCDR_CD_20::VARCHAR,
- CPT_PRCDR_CD_21::VARCHAR,
- CPT_PRCDR_CD_22::VARCHAR,
- CPT_PRCDR_CD_23::VARCHAR,
- CPT_PRCDR_CD_24::VARCHAR,
- CPT_PRCDR_CD_25::VARCHAR,
- CPT_PRCDR_CD_26::VARCHAR,
- CPT_PRCDR_CD_27::VARCHAR,
- CPT_PRCDR_CD_28::VARCHAR,
- CPT_PRCDR_CD_29::VARCHAR,
- CPT_PRCDR_CD_30::VARCHAR,
- CPT_PRCDR_CD_31::VARCHAR,
- CPT_PRCDR_CD_32::VARCHAR,
- CPT_PRCDR_CD_33::VARCHAR,
- CPT_PRCDR_CD_34::VARCHAR,
- CPT_PRCDR_CD_35::VARCHAR AS CPT_PRCDR_CD_35,
- replace(replace(PLAN_PMT_AMT, '$', ''), ',', '')::FLOAT AS PLAN_PMT_AMT,
- replace(replace(TOT_CHRG_AMT, '$', ''), ',', '')::FLOAT AS TOT_CHRG_AMT
- FROM read_csv('~/data/syh_dr/syhdr_commercial_outpatient_2016.CSV', header=True, null_padding=true, types={'CPT_PRCDR_CD_1': 'VARCHAR', 'CPT_PRCDR_CD_2': 'VARCHAR', 'CPT_PRCDR_CD_3': 'VARCHAR', 'CPT_PRCDR_CD_4': 'VARCHAR', 'CPT_PRCDR_CD_5': 'VARCHAR', 'CPT_PRCDR_CD_6': 'VARCHAR', 'CPT_PRCDR_CD_7': 'VARCHAR', 'CPT_PRCDR_CD_8': 'VARCHAR', 'CPT_PRCDR_CD_9': 'VARCHAR', 'CPT_PRCDR_CD_10': 'VARCHAR', 'CPT_PRCDR_CD_11': 'VARCHAR', 'CPT_PRCDR_CD_12': 'VARCHAR', 'CPT_PRCDR_CD_13': 'VARCHAR', 'CPT_PRCDR_CD_14': 'VARCHAR', 'CPT_PRCDR_CD_15': 'VARCHAR', 'CPT_PRCDR_CD_16': 'VARCHAR', 'CPT_PRCDR_CD_17': 'VARCHAR', 'CPT_PRCDR_CD_18': 'VARCHAR', 'CPT_PRCDR_CD_19': 'VARCHAR', 'CPT_PRCDR_CD_20': 'VARCHAR', 'CPT_PRCDR_CD_21': 'VARCHAR', 'CPT_PRCDR_CD_22': 'VARCHAR', 'CPT_PRCDR_CD_23': 'VARCHAR', 'CPT_PRCDR_CD_24': 'VARCHAR', 'CPT_PRCDR_CD_25': 'VARCHAR', 'CPT_PRCDR_CD_26': 'VARCHAR', 'CPT_PRCDR_CD_27': 'VARCHAR', 'CPT_PRCDR_CD_28': 'VARCHAR', 'CPT_PRCDR_CD_29': 'VARCHAR', 'CPT_PRCDR_CD_30': 'VARCHAR', 'CPT_PRCDR_CD_31': 'VARCHAR', 'CPT_PRCDR_CD_32': 'VARCHAR', 'CPT_PRCDR_CD_33': 'VARCHAR', 'CPT_PRCDR_CD_34': 'VARCHAR', 'CPT_PRCDR_CD_35': 'VARCHAR'}, ignore_errors=true)```
- Relative file path: ahrq.gov/generated/with_types/syhdr_commercial_person_2016.sql
- ```
- -- SQL model for syhdr_commercial_person_2016.CSV
- {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
- SELECT
- PERSON_ID::UBIGINT AS PERSON_ID,
- PERSON_WGHT::NUMERIC AS PERSON_WGHT,
- AGE_LOW::NUMERIC AS AGE_LOW,
- AGE_HIGH::NUMERIC AS AGE_HIGH,
- SEX_IDENT_CD::VARCHAR AS SEX_IDENT_CD,
- STATE_CD::VARCHAR AS STATE_CD,
- COUNTY_FIPS_CD::VARCHAR AS COUNTY_FIPS_CD,
- ZIP_CD::VARCHAR AS ZIP_CD,
- PHRMCY_CVRG_1::NUMERIC AS PHRMCY_CVRG_1,
- PHRMCY_CVRG_2::VARCHAR,
- PHRMCY_CVRG_3::VARCHAR,
- PHRMCY_CVRG_4::VARCHAR,
- PHRMCY_CVRG_5::VARCHAR,
- PHRMCY_CVRG_6::VARCHAR,
- PHRMCY_CVRG_7::VARCHAR,
- PHRMCY_CVRG_8::VARCHAR,
- PHRMCY_CVRG_9::VARCHAR,
- PHRMCY_CVRG_10::VARCHAR,
- PHRMCY_CVRG_11::VARCHAR,
- PHRMCY_CVRG_12::NUMERIC AS PHRMCY_CVRG_12,
- CMRCL_INSRC_1::NUMERIC AS CMRCL_INSRC_1,
- CMRCL_INSRC_2::VARCHAR,
- CMRCL_INSRC_3::VARCHAR,
- CMRCL_INSRC_4::VARCHAR,
- CMRCL_INSRC_5::VARCHAR,
- CMRCL_INSRC_6::VARCHAR,
- CMRCL_INSRC_7::VARCHAR,
- CMRCL_INSRC_8::VARCHAR,
- CMRCL_INSRC_9::VARCHAR,
- CMRCL_INSRC_10::VARCHAR,
- CMRCL_INSRC_11::VARCHAR,
- CMRCL_INSRC_12::NUMERIC AS CMRCL_INSRC_12
- FROM read_csv('~/data/syh_dr/syhdr_commercial_person_2016.CSV', header=True, null_padding=true)```
- Relative file path: ahrq.gov/generated/with_types/syhdr_medicaid_person_2016.sql
- ```
- -- SQL model for syhdr_medicaid_person_2016.CSV
- {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
- SELECT
- PERSON_ID::UBIGINT AS PERSON_ID,
- MCAID_BENE_ID::UBIGINT AS MCAID_BENE_ID,
- PERSON_WGHT::NUMERIC AS PERSON_WGHT,
- AGE_LOW::NUMERIC AS AGE_LOW,
- AGE_HIGH::NUMERIC AS AGE_HIGH,
- SEX_IDENT_CD::VARCHAR AS SEX_IDENT_CD,
- RACE_CD::VARCHAR AS RACE_CD,
- MCAID_SBMTTG_ST_CD::VARCHAR AS MCAID_SBMTTG_ST_CD,
- STATE_CD::VARCHAR AS STATE_CD,
- COUNTY_FIPS_CD::VARCHAR AS COUNTY_FIPS_CD,
- ZIP_CD::VARCHAR AS ZIP_CD,
- RSN_ENRLMT_CD::VARCHAR AS RSN_ENRLMT_CD,
- MDCD_ENRLMT_1::NUMERIC AS MDCD_ENRLMT_1,
- MDCD_ENRLMT_2::VARCHAR,
- MDCD_ENRLMT_3::VARCHAR,
- MDCD_ENRLMT_4::VARCHAR,
- MDCD_ENRLMT_5::VARCHAR,
- MDCD_ENRLMT_6::VARCHAR,
- MDCD_ENRLMT_7::VARCHAR,
- MDCD_ENRLMT_8::VARCHAR,
- MDCD_ENRLMT_9::VARCHAR,
- MDCD_ENRLMT_10::VARCHAR,
- MDCD_ENRLMT_11::VARCHAR,
- MDCD_ENRLMT_12::NUMERIC AS MDCD_ENRLMT_12,
- MDCD_MCO_ENRLMT_1::NUMERIC AS MDCD_MCO_ENRLMT_1,
- MDCD_MCO_ENRLMT_2::VARCHAR,
- MDCD_MCO_ENRLMT_3::VARCHAR,
- MDCD_MCO_ENRLMT_4::VARCHAR,
- MDCD_MCO_ENRLMT_5::VARCHAR,
- MDCD_MCO_ENRLMT_6::VARCHAR,
- MDCD_MCO_ENRLMT_7::VARCHAR,
- MDCD_MCO_ENRLMT_8::VARCHAR,
- MDCD_MCO_ENRLMT_9::VARCHAR,
- MDCD_MCO_ENRLMT_10::VARCHAR,
- MDCD_MCO_ENRLMT_11::VARCHAR,
- MDCD_MCO_ENRLMT_12::NUMERIC AS MDCD_MCO_ENRLMT_12,
- MDCD_CHIP_ENRLMT::NUMERIC AS MDCD_CHIP_ENRLMT,
- RSTRCTD_BNFTS_IND::VARCHAR,
- DUAL_ELGBL_1::NUMERIC AS DUAL_ELGBL_1,
- DUAL_ELGBL_2::VARCHAR,
- DUAL_ELGBL_3::VARCHAR,
- DUAL_ELGBL_4::VARCHAR,
- DUAL_ELGBL_5::VARCHAR,
- DUAL_ELGBL_6::VARCHAR,
- DUAL_ELGBL_7::VARCHAR,
- DUAL_ELGBL_8::VARCHAR,
- DUAL_ELGBL_9::VARCHAR,
- DUAL_ELGBL_10::VARCHAR,
- DUAL_ELGBL_11::VARCHAR,
- DUAL_ELGBL_12::NUMERIC AS DUAL_ELGBL_12
- FROM read_csv('~/data/syh_dr/syhdr_medicaid_person_2016.CSV', header=True, null_padding=true)```
- Relative file path: ahrq.gov/generated/with_types/syhdr_commercial_inpatient_2016.sql
- ```
- -- SQL model for syhdr_commercial_inpatient_2016.CSV
- {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
- SELECT
- PERSON_ID::UBIGINT AS PERSON_ID,
- PERSON_WGHT::NUMERIC AS PERSON_WGHT,
- FACILITY_ID::UBIGINT AS FACILITY_ID,
- CLM_CNTL_NUM::NUMERIC AS CLM_CNTL_NUM,
- AT_SPCLTY::VARCHAR AS AT_SPCLTY,
- SRVC_BEG_DATE::DATE AS SRVC_BEG_DATE,
- SRVC_END_DATE::DATE AS SRVC_END_DATE,
- LOS::UINTEGER AS LOS,
- ADMSN_TYPE::VARCHAR AS ADMSN_TYPE,
- TOB_CD::VARCHAR AS TOB_CD,
- CLM_TYPE_CD::VARCHAR AS CLM_TYPE_CD,
- DSCHRG_STUS::VARCHAR AS DSCHRG_STUS,
- PRMRY_DX_IMPUTED::VARCHAR AS PRMRY_DX_IMPUTED,
- PRMRY_DX_CD::VARCHAR AS PRMRY_DX_CD,
- ICD_DX_CD_1::VARCHAR AS ICD_DX_CD_1,
- ICD_DX_CD_2::VARCHAR AS ICD_DX_CD_2,
- ICD_DX_CD_3::VARCHAR,
- ICD_DX_CD_4::VARCHAR,
- ICD_DX_CD_5::VARCHAR,
- ICD_DX_CD_6::VARCHAR,
- ICD_DX_CD_7::VARCHAR,
- ICD_DX_CD_8::VARCHAR,
- ICD_DX_CD_9::VARCHAR,
- ICD_DX_CD_10::VARCHAR,
- ICD_DX_CD_11::VARCHAR,
- ICD_DX_CD_12::VARCHAR,
- ICD_DX_CD_13::VARCHAR,
- ICD_DX_CD_14::VARCHAR,
- ICD_DX_CD_15::VARCHAR,
- ICD_DX_CD_16::VARCHAR,
- ICD_DX_CD_17::VARCHAR,
- ICD_DX_CD_18::VARCHAR,
- ICD_DX_CD_19::VARCHAR,
- ICD_DX_CD_20::VARCHAR,
- ICD_DX_CD_21::VARCHAR,
- ICD_DX_CD_22::VARCHAR,
- ICD_DX_CD_23::VARCHAR,
- ICD_DX_CD_24::VARCHAR,
- ICD_DX_CD_25::VARCHAR AS ICD_DX_CD_25,
- ICD_PRCDR_CD_1::VARCHAR AS ICD_PRCDR_CD_1,
- ICD_PRCDR_CD_2::VARCHAR AS ICD_PRCDR_CD_2,
- ICD_PRCDR_CD_3::VARCHAR,
- ICD_PRCDR_CD_4::VARCHAR,
- ICD_PRCDR_CD_5::VARCHAR,
- ICD_PRCDR_CD_6::VARCHAR,
- ICD_PRCDR_CD_7::VARCHAR,
- ICD_PRCDR_CD_8::VARCHAR,
- ICD_PRCDR_CD_9::VARCHAR,
- ICD_PRCDR_CD_10::VARCHAR,
- ICD_PRCDR_CD_11::VARCHAR,
- ICD_PRCDR_CD_12::VARCHAR,
- ICD_PRCDR_CD_13::VARCHAR,
- ICD_PRCDR_CD_14::VARCHAR,
- ICD_PRCDR_CD_15::VARCHAR,
- ICD_PRCDR_CD_16::VARCHAR,
- ICD_PRCDR_CD_17::VARCHAR,
- ICD_PRCDR_CD_18::VARCHAR,
- ICD_PRCDR_CD_19::VARCHAR,
- ICD_PRCDR_CD_20::VARCHAR,
- ICD_PRCDR_CD_21::VARCHAR,
- ICD_PRCDR_CD_22::VARCHAR,
- ICD_PRCDR_CD_23::VARCHAR,
- ICD_PRCDR_CD_24::VARCHAR,
- ICD_PRCDR_CD_25::VARCHAR AS ICD_PRCDR_CD_25,
- CPT_PRCDR_CD_1::VARCHAR AS CPT_PRCDR_CD_1,
- CPT_PRCDR_CD_2::VARCHAR,
- CPT_PRCDR_CD_3::VARCHAR AS CPT_PRCDR_CD_3,
- CPT_PRCDR_CD_4::VARCHAR,
- CPT_PRCDR_CD_5::VARCHAR,
- CPT_PRCDR_CD_6::VARCHAR,
- CPT_PRCDR_CD_7::VARCHAR,
- CPT_PRCDR_CD_8::VARCHAR,
- CPT_PRCDR_CD_9::VARCHAR,
- CPT_PRCDR_CD_10::VARCHAR,
- CPT_PRCDR_CD_11::VARCHAR,
- CPT_PRCDR_CD_12::VARCHAR,
- CPT_PRCDR_CD_13::VARCHAR,
- CPT_PRCDR_CD_14::VARCHAR,
- CPT_PRCDR_CD_15::VARCHAR,
- CPT_PRCDR_CD_16::VARCHAR,
- CPT_PRCDR_CD_17::VARCHAR,
- CPT_PRCDR_CD_18::VARCHAR,
- CPT_PRCDR_CD_19::VARCHAR,
- CPT_PRCDR_CD_20::VARCHAR,
- CPT_PRCDR_CD_21::VARCHAR,
- CPT_PRCDR_CD_22::VARCHAR,
- CPT_PRCDR_CD_23::VARCHAR,
- CPT_PRCDR_CD_24::VARCHAR,
- CPT_PRCDR_CD_25::VARCHAR,
- CPT_PRCDR_CD_26::VARCHAR,
- CPT_PRCDR_CD_27::VARCHAR,
- CPT_PRCDR_CD_28::VARCHAR,
- CPT_PRCDR_CD_29::VARCHAR,
- CPT_PRCDR_CD_30::VARCHAR,
- CPT_PRCDR_CD_31::VARCHAR,
- CPT_PRCDR_CD_32::VARCHAR,
- CPT_PRCDR_CD_33::VARCHAR,
- CPT_PRCDR_CD_34::VARCHAR,
- CPT_PRCDR_CD_35::VARCHAR AS CPT_PRCDR_CD_35,
- replace(replace(PLAN_PMT_AMT, '$', ''), ',', '')::FLOAT AS PLAN_PMT_AMT,
- replace(replace(TOT_CHRG_AMT, '$', ''), ',', '')::FLOAT AS TOT_CHRG_AMT
- FROM read_csv('~/data/syh_dr/syhdr_commercial_inpatient_2016.CSV', header=True, null_padding=true, types={'CPT_PRCDR_CD_1': 'VARCHAR', 'CPT_PRCDR_CD_2': 'VARCHAR', 'CPT_PRCDR_CD_3': 'VARCHAR', 'CPT_PRCDR_CD_4': 'VARCHAR', 'CPT_PRCDR_CD_5': 'VARCHAR', 'CPT_PRCDR_CD_6': 'VARCHAR', 'CPT_PRCDR_CD_7': 'VARCHAR', 'CPT_PRCDR_CD_8': 'VARCHAR', 'CPT_PRCDR_CD_9': 'VARCHAR', 'CPT_PRCDR_CD_10': 'VARCHAR', 'CPT_PRCDR_CD_11': 'VARCHAR', 'CPT_PRCDR_CD_12': 'VARCHAR', 'CPT_PRCDR_CD_13': 'VARCHAR', 'CPT_PRCDR_CD_14': 'VARCHAR', 'CPT_PRCDR_CD_15': 'VARCHAR', 'CPT_PRCDR_CD_16': 'VARCHAR', 'CPT_PRCDR_CD_17': 'VARCHAR', 'CPT_PRCDR_CD_18': 'VARCHAR', 'CPT_PRCDR_CD_19': 'VARCHAR', 'CPT_PRCDR_CD_20': 'VARCHAR', 'CPT_PRCDR_CD_21': 'VARCHAR', 'CPT_PRCDR_CD_22': 'VARCHAR', 'CPT_PRCDR_CD_23': 'VARCHAR', 'CPT_PRCDR_CD_24': 'VARCHAR', 'CPT_PRCDR_CD_25': 'VARCHAR', 'CPT_PRCDR_CD_26': 'VARCHAR', 'CPT_PRCDR_CD_27': 'VARCHAR', 'CPT_PRCDR_CD_28': 'VARCHAR', 'CPT_PRCDR_CD_29': 'VARCHAR', 'CPT_PRCDR_CD_30': 'VARCHAR', 'CPT_PRCDR_CD_31': 'VARCHAR', 'CPT_PRCDR_CD_32': 'VARCHAR', 'CPT_PRCDR_CD_33': 'VARCHAR', 'CPT_PRCDR_CD_34': 'VARCHAR', 'CPT_PRCDR_CD_35': 'VARCHAR'}, ignore_errors=true)```
- Relative file path: ahrq.gov/generated/with_types/syhdr_medicaid_inpatient_2016.sql
- ```
- -- SQL model for syhdr_medicaid_inpatient_2016.CSV
- {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
- SELECT
- PERSON_ID::UBIGINT AS PERSON_ID,
- PERSON_WGHT::NUMERIC AS PERSON_WGHT,
- FACILITY_ID::UBIGINT AS FACILITY_ID,
- CLM_CNTL_NUM::NUMERIC AS CLM_CNTL_NUM,
- AT_SPCLTY::VARCHAR AS AT_SPCLTY,
- SRVC_BEG_DATE::DATE AS SRVC_BEG_DATE,
- SRVC_END_DATE::DATE AS SRVC_END_DATE,
- LOS::UINTEGER AS LOS,
- ADMSN_TYPE::VARCHAR AS ADMSN_TYPE,
- TOB_CD::VARCHAR AS TOB_CD,
- CLM_TYPE_CD::VARCHAR AS CLM_TYPE_CD,
- DSCHRG_STUS::VARCHAR AS DSCHRG_STUS,
- PRMRY_DX_IMPUTED::VARCHAR AS PRMRY_DX_IMPUTED,
- PRMRY_DX_CD::VARCHAR AS PRMRY_DX_CD,
- ICD_DX_CD_1::VARCHAR AS ICD_DX_CD_1,
- ICD_DX_CD_2::VARCHAR AS ICD_DX_CD_2,
- ICD_DX_CD_3::VARCHAR,
- ICD_DX_CD_4::VARCHAR,
- ICD_DX_CD_5::VARCHAR,
- ICD_DX_CD_6::VARCHAR,
- ICD_DX_CD_7::VARCHAR,
- ICD_DX_CD_8::VARCHAR,
- ICD_DX_CD_9::VARCHAR,
- ICD_DX_CD_10::VARCHAR,
- ICD_DX_CD_11::VARCHAR,
- ICD_DX_CD_12::VARCHAR,
- ICD_DX_CD_13::VARCHAR,
- ICD_DX_CD_14::VARCHAR,
- ICD_DX_CD_15::VARCHAR,
- ICD_DX_CD_16::VARCHAR,
- ICD_DX_CD_17::VARCHAR,
- ICD_DX_CD_18::VARCHAR,
- ICD_DX_CD_19::VARCHAR,
- ICD_DX_CD_20::VARCHAR,
- ICD_DX_CD_21::VARCHAR,
- ICD_DX_CD_22::VARCHAR,
- ICD_DX_CD_23::VARCHAR,
- ICD_DX_CD_24::VARCHAR,
- ICD_DX_CD_25::VARCHAR AS ICD_DX_CD_25,
- ICD_PRCDR_CD_1::VARCHAR AS ICD_PRCDR_CD_1,
- ICD_PRCDR_CD_2::VARCHAR AS ICD_PRCDR_CD_2,
- ICD_PRCDR_CD_3::VARCHAR,
- ICD_PRCDR_CD_4::VARCHAR,
- ICD_PRCDR_CD_5::VARCHAR,
- ICD_PRCDR_CD_6::VARCHAR,
- ICD_PRCDR_CD_7::VARCHAR,
- ICD_PRCDR_CD_8::VARCHAR,
- ICD_PRCDR_CD_9::VARCHAR,
- ICD_PRCDR_CD_10::VARCHAR,
- ICD_PRCDR_CD_11::VARCHAR,
- ICD_PRCDR_CD_12::VARCHAR,
- ICD_PRCDR_CD_13::VARCHAR,
- ICD_PRCDR_CD_14::VARCHAR,
- ICD_PRCDR_CD_15::VARCHAR,
- ICD_PRCDR_CD_16::VARCHAR,
- ICD_PRCDR_CD_17::VARCHAR,
- ICD_PRCDR_CD_18::VARCHAR,
- ICD_PRCDR_CD_19::VARCHAR,
- ICD_PRCDR_CD_20::VARCHAR,
- ICD_PRCDR_CD_21::VARCHAR,
- ICD_PRCDR_CD_22::VARCHAR,
- ICD_PRCDR_CD_23::VARCHAR,
- ICD_PRCDR_CD_24::VARCHAR,
- ICD_PRCDR_CD_25::VARCHAR AS ICD_PRCDR_CD_25,
- CPT_PRCDR_CD_1::VARCHAR AS CPT_PRCDR_CD_1,
- CPT_PRCDR_CD_2::VARCHAR,
- CPT_PRCDR_CD_3::VARCHAR AS CPT_PRCDR_CD_3,
- CPT_PRCDR_CD_4::VARCHAR,
- CPT_PRCDR_CD_5::VARCHAR,
- CPT_PRCDR_CD_6::VARCHAR,
- CPT_PRCDR_CD_7::VARCHAR,
- CPT_PRCDR_CD_8::VARCHAR,
- CPT_PRCDR_CD_9::VARCHAR,
- CPT_PRCDR_CD_10::VARCHAR,
- CPT_PRCDR_CD_11::VARCHAR,
- CPT_PRCDR_CD_12::VARCHAR,
- CPT_PRCDR_CD_13::VARCHAR,
- CPT_PRCDR_CD_14::VARCHAR,
- CPT_PRCDR_CD_15::VARCHAR,
- CPT_PRCDR_CD_16::VARCHAR,
- CPT_PRCDR_CD_17::VARCHAR,
- CPT_PRCDR_CD_18::VARCHAR,
- CPT_PRCDR_CD_19::VARCHAR,
- CPT_PRCDR_CD_20::VARCHAR,
- CPT_PRCDR_CD_21::VARCHAR,
- CPT_PRCDR_CD_22::VARCHAR,
- CPT_PRCDR_CD_23::VARCHAR,
- CPT_PRCDR_CD_24::VARCHAR,
- CPT_PRCDR_CD_25::VARCHAR,
- CPT_PRCDR_CD_26::VARCHAR,
- CPT_PRCDR_CD_27::VARCHAR,
- CPT_PRCDR_CD_28::VARCHAR,
- CPT_PRCDR_CD_29::VARCHAR,
- CPT_PRCDR_CD_30::VARCHAR,
- CPT_PRCDR_CD_31::VARCHAR,
- CPT_PRCDR_CD_32::VARCHAR,
- CPT_PRCDR_CD_33::VARCHAR,
- CPT_PRCDR_CD_34::VARCHAR,
- CPT_PRCDR_CD_35::VARCHAR AS CPT_PRCDR_CD_35,
- replace(replace(PLAN_PMT_AMT, '$', ''), ',', '')::FLOAT AS PLAN_PMT_AMT,
- replace(replace(TOT_CHRG_AMT, '$', ''), ',', '')::FLOAT AS TOT_CHRG_AMT
- FROM read_csv('~/data/syh_dr/syhdr_medicaid_inpatient_2016.CSV', header=True, null_padding=true, types={'CPT_PRCDR_CD_1': 'VARCHAR', 'CPT_PRCDR_CD_2': 'VARCHAR', 'CPT_PRCDR_CD_3': 'VARCHAR', 'CPT_PRCDR_CD_4': 'VARCHAR', 'CPT_PRCDR_CD_5': 'VARCHAR', 'CPT_PRCDR_CD_6': 'VARCHAR', 'CPT_PRCDR_CD_7': 'VARCHAR', 'CPT_PRCDR_CD_8': 'VARCHAR', 'CPT_PRCDR_CD_9': 'VARCHAR', 'CPT_PRCDR_CD_10': 'VARCHAR', 'CPT_PRCDR_CD_11': 'VARCHAR', 'CPT_PRCDR_CD_12': 'VARCHAR', 'CPT_PRCDR_CD_13': 'VARCHAR', 'CPT_PRCDR_CD_14': 'VARCHAR', 'CPT_PRCDR_CD_15': 'VARCHAR', 'CPT_PRCDR_CD_16': 'VARCHAR', 'CPT_PRCDR_CD_17': 'VARCHAR', 'CPT_PRCDR_CD_18': 'VARCHAR', 'CPT_PRCDR_CD_19': 'VARCHAR', 'CPT_PRCDR_CD_20': 'VARCHAR', 'CPT_PRCDR_CD_21': 'VARCHAR', 'CPT_PRCDR_CD_22': 'VARCHAR', 'CPT_PRCDR_CD_23': 'VARCHAR', 'CPT_PRCDR_CD_24': 'VARCHAR', 'CPT_PRCDR_CD_25': 'VARCHAR', 'CPT_PRCDR_CD_26': 'VARCHAR', 'CPT_PRCDR_CD_27': 'VARCHAR', 'CPT_PRCDR_CD_28': 'VARCHAR', 'CPT_PRCDR_CD_29': 'VARCHAR', 'CPT_PRCDR_CD_30': 'VARCHAR', 'CPT_PRCDR_CD_31': 'VARCHAR', 'CPT_PRCDR_CD_32': 'VARCHAR', 'CPT_PRCDR_CD_33': 'VARCHAR', 'CPT_PRCDR_CD_34': 'VARCHAR', 'CPT_PRCDR_CD_35': 'VARCHAR'}, ignore_errors=true)```
- Relative file path: ahrq.gov/generated/with_types/syhdr_commercial_pharmacy_2016.sql
- ```
- -- SQL model for syhdr_commercial_pharmacy_2016.CSV
- {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
- SELECT
- PERSON_ID::UBIGINT AS PERSON_ID,
- PERSON_WGHT::NUMERIC AS PERSON_WGHT,
- PHMCY_CLM_NUM::NUMERIC AS PHMCY_CLM_NUM,
- CLM_CNTL_NUM::NUMERIC AS CLM_CNTL_NUM,
- LINE_NBR::VARCHAR,
- FILL_DT::DATE AS FILL_DT,
- SYNTHETIC_DRUG_ID::VARCHAR,
- GENERIC_DRUG_NAME::VARCHAR AS GENERIC_DRUG_NAME,
- replace(replace(PLAN_PMT_AMT, '$', ''), ',', '')::FLOAT AS PLAN_PMT_AMT,
- replace(replace(TOT_CHRG_AMT, '$', ''), ',', '')::FLOAT AS TOT_CHRG_AMT
- FROM read_csv('~/data/syh_dr/syhdr_commercial_pharmacy_2016.CSV', header=True, null_padding=true)```
- Relative file path: ahrq.gov/generated/with_types/syhdr_medicare_person_2016.sql
- ```
- -- SQL model for syhdr_medicare_person_2016.CSV
- {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
- SELECT
- PERSON_ID::UBIGINT AS PERSON_ID,
- PERSON_WGHT::NUMERIC AS PERSON_WGHT,
- AGE_LOW::NUMERIC AS AGE_LOW,
- AGE_HIGH::NUMERIC AS AGE_HIGH,
- SEX_IDENT_CD::VARCHAR AS SEX_IDENT_CD,
- RACE_CD::VARCHAR AS RACE_CD,
- STATE_CD::VARCHAR AS STATE_CD,
- COUNTY_FIPS_CD::VARCHAR AS COUNTY_FIPS_CD,
- ZIP_CD::VARCHAR AS ZIP_CD,
- RSN_ENRLMT_CD::VARCHAR AS RSN_ENRLMT_CD,
- MDCR_ENTLMT_IND_1::VARCHAR AS MDCR_ENTLMT_IND_1,
- MDCR_ENTLMT_IND_2::VARCHAR,
- MDCR_ENTLMT_IND_3::VARCHAR,
- MDCR_ENTLMT_IND_4::VARCHAR,
- MDCR_ENTLMT_IND_5::VARCHAR,
- MDCR_ENTLMT_IND_6::VARCHAR,
- MDCR_ENTLMT_IND_7::VARCHAR,
- MDCR_ENTLMT_IND_8::VARCHAR,
- MDCR_ENTLMT_IND_9::VARCHAR,
- MDCR_ENTLMT_IND_10::VARCHAR,
- MDCR_ENTLMT_IND_11::VARCHAR,
- MDCR_ENTLMT_IND_12::VARCHAR AS MDCR_ENTLMT_IND_12,
- MDCR_HMO_CVRG_1::NUMERIC AS MDCR_HMO_CVRG_1,
- MDCR_HMO_CVRG_2::VARCHAR,
- MDCR_HMO_CVRG_3::VARCHAR,
- MDCR_HMO_CVRG_4::VARCHAR,
- MDCR_HMO_CVRG_5::VARCHAR,
- MDCR_HMO_CVRG_6::VARCHAR,
- MDCR_HMO_CVRG_7::VARCHAR,
- MDCR_HMO_CVRG_8::VARCHAR,
- MDCR_HMO_CVRG_9::VARCHAR,
- MDCR_HMO_CVRG_10::VARCHAR,
- MDCR_HMO_CVRG_11::VARCHAR,
- MDCR_HMO_CVRG_12::NUMERIC AS MDCR_HMO_CVRG_12,
- PHRMCY_CVRG_1::NUMERIC AS PHRMCY_CVRG_1,
- PHRMCY_CVRG_2::VARCHAR,
- PHRMCY_CVRG_3::VARCHAR,
- PHRMCY_CVRG_4::VARCHAR,
- PHRMCY_CVRG_5::VARCHAR,
- PHRMCY_CVRG_6::VARCHAR,
- PHRMCY_CVRG_7::VARCHAR,
- PHRMCY_CVRG_8::VARCHAR,
- PHRMCY_CVRG_9::VARCHAR,
- PHRMCY_CVRG_10::VARCHAR,
- PHRMCY_CVRG_11::VARCHAR,
- PHRMCY_CVRG_12::NUMERIC AS PHRMCY_CVRG_12,
- DUAL_ELGBL_1::NUMERIC AS DUAL_ELGBL_1,
- DUAL_ELGBL_2::VARCHAR,
- DUAL_ELGBL_3::VARCHAR,
- DUAL_ELGBL_4::VARCHAR,
- DUAL_ELGBL_5::VARCHAR,
- DUAL_ELGBL_6::VARCHAR,
- DUAL_ELGBL_7::VARCHAR,
- DUAL_ELGBL_8::VARCHAR,
- DUAL_ELGBL_9::VARCHAR,
- DUAL_ELGBL_10::VARCHAR,
- DUAL_ELGBL_11::VARCHAR,
- DUAL_ELGBL_12::NUMERIC AS DUAL_ELGBL_12
- FROM read_csv('~/data/syh_dr/syhdr_medicare_person_2016.CSV', header=True, null_padding=true)```
- Relative file path: ahrq.gov/generated/with_types/syhdr_medicare_pharmacy_2016.sql
- ```
- -- SQL model for syhdr_medicare_pharmacy_2016.CSV
- {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
- SELECT
- PERSON_ID::UBIGINT AS PERSON_ID,
- PERSON_WGHT::NUMERIC AS PERSON_WGHT,
- PHMCY_CLM_NUM::NUMERIC AS PHMCY_CLM_NUM,
- CLM_CNTL_NUM::NUMERIC AS CLM_CNTL_NUM,
- LINE_NBR::VARCHAR,
- FILL_DT::DATE AS FILL_DT,
- SYNTHETIC_DRUG_ID::VARCHAR,
- GENERIC_DRUG_NAME::VARCHAR AS GENERIC_DRUG_NAME,
- replace(replace(PLAN_PMT_AMT, '$', ''), ',', '')::FLOAT AS PLAN_PMT_AMT,
- replace(replace(TOT_CHRG_AMT, '$', ''), ',', '')::FLOAT AS TOT_CHRG_AMT
- FROM read_csv('~/data/syh_dr/syhdr_medicare_pharmacy_2016.CSV', header=True, null_padding=true)```
- Relative file path: ahrq.gov/generated/with_types/syhdr_medicare_outpatient_2016.sql
- ```
- -- SQL model for syhdr_medicare_outpatient_2016.CSV
- {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
- SELECT
- PERSON_ID::UBIGINT AS PERSON_ID,
- PERSON_WGHT::NUMERIC AS PERSON_WGHT,
- FACILITY_ID::UBIGINT AS FACILITY_ID,
- CLM_CNTL_NUM::NUMERIC AS CLM_CNTL_NUM,
- AT_SPCLTY::VARCHAR AS AT_SPCLTY,
- SRVC_BEG_DATE::DATE AS SRVC_BEG_DATE,
- SRVC_END_DATE::DATE AS SRVC_END_DATE,
- LOS::UINTEGER AS LOS,
- ADMSN_TYPE::VARCHAR AS ADMSN_TYPE,
- TOB_CD::VARCHAR AS TOB_CD,
- CLM_TYPE_CD::VARCHAR AS CLM_TYPE_CD,
- DSCHRG_STUS::VARCHAR AS DSCHRG_STUS,
- PRMRY_DX_IMPUTED::VARCHAR AS PRMRY_DX_IMPUTED,
- PRMRY_DX_CD::VARCHAR AS PRMRY_DX_CD,
- ICD_DX_CD_1::VARCHAR AS ICD_DX_CD_1,
- ICD_DX_CD_2::VARCHAR AS ICD_DX_CD_2,
- ICD_DX_CD_3::VARCHAR,
- ICD_DX_CD_4::VARCHAR,
- ICD_DX_CD_5::VARCHAR,
- ICD_DX_CD_6::VARCHAR,
- ICD_DX_CD_7::VARCHAR,
- ICD_DX_CD_8::VARCHAR,
- ICD_DX_CD_9::VARCHAR,
- ICD_DX_CD_10::VARCHAR,
- ICD_DX_CD_11::VARCHAR,
- ICD_DX_CD_12::VARCHAR,
- ICD_DX_CD_13::VARCHAR,
- ICD_DX_CD_14::VARCHAR,
- ICD_DX_CD_15::VARCHAR,
- ICD_DX_CD_16::VARCHAR,
- ICD_DX_CD_17::VARCHAR,
- ICD_DX_CD_18::VARCHAR,
- ICD_DX_CD_19::VARCHAR,
- ICD_DX_CD_20::VARCHAR,
- ICD_DX_CD_21::VARCHAR,
- ICD_DX_CD_22::VARCHAR,
- ICD_DX_CD_23::VARCHAR,
- ICD_DX_CD_24::VARCHAR,
- ICD_DX_CD_25::VARCHAR AS ICD_DX_CD_25,
- ICD_PRCDR_CD_1::VARCHAR AS ICD_PRCDR_CD_1,
- ICD_PRCDR_CD_2::VARCHAR AS ICD_PRCDR_CD_2,
- ICD_PRCDR_CD_3::VARCHAR,
- ICD_PRCDR_CD_4::VARCHAR,
- ICD_PRCDR_CD_5::VARCHAR,
- ICD_PRCDR_CD_6::VARCHAR,
- ICD_PRCDR_CD_7::VARCHAR,
- ICD_PRCDR_CD_8::VARCHAR,
- ICD_PRCDR_CD_9::VARCHAR,
- ICD_PRCDR_CD_10::VARCHAR,
- ICD_PRCDR_CD_11::VARCHAR,
- ICD_PRCDR_CD_12::VARCHAR,
- ICD_PRCDR_CD_13::VARCHAR,
- ICD_PRCDR_CD_14::VARCHAR,
- ICD_PRCDR_CD_15::VARCHAR,
- ICD_PRCDR_CD_16::VARCHAR,
- ICD_PRCDR_CD_17::VARCHAR,
- ICD_PRCDR_CD_18::VARCHAR,
- ICD_PRCDR_CD_19::VARCHAR,
- ICD_PRCDR_CD_20::VARCHAR,
- ICD_PRCDR_CD_21::VARCHAR,
- ICD_PRCDR_CD_22::VARCHAR,
- ICD_PRCDR_CD_23::VARCHAR,
- ICD_PRCDR_CD_24::VARCHAR,
- ICD_PRCDR_CD_25::VARCHAR AS ICD_PRCDR_CD_25,
- CPT_PRCDR_CD_1::VARCHAR AS CPT_PRCDR_CD_1,
- CPT_PRCDR_CD_2::VARCHAR,
- CPT_PRCDR_CD_3::VARCHAR AS CPT_PRCDR_CD_3,
- CPT_PRCDR_CD_4::VARCHAR,
- CPT_PRCDR_CD_5::VARCHAR,
- CPT_PRCDR_CD_6::VARCHAR,
- CPT_PRCDR_CD_7::VARCHAR,
- CPT_PRCDR_CD_8::VARCHAR,
- CPT_PRCDR_CD_9::VARCHAR,
- CPT_PRCDR_CD_10::VARCHAR,
- CPT_PRCDR_CD_11::VARCHAR,
- CPT_PRCDR_CD_12::VARCHAR,
- CPT_PRCDR_CD_13::VARCHAR,
- CPT_PRCDR_CD_14::VARCHAR,
- CPT_PRCDR_CD_15::VARCHAR,
- CPT_PRCDR_CD_16::VARCHAR,
- CPT_PRCDR_CD_17::VARCHAR,
- CPT_PRCDR_CD_18::VARCHAR,
- CPT_PRCDR_CD_19::VARCHAR,
- CPT_PRCDR_CD_20::VARCHAR,
- CPT_PRCDR_CD_21::VARCHAR,
- CPT_PRCDR_CD_22::VARCHAR,
- CPT_PRCDR_CD_23::VARCHAR,
- CPT_PRCDR_CD_24::VARCHAR,
- CPT_PRCDR_CD_25::VARCHAR,
- CPT_PRCDR_CD_26::VARCHAR,
- CPT_PRCDR_CD_27::VARCHAR,
- CPT_PRCDR_CD_28::VARCHAR,
- CPT_PRCDR_CD_29::VARCHAR,
- CPT_PRCDR_CD_30::VARCHAR,
- CPT_PRCDR_CD_31::VARCHAR,
- CPT_PRCDR_CD_32::VARCHAR,
- CPT_PRCDR_CD_33::VARCHAR,
- CPT_PRCDR_CD_34::VARCHAR,
- CPT_PRCDR_CD_35::VARCHAR AS CPT_PRCDR_CD_35,
- replace(replace(PLAN_PMT_AMT, '$', ''), ',', '')::FLOAT AS PLAN_PMT_AMT,
- replace(replace(TOT_CHRG_AMT, '$', ''), ',', '')::FLOAT AS TOT_CHRG_AMT
- FROM read_csv('~/data/syh_dr/syhdr_medicare_outpatient_2016.CSV', header=True, null_padding=true, types={'CPT_PRCDR_CD_1': 'VARCHAR', 'CPT_PRCDR_CD_2': 'VARCHAR', 'CPT_PRCDR_CD_3': 'VARCHAR', 'CPT_PRCDR_CD_4': 'VARCHAR', 'CPT_PRCDR_CD_5': 'VARCHAR', 'CPT_PRCDR_CD_6': 'VARCHAR', 'CPT_PRCDR_CD_7': 'VARCHAR', 'CPT_PRCDR_CD_8': 'VARCHAR', 'CPT_PRCDR_CD_9': 'VARCHAR', 'CPT_PRCDR_CD_10': 'VARCHAR', 'CPT_PRCDR_CD_11': 'VARCHAR', 'CPT_PRCDR_CD_12': 'VARCHAR', 'CPT_PRCDR_CD_13': 'VARCHAR', 'CPT_PRCDR_CD_14': 'VARCHAR', 'CPT_PRCDR_CD_15': 'VARCHAR', 'CPT_PRCDR_CD_16': 'VARCHAR', 'CPT_PRCDR_CD_17': 'VARCHAR', 'CPT_PRCDR_CD_18': 'VARCHAR', 'CPT_PRCDR_CD_19': 'VARCHAR', 'CPT_PRCDR_CD_20': 'VARCHAR', 'CPT_PRCDR_CD_21': 'VARCHAR', 'CPT_PRCDR_CD_22': 'VARCHAR', 'CPT_PRCDR_CD_23': 'VARCHAR', 'CPT_PRCDR_CD_24': 'VARCHAR', 'CPT_PRCDR_CD_25': 'VARCHAR', 'CPT_PRCDR_CD_26': 'VARCHAR', 'CPT_PRCDR_CD_27': 'VARCHAR', 'CPT_PRCDR_CD_28': 'VARCHAR', 'CPT_PRCDR_CD_29': 'VARCHAR', 'CPT_PRCDR_CD_30': 'VARCHAR', 'CPT_PRCDR_CD_31': 'VARCHAR', 'CPT_PRCDR_CD_32': 'VARCHAR', 'CPT_PRCDR_CD_33': 'VARCHAR', 'CPT_PRCDR_CD_34': 'VARCHAR', 'CPT_PRCDR_CD_35': 'VARCHAR'}, ignore_errors=true)```
- Relative file path: ahrq.gov/generated/with_types/syhdr_medicaid_outpatient_2016.sql
- ```
- -- SQL model for syhdr_medicaid_outpatient_2016.CSV
- {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
- SELECT
- PERSON_ID::UBIGINT AS PERSON_ID,
- PERSON_WGHT::NUMERIC AS PERSON_WGHT,
- FACILITY_ID::UBIGINT AS FACILITY_ID,
- CLM_CNTL_NUM::NUMERIC AS CLM_CNTL_NUM,
- AT_SPCLTY::VARCHAR AS AT_SPCLTY,
- SRVC_BEG_DATE::DATE AS SRVC_BEG_DATE,
- SRVC_END_DATE::DATE AS SRVC_END_DATE,
- LOS::UINTEGER AS LOS,
- ADMSN_TYPE::VARCHAR AS ADMSN_TYPE,
- TOB_CD::VARCHAR AS TOB_CD,
- CLM_TYPE_CD::VARCHAR AS CLM_TYPE_CD,
- DSCHRG_STUS::VARCHAR AS DSCHRG_STUS,
- PRMRY_DX_IMPUTED::VARCHAR AS PRMRY_DX_IMPUTED,
- PRMRY_DX_CD::VARCHAR AS PRMRY_DX_CD,
- ICD_DX_CD_1::VARCHAR AS ICD_DX_CD_1,
- ICD_DX_CD_2::VARCHAR AS ICD_DX_CD_2,
- ICD_DX_CD_3::VARCHAR,
- ICD_DX_CD_4::VARCHAR,
- ICD_DX_CD_5::VARCHAR,
- ICD_DX_CD_6::VARCHAR,
- ICD_DX_CD_7::VARCHAR,
- ICD_DX_CD_8::VARCHAR,
- ICD_DX_CD_9::VARCHAR,
- ICD_DX_CD_10::VARCHAR,
- ICD_DX_CD_11::VARCHAR,
- ICD_DX_CD_12::VARCHAR,
- ICD_DX_CD_13::VARCHAR,
- ICD_DX_CD_14::VARCHAR,
- ICD_DX_CD_15::VARCHAR,
- ICD_DX_CD_16::VARCHAR,
- ICD_DX_CD_17::VARCHAR,
- ICD_DX_CD_18::VARCHAR,
- ICD_DX_CD_19::VARCHAR,
- ICD_DX_CD_20::VARCHAR,
- ICD_DX_CD_21::VARCHAR,
- ICD_DX_CD_22::VARCHAR,
- ICD_DX_CD_23::VARCHAR,
- ICD_DX_CD_24::VARCHAR,
- ICD_DX_CD_25::VARCHAR AS ICD_DX_CD_25,
- ICD_PRCDR_CD_1::VARCHAR AS ICD_PRCDR_CD_1,
- ICD_PRCDR_CD_2::VARCHAR AS ICD_PRCDR_CD_2,
- ICD_PRCDR_CD_3::VARCHAR,
- ICD_PRCDR_CD_4::VARCHAR,
- ICD_PRCDR_CD_5::VARCHAR,
- ICD_PRCDR_CD_6::VARCHAR,
- ICD_PRCDR_CD_7::VARCHAR,
- ICD_PRCDR_CD_8::VARCHAR,
- ICD_PRCDR_CD_9::VARCHAR,
- ICD_PRCDR_CD_10::VARCHAR,
- ICD_PRCDR_CD_11::VARCHAR,
- ICD_PRCDR_CD_12::VARCHAR,
- ICD_PRCDR_CD_13::VARCHAR,
- ICD_PRCDR_CD_14::VARCHAR,
- ICD_PRCDR_CD_15::VARCHAR,
- ICD_PRCDR_CD_16::VARCHAR,
- ICD_PRCDR_CD_17::VARCHAR,
- ICD_PRCDR_CD_18::VARCHAR,
- ICD_PRCDR_CD_19::VARCHAR,
- ICD_PRCDR_CD_20::VARCHAR,
- ICD_PRCDR_CD_21::VARCHAR,
- ICD_PRCDR_CD_22::VARCHAR,
- ICD_PRCDR_CD_23::VARCHAR,
- ICD_PRCDR_CD_24::VARCHAR,
- ICD_PRCDR_CD_25::VARCHAR AS ICD_PRCDR_CD_25,
- CPT_PRCDR_CD_1::VARCHAR AS CPT_PRCDR_CD_1,
- CPT_PRCDR_CD_2::VARCHAR,
- CPT_PRCDR_CD_3::VARCHAR AS CPT_PRCDR_CD_3,
- CPT_PRCDR_CD_4::VARCHAR,
- CPT_PRCDR_CD_5::VARCHAR,
- CPT_PRCDR_CD_6::VARCHAR,
- CPT_PRCDR_CD_7::VARCHAR,
- CPT_PRCDR_CD_8::VARCHAR,
- CPT_PRCDR_CD_9::VARCHAR,
- CPT_PRCDR_CD_10::VARCHAR,
- CPT_PRCDR_CD_11::VARCHAR,
- CPT_PRCDR_CD_12::VARCHAR,
- CPT_PRCDR_CD_13::VARCHAR,
- CPT_PRCDR_CD_14::VARCHAR,
- CPT_PRCDR_CD_15::VARCHAR,
- CPT_PRCDR_CD_16::VARCHAR,
- CPT_PRCDR_CD_17::VARCHAR,
- CPT_PRCDR_CD_18::VARCHAR,
- CPT_PRCDR_CD_19::VARCHAR,
- CPT_PRCDR_CD_20::VARCHAR,
- CPT_PRCDR_CD_21::VARCHAR,
- CPT_PRCDR_CD_22::VARCHAR,
- CPT_PRCDR_CD_23::VARCHAR,
- CPT_PRCDR_CD_24::VARCHAR,
- CPT_PRCDR_CD_25::VARCHAR,
- CPT_PRCDR_CD_26::VARCHAR,
- CPT_PRCDR_CD_27::VARCHAR,
- CPT_PRCDR_CD_28::VARCHAR,
- CPT_PRCDR_CD_29::VARCHAR,
- CPT_PRCDR_CD_30::VARCHAR,
- CPT_PRCDR_CD_31::VARCHAR,
- CPT_PRCDR_CD_32::VARCHAR,
- CPT_PRCDR_CD_33::VARCHAR,
- CPT_PRCDR_CD_34::VARCHAR,
- CPT_PRCDR_CD_35::VARCHAR AS CPT_PRCDR_CD_35,
- replace(replace(PLAN_PMT_AMT, '$', ''), ',', '')::FLOAT AS PLAN_PMT_AMT,
- replace(replace(TOT_CHRG_AMT, '$', ''), ',', '')::FLOAT AS TOT_CHRG_AMT
- FROM read_csv('~/data/syh_dr/syhdr_medicaid_outpatient_2016.CSV', header=True, null_padding=true, types={'CPT_PRCDR_CD_1': 'VARCHAR', 'CPT_PRCDR_CD_2': 'VARCHAR', 'CPT_PRCDR_CD_3': 'VARCHAR', 'CPT_PRCDR_CD_4': 'VARCHAR', 'CPT_PRCDR_CD_5': 'VARCHAR', 'CPT_PRCDR_CD_6': 'VARCHAR', 'CPT_PRCDR_CD_7': 'VARCHAR', 'CPT_PRCDR_CD_8': 'VARCHAR', 'CPT_PRCDR_CD_9': 'VARCHAR', 'CPT_PRCDR_CD_10': 'VARCHAR', 'CPT_PRCDR_CD_11': 'VARCHAR', 'CPT_PRCDR_CD_12': 'VARCHAR', 'CPT_PRCDR_CD_13': 'VARCHAR', 'CPT_PRCDR_CD_14': 'VARCHAR', 'CPT_PRCDR_CD_15': 'VARCHAR', 'CPT_PRCDR_CD_16': 'VARCHAR', 'CPT_PRCDR_CD_17': 'VARCHAR', 'CPT_PRCDR_CD_18': 'VARCHAR', 'CPT_PRCDR_CD_19': 'VARCHAR', 'CPT_PRCDR_CD_20': 'VARCHAR', 'CPT_PRCDR_CD_21': 'VARCHAR', 'CPT_PRCDR_CD_22': 'VARCHAR', 'CPT_PRCDR_CD_23': 'VARCHAR', 'CPT_PRCDR_CD_24': 'VARCHAR', 'CPT_PRCDR_CD_25': 'VARCHAR', 'CPT_PRCDR_CD_26': 'VARCHAR', 'CPT_PRCDR_CD_27': 'VARCHAR', 'CPT_PRCDR_CD_28': 'VARCHAR', 'CPT_PRCDR_CD_29': 'VARCHAR', 'CPT_PRCDR_CD_30': 'VARCHAR', 'CPT_PRCDR_CD_31': 'VARCHAR', 'CPT_PRCDR_CD_32': 'VARCHAR', 'CPT_PRCDR_CD_33': 'VARCHAR', 'CPT_PRCDR_CD_34': 'VARCHAR', 'CPT_PRCDR_CD_35': 'VARCHAR'}, ignore_errors=true)```
- Relative file path: ahrq.gov/generated/with_types/syhdr_medicaid_provider_2016.sql
- ```
- -- SQL model for syhdr_medicaid_provider_2016.csv
- {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
- SELECT
- Facility_ID::VARCHAR,
- Prvdr_Ctgry_Cd::VARCHAR,
- Prvdr_Ownrshp_Cd::VARCHAR,
- Prvdr_Prtcptn_Cd::VARCHAR
- FROM read_csv('~/data/syh_dr/syhdr_medicaid_provider_2016.csv', header=True, null_padding=true)```
- Relative file path: ahrq.gov/sdoh/sdoh_zipcode.sql
- ```
- {{ config(
- materialized = 'external',
- location = '../data/sdoh_zipcode.parquet'
- )
- }}
- {% set sdoh_zipcode_years = [
- '2011', '2012', '2013', '2014', '2015',
- '2016', '2017', '2018', '2019', '2020'] %}
- {% set relations = [] %}
- {% for sdoh_zipcode_year in sdoh_zipcode_years -%}
- {% do relations.append(source('social_determinants_of_health', 'sdoh_zipcode_' ~ sdoh_zipcode_year)) %}
- {% endfor %}
- with union_unpivot as (
- {% for relation in relations %}
- unpivot {{ relation }}
- on columns(* exclude (year, statefips, zipcode, zcta, state, region, territory, point_zip))
- into
- name survey_variable_name
- value survey_score
- {%- if not loop.last %} union all {% endif -%}
- {% endfor %}
- )
- select
- {{ dbt_utils.generate_surrogate_key(
- ['year', 'zcta', 'zipcode']
- ) }} as sdoh_zipcode_key,
- *
- from union_unpivot
- ```
- Relative file path: ahrq.gov/sdoh/sdoh_county.sql
- ```
- {{ config(
- materialized = 'external',
- location = '../data/sdoh_county.parquet'
- )
- }}
- {% set sdoh_county_years = [
- '2009', '2010', '2011', '2012', '2013', '2014',
- '2015', '2016', '2017', '2018', '2019', '2020'] %}
- {% set relations = [] %}
- {% for sdoh_county_year in sdoh_county_years -%}
- {% do relations.append(source('social_determinants_of_health', 'sdoh_county_' ~ sdoh_county_year)) %}
- {% endfor %}
- with union_unpivot as (
- {% for relation in relations %}
- unpivot {{ relation }}
- on columns(* exclude (year, countyfips, statefips, state, county, region, territory))
- into
- name survey_variable_name
- value survey_score
- {% if not loop.last %} union all {% endif -%}
- {% endfor %}
- )
- select
- {{ dbt_utils.generate_surrogate_key(
- ['year', 'countyfips', 'county']
- ) }} as sdoh_county_key,
- *
- from union_unpivot
- ```
- Relative file path: ahrq.gov/sdoh/sdoh_tract.sql
- ```
- {{ config(
- materialized = 'external',
- location = '../data/sdoh_tract.parquet') }}
- {% set sdoh_tract_years = [
- '2009', '2010', '2011', '2012', '2013', '2014',
- '2015', '2016', '2017', '2018', '2019', '2020'] %}
- {% set relations = [] %}
- {% for sdoh_tract_year in sdoh_tract_years -%}
- {% do relations.append(source('social_determinants_of_health', 'sdoh_tract_' ~ sdoh_tract_year)) %}
- {% endfor %}
- with union_unpivot as (
- {% for relation in relations %}
- unpivot {{ relation }}
- on columns(* exclude (year, tractfips, countyfips, statefips, state, county, region, territory))
- into
- name survey_variable_name
- value survey_score
- {% if not loop.last %} union all {% endif -%}
- {% endfor %}
- )
- select
- {{ dbt_utils.generate_surrogate_key(
- ['year', 'tractfips', 'countyfips', 'county']
- ) }} as sdoh_county_key,
- *
- from union_unpivot```
- Relative file path: bls.gov/download_consumer_price_index.py
- ```
- import pandas as pd
- import os
- import requests
- def model(dbt, session):
- # URL to the Excel file containing the Consumer Price Index data
- output_path = dbt.config.get('output_path')
- base_path = os.path.expanduser(output_path)
- excel_path = os.path.join(base_path, "r-cpi-u-rs-allitems.xlsx")
- # cpi_url = "https://www.bls.gov/cpi/research-series/r-cpi-u-rs-allitems.xlsx"
- # download and save to output path as r-cpi-u-rs-allitems.xlsx
- consumer_price_index_df = pd.read_excel(excel_path, skiprows=5, usecols=['YEAR', 'AVG'])
- return consumer_price_index_df```
- Relative file path: bls.gov/consumer_price_index.sql
- ```
- {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
- SELECT
- YEAR AS year,
- "AVG" AS consumer_price_index
- FROM {{ ref('download_consumer_price_index') }}```
- Relative file path: mitre.org/synthea.sql
- ```
- {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
- WITH patient_data AS (
- SELECT
- json_extract_string(resource, '$.id') AS patient_id,
- json_transform(json_extract(resource, '$.identifier'), '[{"type":{"text":"VARCHAR"},"value":"VARCHAR"}]') AS identifiers,
- json_transform(json_extract(resource, '$.extension'), '[{"url":"VARCHAR","extension":[{"valueCoding":{"display":"VARCHAR"}}]}]') AS extensions,
- json_transform(json_extract(resource, '$.address'), '[{"city":"VARCHAR","state":"VARCHAR","postalCode":"VARCHAR","country":"VARCHAR","line":["VARCHAR"]}]') AS addresses,
- json_extract_string(resource, '$.name[0].given[0]') AS first_name,
- json_extract_string(resource, '$.name[0].family') AS last_name,
- json_extract_string(resource, '$.name[0].given[1]') AS first_name_alt,
- json_extract_string(resource, '$.name[0].prefix[0]') AS patient_name_prefix,
- json_extract_string(resource, '$.name[0].use') AS patient_name_use,
- json_extract_string(resource, '$.name[1].given[0]') AS first_name2,
- json_extract_string(resource, '$.name[1].given[1]') AS first_name_alt2,
- json_extract_string(resource, '$.name[1].family') AS last_name2,
- json_extract_string(resource, '$.name[1].prefix[0]') AS patient_name_prefix2,
- json_extract_string(resource, '$.gender') AS gender,
- json_extract_string(resource, '$.birthDate') AS birth_date,
- json_extract_string(resource, '$.telecom[0].system') AS telecom_system,
- json_extract_string(resource, '$.telecom[0].value') AS telecom_value,
- json_extract_string(resource, '$.telecom[0].use') AS telecom_use,
- json_extract_string(resource, '$.maritalStatus.text') AS patient_marital_status,
- json_extract_string(resource, '$.deceasedDateTime') AS patient_deceased_date,
- json_extract_string(resource, '$.multipleBirthBoolean') AS patient_multiple_birth_ind
- FROM (
- SELECT json_extract(json(unnested_entry.entry), '$.resource') AS resource
- FROM read_json_auto('./../data/Gudrun69_Shaunna800_Goyette777_d5e33bd1-960e-bcf4-e5f9-9a4afc6d5a5e.json', columns={'entry': 'JSON[]'}) AS entries,
- unnest(entries.entry) AS unnested_entry
- WHERE json_extract_string(json(unnested_entry.entry), '$.resource.resourceType') = 'Patient'
- ) AS patient_resource
- ),
- identifier_unnested AS (
- SELECT
- patient_id,
- unnest(identifiers) AS identifier
- FROM patient_data
- ),
- identifier_aggregated AS (
- SELECT
- patient_id,
- MIN(CASE WHEN identifier.type.text = 'Medical Record Number' THEN identifier.value END) AS patient_mrn,
- MIN(CASE WHEN identifier.type.text = 'Social Security Number' THEN identifier.value END) AS patient_ssn,
- MIN(CASE WHEN identifier.type.text = 'Driver''s license number' THEN identifier.value END) AS patient_drivers_license_num,
- MIN(CASE WHEN identifier.type.text = 'Passport Number' THEN identifier.value END) AS patient_passport_num
- FROM identifier_unnested
- GROUP BY patient_id
- ),
- extension_unnested AS (
- SELECT
- patient_id,
- unnest(extensions) AS extension
- FROM patient_data
- ),
- extension_aggregated AS (
- SELECT
- patient_id,
- MIN(CASE WHEN extension.url = 'http://hl7.org/fhir/us/core/StructureDefinition/us-core-race' THEN extension.extension[1].valueCoding.display END) AS patient_core_race,
- MIN(CASE WHEN extension.url = 'http://hl7.org/fhir/us/core/StructureDefinition/us-core-ethnicity' THEN extension.extension[1].valueCoding.display END) AS patient_core_ethnicity
- FROM extension_unnested
- GROUP BY patient_id
- ),
- addresses_unnested AS (
- SELECT
- patient_id,
- unnest(addresses) AS address
- FROM patient_data
- ),
- combined_data AS (
- SELECT
- p.patient_id,
- p.first_name,
- p.last_name,
- i.patient_mrn,
- i.patient_ssn,
- e.patient_core_race,
- e.patient_core_ethnicity,
- a.address.line[1],
- a.address.city,
- a.address.state,
- a.address.postalCode,
- a.address.country
- FROM patient_data p
- JOIN identifier_aggregated i on i.patient_id = p.patient_id
- JOIN extension_aggregated e on e.patient_id = p.patient_id
- JOIN addresses_unnested a on a.patient_id = p.patient_id
- )
- select * from combined_data
- ```
- Relative file path: figures/insurance_plan_payment_histogram_inflation_adjusted.sql
- ```
- {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
- WITH cpi_adjustment AS (
- SELECT
- year,
- consumer_price_index
- FROM {{ ref('consumer_price_index') }}
- ),
- latest_cpi AS (
- SELECT
- MAX(consumer_price_index) AS cpi_2022
- FROM cpi_adjustment
- WHERE year = 2022
- ),
- inflation_adjustment_factors AS (
- SELECT
- 2016 AS year,
- (lc.cpi_2022 / ca.consumer_price_index) AS adjustment_factor_to_2022
- FROM cpi_adjustment ca
- CROSS JOIN latest_cpi lc
- WHERE ca.year = 2016
- ),
- commercial_data AS (
- SELECT
- PLAN_PMT_AMT * iaf.adjustment_factor_to_2022 AS Payment,
- COUNT(*) AS count,
- 'Commercial' AS Insurance
- FROM read_parquet('/Users/me/data/syh_dr/syhdr_commercial_inpatient_2016.parquet') cd
- JOIN inflation_adjustment_factors iaf ON 1 = 1
- GROUP BY PLAN_PMT_AMT, iaf.adjustment_factor_to_2022
- ),
- medicaid_data AS (
- SELECT
- PLAN_PMT_AMT * iaf.adjustment_factor_to_2022 AS Payment,
- COUNT(*) AS count,
- 'Medicaid' AS Insurance
- FROM read_parquet('/Users/me/data/syh_dr/syhdr_medicaid_inpatient_2016.parquet') md
- JOIN inflation_adjustment_factors iaf ON 1 = 1
- GROUP BY PLAN_PMT_AMT, iaf.adjustment_factor_to_2022
- ),
- medicare_data AS (
- SELECT
- PLAN_PMT_AMT * iaf.adjustment_factor_to_2022 AS Payment,
- COUNT(*) AS count,
- 'Medicare' AS Insurance
- FROM read_parquet('/Users/me/data/syh_dr/syhdr_medicare_inpatient_2016.parquet') mcd
- JOIN inflation_adjustment_factors iaf ON 1 = 1
- GROUP BY PLAN_PMT_AMT, iaf.adjustment_factor_to_2022
- ),
- combined_data AS (
- SELECT * FROM commercial_data
- UNION ALL
- SELECT * FROM medicaid_data
- UNION ALL
- SELECT * FROM medicare_data
- )
- SELECT
- Payment,
- count,
- Insurance
- FROM combined_data
- ORDER BY Insurance, Payment```
- Relative file path: figures/insurance_plan_payment_histogram.sql
- ```
- {{ config(materialized='external', location=var('output_path') + '/' + this.name + '.parquet') }}
- WITH commercial_data AS (
- SELECT
- PLAN_PMT_AMT AS Payment,
- COUNT(*) AS count,
- 'Commercial' AS Insurance
- FROM read_parquet('/Users/me/data/syh_dr/syhdr_commercial_inpatient_2016.parquet')
- GROUP BY PLAN_PMT_AMT
- ),
- medicaid_data AS (
- SELECT
- PLAN_PMT_AMT AS Payment,
- COUNT(*) AS count,
- 'Medicaid' AS Insurance
- FROM read_parquet('/Users/me/data/syh_dr/syhdr_medicaid_inpatient_2016.parquet')
- GROUP BY PLAN_PMT_AMT
- ),
- medicare_data AS (
- SELECT
- PLAN_PMT_AMT AS Payment,
- COUNT(*) AS count,
- 'Medicare' AS Insurance
- FROM read_parquet('/Users/me/data/syh_dr/syhdr_medicare_inpatient_2016.parquet')
- GROUP BY PLAN_PMT_AMT
- ),
- combined_data AS (
- SELECT * FROM commercial_data
- UNION ALL
- SELECT * FROM medicaid_data
- UNION ALL
- SELECT * FROM medicare_data
- )
- SELECT
- Payment,
- count,
- Insurance
- FROM combined_data
- ORDER BY Insurance, Payment```
- Relative file path: config.yml
- ```
- version: 2
- models:
- - name: download_consumer_price_index
- config:
- data_path: "{{ var('data_path') }}"
- output_path: "{{ var('output_path') }}"
- - name: ahrq.gov
- config:
- data_path: "{{ var('data_path') }}"
- output_path: "{{ var('output_path') }}"
- - name: mitre.org
- config:
- data_path: "{{ var('data_path') }}"
- output_path: "{{ var('output_path') }}"
- ```
- Relative file path: sources.yml
- ```
- version: 2
- sources:
- - name: syh_dr
- description: Synthetic Healthcare data from AHRQ
- tables:
- - name: inpatient
- description: Inpatient files from AHRQ.
- meta:
- external_location: >
- read_csv('../data/syh_dr/syhdr_*_inpatient_2016.csv',
- types = {
- PERSON_ID: 'UBIGINT',
- PERSON_WGHT: 'NUMERIC',
- FACILITY_ID: 'UBIGINT',
- CLM_CNTL_NUM: 'NUMERIC',
- AT_SPCLTY: 'VARCHAR',
- SRVC_BEG_DATE: 'DATE',
- SRVC_END_DATE: 'DATE',
- LOS: 'UINTEGER',
- ADMSN_TYPE: 'VARCHAR',
- TOB_CD: 'VARCHAR',
- CLM_TYPE_CD: 'VARCHAR',
- DSCHRG_STUS: 'VARCHAR',
- PRMRY_DX_IMPUTED: 'VARCHAR',
- PRMRY_DX_CD: 'VARCHAR',
- ICD_DX_CD_1: 'VARCHAR',
- ICD_DX_CD_2: 'VARCHAR',
- ICD_DX_CD_3: 'VARCHAR',
- ICD_DX_CD_4: 'VARCHAR',
- ICD_DX_CD_5: 'VARCHAR',
- ICD_DX_CD_6: 'VARCHAR',
- ICD_DX_CD_7: 'VARCHAR',
- ICD_DX_CD_8: 'VARCHAR',
- ICD_DX_CD_9: 'VARCHAR',
- ICD_DX_CD_10: 'VARCHAR',
- ICD_DX_CD_11: 'VARCHAR',
- ICD_DX_CD_12: 'VARCHAR',
- ICD_DX_CD_13: 'VARCHAR',
- ICD_DX_CD_14: 'VARCHAR',
- ICD_DX_CD_15: 'VARCHAR',
- ICD_DX_CD_16: 'VARCHAR',
- ICD_DX_CD_17: 'VARCHAR',
- ICD_DX_CD_18: 'VARCHAR',
- ICD_DX_CD_19: 'VARCHAR',
- ICD_DX_CD_20: 'VARCHAR',
- ICD_DX_CD_21: 'VARCHAR',
- ICD_DX_CD_22: 'VARCHAR',
- ICD_DX_CD_23: 'VARCHAR',
- ICD_DX_CD_24: 'VARCHAR',
- ICD_DX_CD_25: 'VARCHAR',
- ICD_PRCDR_CD_1: 'VARCHAR',
- ICD_PRCDR_CD_2: 'VARCHAR',
- ICD_PRCDR_CD_3: 'VARCHAR',
- ICD_PRCDR_CD_4: 'VARCHAR',
- ICD_PRCDR_CD_5: 'VARCHAR',
- ICD_PRCDR_CD_6: 'VARCHAR',
- ICD_PRCDR_CD_7: 'VARCHAR',
- ICD_PRCDR_CD_8: 'VARCHAR',
- ICD_PRCDR_CD_9: 'VARCHAR',
- ICD_PRCDR_CD_10: 'VARCHAR',
- ICD_PRCDR_CD_11: 'VARCHAR',
- ICD_PRCDR_CD_12: 'VARCHAR',
- ICD_PRCDR_CD_13: 'VARCHAR',
- ICD_PRCDR_CD_14: 'VARCHAR',
- ICD_PRCDR_CD_15: 'VARCHAR',
- ICD_PRCDR_CD_16: 'VARCHAR',
- ICD_PRCDR_CD_17: 'VARCHAR',
- ICD_PRCDR_CD_18: 'VARCHAR',
- ICD_PRCDR_CD_19: 'VARCHAR',
- ICD_PRCDR_CD_20: 'VARCHAR',
- ICD_PRCDR_CD_21: 'VARCHAR',
- ICD_PRCDR_CD_22: 'VARCHAR',
- ICD_PRCDR_CD_23: 'VARCHAR',
- ICD_PRCDR_CD_24: 'VARCHAR',
- ICD_PRCDR_CD_25: 'VARCHAR',
- CPT_PRCDR_CD_1: 'VARCHAR',
- CPT_PRCDR_CD_2: 'VARCHAR',
- CPT_PRCDR_CD_3: 'VARCHAR',
- CPT_PRCDR_CD_4: 'VARCHAR',
- CPT_PRCDR_CD_5: 'VARCHAR',
- CPT_PRCDR_CD_6: 'VARCHAR',
- CPT_PRCDR_CD_7: 'VARCHAR',
- CPT_PRCDR_CD_8: 'VARCHAR',
- CPT_PRCDR_CD_9: 'VARCHAR',
- CPT_PRCDR_CD_10: 'VARCHAR',
- CPT_PRCDR_CD_11: 'VARCHAR',
- CPT_PRCDR_CD_12: 'VARCHAR',
- CPT_PRCDR_CD_13: 'VARCHAR',
- CPT_PRCDR_CD_14: 'VARCHAR',
- CPT_PRCDR_CD_15: 'VARCHAR',
- CPT_PRCDR_CD_16: 'VARCHAR',
- CPT_PRCDR_CD_17: 'VARCHAR',
- CPT_PRCDR_CD_18: 'VARCHAR',
- CPT_PRCDR_CD_19: 'VARCHAR',
- CPT_PRCDR_CD_20: 'VARCHAR',
- CPT_PRCDR_CD_21: 'VARCHAR',
- CPT_PRCDR_CD_22: 'VARCHAR',
- CPT_PRCDR_CD_23: 'VARCHAR',
- CPT_PRCDR_CD_24: 'VARCHAR',
- CPT_PRCDR_CD_25: 'VARCHAR',
- CPT_PRCDR_CD_26: 'VARCHAR',
- CPT_PRCDR_CD_27: 'VARCHAR',
- CPT_PRCDR_CD_28: 'VARCHAR',
- CPT_PRCDR_CD_29: 'VARCHAR',
- CPT_PRCDR_CD_30: 'VARCHAR',
- CPT_PRCDR_CD_31: 'VARCHAR',
- CPT_PRCDR_CD_32: 'VARCHAR',
- CPT_PRCDR_CD_33: 'VARCHAR',
- CPT_PRCDR_CD_34: 'VARCHAR',
- CPT_PRCDR_CD_35: 'VARCHAR',
- PLAN_PMT_AMT: 'FLOAT'
- TOT_CHRG_AMT: 'FLOAT'
- })
- formatter: oldstyle
- - name: outpatient
- description: Outpatient files from AHRQ.
- meta:
- external_location: >
- read_csv('../data/syh_dr/syhdr_*_outpatient_2016.csv',
- types = {
- PERSON_ID: 'UBIGINT',
- PERSON_WGHT: 'NUMERIC',
- FACILITY_ID: 'UBIGINT',
- CLM_CNTL_NUM: 'NUMERIC',
- AT_SPCLTY: 'VARCHAR',
- SRVC_BEG_DATE: 'DATE',
- SRVC_END_DATE: 'DATE',
- LOS: 'UINTEGER',
- ADMSN_TYPE: 'VARCHAR',
- TOB_CD: 'VARCHAR'',
- CLM_TYPE_CD: 'VARCHAR',
- DSCHRG_STUS: 'VARCHAR',
- PRMRY_DX_IMPUTED: 'VARCHAR',
- PRMRY_DX_CD: 'VARCHAR',
- ICD_DX_CD_1: 'VARCHAR',
- ICD_DX_CD_2: 'VARCHAR',
- ICD_DX_CD_3: 'VARCHAR',
- ICD_DX_CD_4: 'VARCHAR',
- ICD_DX_CD_5: 'VARCHAR',
- ICD_DX_CD_6: 'VARCHAR',
- ICD_DX_CD_7: 'VARCHAR',
- ICD_DX_CD_8: 'VARCHAR',
- ICD_DX_CD_9: 'VARCHAR',
- ICD_DX_CD_10: 'VARCHAR',
- ICD_DX_CD_11: 'VARCHAR',
- ICD_DX_CD_12: 'VARCHAR',
- ICD_DX_CD_13: 'VARCHAR',
- ICD_DX_CD_14: 'VARCHAR',
- ICD_DX_CD_15: 'VARCHAR',
- ICD_DX_CD_16: 'VARCHAR',
- ICD_DX_CD_17: 'VARCHAR',
- ICD_DX_CD_18: 'VARCHAR',
- ICD_DX_CD_19: 'VARCHAR',
- ICD_DX_CD_20: 'VARCHAR',
- ICD_DX_CD_21: 'VARCHAR',
- ICD_DX_CD_22: 'VARCHAR',
- ICD_DX_CD_23: 'VARCHAR',
- ICD_DX_CD_24: 'VARCHAR',
- ICD_DX_CD_25: 'VARCHAR',
- ICD_PRCDR_CD_1: 'VARCHAR',
- ICD_PRCDR_CD_2: 'VARCHAR',
- ICD_PRCDR_CD_3: 'VARCHAR',
- ICD_PRCDR_CD_4: 'VARCHAR',
- ICD_PRCDR_CD_5: 'VARCHAR',
- ICD_PRCDR_CD_6: 'VARCHAR',
- ICD_PRCDR_CD_7: 'VARCHAR',
- ICD_PRCDR_CD_8: 'VARCHAR',
- ICD_PRCDR_CD_9: 'VARCHAR',
- ICD_PRCDR_CD_10: 'VARCHAR',
- ICD_PRCDR_CD_11: 'VARCHAR',
- ICD_PRCDR_CD_12: 'VARCHAR',
- ICD_PRCDR_CD_13: 'VARCHAR',
- ICD_PRCDR_CD_14: 'VARCHAR',
- ICD_PRCDR_CD_15: 'VARCHAR',
- ICD_PRCDR_CD_16: 'VARCHAR',
- ICD_PRCDR_CD_17: 'VARCHAR',
- ICD_PRCDR_CD_18: 'VARCHAR',
- ICD_PRCDR_CD_19: 'VARCHAR',
- ICD_PRCDR_CD_20: 'VARCHAR',
- ICD_PRCDR_CD_21: 'VARCHAR',
- ICD_PRCDR_CD_22: 'VARCHAR',
- ICD_PRCDR_CD_23: 'VARCHAR',
- ICD_PRCDR_CD_24: 'VARCHAR',
- ICD_PRCDR_CD_25: 'VARCHAR',
- CPT_PRCDR_CD_1: 'VARCHAR',
- CPT_PRCDR_CD_2: 'VARCHAR',
- CPT_PRCDR_CD_3: 'VARCHAR',
- CPT_PRCDR_CD_4: 'VARCHAR',
- CPT_PRCDR_CD_5: 'VARCHAR',
- CPT_PRCDR_CD_6: 'VARCHAR',
- CPT_PRCDR_CD_7: 'VARCHAR',
- CPT_PRCDR_CD_8: 'VARCHAR',
- CPT_PRCDR_CD_9: 'VARCHAR',
- CPT_PRCDR_CD_10: 'VARCHAR',
- CPT_PRCDR_CD_11: 'VARCHAR',
- CPT_PRCDR_CD_12: 'VARCHAR',
- CPT_PRCDR_CD_13: 'VARCHAR',
- CPT_PRCDR_CD_14: 'VARCHAR',
- CPT_PRCDR_CD_15: 'VARCHAR',
- CPT_PRCDR_CD_16: 'VARCHAR',
- CPT_PRCDR_CD_17: 'VARCHAR',
- CPT_PRCDR_CD_18: 'VARCHAR',
- CPT_PRCDR_CD_19: 'VARCHAR',
- CPT_PRCDR_CD_20: 'VARCHAR',
- CPT_PRCDR_CD_21: 'VARCHAR',
- CPT_PRCDR_CD_22: 'VARCHAR',
- CPT_PRCDR_CD_23: 'VARCHAR',
- CPT_PRCDR_CD_24: 'VARCHAR',
- CPT_PRCDR_CD_25: 'VARCHAR',
- CPT_PRCDR_CD_26: 'VARCHAR',
- CPT_PRCDR_CD_27: 'VARCHAR',
- CPT_PRCDR_CD_28: 'VARCHAR',
- CPT_PRCDR_CD_29: 'VARCHAR',
- CPT_PRCDR_CD_30: 'VARCHAR',
- CPT_PRCDR_CD_31: 'VARCHAR',
- CPT_PRCDR_CD_32: 'VARCHAR',
- CPT_PRCDR_CD_33: 'VARCHAR',
- CPT_PRCDR_CD_34: 'VARCHAR',
- CPT_PRCDR_CD_35: 'VARCHAR',
- PLAN_PMT_AMT: 'FLOAT',
- TOT_CHRG_AMT: 'FLOAT'
- })
- formatter: oldstyle
- - name: person
- description: Person-Level files from AHRQ.
- meta:
- external_location: >
- read_csv('../data/syh_dr/syhdr_*_person_2016.csv',
- types = {
- PERSON_ID: 'UBIGINT',
- MCAID_BENE_ID: 'UBIGINT',
- PERSON_WGHT: 'NUMERIC',
- AGE_LOW: 'NUMERIC',
- AGE_HIGH: 'NUMERIC',
- SEX_IDENT_CD: 'VARCHAR',
- RACE_CD: 'VARCHAR',
- MCAID_SBMTTG_ST_CD: 'VARCHAR',
- STATE_CD: 'VARCHAR',
- COUNTY_FIPS_CD: 'VARCHAR',
- ZIP_CD: 'VARCHAR',
- RSN_ENRLMT_CD: 'VARCHAR',
- MDCD_ENRLMT_1: 'NUMERIC',
- MDCD_ENRLMT_2: 'NUMERIC',
- MDCD_ENRLMT_3: 'NUMERIC',
- MDCD_ENRLMT_4: 'NUMERIC',
- MDCD_ENRLMT_5: 'NUMERIC',
- MDCD_ENRLMT_6: 'NUMERIC',
- MDCD_ENRLMT_7: 'NUMERIC',
- MDCD_ENRLMT_8: 'NUMERIC',
- MDCD_ENRLMT_9: 'NUMERIC',
- MDCD_ENRLMT_10: 'NUMERIC',
- MDCD_ENRLMT_11: 'NUMERIC',
- MDCD_ENRLMT_12: 'NUMERIC',
- MDCD_MCO_ENRLMT_1: 'NUMERIC',
- MDCD_MCO_ENRLMT_2: 'NUMERIC',
- MDCD_MCO_ENRLMT_3: 'NUMERIC',
- MDCD_MCO_ENRLMT_4: 'NUMERIC',
- MDCD_MCO_ENRLMT_5: 'NUMERIC',
- MDCD_MCO_ENRLMT_6: 'NUMERIC',
- MDCD_MCO_ENRLMT_7: 'NUMERIC',
- MDCD_MCO_ENRLMT_8: 'NUMERIC',
- MDCD_MCO_ENRLMT_9: 'NUMERIC',
- MDCD_MCO_ENRLMT_10: 'NUMERIC',
- MDCD_MCO_ENRLMT_11: 'NUMERIC',
- MDCD_MCO_ENRLMT_12: 'NUMERIC',
- MDCD_CHIP_ENRLMT: 'NUMERIC',
- RSTCTD_BNFTS_IND: 'NUMERIC',
- DUAL_ELGBL_1: 'NUMERIC',
- DUAL_ELGBL_2: 'NUMERIC',
- DUAL_ELGBL_3: 'NUMERIC',
- DUAL_ELGBL_4: 'NUMERIC',
- DUAL_ELGBL_5: 'NUMERIC',
- DUAL_ELGBL_6: 'NUMERIC',
- DUAL_ELGBL_7: 'NUMERIC',
- DUAL_ELGBL_8: 'NUMERIC',
- DUAL_ELGBL_9: 'NUMERIC',
- DUAL_ELGBL_10: 'NUMERIC',
- DUAL_ELGBL_11: 'NUMERIC',
- DUAL_ELGBL_12: 'NUMERIC'
- })
- formatter: oldstyle
- - name: pharmacy
- descripton: Pharmacy files from AHRQ.
- meta:
- external_location: >
- read_csv('../data/syh_dr/syhdr_*_pharmacy_2016.csv',
- types = {
- PERSON_ID: 'UBIGINT',
- PERSON_WGHT: 'NUMERIC',
- PHMCY_CLM_NUM: 'NUMERIC',
- CLM_CNTL_NUM: 'NUMERIC',
- LINE_NBR: 'VARCHAR',
- FILL_DT: 'DATE',
- SYNTHETIC_DRUG_ID: 'VARCHAR',
- GENERIC_DRUG_NAME: 'VARCHAR',
- PLAN_PMT_AMT: 'FLOAT',
- TOT_CHRG_AMT: 'FLOAT'
- })
- formatter: oldstyle
- - name: provider
- description: Provider files from AHRQ.
- meta:
- external_location: >
- read_csv('../data/syh_dr/syhdr_*_provider_2016.csv',
- types = {
- Facility_ID: 'VARCHAR',
- Prvdr_Ctgry_Cd: 'VARCHAR',
- Prvdr_Ownrshp_Cd: 'VARCHAR',
- Prvdr_Prtcptn_Cd: 'VARCHAR'
- })
- formatter: oldstyle
- - name: bls_gov
- config:
- plugin: excel
- tables:
- - name: consumer_price_index
- meta:
- external_location: ../data/r-cpi-u-rs-allitems.xlsx
- sheet_nane: 'All items'
- - name: mitre_org
- tables:
- - name: synthea
- meta:
- external_location: >
- read_json('../data/Gudrun69_Shaunna800_Goyette777_d5e33bd1-960e-bcf4-e5f9-9a4afc6d5a5e.json',
- columns = {
- entry: 'JSON[]'
- })
- formatter: oldstyle
- - name: social_determinants_of_health
- description: "These files contain [version 1 of AHRQβs database on Social Determinants of Health (SDOH)](https://www.ahrq.gov/sdoh/data-analytics/sdoh-data.html), which was created under a project funded by the Patient Centered Outcomes Research (PCOR) Trust Fund. The database is curated from existing Federal datasets and other publicly available data sources. The purpose of these files is to make it easier to find a range of well-documented, readily linkable SDOH variables across domains without having to access multiple source files, facilitating SDOH research and analysis. Variables in the files correspond to five key SDOH domains: social context, economic context, education, physical infrastructure, and healthcare context. The files can be linked to other data by geography. Data are available at the county, ZIP Code, and census tract levels."
- config:
- plugin: excel
- tables:
- - name: sdoh_zipcode_2011
- description: Social Determinants of Health Zipcode data from 2011.
- meta:
- external_location: ../data/SDOH/SDOH_2011_ZIPCODE_1_0.xlsx
- sheet_name: Data
- - name: sdoh_zipcode_2012
- description: Social Determinants of Health Zipcode data from 2012.
- meta:
- external_location: ../data/SDOH/SDOH_2012_ZIPCODE_1_0.xlsx
- sheet_name: Data
- - name: sdoh_zipcode_2013
- description: Social Determinants of Health Zipcode data from 2013.
- meta:
- external_location: ../data/SDOH/SDOH_2013_ZIPCODE_1_0.xlsx
- sheet_name: Data
- - name: sdoh_zipcode_2014
- description: Social Determinants of Health Zipcode data from 2014.
- meta:
- external_location: ../data/SDOH/SDOH_2014_ZIPCODE_1_0.xlsx
- sheet_name: Data
- - name: sdoh_zipcode_2015
- description: Social Determinants of Health Zipcode data from 2015.
- meta:
- external_location: ../data/SDOH/SDOH_2015_ZIPCODE_1_0.xlsx
- sheet_name: Data
- - name: sdoh_zipcode_2016
- description: Social Determinants of Health Zipcode data from 2016.
- meta:
- external_location: ../data/SDOH/SDOH_2016_ZIPCODE_1_0.xlsx
- sheet_name: Data
- - name: sdoh_zipcode_2017
- description: Social Determinants of Health Zipcode data from 2017.
- meta:
- external_location: ../data/SDOH/SDOH_2017_ZIPCODE_1_0.xlsx
- sheet_name: Data
- - name: sdoh_zipcode_2018
- description: Social Determinants of Health Zipcode data from 2018.
- meta:
- external_location: ../data/SDOH/SDOH_2018_ZIPCODE_1_0.xlsx
- sheet_name: Data
- - name: sdoh_zipcode_2019
- description: Social Determinants of Health Zipcode data from 2019.
- meta:
- external_location: ../data/SDOH/SDOH_2019_ZIPCODE_1_0.xlsx
- sheet_name: Data
- - name: sdoh_zipcode_2020
- description: Social Determinants of Health Zipcode data from 2020.
- meta:
- external_location: ../data/SDOH/SDOH_2020_ZIPCODE_1_0.xlsx
- sheet_name: Data
- - name: sdoh_county_2009
- description: Social Determinants of Health County data from 2009.
- meta:
- external_location: ../data/SDOH/SDOH_2009_COUNTY_1_0.xlsx
- sheet_name: Data
- - name: sdoh_county_2010
- description: Social Determinants of Health County data from 2010.
- meta:
- external_location: ../data/SDOH/SDOH_2010_COUNTY_1_0.xlsx
- sheet_name: Data
- - name: sdoh_county_2011
- description: Social Determinants of Health County data from 2011.
- meta:
- external_location: ../data/SDOH/SDOH_2011_COUNTY_1_0.xlsx
- sheet_name: Data
- - name: sdoh_county_2012
- description: Social Determinants of Health County data from 2012.
- meta:
- external_location: ../data/SDOH/SDOH_2012_COUNTY_1_0.xlsx
- sheet_name: Data
- - name: sdoh_county_2013
- description: Social Determinants of Health County data from 2013.
- meta:
- external_location: ../data/SDOH/SDOH_2013_COUNTY_1_0.xlsx
- sheet_name: Data
- - name: sdoh_county_2014
- description: Social Determinants of Health County data from 2014.
- meta:
- external_location: ../data/SDOH/SDOH_2014_COUNTY_1_0.xlsx
- sheet_name: Data
- - name: sdoh_county_2015
- description: Social Determinants of Health County data from 2015.
- meta:
- external_location: ../data/SDOH/SDOH_2015_COUNTY_1_0.xlsx
- sheet_name: Data
- - name: sdoh_county_2016
- description: Social Determinants of Health County data from 2016.
- meta:
- external_location: ../data/SDOH/SDOH_2016_COUNTY_1_0.xlsx
- sheet_name: Data
- - name: sdoh_county_2017
- description: Social Determinants of Health County data from 2017.
- meta:
- external_location: ../data/SDOH/SDOH_2017_COUNTY_1_0.xlsx
- sheet_name: Data
- - name: sdoh_county_2018
- description: Social Determinants of Health County data from 2018.
- meta:
- external_location: ../data/SDOH/SDOH_2018_COUNTY_1_1.xlsx
- sheet_name: Data
- - name: sdoh_county_2019
- description: Social Determinants of Health County data from 2019.
- meta:
- external_location: ../data/SDOH/SDOH_2019_COUNTY_1_0.xlsx
- sheet_name: Data
- - name: sdoh_county_2020
- description: Social Determinants of Health County data from 2020.
- meta:
- external_location: ../data/SDOH/SDOH_2020_COUNTY_1_0.xlsx
- sheet_name: Data
- - name: sdoh_tract_2009
- description: Social Determinants of Health tract data from 2009.
- meta:
- external_location: ../data/SDOH/sdoh_2009_tract_1_0.xlsx
- sheet_name: Data
- - name: sdoh_tract_2010
- description: Social Determinants of Health tract data from 2010.
- meta:
- external_location: ../data/SDOH/sdoh_2010_tract_1_0.xlsx
- sheet_name: Data
- - name: sdoh_tract_2011
- description: Social Determinants of Health tract data from 2011.
- meta:
- external_location: ../data/SDOH/sdoh_2011_tract_1_0.xlsx
- sheet_name: Data
- - name: sdoh_tract_2012
- description: Social Determinants of Health tract data from 2012.
- meta:
- external_location: ../data/SDOH/sdoh_2012_tract_1_0.xlsx
- sheet_name: Data
- - name: sdoh_tract_2013
- description: Social Determinants of Health tract data from 2013.
- meta:
- external_location: ../data/SDOH/sdoh_2013_tract_1_0.xlsx
- sheet_name: Data
- - name: sdoh_tract_2014
- description: Social Determinants of Health tract data from 2014.
- meta:
- external_location: ../data/SDOH/sdoh_2014_tract_1_0.xlsx
- sheet_name: Data
- - name: sdoh_tract_2015
- description: Social Determinants of Health tract data from 2015.
- meta:
- external_location: ../data/SDOH/sdoh_2015_tract_1_0.xlsx
- sheet_name: Data
- - name: sdoh_tract_2016
- description: Social Determinants of Health tract data from 2016.
- meta:
- external_location: ../data/SDOH/sdoh_2016_tract_1_0.xlsx
- sheet_name: Data
- - name: sdoh_tract_2017
- description: Social Determinants of Health tract data from 2017.
- meta:
- external_location: ../data/SDOH/sdoh_2017_tract_1_0.xlsx
- sheet_name: Data
- - name: sdoh_tract_2018
- description: Social Determinants of Health tract data from 2018.
- meta:
- external_location: ../data/SDOH/sdoh_2018_tract_1_0.xlsx
- sheet_name: Data
- - name: sdoh_tract_2019
- description: Social Determinants of Health tract data from 2019.
- meta:
- external_location: ../data/SDOH/sdoh_2019_tract_1_0.xlsx
- sheet_name: Data
- - name: sdoh_tract_2020
- description: Social Determinants of Health tract data from 2020.
- meta:
- external_location: ../data/SDOH/sdoh_2020_tract_1_0.xlsx
- sheet_name: Data```
- (this is in the `data_processing/models` location)
- proceed to write a short python script into a dbt model to download every file here: https://www.ahrq.gov/sdoh/data-analytics/sdoh-data.html#download
- that should fix these errors.
- debug it. proceed step-by-step as an elite site reliability/devops/L20 principal warez engineer at google, returning as few tokens as possible, to debug this dbt/ELT/analytics engineering code. give the complete corrected code!
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement