Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Definitions:
- - 'source' is a file or API backend published on the web by Rosstat or other agency
- - 'clean source' is something we can truct for its structure, usually an API
- - 'messy source' is something that changes once in a while, eg Word files
- - 'scrapper' is a program that downloads the data without transforming it (download files, unpack from zip/rar)
- - 'parser' is a program that reads raw data and makes 'processed output'
- - 'processed output' is canonical result of parsing, importable to production database
- In our pipleine:
- - Scrapper loads Source to Raw Database
- - Parser reads Source from Raw Database layer and produces Processed Output
- - Processed Output is imported into Production Database
- Sometimes a parser/can handle a source itself well, especially if it is an API.
- This way it can bypass qurying the Raw Database, right?
- Question:
- - need clarification about Raw Database layer - do we always need it?
- Sometimes
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement