a guest Jun 19th, 2017 54 Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
- - 'source' is a file or API backend published on the web by Rosstat or other agency
- - 'clean source' is something we can truct for its structure, usually an API
- - 'messy source' is something that changes once in a while, eg Word files
- - 'scrapper' is a program that downloads the data without transforming it (download files, unpack from zip/rar)
- - 'parser' is a program that reads raw data and makes 'processed output'
- - 'processed output' is canonical result of parsing, importable to production database
- In our pipleine:
- - Scrapper loads Source to Raw Database
- - Parser reads Source from Raw Database layer and produces Processed Output
- - Processed Output is imported into Production Database
- Sometimes a parser/can handle a source itself well, especially if it is an API.
- This way it can bypass qurying the Raw Database, right?
- - need clarification about Raw Database layer - do we always need it?
RAW Paste Data