SHARE
TWEET

Untitled

a guest Jun 19th, 2017 54 Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. Definitions:
  2. - 'source' is a file or API backend published on the web by Rosstat or other agency
  3.   - 'clean source' is something we can truct for its structure, usually an API
  4.   - 'messy source' is something that changes once in a while, eg Word files
  5. - 'scrapper' is a program that downloads the data without transforming it (download files, unpack from zip/rar)
  6. - 'parser' is a program that reads raw data and makes 'processed output'
  7. - 'processed output' is canonical result of parsing, importable to production database
  8.  
  9. In our pipleine:
  10. - Scrapper loads Source to Raw Database
  11. - Parser reads Source from Raw Database layer and produces Processed Output
  12. - Processed Output is imported into Production Database
  13.  
  14. Sometimes a parser/can handle a source itself well, especially if it is an API.
  15. This way it can bypass qurying the Raw Database, right?  
  16.  
  17. Question:
  18. - need clarification about Raw Database layer - do we always need it?
  19.  
  20.  
  21.  
  22.  
  23. Sometimes
RAW Paste Data
Top