drpanwe

Code Review

Apr 7th, 2019
92
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 1.58 KB | None | 0 0
  1. https://www.youtube.com/watch?v=eIWFnNz8mF4&t=217s
  2.  
  3. It is a concurrent scraper in @golang
  4.  
  5. How it works:
  6.  
  7. You call the binary (iterscraper) and you give it a URL "http://foo.com/%d" where '%d' is a pattern that will be replaced by an ID. e.g. 'http://foo.com/1' up to 'http://foo.com/9'. Then you can use how many go-routines you want to use at the same time (-concurrency). Then you chose where you should be writing the output (-output) to. Then the '-nameQuery, -addressQuery, emailQuery' are the CSS selectors we are going to be using to find whatever we are looking for (name? address? e-mail) in the URL (e.g. 'http://foo.com/1').
  8.  
  9. A basic package used for scraping information from a website where URLs contain an incrementing integer. Information is retrieved from HTML5 elements, and outputted as a CSV.
  10.  
  11. 1. Fetch the code:
  12. go get github.com/philipithomas/iterscraper
  13.  
  14. 2. Go to the code
  15. cd github.com/philipithomas/iterscraper
  16.  
  17. 3. Create a new branch
  18. git checkout -b work
  19.  
  20. 4. Open VSCode
  21. code .
  22.  
  23. main.go
  24. -------
  25. * Defines all the flags
  26. * Parses those flags
  27. * Uses a WaitGroup and Channels to communicate with different parts of the work
  28. * The different parts are 3:
  29. * emitTasks --> generates every single task that we need to do. Every task is a URL with an ID.
  30. It sends the task to the 'taskChan' channel
  31. * scrape --> Is a worker that will receive the task from the taskChannel, parse the URL and find
  32. whatever we need to find and then send the results to the 'dataChan' channel.
  33. * writeSites --> Writes all the output to a CSV file
Add Comment
Please, Sign In to add comment