Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- 2017-12-17 14:20:27 [scrapy.extensions.telnet] DEBUG: Telnet console
- listening on 127.0.0.1:6025
- 2017-12-17 14:21:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages
- (at
- 0 pages/min), scraped 0 items (at 0 items/min)
- 2017-12-17 14:22:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages
- (at
- 0 pages/min), scraped 0 items (at 0 items/min)
- 2017-12-17 14:22:38 [scrapy.downloadermiddlewares.retry] DEBUG:
- Retrying
- <GET https://fr.example.com/robots.txt> (failed 1 times): TCP
- connection
- timed out: 110: Connection timed out.
- import scrapy
- import itertools
- class SomeSpider(scrapy.Spider):
- name = 'some'
- allowed_domains = ['https://fr.example.com']
- def start_requests(self):
- categories = [ 'thing1', 'thing2', 'thing3',]
- base = "https://fr.example.com/things?t={category}&p={index}"
- for category, index in itertools.product(categories, range(1, 11)):
- yield scrapy.Request(base.format(category=category, index=index))
- def parse(self, response):
- response.selector.remove_namespaces()
- info1 = response.css("span.info1").extract()
- info2 = response.css("span.info2").extract()
- for item in zip(info1, info2):
- scraped_info = {
- 'info1': item[0],
- 'info2': item[1]
- }
- yield scraped_info
Add Comment
Please, Sign In to add comment