Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- (website_linkage) λ scrapy crawl footer -a suffix=co.uk
- 2023-08-04 13:10:10 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: web_crawler)
- 2023-08-04 13:10:10 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.1, Twisted 22.10.0, Python 3.9.17 (main, Jul 5 2023, 20:47:11) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.2.0 (OpenSSL 3.1.2 1 Aug 2023), cryptography 41.0.3, Platform Windows-10-10.0.14393-SP0
- 2023-08-04 13:10:10 [scrapy.crawler] INFO: Overridden settings:
- {'BOT_NAME': 'web_crawler',
- 'CONCURRENT_ITEMS': 1000,
- 'CONCURRENT_REQUESTS': 100,
- 'COOKIES_ENABLED': False,
- 'DOWNLOAD_DELAY': 0.2,
- 'FEED_EXPORT_ENCODING': 'utf-8',
- 'LOG_LEVEL': 'INFO',
- 'NEWSPIDER_MODULE': 'web_crawler.spiders',
- 'REACTOR_THREADPOOL_MAXSIZE': 20,
- 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
- 'RETRY_ENABLED': False,
- 'ROBOTSTXT_OBEY': True,
- 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue',
- 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue',
- 'SCHEDULER_PRIORITY_QUEUE': 'scrapy.pqueues.DownloaderAwarePriorityQueue',
- 'SPIDER_MODULES': ['web_crawler.spiders']}
- 2023-08-04 13:10:10 [scrapy.extensions.telnet] INFO: Telnet Password: c56ea58f5b1ef9f0
- 2023-08-04 13:10:10 [scrapy.middleware] INFO: Enabled extensions:
- ['scrapy.extensions.corestats.CoreStats',
- 'scrapy.extensions.telnet.TelnetConsole',
- 'scrapy.extensions.logstats.LogStats']
- Loading environment variables from .env...
- Connecting to database website_linkage on localhost:5432...
- 2023-08-04 13:10:10 [scrapy.middleware] INFO: Enabled downloader middlewares:
- ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
- 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
- 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
- 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
- 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
- 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
- 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
- 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
- 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
- 'scrapy.downloadermiddlewares.stats.DownloaderStats']
- 2023-08-04 13:10:10 [scrapy.middleware] INFO: Enabled spider middlewares:
- ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
- 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
- 'scrapy.spidermiddlewares.referer.RefererMiddleware',
- 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
- 'scrapy.spidermiddlewares.depth.DepthMiddleware']
- 2023-08-04 13:10:10 [scrapy.middleware] INFO: Enabled item pipelines:
- ['web_crawler.pipelines.FooterPipeline']
- 2023-08-04 13:10:10 [scrapy.core.engine] INFO: Spider opened
- 2023-08-04 13:10:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
- 2023-08-04 13:10:10 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
- Fetching unprocessed urls for suffix co.uk...
- len urls: 7940617
- 2023-08-04 13:11:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
- 2023-08-04 13:11:37 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "avatarspecialeditionmovie.co.uk"; VerificationError(errors=[DNSMismatch(mismatched_id=DNS_ID(hostname=b'avatarspecialeditionmovie.co.uk'))])
- 2023-08-04 13:11:38 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "diaryofthewimpykid.co.uk"; VerificationError(errors=[DNSMismatch(mismatched_id=DNS_ID(hostname=b'diaryofthewimpykid.co.uk'))])
- 2023-08-04 13:11:38 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://renatos.co.uk/robots.txt>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.>]
- Traceback (most recent call last):
- File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\scrapy\core\downloader\middleware.py", line 52, in process_request
- return (yield download_func(request=request, spider=spider))
- twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.>]
- 2023-08-04 13:11:39 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "twelverounds.co.uk"; VerificationError(errors=[DNSMismatch(mismatched_id=DNS_ID(hostname=b'twelverounds.co.uk'))])
- 2023-08-04 13:11:39 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://stephensimpson.co.uk/robots.txt>: DNS lookup failed: no results for hostname lookup: www.stephensimpsonphoto.co.ukrobots.txt.
- Traceback (most recent call last):
- File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\twisted\internet\defer.py", line 1693, in _inlineCallbacks
- result = context.run(
- File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\twisted\python\failure.py", line 518, in throwExceptionIntoGenerator
- return g.throw(self.type, self.value, self.tb)
- File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\scrapy\core\downloader\middleware.py", line 52, in process_request
- return (yield download_func(request=request, spider=spider))
- File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\twisted\internet\defer.py", line 892, in _runCallbacks
- current.result = callback( # type: ignore[misc]
- File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\twisted\internet\endpoints.py", line 1022, in startConnectionAttempts
- raise error.DNSLookupError(
- twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: www.stephensimpsonphoto.co.ukrobots.txt.
- 2023-08-04 13:11:40 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "foxygamesbingo.co.uk"; VerificationError(errors=[DNSMismatch(mismatched_id=DNS_ID(hostname=b'foxygamesbingo.co.uk'))])
- 2023-08-04 13:11:40 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "ladbrokesplc.co.uk"; VerificationError(errors=[DNSMismatch(mismatched_id=DNS_ID(hostname=b'ladbrokesplc.co.uk'))])
- 2023-08-04 13:11:40 [scrapy.core.engine] ERROR: Error while obtaining start requests
- Traceback (most recent call last):
- File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\scrapy\core\engine.py", line 158, in _next_request
- request = next(self.slot.start_requests)
- File "C:\projects\finstat\website_linkage\web_crawler\web_crawler\spiders\footer.py", line 50, in start_requests
- yield scrapy.Request(
- File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\scrapy\http\request\__init__.py", line 93, in __init__
- self._set_url(url)
- File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\scrapy\http\request\__init__.py", line 143, in _set_url
- raise ValueError(f"Missing scheme in request url: {self._url}")
- ValueError: Missing scheme in request url: httpswwwwinchammotservicesltdcouk.co.uk
- 2023-08-04 13:11:42 [scrapy.robotstxt] WARNING: Failure while parsing robots.txt. File either contains garbage or is in an encoding other than UTF-8, treating it as an empty file.
- Traceback (most recent call last):
- File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\scrapy\robotstxt.py", line 15, in decode_robotstxt
- robotstxt_body = robotstxt_body.decode("utf-8")
- UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 10945: invalid start byte
- 2023-08-04 13:11:43 [scrapy.robotstxt] WARNING: Failure while parsing robots.txt. File either contains garbage or is in an encoding other than UTF-8, treating it as an empty file.
- Traceback (most recent call last):
- File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\scrapy\robotstxt.py", line 15, in decode_robotstxt
- robotstxt_body = robotstxt_body.decode("utf-8")
- UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 10945: invalid start byte
- 2023-08-04 13:12:10 [scrapy.extensions.logstats] INFO: Crawled 43 pages (at 43 pages/min), scraped 0 items (at 0 items/min)
- 2023-08-04 13:12:24 [scrapy.core.engine] INFO: Closing spider (finished)
- 2023-08-04 13:12:24 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
- {'downloader/exception_count': 2,
- 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 1,
- 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
- 'downloader/request_bytes': 344107,
- 'downloader/request_count': 1478,
- 'downloader/request_method_count/GET': 1478,
- 'downloader/response_bytes': 1974945,
- 'downloader/response_count': 1476,
- 'downloader/response_status_count/200': 46,
- 'downloader/response_status_count/301': 789,
- 'downloader/response_status_count/302': 510,
- 'downloader/response_status_count/303': 105,
- 'downloader/response_status_count/308': 2,
- 'downloader/response_status_count/403': 2,
- 'downloader/response_status_count/404': 22,
- 'dupefilter/filtered': 64,
- 'elapsed_time_seconds': 134.104222,
- 'finish_reason': 'finished',
- 'finish_time': datetime.datetime(2023, 8, 4, 13, 12, 24, 788478),
- 'httpcompression/response_bytes': 136378,
- 'httpcompression/response_count': 46,
- 'log_count/ERROR': 3,
- 'log_count/INFO': 12,
- 'log_count/WARNING': 7,
- 'response_received_count': 43,
- "robotstxt/exception_count/<class 'twisted.internet.error.DNSLookupError'>": 1,
- "robotstxt/exception_count/<class 'twisted.web._newclient.ResponseNeverReceived'>": 1,
- 'robotstxt/request_count': 105,
- 'robotstxt/response_count': 43,
- 'robotstxt/response_status_count/200': 21,
- 'robotstxt/response_status_count/403': 2,
- 'robotstxt/response_status_count/404': 20,
- 'scheduler/dequeued': 148,
- 'scheduler/dequeued/memory': 148,
- 'scheduler/enqueued': 148,
- 'scheduler/enqueued/memory': 148,
- 'start_time': datetime.datetime(2023, 8, 4, 13, 10, 10, 684256)}
- 2023-08-04 13:12:24 [scrapy.core.engine] INFO: Spider closed (finished)
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement