Advertisement
Guest User

Untitled

a guest
Aug 4th, 2023
45
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 9.97 KB | Software | 0 0
  1. (website_linkage) λ scrapy crawl footer -a suffix=co.uk
  2. 2023-08-04 13:10:10 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: web_crawler)
  3. 2023-08-04 13:10:10 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.1, Twisted 22.10.0, Python 3.9.17 (main, Jul 5 2023, 20:47:11) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.2.0 (OpenSSL 3.1.2 1 Aug 2023), cryptography 41.0.3, Platform Windows-10-10.0.14393-SP0
  4. 2023-08-04 13:10:10 [scrapy.crawler] INFO: Overridden settings:
  5. {'BOT_NAME': 'web_crawler',
  6. 'CONCURRENT_ITEMS': 1000,
  7. 'CONCURRENT_REQUESTS': 100,
  8. 'COOKIES_ENABLED': False,
  9. 'DOWNLOAD_DELAY': 0.2,
  10. 'FEED_EXPORT_ENCODING': 'utf-8',
  11. 'LOG_LEVEL': 'INFO',
  12. 'NEWSPIDER_MODULE': 'web_crawler.spiders',
  13. 'REACTOR_THREADPOOL_MAXSIZE': 20,
  14. 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
  15. 'RETRY_ENABLED': False,
  16. 'ROBOTSTXT_OBEY': True,
  17. 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue',
  18. 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue',
  19. 'SCHEDULER_PRIORITY_QUEUE': 'scrapy.pqueues.DownloaderAwarePriorityQueue',
  20. 'SPIDER_MODULES': ['web_crawler.spiders']}
  21. 2023-08-04 13:10:10 [scrapy.extensions.telnet] INFO: Telnet Password: c56ea58f5b1ef9f0
  22. 2023-08-04 13:10:10 [scrapy.middleware] INFO: Enabled extensions:
  23. ['scrapy.extensions.corestats.CoreStats',
  24. 'scrapy.extensions.telnet.TelnetConsole',
  25. 'scrapy.extensions.logstats.LogStats']
  26. Loading environment variables from .env...
  27. Connecting to database website_linkage on localhost:5432...
  28. 2023-08-04 13:10:10 [scrapy.middleware] INFO: Enabled downloader middlewares:
  29. ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
  30. 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
  31. 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
  32. 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
  33. 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
  34. 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
  35. 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
  36. 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
  37. 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
  38. 'scrapy.downloadermiddlewares.stats.DownloaderStats']
  39. 2023-08-04 13:10:10 [scrapy.middleware] INFO: Enabled spider middlewares:
  40. ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
  41. 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
  42. 'scrapy.spidermiddlewares.referer.RefererMiddleware',
  43. 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
  44. 'scrapy.spidermiddlewares.depth.DepthMiddleware']
  45. 2023-08-04 13:10:10 [scrapy.middleware] INFO: Enabled item pipelines:
  46. ['web_crawler.pipelines.FooterPipeline']
  47. 2023-08-04 13:10:10 [scrapy.core.engine] INFO: Spider opened
  48. 2023-08-04 13:10:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
  49. 2023-08-04 13:10:10 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
  50. Fetching unprocessed urls for suffix co.uk...
  51. len urls: 7940617
  52. 2023-08-04 13:11:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
  53. 2023-08-04 13:11:37 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "avatarspecialeditionmovie.co.uk"; VerificationError(errors=[DNSMismatch(mismatched_id=DNS_ID(hostname=b'avatarspecialeditionmovie.co.uk'))])
  54. 2023-08-04 13:11:38 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "diaryofthewimpykid.co.uk"; VerificationError(errors=[DNSMismatch(mismatched_id=DNS_ID(hostname=b'diaryofthewimpykid.co.uk'))])
  55. 2023-08-04 13:11:38 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://renatos.co.uk/robots.txt>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.>]
  56. Traceback (most recent call last):
  57. File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\scrapy\core\downloader\middleware.py", line 52, in process_request
  58. return (yield download_func(request=request, spider=spider))
  59. twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.>]
  60. 2023-08-04 13:11:39 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "twelverounds.co.uk"; VerificationError(errors=[DNSMismatch(mismatched_id=DNS_ID(hostname=b'twelverounds.co.uk'))])
  61. 2023-08-04 13:11:39 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://stephensimpson.co.uk/robots.txt>: DNS lookup failed: no results for hostname lookup: www.stephensimpsonphoto.co.ukrobots.txt.
  62. Traceback (most recent call last):
  63. File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\twisted\internet\defer.py", line 1693, in _inlineCallbacks
  64. result = context.run(
  65. File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\twisted\python\failure.py", line 518, in throwExceptionIntoGenerator
  66. return g.throw(self.type, self.value, self.tb)
  67. File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\scrapy\core\downloader\middleware.py", line 52, in process_request
  68. return (yield download_func(request=request, spider=spider))
  69. File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\twisted\internet\defer.py", line 892, in _runCallbacks
  70. current.result = callback( # type: ignore[misc]
  71. File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\twisted\internet\endpoints.py", line 1022, in startConnectionAttempts
  72. raise error.DNSLookupError(
  73. twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: www.stephensimpsonphoto.co.ukrobots.txt.
  74. 2023-08-04 13:11:40 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "foxygamesbingo.co.uk"; VerificationError(errors=[DNSMismatch(mismatched_id=DNS_ID(hostname=b'foxygamesbingo.co.uk'))])
  75. 2023-08-04 13:11:40 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "ladbrokesplc.co.uk"; VerificationError(errors=[DNSMismatch(mismatched_id=DNS_ID(hostname=b'ladbrokesplc.co.uk'))])
  76. 2023-08-04 13:11:40 [scrapy.core.engine] ERROR: Error while obtaining start requests
  77. Traceback (most recent call last):
  78. File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\scrapy\core\engine.py", line 158, in _next_request
  79. request = next(self.slot.start_requests)
  80. File "C:\projects\finstat\website_linkage\web_crawler\web_crawler\spiders\footer.py", line 50, in start_requests
  81. yield scrapy.Request(
  82. File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\scrapy\http\request\__init__.py", line 93, in __init__
  83. self._set_url(url)
  84. File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\scrapy\http\request\__init__.py", line 143, in _set_url
  85. raise ValueError(f"Missing scheme in request url: {self._url}")
  86. ValueError: Missing scheme in request url: httpswwwwinchammotservicesltdcouk.co.uk
  87. 2023-08-04 13:11:42 [scrapy.robotstxt] WARNING: Failure while parsing robots.txt. File either contains garbage or is in an encoding other than UTF-8, treating it as an empty file.
  88. Traceback (most recent call last):
  89. File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\scrapy\robotstxt.py", line 15, in decode_robotstxt
  90. robotstxt_body = robotstxt_body.decode("utf-8")
  91. UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 10945: invalid start byte
  92. 2023-08-04 13:11:43 [scrapy.robotstxt] WARNING: Failure while parsing robots.txt. File either contains garbage or is in an encoding other than UTF-8, treating it as an empty file.
  93. Traceback (most recent call last):
  94. File "C:\Users\robo\miniconda3\envs\website_linkage\lib\site-packages\scrapy\robotstxt.py", line 15, in decode_robotstxt
  95. robotstxt_body = robotstxt_body.decode("utf-8")
  96. UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 10945: invalid start byte
  97. 2023-08-04 13:12:10 [scrapy.extensions.logstats] INFO: Crawled 43 pages (at 43 pages/min), scraped 0 items (at 0 items/min)
  98. 2023-08-04 13:12:24 [scrapy.core.engine] INFO: Closing spider (finished)
  99. 2023-08-04 13:12:24 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
  100. {'downloader/exception_count': 2,
  101. 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 1,
  102. 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
  103. 'downloader/request_bytes': 344107,
  104. 'downloader/request_count': 1478,
  105. 'downloader/request_method_count/GET': 1478,
  106. 'downloader/response_bytes': 1974945,
  107. 'downloader/response_count': 1476,
  108. 'downloader/response_status_count/200': 46,
  109. 'downloader/response_status_count/301': 789,
  110. 'downloader/response_status_count/302': 510,
  111. 'downloader/response_status_count/303': 105,
  112. 'downloader/response_status_count/308': 2,
  113. 'downloader/response_status_count/403': 2,
  114. 'downloader/response_status_count/404': 22,
  115. 'dupefilter/filtered': 64,
  116. 'elapsed_time_seconds': 134.104222,
  117. 'finish_reason': 'finished',
  118. 'finish_time': datetime.datetime(2023, 8, 4, 13, 12, 24, 788478),
  119. 'httpcompression/response_bytes': 136378,
  120. 'httpcompression/response_count': 46,
  121. 'log_count/ERROR': 3,
  122. 'log_count/INFO': 12,
  123. 'log_count/WARNING': 7,
  124. 'response_received_count': 43,
  125. "robotstxt/exception_count/<class 'twisted.internet.error.DNSLookupError'>": 1,
  126. "robotstxt/exception_count/<class 'twisted.web._newclient.ResponseNeverReceived'>": 1,
  127. 'robotstxt/request_count': 105,
  128. 'robotstxt/response_count': 43,
  129. 'robotstxt/response_status_count/200': 21,
  130. 'robotstxt/response_status_count/403': 2,
  131. 'robotstxt/response_status_count/404': 20,
  132. 'scheduler/dequeued': 148,
  133. 'scheduler/dequeued/memory': 148,
  134. 'scheduler/enqueued': 148,
  135. 'scheduler/enqueued/memory': 148,
  136. 'start_time': datetime.datetime(2023, 8, 4, 13, 10, 10, 684256)}
  137. 2023-08-04 13:12:24 [scrapy.core.engine] INFO: Spider closed (finished)
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement