Advertisement
Guest User

Untitled

a guest
Jan 21st, 2016
102
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 4.37 KB | None | 0 0
  1. class user_scrape(CrawlSpider):
  2. name = "spyder"
  3. allowed_domains = ["domain.tld"]
  4. start_urls = [
  5. "http://domain.tld/page1",
  6. "http://domain.tld/page2"
  7. ]
  8.  
  9. login_user = "username"
  10. login_pass = "secret"
  11. login_page = "http://domain.tld/login.php"
  12.  
  13. def start_requests(self):
  14. yield Request(
  15. url=self.login_page,
  16. callback=self.login,
  17. dont_filter=True,
  18. )
  19.  
  20. def login(self, response):
  21. print "----- LOGIN -----"
  22. return FormRequest.from_response(
  23. response,
  24. formname='form_login',
  25. formdata={
  26. 'username': self.login_user,
  27. 'password': self.login_pass,
  28. 'cookietime': 'on',
  29. },
  30. callback=self.check_login_response,
  31. )
  32.  
  33. def check_login_response(self, response):
  34. print response.url
  35. print response.body
  36.  
  37. return [Request(url=url)for url in self.start_urls]
  38.  
  39. def parse(self, response):
  40. print response.url
  41.  
  42. 2016-01-21 16:34:23 [scrapy] INFO: Scrapy 1.0.4 started (bot: UsersScrape)
  43. 2016-01-21 16:34:23 [scrapy] INFO: Optional features available: ssl, http11
  44. 2016-01-21 16:34:23 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'UsersScrape.spiders', 'SPIDER_MODULES': ['UsersScrape.spiders'], 'RETRY_TIMES': 5, 'BOT_NAME': 'UsersScrape', 'RETRY_HTTP_CODES': [400, 408, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530], 'DOWNLOAD_DELAY': 1, 'USER_AGENT': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'}
  45. 2016-01-21 16:34:24 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
  46. 2016-01-21 16:34:24 [scrapy] INFO: Enabled downloader middlewares: RetryMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
  47. 2016-01-21 16:34:24 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
  48. 2016-01-21 16:34:24 [scrapy] INFO: Enabled item pipelines:
  49. 2016-01-21 16:34:24 [scrapy] INFO: Spider opened
  50. 2016-01-21 16:34:24 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
  51. 2016-01-21 16:34:24 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
  52. 2016-01-21 16:34:24 [scrapy] DEBUG: Crawled (200) <GET http://domain.tld/login.php?> (referer: None)
  53. ----- LOGIN -----
  54. 2016-01-21 16:34:25 [scrapy] DEBUG: Redirecting (302) to <GET http://domain.tld.com/> from <POST http://domain.tld/takelogin.php>
  55. 2016-01-21 16:34:27 [scrapy] DEBUG: Redirecting (302) to <GET http://domain.tld/> from <GET http://domain.tld/>
  56. 2016-01-21 16:34:27 [scrapy] DEBUG: Filtered duplicate request: <GET http://domain.tld/>
  57. 2016-01-21 16:34:27 [scrapy] INFO: Closing spider (finished)
  58. 2016-01-21 16:34:27 [scrapy] INFO: Dumping Scrapy stats:
  59. {'downloader/request_bytes': 1261,
  60. 'downloader/request_count': 3,
  61. 'downloader/request_method_count/GET': 2,
  62. 'downloader/request_method_count/POST': 1,
  63. 'downloader/response_bytes': 3877,
  64. 'downloader/response_count': 3,
  65. 'downloader/response_status_count/200': 1,
  66. 'downloader/response_status_count/302': 2,
  67. 'dupefilter/filtered': 1,
  68. 'finish_reason': 'finished',
  69. 'finish_time': datetime.datetime(2016, 1, 21, 15, 34, 27, 101000),
  70. 'log_count/DEBUG': 5,
  71. 'log_count/INFO': 7,
  72. 'request_depth_max': 1,
  73. 'response_received_count': 1,
  74. 'scheduler/dequeued': 3,
  75. 'scheduler/dequeued/memory': 3,
  76. 'scheduler/enqueued': 3,
  77. 'scheduler/enqueued/memory': 3,
  78. 'start_time': datetime.datetime(2016, 1, 21, 15, 34, 24, 238000)}
  79. 2016-01-21 16:34:27 [scrapy] INFO: Spider closed (finished)
  80.  
  81. <form method="post" name="login_form" action="takelogin.php" onsubmit="return startLoginVerify();">
  82. <table id="login_form" border="0" cellpadding=5>
  83. <tr>
  84. <td colspan="2" align="right">
  85. <img style="cursor:pointer;" onClick="close_login_box();" src="pic/close.gif" align="right">
  86. </td>
  87. </tr>
  88. <tr>
  89. <td class=rowhead style="padding-left:25px;">User:</td>
  90. <td align=left style="padding-right:25px;">
  91. <input type="text" size=30 name="username" id="navbar_login_menu_input_to_focus_on" />
  92. </td>
  93. </tr>
  94. <tr>
  95. <td class=rowhead>Password:</td>
  96. <td align=left><input type="password" size=30 name="password" /></td>
  97. </tr>
  98. ....
  99. </table>
  100. </form>
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement