Advertisement
rs6000

crawler_DIA

Apr 11th, 2017
283
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 1.53 KB | None | 0 0
  1. '''
  2. author:smilehsu
  3. requirements:Windows7、python3.52
  4. date:2017-09-25 09:26:29
  5. '''
  6.  
  7. '''
  8. 搜尋條件,philippines cebu ,女 18-30歲 ,sorted by last active
  9. 網址格式(20170925):
  10. https://www.dateinasia.com/Search.aspx?g=2&af=18&at=30&c=PH&ci=Cebu&s=2
  11.  
  12. 第一頁
  13. https://www.dateinasia.com/Search.aspx?pg=0&g=2&af=18&at=30&c=PH&ci=Cebu&s=2
  14. 第二頁
  15. https://www.dateinasia.com/Search.aspx?pg=1&g=2&af=18&at=30&c=PH&ci=Cebu&s=2
  16. 第三頁
  17. https://www.dateinasia.com/Search.aspx?pg=2&g=2&af=18&at=30&c=PH&ci=Cebu&s=2
  18.  
  19. 頁數控制參數
  20. pg=數字
  21.  
  22. 其他參數
  23.  
  24. '''
  25.  
  26. import requests
  27. from bs4 import BeautifulSoup
  28.  
  29. base_url='https://www.dateinasia.com'
  30. searchfor='&g=2&af=18&at=30&c=PH&ci=Cebu&s=2'
  31. page_list=[]
  32. page_link=[]
  33. pg=0
  34.  
  35. num=2
  36.  
  37. for i in range(0,num+1):
  38.     get_page='pg='+str(pg)
  39.     pg+=1
  40.     page_list.append(base_url+'/Search.aspx?'+get_page+searchfor)
  41.  
  42. #測試
  43. '''
  44. for p in page_list:
  45.    print('第頁')
  46. '''
  47.  
  48. #print("第一頁:\n"+page_list[0])
  49.  
  50. #抓取頁面的使用者頁面連結
  51.  
  52. res=requests.get(page_list[0])
  53.  
  54. soup=BeautifulSoup(res.content, 'html5lib')
  55.  
  56. get_data=soup.find_all("span",{'class':'responsive-container galleryphoto-responsive'})
  57.  
  58. #印出頁面上使用者的頁面連結與照片連結
  59. #print(get_data)
  60.  
  61.  
  62. User_Page_Link=[]
  63.  
  64. for link in get_data:
  65.     UserName=link.find('img').attrs['alt'].replace(' ','+')
  66.     UserLink=base_url+'/'+UserName
  67.     User_Page_Link.append(UserLink)
  68.     print(UserName)
  69.     print(UserLink)
  70.  
  71. print("頁面使用者數:",len(User_Page_Link))
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement