furas

Python - scrape babycenter.com - (Stackoverflow)

Mar 18th, 2022 (edited)
64
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. # author: Bartlomiej "furas" Burek (https://blog.furas.pl)
  2. # date: 2022.03.18
  3. # [scraping baby names python - Stack Overflow](https://stackoverflow.com/questions/71525007/scraping-baby-names-python)
  4.  
  5. '''
  6. Few mistakes and problems:
  7.  
  8. - it has to be `class_=...` instead of `classname=....`
  9. (funny is because next functions you use correct `class_` or `{"class": ...}`)
  10.  
  11. - you search `<td class="row">` but this page doesn't have it - it has `<div class="row">`.
  12. (funny is, you assign it to variable `div_results` so maybe it is only typo)
  13. But it would be simpler to use `css selector` and search `select("td.nameCell.bodyLinks a")`
  14.  
  15. - you check `if name not in bodyLinks:` but you should do `if name not in babynames:`
  16.  
  17. - page uses relative urls `/baby-names-josiah-2356.htm` but `requests` needs absolute urls `https://www.babycenter.com/baby-names-josiah-2356.htm` and you have to add `https://www.babycenter.com` to url
  18.  
  19. - you add result to `babnames[mean]` but you should use name `babnames[name]`
  20.  
  21. I see another problem: code may run long time and if you would like to stop it then you would have to use Ctr+C and it would not run code which save data in `csv` - it can be better to put `while True` in `with open() as ...:` and write new row directy when you get new name
  22. '''
  23.  
  24. from bs4 import BeautifulSoup
  25. import requests
  26. import csv
  27.  
  28. babnames = {}
  29. start_index = 0
  30.  
  31. with open('babnames.csv', 'w', newline='', encoding="utf-8") as f_output:
  32.     csv_output = csv.writer(f_output)
  33.     csv_output.writerow(['Name', 'Meanning'])
  34.  
  35.     while True:
  36.         print('start_index:', start_index)
  37.         req = requests.get(f"https://www.babycenter.com/babyNamerSearch.htm?startIndex={start_index}&excludeLimit=100&gender=MALE&containing=&origin=&includeLimit=100&sort=&meaning=&endsWith=&theme=&batchSize=40&includeExclude=ALL&numberOfSyllables=&startsWith=")
  38.         soup = BeautifulSoup(req.content, "lxml")
  39.    
  40.         found = False
  41.        
  42.         results = soup.select('td.nameCell.bodyLinks a')
  43.         print('results:', len(results))
  44.        
  45.         for result in results:
  46.             name = result.get_text(strip=True)
  47.             print('>>> name:', name)
  48.                    
  49.             if name in babnames:
  50.                 print('- skip -')
  51.             else:
  52.                 print('name:', name)
  53.                
  54.                 link = 'https://www.babycenter.com' + result['href']
  55.                 print('link:', link)
  56.                
  57.                 response_details = requests.get(link)
  58.                 soup_details = BeautifulSoup(response_details.content, "lxml")
  59.                      
  60.                 a_mean = soup_details.find("p", {"class": "bodyText"})
  61.                 if a_mean:
  62.                     mean = a_mean.get_text(strip=True)
  63.                 else:
  64.                     mean = '#'
  65.                
  66.                 print('mean:', mean)
  67.                
  68.                 babnames[name] = mean
  69.                 found = True
  70.                 csv_output.writerow([name, mean])
  71.                        
  72.         # Keep going until no new names found
  73.         if found:
  74.             start_index += 40
  75.         else:
  76.             break
RAW Paste Data Copied