Advertisement
sbmonzur

BeautifulSoup Tag Removal

Mar 9th, 2021
132
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 1.13 KB | None | 0 0
  1. with open('listfile.txt', 'r', encoding='utf8') as my_file:
  2.     rawData = my_file.read()
  3.     print(rawData)
  4.    
  5. from bs4 import BeautifulSoup
  6. soup = BeautifulSoup(rawData, "html.parser")
  7. titles = [h1_tag.text for h1_tag in soup.select('h1')] #titles of news article
  8. dates = [span_tag.text for span_tag in soup.select('div.storyPageMetaData-m__publish-time__19bdV > span')] #dates of pubication
  9.  
  10. #the code below is not working properly, i.e., it does not return all of the bodytext. In the file, the bodytext starts at the first instance of "div.story-element.story-element-text" and ends before the next h1 class tag. In between there are a couple of div.story-element.story-element-text tags which are possibly there to denote new paragraphs. My code  below is only returning one paragraph of the bodytext.
  11.  
  12. bodytext = [div_tag.text for div_tag in soup.select('div.story-element.story-element-text')]
  13.  
  14. print(bodytext[0]) # so here, I'm only getting the first paragraph of the body of the first article, not all of the first article
  15.  
  16. print(bodytext[1]) # here, I'm getting the second paragraph of the first article, and not the second article
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement