Advertisement
sbmonzur

Using beautifulsoup to remove tags

Mar 9th, 2021
102
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 1.09 KB | None | 0 0
  1. with open('listfile.txt', 'r', encoding='utf8') as my_file:
  2.     rawData = my_file.read()
  3.     print(rawData)
  4.    
  5. from bs4 import BeautifulSoup
  6. soup = BeautifulSoup(rawData, "html.parser")
  7. titles = [h1_tag.text for h1_tag in soup.select('h1')]
  8. dates = [span_tag.text for span_tag in soup.select('div.storyPageMetaData-m__publish-time__19bdV > span')]
  9.  
  10. #this section is not working properly, i.e., it does not return all of the bodytext. In the file, the bodytext starts at the first instance of "div.story-element.story-element-text" and ends before the next h1 class tag. In between there are a couple of div.story-element.story-element-text tags which are possibly there to denote new paragraphs. My code  below is only returning one paragraph of the bodytext.
  11.  
  12. bodytext = [div_tag.text for div_tag in soup.select('div.story-element.story-element-text')]
  13.  
  14. print(bodytext[0]) # so here, I'm only getting the first paragraph of the body of the first article, not all of the first article
  15.  
  16. print(bodytext[1]) # here, I'm getting the second paragraph of the first article, and not the second article
  17.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement