Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- with open('listfile.txt', 'r', encoding='utf8') as my_file:
- rawData = my_file.read()
- print(rawData)
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(rawData, "html.parser")
- titles = [h1_tag.text for h1_tag in soup.select('h1')] #titles of news article
- dates = [span_tag.text for span_tag in soup.select('div.storyPageMetaData-m__publish-time__19bdV > span')] #dates of pubication
- #the code below is not working properly, i.e., it does not return all of the bodytext. In the file, the bodytext starts at the first instance of "div.story-element.story-element-text" and ends before the next h1 class tag. In between there are a couple of div.story-element.story-element-text tags which are possibly there to denote new paragraphs. My code below is only returning one paragraph of the bodytext.
- bodytext = [div_tag.text for div_tag in soup.select('div.story-element.story-element-text')]
- print(bodytext[0]) # so here, I'm only getting the first paragraph of the body of the first article, not all of the first article
- print(bodytext[1]) # here, I'm getting the second paragraph of the first article, and not the second article
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement