Advertisement
simonhalfdan

Python Code to extract data from a Twitter HTML file

Jun 8th, 2013
699
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 2.55 KB | None | 0 0
  1. import csv
  2. from bs4 import BeautifulSoup
  3.  
  4. f = csv.writer(open("output.csv", "w"), delimiter = '\t') #csv output file, changing ‘/t’ changed how the file is delimited. /t is by tab.
  5. f.writerow(["username", "time and date", "permalink", "tweettext", "URLlinked"]) #csv column headings
  6.  
  7. soup = BeautifulSoup(open("raw twitter input.html")) #input html document
  8.  
  9. litop = soup.find_all("li", "js-stream-item stream-item stream-item expanding-stream-item") #li with class "js-stream-item..." form the anchor for each tweet. Where the link is specified below tells the code how to differentiate the ‘li’ values. It’s best to make sure that the input is English with latin characters.
  10. for li in litop:
  11.     for link in li.find_all('li', 'js-stream-item stream-item stream-item expanding-stream-item'):
  12.         link = li['id'][3]
  13.     divcont = li.find_all("div", "content")    
  14.     try:
  15.         username = li.find("span", "username js-action-profile-name").get_text() #sometimes this line will need to be changed to account for differences in the html files from different browers.
  16.         timedate = li.find('a', "tweet-timestamp js-permalink js-nav").attrs['title'] #gets the time and date.
  17.         permalink = li.find('a', "tweet-timestamp js-permalink js-nav").attrs['href'] #gets the permalink of the tweet. Adding https://www.twitter.com before it would link directly to the tweet on twitter.com.
  18.         tweettext = li.find("p", "js-tweet-text tweet-text").get_text().encode('utf-8').replace('\n',"") #gets text of tweet, unfortunately it sometimes returns some other random characters as well. May need some fine tuning.
  19.         URLlinked = li.find("a", "twitter-timeline-link").attrs['href'] #gives a URL given by a tweet. It seems safe to assume that users will only link one URL in their tweet.
  20.        
  21.     except: #this section duplicates the above when a URL is not found. Some fine tuning might negate the need for this section.
  22.         print "no URL present"
  23.         username = li.find("span", "username js-action-profile-name").get_text()
  24.         timedate = li.find('a', "tweet-timestamp js-permalink js-nav").attrs['title']
  25.         permalink = li.find('a', "tweet-timestamp js-permalink js-nav").attrs['href']
  26.         tweettext = li.find("p", "js-tweet-text tweet-text").get_text().encode('utf-8').replace('\n',"")
  27.     URLlinked = “”
  28.        
  29.     f.writerow([username, timedate, permalink, tweettext, URLlinked]) #order in which outputs are placed in csv file, probably best to keep URLlinked at the end as not all entries will have these. The order does not matter.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement