Advertisement
LorenKPetrov

Python Reddit Scraper V1.0

Nov 15th, 2013
145
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 2.55 KB | None | 0 0
  1. # Python Reddit Scraper V1.0
  2. # This script scrapes Reddit
  3. # PHP XML Render page: http://pastebin.com/nip125AJ
  4. # It then exports it to an XML file to be read whichever method you decide is best.
  5. # I wanted to create a scraper that would grab three things 1. The name of the post 2. The comments section URL and 3. The image / youtube video attributed to the post
  6.  
  7. # NOTES
  8. # Coded on Python 2.7.5
  9. # Requires Requests and LXML modules
  10. # Coded by LKP from CodeCall.net
  11.  
  12. # IMPORTS #
  13. from lxml import html # Imports HTML from LXML
  14. import xml.etree.cElementTree as XMLT # Imports element tree for python so it can write XML in the right style.
  15. import requests # imports the requests so Python can
  16.  
  17. # VARIABLES #
  18. page = requests.get('http://www.reddit.com/r/minecraft') # Gets the page to scrape
  19. tree = html.fromstring(page.text) # converts the HTML page into a tree for XPATH to read
  20. title = tree.xpath('//a[@class="title "]/text()') # Grabs the Hyperlink text with the class named title NOTE: The space is supposed to be there, on Reddit the space is still there.
  21. link = tree.xpath('//li[@class="first"]/a/@href') # Similiar to above but grabs the Hyper link from the href tag from the li tag with the class "first".
  22. imgur = tree.xpath('//p[@class="title"]/a/@href') # and again with above it grabs the href tag within the paragraph tag.
  23. root = XMLT.Element("ENTRY") # This is my root XML tag so it doesn't become part of the loop
  24. start = 0 # This number was what I used during my While tag.
  25. total = len(title) # This counts the total of entries, to explain that a bit clearer if we liken it to a book, it's like counting the number of chapters in a book, I.e. 36 chapters.
  26.  
  27. # MAIN CODE #
  28. while start < total: # While start (equal to 0) is less than the total (equal to however many variables are in the title list) do the following
  29.     doc = XMLT.SubElement(root, "POST") # Writes the XML tag POST
  30.     field1 = XMLT.SubElement(doc, "TITLE") # Writes the XML tag TITLE
  31.     field1.text = title[start] # Writes the tag content for TITLE
  32.     field2  = XMLT.SubElement(doc, "MEDIA") # Writes the XML tag MEDIA
  33.     field2.text = imgur[start] # Writes the tag content for MEDIA
  34.     field3 = XMLT.SubElement(doc, "LINK") # Writes the XML tag LINK
  35.     field3.text = link[start] # Writes the tag content for LINK
  36.  
  37.     start = start + 1 # Adds 1 on to the variable 'start' so it will loop the code for the amount of times that the total is less than the start
  38.  
  39. tree = XMLT.ElementTree(root) # Makes the ENTRY tag in the XML document
  40. tree.write("MC.xml") # finally, it writes the info to the specified XML document.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement