Guest User

does_not_work_lighttable

a guest
Sep 27th, 2014
278
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 1.21 KB | None | 0 0
  1. # <markdown>
  2. # This is a lab a obout interacting with webpages and xml (I think)
  3.  
  4. # <codecell>
  5.  
  6. import urllib2
  7. import bs4
  8. import json
  9. import datetime as dt
  10. import pandas as pd
  11. import numpy as np
  12. import unicodedata
  13.  
  14. # <markdown>
  15. # urllib2 is a useful module to get information from the web
  16. # (unless it is javascript protected)
  17. # the function urlopen() opens a url
  18. # to read the entire html to a single string, use read()
  19. # to read line by line, use readline()
  20. # read() reads the html code and close() closes the connection
  21.  
  22. # keep reading this to get used to this library.
  23. # https://docs.python.org/2/library/urllib2.html
  24.  
  25. # <codecell>
  26. x = urllib2.urlopen("http://www.google.com")
  27. htmlSource = x.read()
  28. x.close()
  29. type(htmlSource)
  30. print htmlSource[:800]
  31.  
  32. # <markdown>
  33. # reading the html source is ok with now you have parse it (both html and xml)
  34. # for which we will use beautifulsoup
  35. # lets try some but with different url as google.com is silly
  36.  
  37. # <codecell>
  38. x = urllib2.urlopen("http://www.reddit.com")
  39. htmlSource = x.read()
  40. x.close()
  41. print htmlSource[:500]
  42.  
  43.  
  44. # <codecell>
  45. soup = bs4.BeautifulSoup(htmlSource)
  46. print type(soup)
  47. print soup.prettify()[:100]
  48. print soup.head.prettify()[:600]
Advertisement
Add Comment
Please, Sign In to add comment