johnmahugu

python - get all links in a webpage 2

Jun 14th, 2015
336
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 0.93 KB | None | 0 0
  1. Get all links in a web page (2)
  2.  
  3. You can also use the HTMLParser module.
  4. import HTMLParser, urllib
  5.  
  6. class linkParser(HTMLParser.HTMLParser):
  7.     def __init__(self):
  8.         HTMLParser.HTMLParser.__init__(self)
  9.         self.links = []
  10.     def handle_starttag(self, tag, attrs):
  11.         if tag=='a':
  12.             self.links.append(dict(attrs)['href'])
  13.  
  14. htmlSource = urllib.urlopen("http://learn.pythonanywhere.com").read(200000)
  15. p = linkParser()
  16. p.feed(htmlSource)
  17. for link in p.links:
  18.     print link
  19.  
  20.  
  21. For each HTML start tag encountered, the handle_starttag() method will be called.
  22. For example <a href="http://google.com> will trigger the method handle_starttag(self,'A',[('href','http://google.com')]).
  23.  
  24. See also all others handle_*() methods in Pyhon manual.
  25.  
  26. (Note that HTMLParser is not bullet-proof: it will choke on ill-formed HTML. In this case, use the sgmllib module, go back to regular expressions or use BeautifulSoup.)
Advertisement
Add Comment
Please, Sign In to add comment