Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Get all links in a web page (2)
- You can also use the HTMLParser module.
- import HTMLParser, urllib
- class linkParser(HTMLParser.HTMLParser):
- def __init__(self):
- HTMLParser.HTMLParser.__init__(self)
- self.links = []
- def handle_starttag(self, tag, attrs):
- if tag=='a':
- self.links.append(dict(attrs)['href'])
- htmlSource = urllib.urlopen("http://learn.pythonanywhere.com").read(200000)
- p = linkParser()
- p.feed(htmlSource)
- for link in p.links:
- print link
- For each HTML start tag encountered, the handle_starttag() method will be called.
- For example <a href="http://google.com> will trigger the method handle_starttag(self,'A',[('href','http://google.com')]).
- See also all others handle_*() methods in Pyhon manual.
- (Note that HTMLParser is not bullet-proof: it will choke on ill-formed HTML. In this case, use the sgmllib module, go back to regular expressions or use BeautifulSoup.)
Advertisement
Add Comment
Please, Sign In to add comment