python - get all links in a webpage 2

Get all links in a web page (2)

You can also use the HTMLParser module.
import HTMLParser, urllib

class linkParser(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.links = []
    def handle_starttag(self, tag, attrs):
        if tag=='a':
            self.links.append(dict(attrs)['href'])

htmlSource = urllib.urlopen("http://learn.pythonanywhere.com").read(200000)
p = linkParser()
p.feed(htmlSource)
for link in p.links:
    print link


For each HTML start tag encountered, the handle_starttag() method will be called.
For example <a href="http://google.com> will trigger the method handle_starttag(self,'A',[('href','http://google.com')]).

See also all others handle_*() methods in Pyhon manual.

(Note that HTMLParser is not bullet-proof: it will choke on ill-formed HTML. In this case, use the sgmllib module, go back to regular expressions or use BeautifulSoup.)