Don't like ads? PRO users don't see any ads ;-)
Guest

Untitled

By: a guest on May 27th, 2012  |  syntax: None  |  size: 1.31 KB  |  hits: 11  |  expires: Never
download  |  raw  |  embed  |  report abuse  |  print
Text below is selected. Please press Ctrl+C to copy to your clipboard. (⌘+C on Mac)
  1. Automatically Extracting feed links (atom, rss,etc) from webpages
  2. from BeautifulSoup import BeautifulSoup as parser
  3.  
  4. def detect_feeds_in_HTML(input_stream):
  5.     """ examines an open text stream with HTML for referenced feeds.
  6.  
  7.     This is achieved by detecting all ``link`` tags that reference a feed in HTML.
  8.  
  9.     :param input_stream: an arbitrary opened input stream that has a :func:`read` method.
  10.     :type input_stream: an input stream (e.g. open file or URL)
  11.     :return: a list of tuples ``(url, feed_type)``
  12.     :rtype: ``list(tuple(str, str))``
  13.     """
  14.     # check if really an input stream
  15.     if not hasattr(input_stream, "read"):
  16.         raise TypeError("An opened input *stream* should be given, was %s instead!" % type(input_stream))
  17.     result = []
  18.     # get the textual data (the HTML) from the input stream
  19.     html = parser(input_stream.read())
  20.     # find all links that have an "alternate" attribute
  21.     feed_urls = html.findAll("link", rel="alternate")
  22.     # extract URL and type
  23.     for feed_link in feed_urls:
  24.         url = feed_link.get("href", None)
  25.         # if a valid URL is there
  26.         if url:
  27.             result.append(url)
  28.     return result
  29.        
  30. <link rel="alternative" type="application/rss+xml" href="http://link.to/feed">
  31. <link rel="alternative" type="application/atom+xml" href="http://link.to/feed">