Don't like ads? PRO users don't see any ads ;-)
Guest

Untitled

By: a guest on May 1st, 2012  |  syntax: None  |  size: 2.94 KB  |  hits: 21  |  expires: Never
download  |  raw  |  embed  |  report abuse  |  print
Text below is selected. Please press Ctrl+C to copy to your clipboard. (⌘+C on Mac)
  1. # WikiStance. A Wikipedia distance meter.
  2. # Based on the alt text of http://xkcd.org/903/
  3. # Gets a Wikipedia URL and measure the distance of this page to the Philosophy article, clicking on links not in parens
  4. # neither italics
  5. #
  6. # Author::    Alejandro Fernández  (mailto:antarticonorte@gmail.com)
  7. # Copyright:: Copyright (c) 2011 Alejandro Fernández
  8. # License::   GPL
  9. #
  10. # =Usage=
  11. #
  12. # require 'wikistance'
  13. #
  14. # url = 'http://en.wikipedia.org/wiki/Scrubs_%28TV_series%29'
  15. # ws = WikiStance.new(url)
  16. # ws.trace  # Go through all the pages until we reach Philosophy
  17. # ws.distance # => 22
  18. # ws.breadcrumbs # => ["List of characters on Scrubs", "NBC", "United States", ..., "Philosophy"]
  19.  
  20. require 'rubygems'
  21. require 'mechanize'
  22.  
  23. class WikiStance
  24.  
  25.   attr_reader :title, :breadcrumbs
  26.  
  27.   def initialize(url)
  28.  
  29.     if url =~ /^http:\/\/en\.wikipedia\.org\/wiki\/(.*?)/
  30.       @url = url
  31.  
  32.       # Wikipedia returns 403 with the default user agent
  33.       @agent = Mechanize.new
  34.       @agent.user_agent_alias = 'Mac Safari'
  35.      
  36.       self.reset
  37.      
  38.     else
  39.       raise ArgumentError, "You should use a valid wikipedia link"
  40.     end
  41.  
  42.   end
  43.  
  44.   # Resets the class
  45.   def reset
  46.     @page        = @agent.get(@url)
  47.     @breadcrumbs = []
  48.     @title       = page_title
  49.     @breadcrumbs << @title
  50.     true
  51.   end
  52.  
  53.   # Gets the current @page title
  54.   def page_title
  55.     @page.at('#firstHeading').text()
  56.   end
  57.  
  58.   # Go through the pages to calculate distance
  59.   def trace
  60.    
  61.     while page_title != 'Philosophy'
  62.       click_first_link
  63.       title = page_title
  64.      
  65.       # Avoid entering in an infinite loop
  66.       if @breadcrumbs.include?(title)
  67.         raise "We are repeating ourselves! We already visited \"#{title}\""
  68.       end
  69.       @breadcrumbs << title
  70.     end
  71.    
  72.     true
  73.    
  74.   end
  75.  
  76.   def distance
  77.     # Breadcrumbs hold the initial page. If we start in philosophy the distance should be 0
  78.     @breadcrumbs.length - 1
  79.   end
  80.  
  81.   private
  82.  
  83.   def click_first_link
  84.     first_link = nil
  85.    
  86.     # div#bodyContent is where wikipedia shows article's content
  87.     # The starting text is direct child of div#bodyContent. This way we avoid <p> inside TOCs and other texts.
  88.     # We also avoid Disambiguation and other wikipedia texts, (which all of them contains links in italics) because
  89.     # they are in <div> instead of <p>
  90.     @page.search('#bodyContent > p').each do |p|
  91.      
  92.       # Links between parens should not be clicked
  93.       # I tried using a regex with lookbehind to know if a link has an opening parenthesis before, but ruby doesn't
  94.       # support them, so I will just remove all text between parens...
  95.       text = p.to_html.gsub(/\((?:.*?)\)/, '').gsub(/<i>(?:.*?)<\/i>/, '')
  96.  
  97.       # ...and then get the first link.
  98.       first_link = text.match(/<a(?:.*?)href\=\"[^#](.*?)\"(?:.*?)\/a>/)
  99.       break unless first_link.nil?
  100.     end
  101.  
  102.     raise "Oops! seems that \"#{page_title}\" has no links" if first_link.nil?
  103.     @page = @page.links_with(:href => /#{first_link[1]}/).first.click
  104.   end
  105.  
  106.  
  107. end