Pastebin launched a little side project called VERYVIRAL.com, check it out ;-) Want more features on Pastebin? Sign Up, it's FREE!
Guest

PS6-1-3-OnlyALittleLucky-Doesn'tWork.py

By: codeadi on Jan 20th, 2013  |  syntax: Python  |  size: 7.39 KB  |  views: 210  |  expires: Never
download  |  raw  |  embed  |  report abuse  |  print
Text below is selected. Please press Ctrl+C to copy to your clipboard. (⌘+C on Mac)
  1. # Triple Gold Star
  2.  
  3. # Only A Little Lucky
  4.  
  5. # The Feeling Lucky question (from the regular homework) assumed it was enough
  6. # to find the best-ranked page for a given query. For most queries, though, we
  7. # don't just want the best page (according to the page ranking algorithm), we
  8. # want a list of many pages that match the query, ordered from the most likely
  9. # to be useful to the least likely.
  10.  
  11. # Your goal for this question is to define a procedure, ordered_search(index,
  12. # ranks, keyword), that takes the same inputs as lucky_search from Question 5,
  13. # but returns an ordered list of all the URLs that match the query.
  14.  
  15. # To order the pages, use the quicksort algorithm, invented by Sir Tony Hoare in
  16. # 1959. Quicksort provides a way to sort any list of data, using an expected
  17. # number of comparisons that scales as n log n where n is the number of elements
  18. # in the list.
  19.  
  20. # The idea of quicksort is quite simple:
  21.  
  22. # If the list has zero or one elements, it is already sorted.
  23.  
  24. # Otherwise, pick a pivot element, and split the list into two partitions: one
  25. # contains all the elements equal to or lower than the value of the pivot
  26. # element, and the other contains all the elements that are greater than the
  27. # pivot element. Recursively sort each of the sub-lists, and then return the
  28. # result of concatenating the sorted left sub-list, the pivot element, and the
  29. # sorted right sub-list.
  30.  
  31. # For simplicity, use the first element in the list as your pivot element (this
  32. # is not usually a good choice, since it means if the input list is already
  33. # nearly sorted, the actual work will be much worse than expected).
  34.  
  35. def ordered_search(index, ranks, keyword):
  36.     if keyword not in index:
  37.         return None
  38.     else:
  39.         possible_results = lookup(index, keyword)
  40.         if len(possible_results) > 1:
  41.             quicksort(possible_results, ranks)
  42.         return possible_results
  43.  
  44. def quicksort(pages, ranks):
  45.     pivot = ranks[pages[0]]
  46.     worse = []
  47.     better = []
  48.     for page in pages[1:]:
  49.         if ranks[page] <= pivot:
  50.             worse.append(page)
  51.         else:
  52.             better.append(page)
  53.     return quicksort(better, ranks) + [pages[0]] + quicksort(worse, ranks)
  54.  
  55.  
  56. cache = {
  57.     'http://udacity.com/cs101x/urank/index.html': """<html>
  58. <body>
  59. <h1>Dave's Cooking Algorithms</h1>
  60. <p>
  61. Here are my favorite recipies:
  62. <ul>
  63. <li> <a href="http://udacity.com/cs101x/urank/hummus.html">Hummus Recipe</a>
  64. <li> <a href="http://udacity.com/cs101x/urank/arsenic.html">World's Best Hummus</a>
  65. <li> <a href="http://udacity.com/cs101x/urank/kathleen.html">Kathleen's Hummus Recipe</a>
  66. </ul>
  67.  
  68. For more expert opinions, check out the
  69. <a href="http://udacity.com/cs101x/urank/nickel.html">Nickel Chef</a>
  70. and <a href="http://udacity.com/cs101x/urank/zinc.html">Zinc Chef</a>.
  71. </body>
  72. </html>
  73.  
  74.  
  75. """,
  76.     'http://udacity.com/cs101x/urank/zinc.html': """<html>
  77. <body>
  78. <h1>The Zinc Chef</h1>
  79. <p>
  80. I learned everything I know from
  81. <a href="http://udacity.com/cs101x/urank/nickel.html">the Nickel Chef</a>.
  82. </p>
  83. <p>
  84. For great hummus, try
  85. <a href="http://udacity.com/cs101x/urank/arsenic.html">this recipe</a>.
  86.  
  87. </body>
  88. </html>
  89.  
  90.  
  91. """,
  92.     'http://udacity.com/cs101x/urank/nickel.html': """<html>
  93. <body>
  94. <h1>The Nickel Chef</h1>
  95. <p>
  96. This is the
  97. <a href="http://udacity.com/cs101x/urank/kathleen.html">
  98. best Hummus recipe!
  99. </a>
  100.  
  101. </body>
  102. </html>
  103.  
  104.  
  105. """,
  106.     'http://udacity.com/cs101x/urank/kathleen.html': """<html>
  107. <body>
  108. <h1>
  109. Kathleen's Hummus Recipe
  110. </h1>
  111. <p>
  112.  
  113. <ol>
  114. <li> Open a can of garbonzo beans.
  115. <li> Crush them in a blender.
  116. <li> Add 3 tablesppons of tahini sauce.
  117. <li> Squeeze in one lemon.
  118. <li> Add salt, pepper, and buttercream frosting to taste.
  119. </ol>
  120.  
  121. </body>
  122. </html>
  123.  
  124.  
  125. """,
  126.     'http://udacity.com/cs101x/urank/arsenic.html': """<html>
  127. <body>
  128. <h1>
  129. The Arsenic Chef's World Famous Hummus Recipe
  130. </h1>
  131. <p>
  132.  
  133. <ol>
  134. <li> Kidnap the <a href="http://udacity.com/cs101x/urank/nickel.html">Nickel Chef</a>.
  135. <li> Force her to make hummus for you.
  136. </ol>
  137.  
  138. </body>
  139. </html>
  140.  
  141.  
  142. """,
  143.     'http://udacity.com/cs101x/urank/hummus.html': """<html>
  144. <body>
  145. <h1>
  146. Hummus Recipe
  147. </h1>
  148. <p>
  149.  
  150. <ol>
  151. <li> Go to the store and buy a container of hummus.
  152. <li> Open it.
  153. </ol>
  154.  
  155. </body>
  156. </html>
  157.  
  158.  
  159. """,
  160.     }
  161.  
  162. def get_page(url):
  163.     if url in cache:
  164.         return cache[url]
  165.     return ""
  166.  
  167.  
  168. def get_next_target(page):
  169.     start_link = page.find('<a href=')
  170.     if start_link == -1:
  171.         return None, 0
  172.     start_quote = page.find('"', start_link)
  173.     end_quote = page.find('"', start_quote + 1)
  174.     url = page[start_quote + 1:end_quote]
  175.     return url, end_quote
  176.  
  177. def get_all_links(page):
  178.     links = []
  179.     while True:
  180.         url, endpos = get_next_target(page)
  181.         if url:
  182.             links.append(url)
  183.             page = page[endpos:]
  184.         else:
  185.             break
  186.     return links
  187.  
  188.  
  189. def union(a, b):
  190.     for e in b:
  191.         if e not in a:
  192.             a.append(e)
  193.  
  194. def add_page_to_index(index, url, content):
  195.     words = content.split()
  196.     for word in words:
  197.         add_to_index(index, word, url)
  198.  
  199. def add_to_index(index, keyword, url):
  200.     if keyword in index:
  201.         index[keyword].append(url)
  202.     else:
  203.         index[keyword] = [url]
  204.  
  205. def lookup(index, keyword):
  206.     if keyword in index:
  207.         return index[keyword]
  208.     else:
  209.         return None
  210.  
  211. def crawl_web(seed): # returns index, graph of inlinks
  212.     tocrawl = [seed]
  213.     crawled = []
  214.     graph = {}  # <url>, [list of pages it links to]
  215.     index = {}
  216.     while tocrawl:
  217.         page = tocrawl.pop()
  218.         if page not in crawled:
  219.             content = get_page(page)
  220.             add_page_to_index(index, page, content)
  221.             outlinks = get_all_links(content)
  222.             graph[page] = outlinks
  223.             union(tocrawl, outlinks)
  224.             crawled.append(page)
  225.     return index, graph
  226.  
  227. def compute_ranks(graph):
  228.     d = 0.8 # damping factor
  229.     numloops = 10
  230.  
  231.     ranks = {}
  232.     npages = len(graph)
  233.     for page in graph:
  234.         ranks[page] = 1.0 / npages
  235.  
  236.     for i in range(0, numloops):
  237.         newranks = {}
  238.         for page in graph:
  239.             newrank = (1 - d) / npages
  240.             for node in graph:
  241.                 if page in graph[node]:
  242.                     newrank = newrank + d * (ranks[node] / len(graph[node]))
  243.             newranks[page] = newrank
  244.         ranks = newranks
  245.     return ranks
  246.  
  247.  
  248. # Here are some example showing what ordered_search should do:
  249.  
  250. # Observe that the result list is sorted so the highest-ranking site is at the
  251. # beginning of the list.
  252.  
  253. # Note: the intent of this question is for students to write their own sorting
  254. # code, not to use the built-in sort procedure.
  255.  
  256. index, graph = crawl_web('http://udacity.com/cs101x/urank/index.html')
  257. ranks = compute_ranks(graph)
  258.  
  259. print ordered_search(index, ranks, 'Hummus')
  260. #>>> ['http://udacity.com/cs101x/urank/kathleen.html',
  261. #    'http://udacity.com/cs101x/urank/nickel.html',
  262. #    'http://udacity.com/cs101x/urank/arsenic.html',
  263. #    'http://udacity.com/cs101x/urank/hummus.html',
  264. #    'http://udacity.com/cs101x/urank/index.html']
  265.  
  266. print ordered_search(index, ranks, 'the')
  267. #>>> ['http://udacity.com/cs101x/urank/nickel.html',
  268. #    'http://udacity.com/cs101x/urank/arsenic.html',
  269. #    'http://udacity.com/cs101x/urank/hummus.html',
  270. #    'http://udacity.com/cs101x/urank/index.html']
  271.  
  272.  
  273. print ordered_search(index, ranks, 'babaganoush')
  274. #>>> None
clone this paste RAW Paste Data