Advertisement
Guy_Lalonde

CS101 Lesson 6 Starred problem 3

Jul 19th, 2014
195
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 7.39 KB | None | 0 0
  1. # Triple Gold Star
  2.  
  3. # Only A Little Lucky
  4.  
  5. # The Feeling Lucky question (from the regular homework) assumed it was enough
  6. # to find the best-ranked page for a given query. For most queries, though, we
  7. # don't just want the best page (according to the page ranking algorithm), we
  8. # want a list of many pages that match the query, ordered from the most likely
  9. # to be useful to the least likely.
  10.  
  11. # Your goal for this question is to define a procedure, ordered_search(index,
  12. # ranks, keyword), that takes the same inputs as lucky_search from Question 5,
  13. # but returns an ordered list of all the URLs that match the query.
  14.  
  15. # To order the pages, use the quicksort algorithm, invented by Sir Tony Hoare in
  16. # 1959. Quicksort provides a way to sort any list of data, using an expected
  17. # number of comparisons that scales as n log n where n is the number of elements
  18. # in the list.
  19.  
  20. # The idea of quicksort is quite simple:
  21.  
  22. # If the list has zero or one elements, it is already sorted.
  23.  
  24. # Otherwise, pick a pivot element, and split the list into two partitions: one
  25. # contains all the elements equal to or lower than the value of the pivot
  26. # element, and the other contains all the elements that are greater than the
  27. # pivot element. Recursively sort each of the sub-lists, and then return the
  28. # result of concatenating the sorted left sub-list, the pivot element, and the
  29. # sorted right sub-list.
  30.  
  31. # For simplicity, use the first element in the list as your pivot element (this
  32. # is not usually a good choice, since it means if the input list is already
  33. # nearly sorted, the actual work will be much worse than expected).
  34.  
  35.  
  36. def ordered_search(pages, ranks,keyword):
  37. listA, listB, listC = [],[],[]
  38. if len(pages)>1:
  39. pivot= ranks[pages[0]]
  40. for item in pages:
  41. if ranks[item]> pivot:
  42. listA.append(item)
  43. if ranks[item]== pivot:
  44. listB.append(item)
  45. if ranks[item]< pivot:
  46. listC.append(item)
  47. else:
  48. return pages
  49. return ordered_search(listA, ranks,keyword) + listB + ordered_search(listC, ranks,keyword)
  50.  
  51.  
  52.  
  53.  
  54.  
  55.  
  56. cache = {
  57. 'http://udacity.com/cs101x/urank/index.html': """<html>
  58. <body>
  59. <h1>Dave's Cooking Algorithms</h1>
  60. <p>
  61. Here are my favorite recipies:
  62. <ul>
  63. <li> <a href="http://udacity.com/cs101x/urank/hummus.html">Hummus Recipe</a>
  64. <li> <a href="http://udacity.com/cs101x/urank/arsenic.html">World's Best Hummus</a>
  65. <li> <a href="http://udacity.com/cs101x/urank/kathleen.html">Kathleen's Hummus Recipe</a>
  66. </ul>
  67.  
  68. For more expert opinions, check out the
  69. <a href="http://udacity.com/cs101x/urank/nickel.html">Nickel Chef</a>
  70. and <a href="http://udacity.com/cs101x/urank/zinc.html">Zinc Chef</a>.
  71. </body>
  72. </html>
  73.  
  74.  
  75.  
  76.  
  77.  
  78.  
  79. """,
  80. 'http://udacity.com/cs101x/urank/zinc.html': """<html>
  81. <body>
  82. <h1>The Zinc Chef</h1>
  83. <p>
  84. I learned everything I know from
  85. <a href="http://udacity.com/cs101x/urank/nickel.html">the Nickel Chef</a>.
  86. </p>
  87. <p>
  88. For great hummus, try
  89. <a href="http://udacity.com/cs101x/urank/arsenic.html">this recipe</a>.
  90.  
  91. </body>
  92. </html>
  93.  
  94.  
  95.  
  96.  
  97.  
  98.  
  99. """,
  100. 'http://udacity.com/cs101x/urank/nickel.html': """<html>
  101. <body>
  102. <h1>The Nickel Chef</h1>
  103. <p>
  104. This is the
  105. <a href="http://udacity.com/cs101x/urank/kathleen.html">
  106. best Hummus recipe!
  107. </a>
  108.  
  109. </body>
  110. </html>
  111.  
  112.  
  113.  
  114.  
  115.  
  116.  
  117. """,
  118. 'http://udacity.com/cs101x/urank/kathleen.html': """<html>
  119. <body>
  120. <h1>
  121. Kathleen's Hummus Recipe
  122. </h1>
  123. <p>
  124.  
  125. <ol>
  126. <li> Open a can of garbonzo beans.
  127. <li> Crush them in a blender.
  128. <li> Add 3 tablesppons of tahini sauce.
  129. <li> Squeeze in one lemon.
  130. <li> Add salt, pepper, and buttercream frosting to taste.
  131. </ol>
  132.  
  133. </body>
  134. </html>
  135.  
  136. """,
  137. 'http://udacity.com/cs101x/urank/arsenic.html': """<html>
  138. <body>
  139. <h1>
  140. The Arsenic Chef's World Famous Hummus Recipe
  141. </h1>
  142. <p>
  143.  
  144. <ol>
  145. <li> Kidnap the <a href="http://udacity.com/cs101x/urank/nickel.html">Nickel Chef</a>.
  146. <li> Force her to make hummus for you.
  147. </ol>
  148.  
  149. </body>
  150. </html>
  151.  
  152. """,
  153. 'http://udacity.com/cs101x/urank/hummus.html': """<html>
  154. <body>
  155. <h1>
  156. Hummus Recipe
  157. </h1>
  158. <p>
  159.  
  160. <ol>
  161. <li> Go to the store and buy a container of hummus.
  162. <li> Open it.
  163. </ol>
  164.  
  165. </body>
  166. </html>
  167.  
  168.  
  169.  
  170.  
  171. """,
  172. }
  173.  
  174. def get_page(url):
  175. if url in cache:
  176. return cache[url]
  177. return ""
  178.  
  179.  
  180. def get_next_target(page):
  181. start_link = page.find('<a href=')
  182. if start_link == -1:
  183. return None, 0
  184. start_quote = page.find('"', start_link)
  185. end_quote = page.find('"', start_quote + 1)
  186. url = page[start_quote + 1:end_quote]
  187. return url, end_quote
  188.  
  189. def get_all_links(page):
  190. links = []
  191. while True:
  192. url, endpos = get_next_target(page)
  193. if url:
  194. links.append(url)
  195. page = page[endpos:]
  196. else:
  197. break
  198. return links
  199.  
  200.  
  201. def union(a, b):
  202. for e in b:
  203. if e not in a:
  204. a.append(e)
  205.  
  206. def add_page_to_index(index, url, content):
  207. words = content.split()
  208. for word in words:
  209. add_to_index(index, word, url)
  210.  
  211. def add_to_index(index, keyword, url):
  212. if keyword in index:
  213. index[keyword].append(url)
  214. else:
  215. index[keyword] = [url]
  216.  
  217. def lookup(index, keyword):
  218. if keyword in index:
  219. return index[keyword]
  220. else:
  221. return None
  222.  
  223. def crawl_web(seed): # returns index, graph of inlinks
  224. tocrawl = [seed]
  225. crawled = []
  226. graph = {} # <url>, [list of pages it links to]
  227. index = {}
  228. while tocrawl:
  229. page = tocrawl.pop()
  230. if page not in crawled:
  231. content = get_page(page)
  232. add_page_to_index(index, page, content)
  233. outlinks = get_all_links(content)
  234. graph[page] = outlinks
  235. union(tocrawl, outlinks)
  236. crawled.append(page)
  237. return index, graph
  238.  
  239. def compute_ranks(graph):
  240. d = 0.8 # damping factor
  241. numloops = 10
  242.  
  243. ranks = {}
  244. npages = len(graph)
  245. for page in graph:
  246. ranks[page] = 1.0 / npages
  247.  
  248. for i in range(0, numloops):
  249. newranks = {}
  250. for page in graph:
  251. newrank = (1 - d) / npages
  252. for node in graph:
  253. if page in graph[node]:
  254. newrank = newrank + d * (ranks[node] / len(graph[node]))
  255. newranks[page] = newrank
  256. ranks = newranks
  257. return ranks
  258.  
  259.  
  260. # Here are some example showing what ordered_search should do:
  261.  
  262. # Observe that the result list is sorted so the highest-ranking site is at the
  263. # beginning of the list.
  264.  
  265. # Note: the intent of this question is for students to write their own sorting
  266. # code, not to use the built-in sort procedure.
  267.  
  268. index, graph = crawl_web('http://udacity.com/cs101x/urank/index.html')
  269. ranks = compute_ranks(graph)
  270. #keyword = 'Hummus'
  271. pages= lookup(index,keyword)
  272.  
  273.  
  274. print ordered_search(pages, ranks,'Hummus')
  275. #>>> ['http://udacity.com/cs101x/urank/kathleen.html',
  276. # 'http://udacity.com/cs101x/urank/nickel.html',
  277. # 'http://udacity.com/cs101x/urank/arsenic.html',
  278. # 'http://udacity.com/cs101x/urank/hummus.html',
  279. # 'http://udacity.com/cs101x/urank/index.html']
  280.  
  281. #print ordered_search(index, ranks, 'the')
  282. #>>> ['http://udacity.com/cs101x/urank/nickel.html',
  283. # 'http://udacity.com/cs101x/urank/arsenic.html',
  284. # 'http://udacity.com/cs101x/urank/hummus.html',
  285. # 'http://udacity.com/cs101x/urank/index.html']
  286.  
  287.  
  288. #print ordered_search(index, ranks, 'babaganoush')
  289. #>>> None
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement