Advertisement
forbiddenvoid

David Harris's Udacity Search Engine

Apr 10th, 2012
331
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 18.90 KB | None | 0 0
  1. from bs4 import BeautifulSoup
  2. import html5lib
  3. import urllib2
  4. import robotparser
  5.  
  6. # Written by David Harris, 2012
  7. # Colorado Springs, CO, USA
  8. # e-mail: forbiddenvoid@gmail.com
  9. # Released under a Creative Commons License for use by Udacity.com  
  10. # -----------------------------
  11. #
  12. # The following is my 'finished' search engine built as an
  13. # entry to the contest at the end of the first run of Udacity's
  14. # CS101 course: Build a Search Engine
  15.  
  16. # A few preliminary notes:
  17. # I have used two external modules, namely BeautifulSoup 4 - for cleaning
  18. # up web page content and html5lib, which serves as a parser for BS4.
  19. # urllib2 and robotparser are in the standard library for Python 2.7
  20. # I was pretty excited to find out that there was already a tool for parsing
  21. # robots.txt, because it seemed like a pretty daunting task.
  22. # Otherwise, I have adapted the code from the Udacity course and made a number
  23. # of modifications:
  24. #
  25. # 1. Allowing for multiple search terms (exact match only)
  26. # 2. I modified the 'URank' algorithm. It now returns more relevant results.
  27. #     - A more thorough explanation is attached to the multi_ranks function.
  28. # 3. I have made the crawler more polite by following instructions in robots.txt
  29. # 4. I am incorporating both the max_pages and max_depth arguments in crawl web.
  30. #     - This allows more control over how much is crawled.
  31. # 5. I incorporated my own split_string function instead of the built in .split
  32. # 6. I have set everything to lower-case to eliminate duplicate, unconnected index entries.
  33. # 7. I have built a console UI for executing searches with a number of features.
  34. #     - Search results limited to a certain number per 'page.' Default is 10. Changeable by user.
  35. #     - Able to navigate forward and backward through result pages.
  36. #     - Able to jump to a specific search page.
  37. #     - Constraints to prevent going to pages out of bounds.
  38. #     - Results page lists number of matches, number of pages, current page & current result range.
  39.  
  40. # There is a brief explanation of each page preceding it. Feel free to e-mail me if you have
  41. # questions about a specific function or about any part of the code.
  42.  
  43. # crawl_web is the main function of the webcrawler portion of the code.
  44. # This function builds an index of keywords, a directional graph of links
  45. # and stores webpage content in a cache for later use. (The contents are not
  46. # currently written to file due to time constraints, so the crawler must be
  47. # run prior to every search. This will be fixed at a later time, but for now
  48. # the code is more important than that execution.
  49.  
  50. # A few notes:
  51. # Twitter statuses are not blocked by twitter.com's robots.txt file. I have built a filter into
  52. # the function to avoid most of these pages, because they will quickly overrun the crawler graph.
  53. # 199.59.148.11 is twitter.com's public IP address
  54. #
  55. # .RobotFileParser() and can_fetch() are both part of the robotparser module. I enclosed the .read
  56. # and can_fetch in a try block because I got errors on a few pages. For simplicity, an error given
  57. # when trying to access a robots.txt file defaults to not allowed to search.
  58. #
  59. # BeautifulSoup() is the primary method used when creating a BS4 object out of the HTML content of
  60. # a web page. It has some nice features to the object. I only use three in this whole program to
  61. # preserve the 101 simplicity. .prettify organizes the content in a neat structure, and get_text gets
  62. # all of the non-HTML text from the website. (This currently also includes javascript. I'm looking into
  63. # ways to remove the js, also).
  64. #
  65. # I changed crawled from a list to a dictionary to speed up lookup times. I found that the time increased
  66. # dramatically starting at around 2500 pages crawled with the list. The hashed dictionary structure moves
  67. # much faster.
  68.  
  69.  
  70. def crawl_web(seed, max_depth, max_pages):
  71.     tocrawl = [seed]
  72.     crawled = {}
  73.     graph = {}                              
  74.     index = {}
  75.     cache = {}
  76.     depth = [0]
  77.     num_pages = 0
  78.     while tocrawl and num_pages < max_pages:
  79.         page = tocrawl.pop(0)
  80.         page = page.lower()
  81.         if page[len(page)-1] == '/':
  82.             page = page[:len(page)-1]
  83.         not_twitter = True
  84.         if (page.find('twitter.com') > 0 or page.find('199.59.148.11') > 0):
  85.             if len(page) > 28:
  86.                 not_twitter = False
  87.             else:
  88.                 not_twitter = True
  89.         current_depth = depth.pop(0)
  90.         robot = robotparser.RobotFileParser()
  91.         robot.set_url(roboize(page))
  92.         try:
  93.             robot.read()
  94.             allowed = robot.can_fetch("*", page)
  95.         except:
  96.             allowed = False
  97.         if page not in crawled and current_depth <= max_depth and num_pages < max_pages and allowed and not_twitter:
  98.             num_pages += 1
  99.             try:
  100.                 soup = BeautifulSoup(get_page(page), "html5lib")
  101.                 soup.prettify()
  102.             except:
  103.                 soup = ''
  104.             if soup:
  105.                 content = soup.get_text()
  106.                 cache[page] = content
  107.                 outlinks = get_all_links(soup, page)
  108.                 add_page_to_index(index, page, content)
  109.             else:
  110.                 outlinks = []
  111.             for link in outlinks:
  112.                 depth.append(current_depth + 1)
  113.             graph[page] = outlinks
  114.             union(tocrawl, outlinks)
  115.             crawled[page] = True
  116.     return index, graph, cache
  117.  
  118. # This is a fairly simple function that creates the robots.txt url from the page being crawled.
  119.  
  120. def roboize(page):
  121.     start_page = page.find('/') + 2
  122.     if page.find('/', start_page) == -1:
  123.         return page + '/robots.txt'
  124.     else:
  125.         return page[:page.find('/', start_page)] + '/robots.txt'
  126.  
  127. # Pretty straightforward. This function contacts a webpage and returns the contents.
  128.  
  129. def get_page(url):
  130.     try:
  131.         return urllib2.urlopen(url, None, 10).read()
  132.     except:
  133.         return ""
  134.  
  135. # This is the third function I use from BeautifulSoup -link.get('href') gets all of the links
  136. # (internal and external) on a page. I've modified get_all_links from what we did in class to
  137. # convert internal links to their full external counterparts.
  138.  
  139. def get_all_links(soup, page):
  140.     links = []
  141.     last = page.find('/', 8)
  142.     if last > 0:
  143.         page = page[:last]
  144.     for link in soup.find_all('a'):
  145.         if link.get('href'):
  146.             if link.get('href')[0] == '/':
  147.                 links.append(page + link.get('href'))
  148.             else:
  149.                 links.append(link.get('href'))
  150.     return links
  151.  
  152. # The following two functions work in tandem to split all of the content of a web page and
  153. # add each word to the index, along with the url of the page and a list of positions on the page
  154. # where the URL appears. The position is needed for searching multiple terms.
  155.  
  156. def add_page_to_index(index, url, content):
  157.     words = split_string(content, "!@#$%^&*(),./?><[]}\"{':;=-~`|\\ \n")
  158.     counter = 0
  159.     for word in words:
  160.         counter +=1
  161.         add_to_index(index, regex(word), url, counter)
  162.        
  163. def add_to_index(index, keyword, url, pos):
  164.     if keyword in index:
  165.         if url in index[keyword]:
  166.             index[keyword][url].append(pos)
  167.         else:
  168.             index[keyword][url] = [pos]
  169.     else:
  170.         index[keyword] = {url: [pos]}
  171.  
  172. #Logical Set Union, nothing to see here, move along. :)
  173.  
  174. def union(a, b):
  175.     for e in b:
  176.         if e not in a:
  177.             a.append(e)
  178.  
  179. # My split_string function. There are more efficient ones, but
  180. # since this is the one I wrote for class, I used it.
  181.  
  182. def split_string(source, splitlist):
  183.     end_split = []
  184.     if source == "":
  185.         return end_split
  186.     marker = 0
  187.     for pos in range(0, len(source) - 1):
  188.         for j in range(0, len(splitlist)):
  189.            if source[pos] == splitlist[j]:
  190.                if len(source[marker:pos]) > 0:
  191.                    end_split.append(source[marker:pos])
  192.                    marker = pos + 1
  193.                    break
  194.                else:
  195.                    marker = pos + 1
  196.     pos = len(source)-1
  197.     flag = False
  198.     for j in range(0, len(splitlist)):
  199.         if source[pos] == splitlist[j]:
  200.             if len(source[marker:pos]) > 0:
  201.                    end_split.append(source[marker:pos])
  202.                    flag = True
  203.     if not flag and len(source[marker:pos]) > 0:
  204.         end_split.append(source[marker:])
  205.     return end_split
  206.  
  207. # I don't have time to write a real regex parser, but this function eliminates
  208. # most possessives and converts all words to lowercase.
  209.  
  210. def regex(word):
  211.     result = word.lower()
  212.     if result[-2:] == "'s":
  213.         result = result[:-2]
  214.     return result
  215.  
  216. # This function looks up the given query and returns all of the urls where the query
  217. # appears exactly. I'd considered doing a .find on the cache, but quickly realized that
  218. # would take a very, very long time as the cache increased in size.  Essentially, this
  219. # function gets a list of possible urls from the first keyword and eliminates all of
  220. # those that do not fit the criteria (pos + 1) on subsequent keywords.
  221.  
  222. def multi_lookup(index, query):
  223.     result = []
  224.     possible = []
  225.     not_possible = []
  226.     for keyword in query:
  227.         if keyword not in index:
  228.             return None
  229.     for url in index[query[0]]:
  230.         for pos in index[query[0]][url]:
  231.             possible.append([url, pos])
  232.     for i in range(0, len(query) - 1):
  233.         for url in possible:
  234.             if url[0] not in index[query[i + 1]]:
  235.                 not_possible.append(url)
  236.             else:
  237.                 flag = False
  238.                 for pos in index[query[i + 1]][url[0]]:
  239.                     if url[1] + i + 1 == pos:
  240.                         flag = True
  241.                 if not flag:
  242.                     not_possible.append(url)
  243.     for entry in not_possible:
  244.         if entry in possible:
  245.             possible.remove(entry)
  246.     for entry in possible:
  247.         if entry[0] not in result:
  248.             result.append(entry[0])
  249.     return result
  250.  
  251. # This is the 'URank' function we defined in class, adapted to the data structure I'm using.
  252.  
  253. def compute_ranks(graph):
  254.     d = 0.8 # damping factor
  255.     numloops = 10
  256.     ranks = {}
  257.     npages = len(graph)
  258.     for page in graph:
  259.         ranks[page] = 1.0 / npages
  260.     for i in range(0, numloops):
  261.         newranks = {}
  262.         for page in graph:
  263.             newrank = (1 - d) / npages
  264.             for node in graph:
  265.                 if page in graph[node]:
  266.                     newrank = newrank + d * (ranks[node] / len(graph[node]))
  267.             newranks[page] = newrank
  268.         ranks = newranks
  269.     return ranks
  270.  
  271. # This is the magical function that fixes the issues with URank/Vanilla PageRank.
  272. # A brief explanation, first. PageRank will order all returned pages in a search
  273. # by their absolute ranks. This is useful, but when you're searching for a certain word,
  274. # say, sushi, if that word happens to appear on twitter.com, you'll likely get twitter as
  275. # a first result, which seems inappropriate to the search itself.
  276. #
  277. # The multirank process simply converts the existing graph and returns a graph of only
  278. # those pages that actually contain the search terms. There is a disadvantage, in that
  279. # the graph has to be built at runtime, but the results are an order of magnitude more
  280. # precise than the basic URank that we used before.
  281.  
  282. def multi_ranks(index, graph, query):
  283.     newgraph = {}
  284.     has_query = multi_lookup(index, query)
  285.     if not has_query:
  286.         return newgraph
  287.     for entry in graph:
  288.         url_with_query = []
  289.         if entry in has_query:
  290.             for url in graph[entry]:
  291.                 if url in has_query:
  292.                     url_with_query.append(url)
  293.             newgraph[entry] = url_with_query
  294.     return newgraph
  295.  
  296. # Quicksort. Not really much explanation needed.
  297.  
  298. def quicksort(rlist):
  299.     if len(rlist) <= 1:
  300.         return rlist
  301.     left = []
  302.     right = []
  303.     pivot = rlist[0]
  304.     for element in rlist:
  305.         if element != pivot:    
  306.             if element[1] >= pivot[1]:
  307.                 left.append(element)
  308.             else:
  309.                 right.append(element)
  310.     return quicksort(left) + [pivot] + quicksort(right)
  311.  
  312. # This parses the ranks returned by compute_ranks and uses quicksort to order them.
  313. # The last step removes the actual ranks from the list, leaving only the ordered URLs.
  314.    
  315. def ordered_search(ranks, query):
  316.     for keyword in query:
  317.         if keyword not in index:
  318.             return None
  319.     result = []
  320.     new_rank = []
  321.     for element in ranks:
  322.         new_rank.append([element, ranks[element]])
  323.     result = quicksort(new_rank)
  324.     final = []
  325.     for element in result:
  326.         final.append(element[0])
  327.     return final
  328.  
  329. # This is the primary search function. It takes the index, graph and cache built by the crawler
  330. # and allows the user to search for the term of their choice. It runs until given the 'quit'
  331. # command. So, no, you can't search for 'quit,' but really, why would you want to?
  332. # It uses the split_string function to split up the query into a list of words.
  333. # In order, the function calls multi_graph to build the graph of results with the search term,
  334. # then it computes the ranks of those pages, then orders them. Finally, it calls search_result
  335. # to present the results.
  336.  
  337.  
  338. def do_search(index, graph, cache):
  339.     print "*******************"
  340.     print "Welcome to FVSearch"
  341.     print "*******************"
  342.     while True:
  343.         query = str(raw_input("What can I help you find? (Type 'quit' to exit) >>"))
  344.         pass_query = query
  345.         if len(query) >  1 or (len(query) == 1 and len(query[0]) > 1):
  346.             query = split_string(query, "!@#$%^&*(),./?><[]}\"{':;=-~`|\\ \n")
  347.         else:
  348.             query = [query]
  349.         for i in range(0, len(query)):
  350.             query[i] = query[i].lower()
  351.         if query[0] == 'quit':
  352.             break
  353.         multi_graph = multi_ranks(index, graph, query)
  354.         if multi_graph:
  355.             ranks = compute_ranks(multi_graph)
  356.         else:
  357.             ranks = []
  358.         if ranks:
  359.             results = ordered_search(ranks, query)
  360.         else:
  361.             results = []
  362.         if results:
  363.             search_result(results, pass_query, cache)
  364.         else:
  365.             print 'No results found. Try another search term.'
  366.  
  367. # Here are our prettified search results. It gives the total number of results
  368. # for the query and breaks them up into pages. .The action switch uses user input
  369. # to generate either the next page, previous page or a specific page.
  370. # It also returns to the main search function if 's' or 'S' is typed.
  371. # I inlined a function to change the number of results per page.
  372.  
  373.  
  374. def search_result(results, query, cache):
  375.     print '**************'
  376.     print 'SEARCH RESULTS'
  377.     print '**************'
  378.     print 'Found ', len(results), ' matches for ', "'", query, "'"
  379.     page = 1
  380.     results_per_page = 10
  381.     while True:
  382.         if len(results)%results_per_page > 0:
  383.             pages = len(results)/results_per_page + 1
  384.         else:
  385.             pages = len(results)/results_per_page
  386.         show_page(results, page, results_per_page, query, cache)
  387.         action = get_next_action()
  388.         if action in ['s', 'S']:
  389.             return
  390.         if action in ['p', 'P']:
  391.             if page > 1:
  392.                 page -= 1
  393.             else:
  394.                 print
  395.                 print "Can't go back any more!"
  396.                 print
  397.         if action in ['n', 'N']:
  398.             if page < pages:
  399.                 page += 1
  400.             else:
  401.                 print
  402.                 print 'No more results!'
  403.                 print
  404.         if action in ['j', 'J']:
  405.             page = jump_page(len(results), results_per_page)
  406.         if action in ['o', 'O']:
  407.             print
  408.             print '1. Change number of results per page'
  409.             print '2. Would you like to play a game?'
  410.             print
  411.             choice = raw_input('> ')
  412.             choice = int(choice)
  413.             if choice == 1:
  414.                 print
  415.                 get_answer = ''
  416.                 while not get_answer:
  417.                     print 'How many results per page?'
  418.                     get_answer = raw_input('> ')
  419.                     if get_answer:
  420.                         get_answer = int(get_answer)
  421.                 results_per_page = get_answer
  422.                 print
  423.             if choice != 1:
  424.                 print
  425.                 print "I'm sorry, Dave, but I can't do that."
  426.                 print
  427.  
  428. # This function just prints the results on the current page. I had built a function to
  429. # also display an excerpt of the page, but it wasn't pretty (lots of extra whitespace, newlines
  430. # and javascript inside) and turned out to not really be very helpful.
  431.  
  432. def show_page(results, page, results_per_page, query, cache):
  433.     if len(results)%results_per_page > 0:
  434.         pages = len(results)/results_per_page + 1
  435.     else:
  436.         pages = len(results)/results_per_page
  437.     start = ((page - 1) * results_per_page) + 1
  438.     end = min([page * results_per_page, len(results)])
  439.     print 'Showing results ', start, '-', end, ' ( Page ', page, ' of', pages, ')\n'
  440.     j = 1
  441.     for i in range(start - 1, end):
  442.         print j, ' - ', results[i]
  443.         print
  444.         j += 1
  445.  
  446. # This function just gets user input to find out what they want to do once they receive a page of results.
  447.  
  448. def get_next_action():
  449.     action_flag = True
  450.     while action_flag:
  451.         print '\n'
  452.         print 'What would you like to do?'
  453.         print '(N)ext Page - (P)revious Page - (J)ump to Page - New (S)earch - (O)ptions'
  454.         print '\n'
  455.         action = raw_input('> ')
  456.         if action not in ['n', 'N', 'p', 'P', 'j', 'J', 's', 'S', 'o', 'O']:
  457.             print "I don't understand what you're trying to do."
  458.         else:
  459.             action_flag  = False
  460.     return action
  461.  
  462. # This turned out to be more complex than I thought it would. The process has to make sure that the page
  463. # selected is a valid page and also give an accurate number of the total pages based on the number of results
  464. # and the results per page.
  465.  
  466. def jump_page(num_results, results_per_page):
  467.     if num_results%results_per_page > 0:
  468.         pages = num_results/results_per_page + 1
  469.     else:
  470.         pages = num_results/results_per_page
  471.     while True:
  472.         print '\n'
  473.         print 'What page? Choose 1 -', pages
  474.         page = ''
  475.         while not page:
  476.             page = raw_input('> ')
  477.         page = int(page)    
  478.         print '\n'
  479.         if page < 1 or page > pages:
  480.             print '\n'
  481.             print 'No such page!'
  482.             print '\n'
  483.         else:
  484.             return page
  485.  
  486. ## WARNING ##
  487. # It takes a LONG time to crawl even 5000 pages, several hours probably.
  488. # I don't recommend using numbers any larger. Udacity seems to be a decent
  489. # seed page, also, but you can experiment with others.
  490.  
  491. index, graph, cache = crawl_web('http://www.udacity.com', 4, 5000)
  492. do_search(index, graph, cache)
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement