a guest Feb 21st, 2020 80 Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. Stemming (Conflation)
  2.     2 types
  3.         Dictionary-based
  4.             uses lists of related words
  5.             language dependent
  6.             drawback: cannot react automatically to new words
  7.         Algorithmic
  8.             uses program to determine related words
  9.             language dependent
  10.             drawback: lots of errors and difficult to modify, produces stems not words
  11.             e.g. Porter Stemmer
  13.     Krovetz Stemmer
  14.         hybrid algorithmic-dictionary stemmer
  15.         use dictionary if word is present, else use algorithm and check dictionary again
  16.         produces words not stems
  18. Term-document Incidence Matrix
  19.     Boolean matrix indicates if a term (rows) exists in a certain document(columns)
  20.     Size Issue
  21.         Ex) 500k distinct terms
  22.             500k x 1M matrix has half a trillion 0's and 1's
  23.             no more than one billion 1's
  24.                 matrix is extremely sparse
  25.                     save space by only recording the 1's
  27. Inverted Index
  28.     dictionary with terms as keys and a linked list of postings/docIDs as values
  30.     Query Processing
  31.         To process a merge (intersection of two posting list):
  32.             if postings are sorted by docID, can walk through both in linear time
  33.                 if list lengths are x and y, ther merge takes O(x+y) operations
  35.     Inverted Index with word frequencies
  36.         supports better ranking algorithms
  37.     Inverted Index with word positions
  38.         supports proximity matches
  40.     Fields and Extent Lists
  41.         field: a certain region of a document
  42.             e.g. title
  43.         extent: contiguous region of a document
  44.         represented by a range e.g. (9, 15)
  45.         inverted index records all extents for a given field type
  47.     Score-ordered lists as opposed to document-ordered lists
  48.         top part of each list will be the highest-scoring postings
  49.         very efficient for single-word queries
  51.     Construction
  52.         docIDs are URLs; map URLs to ints as docs are processed to save space
  53.             need to keep this mapping in a separate file for later reference
  55.         issue: no scalability for collections too large to fit in memory
  56.             can't use disk because too slow
  57.         Partial Indexes + Merging
  58.             build the inverted list until a certain size
  59.             write the partial index to disk, and start making a new one
  60.             disk will be filled with many partial indexes
  61.             partial indexes must be designed such that they can be merged in small pieces
  62.                 e.g. store in alphabetical order
  63.             Efficient merging
  64.                 open all partial index files simultaneously
  65.                 maintain a read buffer for each one and a writer buffer for the output file
  66.                 each iteration, pick lowest termID yet to be processed
  67.                 merge all postings lists for that termID and write it
  69.     Querying
  70.         Boolean Retrieval Model
  71.             Boolean Queries: use AND, OR, and NOT to join query terms
  72.             Pros
  73.                 predictable results, easy to explain
  74.                 many features can be incorporated
  75.                 efficient processing because many documents can be eliminated from search
  76.             Cons
  77.                 effectiveness depends on user
  78.                 simple queries usually don't work well
  79.                 complex queries are difficult
  81.         Query Optimization
  82.             start with term that has the shortest postings list
  83.             Ex) Brutus AND Calpurnia AND Caesar
  84.                 if Calpurnia is the shortest, then (Calpurnia AND Brutus) AND Caesar is better
  86.         Phrase Queries
  87.             Biword indexes
  88.                 index every consecutive pair of terms as a phrase
  89.                     each pair is now a dictionary term
  90.                     2-word phrase query processing is now immediate
  91.                 issue: false positives, dictionary gets too large
  92.             solution: Positional Indexes
  93.                 in the postings, for each term, store the positions in which it appears
  94.                 for English-like languages:
  95.                     positional index is 2-4 times as large as a non-positional index
  96.                     the size is 35-50% of the volume of the original text
  97.             can also combine both
RAW Paste Data
We use cookies for various purposes including analytics. By continuing to use Pastebin, you agree to our use of cookies as described in the Cookies Policy. OK, I Understand