Guest User

Untitled

a guest
Oct 16th, 2018
89
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 7.12 KB | None | 0 0
  1.  
  2. [[TOC(inline)]]
  3.  
  4. == Introduction to Bi-gram in Xapian ==
  5.  
  6. === Bigrams ===
  7. Bi-gram indexing will be implemented in Xapian for use in Bigram Language Model Implementation ,collocations.All the indexed data while indexing in stored in Btree table.So to easy access bigram will be stored as a bi-gram term.
  8.  
  9. For a Document with content:
  10.  
  11. {{{
  12. Read a book about the history of read.
  13. }}}
  14.  
  15. Bi-gram will be (with stop words removed) :
  16. given are bigram with bigramid
  17. {{{
  18. Bi-gram bgid(bigramid)
  19. "read book" - 881
  20. "book about" - 882
  21. "about history" - 883
  22. "history read" - 884
  23. }}}
  24.  
  25. === Term List Data Storage ===
  26. Term list table for the document (Lets say did = 88 and make key for terms will add uni and for bi-gram will add bi):
  27.  
  28. ''Although these data is stored with several optimization like reusing previous term etc.i only mean to depict a overview of how (not exact) data is stored.''
  29.  
  30. {{{
  31.  
  32. 88uni 5 4 read 2 book 1 about 1history 1
  33. 88bi 4 4 read^book 1 book^about 1 about^history 1 history^read 1
  34. }}}
  35.  
  36. === Posting List Data Storage ===
  37. Posting List Table for the Document(document ids are assumed and key will add uni for unigram terms bi for bigram terms and coll for collocation additions:
  38.  
  39. {{{
  40.  
  41. BRASSPostTable Entry
  42.  
  43. term termfreq collfreq firstdid islast firstdocid lastdocid docid freq docid freq docid freq
  44. readuni 3 8 98 1 98 88 98 8 78 9 88 1
  45.  
  46. BRASSBigramPostListTable Entry
  47.  
  48. bigram bigramfreq collfreq firstdid islast firstdocid lastdocid docid freq docid freq
  49. read^bookbi 2 2 67 1 67 74 67 1 74 1
  50.  
  51. BigramCollocationPostListTable Entry
  52.  
  53. uniqbigramfreq totalbigramfreq firstbigramid islast firstbigramid lastbigramid bigramid freq bigramid freq
  54. readcoll 2(with first gram read) 4(with first gram read) 871 1 871 981 871 2 981 2
  55.  
  56. }}}
  57.  
  58. I am Suggesting a postinglist type entry of unigram and bigramids so that it is possible to use this bigrams for collocation. Otherwise i wasnot able to foresee the use of indexed bi-grams for collocation.
  59.  
  60. == TermGenerator Class(Indexing) ==
  61.  
  62. Currently index_text function of termgenerator_internal.cc index the terms in the text.But since not we want to also receive Bigrams and Unigram Both or either.
  63.  
  64. === In-place Tokenization ===
  65.  
  66. One option could be to store the previous tokenization and merge the previous and current term here in the index_text function.
  67.  
  68. === Tokenization Class[Prefered One] ===
  69.  
  70. Two classes UnigramTokenization and BigramTokenization to be implemented .
  71. UnigramTokenization.next() will return the current term and BigramTokenization.next() will return current bigram.
  72.  
  73. '''This will make framework better for if we want to increase grams for indexing or make changes to implementation of Bigram creation i.e removing restriction of keeping bigrams just as consecutive terms etc.Moreover code would be more redable and easy to understand.'''
  74.  
  75. == DocumentBigramTerm ==
  76.  
  77. Since the final storage of bi-gram is similar to normal terms of Xapian. I am planning to DocumentBigramTerm similar to OmDocumentTerm class.
  78.  
  79. I am planning DocumentBigramTerm with following:
  80.  
  81. string bigram
  82. int wdf
  83. string term1;
  84. string term2;
  85.  
  86. functions:
  87.  
  88. inc_wdf(termcount)
  89.  
  90. get_wdf()
  91.  
  92. get_description()
  93.  
  94. '''
  95. Doubts:'''
  96. * '''Since these are similar to OmDocumentTerm should it be inherited by OmDocument term or not?'''
  97. * ''' should we store bigrams in database with term1 term2 or term1-term2 (since we do a lot of optimization there and reuse previous term just wondering would having space between term will make diffrence due to optimization)? '''
  98.  
  99. == Document class changes ==
  100.  
  101. This class will have two more storage maps to store bigrams and collocation terms:
  102.  
  103. map<String bigramterm,BigramDocumentTerm>
  104.  
  105. map<String unigramterm,map<bigramid bgid,bigramterm>>
  106.  
  107. Methods to support the reading of these storage maps like bigramlist_begin() and to fill them back from database on open_document
  108. change open_term_list() implementation to support the reading of bigram from database.
  109.  
  110. using open_term_list(bool isbigram) in document class
  111.  
  112. '''Doubts:'''
  113.  
  114. * '''I want to generate bigramsid for all the bigrams mainly to store bigram in back-end for easy access for collocation.How can we generate such uniqueid ... docid is self generated how is it generated ? '''
  115. * '''What is use and in what all cases document is open from Database using open_document function in brass_backend?'''
  116. * '''Generation of bigramid is fine but i will also need to access bigram corresponding to bigramid.Where can i easily store and acess those (bigramid,bigram) pair in backend '''
  117. * ''' Do you think it will be a good idea implement and integrate collocation in this way.if you can think of better way to access collocation please suggest.(This is really a low priority task now i think it more sane to first implement BigramLMWeight and make that functional than including all functionality of bigrams) '''
  118.  
  119. == Database changes ==
  120.  
  121. === changes in function function add_document_() ===
  122.  
  123. Add document function transfer the term list to termlist table and stores posting list changes (i.e did need to be add to the term in document for postlist).
  124.  
  125. Now in addition it will transfer the bigrams to the inverter class to store postlist changes for these bigrams and bigram list to termlist table to store bigrams.
  126.  
  127. This function will also add the collocation data to the postlist changes.
  128. and finally calls flush function on both all the posting list doclength, unigram post list , bigram post list,collocation postlist
  129.  
  130. === changes in open_term_list() ,open_post_list() ===
  131.  
  132. Now open_term_list function will support the bigram and will open BrassTermList or BrassBigramList based on its unigram or bigram
  133.  
  134. open_term_list(did,isbigram)
  135.  
  136. Now open_post_list function will support the bigram and will open BrassPostList or BrassBigramPostList based on its unigram or bigram
  137.  
  138. open_post_list(term,isbigram)
  139.  
  140. == Bigram Termlist ==
  141.  
  142. Bi-gram Termlist will be added similarly to the terms with a diffrent key.
  143.  
  144. '''Doubts:'''
  145.  
  146. * ''' I have noticed that term list is used to open the document i.e we read term list and add terms to the document.Is there any other use of maintaing term list ?'''
  147.  
  148. == Brass BiGram List Changes ==
  149.  
  150. Opens bigram list in termlist tablenin a way similar to BRASSTermList to support reading.
  151.  
  152. == Brass Bigram PostList Changes ==
  153.  
  154. Opens the posting list of Bigram stored in postlist table and provide access to Bigrams post-list . open_post_list(bigram,1(isbigram)) will return object of this class.
  155.  
  156.  
  157. == Brass Collocation PostList Changes ==
  158.  
  159. This class will not be implemented as it is reuired for collocation use and BigramLMWeight currently make no use of collocation.
  160.  
  161. == Query Parser Changes ==
  162.  
  163. Need to intergrate BiGram Term creation in the query with a flag set. Need to make changes to calling infrastructure .QueryParser and query internal object makes call to the open_post_list function to get postlist of term.Need to tweak it to call BrassBigramPostList in case term is bigram.
  164.  
  165. Rest every thing will be similar as BrassBigramPostList will be similar to BrassPostList .
Add Comment
Please, Sign In to add comment