Advertisement
Guest User

Untitled

a guest
May 7th, 2017
162
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 3.07 KB | None | 0 0
  1. Liber Primus 5-grams Found in Google Data
  2. 6/5/2017
  3.  
  4. INTRODUCTION:
  5. Lists of tagged 5-grams from google data that have word lengths that match the 5-grams
  6. in the Liber Primus that can be used as the raw data for crib-finding.
  7. http://cicada3301.boards.net/thread/48/words-liber-primus-ii-grams
  8.  
  9. DATASET:
  10. http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
  11. files: googlebooks-eng-all-5gram-20120701-XX_.gz
  12.  
  13. CUTS:
  14. Selected 5-grams that Match LP 5-gram word lengths, with and without punctutation
  15. For 5-grams with quotation marks, have added versions with additonal commas
  16. e.g. ...,"..." even though LP has ..."..."
  17. for hints and tips on how the cutting was done see "example_python_n-gram_parser.py"
  18. , ; : assumed punctuation for "3-dots"
  19. . ? ! assumed punctuation for "4-dots"
  20.  
  21. FILENAME:
  22. all words are in master list "XX_master.txt"
  23. all words are NOT in master list "XX_non_master.txt"
  24. XX is the first two letters of the bigram
  25. master_word_list is available at https://pastebin.com/2tpcEwTs
  26.  
  27. FORMAT:
  28. Space delimited data
  29. word1 word2 word3 word4 word5 POS1 POS2 POS3 POS4 POS5 LENGTH1 LENGTH2 LENGTH3 LENGTH4 LENGTH5 MASTER_TAG COUNTS
  30. wordX is word or punctuation Letter cases (upper/lower) has been preserved, Capitalsed first letters
  31. (probably) indicate start of sentence. Punctuation . ! ? (probably) indicates end of sentence
  32. POS = Part Of Speech, (noun, verb, etc.)
  33. LENGTHX is the length of the word *in runes*
  34. MASTER_TAG are all the words in our master_list
  35. COUNT number of entries for phrase in google data
  36.  
  37. TAG CODE:
  38. _NOUN = N
  39. _VERB = V
  40. _ADJ = J
  41. _ADV = D
  42. _PRON = P
  43. _DET = E
  44. _ADP = A
  45. _NUM = U
  46. _CONJ = C
  47. _PRT = R
  48. _X = X
  49. _. = . (punctuation)
  50. NO_POS_TAG = _
  51. ALL_IN_MASTER_LIST = M
  52. ALL_NOT_IN_MASTER_LIST = S
  53.  
  54. EXAMPLE:
  55. back to the asylum with D R E N A 4 2 2 6 3 S 46
  56. resolves to:
  57. back_ADV to_PRT the_DET asylum_NOUN with_ADP
  58. word-length-in-runes 4 2 2 6 3
  59. NOT all words in master list
  60. 46 counts
  61.  
  62. SAMPLE:
  63. bid her unsay all again _ _ _ _ _ 3 3 5 3 5 S 73
  64. bidding him use it for _ _ _ _ _ 5 3 3 2 3 S 95
  65. bier ye can not fashion _ _ _ _ _ 4 2 3 3 6 S 47
  66. big and bright and crazy _ _ _ _ _ 3 3 6 3 5 S 66
  67. big basket that had hints _ _ _ _ _ 3 6 3 3 5 S 46
  68. big dining room table with J N N N A 3 4 4 5 3 S 48
  69. big steamer sink beneath the J N N A E 3 6 4 5 2 S 74
  70. bigger man might find ample J N V V J 6 3 5 4 5 S 129
  71. biggest and jovialest man of _ _ _ _ _ 7 3 8 3 2 S 49
  72. ...
  73. parents when they need to N D P V R 7 4 3 4 2 M 46
  74. part almost as well as _ _ _ _ _ 4 6 2 4 2 M 55
  75. part both mad and wicked _ _ _ _ _ 4 3 3 3 6 M 47
  76. part had been played out _ _ _ _ _ 4 3 4 6 3 M 59
  77. part not to take the _ _ _ _ _ 4 3 2 4 2 M 40
  78. part of all the operations _ _ _ _ _ 4 2 3 2 9 M 62
  79. part of all the participants _ _ _ _ _ 4 2 3 2 12 M 434
  80. part of an agency permit _ _ _ _ _ 4 2 2 6 6 M 174
  81. part of an ongoing and _ _ _ _ _ 4 2 2 4 3 M 599
  82. part of her body has _ _ _ _ _ 4 2 3 4 3 M 139
  83. part of her had believed _ _ _ _ _ 4 2 3 3 8 M 44
  84.  
  85.  
  86. MOAR:
  87. 2,3,4 grams next ... ? :)
  88.  
  89. HELP:
  90. http://webchat.freenode.net/?channels=cicadasolvers
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement