Guest User

Untitled

a guest
Jan 23rd, 2018
95
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 0.78 KB | None | 0 0
  1. #!/usr/bin/env ruby
  2.  
  3. ### 文字列とIDから、転置インデックスを生成する
  4. # str: 対象文字列
  5. # id: 文書ID
  6. # opt:
  7. # n: N-gramのN
  8. # index: 生成済みindex
  9. def indexize(str, id, opt={})
  10. n = opt[:n] || 2
  11. index = opt[:index] || {}
  12.  
  13. chars = str.strip.split(//)
  14. tokens = tokenize(chars, n)
  15. tokens.each.with_index do |token, i|
  16. index[token] ||= []
  17. index[token] << "#{id}:#{i}"
  18. end
  19. return index
  20. end
  21.  
  22. ### 文字配列から、N-gram用のトークン配列を生成する
  23. def tokenize(chars, n)
  24. (0 .. chars.size - n).map do |i|
  25. grams = (0 .. n-1).map{|j| chars[i+j].chomp }
  26. next if grams.any?(&:empty?)
  27. grams.join
  28. end
  29. end
  30.  
  31. id = 1 # XXX: IDは外部から渡したい
  32. indexize(STDIN.read, id).each do |k, v|
  33. puts "#{k}\t#{v.join(",")}"
  34. end
Add Comment
Please, Sign In to add comment