Guest User

Untitled

a guest
Oct 18th, 2018
87
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 0.85 KB | None | 0 0
  1. # Very (very) naive wordt segmentation algorithm for Chinese
  2. # (or any language with similar characteristics, works at the
  3. # character level.)
  4. class Partitioner
  5. attr_reader :ngrams
  6.  
  7. # +ngrams+ Enumerable list of ngrams
  8. def initialize(ngrams, lookahead = 6)
  9. @lookahead = lookahead
  10. @ngrams = {}
  11. ngrams.each {|ng| @ngrams[ng] = true}
  12. end
  13.  
  14. # Goes from beginning to end, each time trying to find the longest
  15. # initial n characters that are in the list of known n-grams
  16. def partition(text)
  17. text = text.split('')
  18. result = []
  19. while text and not text.empty?
  20. lookahead = @lookahead
  21. while lookahead > 0
  22. test = text[0...lookahead].join
  23. if lookahead == 1 || ngrams[test]
  24. result << test
  25. text = text[lookahead..-1]
  26. break
  27. end
  28. lookahead-=1
  29. end
  30. end
  31. result
  32. end
  33. end
Add Comment
Please, Sign In to add comment