Guest User

Untitled

a guest
Jul 17th, 2018
97
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 0.50 KB | None | 0 0
  1. import pymorphy2
  2. import re
  3. from nltk.stem.snowball import SnowballStemmer
  4.  
  5. from string import ascii_lowercase, punctuation
  6.  
  7. stemmer = SnowballStemmer("russian", ignore_stopwords=True)
  8. morph = pymorphy2.MorphAnalyzer()
  9. retoken = re.compile(r'[\'\w\-]+')
  10.  
  11. def stemmatize(text):
  12. text = [stemmer.stem(x) for x in text.split()]
  13. return ' '.join(text)
  14.  
  15. def tokenize_normalize(text):
  16. text = retoken.findall(text.lower())
  17. text = [morph.parse(x)[0].normal_form for x in text]
  18. return ' '.join(text)
Add Comment
Please, Sign In to add comment