Guest User

Untitled

a guest
Jul 18th, 2018
81
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 0.43 KB | None | 0 0
  1. def tokenize(f):
  2.  
  3. # Skip headers
  4. data = f.read()
  5. eoh = data.find('\n\n')
  6. data = data[eoh+1:]
  7. data = data.lower()
  8.  
  9. # More opportunities here as well if this turns out to be a good idea
  10. for char in ['\n','.','!',"'",':','?','@','=',',']:
  11. data = data.replace(char,' ')
  12.  
  13. data = data.replace('><','> <')
  14. tokens = data.split()
  15. tokens = [t for t in tokens if len(t) > 0 and t not in stopwords]
  16.  
  17. return tokens
Add Comment
Please, Sign In to add comment