Guest User

Untitled

a guest
Mar 18th, 2018
92
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 1.91 KB | None | 0 0
  1. ##POS tagging is labeling words in a sentence as nouns, adjectives, verbs...etc
  2. import nltk
  3. from nltk.corpus import state_union
  4. from nltk.tokenize import PunktSentenceTokenizer
  5.  
  6. ##PunktSentenceTokenizer a new sentence tokenizer
  7. ## This tokenizer is capable of unsupervised machine learning,
  8. ##so you can actually train it on any body of text that you use
  9.  
  10. ##Creating training and testing data
  11. train_text = state_union.raw("2005-GWBush.txt")
  12. sample_text = state_union.raw("2006-GWBush.txt")
  13.  
  14. ##train the Punkt tokenizer
  15. custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
  16.  
  17. tokenized = custom_sent_tokenizer.tokenize(sample_text)
  18.  
  19. def process_content():
  20. try:
  21. for i in tokenized:
  22. words = nltk.word_tokenize(i)
  23. tagged = nltk.pos_tag(words)
  24. ## print(tagged)
  25. ##
  26. ## Chunking is done to extract meaningful
  27. ## Chunking on Adverbs, Noun (Singular) and Proper Noun
  28. chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
  29. chunkParser = nltk.RegexpParser(chunkGram)
  30. chunked = chunkParser.parse(tagged)
  31. ## print(chunked)
  32. ## chunked.draw()
  33. ## "chunked" variable is an NLTK tree
  34. ## Each "chunk" and "non chunk" is a "subtree" of the tree
  35. ## for subtree in chunked.subtrees():
  36. ## print(subtree)
  37. ## Print the subtree with label Chunk that we assigned above
  38. for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
  39. print(subtree)
  40.  
  41. except Exception as e:
  42. print(str(e))
  43.  
  44. process_content()
  45.  
  46.  
  47. ##Chinking is a lot like chunking, it is basically a way for you to remove a
  48. ##chunk from a chunk.
  49. ##The chunk that you remove from your chunk is your chink.
  50. ##chunkGram = r"""Chunk: {<.*>+}
  51. ## }<VB.?|IN|DT|TO>+{"""
  52. ##This means we're removing from the chink one or more
  53. ##verbs, prepositions, determiners, or the word 'to'.
Add Comment
Please, Sign In to add comment