SHARE
TWEET

Untitled

a guest Sep 21st, 2019 62 Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. import re
  2.  
  3. class BanglaTokenizer:
  4.     #Tokenize a text and returns a list of senetences
  5.     def sentenceTokenize(text):
  6.         sentences = re.split('\?|!|।', text)
  7.         return sentences
  8.        
  9.     #Tokenize a sentence and returns a list of words
  10.     def wordTokenize(sentence):
  11.         words = re.findall(r'[\w|ি|া|ী|ু|ূ|ৃ|ে|ৈ|ো|ৌ|্|ঃ|ঁ|়|ঽ|ৄ|ৗ|ৠ|ৡ|ৢ|ৣ|্য|্র|ক্ষ|ঙ্ক|ঙ্গ|জ্ঞ|ঞ্চ|ঞ্ছ|ঞ্জ|ত্ত|ষ্ণ|হ্ম|ণ্ড|।|৳|ৰ|ৱ|৲|৴|৵|৶|৷|৸|৹|৺]+', sentence)
  12.         return words
  13. #End of BanglaTokenizer
  14.  
  15. senten = BanglaTokenizer.sentenceTokenize (text)
  16. print(senten[0])
  17.  
  18. print(BanglaTokenizer.wordTokenize(senten[0]))
RAW Paste Data
We use cookies for various purposes including analytics. By continuing to use Pastebin, you agree to our use of cookies as described in the Cookies Policy. OK, I Understand
Not a member of Pastebin yet?
Sign Up, it unlocks many cool features!
 
Top