Advertisement
Guest User

Untitled

a guest
Dec 7th, 2016
57
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 0.50 KB | None | 0 0
  1. def processData(data):
  2. data = data.lower() #casefold
  3. data = re.sub('<[^>]*>',' ',data) #remove any html
  4.  
  5. data = re.sub(r'#([^s]+)', r'1', data) #Replace #word with word
  6. remove = string.punctuation
  7. remove = remove.replace("'", "") # don't remove '
  8. p = r"[{}]".format(remove) #create the pattern
  9. data = re.sub(p, "", data)
  10.  
  11. data = re.sub('[s]+', ' ', data) #remove additional whitespaces
  12.  
  13. pp = re.compile(r"(.)1{1,}", re.DOTALL) #pattern for remove repetitions
  14. data = pp.sub(r"11", data)
  15.  
  16. return data
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement