Advertisement
Guest User

Untitled

a guest
Nov 13th, 2019
128
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 0.58 KB | None | 0 0
  1. def clean_post(post):
  2.     """This function takes the posts dataframe as input and produce a new dataframe with two new
  3.    columns: words (just tokenization of body) and words_filt (words filtered from stopwords)"""
  4.     #define and apply regex
  5.     regTok = RegexTokenizer(inputCol="body", outputCol="words", pattern="\\W")
  6.     post_tok = regTok.transform(post)
  7.    
  8.     #define and apply stopwords
  9.     remover = StopWordsRemover(inputCol="words", outputCol="words_filt")
  10.     remover.loadDefaultStopWords('english')
  11.     post_words = remover.transform(post_tok)
  12.    
  13.     return post_words
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement