SHARE
TWEET

Untitled

baobabkoodaa Feb 8th, 2017 62 Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. # Create "p", which represents probabilities P(word|newsgroup) = P(a randomly selected document from given newsgroup contains given word)
  2. for each newsgroup:
  3.     ...
  4.     # How often each word was seen in documents of this newsgroup
  5.     train_word_counts[newsgroup] = train[newsgroup]['wordID'].value_counts()
  6.     # Count documents + pseudocount +1 per word for 53975 words
  7.     count_docs = len(train[newsgroup]['docID'].unique()) + 53975
  8.     # P(newsgroup,word) as explained at the top. All
  9.     p[newsgroup] = train_word_counts[newsgroup].add(1) / count_docs
  10.     # Sort by wordId and fill pseudovalue for nonexisting words
  11.     p[newsgroup] = pd.DataFrame(data={'fraction': p[newsgroup]}, index=p[newsgroup].index)
  12.     p[newsgroup].sort_index(inplace=True)
  13.     p = pd.DataFrame(p, index=range(1,53976), columns=['fraction'])
  14.     # Pseudocount for unseen words
  15.     p = p.fillna(1 / count_docs)
  16.  
  17.  
  18. # Classify rows, which are wordID-docID combinations, representing "this word is present in this document".
  19. # We will write our results in DataFrame "b", where rows represent documents and columns represent newsgroups
  20. # At first we fill columns with prior values for each newsgroup
  21. # I'm omitting prior calculations, because my results don't change materially even when all priors are set to 1/20 (and logarithmized)
  22. for each row of data we want to classify:
  23.     for each newsgroup:
  24.         docID = row[1]
  25.         wordID = row[2]
  26.         # Val represents how likely this document belongs to this newsgroup, before dealing with this current row
  27.         val = b.at[docID, newsgroup]
  28.         # P(word|newsgroup)
  29.         wordFraction = p[newsgroup].iat[wordID-1, 0]
  30.         # Summing up logarithmized probabilities produces the same results as multiplying normal probabilities
  31.         val += math.log(wordFraction)
  32.         b.set_value(docID, newsgroup, val)
  33.  
  34. # For each doc, pick the newsgroup with max val.
RAW Paste Data
We use cookies for various purposes including analytics. By continuing to use Pastebin, you agree to our use of cookies as described in the Cookies Policy. OK, I Understand
 
Top