Guest User

Untitled

a guest
Oct 22nd, 2016
108
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 31.61 KB | None | 0 0
  1. #A Collection of NLP notes
  2.  
  3. ##N-grams
  4.  
  5. ###Calculating unigram probabilities:
  6.  
  7. P( w<sub>i</sub> ) = count ( w<sub>i</sub> ) ) / count ( total number of words )
  8.  
  9. In english..
  10.  
  11. Probability of word<sub>i</sub> =
  12. Frequency of word (i) in our corpus / total number of words in our corpus
  13.  
  14. ##Calcuting bigram probabilities:
  15.  
  16. P( w<sub>i</sub> | w<sub>i-1</sub> ) = count ( w<sub>i-1</sub>, w<sub>i</sub> ) / count ( w<sub>i-1</sub> )
  17.  
  18. In english..
  19.  
  20. Probability that word<sub>i-1</sub> is followed by word<sub>i</sub> =
  21. [Num times we saw word<sub>i-1</sub> followed by word<sub>i</sub>] / [Num times we saw word<sub>i-1</sub>]
  22.  
  23. Example
  24. -------
  25. - **s** = beginning of sentence
  26. - **/s** = end of sentence
  27.  
  28. ####Given the following corpus:
  29.  
  30. **s** I am Sam **/s**
  31.  
  32. **s** Sam I am **/s**
  33.  
  34. **s** I do not like green eggs and ham **/s**
  35.  
  36. We can calculate bigram probabilities as such:
  37.  
  38. - P( I | **s** ) = 2/3
  39.  
  40. => Probability that an **s** is followed by an `I`
  41.  
  42. = [Num times we saw `I` follow **s** ] / [Num times we saw an **s** ]
  43. = 2 / 3
  44.  
  45. - P( Sam | am ) = 1/2
  46.  
  47. => Probability that `am` is followed by `Sam`
  48.  
  49. = [Num times we saw `Sam` follow `am` ] / [Num times we saw `am`]
  50. = 1 / 2
  51.  
  52. ###Calculating trigram probabilities:
  53.  
  54. Building off the logic in bigram probabilities,
  55.  
  56. P( w<sub>i</sub> | w<sub>i-1</sub> w<sub>i-2</sub> ) = count ( w<sub>i</sub>, w<sub>i-1</sub>, w<sub>i-2</sub> ) / count ( w<sub>i-1</sub>, w<sub>i-2</sub> )
  57.  
  58. In english...
  59.  
  60. Probability that we saw word<sub>i-1</sub> followed by word<sub>i-2</sub> followed by word<sub>i</sub> = [Num times we saw the three words in order] / [Num times we saw word<sub>i-1</sub> followed by word<sub>i-2</sub>]
  61.  
  62. Example
  63. -------
  64.  
  65. - P( Sam | I am )
  66. = count( Sam I am ) / count(I am)
  67. = 1 / 2
  68.  
  69. ###Interpolation using N-grams
  70.  
  71. We can combine knowledge from each of our n-grams by using interpolation.
  72.  
  73. E.g. assuming we have calculated unigram, bigram, and trigram probabilities, we can do:
  74.  
  75. P ( Sam | I am ) = &Theta;<sub>1</sub> x P( Sam ) + &Theta;<sub>2</sub> x P( Sam | am ) + &Theta;<sub>3</sub> x P( Sam | I am )
  76.  
  77. Using our corpus and assuming all lambdas = 1/3,
  78.  
  79. P ( Sam | I am ) = (1/3)x(2/20) + (1/3)x(1/2) + (1/3)x(1/2)
  80.  
  81. In web-scale applications, there's too much information to use interpolation effectively, so we use **Stupid Backoff** instead.
  82.  
  83. In **Stupid Backoff**, we use the trigram if we have enough data points to make it seem credible, otherwise if we don't have enough of a trigram count, we back-off and use the bigram, and if there still isn't enough of a bigram count, we use the unigram probability.
  84.  
  85. ###Smoothing Algorithms
  86.  
  87. Let's say we've calculated some n-gram probabilities, and now we're analyzing some text. What happens when we encounter a word we haven't seen before? How do we know what probability to assign to it?
  88.  
  89. We use `smoothing` to give it a probability.
  90.  
  91. => Use the count of things we've only seen *once* in our corpus to estimate the count of things we've *never seen*.
  92.  
  93. This is the intuition used by many smoothing algorithms.
  94.  
  95. ###Good-Turing Smoothing
  96.  
  97. Notation:
  98.  
  99. N<sub>c</sub> = the count of things with frequency `c` - how many *things* occur with frequency `c` in our corpus.
  100.  
  101. Good Turing modifies our:
  102. - n-gram probability function for things we've never seen (things that have count 0)
  103. - count for things we *have* seen (since all probabilites add to 1, we have to modify this count if we are introducing new probabilities for things we've never seen)
  104.  
  105. Modified Good-Turing probability function:
  106.  
  107. P<sup>*</sup> ( things with 0 count ) = N<sub>1</sub> / N
  108.  
  109. => [Num things with frequency 1] / [Num things]
  110.  
  111. Modified Good-Turing count:
  112.  
  113. count<sup>*</sup> = [ (count + 1) x N<sub>c+1</sub> ] / [ N<sub>c</sub> ]
  114.  
  115. Example
  116. -------
  117.  
  118. Assuming our corpus has the following frequency count:
  119.  
  120. carp: 10
  121. perch: 3
  122. whitefish: 2
  123. trout: 1
  124. salmon: 1
  125. eel: 1
  126.  
  127. Calculating the probability of something we've never seen before:
  128.  
  129. P ( catfish ) = N<sub>1</sub> / N = 3 / 18
  130.  
  131. Calculating the modified count of something we've seen:
  132.  
  133. count<sup>*</sup> ( trout )
  134.  
  135. = [ (1 + 1) x N<sub>2</sub> ] / [ N<sub>1</sub> ]
  136. = [ 2 x 1 ] / [ 3 ]
  137. = 2 / 3
  138.  
  139. Calculating the probability of something we've seen:
  140.  
  141. P<sup>*</sup> ( trout ) = count ( trout ) / count ( all things ) = (2/3) / 18 = 1/27
  142.  
  143. What happens if we don't have a word that occurred exactly N<sub>c+1</sub> times?
  144.  
  145. => Once we have a sufficient amount of training data, we generate a best-fit curve to make sure we can calculate an estimate of N<sub>c+1</sub> for any `c`.
  146.  
  147. ###Kneser-Ney Smoothing
  148.  
  149. A problem with Good-Turing smoothing is apparent in analyzing the following sentence, to determine what word comes next:
  150.  
  151. ```
  152. I can't see without my reading ___________
  153. ```
  154.  
  155. The word `Francisco` is more common than the word `glasses`, so we may end up choosing `Francisco` here, instead of the correct choice, `glasses`.
  156.  
  157. The Kneser-Ney smoothing algorithm has a notion of `continuation probability` which helps with these sorts of cases. It also saves you from having to recalculate all your counts using `Good-Turing` smoothing.
  158.  
  159. Here's how you calculate the K-N probabilty with bigrams:
  160.  
  161. P<sub>kn</sub>( w<sub>i</sub> | w<sub>i-1</sub> ) = [ max( count( w<sub>i-1</sub>, w<sub>i</sub> ) - `d`, 0) ] / [ count( w<sub>i-1</sub> ) ] + &Theta;( w<sub>i-1</sub> ) x P<sub>continuation</sub>( w<sub>i</sub> )
  162.  
  163. Where:
  164.  
  165. **P<sub>continuation</sub>( w<sub>i</sub> )**
  166.  
  167. represents the continuation probability of w<sub>i</sub>. This is the number of bigrams where w<sub>i</sub> followed w<sub>i-1</sub>, divided by the total number of bigrams that appear with a frequency > 0. It gives an indication of the probability that a given word will be used as the second word in an unseen bigram (such as `reading ________`)
  168.  
  169. **&Theta;( )**
  170. This is a `normalizing constant`; since we are subtracting by a `discount weight` **d**, we need to re-add that probability mass we have discounted. It gives us a weighting for our P<sub>continuation</sub>.
  171.  
  172. We calculate this as follows:
  173.  
  174. &Theta;( w<sub>i-1</sub> ) = { d * [ Num words that can follow w<sub>i-1</sub> ] } / [ count( w<sub>i-1</sub> ) ]
  175.  
  176. ###Kneser-Ney Smoothing for N-grams
  177.  
  178. The Kneser-Ney probability we discussed above showed only the bigram case.
  179.  
  180. For N-grams, the probability can be generalized as follows:
  181.  
  182. P<sub>kn</sub>( w<sub>i</sub> | w<sub>i-n+1</sub><sup>i-1</sup>) = [ max( count<sub>kn</sub>( w<sub>i-n+1</sub><sup>i</sup> ) - `d`, 0) ] / [ count<sub>kn</sub>( w<sub>i-n+1</sub><sup>i-1</sup> ) ] + &Theta;( w<sub>i-n+1</sub><sup>i-1</sup> ) x P<sub>kn</sub>( w<sub>i</sub> | w<sub>i-n+2</sub><sup>i-1</sup> )
  183.  
  184. Where:
  185.  
  186. c<sub>kn</sub>(&bull;) =
  187.  
  188. - the actual count(&bull;) for the highest order n-gram
  189.  
  190. or
  191.  
  192. - continuation_count(&bull;) for lower order n-gram
  193.  
  194. => continuation_count = Number of unique single word contexts for &bull;
  195.  
  196. ##Spelling Correction
  197.  
  198. We can imagine a noisy channel model for this (representing the keyboard).
  199.  
  200. `original word` ~~~~~~~~~Noisy Channel~~~~~~~~> `noisy word`
  201.  
  202. Our decoder receives a `noisy word`, and must try to guess what the `original` (intended) word was.
  203.  
  204. So what we can do is generate **N** possible `original words`, and run them through our **noisy channel** and see which one looks most like the `noisy word` we received.
  205.  
  206. The corrected word, w<sup>*</sup>, is the word in our vocabulary (`V`) that has the maximum probability of being the correct word (`w`), given the input `x` (the misspelled word).
  207.  
  208. w<sup>*</sup> = argmax<sub>w&isin;V</sub> P( w | x )
  209.  
  210. Using Bayes' Rule, we can rewrite this as:
  211.  
  212. w<sup>*</sup> = argmax<sub>w&isin;V</sub> P( x | w ) x P( w )
  213.  
  214. **P( x | w )** is determined by our **channel model**.
  215. **P( w )** is determined by our **language model** (using N-grams).
  216.  
  217. The first thing we have to do is generate candidate words to compare to the misspelled word.
  218.  
  219. ###Confusion Matrix
  220.  
  221. This is how we model our noisy channel. A confusion matrix gives us the probabilty that a given spelling mistake (or word edit) happened at a given location in the word. We use the **Damerau-Levenshtein** edit types (deletion, insertion, substitution, transposition). These account for 80% of human spelling errors.
  222.  
  223. - del[a,b]: count( ab typed as a )
  224. - ins[a,b]: count( a typed as ab )
  225. - sub[a,b]: count( a typed as b )
  226. - trans[a,b]: count( ab typed as ba )
  227.  
  228. Our confusion matrix keeps counts of the frequencies of each of these operations for each letter in our alphabet, and from this matrix we can generate probabilities.
  229.  
  230. We would need to train our confusion matrix, for example using wikipedia's list of common english word misspellings.
  231.  
  232. After we've generated our confusion matrix, we can generate probabilities.
  233.  
  234. Let w<sub>i</sub> denote the i<sup>th</sup> character in the word **w**.
  235.  
  236. **p( x | w ) =**
  237.  
  238. - if deletion, [ del( w<sub>i-1</sub>, w<sub>i</sub> ) ] / [ count(w<sub>i-1</sub> w<sub>i</sub>) ]
  239. - if insertion, [ ins( w<sub>i-1</sub>, x<sub>i</sub> ) ] / [ count(w<sub>i-1</sub> ]
  240. - if substitution, [ sub( x<sub>i</sub>, w<sub>i</sub> ) ] / [ count(w<sub>i</sub> ]
  241. - if transposition, [ trans( w<sub>i</sub>, w<sub>i+1</sub> ) ] / [ count(w<sub>i</sub> w<sub>i+1</sub> ]
  242.  
  243. Suppose we have the misspelled word **x** = **acress**
  244.  
  245. We can generate our channel model for **acress** as follows:
  246.  
  247. **actress**
  248.  
  249. => Correct letter : `t`
  250.  
  251. => Error letter : `-`
  252.  
  253. => **x | w** : `c` | `ct` (probability of deleting a `t` given the correct spelling has a `ct`)
  254.  
  255. => P( x | w ) : 0.000117
  256.  
  257. **cress**
  258.  
  259. => Correct letter : `-`
  260.  
  261. => Error letter : `a`
  262.  
  263. => **x | w** : `a` | `-`
  264.  
  265. => P( x | w ) : 0.00000144
  266.  
  267. **caress**
  268.  
  269. => Correct letter : `ca`
  270.  
  271. => Error letter : `ac`
  272.  
  273. => **x | w** : `ac` | `ca`
  274.  
  275. => P( x | w ) : 0.00000164
  276.  
  277. ... and so on
  278.  
  279. We would combine the information from out channel model by multiplying it by our n-gram probability.
  280.  
  281. ###Real-Word Spelling Correction
  282.  
  283. What happens when a user misspells a word as another, **valid** english word?
  284.  
  285. Eg. I have fifteen **minuets** to leave the house.
  286.  
  287. We find valid english words that have an edit distance of 1 from the input word.
  288.  
  289. Given a sentence w<sub>1</sub>, w<sub>2</sub>, w<sub>3</sub>, ..., w<sub>n</sub>
  290.  
  291. Generate a set of candidate words for each w<sub>i</sub>
  292.  
  293. - Candidate( w<sub>1</sub> ) = { w<sub>1</sub>, w<sub>1</sub><sup>'</sup>, w<sub>1</sub><sup>''</sup>, ... }
  294. - Candidate( w<sub>2</sub> ) = { w<sub>2</sub>, w<sub>2</sub><sup>'</sup>, w<sub>2</sub><sup>''</sup>, ... }
  295. - Candidate( w<sub>n</sub> ) = { w<sub>n</sub>, w<sub>n</sub><sup>'</sup>, w<sub>n</sub><sup>''</sup>, ... }
  296.  
  297. Note that the candidate sets include the original word itself (since it may actually be correct!)
  298.  
  299. Then we choose the sequence of candidates **W** that has the maximal probability.
  300.  
  301. Example
  302. -------
  303.  
  304. Given the sentence `two of thew`, our sequences of candidates may look like:
  305.  
  306. - two of thew
  307. - two of the
  308. - to off threw
  309. - to off the
  310. - to on threw
  311. - to on the
  312. - to of threw
  313. - to of the
  314. - too of threw
  315. - too of the
  316.  
  317. Then we ask ourselves, of all possible sentences, which has the highest probability?
  318.  
  319. In practice, we simplify by looking at the cases where only 1 word of the sentence was mistyped (note that above we were considering all possible cases where each word could have been mistyped). So we look at all possibilities with one word replaced at a time. This changes our run-time from O(n<sup>2</sup>) to O(n).
  320.  
  321. Where do we get these probabilities?
  322.  
  323. - Our language model (unigrams, bigrams, ..., n-grams)
  324. - Our Channel model (same as for non-word spelling correction)
  325.  
  326. Our Noisy Channel model can be further improved by looking at factors like:
  327.  
  328. - The nearby keys in the keyboard
  329. - Letters or word-parts that are pronounced similarly (such as `ant`->`ent`)
  330.  
  331. ##Text Classification
  332.  
  333. Text Classification allows us to do things like:
  334.  
  335. - determining if an email is spam
  336. - determining who is the author of some piece of text
  337. - determining the likelihood that a piece of text was written by a man or a woman
  338. - Perform sentiment analysis on some text
  339.  
  340. Let's define the Task of Text Classification
  341.  
  342. Given:
  343.  
  344. - a document **d**
  345. - a fixed set of classes **C = { c<sub>1</sub>, c<sub>2</sub>, ... , c<sub>n</sub> }**
  346.  
  347. Determine:
  348.  
  349. - the predicted class **c &isin; C**
  350.  
  351. Put simply, we want to take a piece of text, and assign a class to it.
  352.  
  353. ###Classification Methods
  354.  
  355. We can use **Supervised Machine Learning**:
  356.  
  357. Given:
  358.  
  359. - a document **d**
  360. - a fixed set of classes **C = { c<sub>1</sub>, c<sub>2</sub>, ... , c<sub>n</sub> }**
  361. - a training set of **m** documents that we have pre-determined to belong to a specific class
  362.  
  363. We train our classifier using the training set, and result in a learned classifier.
  364.  
  365. We can then use this learned classifier to classify new documents.
  366.  
  367. Notation: we use &Upsilon;(d) = C to represent our classifier, where **&Upsilon;()** is the classifier, **d** is the document, and **c** is the class we assigned to the document.
  368.  
  369. (Google's `mark as spam` button probably works this way).
  370.  
  371. ####Naive Bayes Classifier
  372.  
  373. This is a simple (naive) classification method based on Bayes rule. It relies on a very simple representation of the document (called the **bag of words** representation)
  374.  
  375. Imagine we have 2 classes ( **positive** and **negative** ), and our input is a text representing a review of a movie. We want to know whether the review was **positive** or **negative**. So we may have a bag of positive words (e.g. `love`, `amazing`, `hilarious`, `great`), and a bag of negative words (e.g. `hate`, `terrible`).
  376.  
  377. We may then count the number of times each of those words appears in the document, in order to classify the document as **positive** or **negative**.
  378.  
  379. This technique works well for **topic** classification; say we have a set of academic papers, and we want to classify them into different topics (computer science, biology, mathematics).
  380.  
  381. ####Bayes' Rule applied to Documents and Classes
  382.  
  383. For a document **d** and a class **c**, and using Bayes' rule,
  384.  
  385. P( c | d ) = [ P( d | c ) x P( c ) ] / [ P( d ) ]
  386.  
  387. The class mapping for a given document is the class which has the maximum value of the above probability.
  388.  
  389. Since all probabilities have P( d ) as their denominator, we can eliminate the denominator, and simply compare the different values of the numerator:
  390.  
  391. P( c | d ) = P( d | c ) x P( c )
  392.  
  393. Now, what do we mean by the term **P( d | c )** ?
  394.  
  395. Let's represent the document as a set of features (words or tokens) **x<sub>1</sub>, x<sub>2</sub>, x<sub>3</sub>, ...**
  396.  
  397. We can then re-write **P( d | c )** as:
  398.  
  399. P( x<sub>1</sub>, x<sub>2</sub>, x<sub>3</sub>, ... , x<sub>n</sub> | c )
  400.  
  401. What about P( c ) ? How do we calculate it?
  402.  
  403. => P( c ) is the **total probability** of a class.
  404. => How often does this class occur in total?
  405.  
  406. E.g. in the case of classes **positive** and **negative**, we would be calculating the probability that any given review is **positive** or **negative**, without actually analyzing the current input document.
  407.  
  408. This is calculated by counting the relative frequencies of each class in a corpus.
  409.  
  410. E.g. out of 10 reviews we have seen, 3 have been classified as **positive**.
  411.  
  412. => P ( positive ) = 3 / 10
  413.  
  414. Now let's go back to the first term in the Naive Bayes equation:
  415.  
  416. **P( d | c )**, or **P( x<sub>1</sub>, x<sub>2</sub>, x<sub>3</sub>, ... , x<sub>n</sub> | c )**.
  417.  
  418. How do we actually calculate this?
  419. ----------------------------------
  420.  
  421. We use some assumptions to simplify the computation of this probability:
  422.  
  423. - the **Bag of Words assumption** => assume the position of the words in the document doesn't matter.
  424. - **Conditional Independence** => Assume the feature probabilities **P( x<sub>i</sub> | c<sub>j</sub> )** are independent given the class **c**.
  425.  
  426. It is important to note that both of these assumptions aren't actually correct - of course, the order of words matter, and they are not independent. A phrase like `this movie was incredibly terrible` shows an example of how both of these assumptions don't hold up in regular english.
  427.  
  428. However, these assumptions greatly simplify the complexity of calculating the classification probability. And in practice, we can calculate probabilities with a reasonable level of accuracy given these assumptions.
  429.  
  430. So..
  431. ------
  432. To calculate the Naive Bayes probability, **P( d | c ) x P( c )**, we calculate P( x<sub>i</sub> | c ) for each x<sub>i</sub> in **d**, and multiply them together.
  433.  
  434. Then we multiply the result by P( c ) for the current class. We do this for each of our classes, and choose the class that has the maximum overall value.
  435.  
  436. ###How do we learn the values of **P ( c )** and **P ( x<sub>i</sub> | c )** ?
  437. ------------------------------------------------------------------------------
  438.  
  439. => We can use **Maximum Likelihood estimates**.
  440.  
  441. Simply put, we look at frequency counts.
  442.  
  443. P ( c<sub>i</sub> ) = [ Num documents that have been classified as c<sub>i</sub> ] / [ Num documents ]
  444.  
  445. In english..
  446.  
  447. Out of all the documents, how many of them were in class **i** ?
  448.  
  449. P ( w<sub>i</sub> | c<sub>j</sub> ) = [ count( w<sub>i</sub>, c<sub>j</sub> ) ] / [ &Sigma;<sub>w&isin;V</sub> count ( w, c<sub>j</sub> ) ]
  450.  
  451. In english...
  452.  
  453. The probability of word **i** given class **j** is the count that the word occurred in documents of class **j**, divided by the sum of the counts of each word in our vocabulary in class **j**.
  454.  
  455. So for the denominator, we iterate thru each word in our vocabulary, look up the frequency that it has occurred in class **j**, and add these up.
  456.  
  457. ####Problems with Maximum-Likelihood Estimate.
  458.  
  459. What if we haven't seen any training documents with the word **fantastic** in our class **positive** ?
  460.  
  461. In this case, P ( **fantastic** | **positive** ) = 0
  462.  
  463. => This is **BAD**
  464.  
  465. Since we are calculating the overall probability of the class by multiplying individual probabilities for each word, we would end up with an overall probability of **0** for the **positive** class.
  466.  
  467. So how do we fix this issue?
  468.  
  469. We can use a Smoothing Algorithm, for example **Add-one smoothing** (or **Laplace smoothing**).
  470.  
  471. ####Laplace Smoothing
  472.  
  473. We modify our conditional word probability by adding 1 to the numerator and modifying the denominator as such:
  474.  
  475. P ( w<sub>i</sub> | c<sub>j</sub> ) = [ count( w<sub>i</sub>, c<sub>j</sub> ) + 1 ] / [ &Sigma;<sub>w&isin;V</sub>( count ( w, c<sub>j</sub> ) + 1 ) ]
  476.  
  477. This can be simplified to
  478.  
  479. P ( w<sub>i</sub> | c<sub>j</sub> ) = [ count( w<sub>i</sub>, c<sub>j</sub> ) + 1 ] / [ &Sigma;<sub>w&isin;V</sub>( count ( w, c<sub>j</sub> ) ) + |V| ]
  480.  
  481. where |V| is our vocabulary size (we can do this since we are adding 1 for each word in the vocabulary in the previous equation).
  482.  
  483. ####So in Summary, to Machine-Learn your Naive-Bayes Classifier:
  484.  
  485. Given:
  486.  
  487. - an input document
  488. - the category that this document belongs to
  489.  
  490. We do:
  491.  
  492. - increment the count of total documents we have learned from **N**.
  493. - increment the count of documents that have been mapped to this category **N<sub>c</sub>**.
  494. - if we encounter new words in this document, add them to our vocabulary, and update our vocabulary size **|V|**.
  495. - update count( w, c ) => the frequency with which each word in the document has been mapped to this category.
  496. - update count ( c ) => the total count of all words that have been mapped to this class.
  497.  
  498. So when we are confronted with a new document, we calculate for each class:
  499. ---------------------------------------------------------------------------
  500.  
  501. **P( c )** = N<sub>c</sub> / N
  502.  
  503. => how many documents were mapped to class **c**, divided by the total number of documents we have ever looked at. This is the **overall**, or **prior** probability of this class.
  504.  
  505. Then we iterate thru each word in the document, and calculate:
  506.  
  507. **P( w | c )** = [ count( w, c ) + 1 ] / [ count( c ) + |V| ]
  508.  
  509. => the count of how many times this word has appeared in class **c**, plus 1, divided by the total count of all words that have ever been mapped to class **c**, plus the vocabulary size.
  510. This uses the Laplace-Smoothing, so we don't get tripped up by words we've never seen before. This equation is used both for words we **have** seen, as well as words we **haven't** seen.
  511.  
  512. => we multiply each **P( w | c )** for each word **w** in the new document, then multiply by **P( c )**, and the result is the **probability that this document belongs to this class**.
  513.  
  514. ####Some Ways that we can tweak our Naive Bayes Classifier
  515.  
  516. Depending on the domain we are working with, we can do things like
  517.  
  518. - Collapse Part Numbers or Chemical Names into a single token
  519. - Upweighting (counting a word as if it occurred twice)
  520. - Feature selection (since not all words in the document are usually important in assigning it a class, we can look for specific words in the document that are good indicators of a particular class, and drop the other words - those that are viewed to be **semantically empty**)
  521.  
  522. => If we have a sentence that contains a **title** word, we can upweight the sentence (multiply all the words in it by 2 or 3 for example), or we can upweight the **title** word itself (multiply it by a constant).
  523.  
  524. ##Sentiment Analysis
  525.  
  526. ###Scherer Typology of Affective States
  527.  
  528. - **Emotion**
  529.  
  530. Brief, organically synchronized.. evaluation of a major event
  531. => angry, sad, joyful, fearful, ashamed, proud, elated
  532.  
  533. - **Mood**
  534.  
  535. diffuse non-caused low-intensity long-duration change in subjective feeling
  536. => cheerful, gloomy, irritable, listless, depressed, buoyant
  537.  
  538. - **Interpersonal Stances**
  539.  
  540. Affective stance towards another person in a specific interaction
  541. => friendly, flirtatious, distant, cold, warm, supportive, contemtuous
  542.  
  543. - **Attitudes**
  544.  
  545. Enduring, affectively colored beliefs, disposition towards objects or persons
  546. => liking, loving, hating, valuing, desiring
  547.  
  548. - **Personality Traits*
  549.  
  550. Stable personality dispositions and typical behavior tendencies
  551. => nervous, anxious, reckless, morose, hostile, jealous
  552.  
  553.  
  554. **Sentiment Analysis** is the detection of **attitudes** (2nd from the bottom of the above list).
  555.  
  556. We want to know:
  557. ---------------
  558.  
  559. - The **Holder** (source) of the attitude
  560.  
  561. - The **Target** (aspect) of the attitude
  562.  
  563. - The **Type** of the attitude from a set of types (like, love, hate, value, desire, etc.).
  564. Or, more commonly, simply the weighted polarity (positive, negative, neutral, together with **strength**).
  565.  
  566. ###Baseline Algorithm for Sentiment Analysis
  567.  
  568. Given a piece of text, we perform:
  569.  
  570. - Tokenization
  571. - Feature Extraction
  572. - Classification using different classifiers
  573. - Naive Bayes
  574. - MaxEnt
  575. - SVM
  576.  
  577. ####Tokenization Issues
  578.  
  579. Depending on what type of text we're dealing with, we can have the following issues:
  580.  
  581. - Dealing with HTML or XML markup
  582. - Twitter Markup (names, hash tags)
  583. - Capitalization
  584. - Phone Numbers, dates, emoticons
  585.  
  586. Some useful code for tokenizing:
  587.  
  588. - Christopher Potts Sentiment Tokenizer
  589. - Brendan O'Connor Twitter Tokenizer
  590.  
  591. ####Classification
  592.  
  593. We will have to deal with handling negation:
  594.  
  595. `I didn't like this movie` **vs** `I really like this movie`
  596.  
  597. ####So, how do we handle negation?
  598.  
  599. One way is to prepend **NOT_** to every word between the negation and the beginning of the next punctuation character.
  600.  
  601. E.g. I didn't really like this movie, but ...
  602.  
  603. => I didn't NOT_really NOT_like NOT_this NOT_movie, but ...
  604.  
  605. This doubles our vocabulary, but helps in tokenizing negative sentiments and classifying them.
  606.  
  607. ####Hatzivassiloglou and McKeown intuition for identifying word polarity
  608.  
  609. - Adjectives conjoined by **and** have the same polarity
  610.  
  611. => Fair **and** legitimate, corrupt **and** brutal
  612.  
  613. - Adjectives conjoined by **but** do not
  614.  
  615. => Fair **but** brutal
  616.  
  617. We can use this intuition to **learn** new adjectives.
  618.  
  619. Imagine we have a set of adjectives, and we have identified the polarity of each adjective. Whenever we see a new word we haven't seen before, and it is joined to an adjective we have seen before by an **and**, we can assign it the same polarity.
  620.  
  621. For example, say we know the poloarity of **nice**.
  622.  
  623. When we see the phrase `nice and helpful`, we can learn that the word **helpful** has the same polarity as the word **nice**. In this way, we can learn the polarity of new words we haven't encountered before.
  624.  
  625. So we can expand our **seed set** of adjectives using these rules. Then, as we count the frequency that **but** has occurred between a pair of words versus the frequency with which **and** has occurred between the pair, we can start to build a ratio of **but**s to **and**s, and thus establish a degree of polarity for a given word.
  626.  
  627. ####What about learning the polarity of phrases?
  628.  
  629. - Take a corpus, and divide it up into phrases.
  630.  
  631. Then run through the corpus, and extract the **first two words of every phrase** that matches one these rules:
  632.  
  633. - 1st word is adjective, 2nd word is noun_singular or noun_plural, 3rd word is **anything**
  634. - 1st word is adverb, 2nd word is adjective, 3rd word is NOT noun_singular or noun_plural
  635. - 1st word is adjective, 2nd word is adjective, 3rd word is NOT noun_singular or noun_plural
  636. - 1st word is noun_singular or noun_plural, 2nd word is adjective, 3rd word is NOT noun_singular or noun_plural
  637. - 1st word is adverb, 2nd word is verb, 3rd word is anything
  638.  
  639. Note: To do this, we'd have to run each phrase through a Part-of-Speech tagger.
  640.  
  641. Then, we can look at how often they co-occur with positive words.
  642.  
  643. - Positive phrases co-occur more with **excellent**
  644. - Negative phrases co-occur more with **poor**
  645.  
  646. But how do we measure co-occurrence?
  647.  
  648. We can use Pointwise Mutual Information:
  649.  
  650. How much more do events **x** and **y** occur than if they were independent?
  651.  
  652. PMI( word<sub>1</sub>, word<sub>2</sub> ) = log<sub>2</sub> { [ P( word<sub>1</sub>, word<sub>2</sub> ] / [ P( word<sub>1</sub> ) x P( word<sub>2</sub> ) ] }
  653.  
  654. Then we can determine the polarity of the phrase as follows:
  655.  
  656. **Polarity( phrase )** = PMI( phrase, **excellent** ) - PMI( phrase, **poor** )
  657.  
  658. = log<sub>2</sub> { [ P( phrase, **excellent** ] / [ P( phrase ) x P( **excellent** ) ] } - log<sub>2</sub> { [ P( phrase, **poor** ] / [ P( phrase ) x P( **poor** ) ] }
  659.  
  660. Another way to learn polarity (of words)
  661. ----------------------------------------
  662.  
  663. Start with a seed set of **positive** and **negative** words.
  664.  
  665. Then, look them up in a thesaurus, and:
  666.  
  667. - add synonyms of each of the **positive** words to the **positive** set
  668. - add antonyms of each of the **positive** words to the **negative** set
  669.  
  670. - add synonyms of each of the **negative** words to the **negative** set
  671. - add antonyms of each of the **negative** words to the **positive** set
  672.  
  673. and.. repeat.. with the new set of words we have discovered, to build out our lexicon.
  674.  
  675. ###Summary on learning Lexicons
  676.  
  677. - Start with a *seed set** of words ( `good`, `poor`, ... )
  678. - Find other words that have similar polarity:
  679. - using **and** and **but**
  680. - using words that appear nearby in the same document
  681. - using synonyms and antonyms
  682.  
  683. ###Sentiment Aspect Analysis
  684.  
  685. What happens if we get the following phrase:
  686.  
  687. `The food was great, but the service was awful.`
  688.  
  689. This phrase doesn't really have an overall sentiment; it has two separate sentiments; **great food** and **awful service**. So sometimes, instead of trying to tackle the problem of figuring out the overall sentiment of a phrase, we can instead look at finding the **target** of any sentiment.
  690.  
  691. How do we do this?
  692. ------------------
  693.  
  694. => We look at frequent phrases, and rules
  695.  
  696. - Find all **highly frequent** phrases across a set of reviews (e.g. `fish tacos`) => this can help identify the targets of different sentiments.
  697. - Filter these highly frequent phrases by rules like **occurs right after a sentiment word**
  698.  
  699. => `... great fish tacos ...` means that **fish tacos** is a likely target of sentiment, since we know **great** is a sentiment word.
  700.  
  701. Let's say we already know the important aspects of a piece of text. For example, if we are analyzing restaurant reviews, we know that aspects we will come across include **food**, **decor**, **service**, **value**, ...
  702.  
  703. Then we can train our classifier to assign an aspect to a given sentence or phrase.
  704.  
  705. "Given this sentence, is it talking about **food** or **decor** or ..."
  706.  
  707. => This only applies to text where we KNOW what we will come across.
  708.  
  709. So overall, our flow could look like:
  710.  
  711. **Text (e.g. reviews)** --> **Text extractor (extract sentences/phrases)** --> **Sentiment Classifier (assign a sentiment to each sentence/phrase)** --> **Aspect Extractor (assign an aspect to each sentence/phrase)** --> **Aggregator** --> **Final Summary**
  712.  
  713. ##Conditional Models
  714.  
  715. Naive Bayes Classifiers use a joint probability model. We evaluate probabilities P( d, c ) and try to maximize this joint likelihood.
  716.  
  717. => maximizing P( text, class )
  718.  
  719. rather than a conditional probability model
  720.  
  721. -> maximizing P( class | text )
  722.  
  723. If we instead try to maximize the conditional probability of P( class | text ), we can achieve higher accuracy in our classifier.
  724.  
  725. A **conditional** model gives probabilities **P( c | d )**. It takes the data as given and models only the conditional probability of the class.
  726.  
  727. ##MaxEnt Classifiers (Maximum Entropy Classifiers)
  728.  
  729. We define a **feature** as an elementary piece of evidence that links aspects of what we observe ( **d** ), with a category ( **c** ) that we want to predict.
  730.  
  731. So a feature is a function that maps from the space of **classes** and **data** onto a **Real Number** (it has a bounded, real value).
  732.  
  733. &fnof;: C x D --> R
  734.  
  735. Models will assign a **weight** to each feature:
  736.  
  737. - A **positive** weight votes that the configuration is likely correct
  738. - A **negative** weight votes that the configuration is likely incorrect
  739.  
  740. What do features look like?
  741. ---------------------------
  742. Here is an example feature:
  743.  
  744. - &fnof;<sub>1</sub>(c,d) &equiv; [ c = LOCATION & w<sub>-1</sub>="in" & isCapitalized(w) ]
  745.  
  746. This feature picks out from the data cases where the **class** is **LOCATION**, the previous word is "in" and the current word is capitalized.
  747.  
  748. This feature would match the following scenarios:
  749.  
  750. - class = LOCATION, data = "in Quebec"
  751. - class = LOCATION, data = "in Arcadia"
  752.  
  753. Another example feature:
  754.  
  755. - &fnof;<sub>2</sub>(c,d) &equiv; [ c = DRUG & ends(w, "c") ]
  756.  
  757. This feature picks out from the data cases where the **class** is **DRUG** and the current word ends with the letter **c**.
  758.  
  759. This feature would match:
  760.  
  761. - class = DRUG, data = "taking Zantac"
  762.  
  763. Features generally use both the **bag of words**, as we saw with the Naive-Bayes Classifier, as well as looking at adjacent words (like the example features above).
  764.  
  765. ###Feature-Based Linear Classifiers
  766.  
  767. Feature-Based Linear Classifiers:
  768.  
  769. - Are a linear function from feature sets {&fnof;<sub>i</sub>} to classes {c}.
  770. - Assign a weight &lambda;<sub>i</sub> to each feature &fnof;<sub>i</sub>
  771. - We consider each class for an observed datum **d**
  772. - For a pair **(c,d)**, features vote with their weights:
  773.  
  774. **vote(c)** = &Sigma; &lambda;<sub>i</sub>&fnof;<sub>i</sub>(c,d)
  775.  
  776. - Choose the class **c** which maximizes **vote(c)**
  777.  
  778. As you can see in the equation above, the vote is just a weighted sum of the features; each feature has its own weight. So we try to find the class that maximizes the weighted sum of all the features.
  779.  
  780. MaxEnt Models make a probabilistic model from the linear combination &Sigma; &lambda;<sub>i</sub>&fnof;<sub>i</sub>(c,d).
  781.  
  782. Since the weights can be negative values, we need to convert them to positive values since we want to calculating a non-negative probability for a given class. So we use the value as such:
  783.  
  784. **exp &Sigma; &lambda;<sub>i</sub>&fnof;<sub>i</sub>(c,d)**
  785.  
  786. This way we will always have a positive value.
  787.  
  788. We make this value into a probability by dividing by the sum of the probabilities of all classes:
  789.  
  790. **[ exp &Sigma; &lambda;<sub>i</sub>&fnof;<sub>i</sub>(c,d) ] / [ &Sigma;<sub>C</sub> exp &Sigma; &lambda;<sub>i</sub>&fnof;<sub>i</sub>(c,d) ]**
  791.  
  792.  
  793. ##Named Entity Recognition
  794.  
  795. **Named Entity Recognition (NER)** is the task of extracting entities (people, organizations, dates, etc.) from text.
  796.  
  797. ###Machine-Learning sequence model approach to NER
  798.  
  799. ####Training
  800.  
  801. - Collect a set of representative Training Documents
  802. - Label each token for its entity class, or Other (O) if no match
  803. - Design feature extractors appropriate to the text and classes
  804. - Train a sequence classifier to predict the labels from the data
  805.  
  806. ####Testing
  807.  
  808. - Get a set of testing documents
  809. - Run the model on the document to label each token
  810. - Output the recognized entities
Add Comment
Please, Sign In to add comment