Advertisement
SageTheWizard

final notes cmpsc441

Dec 7th, 2018
422
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 30.21 KB | None | 0 0
  1.  
  2. MODERN AI
  3. - Machine learning
  4. - LEARNING: Adopting / imporiving one's behavior through experience
  5. -- To measure an improvement.. you need .. ya know.. a way to measure performance
  6. -- You also need a bunch of things to experience
  7. -- Behavior = behavior in a given task
  8. - ex. AI
  9. - put and Agent (soft/hardware) in an enviroment .. and see the action
  10. - enviorment ---> Agent --<--> action
  11. - ex. Neural Net
  12. - Aluinn - first self driving vehicle
  13. - Neural network mimics the human brain
  14. - problem with this.. is it predicts really well.. but no one reall knows why it does
  15. - ex. KDD (Knowledge discover from Data)
  16. - OG datamining
  17. - ex. Bayesian Learning
  18. - Spam mail filters, algorithm that learns what is spam
  19. - Bayesian: P(x|a) = (P(A) * P(A|X)) / [Summation](P(A)*P(A|X))
  20. - SUM and HMM
  21. - This is a poorly understood field.. no one really knows how it works
  22. - Large Search Space
  23.  
  24.  
  25. How Machine Learning Can Be used
  26. - ex. Netflix - even when it first came out and was a mail system... their recommendation was really good
  27. - Dot Product
  28. - Gets the customer preferences.. Action? Comedy? Scifi?
  29. - Gets an expert review of the movie..
  30. - used dot product to determine if the person would like the movie
  31. - checks a threshold.. if the threshold is > the dot product.. don't reccomend
  32. - What if the expert has different preferences compared to the individual?
  33. - you just get the customer to rate it ... without the customer rating it .. how?
  34. - you learn what the customer likes based off of the movies they watched
  35.  
  36. Designing a Learning System
  37. - Learning: Improve one's performance(P) at a task(T) through a bunch of experience(E)
  38. - ex. Checkers
  39. * T: Playing Checkers
  40. * P: # Wins / # pieces captured (can vary)
  41. * E: Playing a game (usually with itself)
  42. -> Must Represent future distribution of data
  43. - ie. it must be similar to the real world enviorment you're sticking it in
  44. - <instance, label> = "Example"
  45. - such as.. attributes of the weather in a day (instance) && whether it rained (label)
  46. - instance are a vector
  47. h([instanceVector]) = 1 or 0 ... 1 means yes rain, 0 means no rain
  48. DATA MUST BE ACCURATE / REPRESENT FUTURE DISTRIBUTION OF DATA OR IT WILL FAIL
  49.  
  50. HOW TO DESIGN A LEARNING SYSTEM
  51. - Task (T)
  52. - Performance measure (P)
  53. - Experence (E)
  54.  
  55. ex. <<wind,temp,humidity>, Sunny>
  56. instance label
  57. whole thing: EXAMPLE
  58. we assume there exists a function, that generates a label based on an instance
  59. that function is called "Target Concept"
  60. - You can't get that target concept correct.. but you can get a "Target Approximation"
  61. - only way we can see how close the Target Approximation is to the Target Concept ..
  62. you need to test it against an example
  63. So you need....
  64. Target Function to Learn
  65. Reprsentation of Target Function
  66. Learning Mechanism
  67.  
  68. ex. Checker Player
  69. - TargetFunction(BoardConfig) -> BestMove
  70. - This isn't a good idea .. best move might result in a bad move later
  71. - TargetFunction(BoardConfig) -> Score
  72. - find a board configuration that guarentees a win ... that gets the highest score
  73. - find a board configuratioin that guarentees a loss .. that gets the lowest scores
  74. - finding a board configuration that guarentees the winning configuration .. gets highest - 1
  75.  
  76. - This is better .. it's more forward thinking.. where as the previous one isn't
  77.  
  78. From here.. you Chose the TargetFunction that gets the score to be the one you learn :)
  79. - we can't guarntee it's the best.. but we can know it's pretty good
  80.  
  81. - So.. how to represent the target functions?
  82. - Easiest Way = A Look Up Table containing the score of a bunch of configuration
  83. - Decision Tree
  84. - Neural Net - Learns the scores
  85.  
  86. - Lets pick the Easiest way.. a Linear Combination - "Hypothesis"
  87. TargetFunction(BoardConfig) = (numKings)*(weightKing) + (numQueens)*(weightQueen) +
  88. (numRooks)*(weightRook) .... and so on
  89. - Eeasy part is knowing number of peices.. hardest is learning the weight
  90. - Everything is learning that weight value
  91. - Clearning TargetConcept has been reduced into learning the Weights
  92. - So how do we do that?
  93.  
  94. -Learning MEchanisms
  95. - They are going to learn from Examples: "Training Examples"
  96. - in this example.. it's a <boardConfig, realLabel>
  97. ex. <<0 1 2 ....>, -100>
  98. K Q R .. Lost Score
  99. - Task has been reduced to find the Weightss that best fit the examples
  100. - E = [Summation](targetConcept - targetFunction)^2
  101. summation sums all the set of all examples
  102. - Bigger E.. more error, then you make a new Weights to try to reduce the Error
  103. But.. how do you figure out how to do that?
  104. WeightI <- WeightI + Y(TargetConcept - TargetFunction) * numPieceI
  105. Y = Learning Rate
  106. - Used to increase / decrease weight little by little to avoid messing up
  107. - Really small # .. like .001
  108. What we have so far...
  109. Unknown Target Function= C : x -> y
  110. Training Examples generated by Uknown Target Function= (vector, label)
  111. Hypothesis Space= All possible values for the weight values
  112. -The goal of the learning is to find the best values from the space
  113.  
  114. A Learning Algorithm will take the Examples and the Hypothesis space
  115. - Searches Hypothesis space to find the best hypothesis
  116. "FINAL HYPOTHESIS"
  117.  
  118. LEARNING ALGORITH: for Concept Learning using Linear Combinations
  119. Instance Space (x) = <vector of all examples> in R^d
  120. Label: Y = {1,-1} <- "CONCEPT LEARNING" true/false sorta learning
  121. Hypothesis Space (H) - Assuming Linear Functions
  122. - Weighted Linear Combinations of attributes
  123. [Summation i->d]WiXi > Q : yes
  124. < Q : No
  125. Where Q is a thereshold
  126. H(x) = ([Summation i -> d]Wi*Xi - Q) > 0 : yes
  127. < 0 : no
  128.  
  129. H = {weight> | weight> exists on R^d+1}
  130. x1> = < x1, x2> ... therefore, X1> = <x0 = 1, x1, x2>
  131. H(X>) = sign(W>TRANSPOSE * X>)
  132.  
  133. Okay.. so how do we use this?
  134.  
  135. Implimenting it using Iterative Method
  136.  
  137. 1. Init weight Vector 1 as 0 vector (w>(1) = 0>) *or some other small random vector close to 0
  138. 2. Do Until Learned
  139. for t = 1 .... |D| *where D is example space*
  140. Make a prediction on first example: ^y(t) = sign(W>TRANSPOSE * X>(t) )
  141. w>(t+) <- w>(t) + z(y(t) - ^y(t)) * x>(t) *z = learning rate *t+ is updated w>(t) *mistake bound*
  142. Since it's mistake bound, you can make it more efficent with an if statement
  143. !(if (y(t) == ^y(t)) --> w>(t+) <- w>(t) + z*y(t) * x>(t)
  144.  
  145. This is Perceptron! Works really well if the solution is linearly separable!
  146. w0 = moves line up and down, theta changes angle of the line
  147.  
  148. 2 Things left to cover this semester
  149. 1. PRINCIPLE OF LEARNING
  150. 2. ALGORITHM OF LEARNING
  151.  
  152.  
  153. LEARNING AS A GAME -> Game = something with precicely well defined rules
  154. - Supervised Learning
  155. - Perceptron is Supervised Learning: the y(t), (actual label) is the teacher
  156. - Teacher gives x> (attributes of a problem), Learner guesses ^y (label)
  157. and Teacher returns error with Sum(y - ^y)^2 so the learner can adjust
  158. - Unsupervised Learning
  159. - Clustering: If you don't have a label, you just slam similar things into a category
  160. - How do you use Binary classification algorithm for 3 different classifications?
  161. - you can use 3 different Perceptrons!
  162. - one for A and !A, B and !B, C and !C
  163. CONCEPT LEARNING - Binary Classifications
  164. - Concept(C) C:X -> {-1,1}
  165. - INSTANCE SPACE: X = {x1>, x2>, x3> ....} vectors are a list of attributes
  166. - Example Space: D = {<x1>, y1> , <x2>, y2> ...}
  167. - HYPOTHESIS SPACE: H <- representation of C *the Neural net or line to best separate data*
  168. *literally just a search algorithm to find that line that best separates the data*
  169.  
  170. daily reminder: if your assumption is totally wrong, your solution will be really fucked
  171. Based on Search Critiera you can have....
  172. - Supervised Learning - Learning from Examples (classification problems)
  173. - Unsupervised learning - Learning w/o Examples (Clustering)
  174. - Reinforcement - Agent is in an Enviorment, then it reacts .. if it's a good reaction
  175. it is rewarded, if not it is punished
  176.  
  177. All algorithms / searching can be...
  178. - Online Learning
  179. - ex. it learns on the job.. makes mistakes on the job but learns from it
  180. - Offline Learning
  181. - ex. Fraud Detection for CC companies.. you have to train it before you put it in use
  182. if Mistakes aren't fatal, you can use Online.. otherwise offline is the way to go
  183. - Hybrids Exist btw.. but still
  184.  
  185. **important CONCEPT***
  186. OVERFITTING vs. GENERALIZATION
  187. - Remember, the goal of the algorithm is to classify instances
  188. - The Neural Network could have a bias towards the examples it learned from
  189. - to test how the net will work, exclude 20% of your data to test the final algorithm on
  190. - if it fucks up the new (20%) of the data, your algorithm is over fitted..
  191. - N Data: 20% - test set, 20% - Validation set, 60% - Training Set
  192. type error: test error validation error traning error
  193. TEST ERROR IS IMPORTANT ONE: only test it once at the very very end
  194.  
  195.  
  196. DECISION TREE LEARNING
  197. - most medical stuff is this
  198. - Learning discrete valued target concept
  199. - learned hypothesis is represented as a tree (called decision tree)
  200. - How the tree is set up
  201. - Root Node = attribute to test
  202. - Branches are possible values for tested attribute
  203.  
  204. ex. Play Tennis?
  205.  
  206. Outlook
  207. / | \
  208. sunny overcast rainy
  209. / | \
  210. Humdity [yes] wind
  211. / \ / \
  212. high low strong weak
  213. | | | |
  214. [NO] [yes] [NO] [YES]
  215. you can look at it logically like this...
  216. (outlook=sunny /\ humidity=Low) = Yes
  217. (outlook=overcast) = YES
  218. ... and so on
  219.  
  220. QUINLAN
  221. construct a tree from the top to bottom
  222. - the root node, needs to be the most useful attribute
  223. - the attribute classifies the most data
  224. - but uh.. how do we do that?
  225. finding the best attribute: Entropy (reaD: RANOMNESS)
  226. - Entropy: Purity / impurity of the set
  227. - choose attribute that gets the lowest entropy
  228. Entroy(S) = -[Summation from i=1 to C] Pi*lg(Pi) :where C is # of classes
  229. *note, Pi isn't pi, it's.. P[sub i]
  230. Entroy(s)'s graph is basically an upsidedown parabola...
  231. - so if each class is 1/2 and 1/2 .. entropy becomes 1
  232. and if there is only 1 class, entropy becomes 0
  233. Basically, you use an attribute that gets the largest entropy gain
  234.  
  235. D ROOT: Entropy(Set)
  236. / | \
  237. S1 S2 S3
  238. E(S1) E(S2) E(S3) Average Weight of E(Si): Information game = difference between root and here
  239.  
  240. High entropy at Root = mix bag, the next level is low .. then that means the lower attributes
  241. are more pure... Higher entropy reduction = better Root Attribute
  242. So each level, you need to ensure each level has the maximum information Gain...
  243.  
  244. So.. How do you calculate information gain?
  245. Information Gain: Expected reduction in entropy caused by partitioning the examples
  246. based on an attribute
  247. G(S,A) = E(S) - [Summation forall v=-values(A) ](|Sv|/|S|)E(Sv)
  248. G = info gain, S=Set, A = attribute
  249.  
  250. EXAMPLE: Tennis
  251. Day outlook Temp Humindity Wind PlayTenis?
  252. 1 Sunny Hot High Weak No
  253. 2 Sunny H H Strong N
  254. 3 Overcast H H W Yes
  255. 4 Rain Mild H W y
  256. 5 R Cool Normal W y
  257. 6 R C N S n
  258. 7 O C N S y
  259. 8 S M H W n
  260. 9 S C N W y
  261. 10 R M N W y
  262. 11 S M N S y
  263. 12 O M H S y
  264. 13 O H N W y
  265. 14 R M H S n
  266.  
  267. E(S) = -9/14*log(9/14) - 6/14*log(5/14) = .94 (9 and 5 because of 9 yes and 5 no)
  268. G(S, Outlook)
  269. - Sunny, Overcast, Rainy
  270. - Sunny: 2 yes, 3 no --> E(Outlook=Sunny)= -2/5log(2/5) - 3/5log(3/5)
  271. = .971
  272. - OverCast: 4 yes, 0 no --> E(outlook=overcast) = -4/4log(4/4) - 0/4log(0/4)
  273. = 0.0
  274. - Rainy: 3 yes, 2 no --> equation --> .971
  275. - so ... G(S,outlook) = .94 - 5/14*.971 - 0 - 5/14*.971
  276. G(S,outlook) = reduces entropy by... .246
  277. G(S, Wind)
  278. - Weak and Strong
  279. - Weak: 6 yes, 2 negative --> E(Wind=Weak) = -6/8*log(6/8) - 2/8*log(2/8)
  280. = 0.811
  281. - Strong: 3 yes 3 no .. --> E(Wind=strong)
  282. = 1 (no need for calcualtion since we know entropy of 1/2 and 1/2 = 1)
  283. - so... G(S, Wind) = .95 - 8/14*.811 - 6/14*1
  284. = .048 .. so G(S, Wind) reduces entropy by .048 .. not too gud
  285. G(S, Humidity)
  286. - High and Normal ----> produces .152
  287. G(S, Temperature) = .029
  288.  
  289. So.. out of all of those, we know Outlook you can gain the most entropy
  290.  
  291. so now.. E(S1) = .971 .. and you just apply the same process as above to pick the next attribute
  292. G(S1, Wind) (when outlook is sunny, what does wind impact)
  293. - Stong, Weak
  294. - Weak = 1 yes 2 no .. Strong = 1 yes 1 no
  295. E(Wind=weak) = -1/3*log(1/3) - 2/3*log(2/3) = .918
  296. E(wind = strong) = 1 (since 1/1)
  297. - so... G(S1, Wind) = .19
  298. G(S1, Humidity)
  299. - High and Normal
  300. = High = 0 yes 3 no 2 yes 0 no
  301. E(Humidity=high) = 0
  302. E(humidity=normal) = 0
  303. so.. G(S1, humidity) = .971 - 0 - 0 = .971 .. hey that's a good gain
  304. G(S1, Temp) = .57
  305.  
  306. So.. We know G(S1,Humidity) is the best so we choose Humidity....
  307. And you keep repeating the process.. if chang keeps listing more stuff i'm not tpying it you get hte idea
  308.  
  309. And that's how you build a tree from top to bottom
  310. - just keep in mind.. just because an attribute gives you best gains unde rsomething else...
  311. doesn't mean it will under others (Ie. G(S1, humidity) is best .. but G(S3, humidity) might not be)
  312.  
  313. Train Fact: Temperature doesn't matter in this set
  314. More Related Train Fact: These trees can pretty easily get over fit
  315. - So how the fuck do you do that?
  316. - As the tree grows.. you gotta check if it's overfitted
  317. - but while it's growing.. that's almost impossible
  318. - so how do we do it?
  319. -Just let it over fit with all training data ... and then prune it
  320. - So how do you prune it?
  321. - you use validtation data and measure the error
  322. - then you remove a leaf node
  323. - if the error improves .. remove it forever
  324. - else .. keep it forever
  325. - and you just kinda keep doing it until pruning it further fucks the error
  326.  
  327. EXAM: NEXT THURSDAY: Game -> |algorithms .. so up to but not including algorithms
  328. NEXT ALGORITHM
  329. Basen Learning - Bayes theory
  330. Chang's story
  331. - Participants.. ~10000 people want to check if they have brain cancer by knocking on their head
  332. - Result: 1% of participants actually had cancer
  333. - But with chang's method: 80% of participants that actually have cancer came up positive
  334. 10% of participants w/o cancer came up positive (false positive)
  335. - Then the FDA approved this method
  336. - so then.. you get a new patient: and they get diagnosed as positive for brain cancer
  337. - but guess what.. odds are 7.5% that he actually has the cancer
  338. P(BrainCancer) = .01 <-> P(~BrainCancer) = .99
  339. P(+ | BrainCancer) = .8 <-> P(- | BrainCancer) = .2
  340. P(+ | ~BrainCancer) = .1 <-> P(- | ~BrainCancer) = .9
  341. so... P(BrainCancer | +) =
  342. - note how they look similar to what we have above it.. but they're not!
  343. - P(B[union]+) = P(B) * P(+|B) .. or .. P(+) * P(B | +)
  344. - P(B|+) = (P(B) * P(+|B)) / P(+)
  345. *** AND THIS IS BAYE's THEORUM***** .. that equation
  346. Using that equation, you use it ot learn something
  347. P(B) ~> P(!B) = |-P(B)
  348. P(+ | B) ~> P( - | B ) = |- P(+|B)
  349. P(+ | !B) ~> P( - | !B) = |- P(+ | !B)
  350.  
  351. IN TERMS OF MACHINE LEARNING.. BAYES THEORUM IS...
  352. P(h | D) = (P(D | h) P(h) ) / P(D)
  353. - given a particular example (D), what is the probablity this label / hypothesis (H) is correct
  354. according to specific examples
  355. P(h) = prior probability of H ,, P(D) = prior probability of D
  356. P(D | h) = Probablity that D is consistent w/ H
  357.  
  358. Baysian learning:
  359. Question: What is the most probable hypothesis, given a particular Dataset
  360. Answer: hmap (maximum A prior) hypothesis
  361. - Maximum probabliity given an H for a D P(h | D)
  362. - argmax h=-H .. which h maximizes P(h | D) .. that's the ML solution
  363. argmax = value that maximizes a function
  364. - argmax (x + 1)^2 -1 x=-0 to 2 ... x = 2
  365. max (same func) ... = 8
  366.  
  367. issue is.... you gotta know the 3 values P(D | h) .. P(h) ... P(D)
  368. - P(D | h) - Given an H what is the probability classifies D correctly
  369. Trainfact: You can ignore P(D) since it's all the same (doesn't effect argmax h)
  370. - P(h) - probability of a certain hypothesis being picked
  371. :: gotta have some degree of domain knowledge for this to work out
  372. ::: creates problems if you don't know the domain .. no prior knowledge.. you have a problem
  373. - if you don't know.. just assume uniform probability
  374.  
  375. Another Way::: HML (maximum likelyhood hypothesis)
  376. - assumes P(h) = CONSTANT for all h in H
  377. - this is what happens in you have no prior knowledge
  378. - in this situation...
  379. - hml = argmax( P(h|D) ) = P(D | h) ***since P(D) is constant.. and P(h) is constant***
  380.  
  381.  
  382. Back to chang's brain cancer fun
  383. P(B) = .01 =====> P(~B) = .99
  384. P(+|B) = .8
  385. P(+|~B) = .1
  386. in this example... you have 2 hypothesises .. Yes Brain Cancer.. No Brain Cancer
  387. P(h) ~~~~> P(B) = .01 P(~B) = .99
  388. So.. we shouldn't use HML.. we should use HMAP because we have domain knowledge
  389. so... P(D | h) * P(h)
  390. h = B :: P(D | B) * P(B)
  391. - P(+ | B) * P(B) ~~> .01 * .8 = .08
  392. h = ~B :: P(D | ~B) * P(B)
  393. - P(+ | ~B) * P(B) ~~> .1 * .00 = .099
  394. .... So what this is saying.. is you're more likely to get stuck with a positive label
  395. given when you don't have brain cancer.. meaning the method is bad
  396. More information is better.. so use HMAP over HML whenever you can
  397.  
  398. So.. instead of trying to ask what is the best hypothesis... why can't we ask..
  399. Given a new instance X> what is the most likely label v=-V
  400. P(v | D) = [Sum for all H](P(v|h) * P(h|d))
  401. _________________________________
  402. Weighted average of prediction by all h
  403. - better hypthosis.. better weight
  404. this is called HOPT = optimal hypothesis...
  405. - Argmax(v) [Summation for all H] P(v|h) * P(h | D)
  406. - which v maximizes that expression
  407. so back to chang's example...
  408. - argmax(v = + / - ) [summation for B ~B] P(v|h) * p(h | D)
  409. ... so
  410. argMax{ [Summation for B ~B] P(+|h) * p(h | D)
  411. [Summation for B ~B] P(-|h) * p(h | D) }
  412. so.. Top (+) = .0163 .. Bottom (-) = .0907
  413. - so Bottom is high probablity.. so odds are, they are going to be negative for
  414. brain cancer
  415.  
  416. ANOTHER CLASSIFICATION PROBLEM...
  417. Assume conjunctive hypothesis...
  418. h = a1 /\ a2 /\ a3 .... /\ aN
  419. where aI is independent
  420. Q: What is the most probable class of a new instance, given values of attributes
  421. P(v | a1,a2,a3 ... aN)
  422. argmax(v)(P(v | a1,a2,a3 ......)) ~~~> JUST APPLY BAYES
  423. called VMAP .. LABEL MAX as opposed HYPOTHESIS max
  424. Since the arguments are independent.. you can do this in the bayes
  425. argmax P(a1 | v) * P(a2 | v) ... * P(aN | v) * P(v) = Vnb
  426. - Vnb = naive baysian (used in spam filters)
  427.  
  428.  
  429. Vnb = argmax p(v) [multiplationSummation]i P(ai | v)
  430. - predicting if a car is stolen
  431.  
  432. COLOR TYPE ORIGIN STOLEN?
  433. R S D Y
  434. R S D N
  435. R S D Y
  436. Y S D N
  437. Y S I Y
  438. Y V I N
  439. Y V I Y
  440. Y V D N
  441. R V I N
  442. R V I Y
  443. S = Sedan, V = SUV , D = domestic I = import
  444. V=-{Yes,No} ... we're looking for a domestic red suv
  445. = argmax p(v) * P(Red | v) * P(Dom | v) * P(SUV | v)
  446. v = y: P(Y)* P(R | Y) * P(D | Y) * P(SUV | Y) = 1/2 * 3/5 * 2/5 * 1/5 = .024
  447. v = n: ||(except Y = N) = 1/2 * 2/5 * 3/5 * 3/5 = .096
  448. Answer: NO... its more likely to not get stolen (ie.. ARGMAX( .024 or .096) .. so picks the biggest)
  449.  
  450. SELF DELIMITING PREFIX-TREE ENCODING
  451. - encodes messages
  452. [summation]Pi*lg(Pi)
  453. say we have .. Z1 = Hi, Z2 = Hungry?, Z3 = Thirsty?, Z4 = Bye
  454. this way .. if both parties agree on the encoding method you can use singluar bits
  455. Z1 = 0, Z2 = 1, Z3 = 10, Z4 = 11 ..
  456. - creates problems.. when you recieve the bit "1", you have to wait for the next message
  457. - we dont want PREFIX encoding to avoid this problem
  458. - if he recieves a 1.. he's gotta wait .. 0 just kinda dictates the end of each message
  459. - so.. Z1 = 0, Z2 = 10, Z3 = 110, Z4 = 111
  460. - so you know to stop listening when you get a 0
  461. - this is ideal method to achieve minimum number of bits
  462. - we need to determine which messages are used most...
  463. .. so if Z4 (Bye) is used most, make Z4 = 0
  464. - "Description Length" Li (length of message i) = -lg(Pi) Pi= probablity of messsage i
  465. - if the probability is low, you can use a long length,
  466. if its high.. use small bits
  467.  
  468. Hmap = arg P(D | h) * P(h)
  469. h=-H ... we use this cus we know the probablity of each hypothesis
  470.  
  471. NEURAL NET:
  472. (this is our last topic)
  473. *Basically a couple of Perceptron*
  474. They're trying to simulate how a brain actually works
  475. Perceptron works well as long as the data is linearly separatable
  476. - ie.. 1 or 0 as the response and x0,x1,x2 can be used as weights
  477. PERCEPTRON RULE: Wi <- Wi + y(t-0)*xi
  478. where Y is learning speed
  479. well.. what happens when data isn't linearly separatable
  480. - line always is moving.. (if data is non linear) it doesn't relaly converge to a correct
  481. weighting
  482.  
  483. Well how to solve this?
  484. - Calculate an Error per particular set of W's (weights)
  485. E(w>) = 1/2 * [Summation of d's in D](Td - Dd)^2
  486. so we want to find the w> where there's a global minimum for errors (note the 3d parabola eq)
  487. - so we travel along that until we get the lowest error (hill climbing)
  488. - this is called "GRADIENT DESCENT ALGORITH"
  489. *we need to find the gradient vector of E(w>)
  490. which is ▽E(w>) = [dE/dw1 dE/w2 ... dE/wn]
  491. so..
  492. w> <- w> - Y * ▽E(w>)
  493. ---- DELTA RULE *using gradient descent*
  494. But.. We have a problem.. if we actually wanna calculate some weirdly separatable data
  495. we can't just have lines (▽E(w>) method will still just be a bunch of lines
  496. How to solve THAT problem?
  497. - remember the treshold unit (the steppy looking wave)
  498. - we can't have that.. its not a continuous function
  499. Wel.. we have a SIGMOID FUNCTION
  500. *****S(y) = 1/(1+ e^-y)*******
  501. .. we can control the stiffness of the function of it to get it very close to a step function
  502. Sigmoid does some weird shit when you take the derivative
  503. dS(y)/dy = d/dy(1/(1+e^-y))
  504. dS(y)/dy = d/dy(1 + e^-y)^-1
  505. = -(1 + e^-y)^-2 * (e^-y)
  506. = e^-y / (1 + e^-y)^2
  507. = 1/(1+e^-y) * ((1 + e^-y - 1) / (1+e^-y))
  508. = S(y) * (1 - S(y)) <--- so the derivative is really easy to calculate
  509. so .. ex... *remember.. x0 is always 1*
  510. w0 = -30, w1 = 20. w2 = 20
  511. x1 x2 w>*x> S(w>*x>)
  512. 0 0 -30 0
  513. 0 1 -10 0
  514. 1 0 -10 0
  515. 1 1 10 1
  516. note.. S(w>*x>) is effectively an AND function
  517.  
  518. ex 2.
  519. w0= 10. w1 = -20. w2= -20
  520. x1 x2 w>*x> S(w>*x>)
  521. 0 0 10 1
  522. 0 1 -10 0
  523. 1 0 -10 0
  524. 1 1 -30 0
  525. this is effectively a NOR function, which makes it ez to make it a simple OR
  526. if you change the weights from -20s to 20.. you get an OR function btw
  527.  
  528. ex 3. XNOR (so.. note, xnor is an OR of and an nor)
  529. x1 x2 XOR XNOR AND NOR
  530. 0 0 0 1 0 1
  531. 0 1 1 0 0 0
  532. 1 0 1 0 0 0
  533. 1 1 0 1 1 0
  534.  
  535. AND w0 = -30, w1 = 20, w2 = 20
  536. NOR w0 = 10, w1 = -20, w2 = -20
  537. OR = VAL( AND = x1 | 20 = w1, NOR = x2 | 20 = x2, x0 = -10)
  538. to get XNOR which will work
  539.  
  540. More NN
  541. - first ins input layer, then the next layer (all full connected) and you keep stacking the layers
  542. - each one has a summation Sigmoid function.. and the last layer has an output layer has one node
  543. - layers in between input and output layer are ''hidden layers''
  544.  
  545. ex. Finding a cat / dog / horse in a picture
  546. - each pixel is fed into the input layer and eventually output layer
  547. - the output layers spits out 0-1 for each node (.8, .1, .1) as a vector for each result
  548. - so since it would be .8 .. odds are theres a cat in the picture
  549. - BUT WAIT! theres a lot of pixels
  550. - ya gotta teach the NN to compress the picture
  551. - the first layer has 1Million nodes
  552. - next layer has say 100k nodes
  553. - and the next has 10k
  554. - and then it starts to go back up
  555. - so basically.. it reduces pixels, and then expands it back when it gets good at compressing
  556. in a relatively lossless way
  557. - So then you cut that in half and just use the compression to feed the small image into
  558. the finding the cat / dog / horse NN
  559.  
  560. FEED FORWARD BACK PROPAGATION
  561. x --w1-->[SIGMA]-- x*w1 = n1--->[sigmoid]- S(n1) = y --->[Sigma]---y*w2 = n2 --->[sigmoid]-S(n2)->O(x)
  562. where O(x) = z<--- prediction, and t is target value
  563. E = 1/2 * (z-t)^2 <-------- error value so ya gotta minimize this
  564. ... so now you have this error value, you back propogate
  565. now you need to measure the rate of change of E in w2 (since we touch w2 first in back prop)
  566. d(E)/d(w2) .. but you can't directly measure the change of E in respects to w2
  567. .. but you with n2
  568. .. d(n2)/d(w2) .. if you change w2.. how much will n2 change
  569. ... sooooo .. d(E)/d(w2) = d(E)/d(Sigmoid(n2)) * d(Sigmoid(n2))/d(n2) * d(n2)/d(w2)
  570. d(E)/d(w2) = (z - t) * Sigmoid(n2)*(1 - sigmoid(n2)) * sigmoid(n1)
  571. so now.. instead of crazy computations.. we just have some simple math
  572. to get an error change with respoect to changing w2
  573. ..now for w1 ..
  574. d(E)/d(w1) = d(E)/d(z) * d(z)/d(n2) * d(n2)/d(y) * d(y)/d(n1) * d(n1)/d(w1)
  575. known(w2) known(w2) w2 sigmoid(n1)/(1-sigmoid(n1) x
  576. known(w2) = this was already calculated only thing that actually
  577. in d(E)/d(w2) needs calculated
  578.  
  579. as you add layers / more nodes in each layer you start needing to
  580. take derivatives of vectors with respect to the error value.. but its the same concept
  581.  
  582. NN ALGORITHM:::::
  583. 1. Init WEIGHTS to random matrix WEIGHTS = [w1> w2> ... wN>]
  584. (this represents the whole NN's weights)
  585. 2. DO until Learned
  586. 2.1 -> Feed Foward
  587. 2.2 -> Calculate E (1/2 * (z-t)^2)
  588. 2.3 -> Back Prop E and Adjust WEIGHTS
  589.  
  590. btw.. no one impliments their own NN unless they're just trying to be cool.. just use TensorFlow
  591. BUT WAIT... What the fuck does "Learned" mean.. when do you stop?
  592. - Now remember Kids! split your data into 60% Training, 20% Validation, 20% Test
  593. - and you know when to stop based on the 20% validation
  594. ERROR
  595. |* # Evalidation
  596. |* # #
  597. | * # #
  598. | * #<------- d(Evalidation)/d(Epoch) = 0 ... so stop at EPOCH val here
  599. | * Etrain
  600. -------------*-------EPOCH
  601. so.. once your your Evalidation is at a minimum.. record the Epoch where it occurs
  602. ... then retrain it until that Epoch
  603. ... Then once its done that.. test it on the 20% test.. and boom you have your
  604. learned AI (% of correct in 20% test data is how good it is)...
  605. oh and if your test data % sucks ass you gotta rethink your model
  606.  
  607.  
  608.  
  609. Support Vector Machine (SVM)
  610. - originally made by some Russian dude and just kinda jacked to use for NNs
  611. - Suppose you have a bunch of linearly separatable data
  612. - and there are infinite number of lines that will work.. perceptron will just find one of them
  613. BUT I WANT THE BEST ONE!
  614. - you need to choose a line that will nicely with potentially new data
  615. - and new data that fits a certain classification, will be near data of that same
  616. classification (like N-Nearest Neighbor)
  617. - you need to have alarge margin between the line of separation and the data,
  618. that's the ideal method
  619. - so you effectively on top of looking or error, you look for the line with the largest margin
  620. FINDING THAT LINE
  621. - you need 2 points on one side (to make the first line)
  622. a 3rd poin on the other side (of the other data) to make the parralell line
  623. - and the middle of the 2 parraelel lines is the good line
  624. .. so Top line == Wx + b = 1, middle(goodline) == Wx + b = 0, bottom line == Wx + b = -1
  625. so to get the the best .. you need ... W> / ||W>|| . (x2-x1) = width
  626. so you wanna find the W that maximizes that width (. = dotproduct btw)
  627. .. we're trying to minimize ||w>||^2
  628. ..... SO! SVM: minimize(||W>||^2) ..
  629. so.. Yi(<w,xi> + b) >= 1
  630. .. btw no one impliments this .. to optimize it you need to use a modified Lagragean.."KKT"
  631. but yea just use a library ya cunt
  632.  
  633. SVM: Classifcation F(x) = sign(<w,x> + b)
  634. Training: minimize 1/2||w||^2
  635. w, b <-- trying to get w and b to use in the classification
  636. st .. Yi(<wi , xi> + b) >= 1
  637.  
  638. USING KKT -> Classification now equals ===== f(x) = sign([Summation]aiyi <x,xi> + b)
  639. xi = training example... x is new instance (dot product them) then multip by yi and ai and add b
  640. a (alpha) is lagrangian ... and remeber...
  641. there could be millions of xi's .. but... how is this better .. alot of the xi's ai's are 0
  642. meaning the calculation is pretty quick
  643. only the points on the gutter (dividing line) are non-zero
  644. - those points are called "support vectors"
  645. hence the name... SUPPORT VECTOR MACHINE
  646.  
  647. KERNAL TRICKS
  648. - Very powerful and efficient compared to NN
  649. .. suppose you wanna calculate
  650. a * b <-- Disgustingly slow
  651.  
  652. you can classify potentially non linear seperable space using linear transformation going from say R2-->R3
  653. -just gotta figure out how dot product behaves in the future space
  654. - once you do that .. you don't eve nneed to calculate what [Phi](x) exactly is .. just dot product that shit
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement