Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- MODERN AI
- - Machine learning
- - LEARNING: Adopting / imporiving one's behavior through experience
- -- To measure an improvement.. you need .. ya know.. a way to measure performance
- -- You also need a bunch of things to experience
- -- Behavior = behavior in a given task
- - ex. AI
- - put and Agent (soft/hardware) in an enviroment .. and see the action
- - enviorment ---> Agent --<--> action
- - ex. Neural Net
- - Aluinn - first self driving vehicle
- - Neural network mimics the human brain
- - problem with this.. is it predicts really well.. but no one reall knows why it does
- - ex. KDD (Knowledge discover from Data)
- - OG datamining
- - ex. Bayesian Learning
- - Spam mail filters, algorithm that learns what is spam
- - Bayesian: P(x|a) = (P(A) * P(A|X)) / [Summation](P(A)*P(A|X))
- - SUM and HMM
- - This is a poorly understood field.. no one really knows how it works
- - Large Search Space
- How Machine Learning Can Be used
- - ex. Netflix - even when it first came out and was a mail system... their recommendation was really good
- - Dot Product
- - Gets the customer preferences.. Action? Comedy? Scifi?
- - Gets an expert review of the movie..
- - used dot product to determine if the person would like the movie
- - checks a threshold.. if the threshold is > the dot product.. don't reccomend
- - What if the expert has different preferences compared to the individual?
- - you just get the customer to rate it ... without the customer rating it .. how?
- - you learn what the customer likes based off of the movies they watched
- Designing a Learning System
- - Learning: Improve one's performance(P) at a task(T) through a bunch of experience(E)
- - ex. Checkers
- * T: Playing Checkers
- * P: # Wins / # pieces captured (can vary)
- * E: Playing a game (usually with itself)
- -> Must Represent future distribution of data
- - ie. it must be similar to the real world enviorment you're sticking it in
- - <instance, label> = "Example"
- - such as.. attributes of the weather in a day (instance) && whether it rained (label)
- - instance are a vector
- h([instanceVector]) = 1 or 0 ... 1 means yes rain, 0 means no rain
- DATA MUST BE ACCURATE / REPRESENT FUTURE DISTRIBUTION OF DATA OR IT WILL FAIL
- HOW TO DESIGN A LEARNING SYSTEM
- - Task (T)
- - Performance measure (P)
- - Experence (E)
- ex. <<wind,temp,humidity>, Sunny>
- instance label
- whole thing: EXAMPLE
- we assume there exists a function, that generates a label based on an instance
- that function is called "Target Concept"
- - You can't get that target concept correct.. but you can get a "Target Approximation"
- - only way we can see how close the Target Approximation is to the Target Concept ..
- you need to test it against an example
- So you need....
- Target Function to Learn
- Reprsentation of Target Function
- Learning Mechanism
- ex. Checker Player
- - TargetFunction(BoardConfig) -> BestMove
- - This isn't a good idea .. best move might result in a bad move later
- - TargetFunction(BoardConfig) -> Score
- - find a board configuration that guarentees a win ... that gets the highest score
- - find a board configuratioin that guarentees a loss .. that gets the lowest scores
- - finding a board configuration that guarentees the winning configuration .. gets highest - 1
- - This is better .. it's more forward thinking.. where as the previous one isn't
- From here.. you Chose the TargetFunction that gets the score to be the one you learn :)
- - we can't guarntee it's the best.. but we can know it's pretty good
- - So.. how to represent the target functions?
- - Easiest Way = A Look Up Table containing the score of a bunch of configuration
- - Decision Tree
- - Neural Net - Learns the scores
- - Lets pick the Easiest way.. a Linear Combination - "Hypothesis"
- TargetFunction(BoardConfig) = (numKings)*(weightKing) + (numQueens)*(weightQueen) +
- (numRooks)*(weightRook) .... and so on
- - Eeasy part is knowing number of peices.. hardest is learning the weight
- - Everything is learning that weight value
- - Clearning TargetConcept has been reduced into learning the Weights
- - So how do we do that?
- -Learning MEchanisms
- - They are going to learn from Examples: "Training Examples"
- - in this example.. it's a <boardConfig, realLabel>
- ex. <<0 1 2 ....>, -100>
- K Q R .. Lost Score
- - Task has been reduced to find the Weightss that best fit the examples
- - E = [Summation](targetConcept - targetFunction)^2
- summation sums all the set of all examples
- - Bigger E.. more error, then you make a new Weights to try to reduce the Error
- But.. how do you figure out how to do that?
- WeightI <- WeightI + Y(TargetConcept - TargetFunction) * numPieceI
- Y = Learning Rate
- - Used to increase / decrease weight little by little to avoid messing up
- - Really small # .. like .001
- What we have so far...
- Unknown Target Function= C : x -> y
- Training Examples generated by Uknown Target Function= (vector, label)
- Hypothesis Space= All possible values for the weight values
- -The goal of the learning is to find the best values from the space
- A Learning Algorithm will take the Examples and the Hypothesis space
- - Searches Hypothesis space to find the best hypothesis
- "FINAL HYPOTHESIS"
- LEARNING ALGORITH: for Concept Learning using Linear Combinations
- Instance Space (x) = <vector of all examples> in R^d
- Label: Y = {1,-1} <- "CONCEPT LEARNING" true/false sorta learning
- Hypothesis Space (H) - Assuming Linear Functions
- - Weighted Linear Combinations of attributes
- [Summation i->d]WiXi > Q : yes
- < Q : No
- Where Q is a thereshold
- H(x) = ([Summation i -> d]Wi*Xi - Q) > 0 : yes
- < 0 : no
- H = {weight> | weight> exists on R^d+1}
- x1> = < x1, x2> ... therefore, X1> = <x0 = 1, x1, x2>
- H(X>) = sign(W>TRANSPOSE * X>)
- Okay.. so how do we use this?
- Implimenting it using Iterative Method
- 1. Init weight Vector 1 as 0 vector (w>(1) = 0>) *or some other small random vector close to 0
- 2. Do Until Learned
- for t = 1 .... |D| *where D is example space*
- Make a prediction on first example: ^y(t) = sign(W>TRANSPOSE * X>(t) )
- w>(t+) <- w>(t) + z(y(t) - ^y(t)) * x>(t) *z = learning rate *t+ is updated w>(t) *mistake bound*
- Since it's mistake bound, you can make it more efficent with an if statement
- !(if (y(t) == ^y(t)) --> w>(t+) <- w>(t) + z*y(t) * x>(t)
- This is Perceptron! Works really well if the solution is linearly separable!
- w0 = moves line up and down, theta changes angle of the line
- 2 Things left to cover this semester
- 1. PRINCIPLE OF LEARNING
- 2. ALGORITHM OF LEARNING
- LEARNING AS A GAME -> Game = something with precicely well defined rules
- - Supervised Learning
- - Perceptron is Supervised Learning: the y(t), (actual label) is the teacher
- - Teacher gives x> (attributes of a problem), Learner guesses ^y (label)
- and Teacher returns error with Sum(y - ^y)^2 so the learner can adjust
- - Unsupervised Learning
- - Clustering: If you don't have a label, you just slam similar things into a category
- - How do you use Binary classification algorithm for 3 different classifications?
- - you can use 3 different Perceptrons!
- - one for A and !A, B and !B, C and !C
- CONCEPT LEARNING - Binary Classifications
- - Concept(C) C:X -> {-1,1}
- - INSTANCE SPACE: X = {x1>, x2>, x3> ....} vectors are a list of attributes
- - Example Space: D = {<x1>, y1> , <x2>, y2> ...}
- - HYPOTHESIS SPACE: H <- representation of C *the Neural net or line to best separate data*
- *literally just a search algorithm to find that line that best separates the data*
- daily reminder: if your assumption is totally wrong, your solution will be really fucked
- Based on Search Critiera you can have....
- - Supervised Learning - Learning from Examples (classification problems)
- - Unsupervised learning - Learning w/o Examples (Clustering)
- - Reinforcement - Agent is in an Enviorment, then it reacts .. if it's a good reaction
- it is rewarded, if not it is punished
- All algorithms / searching can be...
- - Online Learning
- - ex. it learns on the job.. makes mistakes on the job but learns from it
- - Offline Learning
- - ex. Fraud Detection for CC companies.. you have to train it before you put it in use
- if Mistakes aren't fatal, you can use Online.. otherwise offline is the way to go
- - Hybrids Exist btw.. but still
- **important CONCEPT***
- OVERFITTING vs. GENERALIZATION
- - Remember, the goal of the algorithm is to classify instances
- - The Neural Network could have a bias towards the examples it learned from
- - to test how the net will work, exclude 20% of your data to test the final algorithm on
- - if it fucks up the new (20%) of the data, your algorithm is over fitted..
- - N Data: 20% - test set, 20% - Validation set, 60% - Training Set
- type error: test error validation error traning error
- TEST ERROR IS IMPORTANT ONE: only test it once at the very very end
- DECISION TREE LEARNING
- - most medical stuff is this
- - Learning discrete valued target concept
- - learned hypothesis is represented as a tree (called decision tree)
- - How the tree is set up
- - Root Node = attribute to test
- - Branches are possible values for tested attribute
- ex. Play Tennis?
- Outlook
- / | \
- sunny overcast rainy
- / | \
- Humdity [yes] wind
- / \ / \
- high low strong weak
- | | | |
- [NO] [yes] [NO] [YES]
- you can look at it logically like this...
- (outlook=sunny /\ humidity=Low) = Yes
- (outlook=overcast) = YES
- ... and so on
- QUINLAN
- construct a tree from the top to bottom
- - the root node, needs to be the most useful attribute
- - the attribute classifies the most data
- - but uh.. how do we do that?
- finding the best attribute: Entropy (reaD: RANOMNESS)
- - Entropy: Purity / impurity of the set
- - choose attribute that gets the lowest entropy
- Entroy(S) = -[Summation from i=1 to C] Pi*lg(Pi) :where C is # of classes
- *note, Pi isn't pi, it's.. P[sub i]
- Entroy(s)'s graph is basically an upsidedown parabola...
- - so if each class is 1/2 and 1/2 .. entropy becomes 1
- and if there is only 1 class, entropy becomes 0
- Basically, you use an attribute that gets the largest entropy gain
- D ROOT: Entropy(Set)
- / | \
- S1 S2 S3
- E(S1) E(S2) E(S3) Average Weight of E(Si): Information game = difference between root and here
- High entropy at Root = mix bag, the next level is low .. then that means the lower attributes
- are more pure... Higher entropy reduction = better Root Attribute
- So each level, you need to ensure each level has the maximum information Gain...
- So.. How do you calculate information gain?
- Information Gain: Expected reduction in entropy caused by partitioning the examples
- based on an attribute
- G(S,A) = E(S) - [Summation forall v=-values(A) ](|Sv|/|S|)E(Sv)
- G = info gain, S=Set, A = attribute
- EXAMPLE: Tennis
- Day outlook Temp Humindity Wind PlayTenis?
- 1 Sunny Hot High Weak No
- 2 Sunny H H Strong N
- 3 Overcast H H W Yes
- 4 Rain Mild H W y
- 5 R Cool Normal W y
- 6 R C N S n
- 7 O C N S y
- 8 S M H W n
- 9 S C N W y
- 10 R M N W y
- 11 S M N S y
- 12 O M H S y
- 13 O H N W y
- 14 R M H S n
- E(S) = -9/14*log(9/14) - 6/14*log(5/14) = .94 (9 and 5 because of 9 yes and 5 no)
- G(S, Outlook)
- - Sunny, Overcast, Rainy
- - Sunny: 2 yes, 3 no --> E(Outlook=Sunny)= -2/5log(2/5) - 3/5log(3/5)
- = .971
- - OverCast: 4 yes, 0 no --> E(outlook=overcast) = -4/4log(4/4) - 0/4log(0/4)
- = 0.0
- - Rainy: 3 yes, 2 no --> equation --> .971
- - so ... G(S,outlook) = .94 - 5/14*.971 - 0 - 5/14*.971
- G(S,outlook) = reduces entropy by... .246
- G(S, Wind)
- - Weak and Strong
- - Weak: 6 yes, 2 negative --> E(Wind=Weak) = -6/8*log(6/8) - 2/8*log(2/8)
- = 0.811
- - Strong: 3 yes 3 no .. --> E(Wind=strong)
- = 1 (no need for calcualtion since we know entropy of 1/2 and 1/2 = 1)
- - so... G(S, Wind) = .95 - 8/14*.811 - 6/14*1
- = .048 .. so G(S, Wind) reduces entropy by .048 .. not too gud
- G(S, Humidity)
- - High and Normal ----> produces .152
- G(S, Temperature) = .029
- So.. out of all of those, we know Outlook you can gain the most entropy
- so now.. E(S1) = .971 .. and you just apply the same process as above to pick the next attribute
- G(S1, Wind) (when outlook is sunny, what does wind impact)
- - Stong, Weak
- - Weak = 1 yes 2 no .. Strong = 1 yes 1 no
- E(Wind=weak) = -1/3*log(1/3) - 2/3*log(2/3) = .918
- E(wind = strong) = 1 (since 1/1)
- - so... G(S1, Wind) = .19
- G(S1, Humidity)
- - High and Normal
- = High = 0 yes 3 no 2 yes 0 no
- E(Humidity=high) = 0
- E(humidity=normal) = 0
- so.. G(S1, humidity) = .971 - 0 - 0 = .971 .. hey that's a good gain
- G(S1, Temp) = .57
- So.. We know G(S1,Humidity) is the best so we choose Humidity....
- And you keep repeating the process.. if chang keeps listing more stuff i'm not tpying it you get hte idea
- And that's how you build a tree from top to bottom
- - just keep in mind.. just because an attribute gives you best gains unde rsomething else...
- doesn't mean it will under others (Ie. G(S1, humidity) is best .. but G(S3, humidity) might not be)
- Train Fact: Temperature doesn't matter in this set
- More Related Train Fact: These trees can pretty easily get over fit
- - So how the fuck do you do that?
- - As the tree grows.. you gotta check if it's overfitted
- - but while it's growing.. that's almost impossible
- - so how do we do it?
- -Just let it over fit with all training data ... and then prune it
- - So how do you prune it?
- - you use validtation data and measure the error
- - then you remove a leaf node
- - if the error improves .. remove it forever
- - else .. keep it forever
- - and you just kinda keep doing it until pruning it further fucks the error
- EXAM: NEXT THURSDAY: Game -> |algorithms .. so up to but not including algorithms
- NEXT ALGORITHM
- Basen Learning - Bayes theory
- Chang's story
- - Participants.. ~10000 people want to check if they have brain cancer by knocking on their head
- - Result: 1% of participants actually had cancer
- - But with chang's method: 80% of participants that actually have cancer came up positive
- 10% of participants w/o cancer came up positive (false positive)
- - Then the FDA approved this method
- - so then.. you get a new patient: and they get diagnosed as positive for brain cancer
- - but guess what.. odds are 7.5% that he actually has the cancer
- P(BrainCancer) = .01 <-> P(~BrainCancer) = .99
- P(+ | BrainCancer) = .8 <-> P(- | BrainCancer) = .2
- P(+ | ~BrainCancer) = .1 <-> P(- | ~BrainCancer) = .9
- so... P(BrainCancer | +) =
- - note how they look similar to what we have above it.. but they're not!
- - P(B[union]+) = P(B) * P(+|B) .. or .. P(+) * P(B | +)
- - P(B|+) = (P(B) * P(+|B)) / P(+)
- *** AND THIS IS BAYE's THEORUM***** .. that equation
- Using that equation, you use it ot learn something
- P(B) ~> P(!B) = |-P(B)
- P(+ | B) ~> P( - | B ) = |- P(+|B)
- P(+ | !B) ~> P( - | !B) = |- P(+ | !B)
- IN TERMS OF MACHINE LEARNING.. BAYES THEORUM IS...
- P(h | D) = (P(D | h) P(h) ) / P(D)
- - given a particular example (D), what is the probablity this label / hypothesis (H) is correct
- according to specific examples
- P(h) = prior probability of H ,, P(D) = prior probability of D
- P(D | h) = Probablity that D is consistent w/ H
- Baysian learning:
- Question: What is the most probable hypothesis, given a particular Dataset
- Answer: hmap (maximum A prior) hypothesis
- - Maximum probabliity given an H for a D P(h | D)
- - argmax h=-H .. which h maximizes P(h | D) .. that's the ML solution
- argmax = value that maximizes a function
- - argmax (x + 1)^2 -1 x=-0 to 2 ... x = 2
- max (same func) ... = 8
- issue is.... you gotta know the 3 values P(D | h) .. P(h) ... P(D)
- - P(D | h) - Given an H what is the probability classifies D correctly
- Trainfact: You can ignore P(D) since it's all the same (doesn't effect argmax h)
- - P(h) - probability of a certain hypothesis being picked
- :: gotta have some degree of domain knowledge for this to work out
- ::: creates problems if you don't know the domain .. no prior knowledge.. you have a problem
- - if you don't know.. just assume uniform probability
- Another Way::: HML (maximum likelyhood hypothesis)
- - assumes P(h) = CONSTANT for all h in H
- - this is what happens in you have no prior knowledge
- - in this situation...
- - hml = argmax( P(h|D) ) = P(D | h) ***since P(D) is constant.. and P(h) is constant***
- Back to chang's brain cancer fun
- P(B) = .01 =====> P(~B) = .99
- P(+|B) = .8
- P(+|~B) = .1
- in this example... you have 2 hypothesises .. Yes Brain Cancer.. No Brain Cancer
- P(h) ~~~~> P(B) = .01 P(~B) = .99
- So.. we shouldn't use HML.. we should use HMAP because we have domain knowledge
- so... P(D | h) * P(h)
- h = B :: P(D | B) * P(B)
- - P(+ | B) * P(B) ~~> .01 * .8 = .08
- h = ~B :: P(D | ~B) * P(B)
- - P(+ | ~B) * P(B) ~~> .1 * .00 = .099
- .... So what this is saying.. is you're more likely to get stuck with a positive label
- given when you don't have brain cancer.. meaning the method is bad
- More information is better.. so use HMAP over HML whenever you can
- So.. instead of trying to ask what is the best hypothesis... why can't we ask..
- Given a new instance X> what is the most likely label v=-V
- P(v | D) = [Sum for all H](P(v|h) * P(h|d))
- _________________________________
- Weighted average of prediction by all h
- - better hypthosis.. better weight
- this is called HOPT = optimal hypothesis...
- - Argmax(v) [Summation for all H] P(v|h) * P(h | D)
- - which v maximizes that expression
- so back to chang's example...
- - argmax(v = + / - ) [summation for B ~B] P(v|h) * p(h | D)
- ... so
- argMax{ [Summation for B ~B] P(+|h) * p(h | D)
- [Summation for B ~B] P(-|h) * p(h | D) }
- so.. Top (+) = .0163 .. Bottom (-) = .0907
- - so Bottom is high probablity.. so odds are, they are going to be negative for
- brain cancer
- ANOTHER CLASSIFICATION PROBLEM...
- Assume conjunctive hypothesis...
- h = a1 /\ a2 /\ a3 .... /\ aN
- where aI is independent
- Q: What is the most probable class of a new instance, given values of attributes
- P(v | a1,a2,a3 ... aN)
- argmax(v)(P(v | a1,a2,a3 ......)) ~~~> JUST APPLY BAYES
- called VMAP .. LABEL MAX as opposed HYPOTHESIS max
- Since the arguments are independent.. you can do this in the bayes
- argmax P(a1 | v) * P(a2 | v) ... * P(aN | v) * P(v) = Vnb
- - Vnb = naive baysian (used in spam filters)
- Vnb = argmax p(v) [multiplationSummation]i P(ai | v)
- - predicting if a car is stolen
- COLOR TYPE ORIGIN STOLEN?
- R S D Y
- R S D N
- R S D Y
- Y S D N
- Y S I Y
- Y V I N
- Y V I Y
- Y V D N
- R V I N
- R V I Y
- S = Sedan, V = SUV , D = domestic I = import
- V=-{Yes,No} ... we're looking for a domestic red suv
- = argmax p(v) * P(Red | v) * P(Dom | v) * P(SUV | v)
- v = y: P(Y)* P(R | Y) * P(D | Y) * P(SUV | Y) = 1/2 * 3/5 * 2/5 * 1/5 = .024
- v = n: ||(except Y = N) = 1/2 * 2/5 * 3/5 * 3/5 = .096
- Answer: NO... its more likely to not get stolen (ie.. ARGMAX( .024 or .096) .. so picks the biggest)
- SELF DELIMITING PREFIX-TREE ENCODING
- - encodes messages
- [summation]Pi*lg(Pi)
- say we have .. Z1 = Hi, Z2 = Hungry?, Z3 = Thirsty?, Z4 = Bye
- this way .. if both parties agree on the encoding method you can use singluar bits
- Z1 = 0, Z2 = 1, Z3 = 10, Z4 = 11 ..
- - creates problems.. when you recieve the bit "1", you have to wait for the next message
- - we dont want PREFIX encoding to avoid this problem
- - if he recieves a 1.. he's gotta wait .. 0 just kinda dictates the end of each message
- - so.. Z1 = 0, Z2 = 10, Z3 = 110, Z4 = 111
- - so you know to stop listening when you get a 0
- - this is ideal method to achieve minimum number of bits
- - we need to determine which messages are used most...
- .. so if Z4 (Bye) is used most, make Z4 = 0
- - "Description Length" Li (length of message i) = -lg(Pi) Pi= probablity of messsage i
- - if the probability is low, you can use a long length,
- if its high.. use small bits
- Hmap = arg P(D | h) * P(h)
- h=-H ... we use this cus we know the probablity of each hypothesis
- NEURAL NET:
- (this is our last topic)
- *Basically a couple of Perceptron*
- They're trying to simulate how a brain actually works
- Perceptron works well as long as the data is linearly separatable
- - ie.. 1 or 0 as the response and x0,x1,x2 can be used as weights
- PERCEPTRON RULE: Wi <- Wi + y(t-0)*xi
- where Y is learning speed
- well.. what happens when data isn't linearly separatable
- - line always is moving.. (if data is non linear) it doesn't relaly converge to a correct
- weighting
- Well how to solve this?
- - Calculate an Error per particular set of W's (weights)
- E(w>) = 1/2 * [Summation of d's in D](Td - Dd)^2
- so we want to find the w> where there's a global minimum for errors (note the 3d parabola eq)
- - so we travel along that until we get the lowest error (hill climbing)
- - this is called "GRADIENT DESCENT ALGORITH"
- *we need to find the gradient vector of E(w>)
- which is ▽E(w>) = [dE/dw1 dE/w2 ... dE/wn]
- so..
- w> <- w> - Y * ▽E(w>)
- ---- DELTA RULE *using gradient descent*
- But.. We have a problem.. if we actually wanna calculate some weirdly separatable data
- we can't just have lines (▽E(w>) method will still just be a bunch of lines
- How to solve THAT problem?
- - remember the treshold unit (the steppy looking wave)
- - we can't have that.. its not a continuous function
- Wel.. we have a SIGMOID FUNCTION
- *****S(y) = 1/(1+ e^-y)*******
- .. we can control the stiffness of the function of it to get it very close to a step function
- Sigmoid does some weird shit when you take the derivative
- dS(y)/dy = d/dy(1/(1+e^-y))
- dS(y)/dy = d/dy(1 + e^-y)^-1
- = -(1 + e^-y)^-2 * (e^-y)
- = e^-y / (1 + e^-y)^2
- = 1/(1+e^-y) * ((1 + e^-y - 1) / (1+e^-y))
- = S(y) * (1 - S(y)) <--- so the derivative is really easy to calculate
- so .. ex... *remember.. x0 is always 1*
- w0 = -30, w1 = 20. w2 = 20
- x1 x2 w>*x> S(w>*x>)
- 0 0 -30 0
- 0 1 -10 0
- 1 0 -10 0
- 1 1 10 1
- note.. S(w>*x>) is effectively an AND function
- ex 2.
- w0= 10. w1 = -20. w2= -20
- x1 x2 w>*x> S(w>*x>)
- 0 0 10 1
- 0 1 -10 0
- 1 0 -10 0
- 1 1 -30 0
- this is effectively a NOR function, which makes it ez to make it a simple OR
- if you change the weights from -20s to 20.. you get an OR function btw
- ex 3. XNOR (so.. note, xnor is an OR of and an nor)
- x1 x2 XOR XNOR AND NOR
- 0 0 0 1 0 1
- 0 1 1 0 0 0
- 1 0 1 0 0 0
- 1 1 0 1 1 0
- AND w0 = -30, w1 = 20, w2 = 20
- NOR w0 = 10, w1 = -20, w2 = -20
- OR = VAL( AND = x1 | 20 = w1, NOR = x2 | 20 = x2, x0 = -10)
- to get XNOR which will work
- More NN
- - first ins input layer, then the next layer (all full connected) and you keep stacking the layers
- - each one has a summation Sigmoid function.. and the last layer has an output layer has one node
- - layers in between input and output layer are ''hidden layers''
- ex. Finding a cat / dog / horse in a picture
- - each pixel is fed into the input layer and eventually output layer
- - the output layers spits out 0-1 for each node (.8, .1, .1) as a vector for each result
- - so since it would be .8 .. odds are theres a cat in the picture
- - BUT WAIT! theres a lot of pixels
- - ya gotta teach the NN to compress the picture
- - the first layer has 1Million nodes
- - next layer has say 100k nodes
- - and the next has 10k
- - and then it starts to go back up
- - so basically.. it reduces pixels, and then expands it back when it gets good at compressing
- in a relatively lossless way
- - So then you cut that in half and just use the compression to feed the small image into
- the finding the cat / dog / horse NN
- FEED FORWARD BACK PROPAGATION
- x --w1-->[SIGMA]-- x*w1 = n1--->[sigmoid]- S(n1) = y --->[Sigma]---y*w2 = n2 --->[sigmoid]-S(n2)->O(x)
- where O(x) = z<--- prediction, and t is target value
- E = 1/2 * (z-t)^2 <-------- error value so ya gotta minimize this
- ... so now you have this error value, you back propogate
- now you need to measure the rate of change of E in w2 (since we touch w2 first in back prop)
- d(E)/d(w2) .. but you can't directly measure the change of E in respects to w2
- .. but you with n2
- .. d(n2)/d(w2) .. if you change w2.. how much will n2 change
- ... sooooo .. d(E)/d(w2) = d(E)/d(Sigmoid(n2)) * d(Sigmoid(n2))/d(n2) * d(n2)/d(w2)
- d(E)/d(w2) = (z - t) * Sigmoid(n2)*(1 - sigmoid(n2)) * sigmoid(n1)
- so now.. instead of crazy computations.. we just have some simple math
- to get an error change with respoect to changing w2
- ..now for w1 ..
- d(E)/d(w1) = d(E)/d(z) * d(z)/d(n2) * d(n2)/d(y) * d(y)/d(n1) * d(n1)/d(w1)
- known(w2) known(w2) w2 sigmoid(n1)/(1-sigmoid(n1) x
- known(w2) = this was already calculated only thing that actually
- in d(E)/d(w2) needs calculated
- as you add layers / more nodes in each layer you start needing to
- take derivatives of vectors with respect to the error value.. but its the same concept
- NN ALGORITHM:::::
- 1. Init WEIGHTS to random matrix WEIGHTS = [w1> w2> ... wN>]
- (this represents the whole NN's weights)
- 2. DO until Learned
- 2.1 -> Feed Foward
- 2.2 -> Calculate E (1/2 * (z-t)^2)
- 2.3 -> Back Prop E and Adjust WEIGHTS
- btw.. no one impliments their own NN unless they're just trying to be cool.. just use TensorFlow
- BUT WAIT... What the fuck does "Learned" mean.. when do you stop?
- - Now remember Kids! split your data into 60% Training, 20% Validation, 20% Test
- - and you know when to stop based on the 20% validation
- ERROR
- |* # Evalidation
- |* # #
- | * # #
- | * #<------- d(Evalidation)/d(Epoch) = 0 ... so stop at EPOCH val here
- | * Etrain
- -------------*-------EPOCH
- so.. once your your Evalidation is at a minimum.. record the Epoch where it occurs
- ... then retrain it until that Epoch
- ... Then once its done that.. test it on the 20% test.. and boom you have your
- learned AI (% of correct in 20% test data is how good it is)...
- oh and if your test data % sucks ass you gotta rethink your model
- Support Vector Machine (SVM)
- - originally made by some Russian dude and just kinda jacked to use for NNs
- - Suppose you have a bunch of linearly separatable data
- - and there are infinite number of lines that will work.. perceptron will just find one of them
- BUT I WANT THE BEST ONE!
- - you need to choose a line that will nicely with potentially new data
- - and new data that fits a certain classification, will be near data of that same
- classification (like N-Nearest Neighbor)
- - you need to have alarge margin between the line of separation and the data,
- that's the ideal method
- - so you effectively on top of looking or error, you look for the line with the largest margin
- FINDING THAT LINE
- - you need 2 points on one side (to make the first line)
- a 3rd poin on the other side (of the other data) to make the parralell line
- - and the middle of the 2 parraelel lines is the good line
- .. so Top line == Wx + b = 1, middle(goodline) == Wx + b = 0, bottom line == Wx + b = -1
- so to get the the best .. you need ... W> / ||W>|| . (x2-x1) = width
- so you wanna find the W that maximizes that width (. = dotproduct btw)
- .. we're trying to minimize ||w>||^2
- ..... SO! SVM: minimize(||W>||^2) ..
- so.. Yi(<w,xi> + b) >= 1
- .. btw no one impliments this .. to optimize it you need to use a modified Lagragean.."KKT"
- but yea just use a library ya cunt
- SVM: Classifcation F(x) = sign(<w,x> + b)
- Training: minimize 1/2||w||^2
- w, b <-- trying to get w and b to use in the classification
- st .. Yi(<wi , xi> + b) >= 1
- USING KKT -> Classification now equals ===== f(x) = sign([Summation]aiyi <x,xi> + b)
- xi = training example... x is new instance (dot product them) then multip by yi and ai and add b
- a (alpha) is lagrangian ... and remeber...
- there could be millions of xi's .. but... how is this better .. alot of the xi's ai's are 0
- meaning the calculation is pretty quick
- only the points on the gutter (dividing line) are non-zero
- - those points are called "support vectors"
- hence the name... SUPPORT VECTOR MACHINE
- KERNAL TRICKS
- - Very powerful and efficient compared to NN
- .. suppose you wanna calculate
- a * b <-- Disgustingly slow
- you can classify potentially non linear seperable space using linear transformation going from say R2-->R3
- -just gotta figure out how dot product behaves in the future space
- - once you do that .. you don't eve nneed to calculate what [Phi](x) exactly is .. just dot product that shit
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement