final notes cmpsc441


MODERN AI
	- Machine learning
		- LEARNING: Adopting / imporiving one's behavior through experience
			-- To measure an improvement.. you need .. ya know.. a way to measure performance
			-- You also need a bunch of things to experience
			-- Behavior = behavior in a given task
	- ex. AI
		- put and Agent (soft/hardware) in an enviroment .. and see the action
		- enviorment ---> Agent --<--> action
	- ex. Neural Net
		- Aluinn - first self driving vehicle
		- Neural network mimics the human brain
			- problem with this.. is it predicts really well.. but no one reall knows why it does
	- ex. KDD (Knowledge discover from Data)
		- OG datamining
	- ex. Bayesian Learning
		- Spam mail filters, algorithm that learns what is spam
			- Bayesian: P(x|a) = (P(A) * P(A|X)) / [Summation](P(A)*P(A|X))
			- SUM and HMM
	- This is a poorly understood field.. no one really knows how it works
		- Large Search Space


How Machine Learning Can Be used
	- ex. Netflix - even when it first came out and was a mail system... their recommendation was really good
		- Dot Product
		- Gets the customer preferences.. Action? Comedy? Scifi?
		- Gets an expert review of the movie..
		- used dot product to determine if the person would like the movie
			- checks a threshold.. if the threshold is > the dot product.. don't reccomend
			- What if the expert has different preferences compared to the individual?
				- you just get the customer to rate it ... without the customer rating it .. how?
					- you learn what the customer likes based off of the movies they watched

Designing a Learning System
	- Learning: Improve one's performance(P) at a task(T) through a bunch of experience(E)
	- ex. Checkers
		* T: Playing Checkers
		* P: # Wins / # pieces captured (can vary)
		* E: Playing a game (usually with itself)
			-> Must Represent future distribution of data
				- ie. it must be similar to the real world enviorment you're sticking it in
	- <instance, label> = "Example"
		- such as.. attributes of the weather in a day (instance) && whether it rained (label)
		- instance are a vector
		h([instanceVector]) = 1 or 0 ... 1 means yes rain, 0 means no rain
		DATA MUST BE ACCURATE / REPRESENT FUTURE DISTRIBUTION OF DATA OR IT WILL FAIL

HOW TO DESIGN A LEARNING SYSTEM
	- Task (T)
	- Performance measure (P)
	- Experence (E)

		ex. <<wind,temp,humidity>, Sunny>
					instance       label
			whole thing: EXAMPLE
		we assume there exists a function, that generates a label based on an instance
			that function is called "Target Concept"
				- You can't get that target concept correct.. but you can get a "Target Approximation"
				- only way we can see how close the Target Approximation is to the Target Concept ..
					you need to test it against an example
	So you need....
		Target Function to Learn
		Reprsentation of Target Function
		Learning Mechanism

	ex. Checker Player
		- TargetFunction(BoardConfig) -> BestMove
			- This isn't a good idea .. best move might result in a bad move later
		- TargetFunction(BoardConfig) -> Score
			- find a board configuration that guarentees a win ... that gets the highest score
			- find a board configuratioin that guarentees a loss .. that gets the lowest scores
			- finding a board configuration that guarentees the winning configuration .. gets highest - 1

			- This is better .. it's more forward thinking.. where as the previous one isn't

		From here.. you Chose the TargetFunction that gets the score to be the one you learn :)
			- we can't guarntee it's the best.. but we can know it's pretty good

		- So.. how to represent the target functions?
			- Easiest Way = A Look Up Table containing the score of a bunch of configuration
			- Decision Tree
			- Neural Net - Learns the scores

			- Lets pick the Easiest way.. a Linear Combination - "Hypothesis"
		TargetFunction(BoardConfig) = (numKings)*(weightKing) + (numQueens)*(weightQueen) +
									  (numRooks)*(weightRook) .... and so on
			- Eeasy part is knowing number of peices.. hardest is learning the weight
				- Everything is learning that weight value
			- Clearning TargetConcept has been reduced into learning the Weights
				- So how do we do that?

		-Learning MEchanisms
			- They are going to learn from Examples: "Training Examples"
				- in this example.. it's a <boardConfig, realLabel>
					ex. <<0 1 2 ....>, -100>
						  K Q R  ..   Lost Score
			- Task has been reduced to find the Weightss that best fit the examples
				- E = [Summation](targetConcept - targetFunction)^2
					summation sums all the set of all examples
					- Bigger E.. more error, then you make a new Weights to try to reduce the Error
		But.. how do you figure out how to do that?
			WeightI <- WeightI + Y(TargetConcept - TargetFunction) * numPieceI
					Y = Learning Rate
						- Used to increase / decrease weight little by little to avoid messing up
						- Really small # .. like .001
	What we have so far...
		Unknown Target Function= C : x -> y
		Training Examples generated by Uknown Target Function= (vector, label)
		Hypothesis Space= All possible values for the weight values
			-The goal of the learning is to find the best values from the space

		A Learning Algorithm will take the Examples and the Hypothesis space
			- Searches Hypothesis space to find the best hypothesis
					"FINAL HYPOTHESIS"

	LEARNING ALGORITH: for Concept Learning using Linear Combinations
		Instance Space (x) = <vector of all examples> in R^d
		Label: Y = {1,-1} <- "CONCEPT LEARNING" true/false sorta learning
		Hypothesis Space (H) - Assuming Linear Functions
			- Weighted Linear Combinations of attributes
				[Summation i->d]WiXi  > Q : yes
									  < Q : No
					Where Q is a thereshold
		H(x) = ([Summation i -> d]Wi*Xi - Q)  > 0 : yes
											  < 0 : no

		H = {weight> | weight> exists on R^d+1}
		x1> = < x1, x2> ... therefore, X1> = <x0 = 1, x1, x2>
		H(X>) = sign(W>TRANSPOSE * X>)

	Okay.. so how do we use this?

Implimenting it using Iterative Method

	1. Init weight Vector 1 as 0 vector (w>(1) = 0>) *or some other small random vector close to 0
	2. Do Until Learned
		for t = 1 .... |D| *where D is example space*
			Make a prediction on first example: ^y(t) = sign(W>TRANSPOSE * X>(t) )
			w>(t+) <- w>(t) + z(y(t) - ^y(t)) * x>(t)  *z = learning rate *t+ is updated w>(t) *mistake bound*
					Since it's mistake bound, you can make it more efficent with an if statement
						!(if (y(t) == ^y(t)) --> w>(t+) <- 	w>(t) + z*y(t) * x>(t)

This is Perceptron! Works really well if the solution is linearly separable!
	w0 = moves line up and down, theta changes angle of the line

2 Things left to cover this semester
	1. PRINCIPLE OF LEARNING
	2. ALGORITHM OF LEARNING


LEARNING AS A GAME -> Game = something with precicely well defined rules
	- Supervised Learning
		- Perceptron is Supervised Learning: the y(t), (actual label) is the teacher
		- Teacher gives x> (attributes of a problem), Learner guesses ^y (label)
		  and Teacher returns error with Sum(y - ^y)^2 so the learner can adjust
	- Unsupervised Learning
		- Clustering: If you don't have a label, you just slam similar things into a category
		- How do you use Binary classification algorithm for 3 different classifications?
			- you can use 3 different Perceptrons!
				- one for A and !A, B and !B, C and !C
CONCEPT LEARNING - Binary Classifications
	- Concept(C) C:X -> {-1,1}
	- INSTANCE SPACE: X = {x1>, x2>, x3> ....} vectors are a list of attributes
	- Example Space: D = {<x1>, y1> , <x2>, y2> ...}
	- HYPOTHESIS SPACE: H <- representation of C *the Neural net or line to best separate data*
		*literally just a search algorithm to find that line that best separates the data*

daily reminder: if your assumption is totally wrong, your solution will be really fucked
Based on Search Critiera you can have....
	- Supervised Learning - Learning from Examples (classification problems)
	- Unsupervised learning - Learning w/o Examples (Clustering)
	- Reinforcement - Agent is in an Enviorment, then it reacts .. if it's a good reaction
			          it is rewarded, if not it is punished

All algorithms / searching can be...
	- Online Learning
		- ex. it learns on the job.. makes mistakes on the job but learns from it
	- Offline Learning
		- ex. Fraud Detection for CC companies.. you have to train it before you put it in use
	if Mistakes aren't fatal, you can use Online.. otherwise offline is the way to go
	- Hybrids Exist btw.. but still

**important CONCEPT***
OVERFITTING vs. GENERALIZATION
	- Remember, the goal of the algorithm is to classify instances
	- The Neural Network could have a bias towards the examples it learned from
		- to test how the net will work, exclude 20% of your data to test the final algorithm on
			- if it fucks up the new (20%) of the data, your algorithm is over fitted..
			- N Data: 20% - test set, 20% - Validation set, 60% - Training Set
	  type error:        test error           validation error    traning error
					TEST ERROR IS IMPORTANT ONE: only test it once at the very very end


DECISION TREE LEARNING
	- most medical stuff is this
	- Learning discrete valued target concept
	- learned hypothesis is represented as a tree (called decision tree)
	- How the tree is set up
		- Root Node = attribute to test
		- Branches are possible values for tested attribute

	ex. Play Tennis?

			   		Outlook
					/  |     \
    			sunny overcast rainy
				/      |          \
			Humdity   [yes]        wind
			/  \                   /  \
		  high low            strong   weak
			|   |               |       |
		  [NO] [yes]          [NO]    [YES]
	  you can look at it logically like this...
		(outlook=sunny /\ humidity=Low) = Yes
		(outlook=overcast) = YES
	     ... and so on

QUINLAN
	 construct a tree from the top to bottom
		- the root node, needs to be the most useful attribute
			- the attribute classifies the most data
			- but uh.. how do we do that?
	finding the best attribute: Entropy (reaD: RANOMNESS)
		- Entropy: Purity / impurity of the set
			- choose attribute that gets the lowest entropy
	Entroy(S) = -[Summation from i=1 to C] Pi*lg(Pi) :where C is # of classes
			*note, Pi isn't pi, it's.. P[sub i]
	Entroy(s)'s graph is basically an upsidedown parabola...
		- so if each class is 1/2 and 1/2 .. entropy becomes 1
		  and if there is only 1 class, entropy becomes 0
	Basically, you use an attribute that gets the largest entropy gain

	D		ROOT: Entropy(Set)
		  /   |   \
	   S1    S2    S3
	E(S1)   E(S2) E(S3)   Average Weight of E(Si): Information game = difference between root and here

	High entropy at Root = mix bag, the next level is low .. then that means the lower attributes
		are more pure... Higher entropy reduction = better Root Attribute
	So each level, you need to ensure each level has the maximum information Gain...

	So.. How do you calculate information gain?
	Information Gain: Expected reduction in entropy caused by partitioning the examples
					  based on an attribute
		G(S,A) = E(S) - [Summation forall v=-values(A) ](|Sv|/|S|)E(Sv)
		G = info gain, S=Set, A = attribute

	EXAMPLE: Tennis
		Day   outlook      Temp       Humindity      Wind        PlayTenis?
		1      Sunny       Hot           High         Weak          No
		2      Sunny       H             H            Strong        N
		3      Overcast    H             H            W             Yes
		4      Rain        Mild          H            W             y
		5      R           Cool          Normal       W             y
		6      R           C             N            S             n
		7      O           C             N            S             y
		8      S           M             H            W             n
		9      S           C             N            W             y
	   10      R           M             N            W             y
	   11      S           M             N            S             y
	   12      O           M             H            S             y
	   13      O           H             N            W             y
	   14      R           M             H            S             n

	E(S) = -9/14*log(9/14) - 6/14*log(5/14) = .94 (9 and 5 because of 9 yes and 5 no)
	G(S, Outlook)
		- Sunny, Overcast, Rainy
		- Sunny: 2 yes, 3 no --> E(Outlook=Sunny)= -2/5log(2/5) - 3/5log(3/5)
			= .971
		- OverCast: 4 yes, 0 no --> E(outlook=overcast) = -4/4log(4/4) - 0/4log(0/4)
			= 0.0
		- Rainy: 3 yes, 2 no --> equation --> .971
		- so ... G(S,outlook) = .94 - 5/14*.971 - 0 - 5/14*.971
				G(S,outlook) = reduces entropy by... .246
	G(S, Wind)
		- Weak and Strong
		- Weak: 6 yes, 2 negative --> E(Wind=Weak) = -6/8*log(6/8) - 2/8*log(2/8)
			= 0.811
		- Strong: 3 yes 3 no ..  --> E(Wind=strong)
			= 1 (no need for calcualtion since we know entropy of 1/2 and 1/2 = 1)
		- so... G(S, Wind) = .95 - 8/14*.811 - 6/14*1
			= .048  .. so G(S, Wind) reduces entropy by .048 .. not too gud
	G(S, Humidity)
		- High and Normal ----> produces .152
	G(S, Temperature) = .029

	So.. out of all of those, we know Outlook you can gain the most entropy

	so now.. E(S1) = .971 .. and you just apply the same process as above to pick the next attribute
	G(S1, Wind) (when outlook is sunny, what does wind impact)
		- Stong, Weak
		- Weak = 1 yes 2 no .. Strong = 1 yes 1 no
			E(Wind=weak) = -1/3*log(1/3) - 2/3*log(2/3) = .918
			E(wind = strong) = 1 (since 1/1)
		- so... G(S1, Wind) = .19
	G(S1, Humidity)
		- High and Normal
		= High = 0 yes 3 no  2 yes 0 no
			E(Humidity=high) = 0
			E(humidity=normal) = 0
		so.. G(S1, humidity) = .971 - 0 - 0 = .971 .. hey that's a good gain
	G(S1, Temp) = .57

	So.. We know G(S1,Humidity) is the best so we choose Humidity....
	And you keep repeating the process.. if chang keeps listing more stuff i'm not tpying it you get hte idea

	And that's how you build a tree from top to bottom
		- just keep in mind.. just because an attribute gives you best gains unde rsomething else...
		  doesn't mean it will under others (Ie. G(S1, humidity) is best .. but G(S3, humidity) might not be)

Train Fact: Temperature doesn't matter in this set
More Related Train Fact: These trees can pretty easily get over fit
	- So how the fuck do you do that?
		- As the tree grows.. you gotta check if it's overfitted
			- but while it's growing.. that's almost impossible
				- so how do we do it?
	-Just let it over fit with all training data ... and then prune it
	- So how do you prune it?
		- you use validtation data and measure the error
			- then you remove a leaf node
				- if the error improves .. remove it forever
				- else .. keep it forever
		- and you just kinda keep doing it until pruning it further fucks the error

EXAM: NEXT THURSDAY: Game -> |algorithms .. so up to but not including algorithms
NEXT ALGORITHM
Basen Learning - Bayes theory
Chang's story
	- Participants.. ~10000 people want to check if they have brain cancer by knocking on their head
	- Result: 1% of participants actually had cancer
	- But with chang's method: 80% of participants that actually have cancer came up positive
							   10% of participants w/o cancer came up positive (false positive)
	- Then the FDA approved this method
	- so then.. you get a new patient: and they get diagnosed as positive for brain cancer
		- but guess what.. odds are 7.5% that he actually has the cancer
	P(BrainCancer) = .01  <-> P(~BrainCancer) = .99
	P(+ | BrainCancer) = .8   <-> P(- | BrainCancer) = .2
	P(+ | ~BrainCancer) = .1  <-> P(- | ~BrainCancer) = .9
	so... P(BrainCancer | +) =
		- note how they look similar to what we have above it.. but they're not!
		- P(B[union]+) = P(B) * P(+|B) .. or .. P(+) * P(B | +)
		- P(B|+) = (P(B) * P(+|B)) / P(+)
			*** AND THIS IS BAYE's THEORUM***** .. that equation
	Using that equation, you use it ot learn something
	P(B) ~> P(!B) = |-P(B)
	P(+ | B) ~> P( - | B ) = |- P(+|B)
	P(+ | !B) ~> P( - | !B) = |- P(+ | !B)

	IN TERMS OF MACHINE LEARNING.. BAYES THEORUM IS...
		P(h | D) = (P(D | h) P(h) ) / P(D)
			- given a particular example (D), what is the probablity this label / hypothesis (H) is correct
			  according to specific examples
		P(h) = prior probability of H ,, P(D) = prior probability of D
		P(D | h) = Probablity that D is consistent w/ H

Baysian learning:
	Question: What is the most probable hypothesis, given a particular Dataset
	Answer: hmap (maximum A prior) hypothesis
		- Maximum probabliity given an H for a D P(h | D)
			- argmax h=-H .. which h maximizes P(h | D) .. that's the ML solution
	argmax = value that maximizes a function
		- argmax (x + 1)^2 -1 x=-0 to 2 ... x = 2
		  max (same func) ... = 8

	issue is.... you gotta know the 3 values P(D | h) .. P(h) ... P(D)
		- P(D | h) - Given an H what is the probability classifies D correctly
	Trainfact: You can ignore P(D) since it's all the same (doesn't effect argmax h)
		- P(h) - probability of a certain hypothesis being picked
			:: gotta have some degree of domain knowledge for this to work out
			::: creates problems if you don't know the domain .. no prior knowledge.. you have a problem
				- if you don't know.. just assume uniform probability

	Another Way::: HML (maximum likelyhood hypothesis)
		- assumes P(h) = CONSTANT for all h in H
			- this is what happens in you have no prior knowledge
			- in this situation...
			- hml = argmax( P(h|D) ) = P(D | h) ***since P(D) is constant.. and P(h) is constant***


Back to chang's brain cancer fun
	P(B) = .01   =====> P(~B) = .99
	P(+|B) = .8
	P(+|~B) = .1
	in this example... you have 2 hypothesises .. Yes Brain Cancer.. No Brain Cancer
		P(h)  ~~~~> P(B) = .01  P(~B) = .99
		So.. we shouldn't use HML.. we should use HMAP because we have domain knowledge
	so... P(D | h) * P(h)
		h = B :: P(D | B) * P(B)
			- P(+ | B) * P(B) ~~> .01 * .8 = .08
		h = ~B :: P(D | ~B) * P(B)
			- P(+ | ~B) * P(B) ~~> .1 * .00 = .099
		.... So what this is saying.. is you're more likely to get stuck with a positive label
		     given when you don't have brain cancer.. meaning the method is bad
	More information is better.. so use HMAP over HML whenever you can

	So.. instead of trying to ask what is the best hypothesis... why can't we ask..
		Given a new instance X> what is the most likely label v=-V
	P(v | D) = [Sum for all H](P(v|h) * P(h|d))
		       _________________________________
				Weighted average of prediction by all h
						- better hypthosis.. better weight
	this is called HOPT = optimal hypothesis...
			- Argmax(v) [Summation for all H] P(v|h) * P(h | D)
				- which v maximizes that expression
	so back to chang's example...
		- argmax(v = + / - )  [summation for B ~B] P(v|h) * p(h | D)
			... so
				argMax{ [Summation for B ~B] P(+|h) * p(h | D)
						[Summation for B ~B] P(-|h) * p(h | D) }
			so.. Top (+) = .0163 .. Bottom (-) = .0907
				- so Bottom is high probablity.. so odds are, they are going to be negative for
				  brain cancer

ANOTHER CLASSIFICATION PROBLEM...
	Assume conjunctive hypothesis...
		h = a1 /\ a2 /\ a3 .... /\ aN
			where aI is independent
	Q: What is the most probable class of a new instance, given values of attributes
		P(v | a1,a2,a3 ... aN)
	argmax(v)(P(v | a1,a2,a3 ......)) ~~~> JUST APPLY BAYES
		called VMAP .. LABEL MAX as opposed HYPOTHESIS max
	Since the arguments are independent.. you can do this in the bayes
	 argmax	P(a1 | v) * P(a2 | v) ... * P(aN | v) * P(v) = Vnb
		- Vnb = naive baysian (used in spam filters)


Vnb = argmax p(v) [multiplationSummation]i P(ai | v)
	- predicting if a car is stolen

COLOR    TYPE    ORIGIN     STOLEN?
R         S       D           Y
R         S       D           N
R         S       D           Y
Y         S       D           N
Y         S       I           Y
Y         V       I           N
Y         V       I           Y
Y         V       D           N
R         V       I           N
R	      V       I           Y
	S = Sedan, V = SUV , D = domestic I = import
V=-{Yes,No}  ... we're looking for a domestic red suv
= argmax p(v) * P(Red | v) * P(Dom | v) * P(SUV | v)
v = y: P(Y)* P(R | Y) * P(D | Y) * P(SUV | Y) = 1/2 * 3/5 * 2/5 * 1/5 = .024
v = n:                ||(except Y = N)        = 1/2 * 2/5 * 3/5 * 3/5 = .096
Answer: NO... its more likely to not get stolen (ie.. ARGMAX( .024 or .096) .. so picks the biggest)

SELF DELIMITING PREFIX-TREE ENCODING
	- encodes messages
	[summation]Pi*lg(Pi)
		say we have .. Z1 = Hi, Z2 = Hungry?, Z3 = Thirsty?, Z4 = Bye
		this way .. if both parties agree on the encoding method you can use singluar bits
		Z1 = 0, Z2 = 1, Z3 = 10, Z4 = 11 ..
			- creates problems.. when you recieve the bit "1", you have to wait for the next message
			- we dont want PREFIX encoding to avoid this problem
				- if he recieves a 1.. he's gotta wait .. 0 just kinda dictates the end of each message
				- so.. Z1 = 0, Z2 = 10, Z3 = 110, Z4 = 111
					- so you know to stop listening when you get a 0
						- this is ideal method to achieve minimum number of bits
			- we need to determine which messages are used most...
			.. so if Z4 (Bye) is used most, make Z4 = 0
				- "Description Length" Li (length of message i) = -lg(Pi) Pi= probablity of messsage i
					- if the probability is low, you can use a long length,
					  if its high.. use small bits

Hmap = arg  P(D | h) * P(h)
	  h=-H  ... we use this cus we know the probablity of each hypothesis

NEURAL NET:
	(this is our last topic)
	*Basically a couple of Perceptron*
	They're trying to simulate how a brain actually works
Perceptron works well as long as the data is linearly separatable
	- ie.. 1 or 0 as the response and x0,x1,x2 can be used as weights
	PERCEPTRON RULE: Wi <- Wi + y(t-0)*xi
						where Y is learning speed
	well.. what happens when data isn't linearly separatable
		- line always is moving.. (if data is non linear) it doesn't relaly converge to a correct
		  weighting

Well how to solve this?
	- Calculate an Error per particular set of W's (weights)
		E(w>) = 1/2 * [Summation of d's in D](Td - Dd)^2
			so we want to find the w> where there's a global minimum for errors (note the 3d parabola eq)
				- so we travel along that until we get the lowest error (hill climbing)
					- this is called "GRADIENT DESCENT ALGORITH"
						*we need to find the gradient vector of E(w>)
							which is ▽E(w>) = [dE/dw1 dE/w2 ... dE/wn]
							so..
							w> <- w> - Y * ▽E(w>)
									---- DELTA RULE *using gradient descent*
	But.. We have a problem.. if we actually wanna calculate some weirdly separatable data
	we can't just have lines (▽E(w>) method will still just be a bunch of lines
How to solve THAT problem?
	- remember the treshold unit (the steppy looking wave)
		- we can't have that.. its not a continuous function
	Wel.. we have a SIGMOID FUNCTION
	*****S(y) = 1/(1+ e^-y)*******
	.. we can control the stiffness of the function of it to get it very close to a step function
		Sigmoid does some weird shit when you take the derivative
		dS(y)/dy = d/dy(1/(1+e^-y))
		dS(y)/dy = d/dy(1 + e^-y)^-1
		         = -(1 + e^-y)^-2 * (e^-y)
				 = e^-y / (1 + e^-y)^2
				 = 1/(1+e^-y) * ((1 + e^-y - 1) / (1+e^-y))
				 = S(y) * (1 - S(y)) <--- so the derivative is really easy to calculate
		so .. ex... *remember.. x0 is always 1*
		w0 = -30, w1 = 20. w2 = 20
		x1   x2   w>*x>   S(w>*x>)
		0    0    -30        0
        0    1    -10        0
		1    0    -10        0
		1    1     10        1
		note.. S(w>*x>) is effectively an AND function

	ex 2.
	w0= 10. w1 = -20. w2= -20
	x1   x2    w>*x>      S(w>*x>)
	0    0      10         1
    0    1     -10         0
    1    0     -10         0
    1    1     -30         0
	 this is effectively a NOR function, which makes it ez to make it a simple OR
	 if you change the weights from -20s to 20.. you get an OR function btw

	ex 3. XNOR                 (so.. note, xnor is an OR  of and an nor)
	x1     x2     XOR   XNOR    AND       NOR
    0      0       0     1       0         1
    0      1       1     0       0         0
    1      0       1     0       0         0
    1      1       0     1       1         0

	AND w0 = -30, w1 = 20, w2 = 20
    NOR w0 = 10, w1 = -20, w2 = -20
	OR = VAL( AND = x1 |  20 = w1, NOR = x2 | 20 = x2, x0 = -10)
	to get XNOR which will work

More NN
	- first ins input layer, then the next layer (all full connected) and you keep stacking the layers
	- each one has a summation Sigmoid function.. and the last layer has an output layer has one node
	- layers in between input and output layer are ''hidden layers''

ex. Finding a cat / dog / horse in a picture
	- each pixel is fed into the input layer and eventually output layer
		- the output layers spits out 0-1 for each node (.8, .1, .1) as a vector for each result
		- so since it would be .8 .. odds are theres a cat in the picture
	- BUT WAIT! theres a lot of pixels
		- ya gotta teach the NN to compress the picture
			- the first layer has 1Million nodes
			- next layer has say 100k nodes
			- and the next has 10k
			- and then it starts to go back up
			- so basically.. it reduces pixels, and then expands it back when it gets good at compressing
		      in a relatively lossless way
		- So then you cut that in half and just use the compression to feed the small image into
		  the finding the cat / dog / horse NN

FEED FORWARD BACK PROPAGATION
	x --w1-->[SIGMA]-- x*w1 = n1--->[sigmoid]- S(n1) = y --->[Sigma]---y*w2 = n2 --->[sigmoid]-S(n2)->O(x)
	where O(x) = z<--- prediction, and t is target value
		E = 1/2 * (z-t)^2 <-------- error value so ya gotta minimize this
	... so now you have this error value, you back propogate
	now you need to measure the rate of change of E in w2 (since we touch w2 first in back prop)
		d(E)/d(w2) .. but you can't directly measure the change of E in respects to w2
		.. but you with n2
	.. d(n2)/d(w2) .. if you change w2.. how much will n2 change
...	sooooo .. d(E)/d(w2) = d(E)/d(Sigmoid(n2)) * d(Sigmoid(n2))/d(n2) * d(n2)/d(w2)
			  d(E)/d(w2) =  (z - t)     *    Sigmoid(n2)*(1 - sigmoid(n2)) * sigmoid(n1)
		so now.. instead of crazy computations.. we just have some simple math
			to get an error change with respoect to changing w2
..now for w1 ..
			  d(E)/d(w1) = d(E)/d(z) * d(z)/d(n2) * d(n2)/d(y) * d(y)/d(n1) * d(n1)/d(w1)
			  	             known(w2)   known(w2)    w2    sigmoid(n1)/(1-sigmoid(n1)  x
			known(w2) = this was already calculated          only thing that actually
                        in d(E)/d(w2)                           needs calculated

as you add layers / more nodes in each layer you start needing to
take derivatives of vectors with respect to the error value.. but its the same concept

NN ALGORITHM:::::
	1. Init WEIGHTS to random matrix WEIGHTS = [w1> w2> ... wN>]
			(this represents the whole NN's weights)
	2. DO until Learned
		2.1 -> Feed Foward
		2.2 -> Calculate E (1/2 * (z-t)^2)
		2.3 -> Back Prop E and Adjust WEIGHTS

btw.. no one impliments their own NN unless they're just trying to be cool.. just use TensorFlow
BUT WAIT... What the fuck does "Learned" mean.. when do you stop?
	-  Now remember Kids! split your data into 60% Training, 20% Validation, 20% Test
		- and you know when to stop based on the 20% validation
	ERROR
	    |*               # Evalidation
		|* #           #
		| *   #     #
		|   *    #<------- d(Evalidation)/d(Epoch) = 0 ... so stop at EPOCH val here
		|      *      Etrain
		-------------*-------EPOCH
	 so.. once your your Evalidation is at a minimum.. record the Epoch where it occurs
	 ... then retrain it until that Epoch
	 ... Then once its done that.. test it on the 20% test.. and boom you have your
	     learned AI (% of correct in 20% test data is how good it is)...
	oh and if your test data % sucks ass you gotta rethink your model


Support Vector Machine (SVM)
	- originally made by some Russian dude and just kinda jacked to use for NNs
	- Suppose you have a bunch of linearly separatable data
		- and there are infinite number of lines that will work.. perceptron will just find one of them
		BUT I WANT THE BEST ONE!
			- you need to choose a line that will nicely with potentially new data
				- and new data that fits a certain classification, will be near data of that same
				  classification (like N-Nearest Neighbor)
			- you need to have alarge margin between the line of separation and the data,
			  that's the ideal method
	- so you effectively on top of looking or error, you look for the line with the largest margin
	FINDING THAT LINE
		- you need 2 points on one side (to make the first line)
		  a 3rd poin on the other side (of the other data) to make the parralell line
		- and the middle of the 2 parraelel lines is the good line
	.. so Top line == Wx + b = 1, middle(goodline) == Wx + b = 0, bottom line == Wx + b = -1
		so to get the the best .. you need ... W> / ||W>|| . (x2-x1) = width
			so you wanna find the W that maximizes that width (. = dotproduct btw)
.. we're trying to minimize ||w>||^2
..... SO! SVM: minimize(||W>||^2) ..
		so.. Yi(<w,xi> + b) >= 1
	.. btw no one impliments this .. to optimize it you need to use a modified Lagragean.."KKT"
	   but yea just use a library ya cunt

	SVM: Classifcation F(x) = sign(<w,x> + b)
		Training: minimize 1/2||w||^2
		           w, b  <-- trying to get w and b to use in the classification
		st .. Yi(<wi , xi> + b) >= 1

	USING KKT -> Classification now equals ===== f(x) = sign([Summation]aiyi <x,xi> + b)
		 xi = training example... x is new instance (dot product them) then multip by yi and ai and add b
		a (alpha) is lagrangian ... and remeber...
			there could be millions of xi's .. but... how is this better .. alot of the xi's ai's are 0
			meaning the calculation is pretty quick
		only the points on the gutter (dividing line) are non-zero
			- those points are called "support vectors"
			hence the name... SUPPORT VECTOR MACHINE

KERNAL TRICKS
	- Very powerful and efficient compared to NN
	.. suppose you wanna calculate
		a * b  <-- Disgustingly slow

	you can classify potentially non linear seperable space using linear transformation going from say R2-->R3
		-just gotta figure out how dot product behaves in the future space
			- once you do that .. you don't eve nneed to calculate what [Phi](x) exactly is .. just dot product that shit