Untitled

# Machine Learning Quick Notes.

`One plus 2 is 4 minus 1 that's 3 quick maths`

# Conjugate Distributions:

### Beta Distribution:
Family of continous Distributions, mainly focused using the two parameters `alpha` and `beta`. These are used as exponents of a random variable and are the main factors on the shape of the distribution.

Beta Distribution acts as the **conjugate distribution** of:
1. Bernoulli Distribution
2. Binomial Distribution
3. Geometric Distribution.

### Dirichlet Distributions :
Multivariate versions of the Beta distribution, paramterized by `alpha`.

Dirichlet distributions act as the **conjugate prior** for :
1. Categorical Distributions
2. Multinomial distributions

# Normal / Gaussian Distributions :
1. Gaussians are great due to the central limit theorem
2. Gaussians are **self conjugate**
3. The conjugate prior for the `mean` is a `Gaussian`
4. the conjugate prior for the `Variance/STD` is `Inverse-Wishart` which is a distribution define on the real value matrices.

# Exponential Family:
The exponential family defines a _natural_ paratmetrisation of distributions that we can work with.

# Linear Regression

So, linear regression is a method of finding the relation between a set of data points,
suppose you have the equation `Y = w X + e` where `e` is noise. Linear regression is used to
* Find the values of W in such a way that the data matches up well
* Predict values of Y based on the w's from above.

The steps of doing this involve Bayes' theorem.
1. Create the likelihood function p(Y|x,w).
2. Next have a prior belief over the W values -> p(w)
3. Now obtain a posterior belief over the values of W so you can find a closer approximation that matches the data, -> p(W|Y,X)
> P(W|Y,X) depends on P(Y|X,W) and P(W)

Use conjugate priors to ensure that the posterior distribution is of the correct form.

This system however isn't the best for prediction of values, especially as the dimensionality of W, X and Y increase.

## Dual Linear Regression
Dual linear regression is used to have dual representations of the data and allow for better predictions in higher dimensionalities.

this is done using Kernel Regression / Kernel Methods

## Kernel Regression
Basically the same as Dual Linear Regression / Linear Regression but it includes the **mapping of data from N dimensions to an easier representation in X dimensions using a function `phi(.)`**

The steps for kernel regression / dual linear regression are :
1. Formulate a posterior value
2. Find the stationary point of the posterior
3. Rewrite the parameters using the data (???)
4. Kernel regression using this formula

The general kernel regression formula being used is :
> y(x*) = k(x*,x) (K + LI)^-1 t

Here, the x* is the new point being predicted, k is mapping the relation between x* and x, and K is the mapping of _all_ values in the data. L acts as the noise parameter and t is the set of target variables.

This is the general formula


# blurb of Unsupervised learning
So basically when you have a function `y = f(x) + noise` with normal linear regression / supervised learning you get the parameters for the function `f` from the pairs of values `(y,x)`
This is called supervised learning.

Now for unsupervised learning, you're supposed to infer the parameters of `f` and the latent variables `x` just from looking at the observed values `y`.

Normally this would involve marginalising out the values of both f and x and therefore obtaining a likelihood estimate for the formula / parameters.
However this is impossible as removing both x and y is computationally impossible.

So instead we use the method of `maximum likelihood type 2` which is essentially a compromise over the fact you can't marignalise out both these things.

So instead what we do is marginalise out the values with higher dimensionality, which in our case is the values of the latent variables `x`. After this we obtain a formula for p(y|mew, W and variance).

The next step for MLE2 is to maximise the values of this function / formula.

This is done to find the `point estimate` for the parameters for the function, therefore giving you the most optimal values for the parameters given your data `y`.


In simple words
> you've been given only y values
>
> you know there's some values x that made it after some function f had been applied
>
> you take all the possible values of 'x' and force them through the formula and obtain a relation between the F parameters and the Y values.
>
> Now you've got the probability function that relates these Y values with the F's parameter values.
>
> To find a final solution, take the maximum likelihood of this probability function, which finds the point estimate of the parameters for F
>
> This means that you now have the best estimated value for these parameters given the values for Y and the assumed values for X.
>


# Random Points
* The **assumption over a prior of linear regression** is a **Zero Mean Isotropic Gaussian** which means that it's a Normal Distribution with mean as (W|0) and variance as `alpha ^-1` I) where alpha is a single parameter from the distribution.
* **Parameter Distribution** is the process / method of applying gaussian priors over some parameters to obtain their final value inferred / obtained from the data. This is different to individually changing parameters to fit the given data in creation of a model as it would cause _overfitting_ on the data.