Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- # Machine Learning Quick Notes.
- `One plus 2 is 4 minus 1 that's 3 quick maths`
- # Conjugate Distributions:
- ### Beta Distribution:
- Family of continous Distributions, mainly focused using the two parameters `alpha` and `beta`. These are used as exponents of a random variable and are the main factors on the shape of the distribution.
- Beta Distribution acts as the **conjugate distribution** of:
- 1. Bernoulli Distribution
- 2. Binomial Distribution
- 3. Geometric Distribution.
- ### Dirichlet Distributions :
- Multivariate versions of the Beta distribution, paramterized by `alpha`.
- Dirichlet distributions act as the **conjugate prior** for :
- 1. Categorical Distributions
- 2. Multinomial distributions
- # Normal / Gaussian Distributions :
- 1. Gaussians are great due to the central limit theorem
- 2. Gaussians are **self conjugate**
- 3. The conjugate prior for the `mean` is a `Gaussian`
- 4. the conjugate prior for the `Variance/STD` is `Inverse-Wishart` which is a distribution define on the real value matrices.
- # Exponential Family:
- The exponential family defines a _natural_ paratmetrisation of distributions that we can work with.
- # Linear Regression
- So, linear regression is a method of finding the relation between a set of data points,
- suppose you have the equation `Y = w X + e` where `e` is noise. Linear regression is used to
- * Find the values of W in such a way that the data matches up well
- * Predict values of Y based on the w's from above.
- The steps of doing this involve Bayes' theorem.
- 1. Create the likelihood function p(Y|x,w).
- 2. Next have a prior belief over the W values -> p(w)
- 3. Now obtain a posterior belief over the values of W so you can find a closer approximation that matches the data, -> p(W|Y,X)
- > P(W|Y,X) depends on P(Y|X,W) and P(W)
- Use conjugate priors to ensure that the posterior distribution is of the correct form.
- This system however isn't the best for prediction of values, especially as the dimensionality of W, X and Y increase.
- ## Dual Linear Regression
- Dual linear regression is used to have dual representations of the data and allow for better predictions in higher dimensionalities.
- this is done using Kernel Regression / Kernel Methods
- ## Kernel Regression
- Basically the same as Dual Linear Regression / Linear Regression but it includes the **mapping of data from N dimensions to an easier representation in X dimensions using a function `phi(.)`**
- The steps for kernel regression / dual linear regression are :
- 1. Formulate a posterior value
- 2. Find the stationary point of the posterior
- 3. Rewrite the parameters using the data (???)
- 4. Kernel regression using this formula
- The general kernel regression formula being used is :
- > y(x*) = k(x*,x) (K + LI)^-1 t
- Here, the x* is the new point being predicted, k is mapping the relation between x* and x, and K is the mapping of _all_ values in the data. L acts as the noise parameter and t is the set of target variables.
- This is the general formula
- # blurb of Unsupervised learning
- So basically when you have a function `y = f(x) + noise` with normal linear regression / supervised learning you get the parameters for the function `f` from the pairs of values `(y,x)`
- This is called supervised learning.
- Now for unsupervised learning, you're supposed to infer the parameters of `f` and the latent variables `x` just from looking at the observed values `y`.
- Normally this would involve marginalising out the values of both f and x and therefore obtaining a likelihood estimate for the formula / parameters.
- However this is impossible as removing both x and y is computationally impossible.
- So instead we use the method of `maximum likelihood type 2` which is essentially a compromise over the fact you can't marignalise out both these things.
- So instead what we do is marginalise out the values with higher dimensionality, which in our case is the values of the latent variables `x`. After this we obtain a formula for p(y|mew, W and variance).
- The next step for MLE2 is to maximise the values of this function / formula.
- This is done to find the `point estimate` for the parameters for the function, therefore giving you the most optimal values for the parameters given your data `y`.
- In simple words
- > you've been given only y values
- >
- > you know there's some values x that made it after some function f had been applied
- >
- > you take all the possible values of 'x' and force them through the formula and obtain a relation between the F parameters and the Y values.
- >
- > Now you've got the probability function that relates these Y values with the F's parameter values.
- >
- > To find a final solution, take the maximum likelihood of this probability function, which finds the point estimate of the parameters for F
- >
- > This means that you now have the best estimated value for these parameters given the values for Y and the assumed values for X.
- >
- # Random Points
- * The **assumption over a prior of linear regression** is a **Zero Mean Isotropic Gaussian** which means that it's a Normal Distribution with mean as (W|0) and variance as `alpha ^-1` I) where alpha is a single parameter from the distribution.
- * **Parameter Distribution** is the process / method of applying gaussian priors over some parameters to obtain their final value inferred / obtained from the data. This is different to individually changing parameters to fit the given data in creation of a model as it would cause _overfitting_ on the data.
Add Comment
Please, Sign In to add comment