Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Neural network:
- -Multilayer perceptron, parametric model
- -Series of linear transformation by parameters, biasing by parameters, and non-linearities
- -Parameters set in a way that optimize the model's performance on a task
- -Final layer is just a linear transformation in a reversible function too
- -Means the real task of a neural network is to find a representation of the data that can be most optimally used by a linear transformation to perform a task
- -Universal approximators
- -They learn probability distributions (assign the probability that data came from the data-generating distribution)
- -MNIST example (28x28 grayscale images of handwritten digits)
- -785-dimensional vector space
- -Subset of the space: vectors in which
- -The first 784 elements, when viewed as a 28x28 image, form a handwritten digit
- -The 785th element is an integer equal to the number displayed in the image
- -Data-generating distribution: probability distribution that assigns zero-probability to all vector outside this subset, and equal probability to all elements within the subset
- -Perfectly describes the probability of the event that a vector is within this subset
- -Sampling = generating a handwritten image along with its label
- -Model assigns the probability to 785-dimensional vectors of being generated from this subset (input=first 784 elements, outputs=1 final element per output)
- -They can be optmizied by maximizing the likelihood of data
- -Which can be done by using an optimization algorithm on a loss function that minimizes the negative likelihood (Gradient Descent)
- -Lower the divergence between the data-generating distribution and the model's parametrized distribution
- Catastrophic Forgetting:
- -Biological inspiration
- -One thing humans can do: learn more than one task sequentially without getting additional neural resources
- -When trained on one task after another, the network loses nearly all performane on the first task
- -Catastrophic forgetting
- -Also happens in babies, but only selectively (cats&dogs example)
- -Does not happen in developed brains
- -Creating architectures resistant to this: continual learning
- Variables and data:
- theta: current parameters
- A: array A
- B: array B
- X: joint A and B
- Our data-generating distribution: equal mixture of the data-generating distributions for handwritten digits 0 and 1 and handwritten digits 2 and 3
- Events:
- big Theta: current theta is optimal. (Note: dependent on small theta)
- Da: A drawn from the data-gen distribution
- Db: B drawn from the data-gen distribution
- D: X drawn from the data-gen distribution
- -Since X is just joint A and B, this means that P(D)=P(Da&Db)
- Assumptions:
- Flat prior for P(W)
- -when we don't have any data (just the task description), any parameter is as likely to be optimal as any other
- -in practice: random initialization of parameters
- Da and Db are conditionally independent on W
- P(Da&Db|W)=P(D|W)=P(Da|W)*P(Db|W)
- Analyize:
- Bayes theorem: P(W|D)=P(D|W)*P(W)/P(D)
- Expand P(D): P(W|D)=P(D|W)*p(W)/P(Da&Db)
- Conditional independence: P(W|D)=P(Da|W)*P(Db|W)*p(W)/P(Da&Db)
- Rearrange multiplication order: P(W|D)=P(Da|W)*P(W)*P(Db|W)/P(Da&Db)
- P(Da|W)*P(W)=P(Da&W): P(W|D)=P(Da&W)*P(Db|W)/P(Da&Db)
- P(Da&Db)=P(Db|Da)*P(Da): P(W|D)=P(Da&W)*P(Db|W)/(P(Db|Da)*P(Da))
- P(Da&W)/P(Da)=P(W|Da): P(W|D)=P(W|Da)*P(Db|W)/P(Db|Da)
- Take log: logP(W|D)=logP(W|Da) + logP(Db|W) - logP(Db|Da)
- (Rearrange): logP(W|D)= logP(Db|W) + logP(W|Da) - logP(Db|Da)
- Analyize three terms:
- log-likelihood of Db. We maximize this when we train our neural network on data in B.
- Probability that the current weights are optimal for both tasks if we only have access to the data from the first task
- Fixed term. Not dependent on the current weights, so non-optimizable.
- In other words, to maximize P(W|D), we have to:
- * Maximize P(Db|W)
- * Maximize P(W|Da)
- P(W|Da) is intractable. We know it exists, but we do not have access to its density function.
- Strategy: let's approximate P(W|Da) and maximize it instead.
- Laplace approximation: approximate a parametrized probability density with a Gaussian with mean=local maximum, covariance=inverse Hessian wrt parameters of negative log of the density function at that point
- (Hessian: matrix of all second derivatives)
- (Easily derived from Taylor series of a transformed density function; see Bishop 4.4)
- p(W|Da)=q(W|Da)=N(small_theta; small_theta^*, H^-1)=((|H|^0.5)/((2*pi)^(N/2))) * e^(-0.5*(w-w*)'*H*(w-w*))
- Left term=M
- p(W|Da)=q(W|Da)=N(small_theta; small_theta^*, H^-1)= M * e^(-0.5*(w-w*)'*H*(w-w*))
- logp(W|Da)=logM + (-0.5*(w-w*)'*H*(w-w*))
- (where N is the number of parameters, and |H| is the determinant of the Hessian.
- Where small_theta* is the value of small_theta that is a local maximum of P(W|Da), and H is the Hessian of p(W|Da)
- Bayes: p(W|Da)=P(Da|W)*P(Da)/P(W)
- log: logp(W|Da)=logP(Da|W) + logP(Da) - logP(W)
- logP(Da) is not dependent on current parameters. Since we're using a flat prior for W, it isn't either. This means that
- d/dtheta logP(W|Da)=d/dtheta logP(Da|W). The two functions thus have local maximums at the same values of theta. Since logP(Da|W) is just the log-likelihood of Da on the model, the model's weights after being trained on A will be a local maximum for P(W|Da) as well.
- Since logP(W|Da)=d/dtheta logP(Da|W), d/dtheta -logP(W|Da)=d/dtheta -logP(Da|W). Thus,
- H_i,j=d^2/dw_iw_j -logp(Da|W)
- We now know how to find both the local maximum and the hessian of the negative log of p(W|Da), so we can approximate it with a Gaussian.
- Problem: calculating the Hessian requires making an NxN matrix. N can go into millions. Computationally impossible to scale.
- Assumption: non-diagonal elements of the Hessian are near-zero.
- -Intuitively, we're assuming that how the loss varies with any given parameter w_i does not itself vary significantly with any other parameter w_j.
- -Not a very good assumption, but it's practically necessary.
- Laplace approximation formula: logp(W|Da)=logM + (-0.5*(w-w*)'*H*(w-w*))
- (H(w-w*))_i=H_i,i * (w-w*)_i
- Thus, Thus, (w-w*)'*H*(w-w*) = sum from i=1 to N of ((w-w*)_i * to H_i,i * (w-w*)_i)
- (w-w*)_i=(w_i-w*_i): (w-w*)'*H*(w-w*)=sum from i=1 to N of ((w-w*)_i * to H_i,i * (w-w*)_i) = sum from i=1 to N of (H_i,i * (w-w*)_i^2 )
- Thus, laplace approximation of logP(W|Da) becomes:
- logP(W|Da)~=logM - (1/2) sum from i=1 to N of H_i,i * (w_i-w*_i)^2
- Original formula, with approximation:
- logp(W|D)~=logp(Db|W) + logM - (1/2) sum from i=1 to N of H_i,i*(w_i-w*_i)^2 - logp(Db|Da)
- logM and logp(Db|Da) are not dependent on theta.
- We can maximize the expression by minimizing:
- J(w)=-logp(Db|W) + (1/2) sum from i=1 to N of H_i,i*(w_i-w*_i)^2
- This is a loss function! First term=negative log likelihood. Second term=quadratic regularization term that regularizers parameters accoridng to how important they were to the first task! Elastic weight consolidation.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement