ewc presentation

Neural network:
    -Multilayer perceptron, parametric model
    -Series of linear transformation by parameters, biasing by parameters, and non-linearities
    -Parameters set in a way that optimize the model's performance on a task
    -Final layer is just a linear transformation in a reversible function too
        -Means the real task of a neural network is to find a representation of the data that can be most optimally used by a linear transformation to perform a task
    -Universal approximators
    -They learn probability distributions (assign the probability that data came from the data-generating distribution)
        -MNIST example (28x28 grayscale images of handwritten digits)
            -785-dimensional vector space
            -Subset of the space: vectors in which
                -The first 784 elements, when viewed as a 28x28 image, form a handwritten digit
                -The 785th element is an integer equal to the number displayed in the image
            -Data-generating distribution: probability distribution that assigns zero-probability to all vector outside this subset, and equal probability to all elements within the subset
                -Perfectly describes the probability of the event that a vector is within this subset
                -Sampling = generating a handwritten image along with its label
            -Model assigns the probability to 785-dimensional vectors of being generated from this subset (input=first 784 elements, outputs=1 final element per output)
    -They can be optmizied by maximizing the likelihood of data
        -Which can be done by using an optimization algorithm on a loss function that minimizes the negative likelihood (Gradient Descent)
        -Lower the divergence between the data-generating distribution and the model's parametrized distribution


Catastrophic Forgetting:
    -Biological inspiration
    -One thing humans can do: learn more than one task sequentially without getting additional neural resources
    -When trained on one task after another, the network loses nearly all performane on the first task
    -Catastrophic forgetting
        -Also happens in babies, but only selectively (cats&dogs example)
        -Does not happen in developed brains
    -Creating architectures resistant to this: continual learning


Variables and data:
theta: current parameters
A: array A
B: array B
X: joint A and B
Our data-generating distribution: equal mixture of the data-generating distributions for handwritten digits 0 and 1 and handwritten digits 2 and 3

Events:
big Theta: current theta is optimal. (Note: dependent on small theta)
Da: A drawn from the data-gen distribution
Db: B drawn from the data-gen distribution
D: X drawn from the data-gen distribution
    -Since X is just joint A and B, this means that P(D)=P(Da&Db)

Assumptions:
Flat prior for P(W)
    -when we don't have any data (just the task description), any parameter is as likely to be optimal as any other
    -in practice: random initialization of parameters
Da and Db are conditionally independent on W
    P(Da&Db|W)=P(D|W)=P(Da|W)*P(Db|W)

Analyize:
Bayes theorem: P(W|D)=P(D|W)*P(W)/P(D)
Expand P(D): P(W|D)=P(D|W)*p(W)/P(Da&Db)
Conditional independence: P(W|D)=P(Da|W)*P(Db|W)*p(W)/P(Da&Db)
Rearrange multiplication order: P(W|D)=P(Da|W)*P(W)*P(Db|W)/P(Da&Db)
P(Da|W)*P(W)=P(Da&W): P(W|D)=P(Da&W)*P(Db|W)/P(Da&Db)
P(Da&Db)=P(Db|Da)*P(Da): P(W|D)=P(Da&W)*P(Db|W)/(P(Db|Da)*P(Da))
P(Da&W)/P(Da)=P(W|Da): P(W|D)=P(W|Da)*P(Db|W)/P(Db|Da)
Take log: logP(W|D)=logP(W|Da) + logP(Db|W) - logP(Db|Da)
(Rearrange): logP(W|D)= logP(Db|W) + logP(W|Da) - logP(Db|Da)
Analyize three terms:
    log-likelihood of Db. We maximize this when we train our neural network on data in B.
    Probability that the current weights are optimal for both tasks if we only have access to the data from the first task
    Fixed term. Not dependent on the current weights, so non-optimizable.

In other words, to maximize P(W|D), we have to:
    * Maximize P(Db|W)
    * Maximize P(W|Da)


P(W|Da) is intractable. We know it exists, but we do not have access to its density function.
Strategy: let's approximate P(W|Da) and maximize it instead.
Laplace approximation: approximate a parametrized probability density with a Gaussian with mean=local maximum, covariance=inverse Hessian wrt parameters of negative log of the density function at that point
(Hessian: matrix of all second derivatives)
(Easily derived from Taylor series of a transformed density function; see Bishop 4.4)
p(W|Da)=q(W|Da)=N(small_theta; small_theta^*, H^-1)=((|H|^0.5)/((2*pi)^(N/2))) * e^(-0.5*(w-w*)'*H*(w-w*))
Left term=M
p(W|Da)=q(W|Da)=N(small_theta; small_theta^*, H^-1)= M * e^(-0.5*(w-w*)'*H*(w-w*))
logp(W|Da)=logM + (-0.5*(w-w*)'*H*(w-w*))
(where N is the number of parameters, and |H| is the determinant of the Hessian.
Where small_theta* is the value of small_theta that is a local maximum of P(W|Da), and H is the Hessian of p(W|Da)

Bayes: p(W|Da)=P(Da|W)*P(Da)/P(W)
log: logp(W|Da)=logP(Da|W) + logP(Da) - logP(W)
logP(Da) is not dependent on current parameters. Since we're using a flat prior for W, it isn't either. This means that
d/dtheta logP(W|Da)=d/dtheta logP(Da|W). The two functions thus have local maximums at the same values of theta. Since logP(Da|W) is just the log-likelihood of Da on the model, the model's weights after being trained on A will be a local maximum for P(W|Da) as well.
Since logP(W|Da)=d/dtheta logP(Da|W), d/dtheta -logP(W|Da)=d/dtheta -logP(Da|W). Thus,
H_i,j=d^2/dw_iw_j -logp(Da|W)

We now know how to find both the local maximum and the hessian of the negative log of p(W|Da), so we can approximate it with a Gaussian.
Problem: calculating the Hessian requires making an NxN matrix. N can go into millions. Computationally impossible to scale.

Assumption: non-diagonal elements of the Hessian are near-zero.
    -Intuitively, we're assuming that how the loss varies with any given parameter w_i does not itself vary significantly with any other parameter w_j.
    -Not a very good assumption, but it's practically necessary.
Laplace approximation formula: logp(W|Da)=logM + (-0.5*(w-w*)'*H*(w-w*))
(H(w-w*))_i=H_i,i * (w-w*)_i
Thus, Thus, (w-w*)'*H*(w-w*) = sum from i=1 to N of ((w-w*)_i * to H_i,i * (w-w*)_i)
(w-w*)_i=(w_i-w*_i): (w-w*)'*H*(w-w*)=sum from i=1 to N of ((w-w*)_i * to H_i,i * (w-w*)_i) = sum from i=1 to N of (H_i,i * (w-w*)_i^2 )
Thus, laplace approximation of logP(W|Da) becomes:
logP(W|Da)~=logM - (1/2) sum from i=1 to N of H_i,i * (w_i-w*_i)^2

Original formula, with approximation:
logp(W|D)~=logp(Db|W) + logM - (1/2) sum from i=1 to N of H_i,i*(w_i-w*_i)^2 - logp(Db|Da)
logM and logp(Db|Da) are not dependent on theta.
We can maximize the expression by minimizing:
J(w)=-logp(Db|W) + (1/2) sum from i=1 to N of H_i,i*(w_i-w*_i)^2
This is a loss function! First term=negative log likelihood. Second term=quadratic regularization term that regularizers parameters accoridng to how important they were to the first task! Elastic weight consolidation.