Advertisement
Guest User

ewc presentation

a guest
Dec 3rd, 2019
223
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 7.02 KB | None | 0 0
  1. Neural network:
  2. -Multilayer perceptron, parametric model
  3. -Series of linear transformation by parameters, biasing by parameters, and non-linearities
  4. -Parameters set in a way that optimize the model's performance on a task
  5. -Final layer is just a linear transformation in a reversible function too
  6. -Means the real task of a neural network is to find a representation of the data that can be most optimally used by a linear transformation to perform a task
  7. -Universal approximators
  8. -They learn probability distributions (assign the probability that data came from the data-generating distribution)
  9. -MNIST example (28x28 grayscale images of handwritten digits)
  10. -785-dimensional vector space
  11. -Subset of the space: vectors in which
  12. -The first 784 elements, when viewed as a 28x28 image, form a handwritten digit
  13. -The 785th element is an integer equal to the number displayed in the image
  14. -Data-generating distribution: probability distribution that assigns zero-probability to all vector outside this subset, and equal probability to all elements within the subset
  15. -Perfectly describes the probability of the event that a vector is within this subset
  16. -Sampling = generating a handwritten image along with its label
  17. -Model assigns the probability to 785-dimensional vectors of being generated from this subset (input=first 784 elements, outputs=1 final element per output)
  18. -They can be optmizied by maximizing the likelihood of data
  19. -Which can be done by using an optimization algorithm on a loss function that minimizes the negative likelihood (Gradient Descent)
  20. -Lower the divergence between the data-generating distribution and the model's parametrized distribution
  21.  
  22.  
  23. Catastrophic Forgetting:
  24. -Biological inspiration
  25. -One thing humans can do: learn more than one task sequentially without getting additional neural resources
  26. -When trained on one task after another, the network loses nearly all performane on the first task
  27. -Catastrophic forgetting
  28. -Also happens in babies, but only selectively (cats&dogs example)
  29. -Does not happen in developed brains
  30. -Creating architectures resistant to this: continual learning
  31.  
  32.  
  33.  
  34.  
  35.  
  36. Variables and data:
  37. theta: current parameters
  38. A: array A
  39. B: array B
  40. X: joint A and B
  41. Our data-generating distribution: equal mixture of the data-generating distributions for handwritten digits 0 and 1 and handwritten digits 2 and 3
  42.  
  43. Events:
  44. big Theta: current theta is optimal. (Note: dependent on small theta)
  45. Da: A drawn from the data-gen distribution
  46. Db: B drawn from the data-gen distribution
  47. D: X drawn from the data-gen distribution
  48. -Since X is just joint A and B, this means that P(D)=P(Da&Db)
  49.  
  50. Assumptions:
  51. Flat prior for P(W)
  52. -when we don't have any data (just the task description), any parameter is as likely to be optimal as any other
  53. -in practice: random initialization of parameters
  54. Da and Db are conditionally independent on W
  55. P(Da&Db|W)=P(D|W)=P(Da|W)*P(Db|W)
  56.  
  57. Analyize:
  58. Bayes theorem: P(W|D)=P(D|W)*P(W)/P(D)
  59. Expand P(D): P(W|D)=P(D|W)*p(W)/P(Da&Db)
  60. Conditional independence: P(W|D)=P(Da|W)*P(Db|W)*p(W)/P(Da&Db)
  61. Rearrange multiplication order: P(W|D)=P(Da|W)*P(W)*P(Db|W)/P(Da&Db)
  62. P(Da|W)*P(W)=P(Da&W): P(W|D)=P(Da&W)*P(Db|W)/P(Da&Db)
  63. P(Da&Db)=P(Db|Da)*P(Da): P(W|D)=P(Da&W)*P(Db|W)/(P(Db|Da)*P(Da))
  64. P(Da&W)/P(Da)=P(W|Da): P(W|D)=P(W|Da)*P(Db|W)/P(Db|Da)
  65. Take log: logP(W|D)=logP(W|Da) + logP(Db|W) - logP(Db|Da)
  66. (Rearrange): logP(W|D)= logP(Db|W) + logP(W|Da) - logP(Db|Da)
  67. Analyize three terms:
  68. log-likelihood of Db. We maximize this when we train our neural network on data in B.
  69. Probability that the current weights are optimal for both tasks if we only have access to the data from the first task
  70. Fixed term. Not dependent on the current weights, so non-optimizable.
  71.  
  72. In other words, to maximize P(W|D), we have to:
  73. * Maximize P(Db|W)
  74. * Maximize P(W|Da)
  75.  
  76.  
  77.  
  78. P(W|Da) is intractable. We know it exists, but we do not have access to its density function.
  79. Strategy: let's approximate P(W|Da) and maximize it instead.
  80. Laplace approximation: approximate a parametrized probability density with a Gaussian with mean=local maximum, covariance=inverse Hessian wrt parameters of negative log of the density function at that point
  81. (Hessian: matrix of all second derivatives)
  82. (Easily derived from Taylor series of a transformed density function; see Bishop 4.4)
  83. p(W|Da)=q(W|Da)=N(small_theta; small_theta^*, H^-1)=((|H|^0.5)/((2*pi)^(N/2))) * e^(-0.5*(w-w*)'*H*(w-w*))
  84. Left term=M
  85. p(W|Da)=q(W|Da)=N(small_theta; small_theta^*, H^-1)= M * e^(-0.5*(w-w*)'*H*(w-w*))
  86. logp(W|Da)=logM + (-0.5*(w-w*)'*H*(w-w*))
  87. (where N is the number of parameters, and |H| is the determinant of the Hessian.
  88. Where small_theta* is the value of small_theta that is a local maximum of P(W|Da), and H is the Hessian of p(W|Da)
  89.  
  90. Bayes: p(W|Da)=P(Da|W)*P(Da)/P(W)
  91. log: logp(W|Da)=logP(Da|W) + logP(Da) - logP(W)
  92. logP(Da) is not dependent on current parameters. Since we're using a flat prior for W, it isn't either. This means that
  93. d/dtheta logP(W|Da)=d/dtheta logP(Da|W). The two functions thus have local maximums at the same values of theta. Since logP(Da|W) is just the log-likelihood of Da on the model, the model's weights after being trained on A will be a local maximum for P(W|Da) as well.
  94. Since logP(W|Da)=d/dtheta logP(Da|W), d/dtheta -logP(W|Da)=d/dtheta -logP(Da|W). Thus,
  95. H_i,j=d^2/dw_iw_j -logp(Da|W)
  96.  
  97. We now know how to find both the local maximum and the hessian of the negative log of p(W|Da), so we can approximate it with a Gaussian.
  98. Problem: calculating the Hessian requires making an NxN matrix. N can go into millions. Computationally impossible to scale.
  99.  
  100. Assumption: non-diagonal elements of the Hessian are near-zero.
  101. -Intuitively, we're assuming that how the loss varies with any given parameter w_i does not itself vary significantly with any other parameter w_j.
  102. -Not a very good assumption, but it's practically necessary.
  103. Laplace approximation formula: logp(W|Da)=logM + (-0.5*(w-w*)'*H*(w-w*))
  104. (H(w-w*))_i=H_i,i * (w-w*)_i
  105. Thus, Thus, (w-w*)'*H*(w-w*) = sum from i=1 to N of ((w-w*)_i * to H_i,i * (w-w*)_i)
  106. (w-w*)_i=(w_i-w*_i): (w-w*)'*H*(w-w*)=sum from i=1 to N of ((w-w*)_i * to H_i,i * (w-w*)_i) = sum from i=1 to N of (H_i,i * (w-w*)_i^2 )
  107. Thus, laplace approximation of logP(W|Da) becomes:
  108. logP(W|Da)~=logM - (1/2) sum from i=1 to N of H_i,i * (w_i-w*_i)^2
  109.  
  110. Original formula, with approximation:
  111. logp(W|D)~=logp(Db|W) + logM - (1/2) sum from i=1 to N of H_i,i*(w_i-w*_i)^2 - logp(Db|Da)
  112. logM and logp(Db|Da) are not dependent on theta.
  113. We can maximize the expression by minimizing:
  114. J(w)=-logp(Db|W) + (1/2) sum from i=1 to N of H_i,i*(w_i-w*_i)^2
  115. This is a loss function! First term=negative log likelihood. Second term=quadratic regularization term that regularizers parameters accoridng to how important they were to the first task! Elastic weight consolidation.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement