Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- 1. General Form of EM
- 1. Concave functions
- 2. Satisfy Jensen’s equality f(Et)>=Ef(t)
- 3. kullback-leibler divergence: measure differnce of two probabilistic distributions
- 1. KL divergence
- 2. how different is each data point at any point of x-axis in the log scale, take expectation
- 3. non symmetric
- 4. = 0 if compare to self
- 5. always non-negative
- 4. EM
- 1. Build a lower bound at the local likelihood, depending on the theta to maximize
- 2. Optimize the lower bound until convergence
- 1. maximize the lower bound with respect to q (E step)
- 2. fix q and maximize the lower bound with respect to theta (M step)
- 3. E step
- 1. The gap between the log likelihood and the lower bound equals to the sum of KL divergences within the distribution Q
- 4. M Step
- 1. maximize Elogp(X,T|theta) + const
- 5. Convergence guarantee to at least local maximum or saddle point
- 1. EM algorithm never make things worse
- 2. On each iteration EM doesn’t decrease the objective
- 5. Example for discrete mixture
- 6. Summary
- 1. Method for training latent variable models: more efficient than some general purpose optimization algorithm like stochastic gradient descent
- 2. Handle missing data
- 3. Replace the hard problem (maximize margin log likelihood) to sequence of simple task (maximize lower bound)
- 4. Guaranties to converge (only local maximum)
- 5. Helps complicated parameter constraints
- 6. Some extensions
- 2. Applications of EM
- 1. E step: q -> posterior distribution of latent variable t
- 2. M step: maximize expectation of joint log likelihood while fixing q
- 3. K-means
- 1. Randomly initialize parameters theta
- 2. Until convergence repeat: for each point compute closest centroid, update centroids
- 3. K-means is the special case of EM
- 1. where q belongs to Q (a set of delta functions, 0 or 1)
- 2. with fixed covariance matrices (uniform circle with certain width)
- 4. k-means is faster but less flexible than GMM
- 4. Probabilistic PCA
- 1. Dimensional reduction
- 1. two variables are so correlated
- 2. project two dimension to 1 dimension to keep as much data as possible
- 2. PCA tries to find the best linear transformation to project 2d to 1d
- 3. In presence of missing values using probabilistic PCA
- 4. Integral is intractable
- 5. The full exact EM-algorithm will not allow you to optimize intractable functions, but can appoximate it to build an apporximate EM algorithm
- 6. Advantages
- 1. missing data
- 2. straightforward iterative scheme for large dimensionalities
- 3. Can do mixture of PPCA
- 4. hyperparameter tuning
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement