Advertisement
Guest User

Untitled

a guest
Mar 30th, 2015
180
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 3.35 KB | None | 0 0
  1. Hi yeye:
  2.  
  3. Hope things are well. Over the years, you've mentioned theorem 3 from
  4. Shannon's seminal information theory paper several times. Informally,
  5. the theorem says that when considering probabilistic messages of at
  6. least a certain length, we can categorize them into an arbitrarily
  7. small set and a set of messages that are close to the expected entropy
  8. of the message (given all the probabilistic parameters). The
  9. interpretation of the two sets is that the small set is the set of
  10. "interesting" messages while the large set is the set of
  11. "uninteresting" messages. It's clear enough that all relevant
  12. biological sequences, i.e. genomes, fall into the small set.
  13.  
  14. What I got from talking to you was that this serves as a warning. We
  15. aren't able to comprehend the genetic data in it's entirety without
  16. the use of statistics due to the size and complexity of the data; yet
  17. we have to be very careful with statistics because our observations
  18. are extremely biased, at least with respect to the whole set of
  19. possible sequences.
  20.  
  21. I realized today that there's an analogous idea in what I work on. In
  22. particular, I think about dependent high-dimensional data. Two common
  23. concepts are: (1) assuming low rank structure and (2) regularization
  24. of covariance matrices.
  25.  
  26. When assuming low rank structure, we are trying to denoise the data by
  27. only allowing the eigenvectors with the largest corresponding
  28. eigenvalues to describe the data. Additional benefits is that it
  29. becomes easier to work with the low rank basis for your data.
  30.  
  31. For regularization, if we have n samples, using a naive model to
  32. estimate a covariance matrix requires O(n^2) parameters, which is
  33. dangerous in terms of overfitting. Regularization reduces the number
  34. of parameters we have to fit --- some concept are assuming the
  35. covariance matrix is banded and inducing sparsity.
  36.  
  37. In the situation of low rank structure, the resulting estimated
  38. covariance matrix is singular because it's not full rank. But if we
  39. observe a random square matrix where the elements are drawn from the
  40. real numbers assuming a fairly diffuse distribution (say, each element
  41. is a standard normal), that matrix will be invertible with probability
  42. 1! (proof sketch: the determinant of a n x n matrix is a polynomial of
  43. the n^2 elements, and the zero set of a polynomial with degree n^2 is
  44. measure zero in R^(n^2)).
  45.  
  46. Is it weird that we're assuming that our data comes from this set with
  47. measure 0? I don't think so. The motivation of assuming low rank
  48. structure is that we know real data is not full rank. In a sample of
  49. human genomes, we know that individuals will be dependent because they
  50. have common ancestors many generations ago.
  51.  
  52. Another viewpoint is that of a practical benefit of regularization. In
  53. the high-dimensional setting, sample covariance matrices will be
  54. singular. Regularization often serves to make the matrix invertible.
  55. If the sample covariance matrix is singular, then I'm not too worried
  56. about the low rank assumption.
  57.  
  58. All this is to say that I think this is analogous to the situation
  59. presented by Shannon. I think that it's not too big of a deal in my
  60. case because it's clear that the data generating process is inherently
  61. biased. Perhaps then we should be thinking about the probabilistic
  62. messages conditional on the "realistic" messages.
  63.  
  64. Just something I've been musing about. Talk to you soon.
  65.  
  66. Love,
  67. Wei
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement