Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Hi yeye:
- Hope things are well. Over the years, you've mentioned theorem 3 from
- Shannon's seminal information theory paper several times. Informally,
- the theorem says that when considering probabilistic messages of at
- least a certain length, we can categorize them into an arbitrarily
- small set and a set of messages that are close to the expected entropy
- of the message (given all the probabilistic parameters). The
- interpretation of the two sets is that the small set is the set of
- "interesting" messages while the large set is the set of
- "uninteresting" messages. It's clear enough that all relevant
- biological sequences, i.e. genomes, fall into the small set.
- What I got from talking to you was that this serves as a warning. We
- aren't able to comprehend the genetic data in it's entirety without
- the use of statistics due to the size and complexity of the data; yet
- we have to be very careful with statistics because our observations
- are extremely biased, at least with respect to the whole set of
- possible sequences.
- I realized today that there's an analogous idea in what I work on. In
- particular, I think about dependent high-dimensional data. Two common
- concepts are: (1) assuming low rank structure and (2) regularization
- of covariance matrices.
- When assuming low rank structure, we are trying to denoise the data by
- only allowing the eigenvectors with the largest corresponding
- eigenvalues to describe the data. Additional benefits is that it
- becomes easier to work with the low rank basis for your data.
- For regularization, if we have n samples, using a naive model to
- estimate a covariance matrix requires O(n^2) parameters, which is
- dangerous in terms of overfitting. Regularization reduces the number
- of parameters we have to fit --- some concept are assuming the
- covariance matrix is banded and inducing sparsity.
- In the situation of low rank structure, the resulting estimated
- covariance matrix is singular because it's not full rank. But if we
- observe a random square matrix where the elements are drawn from the
- real numbers assuming a fairly diffuse distribution (say, each element
- is a standard normal), that matrix will be invertible with probability
- 1! (proof sketch: the determinant of a n x n matrix is a polynomial of
- the n^2 elements, and the zero set of a polynomial with degree n^2 is
- measure zero in R^(n^2)).
- Is it weird that we're assuming that our data comes from this set with
- measure 0? I don't think so. The motivation of assuming low rank
- structure is that we know real data is not full rank. In a sample of
- human genomes, we know that individuals will be dependent because they
- have common ancestors many generations ago.
- Another viewpoint is that of a practical benefit of regularization. In
- the high-dimensional setting, sample covariance matrices will be
- singular. Regularization often serves to make the matrix invertible.
- If the sample covariance matrix is singular, then I'm not too worried
- about the low rank assumption.
- All this is to say that I think this is analogous to the situation
- presented by Shannon. I think that it's not too big of a deal in my
- case because it's clear that the data generating process is inherently
- biased. Perhaps then we should be thinking about the probabilistic
- messages conditional on the "realistic" messages.
- Just something I've been musing about. Talk to you soon.
- Love,
- Wei
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement