Untitled

Hi yeye:

Hope things are well. Over the years, you've mentioned theorem 3 from
Shannon's seminal information theory paper several times. Informally,
the theorem says that when considering probabilistic messages of at
least a certain length, we can categorize them into an arbitrarily
small set and a set of messages that are close to the expected entropy
of the message (given all the probabilistic parameters). The
interpretation of the two sets is that the small set is the set of
"interesting" messages while the large set is the set of
"uninteresting" messages. It's clear enough that all relevant
biological sequences, i.e. genomes, fall into the small set.

What I got from talking to you was that this serves as a warning. We
aren't able to comprehend the genetic data in it's entirety without
the use of statistics due to the size and complexity of the data; yet
we have to be very careful with statistics because our observations
are extremely biased, at least with respect to the whole set of
possible sequences.

I realized today that there's an analogous idea in what I work on. In
particular, I think about dependent high-dimensional data. Two common
concepts are: (1) assuming low rank structure and (2) regularization
of covariance matrices.

When assuming low rank structure, we are trying to denoise the data by
only allowing the eigenvectors with the largest corresponding
eigenvalues to describe the data. Additional benefits is that it
becomes easier to work with the low rank basis for your data.

For regularization, if we have n samples, using a naive model to
estimate a covariance matrix requires O(n^2) parameters, which is
dangerous in terms of overfitting. Regularization reduces the number
of parameters we have to fit --- some concept are assuming the
covariance matrix is banded and inducing sparsity.

In the situation of low rank structure, the resulting estimated
covariance matrix is singular because it's not full rank. But if we
observe a random square matrix where the elements are drawn from the
real numbers assuming a fairly diffuse distribution (say, each element
is a standard normal), that matrix will be invertible with probability
1! (proof sketch: the determinant of a n x n matrix is a polynomial of
the n^2 elements, and the zero set of a polynomial with degree n^2 is
measure zero in R^(n^2)).

Is it weird that we're assuming that our data comes from this set with
measure 0? I don't think so. The motivation of assuming low rank
structure is that we know real data is not full rank. In a sample of
human genomes, we know that individuals will be dependent because they
have common ancestors many generations ago.

Another viewpoint is that of a practical benefit of regularization. In
the high-dimensional setting, sample covariance matrices will be
singular. Regularization often serves to make the matrix invertible.
If the sample covariance matrix is singular, then I'm not too worried
about the low rank assumption.

All this is to say that I think this is analogous to the situation
presented by Shannon. I think that it's not too big of a deal in my
case because it's clear that the data generating process is inherently
biased. Perhaps then we should be thinking about the probabilistic
messages conditional on the "realistic" messages.

Just something I've been musing about. Talk to you soon.

Love,
Wei