the lasso

I just wanted to reiterate to everyone that the Lasso for regression analysis is possibly the best thing to happen since Christ came to this earth. It turns out we can even improve the accuracy even further by including the log of the number of iterations in the formula, and it was like, ahah! That's it. And we're done. Just do that, and you're good to go. God, it just makes me so happy. The way it works is just so sexy. And to think I would have spent the last week of my life in a futile pursuit of an imaginary better way... I don't know.

We're just going to start now. There's nothing special about this example, since this is just like the data you have in your hands right now. I promise. I know it's confusing, and hard to keep up with. So just look at the output, and then look back here and try to get the hang of what's happening.

I've put in three dummy variables: one for each sex, one for each race, and one for the number of miles run. So the only difference between the rows is the sex and race (it's assumed you are male, Caucasian, and have run 100 miles, so those are the only differences). And of course, the only difference between the columns is the number of miles run, which is given in kilometers.

Remember: the Lasso is doing feature selection (and is also one of the greatest concepts ever). It's like the most elegant way to deal with the curse of dimensionality. Well, not that elegant, but it has to do with the notion of a lasso, which is a type of corset you put on when you want to make a lot of money (or do some pretty great things) without looking out of place.

So what are these three dummy variables? Well, when you have lots of data (and in this case, we have a lot), then the easiest thing to do when using a linear model is just to use all the data for training, and not to give each data point a vote. But when you have lots of data, it's tempting to think "Let's throw away some of the data. I'm only going to use a subset of it. Or maybe I'll throw away the whole data set and only do some of the data. It doesn't matter much, right?" And then people try to use the data set in a way that was never intended. You can never be sure the data you have is representative of the population as a whole. Or maybe it's even biased! Or who knows, maybe you think the data set is representative, but you have a slight (or huge) bias, so you think that's just because your sample is representative. Or maybe the sample is representative, but you have a bias, so it's not representative. See how silly that is?

That is called "overfitting", and you do that when you try to predict a bunch of random noise. But this isn't random noise, it's your own observations. Or maybe it's a sample of the population that isn't representative. But if you have a bunch of data and you throw away some of it, then your model will be much less accurate, and you will just "overfit".

Luckily, there's a way to make your model more representative of the population as a whole, by forcing you to use less of the data than you could have. And the way you do this is by using the Lasso. This is like an actual corset that has to do with keeping you from making stupid decisions. And in fact, I would not be surprised if this corset is some kind of corset-like thing. Like, some kind of girdle. But anyway, here it is in the picture. I'm using all my fancy skills. I am such a nerd.

I've set the number of iterations to 1.2 million (which is quite big), and I used the l1 regularization. In general, the smaller the number of iterations, the less likely you are to overfit. Of course, this also depends on your learning rate and your lambda, but the point is, the more iterations you have, the more chance you have of overfitting. And in case you're wondering about what regularization means, that's just the lasso penalizing the coefficients.

This is how the data would look like if we only used the first 2% of the data, and we were using one iteration. The features have pretty much just gone.

The number of iterations makes it so that we're forced to use a smaller subset of the data. The lower the number of iterations, the more data we are forced to use. Or the more data we are forced to use. Just as I said before, this is only true if you are doing a linear model with the Lasso. A linear model with the Ridge would not be affected by the number of iterations (you can change that).

But there's something else we can do to make sure we use fewer data. And that's by adding the log of the number of iterations. Well, maybe we shouldn't say that. It might not be adding anything, we don't know for sure, and besides, we haven't tried it yet. We're just going to start here, and in the next step, we will see if the log of the number of iterations is actually doing something useful. If not, we'll just stop at that step, and we won't worry about it anymore.

And so we have the same thing, but this time with the number of iterations as a feature, and the log of the number of iterations.

The feature selection is done by using the Lasso. God - the Lasso - it's the best thing since sliced bread. A lot of people don't even want to use linear models. That's what makes me so happy, because now I have a way to solve a whole bunch of problems, even though I only use features, and I don't need any linear models.

There are a bunch of different methods, and all of them are a lot better than using a regular linear model (even though you still have to use a regular linear model, unless you want to use a method that depends on the entire data set, like SVM). Here's some of them.

But the main thing is to always use regularization, and also to think about regularization, and to know how to use regularization, and to think about how to use regularization. God bless regularization.

I mean, is it a miracle? Maybe not. Is it worth it? It really is.