ML-005 10-5 subtitles

1
00:00:00,390 --> 00:00:02,440
You've seen how regularization can help

2
00:00:02,610 --> 00:00:04,660
prevent overfitting, but how

3
00:00:04,960 --> 00:00:06,230
does it affect the bias and

4
00:00:06,460 --> 00:00:08,070
variance of a learning algorithm?

5
00:00:08,630 --> 00:00:09,890
In this video, I like to

6
00:00:10,020 --> 00:00:11,180
go deeper into the issue

7
00:00:11,550 --> 00:00:13,300
of bias and variance, and

8
00:00:13,520 --> 00:00:14,450
talk about how it interacts

9
00:00:15,070 --> 00:00:15,880
with, and is effected by,

10
00:00:16,070 --> 00:00:18,870
the regularization of your learning algorithm.

11
00:00:22,180 --> 00:00:23,390
Suppose we fit a

12
00:00:23,750 --> 00:00:25,960
high order polynomial, but to

13
00:00:26,170 --> 00:00:27,570
prevent overfitting, we are

14
00:00:27,690 --> 00:00:29,880
going to use regularization as shown here.

15
00:00:30,310 --> 00:00:31,470
So we have this regularization

16
00:00:32,280 --> 00:00:33,450
term to try to

17
00:00:33,790 --> 00:00:35,680
keep the values of the parameters small.

18
00:00:36,120 --> 00:00:37,800
And as usual, the regularization sums

19
00:00:38,170 --> 00:00:39,590
from j equals 1 to

20
00:00:39,690 --> 00:00:40,880
m rather than j equals 0

21
00:00:41,000 --> 00:00:43,530
to m.  Let's consider three cases.

22
00:00:44,140 --> 00:00:45,990
The first is the case of

23
00:00:46,060 --> 00:00:47,300
a very large value of the

24
00:00:47,360 --> 00:00:49,650
regularization parameter lambda, such

25
00:00:49,890 --> 00:00:51,040
as if lambda were

26
00:00:51,190 --> 00:00:52,900
equal to 10,000s of huge value.

27
00:00:53,180 --> 00:00:54,500
In this

28
00:00:54,770 --> 00:00:55,910
case, all of these

29
00:00:56,060 --> 00:00:57,650
parameters, theta 1, theta 2,

30
00:00:57,980 --> 00:00:58,710
theta 3 and so on will

31
00:00:58,890 --> 00:01:00,790
be heavily penalized and

32
00:01:00,970 --> 00:01:03,060
so, what ends up with most

33
00:01:03,510 --> 00:01:04,840
of these parameter values being close

34
00:01:05,190 --> 00:01:07,400
to 0 and the hypothesis will be

35
00:01:07,580 --> 00:01:08,340
roughly h or x

36
00:01:08,680 --> 00:01:10,380
just equal or approximately equal

37
00:01:10,670 --> 00:01:11,930
to theta 0, and so we

38
00:01:12,090 --> 00:01:13,960
end up a hypothesis that more

39
00:01:14,200 --> 00:01:15,650
or less looks like that. This is more or

40
00:01:15,770 --> 00:01:18,530
less a flat, constant straight line.

41
00:01:18,810 --> 00:01:20,720
And so this hypothesis has high

42
00:01:21,060 --> 00:01:23,030
bias and a value underfits this data set.

43
00:01:23,370 --> 00:01:24,920
So the horizontal straight

44
00:01:25,240 --> 00:01:26,210
line is just not a very

45
00:01:26,340 --> 00:01:28,200
good model for this data set.

46
00:01:28,500 --> 00:01:30,270
At the other extreme beam is if we have

47
00:01:30,650 --> 00:01:31,960
a very small value of

48
00:01:32,250 --> 00:01:33,710
lambda, such as if lambda

49
00:01:34,110 --> 00:01:36,030
were equal to 0.

50
00:01:36,120 --> 00:01:37,340
In that case, given that we're

51
00:01:37,480 --> 00:01:38,640
fitting a high order polynomial,

52
00:01:38,790 --> 00:01:40,090
this is a

53
00:01:40,340 --> 00:01:41,990
usual overfitting setting.

54
00:01:43,150 --> 00:01:44,390
In that case, given that we're

55
00:01:44,590 --> 00:01:45,940
fitting a high order polynomial,

56
00:01:46,570 --> 00:01:48,450
basically without regularization or with

57
00:01:48,630 --> 00:01:50,570
very minimal regularization, we end

58
00:01:50,750 --> 00:01:52,580
up with our usual high variance, overfitting

59
00:01:53,210 --> 00:01:54,800
setting, because basically if lambda is

60
00:01:55,030 --> 00:01:56,050
equal to zero, we are just

61
00:01:56,190 --> 00:01:58,710
fitting with our regularization so

62
00:01:58,840 --> 00:02:00,260
that overfits the hypothesis

63
00:02:01,100 --> 00:02:02,470
and is only if we have some

64
00:02:02,630 --> 00:02:09,120
intermediate value of lambda that is neither too large nor too small that we end up with parameters theta

65
00:02:09,620 --> 00:02:11,380
that we end up that give us a reasonable

66
00:02:11,770 --> 00:02:12,750
fit to this data.

67
00:02:13,290 --> 00:02:14,610
So how can we automatically

68
00:02:15,010 --> 00:02:16,580
choose a good value

69
00:02:17,180 --> 00:02:18,690
for the regularization parameter lambda?

70
00:02:19,700 --> 00:02:23,570
Just to reiterate, here is our model and here is our learning algorithm subjective.

71
00:02:24,270 --> 00:02:27,180
For the setting where we're using regularization, let me define

72
00:02:28,010 --> 00:02:30,140
j train of theta to be something different

73
00:02:31,010 --> 00:02:32,970
to be the optimization objective

74
00:02:33,770 --> 00:02:35,400
but without the regularization term.

75
00:02:36,140 --> 00:02:38,000
Previously, in earlier video

76
00:02:38,350 --> 00:02:39,270
when we are not using

77
00:02:39,640 --> 00:02:42,100
regularization, I define j train of theta to

78
00:02:42,450 --> 00:02:45,980
be the same as j of theta as the cost function but

79
00:02:46,230 --> 00:02:49,340
when we are using regularization with this extra lambda term

80
00:02:49,980 --> 00:02:51,040
we're going to

81
00:02:51,280 --> 00:02:53,130
define j train my training set error,

82
00:02:53,400 --> 00:02:54,410
to be just my sum of

83
00:02:54,630 --> 00:02:55,970
squared errors on the training

84
00:02:56,210 --> 00:02:57,800
set, or my average squared error

85
00:02:57,920 --> 00:03:00,960
on the training set without taking into account that regularization chart.

86
00:03:01,840 --> 00:03:03,150
And similarly, I'm then

87
00:03:03,310 --> 00:03:04,590
also going to define the

88
00:03:05,110 --> 00:03:07,070
cross-validation set error when the

89
00:03:07,170 --> 00:03:08,270
test set error, as before

90
00:03:08,730 --> 00:03:10,620
to be the average sum of squared errors

91
00:03:11,220 --> 00:03:12,890
on the cross-validation and the test sets.

92
00:03:14,140 --> 00:03:16,170
So just to summarize,

93
00:03:16,720 --> 00:03:17,960
my definitions of J train and

94
00:03:18,390 --> 00:03:19,310
J CV and J

95
00:03:19,520 --> 00:03:20,720
test are just the

96
00:03:20,950 --> 00:03:21,910
average squared error, or one

97
00:03:22,310 --> 00:03:23,510
half of the average

98
00:03:23,890 --> 00:03:25,500
squared error on my training validation and

99
00:03:25,740 --> 00:03:27,670
test sets without the extra

100
00:03:28,110 --> 00:03:29,090
regularization chart.

101
00:03:29,260 --> 00:03:33,900
So, this is how we can automatically choose the regularization parameter lambda.

102
00:03:34,850 --> 00:03:36,500
What I usually do is may

103
00:03:36,620 --> 00:03:38,940
be have some range of values of lambda I want to try it.

104
00:03:39,120 --> 00:03:40,640
So I might be

105
00:03:40,780 --> 00:03:41,950
considering not using regularization,

106
00:03:42,930 --> 00:03:44,460
or here are a few values I might try.

107
00:03:44,680 --> 00:03:45,640
I might be considering along because

108
00:03:46,110 --> 00:03:48,290
of O1, O2 from O4 and so on.

109
00:03:48,880 --> 00:03:50,300
And you know, I usually step these

110
00:03:50,560 --> 00:03:51,510
up in multiples of

111
00:03:51,710 --> 00:03:55,750
two until some maybe larger value

112
00:03:55,860 --> 00:03:57,040
this in multiples of two you

113
00:03:57,270 --> 00:03:58,790
I actually end up with 10.24;

114
00:03:59,060 --> 00:04:01,600
it's ten exactly, but you

115
00:04:01,770 --> 00:04:03,030
know, this is close enough and

116
00:04:03,650 --> 00:04:05,110
the 35 decimal

117
00:04:05,400 --> 00:04:07,620
places won't affect your result that much.

118
00:04:07,930 --> 00:04:10,310
So, this gives me, maybe

119
00:04:10,630 --> 00:04:12,460
twelve different models, that I'm

120
00:04:12,600 --> 00:04:14,340
trying to select amongst, corresponding to

121
00:04:14,530 --> 00:04:16,200
12 different values of the

122
00:04:16,510 --> 00:04:18,820
regularization parameter lambda and

123
00:04:19,070 --> 00:04:20,300
of course, you can also go

124
00:04:20,500 --> 00:04:22,330
to values less than 0.01

125
00:04:22,610 --> 00:04:23,800
or values larger than 10,

126
00:04:23,900 --> 00:04:26,770
but I've just truncated it here for convenience.

127
00:04:27,300 --> 00:04:28,660
Given each of these 12

128
00:04:28,990 --> 00:04:30,140
models, what we can

129
00:04:30,370 --> 00:04:31,170
do is then the following:

130
00:04:32,200 --> 00:04:33,500
we take this first

131
00:04:33,880 --> 00:04:35,250
model with lambda equals 0,

132
00:04:35,450 --> 00:04:37,510
and minimize my cost

133
00:04:37,790 --> 00:04:39,950
function j of theta and this

134
00:04:40,180 --> 00:04:41,710
would give me some parameter vector theta

135
00:04:42,250 --> 00:04:43,400
and similar to the earlier video,

136
00:04:43,600 --> 00:04:46,460
let me just denote this as

137
00:04:46,950 --> 00:04:48,050
theta superscript 1.

138
00:04:49,980 --> 00:04:50,840
And then I can take my

139
00:04:51,020 --> 00:04:52,610
second model, with lambda

140
00:04:53,090 --> 00:04:54,620
set to 0.01 and

141
00:04:55,250 --> 00:04:57,210
minimize my cost function, now

142
00:04:57,340 --> 00:04:58,960
using lambda equals 0.01

143
00:04:59,060 --> 00:05:00,170
of course, to get some

144
00:05:00,360 --> 00:05:01,380
different parameter vector theta,

145
00:05:01,930 --> 00:05:02,820
we need to know that theta 2,

146
00:05:02,950 --> 00:05:04,090
and for that I

147
00:05:04,330 --> 00:05:05,610
end up with theta 3

148
00:05:05,810 --> 00:05:06,680
so that this is correct for my

149
00:05:06,750 --> 00:05:08,490
third model, and so on,

150
00:05:09,020 --> 00:05:10,380
until for for my final model

151
00:05:10,850 --> 00:05:13,150
with lambda set to 10,

152
00:05:13,450 --> 00:05:16,550
or 10.24, or I end up with this theta 12.

153
00:05:17,740 --> 00:05:19,210
Next I can take

154
00:05:19,450 --> 00:05:21,110
all of these hypotheses, all of

155
00:05:21,190 --> 00:05:23,250
these parameters, and use

156
00:05:23,560 --> 00:05:25,600
my cross-validation set to evaluate them.

157
00:05:26,340 --> 00:05:27,840
So I can look at my

158
00:05:28,520 --> 00:05:29,820
first model, my second

159
00:05:30,170 --> 00:05:32,070
model, fits with these different values

160
00:05:32,300 --> 00:05:34,090
of the regularization parameter and

161
00:05:34,340 --> 00:05:36,220
evaluate them on my cross-validation

162
00:05:36,470 --> 00:05:40,550
set - basically measure the average squared error of each of these parameter

163
00:05:40,740 --> 00:05:43,310
vectors theta on my cross-validation set.

164
00:05:45,050 --> 00:05:46,800
And I would then pick whichever one

165
00:05:46,960 --> 00:05:48,400
of these 12 models gives me

166
00:05:48,570 --> 00:05:50,850
the lowest error on the cross-validation set.

167
00:05:51,250 --> 00:05:52,790
And let's say, for the sake

168
00:05:53,070 --> 00:05:54,660
of this example, that I

169
00:05:54,950 --> 00:05:56,570
end up picking theta 5,

170
00:05:56,650 --> 00:05:59,260
the fifth order polynomial, because

171
00:05:59,650 --> 00:06:02,240
that has the Noah's cross-validation error.

172
00:06:03,010 --> 00:06:05,220
Having done that, finally, what

173
00:06:05,390 --> 00:06:06,220
I would do if I want

174
00:06:06,490 --> 00:06:07,630
to report a test set error

175
00:06:08,370 --> 00:06:09,690
is to take the parameter theta

176
00:06:10,000 --> 00:06:11,890
5 that I've

177
00:06:12,040 --> 00:06:13,550
selected and look at

178
00:06:13,670 --> 00:06:15,710
how well it does on my test set.

179
00:06:15,840 --> 00:06:17,310
And once again here is as

180
00:06:17,480 --> 00:06:18,870
if we fit this parameter

181
00:06:19,230 --> 00:06:21,940
theta to my cross-validation

182
00:06:22,270 --> 00:06:23,460
set, which is why I

183
00:06:23,660 --> 00:06:25,140
am saving aside a separate

184
00:06:25,420 --> 00:06:26,810
test set that I

185
00:06:26,860 --> 00:06:28,260
am going to use to get

186
00:06:28,350 --> 00:06:29,470
a better estimate of how

187
00:06:29,730 --> 00:06:30,940
well my a parameter vector

188
00:06:31,190 --> 00:06:34,590
theta will generalize to previously unseen examples.

189
00:06:35,120 --> 00:06:36,870
So that's model selection applied

190
00:06:37,260 --> 00:06:39,310
to selecting the regularization parameter

191
00:06:40,260 --> 00:06:41,350
lambda. The last thing

192
00:06:41,490 --> 00:06:42,520
I'd like to do in this

193
00:06:42,770 --> 00:06:43,890
video, is get a

194
00:06:43,970 --> 00:06:46,080
better understanding of how

195
00:06:46,650 --> 00:06:48,340
cross-validation and training error

196
00:06:48,680 --> 00:06:49,920
vary as we

197
00:06:50,130 --> 00:06:52,430
vary the regularization parameter lambda.

198
00:06:53,060 --> 00:06:54,360
And so just a reminder, that

199
00:06:54,460 --> 00:06:55,960
was our original cost function j of

200
00:06:56,140 --> 00:06:57,530
theta, but for this

201
00:06:57,700 --> 00:06:58,950
purpose we're going to define

202
00:06:59,750 --> 00:07:01,130
training error without using

203
00:07:01,540 --> 00:07:03,480
the regularization parameter, and cross-validation

204
00:07:04,160 --> 00:07:05,450
error without using the

205
00:07:05,660 --> 00:07:08,110
regularization parameter and what I'd like

206
00:07:08,510 --> 00:07:10,070
to do is plot this J train

207
00:07:11,050 --> 00:07:13,720
and plot this Jcv, meaning just

208
00:07:14,000 --> 00:07:15,120
how well does my

209
00:07:15,220 --> 00:07:17,550
hypothesis do for on

210
00:07:17,880 --> 00:07:19,060
the training set and how well

211
00:07:19,220 --> 00:07:20,580
does my hypothesis do on the

212
00:07:20,640 --> 00:07:22,550
cross-validation set as I

213
00:07:22,620 --> 00:07:24,530
vary my regularization parameter

214
00:07:25,000 --> 00:07:28,470
lambda so as

215
00:07:28,620 --> 00:07:31,040
we saw earlier, if lambda

216
00:07:31,370 --> 00:07:33,030
is small, then we're

217
00:07:33,220 --> 00:07:35,620
not using much regularization and

218
00:07:36,070 --> 00:07:38,160
we run a larger risk of overfitting.

219
00:07:39,250 --> 00:07:40,980
Where as if lambda is

220
00:07:41,230 --> 00:07:42,390
large, that is if we

221
00:07:42,610 --> 00:07:43,510
were on the right part

222
00:07:44,490 --> 00:07:46,700
of this horizontal axis, then

223
00:07:46,990 --> 00:07:48,070
with a large value of lambda

224
00:07:48,860 --> 00:07:51,360
we run the high risk of having a bias problem.

225
00:07:52,340 --> 00:07:53,950
So if you plot J train

226
00:07:54,580 --> 00:07:56,200
and Jcv, what you

227
00:07:56,280 --> 00:07:58,030
find is that for small

228
00:07:58,400 --> 00:08:00,470
values of lambda you can

229
00:08:01,310 --> 00:08:02,340
fit the training set relatively

230
00:08:02,940 --> 00:08:04,490
well because you're not regularizing.

231
00:08:04,900 --> 00:08:06,190
So, for small values of

232
00:08:06,290 --> 00:08:08,050
lambda, the regularization term basically

233
00:08:08,260 --> 00:08:09,400
goes away and you're just

234
00:08:09,720 --> 00:08:11,760
minimizing pretty much your squared error.

235
00:08:12,170 --> 00:08:13,790
So when lambda is small, you

236
00:08:13,930 --> 00:08:14,880
end up with a small value

237
00:08:15,470 --> 00:08:17,090
for J train, whereas if

238
00:08:17,200 --> 00:08:18,480
lambda is large, then you

239
00:08:19,040 --> 00:08:21,780
have a high bias problem and you might not fit your training set so well.

240
00:08:21,940 --> 00:08:23,100
So you end up with a value up there.

241
00:08:23,850 --> 00:08:28,100
So, J train of

242
00:08:28,230 --> 00:08:29,430
theta will tend to

243
00:08:29,620 --> 00:08:31,590
increase when lambda increases

244
00:08:32,350 --> 00:08:34,020
because a large value of

245
00:08:34,220 --> 00:08:35,150
lambda corresponds a high bias

246
00:08:35,700 --> 00:08:36,700
where you might not even fit your

247
00:08:36,890 --> 00:08:38,460
training set well, whereas a

248
00:08:38,590 --> 00:08:40,680
small value of lambda corresponds to,

249
00:08:40,950 --> 00:08:42,800
if you can you know freely

250
00:08:43,150 --> 00:08:45,990
fit to very high degree polynomials, your data, let's say.

251
00:08:46,220 --> 00:08:50,160
As for the cross-validation error, we end up with a figure like this.

252
00:08:51,380 --> 00:08:52,900
Where, over here on

253
00:08:53,230 --> 00:08:54,760
the right, if we

254
00:08:54,830 --> 00:08:55,770
have a large value of lambda,

255
00:08:56,740 --> 00:08:57,900
we may end up underfitting.

256
00:08:59,200 --> 00:09:00,580
And so, this is the bias regime

257
00:09:02,250 --> 00:09:05,050
whereas and cross

258
00:09:05,330 --> 00:09:06,980
validation error will be

259
00:09:07,220 --> 00:09:08,360
high and let me just leave

260
00:09:08,550 --> 00:09:11,060
all that. So, that's Jcv of theta because with

261
00:09:11,570 --> 00:09:12,740
high bias we won't be fitting.

262
00:09:13,730 --> 00:09:15,880
We won't be doing well on the cross-validation set.

263
00:09:17,350 --> 00:09:20,300
Whereas here on the left, this is the high-variance regime.

264
00:09:21,420 --> 00:09:22,920
Where if we have two smaller

265
00:09:23,320 --> 00:09:25,210
value of lambda then we

266
00:09:25,370 --> 00:09:26,490
may be overfitting the data

267
00:09:27,170 --> 00:09:28,440
and so by over fitting the

268
00:09:28,530 --> 00:09:30,620
data then it a cross validation error

269
00:09:31,010 --> 00:09:31,910
will also be high.

270
00:09:33,000 --> 00:09:34,680
And so, this is what the

271
00:09:35,920 --> 00:09:37,570
cross-validation error and what

272
00:09:37,810 --> 00:09:39,160
the training error may look

273
00:09:39,430 --> 00:09:40,710
like on a training set

274
00:09:41,050 --> 00:09:43,820
as we vary the regularization parameter lambda.

275
00:09:44,010 --> 00:09:45,120
And so, once again, it will

276
00:09:45,330 --> 00:09:47,000
often be some intermediate value

277
00:09:47,290 --> 00:09:50,420
of lambda that you know, subsequent just right

278
00:09:50,620 --> 00:09:51,890
or that works best in

279
00:09:52,020 --> 00:09:53,370
terms of having a small

280
00:09:53,670 --> 00:09:56,610
cross-validation error or a small test set error.

281
00:09:56,820 --> 00:09:57,880
And whereas the curves I've drawn

282
00:09:58,200 --> 00:10:01,130
here are somewhat cartoonish and somewhat idealized.

283
00:10:01,550 --> 00:10:02,570
So on a real data set

284
00:10:02,810 --> 00:10:04,300
the pros you get may

285
00:10:04,410 --> 00:10:05,370
end up looking a little bit more

286
00:10:05,590 --> 00:10:07,480
messy and just a little bit more noisy than this.

287
00:10:08,440 --> 00:10:09,540
For some data sets you will

288
00:10:10,080 --> 00:10:11,350
really see these poor

289
00:10:11,640 --> 00:10:13,080
source of trends and

290
00:10:13,350 --> 00:10:14,240
by looking at the plot

291
00:10:14,800 --> 00:10:15,830
of the whole or cross validation

292
00:10:16,720 --> 00:10:18,360
error, you can either

293
00:10:18,500 --> 00:10:20,270
manually, automatically try to

294
00:10:20,580 --> 00:10:22,000
select a point that minimizes

295
00:10:22,450 --> 00:10:25,490
the cross-validation error and

296
00:10:25,780 --> 00:10:27,500
select the value of lambda corresponding

297
00:10:28,180 --> 00:10:29,680
to low cross-validation error.

298
00:10:30,460 --> 00:10:31,690
When I'm trying to pick the

299
00:10:31,820 --> 00:10:33,770
regularization parameter lambda

300
00:10:34,100 --> 00:10:36,200
for a learning algorithm, often I

301
00:10:36,320 --> 00:10:37,420
find that plotting a figure

302
00:10:37,700 --> 00:10:39,370
like this one showed here, helps

303
00:10:39,650 --> 00:10:41,420
me understand better what's going

304
00:10:41,680 --> 00:10:43,220
on and helps me verify that

305
00:10:43,780 --> 00:10:45,040
I am indeed picking a good

306
00:10:45,220 --> 00:10:46,570
value for the regularization parameter

307
00:10:47,020 --> 00:10:49,220
lambda. So hopefully that

308
00:10:49,420 --> 00:10:51,060
gives you more insight into regularization

309
00:10:52,550 --> 00:10:53,790
and it's effects on the bias

310
00:10:54,300 --> 00:10:55,370
and variance of the learning algorithm.

311
00:10:56,870 --> 00:10:58,410
By know you've seen bias and

312
00:10:58,570 --> 00:11:00,310
variance from a lot of different perspectives.

313
00:11:01,080 --> 00:11:02,370
And what I'd like to do

314
00:11:02,600 --> 00:11:03,900
in the next video is take

315
00:11:04,130 --> 00:11:05,010
a lot of the insights

316
00:11:05,180 --> 00:11:06,970
that we've gone through and build

317
00:11:07,220 --> 00:11:08,110
on them to put together

318
00:11:08,820 --> 00:11:10,670
a diagnostic that's called learning

319
00:11:10,950 --> 00:11:12,000
curves, which is a

320
00:11:12,050 --> 00:11:13,200
tool that I often use

321
00:11:13,620 --> 00:11:14,820
to try to diagnose if a

322
00:11:15,090 --> 00:11:16,530
learning algorithm may be suffering

323
00:11:16,940 --> 00:11:18,230
from a bias problem or a

324
00:11:18,460 --> 00:11:19,850
variance problem or a little bit of both.