Advertisement
Guest User

ML-005 10-5 subtitles

a guest
Apr 23rd, 2014
40
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 22.06 KB | None | 0 0
  1. 1
  2. 00:00:00,390 --> 00:00:02,440
  3. You've seen how regularization can help
  4.  
  5. 2
  6. 00:00:02,610 --> 00:00:04,660
  7. prevent overfitting, but how
  8.  
  9. 3
  10. 00:00:04,960 --> 00:00:06,230
  11. does it affect the bias and
  12.  
  13. 4
  14. 00:00:06,460 --> 00:00:08,070
  15. variance of a learning algorithm?
  16.  
  17. 5
  18. 00:00:08,630 --> 00:00:09,890
  19. In this video, I like to
  20.  
  21. 6
  22. 00:00:10,020 --> 00:00:11,180
  23. go deeper into the issue
  24.  
  25. 7
  26. 00:00:11,550 --> 00:00:13,300
  27. of bias and variance, and
  28.  
  29. 8
  30. 00:00:13,520 --> 00:00:14,450
  31. talk about how it interacts
  32.  
  33. 9
  34. 00:00:15,070 --> 00:00:15,880
  35. with, and is effected by,
  36.  
  37. 10
  38. 00:00:16,070 --> 00:00:18,870
  39. the regularization of your learning algorithm.
  40.  
  41. 11
  42. 00:00:22,180 --> 00:00:23,390
  43. Suppose we fit a
  44.  
  45. 12
  46. 00:00:23,750 --> 00:00:25,960
  47. high order polynomial, but to
  48.  
  49. 13
  50. 00:00:26,170 --> 00:00:27,570
  51. prevent overfitting, we are
  52.  
  53. 14
  54. 00:00:27,690 --> 00:00:29,880
  55. going to use regularization as shown here.
  56.  
  57. 15
  58. 00:00:30,310 --> 00:00:31,470
  59. So we have this regularization
  60.  
  61. 16
  62. 00:00:32,280 --> 00:00:33,450
  63. term to try to
  64.  
  65. 17
  66. 00:00:33,790 --> 00:00:35,680
  67. keep the values of the parameters small.
  68.  
  69. 18
  70. 00:00:36,120 --> 00:00:37,800
  71. And as usual, the regularization sums
  72.  
  73. 19
  74. 00:00:38,170 --> 00:00:39,590
  75. from j equals 1 to
  76.  
  77. 20
  78. 00:00:39,690 --> 00:00:40,880
  79. m rather than j equals 0
  80.  
  81. 21
  82. 00:00:41,000 --> 00:00:43,530
  83. to m. Let's consider three cases.
  84.  
  85. 22
  86. 00:00:44,140 --> 00:00:45,990
  87. The first is the case of
  88.  
  89. 23
  90. 00:00:46,060 --> 00:00:47,300
  91. a very large value of the
  92.  
  93. 24
  94. 00:00:47,360 --> 00:00:49,650
  95. regularization parameter lambda, such
  96.  
  97. 25
  98. 00:00:49,890 --> 00:00:51,040
  99. as if lambda were
  100.  
  101. 26
  102. 00:00:51,190 --> 00:00:52,900
  103. equal to 10,000s of huge value.
  104.  
  105. 27
  106. 00:00:53,180 --> 00:00:54,500
  107. In this
  108.  
  109. 28
  110. 00:00:54,770 --> 00:00:55,910
  111. case, all of these
  112.  
  113. 29
  114. 00:00:56,060 --> 00:00:57,650
  115. parameters, theta 1, theta 2,
  116.  
  117. 30
  118. 00:00:57,980 --> 00:00:58,710
  119. theta 3 and so on will
  120.  
  121. 31
  122. 00:00:58,890 --> 00:01:00,790
  123. be heavily penalized and
  124.  
  125. 32
  126. 00:01:00,970 --> 00:01:03,060
  127. so, what ends up with most
  128.  
  129. 33
  130. 00:01:03,510 --> 00:01:04,840
  131. of these parameter values being close
  132.  
  133. 34
  134. 00:01:05,190 --> 00:01:07,400
  135. to 0 and the hypothesis will be
  136.  
  137. 35
  138. 00:01:07,580 --> 00:01:08,340
  139. roughly h or x
  140.  
  141. 36
  142. 00:01:08,680 --> 00:01:10,380
  143. just equal or approximately equal
  144.  
  145. 37
  146. 00:01:10,670 --> 00:01:11,930
  147. to theta 0, and so we
  148.  
  149. 38
  150. 00:01:12,090 --> 00:01:13,960
  151. end up a hypothesis that more
  152.  
  153. 39
  154. 00:01:14,200 --> 00:01:15,650
  155. or less looks like that. This is more or
  156.  
  157. 40
  158. 00:01:15,770 --> 00:01:18,530
  159. less a flat, constant straight line.
  160.  
  161. 41
  162. 00:01:18,810 --> 00:01:20,720
  163. And so this hypothesis has high
  164.  
  165. 42
  166. 00:01:21,060 --> 00:01:23,030
  167. bias and a value underfits this data set.
  168.  
  169. 43
  170. 00:01:23,370 --> 00:01:24,920
  171. So the horizontal straight
  172.  
  173. 44
  174. 00:01:25,240 --> 00:01:26,210
  175. line is just not a very
  176.  
  177. 45
  178. 00:01:26,340 --> 00:01:28,200
  179. good model for this data set.
  180.  
  181. 46
  182. 00:01:28,500 --> 00:01:30,270
  183. At the other extreme beam is if we have
  184.  
  185. 47
  186. 00:01:30,650 --> 00:01:31,960
  187. a very small value of
  188.  
  189. 48
  190. 00:01:32,250 --> 00:01:33,710
  191. lambda, such as if lambda
  192.  
  193. 49
  194. 00:01:34,110 --> 00:01:36,030
  195. were equal to 0.
  196.  
  197. 50
  198. 00:01:36,120 --> 00:01:37,340
  199. In that case, given that we're
  200.  
  201. 51
  202. 00:01:37,480 --> 00:01:38,640
  203. fitting a high order polynomial,
  204.  
  205. 52
  206. 00:01:38,790 --> 00:01:40,090
  207. this is a
  208.  
  209. 53
  210. 00:01:40,340 --> 00:01:41,990
  211. usual overfitting setting.
  212.  
  213. 54
  214. 00:01:43,150 --> 00:01:44,390
  215. In that case, given that we're
  216.  
  217. 55
  218. 00:01:44,590 --> 00:01:45,940
  219. fitting a high order polynomial,
  220.  
  221. 56
  222. 00:01:46,570 --> 00:01:48,450
  223. basically without regularization or with
  224.  
  225. 57
  226. 00:01:48,630 --> 00:01:50,570
  227. very minimal regularization, we end
  228.  
  229. 58
  230. 00:01:50,750 --> 00:01:52,580
  231. up with our usual high variance, overfitting
  232.  
  233. 59
  234. 00:01:53,210 --> 00:01:54,800
  235. setting, because basically if lambda is
  236.  
  237. 60
  238. 00:01:55,030 --> 00:01:56,050
  239. equal to zero, we are just
  240.  
  241. 61
  242. 00:01:56,190 --> 00:01:58,710
  243. fitting with our regularization so
  244.  
  245. 62
  246. 00:01:58,840 --> 00:02:00,260
  247. that overfits the hypothesis
  248.  
  249. 63
  250. 00:02:01,100 --> 00:02:02,470
  251. and is only if we have some
  252.  
  253. 64
  254. 00:02:02,630 --> 00:02:09,120
  255. intermediate value of lambda that is neither too large nor too small that we end up with parameters theta
  256.  
  257. 65
  258. 00:02:09,620 --> 00:02:11,380
  259. that we end up that give us a reasonable
  260.  
  261. 66
  262. 00:02:11,770 --> 00:02:12,750
  263. fit to this data.
  264.  
  265. 67
  266. 00:02:13,290 --> 00:02:14,610
  267. So how can we automatically
  268.  
  269. 68
  270. 00:02:15,010 --> 00:02:16,580
  271. choose a good value
  272.  
  273. 69
  274. 00:02:17,180 --> 00:02:18,690
  275. for the regularization parameter lambda?
  276.  
  277. 70
  278. 00:02:19,700 --> 00:02:23,570
  279. Just to reiterate, here is our model and here is our learning algorithm subjective.
  280.  
  281. 71
  282. 00:02:24,270 --> 00:02:27,180
  283. For the setting where we're using regularization, let me define
  284.  
  285. 72
  286. 00:02:28,010 --> 00:02:30,140
  287. j train of theta to be something different
  288.  
  289. 73
  290. 00:02:31,010 --> 00:02:32,970
  291. to be the optimization objective
  292.  
  293. 74
  294. 00:02:33,770 --> 00:02:35,400
  295. but without the regularization term.
  296.  
  297. 75
  298. 00:02:36,140 --> 00:02:38,000
  299. Previously, in earlier video
  300.  
  301. 76
  302. 00:02:38,350 --> 00:02:39,270
  303. when we are not using
  304.  
  305. 77
  306. 00:02:39,640 --> 00:02:42,100
  307. regularization, I define j train of theta to
  308.  
  309. 78
  310. 00:02:42,450 --> 00:02:45,980
  311. be the same as j of theta as the cost function but
  312.  
  313. 79
  314. 00:02:46,230 --> 00:02:49,340
  315. when we are using regularization with this extra lambda term
  316.  
  317. 80
  318. 00:02:49,980 --> 00:02:51,040
  319. we're going to
  320.  
  321. 81
  322. 00:02:51,280 --> 00:02:53,130
  323. define j train my training set error,
  324.  
  325. 82
  326. 00:02:53,400 --> 00:02:54,410
  327. to be just my sum of
  328.  
  329. 83
  330. 00:02:54,630 --> 00:02:55,970
  331. squared errors on the training
  332.  
  333. 84
  334. 00:02:56,210 --> 00:02:57,800
  335. set, or my average squared error
  336.  
  337. 85
  338. 00:02:57,920 --> 00:03:00,960
  339. on the training set without taking into account that regularization chart.
  340.  
  341. 86
  342. 00:03:01,840 --> 00:03:03,150
  343. And similarly, I'm then
  344.  
  345. 87
  346. 00:03:03,310 --> 00:03:04,590
  347. also going to define the
  348.  
  349. 88
  350. 00:03:05,110 --> 00:03:07,070
  351. cross-validation set error when the
  352.  
  353. 89
  354. 00:03:07,170 --> 00:03:08,270
  355. test set error, as before
  356.  
  357. 90
  358. 00:03:08,730 --> 00:03:10,620
  359. to be the average sum of squared errors
  360.  
  361. 91
  362. 00:03:11,220 --> 00:03:12,890
  363. on the cross-validation and the test sets.
  364.  
  365. 92
  366. 00:03:14,140 --> 00:03:16,170
  367. So just to summarize,
  368.  
  369. 93
  370. 00:03:16,720 --> 00:03:17,960
  371. my definitions of J train and
  372.  
  373. 94
  374. 00:03:18,390 --> 00:03:19,310
  375. J CV and J
  376.  
  377. 95
  378. 00:03:19,520 --> 00:03:20,720
  379. test are just the
  380.  
  381. 96
  382. 00:03:20,950 --> 00:03:21,910
  383. average squared error, or one
  384.  
  385. 97
  386. 00:03:22,310 --> 00:03:23,510
  387. half of the average
  388.  
  389. 98
  390. 00:03:23,890 --> 00:03:25,500
  391. squared error on my training validation and
  392.  
  393. 99
  394. 00:03:25,740 --> 00:03:27,670
  395. test sets without the extra
  396.  
  397. 100
  398. 00:03:28,110 --> 00:03:29,090
  399. regularization chart.
  400.  
  401. 101
  402. 00:03:29,260 --> 00:03:33,900
  403. So, this is how we can automatically choose the regularization parameter lambda.
  404.  
  405. 102
  406. 00:03:34,850 --> 00:03:36,500
  407. What I usually do is may
  408.  
  409. 103
  410. 00:03:36,620 --> 00:03:38,940
  411. be have some range of values of lambda I want to try it.
  412.  
  413. 104
  414. 00:03:39,120 --> 00:03:40,640
  415. So I might be
  416.  
  417. 105
  418. 00:03:40,780 --> 00:03:41,950
  419. considering not using regularization,
  420.  
  421. 106
  422. 00:03:42,930 --> 00:03:44,460
  423. or here are a few values I might try.
  424.  
  425. 107
  426. 00:03:44,680 --> 00:03:45,640
  427. I might be considering along because
  428.  
  429. 108
  430. 00:03:46,110 --> 00:03:48,290
  431. of O1, O2 from O4 and so on.
  432.  
  433. 109
  434. 00:03:48,880 --> 00:03:50,300
  435. And you know, I usually step these
  436.  
  437. 110
  438. 00:03:50,560 --> 00:03:51,510
  439. up in multiples of
  440.  
  441. 111
  442. 00:03:51,710 --> 00:03:55,750
  443. two until some maybe larger value
  444.  
  445. 112
  446. 00:03:55,860 --> 00:03:57,040
  447. this in multiples of two you
  448.  
  449. 113
  450. 00:03:57,270 --> 00:03:58,790
  451. I actually end up with 10.24;
  452.  
  453. 114
  454. 00:03:59,060 --> 00:04:01,600
  455. it's ten exactly, but you
  456.  
  457. 115
  458. 00:04:01,770 --> 00:04:03,030
  459. know, this is close enough and
  460.  
  461. 116
  462. 00:04:03,650 --> 00:04:05,110
  463. the 35 decimal
  464.  
  465. 117
  466. 00:04:05,400 --> 00:04:07,620
  467. places won't affect your result that much.
  468.  
  469. 118
  470. 00:04:07,930 --> 00:04:10,310
  471. So, this gives me, maybe
  472.  
  473. 119
  474. 00:04:10,630 --> 00:04:12,460
  475. twelve different models, that I'm
  476.  
  477. 120
  478. 00:04:12,600 --> 00:04:14,340
  479. trying to select amongst, corresponding to
  480.  
  481. 121
  482. 00:04:14,530 --> 00:04:16,200
  483. 12 different values of the
  484.  
  485. 122
  486. 00:04:16,510 --> 00:04:18,820
  487. regularization parameter lambda and
  488.  
  489. 123
  490. 00:04:19,070 --> 00:04:20,300
  491. of course, you can also go
  492.  
  493. 124
  494. 00:04:20,500 --> 00:04:22,330
  495. to values less than 0.01
  496.  
  497. 125
  498. 00:04:22,610 --> 00:04:23,800
  499. or values larger than 10,
  500.  
  501. 126
  502. 00:04:23,900 --> 00:04:26,770
  503. but I've just truncated it here for convenience.
  504.  
  505. 127
  506. 00:04:27,300 --> 00:04:28,660
  507. Given each of these 12
  508.  
  509. 128
  510. 00:04:28,990 --> 00:04:30,140
  511. models, what we can
  512.  
  513. 129
  514. 00:04:30,370 --> 00:04:31,170
  515. do is then the following:
  516.  
  517. 130
  518. 00:04:32,200 --> 00:04:33,500
  519. we take this first
  520.  
  521. 131
  522. 00:04:33,880 --> 00:04:35,250
  523. model with lambda equals 0,
  524.  
  525. 132
  526. 00:04:35,450 --> 00:04:37,510
  527. and minimize my cost
  528.  
  529. 133
  530. 00:04:37,790 --> 00:04:39,950
  531. function j of theta and this
  532.  
  533. 134
  534. 00:04:40,180 --> 00:04:41,710
  535. would give me some parameter vector theta
  536.  
  537. 135
  538. 00:04:42,250 --> 00:04:43,400
  539. and similar to the earlier video,
  540.  
  541. 136
  542. 00:04:43,600 --> 00:04:46,460
  543. let me just denote this as
  544.  
  545. 137
  546. 00:04:46,950 --> 00:04:48,050
  547. theta superscript 1.
  548.  
  549. 138
  550. 00:04:49,980 --> 00:04:50,840
  551. And then I can take my
  552.  
  553. 139
  554. 00:04:51,020 --> 00:04:52,610
  555. second model, with lambda
  556.  
  557. 140
  558. 00:04:53,090 --> 00:04:54,620
  559. set to 0.01 and
  560.  
  561. 141
  562. 00:04:55,250 --> 00:04:57,210
  563. minimize my cost function, now
  564.  
  565. 142
  566. 00:04:57,340 --> 00:04:58,960
  567. using lambda equals 0.01
  568.  
  569. 143
  570. 00:04:59,060 --> 00:05:00,170
  571. of course, to get some
  572.  
  573. 144
  574. 00:05:00,360 --> 00:05:01,380
  575. different parameter vector theta,
  576.  
  577. 145
  578. 00:05:01,930 --> 00:05:02,820
  579. we need to know that theta 2,
  580.  
  581. 146
  582. 00:05:02,950 --> 00:05:04,090
  583. and for that I
  584.  
  585. 147
  586. 00:05:04,330 --> 00:05:05,610
  587. end up with theta 3
  588.  
  589. 148
  590. 00:05:05,810 --> 00:05:06,680
  591. so that this is correct for my
  592.  
  593. 149
  594. 00:05:06,750 --> 00:05:08,490
  595. third model, and so on,
  596.  
  597. 150
  598. 00:05:09,020 --> 00:05:10,380
  599. until for for my final model
  600.  
  601. 151
  602. 00:05:10,850 --> 00:05:13,150
  603. with lambda set to 10,
  604.  
  605. 152
  606. 00:05:13,450 --> 00:05:16,550
  607. or 10.24, or I end up with this theta 12.
  608.  
  609. 153
  610. 00:05:17,740 --> 00:05:19,210
  611. Next I can take
  612.  
  613. 154
  614. 00:05:19,450 --> 00:05:21,110
  615. all of these hypotheses, all of
  616.  
  617. 155
  618. 00:05:21,190 --> 00:05:23,250
  619. these parameters, and use
  620.  
  621. 156
  622. 00:05:23,560 --> 00:05:25,600
  623. my cross-validation set to evaluate them.
  624.  
  625. 157
  626. 00:05:26,340 --> 00:05:27,840
  627. So I can look at my
  628.  
  629. 158
  630. 00:05:28,520 --> 00:05:29,820
  631. first model, my second
  632.  
  633. 159
  634. 00:05:30,170 --> 00:05:32,070
  635. model, fits with these different values
  636.  
  637. 160
  638. 00:05:32,300 --> 00:05:34,090
  639. of the regularization parameter and
  640.  
  641. 161
  642. 00:05:34,340 --> 00:05:36,220
  643. evaluate them on my cross-validation
  644.  
  645. 162
  646. 00:05:36,470 --> 00:05:40,550
  647. set - basically measure the average squared error of each of these parameter
  648.  
  649. 163
  650. 00:05:40,740 --> 00:05:43,310
  651. vectors theta on my cross-validation set.
  652.  
  653. 164
  654. 00:05:45,050 --> 00:05:46,800
  655. And I would then pick whichever one
  656.  
  657. 165
  658. 00:05:46,960 --> 00:05:48,400
  659. of these 12 models gives me
  660.  
  661. 166
  662. 00:05:48,570 --> 00:05:50,850
  663. the lowest error on the cross-validation set.
  664.  
  665. 167
  666. 00:05:51,250 --> 00:05:52,790
  667. And let's say, for the sake
  668.  
  669. 168
  670. 00:05:53,070 --> 00:05:54,660
  671. of this example, that I
  672.  
  673. 169
  674. 00:05:54,950 --> 00:05:56,570
  675. end up picking theta 5,
  676.  
  677. 170
  678. 00:05:56,650 --> 00:05:59,260
  679. the fifth order polynomial, because
  680.  
  681. 171
  682. 00:05:59,650 --> 00:06:02,240
  683. that has the Noah's cross-validation error.
  684.  
  685. 172
  686. 00:06:03,010 --> 00:06:05,220
  687. Having done that, finally, what
  688.  
  689. 173
  690. 00:06:05,390 --> 00:06:06,220
  691. I would do if I want
  692.  
  693. 174
  694. 00:06:06,490 --> 00:06:07,630
  695. to report a test set error
  696.  
  697. 175
  698. 00:06:08,370 --> 00:06:09,690
  699. is to take the parameter theta
  700.  
  701. 176
  702. 00:06:10,000 --> 00:06:11,890
  703. 5 that I've
  704.  
  705. 177
  706. 00:06:12,040 --> 00:06:13,550
  707. selected and look at
  708.  
  709. 178
  710. 00:06:13,670 --> 00:06:15,710
  711. how well it does on my test set.
  712.  
  713. 179
  714. 00:06:15,840 --> 00:06:17,310
  715. And once again here is as
  716.  
  717. 180
  718. 00:06:17,480 --> 00:06:18,870
  719. if we fit this parameter
  720.  
  721. 181
  722. 00:06:19,230 --> 00:06:21,940
  723. theta to my cross-validation
  724.  
  725. 182
  726. 00:06:22,270 --> 00:06:23,460
  727. set, which is why I
  728.  
  729. 183
  730. 00:06:23,660 --> 00:06:25,140
  731. am saving aside a separate
  732.  
  733. 184
  734. 00:06:25,420 --> 00:06:26,810
  735. test set that I
  736.  
  737. 185
  738. 00:06:26,860 --> 00:06:28,260
  739. am going to use to get
  740.  
  741. 186
  742. 00:06:28,350 --> 00:06:29,470
  743. a better estimate of how
  744.  
  745. 187
  746. 00:06:29,730 --> 00:06:30,940
  747. well my a parameter vector
  748.  
  749. 188
  750. 00:06:31,190 --> 00:06:34,590
  751. theta will generalize to previously unseen examples.
  752.  
  753. 189
  754. 00:06:35,120 --> 00:06:36,870
  755. So that's model selection applied
  756.  
  757. 190
  758. 00:06:37,260 --> 00:06:39,310
  759. to selecting the regularization parameter
  760.  
  761. 191
  762. 00:06:40,260 --> 00:06:41,350
  763. lambda. The last thing
  764.  
  765. 192
  766. 00:06:41,490 --> 00:06:42,520
  767. I'd like to do in this
  768.  
  769. 193
  770. 00:06:42,770 --> 00:06:43,890
  771. video, is get a
  772.  
  773. 194
  774. 00:06:43,970 --> 00:06:46,080
  775. better understanding of how
  776.  
  777. 195
  778. 00:06:46,650 --> 00:06:48,340
  779. cross-validation and training error
  780.  
  781. 196
  782. 00:06:48,680 --> 00:06:49,920
  783. vary as we
  784.  
  785. 197
  786. 00:06:50,130 --> 00:06:52,430
  787. vary the regularization parameter lambda.
  788.  
  789. 198
  790. 00:06:53,060 --> 00:06:54,360
  791. And so just a reminder, that
  792.  
  793. 199
  794. 00:06:54,460 --> 00:06:55,960
  795. was our original cost function j of
  796.  
  797. 200
  798. 00:06:56,140 --> 00:06:57,530
  799. theta, but for this
  800.  
  801. 201
  802. 00:06:57,700 --> 00:06:58,950
  803. purpose we're going to define
  804.  
  805. 202
  806. 00:06:59,750 --> 00:07:01,130
  807. training error without using
  808.  
  809. 203
  810. 00:07:01,540 --> 00:07:03,480
  811. the regularization parameter, and cross-validation
  812.  
  813. 204
  814. 00:07:04,160 --> 00:07:05,450
  815. error without using the
  816.  
  817. 205
  818. 00:07:05,660 --> 00:07:08,110
  819. regularization parameter and what I'd like
  820.  
  821. 206
  822. 00:07:08,510 --> 00:07:10,070
  823. to do is plot this J train
  824.  
  825. 207
  826. 00:07:11,050 --> 00:07:13,720
  827. and plot this Jcv, meaning just
  828.  
  829. 208
  830. 00:07:14,000 --> 00:07:15,120
  831. how well does my
  832.  
  833. 209
  834. 00:07:15,220 --> 00:07:17,550
  835. hypothesis do for on
  836.  
  837. 210
  838. 00:07:17,880 --> 00:07:19,060
  839. the training set and how well
  840.  
  841. 211
  842. 00:07:19,220 --> 00:07:20,580
  843. does my hypothesis do on the
  844.  
  845. 212
  846. 00:07:20,640 --> 00:07:22,550
  847. cross-validation set as I
  848.  
  849. 213
  850. 00:07:22,620 --> 00:07:24,530
  851. vary my regularization parameter
  852.  
  853. 214
  854. 00:07:25,000 --> 00:07:28,470
  855. lambda so as
  856.  
  857. 215
  858. 00:07:28,620 --> 00:07:31,040
  859. we saw earlier, if lambda
  860.  
  861. 216
  862. 00:07:31,370 --> 00:07:33,030
  863. is small, then we're
  864.  
  865. 217
  866. 00:07:33,220 --> 00:07:35,620
  867. not using much regularization and
  868.  
  869. 218
  870. 00:07:36,070 --> 00:07:38,160
  871. we run a larger risk of overfitting.
  872.  
  873. 219
  874. 00:07:39,250 --> 00:07:40,980
  875. Where as if lambda is
  876.  
  877. 220
  878. 00:07:41,230 --> 00:07:42,390
  879. large, that is if we
  880.  
  881. 221
  882. 00:07:42,610 --> 00:07:43,510
  883. were on the right part
  884.  
  885. 222
  886. 00:07:44,490 --> 00:07:46,700
  887. of this horizontal axis, then
  888.  
  889. 223
  890. 00:07:46,990 --> 00:07:48,070
  891. with a large value of lambda
  892.  
  893. 224
  894. 00:07:48,860 --> 00:07:51,360
  895. we run the high risk of having a bias problem.
  896.  
  897. 225
  898. 00:07:52,340 --> 00:07:53,950
  899. So if you plot J train
  900.  
  901. 226
  902. 00:07:54,580 --> 00:07:56,200
  903. and Jcv, what you
  904.  
  905. 227
  906. 00:07:56,280 --> 00:07:58,030
  907. find is that for small
  908.  
  909. 228
  910. 00:07:58,400 --> 00:08:00,470
  911. values of lambda you can
  912.  
  913. 229
  914. 00:08:01,310 --> 00:08:02,340
  915. fit the training set relatively
  916.  
  917. 230
  918. 00:08:02,940 --> 00:08:04,490
  919. well because you're not regularizing.
  920.  
  921. 231
  922. 00:08:04,900 --> 00:08:06,190
  923. So, for small values of
  924.  
  925. 232
  926. 00:08:06,290 --> 00:08:08,050
  927. lambda, the regularization term basically
  928.  
  929. 233
  930. 00:08:08,260 --> 00:08:09,400
  931. goes away and you're just
  932.  
  933. 234
  934. 00:08:09,720 --> 00:08:11,760
  935. minimizing pretty much your squared error.
  936.  
  937. 235
  938. 00:08:12,170 --> 00:08:13,790
  939. So when lambda is small, you
  940.  
  941. 236
  942. 00:08:13,930 --> 00:08:14,880
  943. end up with a small value
  944.  
  945. 237
  946. 00:08:15,470 --> 00:08:17,090
  947. for J train, whereas if
  948.  
  949. 238
  950. 00:08:17,200 --> 00:08:18,480
  951. lambda is large, then you
  952.  
  953. 239
  954. 00:08:19,040 --> 00:08:21,780
  955. have a high bias problem and you might not fit your training set so well.
  956.  
  957. 240
  958. 00:08:21,940 --> 00:08:23,100
  959. So you end up with a value up there.
  960.  
  961. 241
  962. 00:08:23,850 --> 00:08:28,100
  963. So, J train of
  964.  
  965. 242
  966. 00:08:28,230 --> 00:08:29,430
  967. theta will tend to
  968.  
  969. 243
  970. 00:08:29,620 --> 00:08:31,590
  971. increase when lambda increases
  972.  
  973. 244
  974. 00:08:32,350 --> 00:08:34,020
  975. because a large value of
  976.  
  977. 245
  978. 00:08:34,220 --> 00:08:35,150
  979. lambda corresponds a high bias
  980.  
  981. 246
  982. 00:08:35,700 --> 00:08:36,700
  983. where you might not even fit your
  984.  
  985. 247
  986. 00:08:36,890 --> 00:08:38,460
  987. training set well, whereas a
  988.  
  989. 248
  990. 00:08:38,590 --> 00:08:40,680
  991. small value of lambda corresponds to,
  992.  
  993. 249
  994. 00:08:40,950 --> 00:08:42,800
  995. if you can you know freely
  996.  
  997. 250
  998. 00:08:43,150 --> 00:08:45,990
  999. fit to very high degree polynomials, your data, let's say.
  1000.  
  1001. 251
  1002. 00:08:46,220 --> 00:08:50,160
  1003. As for the cross-validation error, we end up with a figure like this.
  1004.  
  1005. 252
  1006. 00:08:51,380 --> 00:08:52,900
  1007. Where, over here on
  1008.  
  1009. 253
  1010. 00:08:53,230 --> 00:08:54,760
  1011. the right, if we
  1012.  
  1013. 254
  1014. 00:08:54,830 --> 00:08:55,770
  1015. have a large value of lambda,
  1016.  
  1017. 255
  1018. 00:08:56,740 --> 00:08:57,900
  1019. we may end up underfitting.
  1020.  
  1021. 256
  1022. 00:08:59,200 --> 00:09:00,580
  1023. And so, this is the bias regime
  1024.  
  1025. 257
  1026. 00:09:02,250 --> 00:09:05,050
  1027. whereas and cross
  1028.  
  1029. 258
  1030. 00:09:05,330 --> 00:09:06,980
  1031. validation error will be
  1032.  
  1033. 259
  1034. 00:09:07,220 --> 00:09:08,360
  1035. high and let me just leave
  1036.  
  1037. 260
  1038. 00:09:08,550 --> 00:09:11,060
  1039. all that. So, that's Jcv of theta because with
  1040.  
  1041. 261
  1042. 00:09:11,570 --> 00:09:12,740
  1043. high bias we won't be fitting.
  1044.  
  1045. 262
  1046. 00:09:13,730 --> 00:09:15,880
  1047. We won't be doing well on the cross-validation set.
  1048.  
  1049. 263
  1050. 00:09:17,350 --> 00:09:20,300
  1051. Whereas here on the left, this is the high-variance regime.
  1052.  
  1053. 264
  1054. 00:09:21,420 --> 00:09:22,920
  1055. Where if we have two smaller
  1056.  
  1057. 265
  1058. 00:09:23,320 --> 00:09:25,210
  1059. value of lambda then we
  1060.  
  1061. 266
  1062. 00:09:25,370 --> 00:09:26,490
  1063. may be overfitting the data
  1064.  
  1065. 267
  1066. 00:09:27,170 --> 00:09:28,440
  1067. and so by over fitting the
  1068.  
  1069. 268
  1070. 00:09:28,530 --> 00:09:30,620
  1071. data then it a cross validation error
  1072.  
  1073. 269
  1074. 00:09:31,010 --> 00:09:31,910
  1075. will also be high.
  1076.  
  1077. 270
  1078. 00:09:33,000 --> 00:09:34,680
  1079. And so, this is what the
  1080.  
  1081. 271
  1082. 00:09:35,920 --> 00:09:37,570
  1083. cross-validation error and what
  1084.  
  1085. 272
  1086. 00:09:37,810 --> 00:09:39,160
  1087. the training error may look
  1088.  
  1089. 273
  1090. 00:09:39,430 --> 00:09:40,710
  1091. like on a training set
  1092.  
  1093. 274
  1094. 00:09:41,050 --> 00:09:43,820
  1095. as we vary the regularization parameter lambda.
  1096.  
  1097. 275
  1098. 00:09:44,010 --> 00:09:45,120
  1099. And so, once again, it will
  1100.  
  1101. 276
  1102. 00:09:45,330 --> 00:09:47,000
  1103. often be some intermediate value
  1104.  
  1105. 277
  1106. 00:09:47,290 --> 00:09:50,420
  1107. of lambda that you know, subsequent just right
  1108.  
  1109. 278
  1110. 00:09:50,620 --> 00:09:51,890
  1111. or that works best in
  1112.  
  1113. 279
  1114. 00:09:52,020 --> 00:09:53,370
  1115. terms of having a small
  1116.  
  1117. 280
  1118. 00:09:53,670 --> 00:09:56,610
  1119. cross-validation error or a small test set error.
  1120.  
  1121. 281
  1122. 00:09:56,820 --> 00:09:57,880
  1123. And whereas the curves I've drawn
  1124.  
  1125. 282
  1126. 00:09:58,200 --> 00:10:01,130
  1127. here are somewhat cartoonish and somewhat idealized.
  1128.  
  1129. 283
  1130. 00:10:01,550 --> 00:10:02,570
  1131. So on a real data set
  1132.  
  1133. 284
  1134. 00:10:02,810 --> 00:10:04,300
  1135. the pros you get may
  1136.  
  1137. 285
  1138. 00:10:04,410 --> 00:10:05,370
  1139. end up looking a little bit more
  1140.  
  1141. 286
  1142. 00:10:05,590 --> 00:10:07,480
  1143. messy and just a little bit more noisy than this.
  1144.  
  1145. 287
  1146. 00:10:08,440 --> 00:10:09,540
  1147. For some data sets you will
  1148.  
  1149. 288
  1150. 00:10:10,080 --> 00:10:11,350
  1151. really see these poor
  1152.  
  1153. 289
  1154. 00:10:11,640 --> 00:10:13,080
  1155. source of trends and
  1156.  
  1157. 290
  1158. 00:10:13,350 --> 00:10:14,240
  1159. by looking at the plot
  1160.  
  1161. 291
  1162. 00:10:14,800 --> 00:10:15,830
  1163. of the whole or cross validation
  1164.  
  1165. 292
  1166. 00:10:16,720 --> 00:10:18,360
  1167. error, you can either
  1168.  
  1169. 293
  1170. 00:10:18,500 --> 00:10:20,270
  1171. manually, automatically try to
  1172.  
  1173. 294
  1174. 00:10:20,580 --> 00:10:22,000
  1175. select a point that minimizes
  1176.  
  1177. 295
  1178. 00:10:22,450 --> 00:10:25,490
  1179. the cross-validation error and
  1180.  
  1181. 296
  1182. 00:10:25,780 --> 00:10:27,500
  1183. select the value of lambda corresponding
  1184.  
  1185. 297
  1186. 00:10:28,180 --> 00:10:29,680
  1187. to low cross-validation error.
  1188.  
  1189. 298
  1190. 00:10:30,460 --> 00:10:31,690
  1191. When I'm trying to pick the
  1192.  
  1193. 299
  1194. 00:10:31,820 --> 00:10:33,770
  1195. regularization parameter lambda
  1196.  
  1197. 300
  1198. 00:10:34,100 --> 00:10:36,200
  1199. for a learning algorithm, often I
  1200.  
  1201. 301
  1202. 00:10:36,320 --> 00:10:37,420
  1203. find that plotting a figure
  1204.  
  1205. 302
  1206. 00:10:37,700 --> 00:10:39,370
  1207. like this one showed here, helps
  1208.  
  1209. 303
  1210. 00:10:39,650 --> 00:10:41,420
  1211. me understand better what's going
  1212.  
  1213. 304
  1214. 00:10:41,680 --> 00:10:43,220
  1215. on and helps me verify that
  1216.  
  1217. 305
  1218. 00:10:43,780 --> 00:10:45,040
  1219. I am indeed picking a good
  1220.  
  1221. 306
  1222. 00:10:45,220 --> 00:10:46,570
  1223. value for the regularization parameter
  1224.  
  1225. 307
  1226. 00:10:47,020 --> 00:10:49,220
  1227. lambda. So hopefully that
  1228.  
  1229. 308
  1230. 00:10:49,420 --> 00:10:51,060
  1231. gives you more insight into regularization
  1232.  
  1233. 309
  1234. 00:10:52,550 --> 00:10:53,790
  1235. and it's effects on the bias
  1236.  
  1237. 310
  1238. 00:10:54,300 --> 00:10:55,370
  1239. and variance of the learning algorithm.
  1240.  
  1241. 311
  1242. 00:10:56,870 --> 00:10:58,410
  1243. By know you've seen bias and
  1244.  
  1245. 312
  1246. 00:10:58,570 --> 00:11:00,310
  1247. variance from a lot of different perspectives.
  1248.  
  1249. 313
  1250. 00:11:01,080 --> 00:11:02,370
  1251. And what I'd like to do
  1252.  
  1253. 314
  1254. 00:11:02,600 --> 00:11:03,900
  1255. in the next video is take
  1256.  
  1257. 315
  1258. 00:11:04,130 --> 00:11:05,010
  1259. a lot of the insights
  1260.  
  1261. 316
  1262. 00:11:05,180 --> 00:11:06,970
  1263. that we've gone through and build
  1264.  
  1265. 317
  1266. 00:11:07,220 --> 00:11:08,110
  1267. on them to put together
  1268.  
  1269. 318
  1270. 00:11:08,820 --> 00:11:10,670
  1271. a diagnostic that's called learning
  1272.  
  1273. 319
  1274. 00:11:10,950 --> 00:11:12,000
  1275. curves, which is a
  1276.  
  1277. 320
  1278. 00:11:12,050 --> 00:11:13,200
  1279. tool that I often use
  1280.  
  1281. 321
  1282. 00:11:13,620 --> 00:11:14,820
  1283. to try to diagnose if a
  1284.  
  1285. 322
  1286. 00:11:15,090 --> 00:11:16,530
  1287. learning algorithm may be suffering
  1288.  
  1289. 323
  1290. 00:11:16,940 --> 00:11:18,230
  1291. from a bias problem or a
  1292.  
  1293. 324
  1294. 00:11:18,460 --> 00:11:19,850
  1295. variance problem or a little bit of both.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement