1
00:00:10,170 --> 00:00:11,549

2
00:00:11,549 --> 00:00:14,819
This presentation is delivered by the Stanford Center for Professional 

3
00:00:14,819 --> 00:00:21,819
Development. 

4
00:00:24,739 --> 00:00:28,499
Okay. Good morning. Welcome back. 

5
00:00:28,499 --> 00:00:31,239
What I want to do today is 

6
00:00:31,239 --> 00:00:36,640
actually wrap up our discussion on learning theory and sort of on " 

7
00:00:36,640 --> 00:00:37,670
and I'm gonna start by 

8
00:00:37,670 --> 00:00:41,810
talking about Bayesian statistics and regularization, 

9
00:00:41,810 --> 00:00:45,680
and then take a very brief digression to tell you about online learning. 

10
00:00:45,680 --> 00:00:50,320
And most of today's lecture will actually be on various pieces of that, so applying 

11
00:00:50,320 --> 00:00:52,450
machine learning algorithms to problems like, you know, 

12
00:00:52,450 --> 00:00:55,920
like the project or other problems you may go work on after you graduate from this 

13
00:00:55,920 --> 00:00:57,010


14
00:00:57,010 --> 00:00:57,739
class. But 

15
00:00:57,739 --> 00:01:02,300
let's start the talk about Bayesian statistics and regularization. 

16
00:01:02,300 --> 00:01:04,130
So you remember from last week, 

17
00:01:04,130 --> 00:01:08,960
we started to talk about learning theory and we learned about bias and 

18
00:01:08,960 --> 00:01:12,260
variance. And I guess in the previous lecture, we spent most of the previous 

19
00:01:12,260 --> 00:01:13,240
lecture 

20
00:01:13,240 --> 00:01:17,510
talking about algorithms for model selection and for 

21
00:01:17,510 --> 00:01:21,360
feature selection. We talked about cross-validation. Right? So 

22
00:01:21,360 --> 00:01:24,200
most of the methods we talked about in the previous lecture were 

23
00:01:24,200 --> 00:01:28,460
ways for you to try to simply the model. So for example, 

24
00:01:28,460 --> 00:01:31,920
the feature selection algorithms we talked about gives you a way to eliminate a number 

25
00:01:31,920 --> 00:01:32,909
of features, 

26
00:01:32,909 --> 00:01:36,350
so as to reduce the number of parameters you need to fit and thereby reduce 

27
00:01:36,350 --> 00:01:39,330
overfitting. Right? You remember that? So feature 

28
00:01:39,330 --> 00:01:40,670
selection algorithms 

29
00:01:40,670 --> 00:01:42,149


30
00:01:42,149 --> 00:01:43,970
choose a subset of the features 

31
00:01:43,970 --> 00:01:48,470
so that you have less parameters and you may be less likely to overfit. Right? 

32
00:01:48,470 --> 00:01:52,790
What I want to do today is to talk about a different way to prevent overfitting. 

33
00:01:52,790 --> 00:01:53,870
And 

34
00:01:53,870 --> 00:01:57,860
there's a method called regularization and there's a way that lets you keep all the 

35
00:01:57,860 --> 00:01:58,859
parameters. 

36
00:01:58,859 --> 00:01:59,920


37
00:01:59,920 --> 00:02:04,200
So here's the idea, and I'm gonna illustrate this example with, 

38
00:02:04,200 --> 00:02:06,960
say, linear regression. 

39
00:02:06,960 --> 00:02:08,709
So 

40
00:02:08,709 --> 00:02:13,329
you take 

41
00:02:13,329 --> 00:02:16,920
the linear regression model, the very first model we learned about, 

42
00:02:16,920 --> 00:02:17,969
right, 

43
00:02:17,969 --> 00:02:21,639
we said that we would choose the parameters 

44
00:02:21,639 --> 00:02:23,609
via 

45
00:02:23,609 --> 00:02:29,519
maximum likelihood. 

46
00:02:29,519 --> 00:02:31,669
Right? And that meant that, 

47
00:02:31,669 --> 00:02:35,299
you know, you would choose the parameters theta 

48
00:02:35,299 --> 00:02:39,879
that maximized 

49
00:02:39,879 --> 00:02:43,009
the probability of the data, 

50
00:02:43,009 --> 00:02:46,109
which is parameters theta that maximized the probability of the data we observe. 

51
00:02:46,109 --> 00:02:50,709
Right? 

52
00:02:50,709 --> 00:02:52,049
And so 

53
00:02:52,049 --> 00:02:56,279
to give this sort of procedure a name, this is one example of most 

54
00:02:56,279 --> 00:03:00,259
common frequencies procedure, 

55
00:03:00,259 --> 00:03:05,419
and frequency, you can think of sort of as maybe one school of statistics. 

56
00:03:05,419 --> 00:03:07,079
And the philosophical view 

57
00:03:07,079 --> 00:03:08,869
behind writing this down was 

58
00:03:08,869 --> 00:03:13,480
we envisioned that there was some true parameter theta out there that generated, 

59
00:03:13,480 --> 00:03:16,469
you know, the Xs and the Ys. There's some true parameter theta 

60
00:03:16,469 --> 00:03:20,529
that govern housing prices, Y is a function of X, 

61
00:03:20,529 --> 00:03:23,459
and we don't know what the value of theta is, 

62
00:03:23,459 --> 00:03:27,630
and we'd like to come up with some procedure for estimating the value of theta. Okay? 

63
00:03:27,630 --> 00:03:28,379
And so, 

64
00:03:28,379 --> 00:03:31,269
maximum likelihood is just one possible procedure 

65
00:03:31,269 --> 00:03:33,209
for estimating the unknown value 

66
00:03:33,209 --> 00:03:35,879
for theta. 

67
00:03:35,879 --> 00:03:37,609
And 

68
00:03:37,609 --> 00:03:41,060
the way you formulated this, you know, theta was not a random variable. Right? 

69
00:03:41,060 --> 00:03:42,140
That's what why said, 

70
00:03:42,140 --> 00:03:45,490
so theta is just some true value out there. It's not random or anything, we just don't 

71
00:03:45,490 --> 00:03:50,329
know what it is, and we have a procedure called maximum likelihood for estimating the 

72
00:03:50,329 --> 00:03:55,469
value for theta. So this is one example of what's called a frequencies procedure. 

73
00:03:55,469 --> 00:03:57,279
The alternative to the, I 

74
00:03:57,279 --> 00:04:04,279
guess, the frequency school of statistics is the Bayesian school, 

75
00:04:04,389 --> 00:04:06,869
in which 

76
00:04:06,869 --> 00:04:10,239
we're gonna say that we don't know what theta, 

77
00:04:10,239 --> 00:04:13,339
and so we will put a prior 

78
00:04:13,339 --> 00:04:14,769
on 

79
00:04:14,769 --> 00:04:18,239
theta. Okay? So in the Bayesian school students would say, well 

80
00:04:18,239 --> 00:04:22,379
don't know what the value of theta so let's represent our uncertainty over theta with a 

81
00:04:22,379 --> 00:04:27,229
prior.

82
00:04:27,229 --> 00:04:28,259


83
00:04:28,259 --> 00:04:32,009
So for example, 

84
00:04:32,009 --> 00:04:34,770
our prior on theta 

85
00:04:34,770 --> 00:04:36,860
may be a Gaussian distribution 

86
00:04:36,860 --> 00:04:39,849
with mean zero and curvalence matrix 

87
00:04:39,849 --> 00:04:42,889
given by tau squared I. Okay? 

88
00:04:42,889 --> 00:04:45,319
And so  

89
00:04:45,319 --> 00:04:52,319
actually, if I use S to denote my training set, well right, 

90
00:04:56,279 --> 00:05:00,289
so theta represents my beliefs about what the parameters are in the absence of 

91
00:05:00,289 --> 00:05:01,380
any data. So 

92
00:05:01,380 --> 00:05:03,290
not having seen any data, theta 

93
00:05:03,290 --> 00:05:05,709
represents, you know, what I think theta " it 

94
00:05:05,709 --> 00:05:10,090
probably represents what I think theta is most likely to be. 

95
00:05:10,090 --> 00:05:15,659
And so given the training set, S, in the sort of Bayesian procedure, we would, 

96
00:05:15,659 --> 00:05:22,439
well, 

97
00:05:22,439 --> 00:05:26,539
calculate the probability, the posterior probability by parameters 

98
00:05:26,539 --> 00:05:28,869
given my training sets, and " let's 

99
00:05:28,869 --> 00:05:30,469
write 

100
00:05:30,469 --> 00:05:31,779
this on the next board. 

101
00:05:31,779 --> 00:05:33,360
So my posterior 

102
00:05:33,360 --> 00:05:36,740
on my parameters given my training set, by Bayes' rule, this will 

103
00:05:36,740 --> 00:05:40,160
be proportional to, you 

104
00:05:40,160 --> 00:05:47,160
know, this. 

105
00:05:51,459 --> 00:05:56,559
Right? So by Bayes' rule. Let's call 

106
00:05:56,559 --> 00:06:01,429
it posterior. 

107
00:06:01,429 --> 00:06:05,669
And this distribution now represents my beliefs about what theta is after I've 

108
00:06:05,669 --> 00:06:08,219
seen the training set. 

109
00:06:08,219 --> 00:06:15,219
And when you now want to make a new prediction on the price of a new house, 

110
00:06:20,060 --> 00:06:21,959
on the input X, 

111
00:06:21,959 --> 00:06:26,709
I would say that, well, the distribution over the possible housing prices for 

112
00:06:26,709 --> 00:06:30,560
this new house I'm trying to estimate the price of, say, 

113
00:06:30,560 --> 00:06:34,270
given the size of the house, the features of the house at X, and the training 

114
00:06:34,270 --> 00:06:36,559
set I had previously, it 

115
00:06:36,559 --> 00:06:43,559
is going to be given by 

116
00:06:46,150 --> 00:06:47,280


117
00:06:47,280 --> 00:06:50,470
an integral over my parameters theta of 

118
00:06:50,470 --> 00:06:53,020
probably of Y given X comma theta 

119
00:06:53,020 --> 00:06:54,649
and times 

120
00:06:54,649 --> 00:06:59,929
the posterior distribution of theta given the training set. Okay? 

121
00:06:59,929 --> 00:07:01,789
And in 

122
00:07:01,789 --> 00:07:03,719
particular, if you want your prediction to be 

123
00:07:03,719 --> 00:07:07,559
the expected value 

124
00:07:07,559 --> 00:07:08,729
of Y given 

125
00:07:08,729 --> 00:07:10,279
the 

126
00:07:10,279 --> 00:07:13,559
input X in training set, you would 

127
00:07:13,559 --> 00:07:16,349
say integrate 

128
00:07:16,349 --> 00:07:23,349
over Y 

129
00:07:23,750 --> 00:07:26,539
times the posterior. Okay? 

130
00:07:26,539 --> 00:07:33,539
You would take an expectation of Y with respect to your posterior distribution. Okay? 

131
00:07:35,440 --> 00:07:39,169
And you notice that when I was writing this down, so with the Bayesian 

132
00:07:39,169 --> 00:07:41,959
formulation, and now started to write up here Y given X 

133
00:07:41,959 --> 00:07:44,110
comma theta because 

134
00:07:44,110 --> 00:07:47,590
this formula now is the property of Y conditioned on the values of the 

135
00:07:47,590 --> 00:07:51,909
random variables X and theta. So I'm no longer writing semicolon theta, I'm writing 

136
00:07:51,909 --> 00:07:53,099
comma theta 

137
00:07:53,099 --> 00:07:56,379
because I'm now treating theta 

138
00:07:56,379 --> 00:08:00,259
as a random variable. So all 

139
00:08:00,259 --> 00:08:02,849
of this is somewhat abstract but this is 

140
00:08:02,849 --> 00:08:04,109
" and 

141
00:08:04,109 --> 00:08:11,109
it turns out " actually let's check. Are there questions about this? No? Okay. Let's 

142
00:08:14,779 --> 00:08:17,819
try to make this more concrete. It turns out that 

143
00:08:17,819 --> 00:08:21,699
for many problems, 

144
00:08:21,699 --> 00:08:26,059
both of these steps in the computation are difficult because if, 

145
00:08:26,059 --> 00:08:29,469
you know, theta is an N plus onedimensional vector, is an N plus 

146
00:08:29,469 --> 00:08:31,219
one-dimensional parameter vector, 

147
00:08:31,219 --> 00:08:35,010
then this is one an integral over an N plus one-dimensional, you know, over RN 

148
00:08:35,010 --> 00:08:36,390
plus one. 

149
00:08:36,390 --> 00:08:38,660
And because numerically it's very difficult 

150
00:08:38,660 --> 00:08:41,130
to compute integrals over 

151
00:08:41,130 --> 00:08:44,800
very high dimensional spaces, all right? So 

152
00:08:44,800 --> 00:08:48,740
usually this integral " actually usually it's hard to compute the posterior in theta 

153
00:08:48,740 --> 00:08:54,310
and it's also hard to compute this integral if theta is very high dimensional. There are few 

154
00:08:54,310 --> 00:08:55,910
exceptions for which this can be done in 

155
00:08:55,910 --> 00:08:58,290
closed form, but for 

156
00:08:58,290 --> 00:09:00,460
many learning algorithms, say, 

157
00:09:00,460 --> 00:09:04,940
Bayesian logistic regression, this is hard to do. 

158
00:09:04,940 --> 00:09:09,780
And so what's commonly done is to take the posterior distribution 

159
00:09:09,780 --> 00:09:13,390
and instead of actually computing a full posterior distribution, chi of theta 

160
00:09:13,390 --> 00:09:15,280
given S, 

161
00:09:15,280 --> 00:09:16,510
we'll instead 

162
00:09:16,510 --> 00:09:19,330
take this quantity on the right-hand side 

163
00:09:19,330 --> 00:09:23,880
and just maximize this quantity on the right-hand side. So let me write this down. 

164
00:09:23,880 --> 00:09:26,650
So 

165
00:09:26,650 --> 00:09:30,100
commonly, instead of computing the full posterior distribution, 

166
00:09:30,100 --> 00:09:37,100
we will choose the following. Okay? 

167
00:09:42,280 --> 00:09:47,160
We will choose what's called the MAP estimate, or the maximum a posteriori 

168
00:09:47,160 --> 00:09:50,270
estimate of theta, which is the most likely value of theta, 

169
00:09:50,270 --> 00:09:53,990
most probable value of theta onto your posterior distribution. 

170
00:09:53,990 --> 00:09:56,430
And that's just 

171
00:09:56,430 --> 00:10:03,430
ont max chi 

172
00:10:07,960 --> 00:10:11,870
of theta. 

173
00:10:11,870 --> 00:10:18,870
And then when you need to make a prediction, 

174
00:10:21,480 --> 00:10:28,480
you know, you would just predict, say, well, 

175
00:10:35,530 --> 00:10:37,599
using your usual hypothesis 

176
00:10:37,599 --> 00:10:41,290
and using this MAP value of theta 

177
00:10:41,290 --> 00:10:42,479
in place of 

178
00:10:42,479 --> 00:10:47,380
" as the parameter vector you'd choose. Okay? And 

179
00:10:47,380 --> 00:10:51,340
notice, the only difference between this and standard maximum likelihood estimation 

180
00:10:51,340 --> 00:10:53,060
is that when you're choosing, 

181
00:10:53,060 --> 00:10:56,920
you know, the " instead of choosing the maximum likelihood value for 

182
00:10:56,920 --> 00:11:01,140
theta, you're instead maximizing this, which is what you have for maximum likelihood estimation, 

183
00:11:01,140 --> 00:11:05,170
and then times this other quantity which is the prior. 

184
00:11:05,170 --> 00:11:06,650
Right? 

185
00:11:06,650 --> 00:11:09,520
And 

186
00:11:09,520 --> 00:11:10,870
let's see, 

187
00:11:10,870 --> 00:11:16,560
when intuition is that if your prior 

188
00:11:16,560 --> 00:11:21,470
is 

189
00:11:21,470 --> 00:11:23,650
theta being Gaussian and with mean 

190
00:11:23,650 --> 00:11:26,030
zero and some covariance, 

191
00:11:26,030 --> 00:11:30,550
then for a distribution like this, most of the [inaudible] mass is close to zero. Right? 

192
00:11:30,550 --> 00:11:34,010
So there's a Gaussian centered around the point zero, and so [inaudible] mass is close to zero. 

193
00:11:34,010 --> 00:11:37,990


194
00:11:37,990 --> 00:11:40,550
And so the prior distribution, instead of saying that 

195
00:11:40,550 --> 00:11:44,800
you think most of the parameters should be close to 

196
00:11:44,800 --> 00:11:48,780
zero, and if you remember our discussion on feature selection, if you eliminate a 

197
00:11:48,780 --> 00:11:50,880
feature from consideration 

198
00:11:50,880 --> 00:11:53,280
that's the same as 

199
00:11:53,280 --> 00:11:56,630
setting the source and value of theta to be equal to zero. All right? So if you 

200
00:11:56,630 --> 00:11:58,020
set 

201
00:11:58,020 --> 00:12:02,090
theta five to be equal to zero, that's the same as, you know, eliminating feature 

202
00:12:02,090 --> 00:12:04,550
five from the your hypothesis. 

203
00:12:04,550 --> 00:12:09,710
And so, this is the prior that drives most of the parameter values to zero 

204
00:12:09,710 --> 00:12:12,710
" to values close to zero. And you'll think of this as 

205
00:12:12,710 --> 00:12:17,410
doing something analogous, if " doing something reminiscent of feature selection. Okay? And 

206
00:12:17,410 --> 00:12:21,560
it turns out that with this formulation, the parameters won't actually be 

207
00:12:21,560 --> 00:12:24,990
exactly zero but many of the values will be close to zero. 

208
00:12:24,990 --> 00:12:27,030
And 

209
00:12:27,030 --> 00:12:31,510
I guess in pictures, 

210
00:12:31,510 --> 00:12:34,550


211
00:12:34,550 --> 00:12:36,210
if you remember, 

212
00:12:36,210 --> 00:12:40,070
I said that if you have, say, five data points and you fit 

213
00:12:40,070 --> 00:12:47,070
a fourth-order polynomial " 

214
00:12:47,620 --> 00:12:51,040
well I think that had too many bumps in it, but never mind. 

215
00:12:51,040 --> 00:12:54,360
If you fit it a " if you fit very high polynomial to a 

216
00:12:54,360 --> 00:12:58,180
very small dataset, you can get these very large oscillations 

217
00:12:58,180 --> 00:13:01,420
if you use maximum likelihood estimation. All right? 

218
00:13:01,420 --> 00:13:05,610
In contrast, if you apply this sort of Bayesian regularization, 

219
00:13:05,610 --> 00:13:08,200
you can actually fit a higherorder polynomial 

220
00:13:08,200 --> 00:13:11,160
that still get 

221
00:13:11,160 --> 00:13:13,250
sort of a smoother and smoother fit to the data 

222
00:13:13,250 --> 00:13:17,840
as you decrease tau. So as you decrease tau, you're driving the parameters to be closer and closer 

223
00:13:17,840 --> 00:13:19,830
to zero. And that 

224
00:13:19,830 --> 00:13:22,750
in practice " it's sort of hard to see, but you can take my word for it. 

225
00:13:22,750 --> 00:13:24,900
As tau becomes smaller and smaller, 

226
00:13:24,900 --> 00:13:28,760
the curves you tend to fit your data also become smoother and smoother, and so you 

227
00:13:28,760 --> 00:13:29,560
tend 

228
00:13:29,560 --> 00:13:35,700
less and less overfit, even when you're fitting a large number of 

229
00:13:35,700 --> 00:13:37,670
parameters. 

230
00:13:37,670 --> 00:13:41,290
Okay? Let's see, 

231
00:13:41,290 --> 00:13:44,590
and 

232
00:13:44,590 --> 00:13:48,129
one last piece of intuition that I would just toss out there. And 

233
00:13:48,129 --> 00:13:49,920
you get to play more with 

234
00:13:49,920 --> 00:13:52,850
this particular set of ideas more in Problem Set 

235
00:13:52,850 --> 00:13:54,980
3, which I'll post online later 

236
00:13:54,980 --> 00:13:57,160
this week I guess. 

237
00:13:57,160 --> 00:14:01,950
Is that whereas maximum likelihood tries to minimize, 

238
00:14:01,950 --> 00:14:08,950
say, this, 

239
00:14:12,210 --> 00:14:15,830
right? Whereas maximum likelihood for, say, linear regression turns out to be minimizing 

240
00:14:15,830 --> 00:14:17,030
this, 

241
00:14:17,030 --> 00:14:19,290
it turns out that if you 

242
00:14:19,290 --> 00:14:21,220
add this prior term there, 

243
00:14:21,220 --> 00:14:26,500
it turns out that the authorization objective you end up optimizing 

244
00:14:26,500 --> 00:14:31,160
turns out to be that. Where you add an extra term that, you know, 

245
00:14:31,160 --> 00:14:34,380
penalizes your parameter theta as being large. 

246
00:14:34,380 --> 00:14:34,980


247
00:14:34,980 --> 00:14:38,820
And so this ends up being an algorithm that's very similar to maximum likelihood, expect that you tend to 

248
00:14:38,820 --> 00:14:40,890
keep your parameters small. 

249
00:14:40,890 --> 00:14:42,810
And this has the effect. 

250
00:14:42,810 --> 00:14:46,530
Again, it's kind of hard to see but just take my word for it. That strengthening the parameters has the effect of 

251
00:14:46,530 --> 00:14:48,260
keeping the functions you fit 

252
00:14:48,260 --> 00:14:55,260
to be smoother and less likely to overfit. Okay? Okay, hopefully this will 

253
00:14:57,870 --> 00:15:02,200
make more sense when you play with these ideas a bit more in the next problem set. But let's check 

254
00:15:02,200 --> 00:15:09,200
questions about all this. 

255
00:15:12,560 --> 00:15:16,000
Student:The smoothing behavior is it because [inaudible] actually get different [inaudible]? Instructor 

256
00:15:16,000 --> 00:15:19,260
(Andrew 

257
00:15:19,260 --> 00:15:23,010
Ng):Let's see. Yeah. It depends on " well 

258
00:15:23,010 --> 00:15:28,160
most priors with most of the mass close to zero will get this effect, I guess. And just by 

259
00:15:28,160 --> 00:15:29,090
convention, the Gaussian 

260
00:15:29,090 --> 00:15:31,770
prior is what's most " 

261
00:15:31,770 --> 00:15:33,820
used the most common for models like logistic 

262
00:15:33,820 --> 00:15:37,910
regression and linear regression, generalized in your models. There are a few 

263
00:15:37,910 --> 00:15:41,190
other priors that I sometimes use, like the Laplace prior, 

264
00:15:41,190 --> 00:15:48,190
but all of them will tend to have these sorts of smoothing effects. All right. Cool. 

265
00:15:50,270 --> 00:15:52,640
And so it turns out that for 

266
00:15:52,640 --> 00:15:57,250
problems like text classification, text classification is like 30,000 features or 50,000 

267
00:15:57,250 --> 00:15:59,560
features, 

268
00:15:59,560 --> 00:16:02,980
where it seems like an algorithm like logistic regression would be very much prone to 

269
00:16:02,980 --> 00:16:03,670
overfitting. Right? 

270
00:16:03,670 --> 00:16:06,470
So imagine trying to build a spam classifier, 

271
00:16:06,470 --> 00:16:10,050
maybe you have 100 training examples but you have 30,000 

272
00:16:10,050 --> 00:16:12,530
features or 50,000 features, 

273
00:16:12,530 --> 00:16:16,820
that seems clearly to be prone to overfitting. Right? But it turns out that with 

274
00:16:16,820 --> 00:16:18,440
this sort of Bayesian 

275
00:16:18,440 --> 00:16:19,460
regularization, 

276
00:16:19,460 --> 00:16:22,100
with [inaudible] Gaussian, 

277
00:16:22,100 --> 00:16:26,710
logistic regression becomes a very effective text classification algorithm 

278
00:16:26,710 --> 00:16:32,510
with this sort of Bayesian regularization. Alex? Student:[Inaudible]? Instructor (Andrew Ng):Yeah, 

279
00:16:32,510 --> 00:16:36,950
right, and so pick " and to pick either tau squared or lambda. 

280
00:16:36,950 --> 00:16:40,260
I think the relation is lambda equals one over tau squared. But right, so pick either 

281
00:16:40,260 --> 00:16:41,840
tau squared or lambda, you could 

282
00:16:41,840 --> 00:16:48,840
use cross-validation, yeah. All right? Okay, cool. So all right, 

283
00:16:49,120 --> 00:16:51,740
that 

284
00:16:51,740 --> 00:16:55,240
was all I want to say about methods for preventing 

285
00:16:55,240 --> 00:16:56,410
overfitting. What I 

286
00:16:56,410 --> 00:17:00,820
want to do next is just spend, you know, five minutes talking about 

287
00:17:00,820 --> 00:17:02,300
online learning. 

288
00:17:02,300 --> 00:17:04,650
And this is sort of a digression. 

289
00:17:04,650 --> 00:17:06,390
And so, you know, when 

290
00:17:06,390 --> 00:17:09,680
you're designing the syllabus of a class, I guess, sometimes 

291
00:17:09,680 --> 00:17:13,040
there are just some ideas you want to talk about but can't find a very good place to 

292
00:17:13,040 --> 00:17:16,480
fit in anywhere. So this is one of those ideas that may 

293
00:17:16,480 --> 00:17:20,890
seem a bit disjointed from the rest of the class but I just 

294
00:17:20,890 --> 00:17:25,770
want to 

295
00:17:25,770 --> 00:17:29,730
tell 

296
00:17:29,730 --> 00:17:33,720
you a little bit about it. Okay. So here's the idea. 

297
00:17:33,720 --> 00:17:37,820
So far, all the learning algorithms we've talked about are what's called batch learning 

298
00:17:37,820 --> 00:17:38,520
algorithms, 

299
00:17:38,520 --> 00:17:41,650
where you're given a training set and then you get to run your learning algorithm on the 

300
00:17:41,650 --> 00:17:46,910
training set and then maybe you test it on some other test set. 

301
00:17:46,910 --> 00:17:49,980
And there's another learning setting called online learning, 

302
00:17:49,980 --> 00:17:53,670
in which you have to make predictions even while you are in the process of learning. 

303
00:17:53,670 --> 00:17:57,650
So here's how the problem sees. 

304
00:17:57,650 --> 00:17:59,360
All right? I'm first gonna give you X one. 

305
00:17:59,360 --> 00:18:01,410
Let's say there's a classification problem, 

306
00:18:01,410 --> 00:18:02,780
so I'm first gonna give you 

307
00:18:02,780 --> 00:18:07,420
X one and then gonna ask you, you know, �Can you make a prediction on X one? Is the label one or zero?� 

308
00:18:07,420 --> 00:18:09,670
And you've not seen any data yet. 

309
00:18:09,670 --> 00:18:11,679
And so, you make a guess. Right? You 

310
00:18:11,679 --> 00:18:12,990
guess " 

311
00:18:12,990 --> 00:18:15,190
we'll call your guess Y hat one. 

312
00:18:15,190 --> 00:18:18,010
And after you've made your prediction, I 

313
00:18:18,010 --> 00:18:21,840
will then reveal to you the true label Y one. Okay? And 

314
00:18:21,840 --> 00:18:25,580
not having seen any data before, your odds of getting the first one right are only 

315
00:18:25,580 --> 00:18:28,750
50 percent, right, if you guess randomly. 

316
00:18:28,750 --> 00:18:29,950


317
00:18:29,950 --> 00:18:35,080
And then I show you X two. And then I ask you, �Can you make a prediction on X two?� And 

318
00:18:35,080 --> 00:18:35,910
so you now maybe 

319
00:18:35,910 --> 00:18:40,200
are gonna make a slightly more educated guess and call that Y hat two. 

320
00:18:40,200 --> 00:18:44,880
And after you've made your guess, I reveal the true label to you. 

321
00:18:44,880 --> 00:18:50,000
And so, then I show you X three, and then you make 

322
00:18:50,000 --> 00:18:53,000
your guess, and learning proceeds as follows. 

323
00:18:53,000 --> 00:18:54,860
So this is just a lot of 

324
00:18:54,860 --> 00:18:59,370
machine learning and batch learning, and the model settings where 

325
00:18:59,370 --> 00:19:03,710
you have to keep learning even as you're making predictions, 

326
00:19:03,710 --> 00:19:08,050
okay? So I don't know, setting your website and you have users coming in. And as the first user 

327
00:19:08,050 --> 00:19:10,990
comes in, you need to start making predictions already about what 

328
00:19:10,990 --> 00:19:13,450
the user likes or dislikes. And there's only, you know, 

329
00:19:13,450 --> 00:19:18,370
as you're making predictions you get to show more and more training examples. 

330
00:19:18,370 --> 00:19:25,370
So in online learning what you care about is the total online error, 

331
00:19:27,370 --> 00:19:28,400


332
00:19:28,400 --> 00:19:32,900
which is sum from I equals one to MC if you get the sequence of M examples 

333
00:19:32,900 --> 00:19:34,390
all together, 

334
00:19:34,390 --> 00:19:37,860
indicator Y hat I 

335
00:19:37,860 --> 00:19:40,900
not equal to Y hi. Okay? 

336
00:19:40,900 --> 00:19:43,510
So the total online error is 

337
00:19:43,510 --> 00:19:45,920
the total number of mistakes you make 

338
00:19:45,920 --> 00:19:49,029
on a sequence of examples like this. 

339
00:19:49,029 --> 00:19:51,910
And 

340
00:19:51,910 --> 00:19:54,049
it turns out that, you know, 

341
00:19:54,049 --> 00:19:57,690
many of the learning algorithms you have " when you finish all the learning algorithms, you've 

342
00:19:57,690 --> 00:20:00,690
learned about and can apply to this setting. One thing 

343
00:20:00,690 --> 00:20:02,680
you could do is 

344
00:20:02,680 --> 00:20:03,850
when you're asked 

345
00:20:03,850 --> 00:20:06,230
to make prediction on Y hat three, 

346
00:20:06,230 --> 00:20:10,409
right, one simple thing to do is well you've seen some other training examples 

347
00:20:10,409 --> 00:20:14,040
up to this point so you can just take your learning algorithm and run it 

348
00:20:14,040 --> 00:20:15,429
on the examples, 

349
00:20:15,429 --> 00:20:18,970
you know, leading up to Y hat three. So just run the learning algorithm on all 

350
00:20:18,970 --> 00:20:21,420
the examples you've seen previous 

351
00:20:21,420 --> 00:20:25,400
to being asked to make a prediction on certain example, and then use your 

352
00:20:25,400 --> 00:20:27,320
learning 

353
00:20:27,320 --> 00:20:30,940
algorithm to make a prediction on the next example. And it turns out that there are also algorithms, especially the algorithms that we 

354
00:20:30,940 --> 00:20:34,190
saw that you could use the stochastic gradient descent, that, 

355
00:20:34,190 --> 00:20:41,190
you know, can be adapted very nicely to this. So as a concrete example, if you 

356
00:20:41,740 --> 00:20:45,660
remember the perceptron algorithms, say, right, 

357
00:20:45,660 --> 00:20:47,350
you would 

358
00:20:47,350 --> 00:20:52,280
say initial the parameter theta to be equal to zero. 

359
00:20:52,280 --> 00:20:55,940
And then after seeing the Ith training 

360
00:20:55,940 --> 00:20:59,230
example, you'd 

361
00:20:59,230 --> 00:21:06,230
update the parameters, you 

362
00:21:13,740 --> 00:21:15,420
know, 

363
00:21:15,420 --> 00:21:19,740
using " you've see this reel a lot of times now, right, using the standard 

364
00:21:19,740 --> 00:21:21,720
perceptron learning rule. And 

365
00:21:21,720 --> 00:21:26,230
the same thing, if you were using logistic regression you can then, 

366
00:21:26,230 --> 00:21:29,940
again, after seeing each training example, just run, you know, essentially run 

367
00:21:29,940 --> 00:21:33,620
one-step stochastic gradient descent 

368
00:21:33,620 --> 00:21:38,050
on just the example you saw. Okay? 

369
00:21:38,050 --> 00:21:39,800
And so 

370
00:21:39,800 --> 00:21:41,040
the reason I've 

371
00:21:41,040 --> 00:21:45,380
put this into the sort of �learning theory� section of this class was because 

372
00:21:45,380 --> 00:21:49,190
it turns that sometimes you can prove fairly amazing results 

373
00:21:49,190 --> 00:21:51,500
on your total online error 

374
00:21:51,500 --> 00:21:53,790
using algorithms like these. 

375
00:21:53,790 --> 00:21:57,500
I will actually " I don't actually want to spend the time in the main 

376
00:21:57,500 --> 00:21:58,640
lecture to prove 

377
00:21:58,640 --> 00:22:01,880
this, but, for example, you can prove that 

378
00:22:01,880 --> 00:22:04,790
when you use the perceptron algorithm, 

379
00:22:04,790 --> 00:22:07,890
then even when 

380
00:22:07,890 --> 00:22:11,710
the features XI, 

381
00:22:11,710 --> 00:22:15,250
maybe infinite dimensional feature vectors, like we saw for 

382
00:22:15,250 --> 00:22:18,700
simple vector machines. And sometimes, infinite feature dimensional vectors 

383
00:22:18,700 --> 00:22:20,740
may use kernel representations. Okay? But so it 

384
00:22:20,740 --> 00:22:22,180
turns out that you can prove that 

385
00:22:22,180 --> 00:22:24,790
when you a perceptron algorithm, 

386
00:22:24,790 --> 00:22:27,590
even when the data is maybe 

387
00:22:27,590 --> 00:22:28,920
extremely high dimensional and it 

388
00:22:28,920 --> 00:22:32,900
seems like you'd be prone to overfitting, right, you can prove that so as the 

389
00:22:32,900 --> 00:22:36,550
long as the positive and negative examples 

390
00:22:36,550 --> 00:22:38,740


391
00:22:38,740 --> 00:22:41,860
are separated by a margin, 

392
00:22:41,860 --> 00:22:43,360
right. So in this 

393
00:22:43,360 --> 00:22:48,050
infinite dimensional space, 

394
00:22:48,050 --> 00:22:49,740
so long as, 

395
00:22:49,740 --> 00:22:53,390
you know, there is some margin 

396
00:22:53,390 --> 00:22:57,480
down there separating the positive and negative examples, 

397
00:22:57,480 --> 00:22:59,300
you can prove that 

398
00:22:59,300 --> 00:23:02,150
perceptron algorithm will converge 

399
00:23:02,150 --> 00:23:06,870
to a hypothesis that perfectly separates the positive and negative examples. 

400
00:23:06,870 --> 00:23:10,760
Okay? And then so after seeing only a finite number of examples, it'll 

401
00:23:10,760 --> 00:23:14,410
converge to digital boundary that perfectly separates the 

402
00:23:14,410 --> 00:23:18,590
positive and negative examples, even though you may in an infinite dimensional space. Okay? 

403
00:23:18,590 --> 00:23:20,970
So 

404
00:23:20,970 --> 00:23:22,260
let's see. 

405
00:23:22,260 --> 00:23:25,920
The proof itself would take me sort of almost an entire lecture to do, 

406
00:23:25,920 --> 00:23:27,179
and there are sort of 

407
00:23:27,179 --> 00:23:29,460
other things that I want to do more than that. 

408
00:23:29,460 --> 00:23:32,850
So you want to see the proof of this yourself, it's actually written up in the lecture 

409
00:23:32,850 --> 00:23:35,010
notes that I posted online. 

410
00:23:35,010 --> 00:23:38,210
For the purposes of this class' syllabus, the proof of this 

411
00:23:38,210 --> 00:23:41,059
result, you can treat this as optional reading. And by that, I mean, 

412
00:23:41,059 --> 00:23:44,799
you know, it won't appear on the midterm and you won't be asked about this 

413
00:23:44,799 --> 00:23:47,039
specifically in the problem sets, 

414
00:23:47,039 --> 00:23:48,770
but I thought it'd be " 

415
00:23:48,770 --> 00:23:52,340
I know some of you are curious after the previous lecture so why 

416
00:23:52,340 --> 00:23:53,750
you can prove 

417
00:23:53,750 --> 00:23:55,450
that, you know, SVMs can have 

418
00:23:55,450 --> 00:23:59,240
bounded VC dimension, even in these infinite dimensional spaces, and how do 

419
00:23:59,240 --> 00:24:00,940
you prove things in these " 

420
00:24:00,940 --> 00:24:04,450
how do you prove learning theory results in these infinite dimensional feature spaces. And 

421
00:24:04,450 --> 00:24:05,190
so 

422
00:24:05,190 --> 00:24:08,480
the perceptron bound that I just talked about was the simplest instance I 

423
00:24:08,480 --> 00:24:11,040
know of that you can sort of read in like half an hour and understand 

424
00:24:11,040 --> 00:24:11,960
it. 

425
00:24:11,960 --> 00:24:14,700
So if you're 

426
00:24:14,700 --> 00:24:16,179
interested, there are lecture notes online for 

427
00:24:16,179 --> 00:24:20,720
how this perceptron bound is actually proved. It's a very 

428
00:24:20,720 --> 00:24:21,580


429
00:24:21,580 --> 00:24:25,710
[inaudible], you can prove it in like a page or so, so go ahead and take a look at that if you're interested. Okay? But 

430
00:24:25,710 --> 00:24:28,450
regardless of the theoretical results, 

431
00:24:28,450 --> 00:24:30,880
you know, the online learning setting is something 

432
00:24:30,880 --> 00:24:33,490
that you " that comes reasonably often. And so, 

433
00:24:33,490 --> 00:24:38,750
these algorithms based on stochastic gradient descent often go very well. Okay, any 

434
00:24:38,750 --> 00:24:45,750
questions about this before I move on? 

435
00:24:49,990 --> 00:24:54,799
All right. Cool. So the last thing I want to do today, and was the majority of today's lecture, actually can I switch to 

436
00:24:54,799 --> 00:24:56,500
PowerPoint slides, please, 

437
00:24:56,500 --> 00:24:59,810
is I actually want to spend most of today's lecture sort of talking about 

438
00:24:59,810 --> 00:25:05,270
advice for applying different machine learning 

439
00:25:05,270 --> 00:25:10,850
algorithms. 

440
00:25:10,850 --> 00:25:11,970
And so, 

441
00:25:11,970 --> 00:25:14,019
you know, right now, already you have a, 

442
00:25:14,019 --> 00:25:16,250
I think, a good understanding of 

443
00:25:16,250 --> 00:25:18,620
really the most powerful tools 

444
00:25:18,620 --> 00:25:21,020
known to humankind in machine learning. 

445
00:25:21,020 --> 00:25:22,090
Right? And 

446
00:25:22,090 --> 00:25:26,520
what I want to do today is give you some advice on how to apply them really 

447
00:25:26,520 --> 00:25:27,720
powerfully because, 

448
00:25:27,720 --> 00:25:30,299
you know, the same tool " it turns out that you can 

449
00:25:30,299 --> 00:25:33,249
take the same machine learning tool, say logistic regression, 

450
00:25:33,249 --> 00:25:36,800
and you can ask two different people to apply it to the same problem. 

451
00:25:36,800 --> 00:25:38,230
And 

452
00:25:38,230 --> 00:25:41,290
sometimes one person will do an amazing job and it'll work amazingly well, and the second 

453
00:25:41,290 --> 00:25:43,150
person will sort of 

454
00:25:43,150 --> 00:25:47,150
not really get it to work, even though it was exactly the same algorithm. Right? 

455
00:25:47,150 --> 00:25:51,049
And so what I want to do today, in the rest of the time I have today, is try 

456
00:25:51,049 --> 00:25:52,190
to 

457
00:25:52,190 --> 00:25:55,210
convey to you, you know, some of the methods for how to 

458
00:25:55,210 --> 00:25:56,600
make sure you're one of " 

459
00:25:56,600 --> 00:26:03,600
you really know how to get these learning algorithms to work well in problems. So 

460
00:26:05,120 --> 00:26:07,340
just some caveats on what I'm gonna, I 

461
00:26:07,340 --> 00:26:08,859
guess, talk about in the 

462
00:26:08,859 --> 00:26:11,080
rest of today's lecture. 

463
00:26:11,080 --> 00:26:11,940


464
00:26:11,940 --> 00:26:15,760
Something I want to talk about is actually not very mathematical but is also 

465
00:26:15,760 --> 00:26:17,610
some of the hardest, 

466
00:26:17,610 --> 00:26:18,480
most 

467
00:26:18,480 --> 00:26:21,910
conceptually most difficult material in this class to understand. All right? So this 

468
00:26:21,910 --> 00:26:23,070
is 

469
00:26:23,070 --> 00:26:25,400
not mathematical but this is not easy. 

470
00:26:25,400 --> 00:26:29,730
And I want to say this caveat some of what I'll say today is debatable. 

471
00:26:29,730 --> 00:26:33,309
I think most good machine learning people will agree with most of what I say but maybe not 

472
00:26:33,309 --> 00:26:35,730
everything I say. 

473
00:26:35,730 --> 00:26:39,350
And some of what I'll say is also not good advice for doing machine learning either, so I'll say more about 

474
00:26:39,350 --> 00:26:41,680
this later. What I'm 

475
00:26:41,680 --> 00:26:43,730
focusing on today is advice for 

476
00:26:43,730 --> 00:26:47,269
how to just get stuff to work. If you work in the company and you want to deliver a 

477
00:26:47,269 --> 00:26:49,120
product or you're, you know, 

478
00:26:49,120 --> 00:26:52,750
building a system and you just want your machine learning system to work. Okay? Some of what I'm about 

479
00:26:52,750 --> 00:26:54,000
to say today 

480
00:26:54,000 --> 00:26:57,559
isn't great advice if you goal is to invent a new machine learning algorithm, but 

481
00:26:57,559 --> 00:26:59,179
this is advice for how to 

482
00:26:59,179 --> 00:27:02,040
make machine learning algorithm work and, you know, and 

483
00:27:02,040 --> 00:27:05,080
deploy a working system. So three 

484
00:27:05,080 --> 00:27:08,900
key areas I'm gonna talk about. One: diagnostics for 

485
00:27:08,900 --> 00:27:11,890
debugging learning algorithms. Second: sort of 

486
00:27:11,890 --> 00:27:16,820
talk briefly about error analyses and ablative analysis. 

487
00:27:16,820 --> 00:27:21,470
And third, I want to talk about just advice for how to get started on a machine-learning 

488
00:27:21,470 --> 00:27:23,410


489
00:27:23,410 --> 00:27:24,420
problem. 

490
00:27:24,420 --> 00:27:27,340
And one theme that'll come up later is it 

491
00:27:27,340 --> 00:27:31,230
turns out you've heard about premature optimization, right, 

492
00:27:31,230 --> 00:27:33,890
in writing software. This is when 

493
00:27:33,890 --> 00:27:38,120
someone over-designs from the start, when someone, you know, is writing piece of code and 

494
00:27:38,120 --> 00:27:41,330
they choose a subroutine to optimize 

495
00:27:41,330 --> 00:27:45,350
heavily. And maybe you write the subroutine as assembly or something. And that's often 

496
00:27:45,350 --> 00:27:48,919
" and many of us have been guilty of premature optimization, where 

497
00:27:48,919 --> 00:27:51,510
we're trying to get a piece of code to run faster. And 

498
00:27:51,510 --> 00:27:54,710
we choose probably a piece of code and we implement it an assembly, and 

499
00:27:54,710 --> 00:27:56,820
really tune and get to run really quickly. And 

500
00:27:56,820 --> 00:28:01,280
it turns out that wasn't the bottleneck in the code at all. Right? And we call that premature 

501
00:28:01,280 --> 00:28:02,210
optimization. 

502
00:28:02,210 --> 00:28:06,140
And in undergraduate programming classes, we warn people all the time not to do 

503
00:28:06,140 --> 00:28:10,250
premature optimization and people still do it all the time. Right? 

504
00:28:10,250 --> 00:28:11,750
And 

505
00:28:11,750 --> 00:28:16,020
turns out, a very similar thing happens in building machine-learning systems. That 

506
00:28:16,020 --> 00:28:20,980
many people are often guilty of, what I call, premature statistical optimization, where they 

507
00:28:20,980 --> 00:28:21,630


508
00:28:21,630 --> 00:28:26,090
heavily optimize part of a machine learning system and that turns out 

509
00:28:26,090 --> 00:28:27,830
not to be the important piece. Okay? 

510
00:28:27,830 --> 00:28:30,170
So I'll talk about that later, as well. 

511
00:28:30,170 --> 00:28:35,680
So let's first talk about debugging learning algorithms. 

512
00:28:35,680 --> 00:28:37,700


513
00:28:37,700 --> 00:28:39,560


514
00:28:39,560 --> 00:28:42,610
As a motivating 

515
00:28:42,610 --> 00:28:46,910
example, let's say you want to build an anti-spam system. And 

516
00:28:46,910 --> 00:28:51,780
let's say you've carefully chosen, you know, a small set of 100 words to use as features. All right? 

517
00:28:51,780 --> 00:28:54,410
So instead of using 50,000 words, you're chosen a small set of 100 

518
00:28:54,410 --> 00:28:55,390
features 

519
00:28:55,390 --> 00:28:59,010
to use for your anti-spam system. 

520
00:28:59,010 --> 00:29:02,140
And let's say you implement Bayesian logistic regression, implement gradient 

521
00:29:02,140 --> 00:29:02,850
descent, 

522
00:29:02,850 --> 00:29:07,310
and you get 20 percent test error, which is unacceptably high. Right? 

523
00:29:07,310 --> 00:29:09,750
So this 

524
00:29:09,750 --> 00:29:14,260
is Bayesian logistic regression, and so it's just like maximum likelihood but, you know, with that additional lambda 

525
00:29:14,260 --> 00:29:15,160
squared term. 

526
00:29:15,160 --> 00:29:19,950
And we're maximizing rather than minimizing as well, so there's a minus lambda 

527
00:29:19,950 --> 00:29:23,809
theta square instead of plus lambda theta squared. So 

528
00:29:23,809 --> 00:29:28,390
the question is, you implemented your Bayesian logistic regression algorithm, 

529
00:29:28,390 --> 00:29:34,050
and you tested it on your test set and you got unacceptably high error, so what do you do 

530
00:29:34,050 --> 00:29:35,370
next? Right? 

531
00:29:35,370 --> 00:29:37,310
So, 

532
00:29:37,310 --> 00:29:40,730
you know, one thing you could do is think about the ways you could improve this algorithm. 

533
00:29:40,730 --> 00:29:44,510
And this is probably what most people will do instead of, �Well let's sit down and think what 

534
00:29:44,510 --> 00:29:47,929
could've gone wrong, and then we'll try to improve the algorithm.� 

535
00:29:47,929 --> 00:29:51,370
Well obviously having more training data could only help, so one thing you can do is try to 

536
00:29:51,370 --> 00:29:55,140
get more training examples. 

537
00:29:55,140 --> 00:29:58,880
Maybe you suspect, that even 100 features was too many, so you might try to 

538
00:29:58,880 --> 00:30:01,059
get a smaller set of 

539
00:30:01,059 --> 00:30:04,510
features. What's more common is you might suspect your features aren't good enough, so you might 

540
00:30:04,510 --> 00:30:06,880
spend some time, look at the email headers, see if 

541
00:30:06,880 --> 00:30:09,240
you can figure out better features for, you know, 

542
00:30:09,240 --> 00:30:12,890
finding spam emails or whatever. 

543
00:30:12,890 --> 00:30:14,460
Right. And 

544
00:30:14,460 --> 00:30:17,720
right, so and 

545
00:30:17,720 --> 00:30:20,780
just sit around and come up with better features, such as for email headers. 

546
00:30:20,780 --> 00:30:24,940
You may also suspect that gradient descent hasn't quite converged yet, and so let's try 

547
00:30:24,940 --> 00:30:28,090
running gradient descent a bit longer to see if that works. And clearly, that can't hurt, right, 

548
00:30:28,090 --> 00:30:29,480
just run 

549
00:30:29,480 --> 00:30:30,940
gradient descent longer. 

550
00:30:30,940 --> 00:30:35,620
Or maybe you remember, you know, you remember hearing from class that maybe 

551
00:30:35,620 --> 00:30:38,580
Newton's method converges better, so let's 

552
00:30:38,580 --> 00:30:39,809
try 

553
00:30:39,809 --> 00:30:41,840
that instead. You may want to tune the value for lambda, because not 

554
00:30:41,840 --> 00:30:43,560
sure if that was the right thing, 

555
00:30:43,560 --> 00:30:46,960
or maybe you even want to an SVM because maybe you think an SVM might work better than logistic regression. So I only 

556
00:30:46,960 --> 00:30:50,240
listed eight things 

557
00:30:50,240 --> 00:30:54,549
here, but you can imagine if you were actually sitting down, building machine-learning 

558
00:30:54,549 --> 00:30:55,720
system, 

559
00:30:55,720 --> 00:30:58,040
the options to you are endless. You can think of, you 

560
00:30:58,040 --> 00:31:01,040
know, hundreds of ways to improve a learning system. 

561
00:31:01,040 --> 00:31:02,429
And some of these things like, 

562
00:31:02,429 --> 00:31:05,670
well getting more training examples, surely that's gonna help, so that seems like it's a good 

563
00:31:05,670 --> 00:31:08,980
use of your time. Right? 

564
00:31:08,980 --> 00:31:11,290
And it turns out that 

565
00:31:11,290 --> 00:31:15,210
this [inaudible] of picking ways to improve the learning algorithm and picking one and going 

566
00:31:15,210 --> 00:31:16,120
for it, 

567
00:31:16,120 --> 00:31:20,620
it might work in the sense that it may eventually get you to a working system, but 

568
00:31:20,620 --> 00:31:25,030
often it's very time-consuming. And I think it's often a largely " largely a matter of 

569
00:31:25,030 --> 00:31:28,140
luck, whether you end up fixing what the problem is. 

570
00:31:28,140 --> 00:31:29,490
In particular, these 

571
00:31:29,490 --> 00:31:33,030
eight improvements all fix very different problems. 

572
00:31:33,030 --> 00:31:37,340
And some of them will be fixing problems that you don't have. And 

573
00:31:37,340 --> 00:31:40,070
if you can rule out six of 

574
00:31:40,070 --> 00:31:44,129
eight of these, say, you could " if by somehow looking at the problem more deeply, 

575
00:31:44,129 --> 00:31:46,809
you can figure out which one of these eight things is actually the right thing 

576
00:31:46,809 --> 00:31:47,850
to do, 

577
00:31:47,850 --> 00:31:48,850
you can save yourself 

578
00:31:48,850 --> 00:31:50,760
a lot of time. 

579
00:31:50,760 --> 00:31:56,090
So let's see how we can go about doing that. 

580
00:31:56,090 --> 00:32:01,830
The people in industry and in research that I see that are really good, would not 

581
00:32:01,830 --> 00:32:05,490
go and try to change a learning algorithm randomly. There are lots of things that 

582
00:32:05,490 --> 00:32:08,110
obviously improve your learning algorithm, 

583
00:32:08,110 --> 00:32:12,460
but the problem is there are so many of them it's hard to know what to do. 

584
00:32:12,460 --> 00:32:16,590
So you find all the really good ones that run various diagnostics to figure out 

585
00:32:16,590 --> 00:32:18,010
the problem is 

586
00:32:18,010 --> 00:32:21,610
and they think where a problem is. Okay? 

587
00:32:21,610 --> 00:32:23,830
So 

588
00:32:23,830 --> 00:32:27,309
for our motivating story, right, we said " let's say Bayesian logistic regression test 

589
00:32:27,309 --> 00:32:29,010
error was 20 percent, 

590
00:32:29,010 --> 00:32:32,020
which let's say is unacceptably high. 

591
00:32:32,020 --> 00:32:34,830
And let's suppose you suspected the problem is 

592
00:32:34,830 --> 00:32:36,440
either overfitting, 

593
00:32:36,440 --> 00:32:37,790
so it's high bias, 

594
00:32:37,790 --> 00:32:42,240
or you suspect that, you know, maybe you have two few features that classify as spam, so there's " Oh excuse 

595
00:32:42,240 --> 00:32:45,220
me; I think I 

596
00:32:45,220 --> 00:32:46,620
wrote that wrong. 

597
00:32:46,620 --> 00:32:48,080
Let's firstly " so let's 

598
00:32:48,080 --> 00:32:49,219
forget " forget the tables. 

599
00:32:49,219 --> 00:32:52,839
Suppose you suspect the problem is either high bias or high variance, and some of the text 

600
00:32:52,839 --> 00:32:54,730
here 

601
00:32:54,730 --> 00:32:55,250
doesn't make sense. And 

602
00:32:55,250 --> 00:32:56,429
you want to know 

603
00:32:56,429 --> 00:33:00,850
if you're overfitting, which would be high variance, or you have too few 

604
00:33:00,850 --> 00:33:06,240
features classified as spam, it'd be high bias. I had those two reversed, sorry. Okay? So 

605
00:33:06,240 --> 00:33:08,750
how can you figure out whether the problem 

606
00:33:08,750 --> 00:33:10,790
is one of high bias 

607
00:33:10,790 --> 00:33:15,610
or high variance? Right? So it turns 

608
00:33:15,610 --> 00:33:19,009
out there's a simple diagnostic you can look at that will tell you 

609
00:33:19,009 --> 00:33:24,150
whether the problem is high bias or high variance. If you 

610
00:33:24,150 --> 00:33:27,900
remember the cartoon we'd seen previously for high variance problems, when you have high 

611
00:33:27,900 --> 00:33:29,710
variance 

612
00:33:29,710 --> 00:33:33,280
the training error will be much lower than the test error. All right? When you 

613
00:33:33,280 --> 00:33:36,140
have a high variance problem, that's when you're fitting 

614
00:33:36,140 --> 00:33:39,480
your training set very well. That's when you're fitting, you know, a tenth order polynomial to 

615
00:33:39,480 --> 00:33:41,650
11 data points. All right? And 

616
00:33:41,650 --> 00:33:44,670
that's when you're just fitting the data set very well, and so your training error will be 

617
00:33:44,670 --> 00:33:45,670
much lower than 

618
00:33:45,670 --> 00:33:47,640
your test 

619
00:33:47,640 --> 00:33:49,940
error. And in contrast, if you have high bias, 

620
00:33:49,940 --> 00:33:52,700
that's when your training error will also be high. Right? 

621
00:33:52,700 --> 00:33:56,450
That's when your data is quadratic, say, but you're fitting a linear function to it 

622
00:33:56,450 --> 00:34:02,290
and so you aren't even fitting your training set well. So 

623
00:34:02,290 --> 00:34:04,450
just in cartoons, I guess, 

624
00:34:04,450 --> 00:34:07,950
this is a " this is what a typical learning curve for high variance looks 

625
00:34:07,950 --> 00:34:09,339
like. 

626
00:34:09,339 --> 00:34:13,690
On your horizontal axis, I'm plotting the training set size M, right, 

627
00:34:13,690 --> 00:34:16,429
and on vertical axis, I'm plotting the error. 

628
00:34:16,429 --> 00:34:19,469
And so, let's see, 

629
00:34:19,469 --> 00:34:21,029
you know, as you increase " 

630
00:34:21,029 --> 00:34:25,119
if you have a high variance problem, you'll notice as the training set size, M, 

631
00:34:25,119 --> 00:34:29,219
increases, your test set error will keep on decreasing. 

632
00:34:29,219 --> 00:34:32,829
And so this sort of suggests that, well, if you can increase the training set size even 

633
00:34:32,829 --> 00:34:36,359
further, maybe if you extrapolate the green curve out, maybe 

634
00:34:36,359 --> 00:34:39,970
that test set error will decrease even further. All right? 

635
00:34:39,970 --> 00:34:43,399
Another thing that's useful to plot here is " let's say 

636
00:34:43,399 --> 00:34:46,539
the red horizontal line is the desired performance 

637
00:34:46,539 --> 00:34:50,259
you're trying to reach, another useful thing to plot is actually the training error. Right? 

638
00:34:50,259 --> 00:34:52,009
And it turns out that 

639
00:34:52,009 --> 00:34:59,009
your training error will actually grow as a function of the training set size 

640
00:34:59,249 --> 00:35:01,609
because the larger your training set, 

641
00:35:01,609 --> 00:35:03,619
the harder it is to fit, 

642
00:35:03,619 --> 00:35:06,149
you know, your training set perfectly. Right? 

643
00:35:06,149 --> 00:35:09,249
So this is just a cartoon, don't take it too seriously, but in general, your training error 

644
00:35:09,249 --> 00:35:11,420
will actually grow 

645
00:35:11,420 --> 00:35:15,079
as a function of your training set size. Because smart training sets, if you have one data point, 

646
00:35:15,079 --> 00:35:17,769
it's really easy to fit that perfectly, but if you have 

647
00:35:17,769 --> 00:35:22,099
10,000 data points, it's much harder to fit that perfectly. 

648
00:35:22,099 --> 00:35:23,149
All right? 

649
00:35:23,149 --> 00:35:27,960
And so another diagnostic for high variance, and the one that I tend to use more, 

650
00:35:27,960 --> 00:35:31,670
is to just look at training versus test error. And if there's a large gap between 

651
00:35:31,670 --> 00:35:32,789
them, 

652
00:35:32,789 --> 00:35:34,160
then this suggests that, you know, 

653
00:35:34,160 --> 00:35:39,630
getting more training data may allow you to help close that gap. Okay? 

654
00:35:39,630 --> 00:35:41,420
So this is 

655
00:35:41,420 --> 00:35:42,340
what the 

656
00:35:42,340 --> 00:35:45,059
cartoon would look like when " in the 

657
00:35:45,059 --> 00:35:49,199
case of high variance. 

658
00:35:49,199 --> 00:35:53,099
This is what the cartoon looks like for high bias. Right? If you 

659
00:35:53,099 --> 00:35:54,779
look at the learning curve, you 

660
00:35:54,779 --> 00:35:57,499
see that the curve for test error 

661
00:35:57,499 --> 00:36:01,419
has flattened out already. And so this is a sign that, 

662
00:36:01,419 --> 00:36:05,179
you know, if you get more training examples, if you extrapolate this curve 

663
00:36:05,179 --> 00:36:06,519
further to the right, 

664
00:36:06,519 --> 00:36:09,670
it's maybe not likely to go down much further. 

665
00:36:09,670 --> 00:36:12,469
And this is a property of high bias: that getting more training data won't 

666
00:36:12,469 --> 00:36:15,619
necessarily help. 

667
00:36:15,619 --> 00:36:18,999
But again, to me the more useful diagnostic is 

668
00:36:18,999 --> 00:36:20,299
if you plot 

669
00:36:20,299 --> 00:36:23,999
training errors well, if you look at your training error as well as your, you know, 

670
00:36:23,999 --> 00:36:26,369
hold out test set error. 

671
00:36:26,369 --> 00:36:29,409
If you find that even your training error 

672
00:36:29,409 --> 00:36:31,529
is high, 

673
00:36:31,529 --> 00:36:34,779
then that's a sign that getting more training data is not 

674
00:36:34,779 --> 00:36:38,269
going to help. Right? 

675
00:36:38,269 --> 00:36:42,199
In fact, you know, think about it, 

676
00:36:42,199 --> 00:36:44,539
training error 

677
00:36:44,539 --> 00:36:48,089
grows as a function of your training set size. 

678
00:36:48,089 --> 00:36:50,449
And so if your 

679
00:36:50,449 --> 00:36:55,569
training error is already above your level of desired performance, 

680
00:36:55,569 --> 00:36:56,599
then 

681
00:36:56,599 --> 00:37:00,789
getting even more training data is not going to reduce your training 

682
00:37:00,789 --> 00:37:03,009
error down to the desired level of performance. Right? 

683
00:37:03,009 --> 00:37:06,469
Because, you know, your training error sort of only gets worse as you get more and more training 

684
00:37:06,469 --> 00:37:07,549
examples. 

685
00:37:07,549 --> 00:37:10,799
So if you extrapolate further to the right, it's not like this blue line will come 

686
00:37:10,799 --> 00:37:13,399
back down to the level of desired performance. Right? This will stay up 

687
00:37:13,399 --> 00:37:17,479
there. Okay? So for 

688
00:37:17,479 --> 00:37:21,339
me personally, I actually, when looking at a curve like the green 

689
00:37:21,339 --> 00:37:25,380
curve on test error, I actually personally tend to find it very difficult to tell 

690
00:37:25,380 --> 00:37:29,000
if the curve is still going down or if it's [inaudible]. Sometimes you can tell, but very 

691
00:37:29,000 --> 00:37:31,009
often, it's somewhat 

692
00:37:31,009 --> 00:37:32,899
ambiguous. So for me personally, 

693
00:37:32,899 --> 00:37:37,129
the diagnostic I tend to use the most often to tell if I have a bias problem or a variance 

694
00:37:37,129 --> 00:37:37,859
problem 

695
00:37:37,859 --> 00:37:41,319
is to look at training and test error and see if they're very close together or if they're relatively far apart. Okay? And so, 

696
00:37:41,319 --> 00:37:45,420
going 

697
00:37:45,420 --> 00:37:47,130
back to 

698
00:37:47,130 --> 00:37:52,399
the list of fixes, look 

699
00:37:52,399 --> 00:37:54,109
at the first fix, 

700
00:37:54,109 --> 00:37:56,339
getting more training examples 

701
00:37:56,339 --> 00:37:58,650
is a way to fix high variance. 

702
00:37:58,650 --> 00:38:02,749
Right? If you have a high variance problem, getting more training examples will help. 

703
00:38:02,749 --> 00:38:05,529
Trying a smaller set of features: 

704
00:38:05,529 --> 00:38:11,759
that also fixes high variance. All right? 

705
00:38:11,759 --> 00:38:15,869
Trying a larger set of features or adding email features, these 

706
00:38:15,869 --> 00:38:20,150
are solutions that fix high bias. Right? 

707
00:38:20,150 --> 00:38:26,769
So high bias being if you're hypothesis was too simple, you didn't have enough features. Okay? 

708
00:38:26,769 --> 00:38:29,069
And so 

709
00:38:29,069 --> 00:38:33,579
quite often you see people working on machine learning problems 

710
00:38:33,579 --> 00:38:34,589
and 

711
00:38:34,589 --> 00:38:37,569
they'll remember that getting more training examples helps. And 

712
00:38:37,569 --> 00:38:41,119
so, they'll build a learning system, build an anti-spam system and it doesn't work. 

713
00:38:41,119 --> 00:38:42,229
And then they 

714
00:38:42,229 --> 00:38:45,999
go off and spend lots of time and money and effort collecting more training data 

715
00:38:45,999 --> 00:38:50,509
because they'll say, �Oh well, getting more data's obviously got to help.� 

716
00:38:50,509 --> 00:38:53,319
But if they had a high bias problem in the first place, and not a high variance 

717
00:38:53,319 --> 00:38:54,890
problem, 

718
00:38:54,890 --> 00:38:56,769
it's entirely possible to spend 

719
00:38:56,769 --> 00:39:00,149
three months or six months collecting more and more training data, 

720
00:39:00,149 --> 00:39:04,999
not realizing that it couldn't possibly help. Right? 

721
00:39:04,999 --> 00:39:07,619
And so, this actually happens a lot in, you 

722
00:39:07,619 --> 00:39:12,409
know, in Silicon Valley and companies, this happens a lot. There will often 

723
00:39:12,409 --> 00:39:15,329
people building various machine learning systems, and 

724
00:39:15,329 --> 00:39:19,519
they'll often " you often see people spending six months working on fixing a 

725
00:39:19,519 --> 00:39:20,999
learning algorithm 

726
00:39:20,999 --> 00:39:23,940
and you could've told them six months ago that, you know, 

727
00:39:23,940 --> 00:39:27,210
that couldn't possibly have helped. But because they didn't know what the 

728
00:39:27,210 --> 00:39:28,709
problem was, and 

729
00:39:28,709 --> 00:39:33,549
they'd easily spend six months trying to invent new features or something. And 

730
00:39:33,549 --> 00:39:37,809
this is " you see this surprisingly often and this is somewhat depressing. You could've gone to them and 

731
00:39:37,809 --> 00:39:42,289
told them, �I could've told you six months ago that this was not going to help.� And 

732
00:39:42,289 --> 00:39:46,149
the six months is not a joke, you actually see 

733
00:39:46,149 --> 00:39:47,709
this. 

734
00:39:47,709 --> 00:39:49,510
And in contrast, if you 

735
00:39:49,510 --> 00:39:53,049
actually figure out the problem's one of high bias or high variance, then 

736
00:39:53,049 --> 00:39:54,299
you can rule out 

737
00:39:54,299 --> 00:39:55,799
two of these solutions and 

738
00:39:55,799 --> 00:40:00,779
save yourself many months of fruitless effort. Okay? I actually 

739
00:40:00,779 --> 00:40:03,709
want to talk about these four at the bottom as well. But before I move on, let me 

740
00:40:03,709 --> 00:40:05,319
just check if there were questions about what I've talked 

741
00:40:05,319 --> 00:40:12,319
about so far. No? Okay, great. So bias 

742
00:40:20,209 --> 00:40:23,220
versus variance is one thing that comes up 

743
00:40:23,220 --> 00:40:29,539
often. This bias versus variance is one common diagnostic. And so, 

744
00:40:29,539 --> 00:40:33,180
for other machine learning problems, it's often up to your own ingenuity to figure out 

745
00:40:33,180 --> 00:40:35,700
your own diagnostics to figure out what's wrong. All right? 

746
00:40:35,700 --> 00:40:41,230
So if a machine-learning algorithm isn't working, very often it's up to you to figure out, you 

747
00:40:41,230 --> 00:40:44,300
know, to construct your own tests. Like do you look at the difference training and 

748
00:40:44,300 --> 00:40:46,499
test errors or do you look at something else? 

749
00:40:46,499 --> 00:40:49,929
It's often up to your own ingenuity to construct your own diagnostics to figure out what's 

750
00:40:49,929 --> 00:40:52,589
going on. 

751
00:40:52,589 --> 00:40:55,029
What I want to do is go through another example. All right? 

752
00:40:55,029 --> 00:40:58,890
And this one is slightly more contrived but it'll illustrate another 

753
00:40:58,890 --> 00:41:02,769
common question that comes up, another one of the most common 

754
00:41:02,769 --> 00:41:04,750
issues that comes up in applying 

755
00:41:04,750 --> 00:41:06,089
learning algorithms. 

756
00:41:06,089 --> 00:41:08,319
So in this example, it's slightly more contrived, 

757
00:41:08,319 --> 00:41:11,579
let's say you implement Bayesian logistic regression 

758
00:41:11,579 --> 00:41:17,549
and you get 2 percent error on spam mail and 2 percent error non-spam mail. Right? So 

759
00:41:17,549 --> 00:41:19,150
it's rejecting, you know, 

760
00:41:19,150 --> 00:41:21,449
2 percent of " 

761
00:41:21,449 --> 00:41:25,179
it's rejecting 98 percent of your spam mail, which is fine, so 2 percent of all 

762
00:41:25,179 --> 00:41:26,959
spam gets 

763
00:41:26,959 --> 00:41:30,660
through which is fine, but is also rejecting 2 percent of your good email, 

764
00:41:30,660 --> 00:41:35,489
2 percent of the email from your friends and that's unacceptably high, let's 

765
00:41:35,489 --> 00:41:36,909
say. 

766
00:41:36,909 --> 00:41:39,010
And let's say that 

767
00:41:39,010 --> 00:41:41,899
a simple vector machine using a linear kernel 

768
00:41:41,899 --> 00:41:44,830
gets 10 percent error on spam and 

769
00:41:44,830 --> 00:41:49,069
0.01 percent error on non-spam, which is more of the acceptable performance you want. And let's say for the sake of this 

770
00:41:49,069 --> 00:41:53,359
example, let's say you're trying to build an anti-spam system. Right? 

771
00:41:53,359 --> 00:41:56,170
Let's say that you really want to deploy 

772
00:41:56,170 --> 00:41:57,679
logistic regression 

773
00:41:57,679 --> 00:42:01,209
to your customers because of computational efficiency or because you need 

774
00:42:01,209 --> 00:42:03,389
retrain overnight every day, 

775
00:42:03,389 --> 00:42:07,319
and because logistic regression just runs more easily and more quickly or something. Okay? So let's 

776
00:42:07,319 --> 00:42:08,669
say you want to deploy logistic 

777
00:42:08,669 --> 00:42:12,649
regression, but it's just not working out well. So 

778
00:42:12,649 --> 00:42:17,609
question is: What do you do next? So it 

779
00:42:17,609 --> 00:42:18,829
turns out that this " 

780
00:42:18,829 --> 00:42:22,319
the issue that comes up here, the one other common question that 

781
00:42:22,319 --> 00:42:24,889
comes up is 

782
00:42:24,889 --> 00:42:30,189
a question of is the algorithm converging. So you might suspect that maybe 

783
00:42:30,189 --> 00:42:33,299
the problem with logistic regression is that it's just not converging. 

784
00:42:33,299 --> 00:42:36,309
Maybe you need to run iterations. And 

785
00:42:36,309 --> 00:42:37,759
it 

786
00:42:37,759 --> 00:42:40,359
turns out that, again if you look at the optimization objective, say, 

787
00:42:40,359 --> 00:42:43,710
logistic regression is, let's say, optimizing J 

788
00:42:43,710 --> 00:42:46,730
of theta, it actually turns out that if you look at optimizing your objective as a function of the number 

789
00:42:46,730 --> 00:42:51,809
of iterations, when you look 

790
00:42:51,809 --> 00:42:55,009
at this curve, you know, it sort of looks like it's going up but it sort of 

791
00:42:55,009 --> 00:42:57,630
looks like there's absentiles. And 

792
00:42:57,630 --> 00:43:00,949
when you look at these curves, it's often very hard to tell 

793
00:43:00,949 --> 00:43:03,729
if the curve has already flattened out. All right? And you look at these 

794
00:43:03,729 --> 00:43:05,979
curves a lot so you can ask: 

795
00:43:05,979 --> 00:43:08,229
Well has the algorithm converged? When you look at the J of theta like this, it's 

796
00:43:08,229 --> 00:43:10,329
often hard to tell. 

797
00:43:10,329 --> 00:43:14,149
You can run this ten times as long and see if it's flattened out. And you can run this ten 

798
00:43:14,149 --> 00:43:21,079
times as long and it'll often still look like maybe it's going up very slowly, or something. Right? 

799
00:43:21,079 --> 00:43:24,919
So a better diagnostic for what logistic regression is converged than 

800
00:43:24,919 --> 00:43:28,809
looking at this curve. 

801
00:43:28,809 --> 00:43:32,089
The other question you might wonder " the other thing you might 

802
00:43:32,089 --> 00:43:36,709
suspect is a problem is are you optimizing the right function. 

803
00:43:36,709 --> 00:43:38,920
So 

804
00:43:38,920 --> 00:43:40,599
what you care about, 

805
00:43:40,599 --> 00:43:42,879
right, in spam, say, 

806
00:43:42,879 --> 00:43:44,260
is a 

807
00:43:44,260 --> 00:43:47,499
weighted accuracy function like that. So A of theta is, 

808
00:43:47,499 --> 00:43:49,190
you know, sum over your 

809
00:43:49,190 --> 00:43:52,249
examples of some weights times whether you got it right. 

810
00:43:52,249 --> 00:43:56,809
And so the weight may be higher for non-spam than for spam mail because you care 

811
00:43:56,809 --> 00:43:57,710
about getting 

812
00:43:57,710 --> 00:44:01,469
your predictions correct for spam email much more than non-spam mail, say. So let's 

813
00:44:01,469 --> 00:44:02,359


814
00:44:02,359 --> 00:44:05,469
say A of theta 

815
00:44:05,469 --> 00:44:10,819
is the optimization objective that you really care about, but Bayesian logistic regression is 

816
00:44:10,819 --> 00:44:15,400
that it optimizes a quantity like that. Right? It's this 

817
00:44:15,400 --> 00:44:17,689
sort of maximum likelihood thing 

818
00:44:17,689 --> 00:44:19,380
and then with this 

819
00:44:19,380 --> 00:44:20,849
two-nom, you know, 

820
00:44:20,849 --> 00:44:22,779
penalty thing that we saw previously. And you 

821
00:44:22,779 --> 00:44:26,499
might be wondering: Is this the right optimization function to be optimizing. 

822
00:44:26,499 --> 00:44:30,949
Okay? And: Or do I maybe need to change the value for lambda 

823
00:44:30,949 --> 00:44:33,899
to change this parameter? Or: 

824
00:44:33,899 --> 00:44:39,819
Should I maybe really be switching to support vector machine optimization objective? 

825
00:44:39,819 --> 00:44:42,130
Okay? Does that make sense? So 

826
00:44:42,130 --> 00:44:44,490
the second diagnostic I'm gonna talk about 

827
00:44:44,490 --> 00:44:46,989
is let's say you want to figure out 

828
00:44:46,989 --> 00:44:50,609
is the algorithm converging, is the optimization algorithm converging, or 

829
00:44:50,609 --> 00:44:51,900
is the problem with 

830
00:44:51,900 --> 00:44:57,749
the optimization objective I chose in the first place? Okay? 

831
00:44:57,749 --> 00:45:02,819
So here's 

832
00:45:02,819 --> 00:45:07,329
the diagnostic you can use. Let me let " right. So to 

833
00:45:07,329 --> 00:45:11,029
just reiterate the story, right, let's say an SVM outperforms Bayesian 

834
00:45:11,029 --> 00:45:13,519
logistic regression but you really want to deploy 

835
00:45:13,519 --> 00:45:16,759
Bayesian logistic regression to your problem. Let me 

836
00:45:16,759 --> 00:45:19,049
let theta subscript SVM, be the 

837
00:45:19,049 --> 00:45:21,669
parameters learned by an SVM, 

838
00:45:21,669 --> 00:45:25,259
and I'll let theta subscript BLR be the parameters learned by Bayesian 

839
00:45:25,259 --> 00:45:28,049
logistic regression. 

840
00:45:28,049 --> 00:45:32,480
So the optimization objective you care about is this, you know, weighted accuracy 

841
00:45:32,480 --> 00:45:35,079
criteria that I talked about just now. 

842
00:45:35,079 --> 00:45:37,859
And 

843
00:45:37,859 --> 00:45:41,739
the support vector machine outperforms Bayesian logistic regression. And so, you know, 

844
00:45:41,739 --> 00:45:44,969
the weighted accuracy on the supportvector-machine parameters 

845
00:45:44,969 --> 00:45:46,969
is better than the weighted accuracy 

846
00:45:46,969 --> 00:45:50,179
for Bayesian logistic regression. 

847
00:45:50,179 --> 00:45:53,929
So 

848
00:45:53,929 --> 00:45:57,039
further, Bayesian logistic regression tries to optimize 

849
00:45:57,039 --> 00:45:59,410
an optimization objective like that, which I 

850
00:45:59,410 --> 00:46:02,269
denoted J theta. 

851
00:46:02,269 --> 00:46:05,839
And so, the diagnostic I choose to use is 

852
00:46:05,839 --> 00:46:08,430
to see if J of SVM 

853
00:46:08,430 --> 00:46:12,269
is bigger-than or less-than J of BLR. Okay? 

854
00:46:12,269 --> 00:46:14,609
So I explain this on the next slide. 

855
00:46:14,609 --> 00:46:15,569
So 

856
00:46:15,569 --> 00:46:19,529
we know two facts. We know that " well we know one fact. We know that a weighted 

857
00:46:19,529 --> 00:46:20,519
accuracy 

858
00:46:20,519 --> 00:46:23,160
of support vector machine, right, 

859
00:46:23,160 --> 00:46:24,479
is bigger than 

860
00:46:24,479 --> 00:46:28,859
this weighted accuracy of Bayesian logistic regression. So 

861
00:46:28,859 --> 00:46:32,209
in order for me to figure out whether Bayesian logistic regression is converging, 

862
00:46:32,209 --> 00:46:35,379
or whether I'm just optimizing the wrong objective function, 

863
00:46:35,379 --> 00:46:41,059
the diagnostic I'm gonna use and I'm gonna check if this equality hold through. Okay? 

864
00:46:41,059 --> 00:46:43,549
So let me explain this, 

865
00:46:43,549 --> 00:46:44,769
so in Case 1, 

866
00:46:44,769 --> 00:46:46,029
right, 

867
00:46:46,029 --> 00:46:48,319
it's just those two equations copied over. 

868
00:46:48,319 --> 00:46:50,489
In Case 1, let's say that 

869
00:46:50,489 --> 00:46:54,589
J of SVM is, indeed, is greater than J of BLR " or J of 

870
00:46:54,589 --> 00:47:01,169
theta SVM is greater than J of theta BLR. But 

871
00:47:01,169 --> 00:47:04,439
we know that Bayesian logistic regression 

872
00:47:04,439 --> 00:47:07,519
was trying to maximize J of theta; 

873
00:47:07,519 --> 00:47:08,869
that's the definition of 

874
00:47:08,869 --> 00:47:12,359
Bayesian logistic regression. 

875
00:47:12,359 --> 00:47:16,759
So this means that 

876
00:47:16,759 --> 00:47:17,599
theta " 

877
00:47:17,599 --> 00:47:22,029
the value of theta output that Bayesian logistic regression actually fails to 

878
00:47:22,029 --> 00:47:24,209
maximize J 

879
00:47:24,209 --> 00:47:27,309
because the support back to machine actually returned the value of theta that, 

880
00:47:27,309 --> 00:47:28,719
you know does a 

881
00:47:28,719 --> 00:47:31,349
better job out-maximizing J. 

882
00:47:31,349 --> 00:47:36,509
And so, this tells me that Bayesian logistic regression didn't actually maximize J 

883
00:47:36,509 --> 00:47:39,319
correctly, and so the problem is with 

884
00:47:39,319 --> 00:47:41,099
the optimization algorithm. The 

885
00:47:41,099 --> 00:47:45,269
optimization algorithm hasn't converged. The other 

886
00:47:45,269 --> 00:47:46,099
case 

887
00:47:46,099 --> 00:47:49,890
is as follows, where 

888
00:47:49,890 --> 00:47:52,579
J of theta SVM is less-than/equal to J of theta 

889
00:47:52,579 --> 00:47:55,719
BLR. Okay? 

890
00:47:55,719 --> 00:47:58,389
In this case, what does 

891
00:47:58,389 --> 00:47:59,140
that mean? 

892
00:47:59,140 --> 00:48:01,849
This means that Bayesian logistic regression 

893
00:48:01,849 --> 00:48:04,599
actually attains the higher value 

894
00:48:04,599 --> 00:48:07,289
for the optimization objective J 

895
00:48:07,289 --> 00:48:10,929
then doesn't support back to machine. 

896
00:48:10,929 --> 00:48:13,159
The support back to machine, 

897
00:48:13,159 --> 00:48:14,969
which does worse 

898
00:48:14,969 --> 00:48:17,669
on your optimization problem, 

899
00:48:17,669 --> 00:48:19,199
actually does better 

900
00:48:19,199 --> 00:48:24,329
on the weighted accuracy measure. 

901
00:48:24,329 --> 00:48:27,999
So what this means is that something that does worse on your optimization 

902
00:48:27,999 --> 00:48:28,789
objective, 

903
00:48:28,789 --> 00:48:29,789
on J, 

904
00:48:29,789 --> 00:48:31,429
can actually do better 

905
00:48:31,429 --> 00:48:34,039
on the weighted accuracy objective. 

906
00:48:34,039 --> 00:48:37,109
And this really means that maximizing 

907
00:48:37,109 --> 00:48:38,369
J of theta, 

908
00:48:38,369 --> 00:48:42,059
you know, doesn't really correspond that well to maximizing your weighted accuracy criteria. 

909
00:48:42,059 --> 00:48:43,430


910
00:48:43,430 --> 00:48:47,359
And therefore, this tells you that J of theta is maybe the wrong optimization 

911
00:48:47,359 --> 00:48:49,649
objective to be maximizing. Right? 

912
00:48:49,649 --> 00:48:51,160
That just maximizing J of 

913
00:48:51,160 --> 00:48:53,149
theta just wasn't a good objective 

914
00:48:53,149 --> 00:49:00,149
to be choosing if you care about the weighted accuracy. Okay? Can you 

915
00:49:02,669 --> 00:49:03,460
raise your hand 

916
00:49:03,460 --> 00:49:09,989
if this made sense? 

917
00:49:09,989 --> 00:49:11,519
Cool, good. So 

918
00:49:11,519 --> 00:49:16,829
that tells us whether the problem is with the optimization objective 

919
00:49:16,829 --> 00:49:19,380
or whether it's with the objective function. 

920
00:49:19,380 --> 00:49:21,009
And so going back to this 

921
00:49:21,009 --> 00:49:23,149
slide, the eight fixes we had, 

922
00:49:23,149 --> 00:49:24,180
you notice that if you 

923
00:49:24,180 --> 00:49:27,170
run gradient descent for more iterations 

924
00:49:27,170 --> 00:49:31,019
that fixes the optimization algorithm. You try and use this method 

925
00:49:31,019 --> 00:49:33,259
fixes the optimization algorithm, 

926
00:49:33,259 --> 00:49:37,289
whereas using a different value for lambda, in that lambda times norm of data 

927
00:49:37,289 --> 00:49:39,469
squared, you know, in your objective, 

928
00:49:39,469 --> 00:49:42,359
fixes the optimization objective. And 

929
00:49:42,359 --> 00:49:47,629
changing to an SVM is also another way of trying to fix the optimization objective. Okay? 

930
00:49:47,629 --> 00:49:49,329
And so 

931
00:49:49,329 --> 00:49:52,309
once again, you actually see this quite often that " 

932
00:49:52,309 --> 00:49:55,079
actually, you see it very often, people will 

933
00:49:55,079 --> 00:49:58,479
have a problem with the optimization objective 

934
00:49:58,479 --> 00:50:00,989
and be working harder and harder 

935
00:50:00,989 --> 00:50:03,179
to fix the optimization algorithm. 

936
00:50:03,179 --> 00:50:06,079
That's another very common pattern that 

937
00:50:06,079 --> 00:50:10,190
the problem is in the formula from your J of theta, that often you see people, you know, 

938
00:50:10,190 --> 00:50:13,269
just running more and more iterations of gradient descent. Like trying Newton's 

939
00:50:13,269 --> 00:50:16,010
method and trying conjugate and then trying 

940
00:50:16,010 --> 00:50:18,589
more and more crazy optimization algorithms, 

941
00:50:18,589 --> 00:50:20,889
whereas the problem was, you know, 

942
00:50:20,889 --> 00:50:24,459
optimizing J of theta wasn't going to fix the problem at all. Okay? 

943
00:50:24,459 --> 00:50:28,649
So there's another example of when these sorts of diagnostics will 

944
00:50:28,649 --> 00:50:31,909
help you figure out whether you should be fixing your optimization algorithm 

945
00:50:31,909 --> 00:50:33,259
or fixing the 

946
00:50:33,259 --> 00:50:38,849
optimization 

947
00:50:38,849 --> 00:50:45,339
objective. Okay? Let me think 

948
00:50:45,339 --> 00:50:47,599
how much time I have. 

949
00:50:47,599 --> 00:50:48,819
Hmm, let's 

950
00:50:48,819 --> 00:50:49,620
see. Well okay, we have time. Let's do this. 

951
00:50:49,620 --> 00:50:52,980
Show you one last example of a diagnostic. This is one that came up in, 

952
00:50:52,980 --> 00:50:56,999
you know, my students' and my work on flying helicopters. 

953
00:50:56,999 --> 00:50:57,839


954
00:50:57,839 --> 00:51:00,189
This one actually, 

955
00:51:00,189 --> 00:51:04,179
this example is the most complex of the three examples I'm gonna do 

956
00:51:04,179 --> 00:51:05,609
today. 

957
00:51:05,609 --> 00:51:08,559
I'm going to somewhat quickly, and 

958
00:51:08,559 --> 00:51:11,259
this actually draws on reinforcement learning which is something that I'm not 

959
00:51:11,259 --> 00:51:14,499
gonna talk about until towards " close to the end of the course here, but this just 

960
00:51:14,499 --> 00:51:16,759
a more 

961
00:51:16,759 --> 00:51:20,009
complicated example of a diagnostic we're gonna go over. 

962
00:51:20,009 --> 00:51:23,759
What I'll do is probably go over this fairly quickly, and then after we've talked about 

963
00:51:23,759 --> 00:51:26,839
reinforcement learning in the class, I'll probably actually come back and redo this exact 

964
00:51:26,839 --> 00:51:32,919
same example because you'll understand it more deeply. Okay? 

965
00:51:32,919 --> 00:51:37,099
So some of you know that my students and I fly autonomous helicopters, so how do you get a 

966
00:51:37,099 --> 00:51:41,559
machine-learning algorithm to design the controller for 

967
00:51:41,559 --> 00:51:44,199
helicopter? This is what we do. All right? 

968
00:51:44,199 --> 00:51:48,519
This first step was you build a simulator for a helicopter, so, you know, there's a screenshot of our 

969
00:51:48,519 --> 00:51:49,619
simulator. 

970
00:51:49,619 --> 00:51:53,499
This is just like a " it's like a joystick simulator; you can fly a helicopter in simulation. And then you 

971
00:51:53,499 --> 00:51:55,679


972
00:51:55,679 --> 00:51:57,190
choose a cost function, it's 

973
00:51:57,190 --> 00:52:00,849
actually called a [inaudible] function, but for this actually I'll call it cost function. 

974
00:52:00,849 --> 00:52:02,949
Say J of theta is, you know, 

975
00:52:02,949 --> 00:52:06,589
the expected squared error in your helicopter's 

976
00:52:06,589 --> 00:52:08,150
position. Okay? So this is J of theta is 

977
00:52:08,150 --> 00:52:08,509
maybe 

978
00:52:08,509 --> 00:52:12,359
it's expected square error or just the square error. 

979
00:52:12,359 --> 00:52:16,909
And then we run a reinforcement-learning algorithm, you'll learn about RL algorithms 

980
00:52:16,909 --> 00:52:18,599
in a few weeks. 

981
00:52:18,599 --> 00:52:22,499
You run reinforcement learning algorithm in your simulator 

982
00:52:22,499 --> 00:52:26,640
to try to minimize this cost function; try to minimize the squared error of 

983
00:52:26,640 --> 00:52:31,439
how well you're controlling your helicopter's position. Okay? 

984
00:52:31,439 --> 00:52:35,279
The reinforcement learning algorithm will output some parameters, which I'm denoting theta 

985
00:52:35,279 --> 00:52:37,209
subscript RL, 

986
00:52:37,209 --> 00:52:41,709
and then you'll use that to fly your helicopter. 

987
00:52:41,709 --> 00:52:44,959
So suppose you run this learning algorithm and 

988
00:52:44,959 --> 00:52:48,589
you get out a set of controller parameters, theta subscript RL, 

989
00:52:48,589 --> 00:52:52,299
that gives much worse performance than a human pilot. Then 

990
00:52:52,299 --> 00:52:54,729
what do you do next? And in particular, you 

991
00:52:54,729 --> 00:52:57,959
know, corresponding to the three steps above, there are three 

992
00:52:57,959 --> 00:53:00,589
natural things you can try. Right? You can 

993
00:53:00,589 --> 00:53:01,869
try to " oh, the bottom of 

994
00:53:01,869 --> 00:53:03,919
the slide got chopped off. 

995
00:53:03,919 --> 00:53:07,529
You can try to improve the simulator. And 

996
00:53:07,529 --> 00:53:10,329
maybe you think your simulator's isn't that accurate, you need to capture 

997
00:53:10,329 --> 00:53:12,339
the aerodynamic effects more 

998
00:53:12,339 --> 00:53:15,429
accurately. You need to capture the airflow and the turbulence affects around the helicopter 

999
00:53:15,429 --> 00:53:18,279
more accurately. 

1000
00:53:18,279 --> 00:53:21,439
Maybe you need to modify the cost function. Maybe your square error isn't cutting it. Maybe 

1001
00:53:21,439 --> 00:53:24,719
what a human pilot does isn't just optimizing square area but it's something more 

1002
00:53:24,719 --> 00:53:25,989
subtle. 

1003
00:53:25,989 --> 00:53:26,769
Or maybe 

1004
00:53:26,769 --> 00:53:32,989
the reinforcement-learning algorithm isn't working; maybe it's not quite converging or something. Okay? So 

1005
00:53:32,989 --> 00:53:36,799
these are the diagnostics that I actually used, and my students and I actually use to figure out what's 

1006
00:53:36,799 --> 00:53:41,299
going on. 

1007
00:53:41,299 --> 00:53:44,509
Actually, why don't you just think about this for a second and think what you'd do, and then 

1008
00:53:44,509 --> 00:53:51,509
I'll go on and tell you what we do. All right, 

1009
00:54:46,229 --> 00:54:47,869
so let me tell you what " 

1010
00:54:47,869 --> 00:54:49,599
how we do this and see 

1011
00:54:49,599 --> 00:54:52,770
whether it's the same as yours or not. And if you have a better idea than I do, let me 

1012
00:54:52,770 --> 00:54:53,569
know and I'll let you try it 

1013
00:54:53,569 --> 00:54:55,919
on my helicopter. 

1014
00:54:55,919 --> 00:54:58,239
So 

1015
00:54:58,239 --> 00:55:01,449
here's a reasoning that I wanted to experiment, right. So, 

1016
00:55:01,449 --> 00:55:03,680
yeah, let's say the controller output 

1017
00:55:03,680 --> 00:55:10,369
by our reinforcement-learning algorithm does poorly. Well 

1018
00:55:10,369 --> 00:55:12,609
suppose the following three things hold true. 

1019
00:55:12,609 --> 00:55:15,149
Suppose the contrary, I guess. Suppose that 

1020
00:55:15,149 --> 00:55:19,649
the helicopter simulator is accurate, so let's assume we have an accurate model 

1021
00:55:19,649 --> 00:55:22,449
of our helicopter. And 

1022
00:55:22,449 --> 00:55:25,299
let's suppose that the reinforcement learning algorithm, 

1023
00:55:25,299 --> 00:55:28,890
you know, correctly controls the helicopter in simulation, 

1024
00:55:28,890 --> 00:55:31,819
so we tend to run a learning algorithm in simulation so that, you know, the 

1025
00:55:31,819 --> 00:55:35,149
learning algorithm can crash a helicopter and it's fine. Right? 

1026
00:55:35,149 --> 00:55:37,230
So let's assume our reinforcement-learning 

1027
00:55:37,230 --> 00:55:40,110
algorithm correctly controls the helicopter so as to minimize the cost 

1028
00:55:40,110 --> 00:55:42,099
function J of theta. 

1029
00:55:42,099 --> 00:55:43,740
And let's suppose that 

1030
00:55:43,740 --> 00:55:47,630
minimizing J of theta does indeed correspond to accurate or the correct autonomous 

1031
00:55:47,630 --> 00:55:49,339
flight. 

1032
00:55:49,339 --> 00:55:52,069
If all of these things held true, 

1033
00:55:52,069 --> 00:55:53,909
then that means that 

1034
00:55:53,909 --> 00:55:58,459
the parameters, theta RL, should actually fly well on my real 

1035
00:55:58,459 --> 00:56:01,039
helicopter. Right? 

1036
00:56:01,039 --> 00:56:03,280
And so the fact that the learning 

1037
00:56:03,280 --> 00:56:05,339
control parameters, theta RL, 

1038
00:56:05,339 --> 00:56:08,599
does not fly well on my helicopter, that sort 

1039
00:56:08,599 --> 00:56:11,249
of means that ones of these three assumptions must be wrong 

1040
00:56:11,249 --> 00:56:17,869
and I'd like to figure out which of these 

1041
00:56:17,869 --> 00:56:19,669
three assumptions 

1042
00:56:19,669 --> 00:56:22,089
is wrong. Okay? So these are the diagnostics we use. 

1043
00:56:22,089 --> 00:56:25,449
First one is 

1044
00:56:25,449 --> 00:56:31,719
we look at the controller and see if it even flies well in 

1045
00:56:31,719 --> 00:56:35,089
simulation. Right? So the simulator of the helicopter that we did the learning on, 

1046
00:56:35,089 --> 00:56:38,699
and so if the learning algorithm flies well in the simulator but 

1047
00:56:38,699 --> 00:56:42,029
it doesn't fly well on my real helicopter, 

1048
00:56:42,029 --> 00:56:46,109
then that tells me the problem is probably in the simulator. Right? 

1049
00:56:46,109 --> 00:56:48,049
My simulator predicts 

1050
00:56:48,049 --> 00:56:51,909
the helicopter's controller will fly well but it doesn't actually fly well in real life, so 

1051
00:56:51,909 --> 00:56:53,580
could be the problem's in the simulator 

1052
00:56:53,580 --> 00:56:59,239
and we should spend out efforts improving the accuracy of our simulator. 

1053
00:56:59,239 --> 00:57:03,170
Otherwise, let me write theta subscript human, be the human 

1054
00:57:03,170 --> 00:57:07,049
control policy. All right? So 

1055
00:57:07,049 --> 00:57:11,639
let's go ahead and ask a human to fly the helicopter, it could be in the simulator, it 

1056
00:57:11,639 --> 00:57:13,479
could be in real life, 

1057
00:57:13,479 --> 00:57:16,769
and let's measure, you know, the means squared error 

1058
00:57:16,769 --> 00:57:20,209
of the human pilot's flight. And 

1059
00:57:20,209 --> 00:57:24,239
let's see if the human pilot does better or worse 

1060
00:57:24,239 --> 00:57:26,089
than the learned controller, 

1061
00:57:26,089 --> 00:57:28,249
in terms of optimizing this 

1062
00:57:28,249 --> 00:57:31,969
objective function J of theta. Okay? 

1063
00:57:31,969 --> 00:57:33,929
So if the human does 

1064
00:57:33,929 --> 00:57:36,890
worse, if even a very good human pilot 

1065
00:57:36,890 --> 00:57:41,439
attains a worse value on my optimization objective, on my cost 

1066
00:57:41,439 --> 00:57:42,410
function, 

1067
00:57:42,410 --> 00:57:48,619
than my learning algorithm, 

1068
00:57:48,619 --> 00:57:51,799
then the problem is in the reinforcement-learning algorithm. 

1069
00:57:51,799 --> 00:57:56,089
Because my reinforcement-learning algorithm was trying to minimize J of 

1070
00:57:56,089 --> 00:58:00,140
theta, but a human actually attains a lower value for J of theta than does my 

1071
00:58:00,140 --> 00:58:01,779
algorithm. 

1072
00:58:01,779 --> 00:58:05,489
And so that tells me that clearly my algorithm's not 

1073
00:58:05,489 --> 00:58:07,819
managing to minimize J of theta 

1074
00:58:07,819 --> 00:58:12,880
and that tells me the problem's in the reinforcement learning algorithm. 

1075
00:58:12,880 --> 00:58:17,650
And finally, if J of theta " if the human actually attains a larger value 

1076
00:58:17,650 --> 00:58:19,549
for theta " excuse me, 

1077
00:58:19,549 --> 00:58:24,399
if the human actually attains a larger value for J of theta, the human actually 

1078
00:58:24,399 --> 00:58:27,859
has, you know, larger mean squared error for the helicopter position than 

1079
00:58:27,859 --> 00:58:30,599
does my reinforcement learning algorithms, that's 

1080
00:58:30,599 --> 00:58:34,000
I like " but I like the way the human flies much better than my reinforcement learning 

1081
00:58:34,000 --> 00:58:35,319
algorithm. So 

1082
00:58:35,319 --> 00:58:37,229
if that holds true, 

1083
00:58:37,229 --> 00:58:39,779
then clearly the problem's in the cost function, right, 

1084
00:58:39,779 --> 00:58:42,880
because the human does worse on my cost function 

1085
00:58:42,880 --> 00:58:46,069
but flies much better than my learning algorithm. 

1086
00:58:46,069 --> 00:58:48,359
And so that means the problem's in the cost function. It 

1087
00:58:48,359 --> 00:58:50,089
means " oh 

1088
00:58:50,089 --> 00:58:50,539
excuse me, I 

1089
00:58:50,539 --> 00:58:53,679
meant minimizing it, not maximizing it, there's a typo on the slide, 

1090
00:58:53,679 --> 00:58:55,379
because that means that minimizing 

1091
00:58:55,379 --> 00:58:57,089
the cost function 

1092
00:58:57,089 --> 00:59:00,219
" my learning algorithm does a better job minimizing the cost function but doesn't 

1093
00:59:00,219 --> 00:59:03,439
fly as well as a human pilot. So that tells you that 

1094
00:59:03,439 --> 00:59:04,719
minimizing the cost function 

1095
00:59:04,719 --> 00:59:06,880
doesn't correspond to good autonomous flight. And what 

1096
00:59:06,880 --> 00:59:11,859
you should do it go back and see if you can change J of 

1097
00:59:11,859 --> 00:59:13,099
theta. Okay? 

1098
00:59:13,099 --> 00:59:18,379
And so for those reinforcement learning problems, you know, if something doesn't work " often reinforcement 

1099
00:59:18,379 --> 00:59:21,730
learning algorithms just work but when they don't work, 

1100
00:59:21,730 --> 00:59:26,200
these are the sorts of diagnostics you use to figure out should we be focusing on the simulator, 

1101
00:59:26,200 --> 00:59:30,329
on changing the cost function, or on changing the reinforcement learning 

1102
00:59:30,329 --> 00:59:32,089
algorithm. And 

1103
00:59:32,089 --> 00:59:37,039
again, if you don't know which of your three problems it is, it's entirely possible, 

1104
00:59:37,039 --> 00:59:40,280
you know, to spend two years, whatever, changing, building a better simulator 

1105
00:59:40,280 --> 00:59:42,599
for your helicopter. 

1106
00:59:42,599 --> 00:59:43,950
But it turns out that 

1107
00:59:43,950 --> 00:59:47,690
modeling helicopter aerodynamics is an active area of research. There are people, you know, writing 

1108
00:59:47,690 --> 00:59:49,789
entire PhD theses on this still. 

1109
00:59:49,789 --> 00:59:53,559
So it's entirely possible to go out and spend six years and write a PhD thesis and build 

1110
00:59:53,559 --> 00:59:55,499
a much better helicopter simulator, but if you're fixing 

1111
00:59:55,499 --> 01:00:02,499
the wrong problem it's not gonna help. 

1112
01:00:03,209 --> 01:00:05,529
So 

1113
01:00:05,529 --> 01:00:08,919
quite often, you need to come up with your own diagnostics to figure out what's happening in an 

1114
01:00:08,919 --> 01:00:11,639
algorithm when something is going wrong. 

1115
01:00:11,639 --> 01:00:15,679
And unfortunately I don't know of " what I've described 

1116
01:00:15,679 --> 01:00:17,149
are sort of maybe 

1117
01:00:17,149 --> 01:00:20,509
some of the most common diagnostics that I've used, that I've seen, 

1118
01:00:20,509 --> 01:00:23,709
you know, to be useful for many problems. But very often, you need to come up 

1119
01:00:23,709 --> 01:00:28,189
with your own for your own specific learning problem. 

1120
01:00:28,189 --> 01:00:31,729
And I just want to point out that even when the learning algorithm is working well, it's 

1121
01:00:31,729 --> 01:00:35,159
often a good idea to run diagnostics, like the ones I talked 

1122
01:00:35,159 --> 01:00:36,069
about, 

1123
01:00:36,069 --> 01:00:38,309
to make sure you really understand what's going on. 

1124
01:00:38,309 --> 01:00:41,599
All right? And this is useful for a couple of reasons. One is that 

1125
01:00:41,599 --> 01:00:45,609
diagnostics like these will often help you to understand your application 

1126
01:00:45,609 --> 01:00:47,899
problem better. 

1127
01:00:47,899 --> 01:00:52,159
So some of you will, you know, graduate from Stanford and go on to get some amazingly high-paying 

1128
01:00:52,159 --> 01:00:56,349
job to apply machine-learning algorithms to some application problem of, you 

1129
01:00:56,349 --> 01:00:59,299
know, significant economic interest. Right? 

1130
01:00:59,299 --> 01:01:02,929
And you're gonna be working on one specific 

1131
01:01:02,929 --> 01:01:08,059
important machine learning application for many months, or even for years. 

1132
01:01:08,059 --> 01:01:10,989
One of the most valuable things for you personally will be for you to 

1133
01:01:10,989 --> 01:01:13,249
get in " for you personally 

1134
01:01:13,249 --> 01:01:16,909
to get in an intuitive understanding of what works and what doesn't work your 

1135
01:01:16,909 --> 01:01:17,369
problem. 

1136
01:01:17,369 --> 01:01:21,240
Sort of right now in the industry, in Silicon Valley or around the world, 

1137
01:01:21,240 --> 01:01:24,830
there are many companies with important machine learning problems and there are often people 

1138
01:01:24,830 --> 01:01:26,950
working on the same machine learning problem, you 

1139
01:01:26,950 --> 01:01:31,209
know, for many months or for years on end. And 

1140
01:01:31,209 --> 01:01:34,659
when you're doing that, I mean solving a really important problem using learning algorithms, one of 

1141
01:01:34,659 --> 01:01:38,719
the most valuable things is just your own personal intuitive understanding of the 

1142
01:01:38,719 --> 01:01:40,999
problem. 

1143
01:01:40,999 --> 01:01:42,169
Okay? 

1144
01:01:42,169 --> 01:01:43,410
And diagnostics, like 

1145
01:01:43,410 --> 01:01:48,149
the sort I talked about, will be one way for you to get a better and better understanding of 

1146
01:01:48,149 --> 01:01:50,279
these problems. It 

1147
01:01:50,279 --> 01:01:54,089
turns out, by the way, there are some of Silicon Valley companies that outsource their 

1148
01:01:54,089 --> 01:01:56,679
machine learning. So there's sometimes, you know, whatever. 

1149
01:01:56,679 --> 01:01:59,529
They're a company in Silicon Valley and they'll, you know, 

1150
01:01:59,529 --> 01:02:03,230
hire a firm in New York to run all their learning algorithms for them. 

1151
01:02:03,230 --> 01:02:06,890
And I'm not a businessman, but I personally think that's 

1152
01:02:06,890 --> 01:02:09,309
often a terrible idea because 

1153
01:02:09,309 --> 01:02:13,639
if your expertise, if your understanding of your data is given, 

1154
01:02:13,639 --> 01:02:15,709
you know, to an outsource agency, 

1155
01:02:15,709 --> 01:02:19,589
then if you don't maintain that expertise, if there's a problem you really care about 

1156
01:02:19,589 --> 01:02:22,299
then it'll be your own, you know, 

1157
01:02:22,299 --> 01:02:26,009
understanding of the problem that you build up over months that'll be really valuable. 

1158
01:02:26,009 --> 01:02:28,649
And if that knowledge is outsourced, you don't get to keep that knowledge 

1159
01:02:28,649 --> 01:02:29,489
yourself. 

1160
01:02:29,489 --> 01:02:31,699
I personally think that's a terrible idea. 

1161
01:02:31,699 --> 01:02:35,809
But I'm not a businessman, but I just see people do that a lot, 

1162
01:02:35,809 --> 01:02:39,109
and just. Let's see. 

1163
01:02:39,109 --> 01:02:42,949
Another reason for running diagnostics like these is actually in writing research 

1164
01:02:42,949 --> 01:02:43,609
papers, 

1165
01:02:43,609 --> 01:02:46,149
right? So 

1166
01:02:46,149 --> 01:02:49,329
diagnostics and error analyses, which I'll talk about in a minute, 

1167
01:02:49,329 --> 01:02:53,019
often help to convey insight about the problem and help justify your research 

1168
01:02:53,019 --> 01:02:54,109
claims. 

1169
01:02:54,109 --> 01:02:56,559


1170
01:02:56,559 --> 01:02:57,780
So for example, 

1171
01:02:57,780 --> 01:03:00,790
rather than writing a research paper, say, that's says, you know, �Oh well here's 

1172
01:03:00,790 --> 01:03:04,039
an algorithm that works. I built this helicopter and it flies,� or whatever, 

1173
01:03:04,039 --> 01:03:05,650
it's often much more interesting to say, 

1174
01:03:05,650 --> 01:03:09,609
�Here's an algorithm that works, and it works because of a specific 

1175
01:03:09,609 --> 01:03:13,920
component X. And moreover, here's the diagnostic that gives you justification that shows X was 

1176
01:03:13,920 --> 01:03:19,159
the thing that fixed this problem,� and that's where you made it work. Okay? So 

1177
01:03:19,159 --> 01:03:21,389
that leads me 

1178
01:03:21,389 --> 01:03:25,929
into a discussion on error analysis, which is often good machine learning practice, 

1179
01:03:25,929 --> 01:03:26,439


1180
01:03:26,439 --> 01:03:32,099
is a way for understanding what your sources of errors are. So what I 

1181
01:03:32,099 --> 01:03:34,689
call error analyses " and let's check 

1182
01:03:34,689 --> 01:03:41,689
questions about this. 

1183
01:03:41,769 --> 01:03:45,789
Yeah? 

1184
01:03:45,789 --> 01:03:49,809
Student:What ended up being wrong with the helicopter? Instructor (Andrew Ng):Oh, don't know. Let's see. We've flown so many times. 

1185
01:03:49,809 --> 01:03:53,500
The thing that is most difficult a helicopter is actually building a 

1186
01:03:53,500 --> 01:03:55,109
very " I don't know. It 

1187
01:03:55,109 --> 01:03:58,489
changes all the time. Quite often, it's actually the simulator. Building an accurate simulator of a helicopter 

1188
01:03:58,489 --> 01:04:02,859
is very hard. Yeah. Okay. So 

1189
01:04:02,859 --> 01:04:03,930
for error 

1190
01:04:03,930 --> 01:04:06,269
analyses, 

1191
01:04:06,269 --> 01:04:10,809
this is a way for figuring out what is working in your algorithm and what isn't working. 

1192
01:04:10,809 --> 01:04:17,709
And we're gonna talk about two specific examples. So there are 

1193
01:04:17,709 --> 01:04:21,529
many learning " there are many sort of IA systems, many machine learning systems, that 

1194
01:04:21,529 --> 01:04:22,469
combine 

1195
01:04:22,469 --> 01:04:24,890
many different components into a pipeline. So 

1196
01:04:24,890 --> 01:04:27,469
here's sort of a contrived example for this, 

1197
01:04:27,469 --> 01:04:31,019
not dissimilar in many ways from the actual machine learning systems you see. 

1198
01:04:31,019 --> 01:04:32,390
So let's say you want to 

1199
01:04:32,390 --> 01:04:37,749
recognize people from images. This is a picture of one of my friends. 

1200
01:04:37,749 --> 01:04:41,899
So you take this input in camera image, say, and you often run it through a long pipeline. So 

1201
01:04:41,899 --> 01:04:43,069
for example, 

1202
01:04:43,069 --> 01:04:47,859
the first thing you may do may be preprocess the image and remove the background, so you remove the 

1203
01:04:47,859 --> 01:04:49,189
background. 

1204
01:04:49,189 --> 01:04:51,909
And then you run a 

1205
01:04:51,909 --> 01:04:55,209
face detection algorithm, so a machine learning algorithm to detect people's faces. 

1206
01:04:55,209 --> 01:04:56,109
Right? 

1207
01:04:56,109 --> 01:04:59,759
And then, you know, let's say you want to recognize the identity of the person, right, this is your 

1208
01:04:59,759 --> 01:05:01,719
application. 

1209
01:05:01,719 --> 01:05:04,440
You then segment of the eyes, segment of the nose, 

1210
01:05:04,440 --> 01:05:08,329
and have different learning algorithms to detect the mouth and so on. 

1211
01:05:08,329 --> 01:05:10,029
I know; she might not want to be friend 

1212
01:05:10,029 --> 01:05:13,249
after she sees this. 

1213
01:05:13,249 --> 01:05:16,769
And then having found all these features, based on, you know, what the nose looks like, what the eyes 

1214
01:05:16,769 --> 01:05:18,610
looks like, whatever, then you 

1215
01:05:18,610 --> 01:05:22,800
feed all the features into a logistic regression algorithm. And your logistic 

1216
01:05:22,800 --> 01:05:24,770
regression or soft match regression, or whatever, 

1217
01:05:24,770 --> 01:05:30,379
will tell you the identity of this person. Okay? 

1218
01:05:30,379 --> 01:05:32,459
So 

1219
01:05:32,459 --> 01:05:35,059
this is what error analysis is. 

1220
01:05:35,059 --> 01:05:40,329
You have a long complicated pipeline combining many machine learning 

1221
01:05:40,329 --> 01:05:43,919
components. Many of these would be used in learning algorithms. 

1222
01:05:43,919 --> 01:05:45,689
And so, 

1223
01:05:45,689 --> 01:05:50,419
it's often very useful to figure out how much of your error can be attributed to each of 

1224
01:05:50,419 --> 01:05:55,179
these components. 

1225
01:05:55,179 --> 01:05:56,179
So 

1226
01:05:56,179 --> 01:05:59,589
what we'll do in a typical error analysis procedure 

1227
01:05:59,589 --> 01:06:03,709
is we'll repeatedly plug in the ground-truth for each component and see how the 

1228
01:06:03,709 --> 01:06:05,129
accuracy changes. 

1229
01:06:05,129 --> 01:06:07,589
So what I mean by that is the 

1230
01:06:07,589 --> 01:06:11,389
figure on the bottom left " bottom right, let's say the overall accuracy of the system is 

1231
01:06:11,389 --> 01:06:12,739
85 percent. Right? 

1232
01:06:12,739 --> 01:06:14,689
Then I want to know 

1233
01:06:14,689 --> 01:06:17,399
where my 15 percent of error comes from. 

1234
01:06:17,399 --> 01:06:19,159
And so what I'll do is I'll go 

1235
01:06:19,159 --> 01:06:21,329
to my test set 

1236
01:06:21,329 --> 01:06:26,629
and I'll actually code it and " oh, instead of " actually implement my correct 

1237
01:06:26,629 --> 01:06:29,750
background removal. So actually, go in and give it, 

1238
01:06:29,750 --> 01:06:33,439
give my algorithm what is the correct background versus foreground. 

1239
01:06:33,439 --> 01:06:36,840
And if I do that, let's color that blue to denote that I'm 

1240
01:06:36,840 --> 01:06:39,529
giving that ground-truth data in the test set, 

1241
01:06:39,529 --> 01:06:43,839
let's assume our accuracy increases to 85.1 percent. Okay? 

1242
01:06:43,839 --> 01:06:47,760
And now I'll go in and, you know, give my algorithm the ground-truth, 

1243
01:06:47,760 --> 01:06:48,929
face detection 

1244
01:06:48,929 --> 01:06:53,019
output. So I'll go in and actually on my test set I'll just tell the algorithm where the 

1245
01:06:53,019 --> 01:06:55,130
face is. And if I do that, 

1246
01:06:55,130 --> 01:06:59,049
let's say my algorithm's accuracy increases to 91 percent, 

1247
01:06:59,049 --> 01:07:02,519
and so on. And then I'll go for each of these components 

1248
01:07:02,519 --> 01:07:05,019
and just give it 

1249
01:07:05,019 --> 01:07:08,660
the ground-truth label for each of the components, 

1250
01:07:08,660 --> 01:07:11,640
because say, like, the nose segmentation algorithm's trying to figure out 

1251
01:07:11,640 --> 01:07:13,219
where the nose is. I just in 

1252
01:07:13,219 --> 01:07:16,589
and tell it where the nose is so that it doesn't have to figure that out. 

1253
01:07:16,589 --> 01:07:20,559
And as I do this, one component through the other, you know, I end up giving it the correct output 

1254
01:07:20,559 --> 01:07:23,650
label and end up with 100 percent accuracy. 

1255
01:07:23,650 --> 01:07:27,000
And now you can look at this table " I'm sorry this is cut off on the bottom, 

1256
01:07:27,000 --> 01:07:29,119
it says logistic regression 100 percent. Now you can 

1257
01:07:29,119 --> 01:07:30,719
look at this 

1258
01:07:30,719 --> 01:07:31,670
table and 

1259
01:07:31,670 --> 01:07:33,009
see, 

1260
01:07:33,009 --> 01:07:36,079
you know, how much giving the ground-truth labels for each of these 

1261
01:07:36,079 --> 01:07:39,029
components could help boost your final performance. 

1262
01:07:39,029 --> 01:07:42,419
In particular, if you look at this table, you notice that 

1263
01:07:42,419 --> 01:07:45,269
when I added the face detection ground-truth, 

1264
01:07:45,269 --> 01:07:48,279
my performance jumped from 85.1 percent accuracy 

1265
01:07:48,279 --> 01:07:50,619
to 91 percent accuracy. Right? 

1266
01:07:50,619 --> 01:07:54,529
So this tells me that if only I can get better face detection, 

1267
01:07:54,529 --> 01:07:58,029
maybe I can boost my accuracy by 6 percent. 

1268
01:07:58,029 --> 01:08:00,499
Whereas in contrast, when I, 

1269
01:08:00,499 --> 01:08:04,349
you know, say plugged in better, I don't know, 

1270
01:08:04,349 --> 01:08:07,059
background removal, my accuracy improved from 85 

1271
01:08:07,059 --> 01:08:08,669
to 85.1 percent. 

1272
01:08:08,669 --> 01:08:11,519
And so, this sort of diagnostic also tells you that if your goal 

1273
01:08:11,519 --> 01:08:13,869
is to improve the system, it's probably a waste of 

1274
01:08:13,869 --> 01:08:17,679
your time to try to improve your background subtraction. Because if 

1275
01:08:17,679 --> 01:08:19,219
even if you got the ground-truth, 

1276
01:08:19,219 --> 01:08:22,059
this is gives you, at most, 0.1 percent accuracy, 

1277
01:08:22,059 --> 01:08:24,600
whereas if you do better face detection, maybe there's a much 

1278
01:08:24,600 --> 01:08:26,399
larger potential for gains there. Okay? 

1279
01:08:26,399 --> 01:08:28,669
So this sort of diagnostic, 

1280
01:08:28,669 --> 01:08:29,899
again, 

1281
01:08:29,899 --> 01:08:33,149
is very useful because if your is to improve the system, 

1282
01:08:33,149 --> 01:08:35,999
there are so many different pieces you can easily choose to spend the next three 

1283
01:08:35,999 --> 01:08:36,650
months on. Right? 

1284
01:08:36,650 --> 01:08:39,259
And choosing the right piece 

1285
01:08:39,259 --> 01:08:42,799
is critical, and this sort of diagnostic tells you what's the piece that may 

1286
01:08:42,799 --> 01:08:48,729
actually be worth your time to work on. 

1287
01:08:48,729 --> 01:08:51,709
There's sort of another type of analyses that's sort of the opposite of what I just 

1288
01:08:51,709 --> 01:08:53,369
talked about. 

1289
01:08:53,369 --> 01:08:55,469
The error analysis I just talked about 

1290
01:08:55,469 --> 01:08:58,279
tries to explain the difference between the current performance and perfect 

1291
01:08:58,279 --> 01:08:59,770
performance, 

1292
01:08:59,770 --> 01:09:03,619
whereas this sort of ablative analysis tries to explain the difference 

1293
01:09:03,619 --> 01:09:09,119
between some baselines, some really bad performance and your current performance. 

1294
01:09:09,119 --> 01:09:13,089
So for this example, let's suppose you've built a very good anti-spam classifier for 

1295
01:09:13,089 --> 01:09:17,150
adding lots of clever features to your logistic regression algorithm. Right? So you added 

1296
01:09:17,150 --> 01:09:20,690
features for spam correction, for, you know, sender host features, for email header 

1297
01:09:20,690 --> 01:09:21,409
features, 

1298
01:09:21,409 --> 01:09:24,799
email text parser features, JavaScript parser features, 

1299
01:09:24,799 --> 01:09:26,839
features for embedded images, and so on. 

1300
01:09:26,839 --> 01:09:30,230
So now let's say you preview the system and you want to figure out, you know, how well did 

1301
01:09:30,230 --> 01:09:33,799
each of these " how much did each of these components actually contribute? Maybe you want 

1302
01:09:33,799 --> 01:09:37,130
to write a research paper and claim this was the piece that made the 

1303
01:09:37,130 --> 01:09:40,949
big difference. Can you actually document that claim and justify it? 

1304
01:09:40,949 --> 01:09:43,319
So in ablative analysis, 

1305
01:09:43,319 --> 01:09:44,569
here's what we do. 

1306
01:09:44,569 --> 01:09:46,330
So in this example, 

1307
01:09:46,330 --> 01:09:49,670
let's say that simple logistic regression without any of your clever 

1308
01:09:49,670 --> 01:09:52,089
improvements get 94 percent performance. And 

1309
01:09:52,089 --> 01:09:55,479
you want to figure out what accounts for your improvement from 94 to 

1310
01:09:55,479 --> 01:09:58,429
99.9 percent performance. 

1311
01:09:58,429 --> 01:10:03,280
So in ablative analysis and so instead of adding components one at a day, we'll instead 

1312
01:10:03,280 --> 01:10:06,840
remove components one at a time to see how it rates. 

1313
01:10:06,840 --> 01:10:11,460
So start with your overall system, which is 99 percent accuracy. 

1314
01:10:11,460 --> 01:10:14,130
And then we remove spelling correction and see how much performance 

1315
01:10:14,130 --> 01:10:15,389
drops. 

1316
01:10:15,389 --> 01:10:22,389
Then we'll remove the sender host features and see how much performance drops, and so on. All right? And so, 

1317
01:10:24,219 --> 01:10:28,150
in this contrived example, 

1318
01:10:28,150 --> 01:10:31,120
you see that, I guess, the biggest drop 

1319
01:10:31,120 --> 01:10:32,380
occurred when you remove 

1320
01:10:32,380 --> 01:10:37,560
the text parser features. And so you can then make a credible case that, 

1321
01:10:37,560 --> 01:10:41,280
you know, the text parser features where what really made the biggest difference here. Okay? 

1322
01:10:41,280 --> 01:10:42,699
And you can also tell, 

1323
01:10:42,699 --> 01:10:45,530
for instance, that, I don't know, 

1324
01:10:45,530 --> 01:10:49,360
removing the sender host features on this 

1325
01:10:49,360 --> 01:10:52,280
line, right, performance dropped from 99.9 to 98.9. And so this also means 

1326
01:10:52,280 --> 01:10:53,139
that 

1327
01:10:53,139 --> 01:10:56,449
in case you want to get rid of the sender host features to speed up 

1328
01:10:56,449 --> 01:11:03,449
computational something that would be a good candidate for elimination. Okay? Student:Are there any 

1329
01:11:03,630 --> 01:11:05,420
guarantees that if you shuffle around the order in which 

1330
01:11:05,420 --> 01:11:06,420
you drop those 

1331
01:11:06,420 --> 01:11:09,580
features that you'll get the same " Instructor (Andrew Ng):Yeah. Let's address the question: What if you shuffle in which you remove things? The answer is no. There's 

1332
01:11:09,580 --> 01:11:12,110
no guarantee you'd get the similar result. 

1333
01:11:12,110 --> 01:11:13,890
So in practice, 

1334
01:11:13,890 --> 01:11:17,730
sometimes there's a fairly natural of ordering for both types of analyses, the error 

1335
01:11:17,730 --> 01:11:19,329
analyses and ablative analysis, 

1336
01:11:19,329 --> 01:11:22,749
sometimes there's a fairly natural ordering in which you add things or remove things, 

1337
01:11:22,749 --> 01:11:24,559
sometimes there's isn't. And 

1338
01:11:24,559 --> 01:11:28,469
quite often, you either choose one ordering and just go for it 

1339
01:11:28,469 --> 01:11:32,070
or " And don't think of these analyses as sort of formulas that are constants, though; I mean 

1340
01:11:32,070 --> 01:11:35,239
feel free to invent your own, as well. You know 

1341
01:11:35,239 --> 01:11:36,639
one of the things 

1342
01:11:36,639 --> 01:11:37,920
that's done quite often is 

1343
01:11:37,920 --> 01:11:39,310
take the overall system 

1344
01:11:39,310 --> 01:11:43,289
and just remove one and then put it back, then remove a different one 

1345
01:11:43,289 --> 01:11:48,129
then put it back until all of these things are done. Okay. 

1346
01:11:48,129 --> 01:11:51,009
So the very last thing I want to talk about is sort of this 

1347
01:11:51,009 --> 01:11:57,979
general advice for how to get started on a learning problem. So 

1348
01:11:57,979 --> 01:12:03,840
here's a cartoon description on two broad to get started on learning problem. 

1349
01:12:03,840 --> 01:12:05,740
The first one is 

1350
01:12:05,740 --> 01:12:07,610
carefully design your system, so 

1351
01:12:07,610 --> 01:12:11,739
you spend a long time designing exactly the right features, collecting the right data set, and 

1352
01:12:11,739 --> 01:12:14,189
designing the right algorithmic structure, then you 

1353
01:12:14,189 --> 01:12:17,679
implement it and hope it works. All right? 

1354
01:12:17,679 --> 01:12:21,039
The benefit of this sort of approach is you get maybe nicer, maybe more scalable 

1355
01:12:21,039 --> 01:12:22,429
algorithms, 

1356
01:12:22,429 --> 01:12:26,719
and maybe you come up with new elegant learning algorithms. And if your goal is to, 

1357
01:12:26,719 --> 01:12:30,760
you know, contribute to basic research in machine learning, if your goal is to invent new machine learning 

1358
01:12:30,760 --> 01:12:31,499
algorithms, 

1359
01:12:31,499 --> 01:12:33,550
this process of slowing down and 

1360
01:12:33,550 --> 01:12:36,300
thinking deeply about the problem, you know, that is sort of the right way to go 

1361
01:12:36,300 --> 01:12:37,119
about is 

1362
01:12:37,119 --> 01:12:41,099
think deeply about a problem and invent new solutions. 

1363
01:12:41,099 --> 01:12:42,280


1364
01:12:42,280 --> 01:12:44,079
Second sort of approach 

1365
01:12:44,079 --> 01:12:48,839
is what I call build-and-fix, which is we input something quick and dirty 

1366
01:12:48,839 --> 01:12:52,309
and then you run error analyses and diagnostics to figure out what's wrong and 

1367
01:12:52,309 --> 01:12:54,199
you fix those errors. 

1368
01:12:54,199 --> 01:12:58,130
The benefit of this second type of approach is that it'll often get your 

1369
01:12:58,130 --> 01:13:01,119
application working much more quickly. 

1370
01:13:01,119 --> 01:13:04,400
And especially with those of you, if you end up working in a company, and sometimes " if you end up working in 

1371
01:13:04,400 --> 01:13:05,550
a company, 

1372
01:13:05,550 --> 01:13:07,460
you know, very often it's not 

1373
01:13:07,460 --> 01:13:10,900
the best product that wins; it's the first product to market that 

1374
01:13:10,900 --> 01:13:11,690
wins. And 

1375
01:13:11,690 --> 01:13:14,869
so there's " especially in the industry. There's really something to be said for, 

1376
01:13:14,869 --> 01:13:18,790
you know, building a system quickly and getting it deployed quickly. 

1377
01:13:18,790 --> 01:13:23,139
And the second approach of building a quick-and-dirty, I'm gonna say hack 

1378
01:13:23,139 --> 01:13:26,469
and then fixing the problems will actually get you to a 

1379
01:13:26,469 --> 01:13:27,839
system that works well 

1380
01:13:27,839 --> 01:13:30,969
much more quickly. 

1381
01:13:30,969 --> 01:13:32,649
And the reason is 

1382
01:13:32,649 --> 01:13:36,149
very often it's really not clear what parts of a system are easier to think of to 

1383
01:13:36,149 --> 01:13:37,590
build and therefore what 

1384
01:13:37,590 --> 01:13:40,179
you need to spends lot of time focusing on. 

1385
01:13:40,179 --> 01:13:43,420
So there's that example I talked about just now. Right? 

1386
01:13:43,420 --> 01:13:46,929
For identifying 

1387
01:13:46,929 --> 01:13:48,710
people, say. 

1388
01:13:48,710 --> 01:13:53,199
And with a big complicated learning system like this, a big complicated pipeline like this, 

1389
01:13:53,199 --> 01:13:55,590
it's really not obvious at the outset 

1390
01:13:55,590 --> 01:13:59,130
which of these components you should spend lots of time working on. Right? And if 

1391
01:13:59,130 --> 01:14:00,960
you didn't know that 

1392
01:14:00,960 --> 01:14:03,800
preprocessing wasn't the right component, you could easily have 

1393
01:14:03,800 --> 01:14:07,269
spent three months working on better background subtraction, not knowing that it's 

1394
01:14:07,269 --> 01:14:09,880
just not gonna ultimately matter. 

1395
01:14:09,880 --> 01:14:10,769
And so 

1396
01:14:10,769 --> 01:14:13,690
the only way to find out what really works was inputting something quickly and 

1397
01:14:13,690 --> 01:14:15,350
you find out what parts " 

1398
01:14:15,350 --> 01:14:16,889
and find out 

1399
01:14:16,889 --> 01:14:17,889
what parts 

1400
01:14:17,889 --> 01:14:21,359
are really the hard parts to implement, or what parts are hard parts that could make a 

1401
01:14:21,359 --> 01:14:23,079
difference in performance. 

1402
01:14:23,079 --> 01:14:26,579
In fact, say that if your goal is to build a 

1403
01:14:26,579 --> 01:14:29,309
people recognition system, a system like this is actually far too 

1404
01:14:29,309 --> 01:14:31,639
complicated as your initial system. 

1405
01:14:31,639 --> 01:14:35,560
Maybe after you're prototyped a few systems, and you converged a system like this. But if this 

1406
01:14:35,560 --> 01:14:42,560
is your first system you're designing, this is much too complicated. Also, this is a 

1407
01:14:43,570 --> 01:14:48,059
very concrete piece of advice, and this applies to your projects as well. 

1408
01:14:48,059 --> 01:14:51,229
If your goal is to build a working application, 

1409
01:14:51,229 --> 01:14:55,260
Step 1 is actually probably not to design a system like this. Step 1 is where you would plot your 

1410
01:14:55,260 --> 01:14:57,279
data. 

1411
01:14:57,279 --> 01:15:01,219
And very often, and if you just take the data you're trying to predict and just plot your 

1412
01:15:01,219 --> 01:15:05,729
data, plot X, plot Y, plot your data everywhere you can think of, 

1413
01:15:05,729 --> 01:15:10,309
you know, half the time you look at it and go, �Gee, how come all those numbers are negative? I thought they 

1414
01:15:10,309 --> 01:15:13,899
should be positive. Something's wrong with this dataset.� And it's about 

1415
01:15:13,899 --> 01:15:18,389
half the time you find something obviously wrong with your data or something very surprising. 

1416
01:15:18,389 --> 01:15:21,570
And this is something you find out just by plotting your data, and that you 

1417
01:15:21,570 --> 01:15:28,179
won't find out be implementing these big complicated learning algorithms on it. Plotting 

1418
01:15:28,179 --> 01:15:31,920
the data sounds so simple, it was one of the pieces of advice that lots of us give but 

1419
01:15:31,920 --> 01:15:38,570
hardly anyone follows, so you can take that for what it's worth. 

1420
01:15:38,570 --> 01:15:42,199
Let me just reiterate, what I just said here may be bad advice 

1421
01:15:42,199 --> 01:15:44,019
if your goal is to come up with 

1422
01:15:44,019 --> 01:15:46,639
new machine learning algorithms. All right? So 

1423
01:15:46,639 --> 01:15:51,019
for me personally, the learning algorithm I use the most often is probably 

1424
01:15:51,019 --> 01:15:53,600
logistic regression because I have code lying around. So give me a 

1425
01:15:53,600 --> 01:15:56,770
learning problem, I probably won't try anything more complicated than logistic 

1426
01:15:56,770 --> 01:15:58,260
regression on it first. And it's 

1427
01:15:58,260 --> 01:16:01,940
only after trying something really simple and figure our what's easy, what's hard, then you know 

1428
01:16:01,940 --> 01:16:03,940
where to focus your efforts. But 

1429
01:16:03,940 --> 01:16:07,610
again, if your goal is to invent new machine learning algorithms, then you sort of don't 

1430
01:16:07,610 --> 01:16:10,750
want to hack up something and then add another hack to fix it, and hack it even more to 

1431
01:16:10,750 --> 01:16:12,219
fix it. Right? So if 

1432
01:16:12,219 --> 01:16:15,919
your goal is to do novel machine learning research, then it pays to think more deeply about the 

1433
01:16:15,919 --> 01:16:21,340
problem and not gonna follow this specifically. 

1434
01:16:21,340 --> 01:16:22,919
Shoot, you know what? All 

1435
01:16:22,919 --> 01:16:28,280
right, sorry if I'm late but I just have two more slides so I'm gonna go through these quickly. 

1436
01:16:28,280 --> 01:16:30,620
And so, this is what I think 

1437
01:16:30,620 --> 01:16:33,459
of as premature statistical optimization, 

1438
01:16:33,459 --> 01:16:35,079
where quite often, 

1439
01:16:35,079 --> 01:16:38,320
just like premature optimization of code, quite often 

1440
01:16:38,320 --> 01:16:44,369
people will prematurely optimize one component of a big complicated machine learning system. Okay? Just two more 

1441
01:16:44,369 --> 01:16:46,949
slides. This 

1442
01:16:46,949 --> 01:16:48,539
was " 

1443
01:16:48,539 --> 01:16:52,070
this is a sort of cartoon that highly influenced my own thinking. It was based on 

1444
01:16:52,070 --> 01:16:55,340
a paper written by Christos Papadimitriou. 

1445
01:16:55,340 --> 01:16:57,429
This is how 

1446
01:16:57,429 --> 01:16:59,360
progress " this is how 

1447
01:16:59,360 --> 01:17:02,360
developmental progress of research often happens. Right? 

1448
01:17:02,360 --> 01:17:05,559
Let's say you want to build a mail delivery robot, so I've drawn a circle there that says mail delivery robot. And it 

1449
01:17:05,559 --> 01:17:06,519
seems like a useful thing to have. 

1450
01:17:06,519 --> 01:17:09,670
Right? You know free up people, don't have 

1451
01:17:09,670 --> 01:17:12,760
to deliver mail. So what " 

1452
01:17:12,760 --> 01:17:14,280
to deliver mail, 

1453
01:17:14,280 --> 01:17:19,139
obviously you need a robot to wander around indoor environments and you need a robot to 

1454
01:17:19,139 --> 01:17:21,480
manipulate objects and pickup envelopes. And so, 

1455
01:17:21,480 --> 01:17:24,890
you need to build those two components in order to get a mail delivery robot. And 

1456
01:17:24,890 --> 01:17:25,590
so I've 

1457
01:17:25,590 --> 01:17:29,650
drawing those two components and little arrows to denote that, you know, obstacle avoidance 

1458
01:17:29,650 --> 01:17:30,460
is 

1459
01:17:30,460 --> 01:17:32,229
needed or would help build 

1460
01:17:32,229 --> 01:17:35,510
your mail delivery robot. Well 

1461
01:17:35,510 --> 01:17:37,189
for obstacle for avoidance, 

1462
01:17:37,189 --> 01:17:43,159
clearly, you need a robot that can navigate and you need to detect objects so you can avoid the obstacles. 

1463
01:17:43,159 --> 01:17:46,840
Now we're gonna use computer vision to detect the objects. And so, 

1464
01:17:46,840 --> 01:17:51,120
we know that, you know, lighting sometimes changes, right, depending on whether it's the 

1465
01:17:51,120 --> 01:17:52,709
morning or noontime or evening. This 

1466
01:17:52,709 --> 01:17:53,930
is lighting 

1467
01:17:53,930 --> 01:17:56,639
causes the color of things to change, and so you need 

1468
01:17:56,639 --> 01:18:00,509
an object detection system that's invariant to the specific colors of an 

1469
01:18:00,509 --> 01:18:01,199
object. Right? 

1470
01:18:01,199 --> 01:18:04,420
Because lighting 

1471
01:18:04,420 --> 01:18:05,400
changes, 

1472
01:18:05,400 --> 01:18:09,849
say. Well color, or RGB values, is represented by three-dimensional vectors. And 

1473
01:18:09,849 --> 01:18:11,170
so you need to learn 

1474
01:18:11,170 --> 01:18:13,499
when two colors might be the same thing, 

1475
01:18:13,499 --> 01:18:15,260
when two, you know, 

1476
01:18:15,260 --> 01:18:18,159
visual appearance of two colors may be the same thing as just the lighting change or 

1477
01:18:18,159 --> 01:18:19,539
something. 

1478
01:18:19,539 --> 01:18:20,590
And 

1479
01:18:20,590 --> 01:18:24,059
to understand that properly, we can go out and study differential geometry 

1480
01:18:24,059 --> 01:18:27,510
of 3d manifolds because that helps us build a sound theory on which 

1481
01:18:27,510 --> 01:18:32,250
to develop our 3d similarity learning algorithms. 

1482
01:18:32,250 --> 01:18:36,159
And to really understand the fundamental aspects of this problem, 

1483
01:18:36,159 --> 01:18:40,110
we have to study the complexity of non-Riemannian geometries. And on 

1484
01:18:40,110 --> 01:18:43,850
and on it goes until eventually you're proving convergence bounds for 

1485
01:18:43,850 --> 01:18:49,790
sampled of non-monotonic logic. I don't even know what this is because I just made it up. 

1486
01:18:49,790 --> 01:18:51,530
Whereas in reality, 

1487
01:18:51,530 --> 01:18:53,970
you know, chances are that link isn't real. 

1488
01:18:53,970 --> 01:18:55,660
Color variance 

1489
01:18:55,660 --> 01:18:59,550
just barely helped object recognition maybe. I'm making this up. 

1490
01:18:59,550 --> 01:19:03,499
Maybe differential geometry was hardly gonna help 3d similarity learning and that link's also gonna fail. Okay? 

1491
01:19:03,499 --> 01:19:05,270
So, each of 

1492
01:19:05,270 --> 01:19:09,130
these circles can represent a person, or a research community, or a thought in your 

1493
01:19:09,130 --> 01:19:12,020
head. And there's a very real chance that 

1494
01:19:12,020 --> 01:19:15,469
maybe there are all these papers written on differential geometry of 3d manifolds, and they are 

1495
01:19:15,469 --> 01:19:18,570
written because some guy once told someone else that it'll help 3d similarity learning. 

1496
01:19:18,570 --> 01:19:20,489
And, 

1497
01:19:20,489 --> 01:19:23,369
you know, it's like �A friend of mine told me that color invariance would help in 

1498
01:19:23,369 --> 01:19:26,119
object recognition, so I'm working on color invariance. And now I'm gonna tell a friend 

1499
01:19:26,119 --> 01:19:27,440
of mine 

1500
01:19:27,440 --> 01:19:30,280
that his thing will help my problem. And he'll tell a friend of his that his thing will help 

1501
01:19:30,280 --> 01:19:31,619
with his problem.� 

1502
01:19:31,619 --> 01:19:33,520
And pretty soon, you're working on 

1503
01:19:33,520 --> 01:19:37,540
convergence bound for sampled non-monotonic logic, when in reality none of these will 

1504
01:19:37,540 --> 01:19:39,129
see the light of 

1505
01:19:39,129 --> 01:19:42,519
day of your mail delivery robot. Okay? 

1506
01:19:42,519 --> 01:19:46,599
I'm not criticizing the role of theory. There are very powerful theories, like the 

1507
01:19:46,599 --> 01:19:48,400
theory of VC dimension, 

1508
01:19:48,400 --> 01:19:52,090
which is far, far, far to the right of this. So VC dimension is about 

1509
01:19:52,090 --> 01:19:53,289
as theoretical 

1510
01:19:53,289 --> 01:19:57,119
as it can get. And it's clearly had a huge impact on many applications. And there's, 

1511
01:19:57,119 --> 01:19:59,559
you know, dramatically advanced data machine learning. And another example is theory of NP-hardness as again, you know, 

1512
01:19:59,559 --> 01:20:00,750
is about 

1513
01:20:00,750 --> 01:20:04,219
theoretical as it can get. It's 

1514
01:20:04,219 --> 01:20:05,800
like a huge application 

1515
01:20:05,800 --> 01:20:09,309
on all of computer science, the theory of NP-hardness. 

1516
01:20:09,309 --> 01:20:10,669
But 

1517
01:20:10,669 --> 01:20:13,799
when you are off working on highly theoretical things, I guess, to me 

1518
01:20:13,799 --> 01:20:16,849
personally it's important to keep in mind 

1519
01:20:16,849 --> 01:20:19,699
are you working on something like VC dimension, which is high impact, or are you 

1520
01:20:19,699 --> 01:20:23,290
working on something like convergence bound for sampled nonmonotonic logic, which 

1521
01:20:23,290 --> 01:20:24,710
you're only hoping 

1522
01:20:24,710 --> 01:20:25,900
has some peripheral relevance 

1523
01:20:25,900 --> 01:20:30,040
to some application. Okay? 

1524
01:20:30,040 --> 01:20:34,849
For me personally, I tend to work on an application only if I " excuse me. 

1525
01:20:34,849 --> 01:20:36,989
For me personally, and this is a personal choice, 

1526
01:20:36,989 --> 01:20:41,340
I tend to trust something only if I personally can see a link from the 

1527
01:20:41,340 --> 01:20:42,679
theory I'm working on 

1528
01:20:42,679 --> 01:20:44,430
all the way back to an application. 

1529
01:20:44,430 --> 01:20:46,010
And 

1530
01:20:46,010 --> 01:20:50,299
if I don't personally see a direct link from what I'm doing to an application then, 

1531
01:20:50,299 --> 01:20:53,429
you know, then that's fine. Then I can choose to work on theory, but 

1532
01:20:53,429 --> 01:20:55,650
I wouldn't necessarily trust that 

1533
01:20:55,650 --> 01:20:59,210
what the theory I'm working on will relate to an application, if I don't personally 

1534
01:20:59,210 --> 01:21:02,429
see a link all the way back. 

1535
01:21:02,429 --> 01:21:04,400
Just to summarize. 

1536
01:21:04,400 --> 01:21:06,409


1537
01:21:06,409 --> 01:21:08,679
One lesson to take away from today is I think 

1538
01:21:08,679 --> 01:21:12,529
time spent coming up with diagnostics for learning algorithms is often time well spent. 

1539
01:21:12,529 --> 01:21:13,029


1540
01:21:13,029 --> 01:21:16,199
It's often up to your own ingenuity to come up with great diagnostics. And 

1541
01:21:16,199 --> 01:21:19,019
just when I personally, when I work on machine learning algorithm, 

1542
01:21:19,019 --> 01:21:21,169
it's not uncommon for me to be spending like 

1543
01:21:21,169 --> 01:21:23,679
between a third and often half of my time 

1544
01:21:23,679 --> 01:21:26,409
just writing diagnostics and trying to figure out what's going right and what's 

1545
01:21:26,409 --> 01:21:28,079
going on. 

1546
01:21:28,079 --> 01:21:31,500
Sometimes it's tempting not to, right, because you want to be implementing learning algorithms and 

1547
01:21:31,500 --> 01:21:34,780
making progress. You don't want to be spending all this time, you know, implementing tests on your 

1548
01:21:34,780 --> 01:21:38,280
learning algorithms; it doesn't feel like when you're doing anything. But when 

1549
01:21:38,280 --> 01:21:41,419
I implement learning algorithms, at least a third, and quite often half of 

1550
01:21:41,419 --> 01:21:45,880
my time, is actually spent implementing those tests and you can figure out what to work on. And 

1551
01:21:45,880 --> 01:21:49,219
I think it's actually one of the best uses of your time. Talked 

1552
01:21:49,219 --> 01:21:50,729
about error 

1553
01:21:50,729 --> 01:21:54,319
analyses and ablative analyses, and lastly 

1554
01:21:54,319 --> 01:21:56,890
talked about, you know, different approaches and the 

1555
01:21:56,890 --> 01:22:00,979
risks of premature statistical optimization. Okay. 

1556
01:22:00,979 --> 01:22:04,339
Sorry I ran you over. I'll be here for a few more minutes for your questions.