1
00:00:10,530 --> 00:00:11,799

2
00:00:11,799 --> 00:00:15,059
This presentation is delivered by the Stanford Center for Professional 

3
00:00:15,059 --> 00:00:22,059
Development. Okay. 

4
00:00:29,109 --> 00:00:31,810
So welcome back. 

5
00:00:31,810 --> 00:00:35,860
What I want to do today is start a new chapter 

6
00:00:35,860 --> 00:00:41,020
in between now and then. In particular, I want to talk about learning theory. 

7
00:00:41,020 --> 00:00:45,450
So in the previous, I guess eight lectures so far, you've learned about 

8
00:00:45,450 --> 00:00:49,920
a lot of learning algorithms, and yes, you now I hope understand a little about 

9
00:00:49,920 --> 00:00:53,870
some of the best and most powerful tools of machine learning in the [inaudible]. 

10
00:00:53,870 --> 00:00:58,219
And all of you are now sort of well qualified to go into industry 

11
00:00:58,219 --> 00:01:01,539
and though powerful learning algorithms apply, really the most powerful 

12
00:01:01,539 --> 00:01:05,170
learning algorithms we know to all sorts of problems, and in fact, I hope 

13
00:01:05,170 --> 00:01:10,730
you start to do that on your projects right away as well. 

14
00:01:10,730 --> 00:01:14,460
You might remember, I think it was in the very first lecture, 

15
00:01:14,460 --> 00:01:18,960
that I made an analogy to if you're trying to learn to be a carpenter, so 

16
00:01:18,960 --> 00:01:20,520
if 

17
00:01:20,520 --> 00:01:24,440
you imagine you're going to carpentry school to learn to be a carpenter, 

18
00:01:24,440 --> 00:01:28,540
then only a small part of what you need to do is to acquire a set of tools. If you learn to 

19
00:01:28,540 --> 00:01:29,990
be a carpenter 

20
00:01:29,990 --> 00:01:33,390
you don't walk in and pick up a tool box and [inaudible], so 

21
00:01:33,390 --> 00:01:35,560
when you need to 

22
00:01:35,560 --> 00:01:39,650
cut a piece of wood do you use a rip saw, or a jig saw, or a keyhole saw 

23
00:01:39,650 --> 00:01:42,780
whatever, is this really mastering the tools there's also 

24
00:01:42,780 --> 00:01:45,820
an essential part of becoming a good carpenter. 

25
00:01:45,820 --> 00:01:46,970
And 

26
00:01:46,970 --> 00:01:49,600
what I want to do in the next few lectures is 

27
00:01:49,600 --> 00:01:52,640
actually give you a sense of the mastery of the machine learning tools all of 

28
00:01:52,640 --> 00:01:55,170
you have. Okay? 

29
00:01:55,170 --> 00:01:57,090
And so in particular, 

30
00:01:57,090 --> 00:02:01,460
in the next few lectures what I want to is to talk more deeply about 

31
00:02:01,460 --> 00:02:05,060
the properties of different machine learning algorithms so that you can get a sense of 

32
00:02:05,060 --> 00:02:06,730
when it's most appropriate to use 

33
00:02:06,730 --> 00:02:08,119
each one. 

34
00:02:08,119 --> 00:02:11,660
And it turns out that one of the most common scenarios in machine learning is 

35
00:02:11,660 --> 00:02:12,869


36
00:02:12,869 --> 00:02:15,239
someday you'll be doing research or 

37
00:02:15,239 --> 00:02:16,919
[inaudible] a company. 

38
00:02:16,919 --> 00:02:20,419
And you'll apply one of the learning algorithms you learned about, you may apply logistic regression, 

39
00:02:20,419 --> 00:02:23,599
or support vector machines, or Naive Bayes or something, 

40
00:02:23,599 --> 00:02:27,849
and for whatever bizarre reason, it won't work as well as you were hoping, or it 

41
00:02:27,849 --> 00:02:32,259
won't quite do what you were hoping it to. 

42
00:02:32,259 --> 00:02:32,980


43
00:02:32,980 --> 00:02:34,530
To 

44
00:02:34,530 --> 00:02:37,409
me what really separates the people from

45
00:02:37,409 --> 00:02:40,689
what really separates the people that really understand and really get machine 

46
00:02:40,689 --> 00:02:41,959
learning, 

47
00:02:41,959 --> 00:02:46,089
compared to people that maybe read the textbook and so they'll work through the math, 

48
00:02:46,089 --> 00:02:50,059
will be what you do next. Will be in your decisions of when 

49
00:02:50,059 --> 00:02:53,689
you apply a support vector machine and it doesn't quite do what you wanted, 

50
00:02:53,689 --> 00:02:56,540
do you really understand enough about support vector machines to know what to 

51
00:02:56,540 --> 00:02:58,559
do next and how to modify the algorithm? 

52
00:02:58,559 --> 00:03:02,019
And to me that's often what really separates the great people in machine 

53
00:03:02,019 --> 00:03:06,259
learning versus the people that like read the text book and so they'll [inaudible] the math, and so they'll have 

54
00:03:06,259 --> 00:03:08,989
just understood that. Okay? 

55
00:03:08,989 --> 00:03:09,659
So 

56
00:03:09,659 --> 00:03:13,819
what I want to do today, today's lecture will mainly be on learning theory and 

57
00:03:13,819 --> 00:03:15,269
we'll start to 

58
00:03:15,269 --> 00:03:18,549
talk about some of the theoretical results of machine learning. 

59
00:03:18,549 --> 00:03:19,349


60
00:03:19,349 --> 00:03:22,239
The next lecture, later this week, will be on algorithms for 

61
00:03:22,239 --> 00:03:26,370
sort of [inaudible], or fixing some of the problems 

62
00:03:26,370 --> 00:03:30,039
that the learning theory will point out to us and help us understand. And then 

63
00:03:30,039 --> 00:03:34,459
two lectures from now, that lecture will be almost entirely focused on 

64
00:03:34,459 --> 00:03:36,400
the practical advice for 

65
00:03:36,400 --> 00:03:40,229
how to apply learning algorithms. Okay? 

66
00:03:40,229 --> 00:03:47,229
So you have any questions about this before I start? Okay. 

67
00:03:50,629 --> 00:03:53,949
So the very first thing we're gonna talk about is something that 

68
00:03:53,949 --> 00:03:58,689
you've probably already seen on the first homework, and something that 

69
00:03:58,689 --> 00:04:01,289
alluded to previously, 

70
00:04:01,289 --> 00:04:07,119
which is the bias variance trade-off. So take 

71
00:04:07,119 --> 00:04:10,999
ordinary least squares, the first learning algorithm we learned 

72
00:04:10,999 --> 00:04:15,849
about, if you [inaudible] a straight line through these datas, this is not a very good model. 

73
00:04:15,849 --> 00:04:19,789
Right. 

74
00:04:19,789 --> 00:04:21,009
And if 

75
00:04:21,009 --> 00:04:23,699
this happens, we say it has 

76
00:04:23,699 --> 00:04:24,490
underfit 

77
00:04:24,490 --> 00:04:27,460
the data, or we say that this is a learning algorithm 

78
00:04:27,460 --> 00:04:30,629
with a very high bias, because it is 

79
00:04:30,629 --> 00:04:35,669
failing to fit the evident quadratic structure in the data. 

80
00:04:35,669 --> 00:04:36,250
And 

81
00:04:36,250 --> 00:04:37,590
for the prefaces 

82
00:04:37,590 --> 00:04:41,340
of [inaudible] you can formally think of the 

83
00:04:41,340 --> 00:04:45,439
bias of the learning algorithm as representing the fact that even if you 

84
00:04:45,439 --> 00:04:47,960
had an infinite amount of training data, even if 

85
00:04:47,960 --> 00:04:50,829
you had tons of training data, 

86
00:04:50,829 --> 00:04:52,580
this algorithm would still fail 

87
00:04:52,580 --> 00:04:54,199
to fit the quadratic 

88
00:04:54,199 --> 00:04:57,819
function, the quadratic structure in 

89
00:04:57,819 --> 00:05:01,669
the data. And so we think of this as a learning algorithm with high bias. 

90
00:05:01,669 --> 00:05:04,340
Then there's the opposite problem, so that's the 

91
00:05:04,340 --> 00:05:06,469
same dataset. 

92
00:05:06,469 --> 00:05:13,099
If 

93
00:05:13,099 --> 00:05:19,500
you fit a fourth of the polynomials into this dataset, 

94
00:05:19,500 --> 00:05:20,430
then you have 

95
00:05:20,430 --> 00:05:25,520
you'll be able to interpolate the five data points exactly, but clearly, this is also 

96
00:05:25,520 --> 00:05:27,200
not a great model 

97
00:05:27,200 --> 00:05:30,650
to the structure that you and I probably see in the data. 

98
00:05:30,650 --> 00:05:34,409
And we say that this 

99
00:05:34,409 --> 00:05:37,310
algorithm has a problem, excuse me, 

100
00:05:37,310 --> 00:05:42,830
is overfitting the data, 

101
00:05:42,830 --> 00:05:48,539
or alternatively that this algorithm has high variance. Okay? And the intuition behind 

102
00:05:48,539 --> 00:05:50,240
overfitting a high variance is that 

103
00:05:50,240 --> 00:05:54,419
the algorithm is fitting serious patterns in the data, or is fitting 

104
00:05:54,419 --> 00:05:58,360
idiosyncratic properties of this specific dataset, be it the dataset of 

105
00:05:58,360 --> 00:06:00,889
housing prices or whatever. 

106
00:06:00,889 --> 00:06:07,889
And quite often, they'll be some happy medium 

107
00:06:08,189 --> 00:06:10,060
of fitting a quadratic function 

108
00:06:10,060 --> 00:06:14,819
that maybe won't interpolate your data points perfectly, but also captures multistructure 

109
00:06:14,819 --> 00:06:16,740
in your data 

110
00:06:16,740 --> 00:06:20,759
than a simple model which under fits. 

111
00:06:20,759 --> 00:06:23,829
I say that you can sort of have the exactly the same picture 

112
00:06:23,829 --> 00:06:25,779
of classification problems as well, 

113
00:06:25,779 --> 00:06:28,449
so lets say 

114
00:06:28,449 --> 00:06:35,449
this is my training set, right, 

115
00:06:35,840 --> 00:06:37,300
of 

116
00:06:37,300 --> 00:06:39,319
positive and negative examples, 

117
00:06:39,319 --> 00:06:42,169


118
00:06:42,169 --> 00:06:44,259


119
00:06:44,259 --> 00:06:46,779
and so you can fit 

120
00:06:46,779 --> 00:06:51,679
logistic regression with a very high order polynomial [inaudible], or [inaudible] of X 

121
00:06:51,679 --> 00:06:57,389
equals the sigmoid function of 

122
00:06:57,389 --> 00:07:01,240
 

123
00:07:01,240 --> 00:07:04,439
whatever. Sigmoid function applied to a tenth of the polynomial. 

124
00:07:04,439 --> 00:07:07,520
And you do that, maybe you get a decision boundary 

125
00:07:07,520 --> 00:07:13,460
like this. Right. 

126
00:07:13,460 --> 00:07:16,830
That does indeed perfectly separate the positive and negative classes, this is 

127
00:07:16,830 --> 00:07:18,610
another example of how 

128
00:07:18,610 --> 00:07:21,169
overfitting, and 

129
00:07:21,169 --> 00:07:24,029
in contrast you fit logistic regression into this model with just the linear features, 

130
00:07:24,029 --> 00:07:26,619
with none 

131
00:07:26,619 --> 00:07:29,560
of the quadratic features, then maybe you get a decision boundary like that, which 

132
00:07:29,560 --> 00:07:33,179
can also underfit. 

133
00:07:33,179 --> 00:07:34,589
Okay. 

134
00:07:34,589 --> 00:07:38,059
So what I want to do now is 

135
00:07:38,059 --> 00:07:41,899
understand this problem of overfitting versus underfitting, of high bias versus high 

136
00:07:41,899 --> 00:07:43,409
variance, more explicitly, 

137
00:07:43,409 --> 00:07:44,819


138
00:07:44,819 --> 00:07:48,770
I will do that by posing a more formal model of machine learning and so 

139
00:07:48,770 --> 00:07:50,789
trying to prove when 

140
00:07:50,789 --> 00:07:56,680
these two twin problems when each of these two problems come up. 

141
00:07:56,680 --> 00:07:59,220
And as I'm modeling the 

142
00:07:59,220 --> 00:08:00,679


143
00:08:00,679 --> 00:08:04,389
example for our initial foray into learning theory, 

144
00:08:04,389 --> 00:08:10,449
I want to talk about learning classification, 

145
00:08:10,449 --> 00:08:13,060
in which 

146
00:08:13,060 --> 00:08:13,770


147
00:08:13,770 --> 00:08:16,360
H of X is equal 

148
00:08:16,360 --> 00:08:18,819
to G of data transpose X. Okay? 

149
00:08:18,819 --> 00:08:20,930
So the learning classifier. 

150
00:08:20,930 --> 00:08:26,840
And for this class I'm going to use, Z 

151
00:08:26,840 --> 00:08:32,740
excuse me. I'm gonna use G as indicator Z grading with 

152
00:08:32,740 --> 00:08:39,740
zero. 

153
00:08:40,260 --> 00:08:43,570
With apologies in advance for changing the notation yet again, 

154
00:08:43,570 --> 00:08:45,350
for the support vector machine 

155
00:08:45,350 --> 00:08:49,450
lectures we use Y equals minus one or plus one. For 

156
00:08:49,450 --> 00:08:52,670
learning theory lectures, turns out it'll be a bit cleaner if I switch back to 

157
00:08:52,670 --> 00:08:57,810
Y equals zero-one again, so I'm gonna switch back to my original notation. 

158
00:08:57,810 --> 00:09:02,220
And so you think of this model as a model forum as 

159
00:09:02,220 --> 00:09:04,540
logistic regressions, say, and think of this as being 

160
00:09:04,540 --> 00:09:06,600
similar to logistic regression, 

161
00:09:06,600 --> 00:09:08,420
except that now we're going to force 

162
00:09:08,420 --> 00:09:12,420
the logistic regression algorithm, to opt for labels that are 

163
00:09:12,420 --> 00:09:14,010
either zero or one. Okay? 

164
00:09:14,010 --> 00:09:15,610
So you can think of this as a 

165
00:09:15,610 --> 00:09:20,710
classifier to opt for labels zero or one involved in the probabilities. 

166
00:09:20,710 --> 00:09:25,410
And so as 

167
00:09:25,410 --> 00:09:32,410
usual, let's say we're given a training set of M examples. 

168
00:09:34,350 --> 00:09:38,080
That's just my notation for writing a set of M examples ranging from I equals 

169
00:09:38,080 --> 00:09:40,170
one 

170
00:09:40,170 --> 00:09:40,500


171
00:09:40,500 --> 00:09:45,450
through M. And I'm going to assume that the training example is XIYI. 

172
00:09:45,450 --> 00:09:48,320
I've drawn IID, 

173
00:09:48,320 --> 00:09:50,220
from sum distribution, 

174
00:09:50,220 --> 00:09:50,959
scripts 

175
00:09:50,959 --> 00:09:54,470
D. Okay? [Inaudible]. Identically and definitively distributed 

176
00:09:54,470 --> 00:09:59,720
and if you have you have running a classification problem on houses, like 

177
00:09:59,720 --> 00:10:03,450
features of the house comma, whether the house will be sold in the next six months, then this 

178
00:10:03,450 --> 00:10:04,700
is just 

179
00:10:04,700 --> 00:10:07,650
the priority distribution over 

180
00:10:07,650 --> 00:10:12,190
features of houses and whether or not they'll be sold. Okay? So I'm gonna assume that 

181
00:10:12,190 --> 00:10:16,810
training examples we've drawn IID from some probability distributions, 

182
00:10:16,810 --> 00:10:18,320


183
00:10:18,320 --> 00:10:19,940
scripts D. Well, same thing for spam, if you're 

184
00:10:19,940 --> 00:10:23,420
trying to build a spam classifier then this would be the distribution of 

185
00:10:23,420 --> 00:10:29,610
what emails look like comma, whether they are spam or not. 

186
00:10:29,610 --> 00:10:32,130
And in particular, to understand 

187
00:10:32,130 --> 00:10:38,000
or simplify to understand the phenomena of bias invariance, I'm actually going to use a 

188
00:10:38,000 --> 00:10:41,880
simplified model of machine learning. 

189
00:10:41,880 --> 00:10:44,670
And in particular, 

190
00:10:44,670 --> 00:10:45,810
logistic regression fits 

191
00:10:45,810 --> 00:10:50,040
this parameters the model like this for maximizing the law of likelihood. 

192
00:10:50,040 --> 00:10:55,010
But in order to understand learning algorithms more deeply, I'm just going to assume a simplified 

193
00:10:55,010 --> 00:10:58,890
model of machine learning, let me just write that down. 

194
00:10:58,890 --> 00:11:01,960


195
00:11:01,960 --> 00:11:05,260
So I'm going to define training error 

196
00:11:05,260 --> 00:11:08,670
as 

197
00:11:08,670 --> 00:11:14,490
 so this is a training error of a hypothesis X subscript data. 

198
00:11:14,490 --> 00:11:17,610
Write this epsilon hat of subscript data. 

199
00:11:17,610 --> 00:11:19,980
If I want to make the 

200
00:11:19,980 --> 00:11:23,450
dependence on a training set explicit, I'll write this with 

201
00:11:23,450 --> 00:11:27,210
a subscript S there where S is a training set. 

202
00:11:27,210 --> 00:11:34,210
And I'll define this as, 

203
00:11:41,660 --> 00:11:44,270
let's see. Okay. I 

204
00:11:44,270 --> 00:11:47,860
hope the notation is clear. This is a sum of indicator functions for whether your hypothesis 

205
00:11:47,860 --> 00:11:52,170
correctly classifies the Y  the IFE example. 

206
00:11:52,170 --> 00:11:55,340
And so when you divide by M, this is just 

207
00:11:55,340 --> 00:11:57,740
in your training set what's the fraction 

208
00:11:57,740 --> 00:11:59,560
of training examples your 

209
00:11:59,560 --> 00:12:01,200
hypothesis classifies 

210
00:12:01,200 --> 00:12:05,340
so defined as a training error. And 

211
00:12:05,340 --> 00:12:07,980
training error is also called risk. 

212
00:12:07,980 --> 00:12:10,190


213
00:12:10,190 --> 00:12:14,180
The simplified model of machine learning I'm gonna talk about is 

214
00:12:14,180 --> 00:12:17,170
called empirical risk minimization. 

215
00:12:17,170 --> 00:12:21,210
And in particular, I'm going to assume that the way my learning algorithm works 

216
00:12:21,210 --> 00:12:25,950
is it will choose parameters 

217
00:12:25,950 --> 00:12:32,950
data, that 

218
00:12:33,940 --> 00:12:39,090
minimize my training error. Okay? 

219
00:12:39,090 --> 00:12:43,240
And it will be this learning algorithm that we'll prove properties about. 

220
00:12:43,240 --> 00:12:44,640
And it turns out that 

221
00:12:44,640 --> 00:12:45,990
you 

222
00:12:45,990 --> 00:12:49,910
can think of this as the most basic learning algorithm, the algorithm that minimizes 

223
00:12:49,910 --> 00:12:51,970
your training error. It 

224
00:12:51,970 --> 00:12:55,350
turns out that logistic regression and support vector machines can be 

225
00:12:55,350 --> 00:12:58,930
formally viewed as approximation cities, so it 

226
00:12:58,930 --> 00:13:02,510
turns out that if you actually want to do this, this is a nonconvex optimization 

227
00:13:02,510 --> 00:13:03,399
problem. 

228
00:13:03,399 --> 00:13:08,860
This is actually  it actually [inaudible] hard to solve this optimization problem. 

229
00:13:08,860 --> 00:13:13,790
And logistic regression and support vector machines can both be viewed as 

230
00:13:13,790 --> 00:13:14,429
approximations 

231
00:13:14,429 --> 00:13:17,270
to this nonconvex optimization problem 

232
00:13:17,270 --> 00:13:20,240
by finding the convex approximation to it. 

233
00:13:20,240 --> 00:13:22,589
Think of this as 

234
00:13:22,589 --> 00:13:23,640
similar to what 

235
00:13:23,640 --> 00:13:27,710
algorithms like logistic regression 

236
00:13:27,710 --> 00:13:30,280
are doing. So 

237
00:13:30,280 --> 00:13:32,550
let me take that 

238
00:13:32,550 --> 00:13:34,230
definition of empirical risk 

239
00:13:34,230 --> 00:13:35,710
minimization 

240
00:13:35,710 --> 00:13:42,600
and actually just rewrite it in a different equivalent way. 

241
00:13:42,600 --> 00:13:45,440
For the results I want to prove today, it turns out 

242
00:13:45,440 --> 00:13:47,630
that it will be useful to think of 

243
00:13:47,630 --> 00:13:49,340
our learning algorithm 

244
00:13:49,340 --> 00:13:52,150
as not choosing a set of parameters, 

245
00:13:52,150 --> 00:13:54,810
but as choosing a function. 

246
00:13:54,810 --> 00:13:57,870
So let me say what I mean by that. Let me define 

247
00:13:57,870 --> 00:14:00,319
the hypothesis 

248
00:14:00,319 --> 00:14:03,490
class, 

249
00:14:03,490 --> 00:14:04,800
script h, 

250
00:14:04,800 --> 00:14:09,410
as the class of all hypotheses of  in other words as the class of all linear 

251
00:14:09,410 --> 00:14:10,690
classifiers, that your 

252
00:14:10,690 --> 00:14:13,730
learning algorithm is choosing from. 

253
00:14:13,730 --> 00:14:17,260
Okay? So 

254
00:14:17,260 --> 00:14:19,029
H subscript data 

255
00:14:19,029 --> 00:14:23,190
is a specific linear classifier, so H subscript data, 

256
00:14:23,190 --> 00:14:29,880
in each of these functions  each 

257
00:14:29,880 --> 00:14:33,550
of these is a function mapping from the input domain X is the class zero-one. Each 

258
00:14:33,550 --> 00:14:34,760
of 

259
00:14:34,760 --> 00:14:38,430
these is a function, and as you vary the parameter's data, 

260
00:14:38,430 --> 00:14:41,110
you get different functions. 

261
00:14:41,110 --> 00:14:43,810
And so let me define the hypothesis class 

262
00:14:43,810 --> 00:14:47,750
script H to be the class of all functions that say logistic regression can choose 

263
00:14:47,750 --> 00:14:49,140
from. Okay. So 

264
00:14:49,140 --> 00:14:53,000
this is the class of all linear classifiers 

265
00:14:53,000 --> 00:14:54,480
and so 

266
00:14:54,480 --> 00:14:57,420


267
00:14:57,420 --> 00:15:00,900
I'm going to define, 

268
00:15:00,900 --> 00:15:05,050
or maybe redefine empirical risk minimization as 

269
00:15:05,050 --> 00:15:08,820
instead of writing this choosing a set of parameters, I want to think of it as 

270
00:15:08,820 --> 00:15:10,390
choosing a 

271
00:15:10,390 --> 00:15:12,500
function 

272
00:15:12,500 --> 00:15:15,300
into hypothesis class of script H 

273
00:15:15,300 --> 00:15:21,570
that minimizes  

274
00:15:21,570 --> 00:15:27,360
that minimizes my training error. Okay? 

275
00:15:27,360 --> 00:15:31,040
So  actually can you raise your 

276
00:15:31,040 --> 00:15:34,540
hand if it makes sense to you why this is equivalent to the previous 

277
00:15:34,540 --> 00:15:40,900
formulation? Okay, cool. 

278
00:15:40,900 --> 00:15:44,350
Thanks. So for development of the use of think of algorithms as choosing from 

279
00:15:44,350 --> 00:15:46,750
function from the class instead, 

280
00:15:46,750 --> 00:15:47,980
because 

281
00:15:47,980 --> 00:15:51,300
in a more general case this set, script H, 

282
00:15:51,300 --> 00:15:53,999
can be some other class of functions. Maybe 

283
00:15:53,999 --> 00:15:58,220
is a class of all functions represented by viewer network, or the class of all 

284
00:15:58,220 --> 00:16:03,670
 some other class of functions the learning algorithm wants to choose from. 

285
00:16:03,670 --> 00:16:05,240
And 

286
00:16:05,240 --> 00:16:11,150
this definition for empirical risk minimization will still apply. Okay? 

287
00:16:11,150 --> 00:16:13,330
So 

288
00:16:13,330 --> 00:16:16,480
what we'd like to do is understand 

289
00:16:16,480 --> 00:16:18,780
whether empirical risk minimization 

290
00:16:18,780 --> 00:16:24,500
is a reasonable algorithm. Alex? Student:[Inaudible] a function that's 

291
00:16:24,500 --> 00:16:26,239
defined 

292
00:16:26,239 --> 00:16:29,610
by G 

293
00:16:29,610 --> 00:16:35,310
of data TX, or is it now more general? Instructor (Andrew Ng): I see, right, so lets see  

294
00:16:35,310 --> 00:16:36,470
I guess this  

295
00:16:36,470 --> 00:16:40,610
the question is H data still defined by G of phase transpose 

296
00:16:40,610 --> 00:16:46,700
X, is this more general? So  Student:[Inaudible] Instructor (Andrew Ng): Oh, 

297
00:16:46,700 --> 00:16:48,410
yeah so very  two answers 

298
00:16:48,410 --> 00:16:51,070
to that. One is, this framework is general, so 

299
00:16:51,070 --> 00:16:54,770
for the purpose of this lecture it may be useful to you to keep in mind a model of the 

300
00:16:54,770 --> 00:16:56,670
example of 

301
00:16:56,670 --> 00:17:00,639
when H subscript data is the class of all linear classifiers such as those used by 

302
00:17:00,639 --> 00:17:04,350
like a visectron algorithm or logistic regression. 

303
00:17:04,350 --> 00:17:05,570
This  

304
00:17:05,570 --> 00:17:06,790
everything on this board, 

305
00:17:06,790 --> 00:17:11,140
however, is actually more general. H can be any set of functions, mapping 

306
00:17:11,140 --> 00:17:14,930
from the INFA domain to the center of class label zero and one, 

307
00:17:14,930 --> 00:17:17,630
and then you can perform empirical risk minimization 

308
00:17:17,630 --> 00:17:21,240
over any hypothesis class. For the purpose 

309
00:17:21,240 --> 00:17:22,770
of today's lecture, 

310
00:17:22,770 --> 00:17:27,170
I am going to restrict myself to talking about binary classification, but it turns 

311
00:17:27,170 --> 00:17:30,660
out everything I say generalizes to regression 

312
00:17:30,660 --> 00:17:35,960
in other problem as 

313
00:17:35,960 --> 00:17:39,730
well. Does that answer your question? Student:Yes. Instructor (Andrew Ng): Cool. All right. So I wanna understand if empirical risk minimization is a reasonable algorithm. 

314
00:17:39,730 --> 00:17:40,370


315
00:17:40,370 --> 00:17:45,340
In particular, what are the things we can prove about it? So 

316
00:17:45,340 --> 00:17:48,900
clearly we don't actually care about training error, we don't really care about 

317
00:17:48,900 --> 00:17:53,430
making accurate predictions on the training set, or at a least that's not the ultimate goal. The 

318
00:17:53,430 --> 00:17:55,510
ultimate goal is 

319
00:17:55,510 --> 00:17:57,560
how well it makes  

320
00:17:57,560 --> 00:17:58,730
generalization  

321
00:17:58,730 --> 00:18:03,940
how well it makes predictions on examples that we haven't seen before. How 

322
00:18:03,940 --> 00:18:05,660
well it predicts prices or 

323
00:18:05,660 --> 00:18:07,380
sale or no sale 

324
00:18:07,380 --> 00:18:10,289
outcomes of houses you haven't seen before. 

325
00:18:10,289 --> 00:18:12,870
So what we really care about 

326
00:18:12,870 --> 00:18:16,160
is generalization error, which I write 

327
00:18:16,160 --> 00:18:17,870
as epsilon of H. 

328
00:18:17,870 --> 00:18:21,700
And this is defined as the probability 

329
00:18:21,700 --> 00:18:23,830
that if I sample 

330
00:18:23,830 --> 00:18:25,940
a new example, X 

331
00:18:25,940 --> 00:18:28,370
comma Y, 

332
00:18:28,370 --> 00:18:35,370
from that distribution scripts D, 

333
00:18:40,250 --> 00:18:43,230
my hypothesis mislabels 

334
00:18:43,230 --> 00:18:46,920
that example. And 

335
00:18:46,920 --> 00:18:49,990
in terms of notational convention, usually 

336
00:18:49,990 --> 00:18:53,400
if I use  if I place a hat on top of something, it 

337
00:18:53,400 --> 00:18:57,190
usually means  not always  but it usually means that it is an attempt to 

338
00:18:57,190 --> 00:18:59,400
estimate something about the hat. So 

339
00:18:59,400 --> 00:19:00,660
for example, 

340
00:19:00,660 --> 00:19:02,580
epsilon hat here  

341
00:19:02,580 --> 00:19:06,049
this is something that we're trying  think of epsilon hat training error 

342
00:19:06,049 --> 00:19:08,779
as an attempt to approximate generalization error. 

343
00:19:08,779 --> 00:19:10,980
Okay, so the notation convention is 

344
00:19:10,980 --> 00:19:15,060
usually the things with the hats on top are things we're using to estimate other 

345
00:19:15,060 --> 00:19:15,950
quantities. 

346
00:19:15,950 --> 00:19:19,560
And H hat is a hypothesis output by learning algorithm to try to 

347
00:19:19,560 --> 00:19:24,670
estimate what the functions from H to Y  X to Y. So 

348
00:19:24,670 --> 00:19:29,240
let's actually prove some things about when empirical risk minimization 

349
00:19:29,240 --> 00:19:30,149
will do well 

350
00:19:30,149 --> 00:19:33,440
in a sense of giving us low generalization error, which is what 

351
00:19:33,440 --> 00:19:40,440
we really care about. 

352
00:19:46,750 --> 00:19:50,110
In order to prove our first learning theory result, I'm going to have to state 

353
00:19:50,110 --> 00:19:52,670
two lemmas, the first is 

354
00:19:52,670 --> 00:19:59,670
the union 

355
00:20:00,250 --> 00:20:07,250
vowel, which is the following, 

356
00:20:07,800 --> 00:20:10,640
let A1 through AK be 

357
00:20:10,640 --> 00:20:12,250
K event. And when 

358
00:20:12,250 --> 00:20:16,080
I say events, I mean events in a sense of a probabilistic event 

359
00:20:16,080 --> 00:20:18,130
that either happens or not. 

360
00:20:18,130 --> 00:20:25,130
And these are not necessarily independent. 

361
00:20:28,490 --> 00:20:32,400
So there's some current distribution over the events A one through AK, 

362
00:20:32,400 --> 00:20:34,830
and maybe they're independent maybe not, 

363
00:20:34,830 --> 00:20:41,240
no assumption on that. Then 

364
00:20:41,240 --> 00:20:44,740
the probability of A one or 

365
00:20:44,740 --> 00:20:46,179
A two 

366
00:20:46,179 --> 00:20:47,610
or dot, dot, 

367
00:20:47,610 --> 00:20:48,960
dot, up to 

368
00:20:48,960 --> 00:20:52,480
AK, this union symbol, this 

369
00:20:52,480 --> 00:20:54,070
hat, this just means 

370
00:20:54,070 --> 00:20:57,970
this sort of just set notation for probability just means �or.� So the probability 

371
00:20:57,970 --> 00:20:58,640
of 

372
00:20:58,640 --> 00:21:02,290
at least one of these events occurring, of A one or A two, or up 

373
00:21:02,290 --> 00:21:03,429
to AK, 

374
00:21:03,429 --> 00:21:07,950
this is S equal to the probability of A one plus probability of A two plus dot, 

375
00:21:07,950 --> 00:21:11,120
dot, dot, plus probability of AK. 

376
00:21:11,120 --> 00:21:17,540
Okay? So 

377
00:21:17,540 --> 00:21:20,700
the intuition behind this is just that 

378
00:21:20,700 --> 00:21:23,390
 I'm not sure if you've seen Venn diagrams 

379
00:21:23,390 --> 00:21:26,010
depictions of probability before, if you haven't, 

380
00:21:26,010 --> 00:21:29,480
what I'm about to do may be a little cryptic, so just ignore that. Just 

381
00:21:29,480 --> 00:21:31,799
ignore what I'm about to do if you haven't seen it before. 

382
00:21:31,799 --> 00:21:37,340
But if you have seen it before then this is really 

383
00:21:37,340 --> 00:21:40,770
 this is really great  the probability of A one, 

384
00:21:40,770 --> 00:21:43,770
union A two, union A three, is less 

385
00:21:43,770 --> 00:21:46,630
than the P of A one, plus 

386
00:21:46,630 --> 00:21:51,380
P of A two, plus P of A 

387
00:21:51,380 --> 00:21:51,809
three. 

388
00:21:51,809 --> 00:21:54,370
Right. So that the total mass in 

389
00:21:54,370 --> 00:21:57,510
the union of these three things [inaudible] to the sum of the masses 

390
00:21:57,510 --> 00:22:01,250
in the three individual sets, it's not very surprising. 

391
00:22:01,250 --> 00:22:04,520
It turns out that depending on how you define your axioms of probability, 

392
00:22:04,520 --> 00:22:08,100
this is actually one of the axioms that probably varies, so 

393
00:22:08,100 --> 00:22:12,180
I won't actually try to prove this. This is usually 

394
00:22:12,180 --> 00:22:16,700
written as an axiom. So sigmas of avitivity are probably 

395
00:22:16,700 --> 00:22:17,379
measured as this  

396
00:22:17,379 --> 00:22:24,379
what is sometimes called as well. 

397
00:22:26,370 --> 00:22:30,600
But in learning theory it's commonly called the union balance  I just call it that. The 

398
00:22:30,600 --> 00:22:31,750
other 

399
00:22:31,750 --> 00:22:38,750
lemma I need is called the Hufting inequality. And 

400
00:22:39,640 --> 00:22:43,240
again, I won't actually prove this, I'll just state it, 

401
00:22:43,240 --> 00:22:44,039
which is  

402
00:22:44,039 --> 00:22:46,290
let's let Z1 up to ZM, 

403
00:22:46,290 --> 00:22:48,850
BM, IID, 

404
00:22:48,850 --> 00:22:51,720
there 

405
00:22:51,720 --> 00:22:55,180


406
00:22:55,180 --> 00:23:02,180
may be random variables with mean Phi. 

407
00:23:06,559 --> 00:23:09,150
So the probability of ZI equals 1 is equal to 

408
00:23:09,150 --> 00:23:16,150
Phi. 

409
00:23:16,400 --> 00:23:18,289
So 

410
00:23:18,289 --> 00:23:22,549
let's say you observe M IID for newly random variables and you want to estimate their 

411
00:23:22,549 --> 00:23:23,640
mean. 

412
00:23:23,640 --> 00:23:28,790
So let me define Phi hat, and this is again that notation, no convention, Phi hat means 

413
00:23:28,790 --> 00:23:29,520
 

414
00:23:29,520 --> 00:23:30,440
does not attempt  

415
00:23:30,440 --> 00:23:33,880
is an estimate or something else. So when we define Phi 

416
00:23:33,880 --> 00:23:39,290
hat to be 1 over M, semper my equals one through MZI. Okay? 

417
00:23:39,290 --> 00:23:40,559
So this is 

418
00:23:40,559 --> 00:23:44,500
our attempt to estimate the mean of these Benuve random variables by sort of taking 

419
00:23:44,500 --> 00:23:46,549
its average. 

420
00:23:46,549 --> 00:23:48,690


421
00:23:48,690 --> 00:23:52,710
And let any gamma 

422
00:23:52,710 --> 00:23:54,970
be fixed. 

423
00:23:54,970 --> 00:23:59,060


424
00:23:59,060 --> 00:24:00,690
Then, 

425
00:24:00,690 --> 00:24:04,340
the Hufting inequality 

426
00:24:04,340 --> 00:24:11,120
is that 

427
00:24:11,120 --> 00:24:13,810


428
00:24:13,810 --> 00:24:17,110
the probability your estimate of Phi 

429
00:24:17,110 --> 00:24:20,630
is more than gamma away from the true value of Phi, 

430
00:24:20,630 --> 00:24:26,380
that this is bounded by two E to the next of two gamma squared. Okay? So 

431
00:24:26,380 --> 00:24:31,299
just in pictures  

432
00:24:31,299 --> 00:24:35,210
so this theorem holds  this lemma, the Hufting inequality, 

433
00:24:35,210 --> 00:24:37,950
this is just a statement of fact, this just holds true. 

434
00:24:37,950 --> 00:24:41,850
But let me now draw a cartoon to describe some of the intuition behind this, I 

435
00:24:41,850 --> 00:24:44,059
guess. 

436
00:24:44,059 --> 00:24:44,940
So 

437
00:24:44,940 --> 00:24:49,070
lets say [inaudible] this is a real number line from zero to one. 

438
00:24:49,070 --> 00:24:52,570
And so Phi is the mean of your Benuve random variables. 

439
00:24:52,570 --> 00:24:53,580


440
00:24:53,580 --> 00:24:56,000
You will remember from  you know, 

441
00:24:56,000 --> 00:24:59,370
whatever  some undergraduate probability or statistics class, 

442
00:24:59,370 --> 00:25:02,550
the central limit theorem that says that when you average all the things together, 

443
00:25:02,550 --> 00:25:05,480
you tend to get a Gaussian distribution. 

444
00:25:05,480 --> 00:25:10,380
And so when you toss M coins with bias Phi, we observe these M 

445
00:25:10,380 --> 00:25:12,320
Benuve random variables, 

446
00:25:12,320 --> 00:25:17,680
and we average them, then the probability distribution of 

447
00:25:17,680 --> 00:25:24,680
Phi hat 

448
00:25:26,260 --> 00:25:28,740
will roughly be a Gaussian lets say. Okay? It 

449
00:25:28,740 --> 00:25:29,680
turns out if 

450
00:25:29,680 --> 00:25:32,650
you haven't seen this up before, this is actually that the 

451
00:25:32,650 --> 00:25:36,500
cumulative distribution function of Phi hat will converse with that of the Gaussian. 

452
00:25:36,500 --> 00:25:40,060
Technically Phi hat can only take on a discreet set of values 

453
00:25:40,060 --> 00:25:43,120
because these are factions one over Ms. It doesn't really have an 

454
00:25:43,120 --> 00:25:45,159
entity but just as a cartoon 

455
00:25:45,159 --> 00:25:49,470
think of it as a converse roughly to a Gaussian. 

456
00:25:49,470 --> 00:25:56,050
So what the Hufting inequality says is that if you pick a value of gamma, let me put 

457
00:25:56,050 --> 00:25:59,500
S one interval gamma there's another interval gamma. 

458
00:25:59,500 --> 00:26:03,480
Then the saying that the probability mass of the details, 

459
00:26:03,480 --> 00:26:05,820
in other words the probability that 

460
00:26:05,820 --> 00:26:08,410
my value of Phi hat is more than 

461
00:26:08,410 --> 00:26:10,960
a gamma away from the true value, 

462
00:26:10,960 --> 00:26:15,500
that the total mass  

463
00:26:15,500 --> 00:26:18,019
that the 

464
00:26:18,019 --> 00:26:22,410
total probability mass in these tails is at most two 

465
00:26:22,410 --> 00:26:25,660
E to the negative two gamma squared M. Okay? 

466
00:26:25,660 --> 00:26:27,510
That's what the Hufting inequality  so if you 

467
00:26:27,510 --> 00:26:30,610
can't read that this just says  this is just the right hand side of the bound, two E to 

468
00:26:30,610 --> 00:26:32,659
negative two gamma squared. 

469
00:26:32,659 --> 00:26:36,220
So balance the probability that you make a mistake in estimating the 

470
00:26:36,220 --> 00:26:41,740
mean of a Benuve random variable. 

471
00:26:41,740 --> 00:26:43,030
And the 

472
00:26:43,030 --> 00:26:47,180
cool thing about this bound  the interesting thing behind this bound is that 

473
00:26:47,180 --> 00:26:48,039


474
00:26:48,039 --> 00:26:50,610
the [inaudible] exponentially in M, 

475
00:26:50,610 --> 00:26:51,710
so it says 

476
00:26:51,710 --> 00:26:54,030
that for a fixed value of gamma, 

477
00:26:54,030 --> 00:26:55,540
as you increase the size 

478
00:26:55,540 --> 00:26:58,030
of your training set, as you toss a coin more and more, 

479
00:26:58,030 --> 00:27:02,250
then the worth of this Gaussian will shrink. The worth of this 

480
00:27:02,250 --> 00:27:06,270
Gaussian will actually shrink like one over root to M. 

481
00:27:06,270 --> 00:27:12,010
And that will cause the probability mass left in the tails to decrease exponentially, 

482
00:27:12,010 --> 00:27:19,010
quickly, as a function of that. And this will be important later. Yeah? Student: Does this come from the 

483
00:27:21,400 --> 00:27:23,360
central limit theorem [inaudible]. Instructor (Andrew Ng): No it doesn't. So this is proved by a different  

484
00:27:23,360 --> 00:27:27,220
this is proved  no  so the central limit theorem  there may be a 

485
00:27:27,220 --> 00:27:29,330
version of the central limit theorem, but the 

486
00:27:29,330 --> 00:27:32,840
versions I'm familiar with tend  are sort of asymptotic, but this works 

487
00:27:32,840 --> 00:27:35,630
for any finer value of M. Oh, and for your  

488
00:27:35,630 --> 00:27:40,190
this bound holds even if M is equal to two, or M is [inaudible], if M is 

489
00:27:40,190 --> 00:27:41,410
very small, 

490
00:27:41,410 --> 00:27:45,480
the central limit theorem approximation is not gonna hold, but this theorem holds 

491
00:27:45,480 --> 00:27:46,370
regardless. 

492
00:27:46,370 --> 00:27:47,680
Okay? I'm 

493
00:27:47,680 --> 00:27:51,920
drawing this just as a cartoon to help explain the intuition, but this 

494
00:27:51,920 --> 00:27:58,920
theorem just holds true, without reference to central limit 

495
00:28:07,190 --> 00:28:08,590


496
00:28:08,590 --> 00:28:12,830
theorem. All right. So lets start to understand empirical risk minimization, 

497
00:28:12,830 --> 00:28:19,830
and what I want to do is 

498
00:28:23,080 --> 00:28:25,490
begin with 

499
00:28:25,490 --> 00:28:28,440
studying empirical risk minimization 

500
00:28:28,440 --> 00:28:30,890
for a 

501
00:28:30,890 --> 00:28:33,000
[inaudible] case 

502
00:28:33,000 --> 00:28:36,670
that's a logistic regression, and in particular I want to start with studying 

503
00:28:36,670 --> 00:28:40,510
the case of finite hypothesis classes. 

504
00:28:40,510 --> 00:28:47,510
So let's say script H is a class of 

505
00:28:48,909 --> 00:28:55,600
K hypotheses. 

506
00:28:55,600 --> 00:28:59,540
Right. So this is K functions with no  each of these is just a function mapping 

507
00:28:59,540 --> 00:29:04,640
from inputs to outputs, there's no parameters in this. 

508
00:29:04,640 --> 00:29:05,910
And so 

509
00:29:05,910 --> 00:29:12,530
what the empirical risk minimization would do is it would take the training set 

510
00:29:12,530 --> 00:29:13,809
and it'll then 

511
00:29:13,809 --> 00:29:16,590
look at each of these K functions, 

512
00:29:16,590 --> 00:29:21,660
and it'll pick whichever of these functions has the lowest training error. Okay? 

513
00:29:21,660 --> 00:29:24,260
So now that the logistic regression uses an infinitely 

514
00:29:24,260 --> 00:29:28,940
large  a continuous infinitely large class of hypotheses, script H, 

515
00:29:28,940 --> 00:29:32,440
but to prove the first row I actually want to just 

516
00:29:32,440 --> 00:29:33,090
describe 

517
00:29:33,090 --> 00:29:37,540
our first learning theorem is all for the case of when you have a finite hypothesis class, and then 

518
00:29:37,540 --> 00:29:41,750
we'll later generalize that into the hypothesis classes. 

519
00:29:41,750 --> 00:29:43,050


520
00:29:43,050 --> 00:29:45,780
So 

521
00:29:45,780 --> 00:29:52,780


522
00:29:53,220 --> 00:29:58,340
empirical risk minimization takes the hypothesis of the lowest training error, 

523
00:29:58,340 --> 00:30:04,270
and what I'd like to do is prove a bound on the generalization error 

524
00:30:04,270 --> 00:30:05,960
of H hat. All right. So in other words I'm 

525
00:30:05,960 --> 00:30:07,070
gonna prove that 

526
00:30:07,070 --> 00:30:14,070
somehow minimizing training error allows me to do well on generalization error. 

527
00:30:14,070 --> 00:30:15,920
And here's the strategy, 

528
00:30:15,920 --> 00:30:19,950
I'm going to  

529
00:30:19,950 --> 00:30:22,899
the first step in this prove I'm going to show that 

530
00:30:22,899 --> 00:30:25,039
training error 

531
00:30:25,039 --> 00:30:29,580
is a good approximation to generalization error, 

532
00:30:29,580 --> 00:30:32,040


533
00:30:32,040 --> 00:30:34,610
and then I'm going to show 

534
00:30:34,610 --> 00:30:37,760
that this implies a bound 

535
00:30:37,760 --> 00:30:40,040
on 

536
00:30:40,040 --> 00:30:42,860
the generalization error 

537
00:30:42,860 --> 00:30:48,360
of the hypothesis of [inaudible] empirical risk 

538
00:30:48,360 --> 00:30:53,310
minimization. And I just realized, this class I guess is also maybe slightly notation 

539
00:30:53,310 --> 00:30:55,559
heavy class 

540
00:30:55,559 --> 00:30:59,160
round, instead of just introducing a reasonably large set of new symbols, so if 

541
00:30:59,160 --> 00:31:02,570
again, in the course of today's lecture, you're looking at some symbol and you don't quite 

542
00:31:02,570 --> 00:31:07,210
remember what it is, please raise your hand and ask. [Inaudible] what's that, what was that, was that a 

543
00:31:07,210 --> 00:31:11,570
generalization error or was it something else? So raise your hand and 

544
00:31:11,570 --> 00:31:16,740
ask if you don't understand what the notation I was defining. 

545
00:31:16,740 --> 00:31:20,429
Okay. So let me introduce this in two steps. And the empirical risk strategy is I'm gonna show training errors 

546
00:31:20,429 --> 00:31:22,970
that give approximation generalization error, 

547
00:31:22,970 --> 00:31:26,360
and this will imply that minimizing training error 

548
00:31:26,360 --> 00:31:29,919
will also do pretty well in terms of minimizing generalization error. 

549
00:31:29,919 --> 00:31:33,130
And this will give us a bound on the generalization error 

550
00:31:33,130 --> 00:31:39,520
of the hypothesis output by empirical risk minimization. Okay? 

551
00:31:39,520 --> 00:31:46,520
So here's the idea. 

552
00:31:47,509 --> 00:31:51,530
So 

553
00:31:51,530 --> 00:31:56,330
lets even not consider all the hypotheses at once, lets pick any 

554
00:31:56,330 --> 00:32:00,330
hypothesis, HJ in the class script H, and 

555
00:32:00,330 --> 00:32:04,549
so until further notice lets just consider there one fixed hypothesis. So pick any 

556
00:32:04,549 --> 00:32:10,250
one hypothesis and let's talk about that 

557
00:32:10,250 --> 00:32:12,390
one. Let 

558
00:32:12,390 --> 00:32:15,160
me define ZI 

559
00:32:15,160 --> 00:32:22,160
to be indicator function 

560
00:32:23,970 --> 00:32:25,490
for 

561
00:32:25,490 --> 00:32:29,929
whether this hypothesis misclassifies the IFE example  excuse me  

562
00:32:29,929 --> 00:32:33,820
or Z subscript I. Okay? 

563
00:32:33,820 --> 00:32:40,250
So 

564
00:32:40,250 --> 00:32:42,809
ZI would be zero or one 

565
00:32:42,809 --> 00:32:46,799
depending on whether this one hypothesis which is the only one 

566
00:32:46,799 --> 00:32:48,640
I'm gonna even consider now, 

567
00:32:48,640 --> 00:32:52,280
whether this hypothesis was classified as an example. 

568
00:32:52,280 --> 00:32:56,970
And so 

569
00:32:56,970 --> 00:33:00,620
my training set is drawn randomly from sum distribution scripts 

570
00:33:00,620 --> 00:33:02,140
d, 

571
00:33:02,140 --> 00:33:04,620
and 

572
00:33:04,620 --> 00:33:09,820
depending on what training examples I've got, these ZIs would be either zero or one. 

573
00:33:09,820 --> 00:33:14,260
So let's figure out what the probability distribution ZI is. 

574
00:33:14,260 --> 00:33:15,020
Well, 

575
00:33:15,020 --> 00:33:15,950
so 

576
00:33:15,950 --> 00:33:20,970
ZI takes on the value of either zero or one, so clearly is a Benuve random variable, it can only 

577
00:33:20,970 --> 00:33:24,870
take on these values. 

578
00:33:24,870 --> 00:33:31,260
Well, 

579
00:33:31,260 --> 00:33:35,160
what's the probability that ZI is equal to one? In other words, what's the 

580
00:33:35,160 --> 00:33:36,490
probability 

581
00:33:36,490 --> 00:33:40,780
that from a fixed hypothesis HJ, 

582
00:33:40,780 --> 00:33:45,080
when I sample my training set IID from distribution D, what is the chance 

583
00:33:45,080 --> 00:33:45,970
that 

584
00:33:45,970 --> 00:33:50,370
my hypothesis will misclassify it? 

585
00:33:50,370 --> 00:33:54,290
Well, by definition, 

586
00:33:54,290 --> 00:33:56,770
that's just a generalization error of my 

587
00:33:56,770 --> 00:34:00,669
hypothesis HJ. 

588
00:34:00,669 --> 00:34:03,990
So ZI is a Benuve random variable 

589
00:34:03,990 --> 00:34:05,490
with mean 

590
00:34:05,490 --> 00:34:12,490
given by the generalization error of this hypothesis. 

591
00:34:14,439 --> 00:34:20,169
Raise your hand if that made sense. Oh, cool. Great. 

592
00:34:20,169 --> 00:34:21,880
And moreover, 

593
00:34:21,880 --> 00:34:23,989
all the ZIs have the same 

594
00:34:23,989 --> 00:34:27,969
probability of being one, and all my training examples I've drawn are IID, 

595
00:34:27,969 --> 00:34:32,839
and so the ZIs are also independent 

596
00:34:32,839 --> 00:34:39,099
 and therefore 

597
00:34:39,099 --> 00:34:42,419
the ZIs themselves are IID random 

598
00:34:42,419 --> 00:34:49,419
variables. Okay? Because my training examples were drawn independently of each other, by assumption. 

599
00:34:54,419 --> 00:34:55,799


600
00:34:55,799 --> 00:34:59,149
If you read this as the definition of training error, 

601
00:34:59,149 --> 00:35:04,329
the training error of my hypothesis HJ, that's just that. 

602
00:35:04,329 --> 00:35:07,210


603
00:35:07,210 --> 00:35:09,029
That's just the average 

604
00:35:09,029 --> 00:35:16,029
of my ZIs, which was  well I previously defined it like this. Okay? 

605
00:35:22,929 --> 00:35:25,729


606
00:35:25,729 --> 00:35:28,539
And so epsilon hat of HJ 

607
00:35:28,539 --> 00:35:30,420
is exactly 

608
00:35:30,420 --> 00:35:32,829
the average of MIID, 

609
00:35:32,829 --> 00:35:35,849
Benuve random variables, 

610
00:35:35,849 --> 00:35:40,039
drawn from Benuve distribution with mean given by 

611
00:35:40,039 --> 00:35:41,749
the 

612
00:35:41,749 --> 00:35:43,869
generalization error, so this is  

613
00:35:43,869 --> 00:35:47,909
well this is the average of MIID Benuve random variables, 

614
00:35:47,909 --> 00:35:54,909
each of which has meaning given by the 

615
00:35:54,939 --> 00:35:58,689
generalization error of HJ. 

616
00:35:58,689 --> 00:36:03,769
And therefore, by 

617
00:36:03,769 --> 00:36:07,029
the Hufting inequality 

618
00:36:07,029 --> 00:36:11,069
we have to add the 

619
00:36:11,069 --> 00:36:15,699
probability 

620
00:36:15,699 --> 00:36:20,029
that the difference between training and generalization error, the probability that this is greater than gamma is 

621
00:36:20,029 --> 00:36:21,539
less than to two, E to 

622
00:36:21,539 --> 00:36:24,719
the negative two, 

623
00:36:24,719 --> 00:36:26,949
gamma squared M. Okay? 

624
00:36:26,949 --> 00:36:32,420
Exactly by the Hufting inequality. 

625
00:36:32,420 --> 00:36:34,579
And what this proves is that, 

626
00:36:34,579 --> 00:36:37,289
for my fixed hypothesis HJ, 

627
00:36:37,289 --> 00:36:40,529
my training error, epsilon hat will 

628
00:36:40,529 --> 00:36:43,899
with high probability, assuming M is large, if 

629
00:36:43,899 --> 00:36:46,890
M is large than this thing on the right hand side will be small, because 

630
00:36:46,890 --> 00:36:49,239
this is two Es and a negative two gamma squared M. 

631
00:36:49,239 --> 00:36:52,559
So this says that if my training set is large enough, 

632
00:36:52,559 --> 00:36:54,390
then the probability 

633
00:36:54,390 --> 00:36:57,299
my training error is far from generalization error, 

634
00:36:57,299 --> 00:36:59,089
meaning that it is more than gamma, 

635
00:36:59,089 --> 00:37:06,089
will be small, will be bounded by this thing on the right hand side. Okay? Now, 

636
00:37:09,119 --> 00:37:11,569
here's the [inaudible] tricky part, 

637
00:37:11,569 --> 00:37:16,569
what we've done is approve this bound for one fixed hypothesis, for HJ. 

638
00:37:16,569 --> 00:37:20,429
What I want to prove is that training error will be a good estimate for generalization 

639
00:37:20,429 --> 00:37:21,279
error, 

640
00:37:21,279 --> 00:37:26,249
not just for this one hypothesis HJ, but actually for all K hypotheses in my 

641
00:37:26,249 --> 00:37:27,829
hypothesis class 

642
00:37:27,829 --> 00:37:29,650
script 

643
00:37:29,650 --> 00:37:30,769
H. 

644
00:37:30,769 --> 00:37:33,069
So let's do it 

645
00:37:33,069 --> 00:37:35,719
 

646
00:37:35,719 --> 00:37:42,719
well, 

647
00:37:54,659 --> 00:37:57,859
better do it on a new board. So in order to show that, let me define 

648
00:37:57,859 --> 00:38:01,229
a random event, let me define AJ to be the event 

649
00:38:01,229 --> 00:38:05,269
that  

650
00:38:05,269 --> 00:38:12,269
let 

651
00:38:21,469 --> 00:38:24,479
me define AJ to be the event that  

652
00:38:24,479 --> 00:38:28,039
you know, the difference between training and generalization error is more than gamma on a 

653
00:38:28,039 --> 00:38:30,439
hypothesis HJ. 

654
00:38:30,439 --> 00:38:32,090
And so what we 

655
00:38:32,090 --> 00:38:36,449
put on the previous board was that the probability of AJ is less equal to two E 

656
00:38:36,449 --> 00:38:43,449
to the negative two, gamma squared M, and this is pretty small. Now, 

657
00:38:44,059 --> 00:38:49,729
What I want to bound is the probability that 

658
00:38:49,729 --> 00:38:53,879
there exists some hypothesis in my class 

659
00:38:53,879 --> 00:39:00,879
script H, 

660
00:39:02,839 --> 00:39:08,109
such that I make a large error in my estimate of generalization error. Okay? 

661
00:39:08,109 --> 00:39:10,269
Such that this holds true. 

662
00:39:10,269 --> 00:39:12,180


663
00:39:12,180 --> 00:39:14,289
So this is really just 

664
00:39:14,289 --> 00:39:17,890
that the probability that there exists a hypothesis for which this 

665
00:39:17,890 --> 00:39:20,650
holds. This is really the probability that 

666
00:39:20,650 --> 00:39:25,299
A one or A two, or up to AK holds. 

667
00:39:25,299 --> 00:39:28,659
The chance 

668
00:39:28,659 --> 00:39:33,179
there exists a hypothesis is just well the priority 

669
00:39:33,179 --> 00:39:37,479
that  for hypothesis one and make a large error in estimating the generalization error, 

670
00:39:37,479 --> 00:39:41,069
or for hypothesis two and make a large error in estimating generalization error, 

671
00:39:41,069 --> 00:39:43,559
and so on. 

672
00:39:43,559 --> 00:39:50,559
And so by the union bound, this is less than equal to 

673
00:39:50,819 --> 00:39:56,229
that, 

674
00:39:56,229 --> 00:39:59,329
which is therefore less than equal to 

675
00:39:59,329 --> 00:40:06,329
 

676
00:40:06,519 --> 00:40:13,519
is 

677
00:40:13,989 --> 00:40:20,989
equal to that. Okay? 

678
00:40:38,949 --> 00:40:40,439
So let 

679
00:40:40,439 --> 00:40:42,359
me just take 

680
00:40:42,359 --> 00:40:47,639
one minus both sides  of the equation 

681
00:40:47,639 --> 00:40:50,939
on the previous board  let me take one minus both sides, so 

682
00:40:50,939 --> 00:40:56,899
the probability that there does not exist for 

683
00:40:56,899 --> 00:40:59,009
hypothesis 

684
00:40:59,009 --> 00:41:06,009
such that, 

685
00:41:07,650 --> 00:41:10,839
that. The probability that there does not exist a hypothesis on which I make a large 

686
00:41:10,839 --> 00:41:13,190
error in this estimate 

687
00:41:13,190 --> 00:41:20,190
while this is equal to the probability that for all hypotheses, 

688
00:41:22,680 --> 00:41:24,759
I make a 

689
00:41:24,759 --> 00:41:26,569
small error, or at 

690
00:41:26,569 --> 00:41:31,609
most gamma, in my estimate of generalization error. 

691
00:41:31,609 --> 00:41:35,519
In taking one minus on the right hand side I get two KE 

692
00:41:35,519 --> 00:41:39,149
to the negative two 

693
00:41:39,149 --> 00:41:44,109
gamma squared M. Okay? 

694
00:41:44,109 --> 00:41:45,709
And so  

695
00:41:45,709 --> 00:41:49,519
and the sign of the inequality flipped because I took one minus both 

696
00:41:49,519 --> 00:41:53,799
sides. The minus sign flips the sign of the 

697
00:41:53,799 --> 00:41:55,819
equality. 

698
00:41:55,819 --> 00:41:59,799
So what we're shown is that 

699
00:41:59,799 --> 00:42:04,679
with probability  which abbreviates to WP  with probability one minus 

700
00:42:04,679 --> 00:42:08,529
two KE to the negative two gamma squared M. 

701
00:42:08,529 --> 00:42:11,889
We have that, epsilon hat 

702
00:42:11,889 --> 00:42:16,479
of H 

703
00:42:16,479 --> 00:42:21,149
will be 

704
00:42:21,149 --> 00:42:28,149
 will then gamma of epsilon of H, 

705
00:42:31,309 --> 00:42:33,059
simultaneously for all 

706
00:42:33,059 --> 00:42:35,449
hypotheses in our 

707
00:42:35,449 --> 00:42:42,449
class script H. 

708
00:42:43,269 --> 00:42:47,209
And so 

709
00:42:47,209 --> 00:42:54,209
just to give this result a name, this is called  

710
00:42:55,529 --> 00:43:00,089
this is one instance of what's called a uniform conversions result, 

711
00:43:00,089 --> 00:43:01,130
and 

712
00:43:01,130 --> 00:43:03,759
the term uniform conversions 

713
00:43:03,759 --> 00:43:05,719
 this sort of alludes to the fact that 

714
00:43:05,719 --> 00:43:09,659
this shows that as M becomes large, 

715
00:43:09,659 --> 00:43:12,029
then these epsilon hats 

716
00:43:12,029 --> 00:43:16,189
will all simultaneously converge to epsilon of H. That 

717
00:43:16,189 --> 00:43:17,849
training error will 

718
00:43:17,849 --> 00:43:20,840
become very close to generalization error 

719
00:43:20,840 --> 00:43:23,210
simultaneously for all hypotheses H. 

720
00:43:23,210 --> 00:43:25,459
That's what the term uniform 

721
00:43:25,459 --> 00:43:29,150
refers to, is the fact that this converges for all hypotheses H and not just 

722
00:43:29,150 --> 00:43:31,059
for one hypothesis. And so 

723
00:43:31,059 --> 00:43:36,130
what we're shown is one example of a uniform conversions result. Okay? 

724
00:43:36,130 --> 00:43:39,400
So let me clean a couple more boards. I'll come back and ask what questions you have 

725
00:43:39,400 --> 00:43:46,400
about this. We should take another look at this and make sure it all makes sense. Yeah, okay. 

726
00:44:20,109 --> 00:44:27,109
What questions do you have about this? 

727
00:44:27,829 --> 00:44:31,269


728
00:44:31,269 --> 00:44:34,239
Student: How the is the 

729
00:44:34,239 --> 00:44:36,150
value of gamma computed [inaudible]? Instructor (Andrew Ng): Right. Yeah. So let's see, the question is how is the value of gamma computed? 

730
00:44:36,150 --> 00:44:40,289
So for these purposes  for the purposes of this, gamma is a constant. 

731
00:44:40,289 --> 00:44:43,489
Imagine a gamma is some constant that we chose in advance, 

732
00:44:43,489 --> 00:44:48,589
and this is a bound that holds true for any fixed value 

733
00:44:48,589 --> 00:44:50,739
of gamma. Later on as we 

734
00:44:50,739 --> 00:44:54,449
take this bound and then sort of develop this result further, 

735
00:44:54,449 --> 00:44:58,129
we'll choose specific values of gamma as a [inaudible] of this bound. For now 

736
00:44:58,129 --> 00:45:05,129
we'll just imagine that when we're proved this holds true for any value of gamma. Any questions? 

737
00:45:08,890 --> 00:45:11,579
Yeah? Student:[Inaudible] hypothesis phase is infinite [inaudible]? Instructor (Andrew Ng):Yes, the labs in the hypothesis phase is infinite, 

738
00:45:11,579 --> 00:45:16,279
so this simple result won't work in this present form, but we'll generalize this 

739
00:45:16,279 --> 00:45:20,199
 probably won't get to it today  but we'll generalize this at the beginning of the 

740
00:45:20,199 --> 00:45:27,199
next lecture to infinite hypothesis classes. Student:How do we use this 

741
00:45:29,779 --> 00:45:31,889
theory [inaudible]? Instructor (Andrew Ng):How do you use theorem factors? So let me  

742
00:45:31,889 --> 00:45:37,259
I might get to a little of that later today, we'll talk concretely about algorithms, 

743
00:45:37,259 --> 00:45:38,890
the consequences of 

744
00:45:38,890 --> 00:45:45,719
the understanding of these things in the next lecture as well. Yeah, okay? Cool. Can you 

745
00:45:45,719 --> 00:45:47,289
just raise your hand if the 

746
00:45:47,289 --> 00:45:49,599
things I've proved so far 

747
00:45:49,599 --> 00:45:55,069
make sense? Okay. Cool. Great. 

748
00:45:55,069 --> 00:45:59,339
Thanks. All right. Let me just take this uniform conversions bound and rewrite 

749
00:45:59,339 --> 00:46:02,279
it in a couple of other forms. 

750
00:46:02,279 --> 00:46:04,589
So 

751
00:46:04,589 --> 00:46:08,689
this is a sort of a bound on probability, this is saying suppose I fix my training 

752
00:46:08,689 --> 00:46:11,089
set and then fix my training 

753
00:46:11,089 --> 00:46:14,699
set  fix my threshold, my error threshold gamma, 

754
00:46:14,699 --> 00:46:19,109
what is the probability that uniform conversions holds, and well, 

755
00:46:19,109 --> 00:46:22,670
that's my formula that gives the answer. This is the probability of something 

756
00:46:22,670 --> 00:46:25,749
happening. 

757
00:46:25,749 --> 00:46:28,750
So there are actually three parameters of interest. One is, 

758
00:46:28,750 --> 00:46:31,539
�What is this probability?� The 

759
00:46:31,539 --> 00:46:34,290
other parameter is, �What's the training set size M?� 

760
00:46:34,290 --> 00:46:36,020
And the third parameter is, �What 

761
00:46:36,020 --> 00:46:37,009
is the value 

762
00:46:37,009 --> 00:46:39,989
of this error 

763
00:46:39,989 --> 00:46:41,950
threshold gamma?� I'm not gonna 

764
00:46:41,950 --> 00:46:44,309
vary K for these purposes. 

765
00:46:44,309 --> 00:46:47,619
So other two other equivalent forms of the bounds, 

766
00:46:47,619 --> 00:46:49,509
which  so you can ask, 

767
00:46:49,509 --> 00:46:50,599
�Given 

768
00:46:50,599 --> 00:46:52,309
gamma 

769
00:46:52,309 --> 00:46:53,289
 so what 

770
00:46:53,289 --> 00:46:56,019
we proved was given gamma and given M, what 

771
00:46:56,019 --> 00:46:57,849
is the probability of 

772
00:46:57,849 --> 00:47:01,019
uniform conversions?� The 

773
00:47:01,019 --> 00:47:04,150
other equivalent forms are, 

774
00:47:04,150 --> 00:47:06,899
so that given gamma 

775
00:47:06,899 --> 00:47:13,899
and the probability delta of making a large error, 

776
00:47:15,339 --> 00:47:19,049
how large a training set size do you need in order to give 

777
00:47:19,049 --> 00:47:21,949
a bound on  how large a 

778
00:47:21,949 --> 00:47:25,459
training set size do you need 

779
00:47:25,459 --> 00:47:29,749
to give a uniform conversions bound with parameters gamma 

780
00:47:29,749 --> 00:47:31,209
and delta? 

781
00:47:31,209 --> 00:47:33,749
And so  well, 

782
00:47:33,749 --> 00:47:37,729
so if you set delta to be two KE so negative two gamma 

783
00:47:37,729 --> 00:47:38,489
squared M. 

784
00:47:38,489 --> 00:47:41,359
This is that form that I had on the left. 

785
00:47:41,359 --> 00:47:43,630
And if you solve 

786
00:47:43,630 --> 00:47:46,079
for M, 

787
00:47:46,079 --> 00:47:48,170
what you find is that 

788
00:47:48,170 --> 00:47:54,430
there's an equivalent form of this result that says that 

789
00:47:54,430 --> 00:47:55,890
so long as your training 

790
00:47:55,890 --> 00:47:59,029
set assigns M as greater than this. 

791
00:47:59,029 --> 00:48:05,390
And this is the formula that I get by solving for M. 

792
00:48:05,390 --> 00:48:08,809
Okay? So long as M is greater than equal to this, 

793
00:48:08,809 --> 00:48:12,929
then with probability, which I'm abbreviating to WP again, 

794
00:48:12,929 --> 00:48:17,369
with probability at least one minus delta, 

795
00:48:17,369 --> 00:48:24,369
we have for all. 

796
00:48:24,880 --> 00:48:30,959
Okay? 

797
00:48:30,959 --> 00:48:34,299
So this says how large a training set size that I need 

798
00:48:34,299 --> 00:48:37,859
to guarantee that with probability at least one minus delta, 

799
00:48:37,859 --> 00:48:42,239
we have the training error is within gamma of generalization error for all my 

800
00:48:42,239 --> 00:48:45,259
hypotheses, and this gives an answer. 

801
00:48:45,259 --> 00:48:49,400
And just to give this another name, this is an example of a sample 

802
00:48:49,400 --> 00:48:53,469
complexity 

803
00:48:53,469 --> 00:48:58,079


804
00:48:58,079 --> 00:48:58,940
bound. 

805
00:48:58,940 --> 00:49:02,870
So from undergrad computer science classes you may have heard of 

806
00:49:02,870 --> 00:49:05,680
computational complexity, which is how much computations you need to achieve 

807
00:49:05,680 --> 00:49:06,849
something. 

808
00:49:06,849 --> 00:49:11,319
So sample complexity just means how large a training example  how large a training set  how 

809
00:49:11,319 --> 00:49:13,679
large a sample of examples do you need 

810
00:49:13,679 --> 00:49:16,549
in order to achieve a certain bound and error. 

811
00:49:16,549 --> 00:49:19,619
And it turns out that in many of the theorems we write out you can 

812
00:49:19,619 --> 00:49:23,659
pose them in sort of a form of probability bound or a sample complexity bound or in some other 

813
00:49:23,659 --> 00:49:24,459
form. 

814
00:49:24,459 --> 00:49:28,369
I personally often find the sample complexity bounds the most 

815
00:49:28,369 --> 00:49:32,419
easy to interpret because it says how large a training set do you need to give a certain bound on the 

816
00:49:32,419 --> 00:49:35,819
errors. 

817
00:49:35,819 --> 00:49:38,829
And in fact  well, we'll see this later, 

818
00:49:38,829 --> 00:49:41,050
sample complexity bounds often sort of 

819
00:49:41,050 --> 00:49:43,299
help to give guidance for really if 

820
00:49:43,299 --> 00:49:44,240
you're trying to 

821
00:49:44,240 --> 00:49:46,140
achieve something on a machine learning problem, 

822
00:49:46,140 --> 00:49:46,940
this really is 

823
00:49:46,940 --> 00:49:49,719
trying to give guidance on how much training data you need 

824
00:49:49,719 --> 00:49:54,269
to prove something. 

825
00:49:54,269 --> 00:49:57,769
The one thing I want to note here is that M 

826
00:49:57,769 --> 00:50:01,019
grows like the log of K, right, so 

827
00:50:01,019 --> 00:50:05,630
the log of K grows extremely slowly as a function of K. The log is one 

828
00:50:05,630 --> 00:50:10,619
of the slowest growing functions, right. It's one of  well, 

829
00:50:10,619 --> 00:50:17,619
some of you may have heard this, right? That for all values of K, right  I learned 

830
00:50:18,869 --> 00:50:22,179
this from a colleague, Andrew Moore, 

831
00:50:22,179 --> 00:50:23,739
at Carnegie Mellon 

832
00:50:23,739 --> 00:50:24,699
 that in 

833
00:50:24,699 --> 00:50:30,549
computer science for all practical purposes for all values of K, log K is less [inaudible], this is 

834
00:50:30,549 --> 00:50:31,509
almost true. 

835
00:50:31,509 --> 00:50:35,529
So log K is  logs is one of the slowest growing functions, and so the 

836
00:50:35,529 --> 00:50:40,059
fact that M sample complexity grows like the log of K, 

837
00:50:40,059 --> 00:50:41,549
means that 

838
00:50:41,549 --> 00:50:45,749
you can increase this number of hypotheses in your hypothesis class quite a lot 

839
00:50:45,749 --> 00:50:50,609
and the number of the training examples you need won't grow 

840
00:50:50,609 --> 00:50:56,390


841
00:50:56,390 --> 00:51:00,289
very much. [Inaudible]. This property will be important later when we talk about infinite 

842
00:51:00,289 --> 00:51:02,859
hypothesis classes. 

843
00:51:02,859 --> 00:51:05,679
The final form is the  

844
00:51:05,679 --> 00:51:08,489
I guess is sometimes called the error bound, 

845
00:51:08,489 --> 00:51:13,659
which is when you hold M and delta fixed and solved for gamma. 

846
00:51:13,659 --> 00:51:20,659
And so  

847
00:51:22,809 --> 00:51:27,469
and what do you do  what you get then is that the 

848
00:51:27,469 --> 00:51:30,759
probability at least one minus delta, 

849
00:51:30,759 --> 00:51:37,759
we have that. 

850
00:51:41,339 --> 00:51:44,409
For all hypotheses in my hypothesis class, 

851
00:51:44,409 --> 00:51:51,409
the difference in the training generalization error would be less 

852
00:51:53,690 --> 00:51:55,169
than equal to that. Okay? And that's just 

853
00:51:55,169 --> 00:52:02,169
solving for gamma and plugging the value I get in there. Okay? All right. 

854
00:52:03,239 --> 00:52:10,239
So the 

855
00:52:30,949 --> 00:52:33,569
second 

856
00:52:33,569 --> 00:52:36,219
step of 

857
00:52:36,219 --> 00:52:43,219
the overall proof I want to execute is the following. 

858
00:52:44,519 --> 00:52:47,329
The result of the training error is essentially that uniform 

859
00:52:47,329 --> 00:52:49,100
conversions will hold true 

860
00:52:49,100 --> 00:52:51,439
with high probability. 

861
00:52:51,439 --> 00:52:55,249
What I want to show now is let's assume that uniform conversions hold. So let's 

862
00:52:55,249 --> 00:52:57,959
assume that for all 

863
00:52:57,959 --> 00:53:00,089
hypotheses H, 

864
00:53:00,089 --> 00:53:05,439
we have that epsilon of H minus epsilon hat of H, is 

865
00:53:05,439 --> 00:53:08,619
less than of the gamma. Okay? 

866
00:53:08,619 --> 00:53:10,930
What I want to do now is 

867
00:53:10,930 --> 00:53:13,140
use this to see what we can 

868
00:53:13,140 --> 00:53:14,910
prove about the bound 

869
00:53:14,910 --> 00:53:21,910
of  see what we can prove about the generalization error. 

870
00:53:28,829 --> 00:53:31,400


871
00:53:31,400 --> 00:53:32,619
So I want to know 

872
00:53:32,619 --> 00:53:35,639
 suppose this holds true  I want 

873
00:53:35,639 --> 00:53:40,259
to know can we prove something about the generalization error of H hat, where 

874
00:53:40,259 --> 00:53:43,380
again, H hat 

875
00:53:43,380 --> 00:53:45,950
was the hypothesis 

876
00:53:45,950 --> 00:53:51,679
selected by empirical risk minimization. 

877
00:53:51,679 --> 00:53:55,789
Okay? So in order to show this, let me make one more definition, let me define H 

878
00:53:55,789 --> 00:54:00,289
star, 

879
00:54:00,289 --> 00:54:03,019
to be the hypothesis 

880
00:54:03,019 --> 00:54:06,459
in my class 

881
00:54:06,459 --> 00:54:10,189
script H that has the smallest generalization error. 

882
00:54:10,189 --> 00:54:11,309
So this is 

883
00:54:11,309 --> 00:54:14,930
 if I had an infinite amount of training data or if I really I 

884
00:54:14,930 --> 00:54:19,029
could go in and find the best possible hypothesis 

885
00:54:19,029 --> 00:54:23,359
 best possible hypothesis in the sense of minimizing generalization error  

886
00:54:23,359 --> 00:54:28,989
what's the hypothesis I would get? Okay? So 

887
00:54:28,989 --> 00:54:32,959
in some sense, it sort of makes sense to compare the performance of our learning 

888
00:54:32,959 --> 00:54:33,609
algorithm 

889
00:54:33,609 --> 00:54:35,569
to the performance of H star, because we 

890
00:54:35,569 --> 00:54:39,959
sort of  we clearly can't hope to do better than H star. 

891
00:54:39,959 --> 00:54:42,469
Another way of saying that is that if 

892
00:54:42,469 --> 00:54:47,609
your hypothesis class is a class of all linear decision boundaries, that 

893
00:54:47,609 --> 00:54:50,630
the data just can't be separated by any linear functions. 

894
00:54:50,630 --> 00:54:53,739
So if even H star is really bad, 

895
00:54:53,739 --> 00:54:55,240
then there's sort of  

896
00:54:55,240 --> 00:54:59,120
it's unlikely  then there's just not much hope that your learning algorithm could do even 

897
00:54:59,120 --> 00:55:04,130
better 

898
00:55:04,130 --> 00:55:08,979
than H star. So I actually prove this result in three steps. 

899
00:55:08,979 --> 00:55:13,399
So the generalization error of H hat, the hypothesis I chose, 

900
00:55:13,399 --> 00:55:20,399
this is going to be less than equal to that, actually let me number these 

901
00:55:21,739 --> 00:55:25,059
equations, right. This 

902
00:55:25,059 --> 00:55:27,199
is  

903
00:55:27,199 --> 00:55:30,289
because of equation one, because I see that 

904
00:55:30,289 --> 00:55:33,089
epsilon of H hat and epsilon hat of H hat will then gamma of each 

905
00:55:33,089 --> 00:55:36,630
other. 

906
00:55:36,630 --> 00:55:38,319
Now 

907
00:55:38,319 --> 00:55:42,140
because H star, excuse me, 

908
00:55:42,140 --> 00:55:45,890
now by the 

909
00:55:45,890 --> 00:55:51,009
definition of empirical risk minimization, H 

910
00:55:51,009 --> 00:55:54,049
hat was chosen to minimize training error 

911
00:55:54,049 --> 00:55:58,679
and so there can't be any hypothesis with lower training error than H hat. 

912
00:55:58,679 --> 00:56:04,629
So the training error of H hat must be less than the equal to 

913
00:56:04,629 --> 00:56:07,079
the training error of H star. So 

914
00:56:07,079 --> 00:56:10,549
this is sort of by two, or by the definition of H hat, 

915
00:56:10,549 --> 00:56:17,549
as the hypothesis that minimizes training error H hat. 

916
00:56:17,619 --> 00:56:20,159
And the final step is I'm 

917
00:56:20,159 --> 00:56:23,499
going to apply this uniform conversions result again. We know that 

918
00:56:23,499 --> 00:56:24,929
epsilon hat 

919
00:56:24,929 --> 00:56:29,549
of H star must be moving gamma of epsilon of H star. 

920
00:56:29,549 --> 00:56:33,879
And so this is at most 

921
00:56:33,879 --> 00:56:35,199
plus gamma. Then 

922
00:56:35,199 --> 00:56:36,089


923
00:56:36,089 --> 00:56:38,299
I have my original gamma 

924
00:56:38,299 --> 00:56:41,319
there. Okay? And so this is by 

925
00:56:41,319 --> 00:56:44,519
equation one again because  oh, excuse me  

926
00:56:44,519 --> 00:56:47,690
because I know the training error of H star must be moving gamma of the 

927
00:56:47,690 --> 00:56:49,929
generalization error of H star. 

928
00:56:49,929 --> 00:56:51,699
And so  well, I'll just 

929
00:56:51,699 --> 00:56:57,469
write this as plus two gamma. Okay? Yeah? Student:[Inaudible] notation, is epsilon proof of [inaudible] H hat that's not the training 

930
00:56:57,469 --> 00:57:04,469
error, that's the generalization error with estimate of the hypothesis? Instructor (Andrew Ng): Oh, 

931
00:57:24,869 --> 00:57:26,649
okay. Let me just  well, let 

932
00:57:26,649 --> 00:57:32,779
me write that down on this board. So actually  actually let me 

933
00:57:32,779 --> 00:57:36,459
think 

934
00:57:36,459 --> 00:57:38,739
 [inaudible] 

935
00:57:38,739 --> 00:57:43,699
fit this in here. So epsilon hat 

936
00:57:43,699 --> 00:57:45,169
of H is the 

937
00:57:45,169 --> 00:57:48,999
training 

938
00:57:48,999 --> 00:57:50,620


939
00:57:50,620 --> 00:57:52,469
error of the hypothesis H. In 

940
00:57:52,469 --> 00:57:55,949
other words, given the hypothesis  a hypothesis is just a function, right mapped from X or 

941
00:57:55,949 --> 00:57:57,559
Ys  

942
00:57:57,559 --> 00:57:59,489
so epsilon hat of H is 

943
00:57:59,489 --> 00:58:02,809
given the hypothesis H, what's the fraction of training examples it 

944
00:58:02,809 --> 00:58:05,029
misclassifies? 

945
00:58:05,029 --> 00:58:11,599
And generalization error 

946
00:58:11,599 --> 00:58:13,839
of 

947
00:58:13,839 --> 00:58:17,949
H, is given the hypothesis H if I 

948
00:58:17,949 --> 00:58:19,149
sample another 

949
00:58:19,149 --> 00:58:21,709
example from my distribution 

950
00:58:21,709 --> 00:58:22,899
scripts 

951
00:58:22,899 --> 00:58:28,309
D, what's the probability that H will misclassify that example? Does that make sense? Student:[Inaudible]? Instructor (Andrew Ng): 

952
00:58:28,309 --> 00:58:31,859
Oh, 

953
00:58:31,859 --> 00:58:38,239
okay. And H hat is the hypothesis that's chosen by empirical risk minimization. 

954
00:58:38,239 --> 00:58:42,579
So when I talk about empirical risk minimization, is the algorithm that 

955
00:58:42,579 --> 00:58:47,799
minimizes training error, and so epsilon hat of H is the training error of H, 

956
00:58:47,799 --> 00:58:50,219
and so H hat is defined as 

957
00:58:50,219 --> 00:58:52,049
the hypothesis that out 

958
00:58:52,049 --> 00:58:55,029
of all hypotheses in my class script H, 

959
00:58:55,029 --> 00:58:57,589
the one that minimizes training error epsilon hat of H. Okay? All right. Yeah? Student:[Inaudible] H is [inaudible] a member 

960
00:58:57,589 --> 00:59:04,589
of 

961
00:59:06,669 --> 00:59:08,799
typical H, [inaudible] 

962
00:59:08,799 --> 00:59:10,689
family 

963
00:59:10,689 --> 00:59:12,909
right? Instructor (Andrew Ng): Yes it is. 

964
00:59:12,909 --> 00:59:19,909
Student:So what happens with the generalization error [inaudible]? 

965
00:59:20,069 --> 00:59:24,489
Instructor (Andrew Ng): I'll talk about that later. So 

966
00:59:24,489 --> 00:59:31,489
let me tie all these things together into a 

967
00:59:34,339 --> 00:59:38,049
theorem. 

968
00:59:38,049 --> 00:59:42,519
Let there be a hypothesis class given with a 

969
00:59:42,519 --> 00:59:48,869
finite set of K hypotheses, 

970
00:59:48,869 --> 00:59:50,889
and let any M 

971
00:59:50,889 --> 00:59:52,249
delta 

972
00:59:52,249 --> 00:59:58,319
be fixed. 

973
00:59:58,319 --> 01:00:02,049
Then  so I fixed M and delta, so this will be the error bound 

974
01:00:02,049 --> 01:00:04,499
form of the theorem, right? 

975
01:00:04,499 --> 01:00:06,759
Then, with 

976
01:00:06,759 --> 01:00:09,640
probability at least one minus delta. 

977
01:00:09,640 --> 01:00:13,920
We have that. The generalization error of H hat is 

978
01:00:13,920 --> 01:00:19,219
less than or equal to 

979
01:00:19,219 --> 01:00:22,609
the minimum over all hypotheses in 

980
01:00:22,609 --> 01:00:25,049
set H epsilon of H, 

981
01:00:25,049 --> 01:00:26,679
plus 

982
01:00:26,679 --> 01:00:33,679
two times, 

983
01:00:38,029 --> 01:00:39,109
plus that. Okay? 

984
01:00:39,109 --> 01:00:41,759
So to prove this, well, 

985
01:00:41,759 --> 01:00:45,529
this term of course is just epsilon of H star. 

986
01:00:45,529 --> 01:00:47,879
And so to prove this 

987
01:00:47,879 --> 01:00:49,739
we set 

988
01:00:49,739 --> 01:00:52,509
gamma to equal to that  

989
01:00:52,509 --> 01:00:55,969
this is two times the square root term. 

990
01:00:55,969 --> 01:01:02,969
To prove this theorem we set gamma to equal to that square root term. Say that again? Student:[Inaudible]. 

991
01:01:04,009 --> 01:01:07,519
Instructor (Andrew Ng): Wait. Say that again? Student:[Inaudible]. Instructor (Andrew Ng): 

992
01:01:07,519 --> 01:01:11,389
Oh, 

993
01:01:11,389 --> 01:01:13,949


994
01:01:13,949 --> 01:01:16,059
yes. Thank you. That didn't 

995
01:01:16,059 --> 01:01:19,639
make sense at all. 

996
01:01:19,639 --> 01:01:20,939
Thanks. Great. 

997
01:01:20,939 --> 01:01:25,670
So set gamma to that square root term, 

998
01:01:25,670 --> 01:01:29,910
and so we know equation one, 

999
01:01:29,910 --> 01:01:31,739
right, from the previous board 

1000
01:01:31,739 --> 01:01:35,279
holds with probability one minus delta. 

1001
01:01:35,279 --> 01:01:40,430
Right. Equation one was the uniform conversions result right, that  well, 

1002
01:01:40,430 --> 01:01:47,430
IE. 

1003
01:01:49,549 --> 01:01:52,249
This is equation one from the previous board, right, so 

1004
01:01:52,249 --> 01:01:54,759
set gamma equal to this we know that we'll probably 

1005
01:01:54,759 --> 01:01:58,459
use one minus delta this uniform conversions holds, 

1006
01:01:58,459 --> 01:02:01,149
and whenever that holds, that implies  you 

1007
01:02:01,149 --> 01:02:04,719
know, I 

1008
01:02:04,719 --> 01:02:05,019
guess  

1009
01:02:05,019 --> 01:02:07,669
if we call this equation star I guess. 

1010
01:02:07,669 --> 01:02:11,809
And whenever uniform conversions holds, we showed again, on the previous boards 

1011
01:02:11,809 --> 01:02:17,089
that this result holds, that generalization error of H hat is less than 

1012
01:02:17,089 --> 01:02:20,769
two  generalization error of H star plus two times gamma. Okay? And 

1013
01:02:20,769 --> 01:02:22,670
so that proves 

1014
01:02:22,670 --> 01:02:29,670
this theorem. So 

1015
01:02:41,559 --> 01:02:44,229
this result sort of helps us to quantify 

1016
01:02:44,229 --> 01:02:45,230
a little bit 

1017
01:02:45,230 --> 01:02:47,609
that bias variance tradeoff 

1018
01:02:47,609 --> 01:02:49,739
that I talked about 

1019
01:02:49,739 --> 01:02:54,239
at the beginning of  actually near the very start of this lecture. 

1020
01:02:54,239 --> 01:02:58,390
And in particular 

1021
01:02:58,390 --> 01:03:03,659
let's say I have some hypothesis class script H, that 

1022
01:03:03,659 --> 01:03:06,039
I'm using, maybe as a class of all 

1023
01:03:06,039 --> 01:03:09,799
linear functions and linear regression, and logistic regression with 

1024
01:03:09,799 --> 01:03:11,789
just the linear features. 

1025
01:03:11,789 --> 01:03:14,739
And let's say I'm considering switching 

1026
01:03:14,739 --> 01:03:19,619
to some new class H prime by having more features. So lets say this is 

1027
01:03:19,619 --> 01:03:23,059
linear 

1028
01:03:23,059 --> 01:03:27,239
and this is quadratic, 

1029
01:03:27,239 --> 01:03:28,039


1030
01:03:28,039 --> 01:03:31,659
so the class of all linear functions and the subset of the class of all quadratic functions, 

1031
01:03:31,659 --> 01:03:34,859
and so H is the subset of H prime. And 

1032
01:03:34,859 --> 01:03:36,789
let's say I'm considering  

1033
01:03:36,789 --> 01:03:40,929
instead of using my linear hypothesis class  let's say I'm considering switching 

1034
01:03:40,929 --> 01:03:44,769
to a quadratic hypothesis class, or switching to a larger hypothesis 

1035
01:03:44,769 --> 01:03:46,369
class. 

1036
01:03:46,369 --> 01:03:47,130


1037
01:03:47,130 --> 01:03:49,809
Then what are the tradeoffs involved? Well, I 

1038
01:03:49,809 --> 01:03:52,929
proved this only for finite hypothesis classes, but we'll see that something 

1039
01:03:52,929 --> 01:03:56,019
very similar holds for infinite hypothesis classes too. 

1040
01:03:56,019 --> 01:03:57,559
But the tradeoff is 

1041
01:03:57,559 --> 01:04:00,619
what if I switch from H to H prime, or I switch from linear to quadratic 

1042
01:04:00,619 --> 01:04:03,999
functions. Then 

1043
01:04:03,999 --> 01:04:08,029
epsilon of H star will become better because 

1044
01:04:08,029 --> 01:04:12,419
the best hypothesis in my hypothesis class will become better. 

1045
01:04:12,419 --> 01:04:14,829
The best quadratic function 

1046
01:04:14,829 --> 01:04:16,049
 

1047
01:04:16,049 --> 01:04:18,799
by best I mean in the sense of generalization error 

1048
01:04:18,799 --> 01:04:21,299
 the hypothesis function  

1049
01:04:21,299 --> 01:04:24,829
the quadratic function with the lowest generalization error 

1050
01:04:24,829 --> 01:04:25,599
has to have 

1051
01:04:25,599 --> 01:04:27,059
equal or 

1052
01:04:27,059 --> 01:04:28,019
more likely lower generalization error than the best 

1053
01:04:28,019 --> 01:04:30,279
linear function. 

1054
01:04:30,279 --> 01:04:31,799


1055
01:04:31,799 --> 01:04:37,549
So by switching to a more complex hypothesis class you can get this first term as you go down. 

1056
01:04:37,549 --> 01:04:40,349
But what I pay for then is that 

1057
01:04:40,349 --> 01:04:42,829
K will increase. 

1058
01:04:42,829 --> 01:04:46,449
By switching to a larger hypothesis class, 

1059
01:04:46,449 --> 01:04:48,110
the first term will go down, 

1060
01:04:48,110 --> 01:04:51,820
but the second term will increase because I now have a larger class of 

1061
01:04:51,820 --> 01:04:52,880
hypotheses 

1062
01:04:52,880 --> 01:04:54,630
and so the second term K 

1063
01:04:54,630 --> 01:04:57,189
will increase. 

1064
01:04:57,189 --> 01:05:01,679
And so this is sometimes called the bias  this is usually called the bias variance 

1065
01:05:01,679 --> 01:05:03,049
tradeoff. Whereby 

1066
01:05:03,049 --> 01:05:07,699
going to larger hypothesis class maybe I have the hope for finding a better function, 

1067
01:05:07,699 --> 01:05:09,949
that my risk of sort of 

1068
01:05:09,949 --> 01:05:13,799
not fitting my model so accurately also increases, and 

1069
01:05:13,799 --> 01:05:19,449
that's because  illustrated by the second term going up when the 

1070
01:05:19,449 --> 01:05:21,299
size of your hypothesis, 

1071
01:05:21,299 --> 01:05:28,199
when K goes up. 

1072
01:05:28,199 --> 01:05:32,079
And so 

1073
01:05:32,079 --> 01:05:35,279
speaking very loosely, 

1074
01:05:35,279 --> 01:05:37,109
we can think of 

1075
01:05:37,109 --> 01:05:40,149
this first term as corresponding 

1076
01:05:40,149 --> 01:05:41,459
maybe to the bias 

1077
01:05:41,459 --> 01:05:45,209
of the learning algorithm, or the bias of the hypothesis class. 

1078
01:05:45,209 --> 01:05:50,429
And you can  again speaking very loosely, 

1079
01:05:50,429 --> 01:05:52,880
think of the second term as corresponding 

1080
01:05:52,880 --> 01:05:56,760
to the variance in your hypothesis, in other words how well you can actually fit a 

1081
01:05:56,760 --> 01:05:57,869
hypothesis in the  

1082
01:05:57,869 --> 01:05:59,759
how well you 

1083
01:05:59,759 --> 01:06:02,869
actually fit this hypothesis class to the data. 

1084
01:06:02,869 --> 01:06:06,829
And by switching to a more complex hypothesis class, your variance increases and your bias decreases. 

1085
01:06:06,829 --> 01:06:09,489


1086
01:06:09,489 --> 01:06:13,539
As a note of warning, it turns out that if you take like a statistics class you've seen 

1087
01:06:13,539 --> 01:06:15,119
definitions of bias and variance, 

1088
01:06:15,119 --> 01:06:18,499
which are often defined in terms of squared error or something. 

1089
01:06:18,499 --> 01:06:21,580
It turns out that for classification problems, 

1090
01:06:21,580 --> 01:06:24,649
there actually is no 

1091
01:06:24,649 --> 01:06:25,880


1092
01:06:25,880 --> 01:06:27,489
universally accepted 

1093
01:06:27,489 --> 01:06:28,989
formal definition of 

1094
01:06:28,989 --> 01:06:30,679
bias and 

1095
01:06:30,679 --> 01:06:35,259
variance for classification problems. For regression problems, there is this square error 

1096
01:06:35,259 --> 01:06:38,639
definition. For classification problems it 

1097
01:06:38,639 --> 01:06:39,980
turns out there've 

1098
01:06:39,980 --> 01:06:43,700
been several competing proposals for definitions of bias and variance. So when I 

1099
01:06:43,700 --> 01:06:45,300
say bias and variance here, 

1100
01:06:45,300 --> 01:06:47,420
think of these as very loose, informal, 

1101
01:06:47,420 --> 01:06:54,420
intuitive definitions, and not formal definitions. Okay. 

1102
01:07:00,499 --> 01:07:07,499
The 

1103
01:07:15,329 --> 01:07:18,080
cartoon associated with intuition I just 

1104
01:07:18,080 --> 01:07:19,390
said would be 

1105
01:07:19,390 --> 01:07:22,349
as follows: Let's 

1106
01:07:22,349 --> 01:07:24,009
say 

1107
01:07:24,009 --> 01:07:26,919
 

1108
01:07:26,919 --> 01:07:30,369
and everything about the plot will be for a fixed value of M, for a 

1109
01:07:30,369 --> 01:07:35,219
fixed training set size M. Vertical axis I'll plot ever and 

1110
01:07:35,219 --> 01:07:38,799
on the horizontal axis I'll plot model 

1111
01:07:38,799 --> 01:07:43,729
complexity. 

1112
01:07:43,729 --> 01:07:46,519
And by model complexity I mean 

1113
01:07:46,519 --> 01:07:53,499
sort of degree of polynomial, 

1114
01:07:53,499 --> 01:07:57,669


1115
01:07:57,669 --> 01:07:59,739
size of your 

1116
01:07:59,739 --> 01:08:04,669
hypothesis class script H etc. It actually turns out, 

1117
01:08:04,669 --> 01:08:07,509
you remember the bandwidth parameter 

1118
01:08:07,509 --> 01:08:10,579
from 

1119
01:08:10,579 --> 01:08:13,890
locally weighted linear regression, that also 

1120
01:08:13,890 --> 01:08:17,460
has a similar effect in controlling how complex your model is. 

1121
01:08:17,460 --> 01:08:20,650
Model complexity [inaudible] polynomial I guess. 

1122
01:08:20,650 --> 01:08:23,239
So the more complex your model, 

1123
01:08:23,239 --> 01:08:25,390
the better your 

1124
01:08:25,390 --> 01:08:26,020
training error, 

1125
01:08:26,020 --> 01:08:30,219
and so 

1126
01:08:30,219 --> 01:08:32,889
your training error will tend to [inaudible] zero as you increase the complexity of your model because the 

1127
01:08:32,889 --> 01:08:36,419
more complete your model the better you can fit your training set. 

1128
01:08:36,419 --> 01:08:37,519
But 

1129
01:08:37,519 --> 01:08:44,519
because of this bias variance tradeoff, 

1130
01:08:44,689 --> 01:08:48,639
you find 

1131
01:08:48,639 --> 01:08:53,370
that 

1132
01:08:53,370 --> 01:08:57,139
generalization error will come down for a while and then it will go back up. 

1133
01:08:57,139 --> 01:09:03,669
And this regime on the left is when you're underfitting the data or when 

1134
01:09:03,669 --> 01:09:06,259
you have high bias. 

1135
01:09:06,259 --> 01:09:07,690
And this regime 

1136
01:09:07,690 --> 01:09:10,379
on the right 

1137
01:09:10,379 --> 01:09:16,370
is when you have high variance or 

1138
01:09:16,370 --> 01:09:20,809
you're overfitting the data. Okay? 

1139
01:09:20,809 --> 01:09:24,459
And this is why a model of sort of intermediate 

1140
01:09:24,459 --> 01:09:27,400
complexity, somewhere here 

1141
01:09:27,400 --> 01:09:28,999
if often preferable 

1142
01:09:28,999 --> 01:09:30,239
to if 

1143
01:09:30,239 --> 01:09:32,859
[inaudible] and minimize generalization error. 

1144
01:09:32,859 --> 01:09:35,729
Okay? 

1145
01:09:35,729 --> 01:09:39,509
So that's just a cartoon. In the next lecture we'll actually talk about the number of 

1146
01:09:39,509 --> 01:09:40,230
algorithms 

1147
01:09:40,230 --> 01:09:44,459
for trying to automatically select model complexities, say to get 

1148
01:09:44,459 --> 01:09:48,179
you as close as possible to this minimum  to 

1149
01:09:48,179 --> 01:09:55,179
this area of minimized generalization error. 

1150
01:10:10,550 --> 01:10:14,539
The last thing I want to do is actually 

1151
01:10:14,539 --> 01:10:18,959
going back to the theorem I wrote out, I just want to take that theorem  well, 

1152
01:10:18,959 --> 01:10:20,150
so the theorem I wrote out 

1153
01:10:20,150 --> 01:10:25,130
was an error bound theorem this says for fixed M and delta where probability 

1154
01:10:25,130 --> 01:10:28,639
one minus delta, I get a bound on 

1155
01:10:28,639 --> 01:10:30,779
gamma, which is what this term is. 

1156
01:10:30,779 --> 01:10:34,769
So the very last thing I wanna do today is just come back to this theorem and write out a 

1157
01:10:34,769 --> 01:10:35,719
corollary 

1158
01:10:35,719 --> 01:10:39,590
where I'm gonna fix gamma, I'm gonna fix my error bound, and fix delta 

1159
01:10:39,590 --> 01:10:41,389
and solve for M. 

1160
01:10:41,389 --> 01:10:46,080
And if you do that, 

1161
01:10:46,080 --> 01:10:47,190
you 

1162
01:10:47,190 --> 01:10:49,289
get the following corollary: Let H 

1163
01:10:49,289 --> 01:10:52,440
be 

1164
01:10:52,440 --> 01:10:54,759
fixed with K hypotheses 

1165
01:10:54,759 --> 01:10:58,989
and let 

1166
01:10:58,989 --> 01:11:01,340
any delta 

1167
01:11:01,340 --> 01:11:04,369
and gamma be fixed. 

1168
01:11:04,369 --> 01:11:09,019


1169
01:11:09,019 --> 01:11:11,449
Then 

1170
01:11:11,449 --> 01:11:18,449
in order to guarantee that, 

1171
01:11:22,800 --> 01:11:25,060
let's say I want a guarantee 

1172
01:11:25,060 --> 01:11:26,989
that the generalization error 

1173
01:11:26,989 --> 01:11:30,729
of the hypothesis I choose with empirical risk minimization, 

1174
01:11:30,729 --> 01:11:33,869
that this is at most two times gamma worse 

1175
01:11:33,869 --> 01:11:38,319
than the best possible error I could obtain with this hypothesis class. 

1176
01:11:38,319 --> 01:11:40,470
Lets say I want this to 

1177
01:11:40,470 --> 01:11:45,050
hold true with probability at least one minus delta, 

1178
01:11:45,050 --> 01:11:50,159
then it suffices 

1179
01:11:50,159 --> 01:11:53,689
that M 

1180
01:11:53,689 --> 01:12:00,519
is [inaudible] to that. Okay? And this is 

1181
01:12:00,519 --> 01:12:01,010
sort of 

1182
01:12:01,010 --> 01:12:05,570
solving for the error 

1183
01:12:05,570 --> 01:12:07,699
bound for M. One thing we're going to convince yourselves 

1184
01:12:07,699 --> 01:12:10,110
of the easy part of this is 

1185
01:12:10,110 --> 01:12:14,960
if you set that term [inaudible] gamma and solve for M you will get this. 

1186
01:12:14,960 --> 01:12:18,670
One thing I want you to go home and sort of convince yourselves of is that this 

1187
01:12:18,670 --> 01:12:20,559
result really holds true. 

1188
01:12:20,559 --> 01:12:23,869
That this really logically follows from the theorem we've proved. 

1189
01:12:23,869 --> 01:12:25,669
In other words, 

1190
01:12:25,669 --> 01:12:29,849
you can take that formula we wrote and solve for M and  because this is the formula you get for M, 

1191
01:12:29,849 --> 01:12:31,670
that's just  that's the easy part. That 

1192
01:12:31,670 --> 01:12:33,770
once you go back and convince yourselves that 

1193
01:12:33,770 --> 01:12:37,660
this theorem is a true fact and that it does indeed logically follow from the 

1194
01:12:37,660 --> 01:12:38,659
other one. In 

1195
01:12:38,659 --> 01:12:39,869
particular, 

1196
01:12:39,869 --> 01:12:43,739
make sure that if you solve for that you really get M grading equals this, and why is this M 

1197
01:12:43,739 --> 01:12:46,679
grading that and not M less equal two, and just make sure  

1198
01:12:46,679 --> 01:12:50,229
I can write this down and it sounds plausible why don't you just go back and convince yourself this is really true. Okay? 

1199
01:12:50,229 --> 01:12:53,280


1200
01:12:53,280 --> 01:12:54,840
And 

1201
01:12:54,840 --> 01:12:58,110
it turns out that when we prove these bounds in learning theory it turns out 

1202
01:12:58,110 --> 01:13:02,820
that very often the constants are sort of loose. So it turns out that when we prove 

1203
01:13:02,820 --> 01:13:05,619
these bounds usually we're interested 

1204
01:13:05,619 --> 01:13:10,349
 usually we're not very interested in the constants, and so I 

1205
01:13:10,349 --> 01:13:12,409
write this as big O of 

1206
01:13:12,409 --> 01:13:16,139
one over gamma squared, log 

1207
01:13:16,139 --> 01:13:17,780
K over delta, 

1208
01:13:17,780 --> 01:13:20,570
and again, the key 

1209
01:13:20,570 --> 01:13:23,869
step in this is that the dependence on M 

1210
01:13:23,869 --> 01:13:26,709
with the size of the hypothesis class is logarithmic. 

1211
01:13:26,709 --> 01:13:30,099
And this will be very important later when we 

1212
01:13:30,099 --> 01:13:35,429
talk about infinite hypothesis classes. Okay? Any 

1213
01:13:35,429 --> 01:13:42,429
questions about this? No? Okay, cool. 

1214
01:13:49,659 --> 01:13:51,379
So 

1215
01:13:51,379 --> 01:13:54,489
next lecture we'll come back, we'll actually start from this result again. 

1216
01:13:54,489 --> 01:13:57,699
Remember this. I'll write this down as the first thing I do in the next lecture 

1217
01:13:57,699 --> 01:14:02,519
and we'll generalize these to infinite hypothesis classes and then talk about 

1218
01:14:02,519 --> 01:14:04,659
practical algorithms for model spectrum. So I'll see you guys in a couple days.