1
00:00:10,910 --> 00:00:12,240

2
00:00:12,240 --> 00:00:15,509
This presentation is delivered by the Stanford Center for Professional 

3
00:00:15,509 --> 00:00:22,509
Development. 

4
00:00:24,209 --> 00:00:28,029
So what I want to do today in this lecture is 

5
00:00:28,029 --> 00:00:31,479
talk a little bit more about learning theory. In particular, I'll 

6
00:00:31,479 --> 00:00:33,230
talk about VC dimension 

7
00:00:33,230 --> 00:00:35,610
and 

8
00:00:35,610 --> 00:00:39,560
building on the issues of bias variance tradeoffs of under 

9
00:00:39,560 --> 00:00:42,630
fitting and over fitting; that we've been seeing in the previous lecture, and then we'll see in 

10
00:00:42,630 --> 00:00:43,640
this one. 

11
00:00:43,640 --> 00:00:49,180
I then want to talk about model selection algorithms for automatically 

12
00:00:49,180 --> 00:00:50,070
making 

13
00:00:50,070 --> 00:00:52,190
decisions for this 

14
00:00:52,190 --> 00:00:55,280
bias variance tradeoff, that we started to talk about in the previous lecture. 

15
00:00:55,280 --> 00:00:59,310
And depending on how much time, I actually may not get to Bayesian, [inaudible]. But if 

16
00:00:59,310 --> 00:01:01,469
I don't get to this today, 

17
00:01:01,469 --> 00:01:07,280
I'll get to this in next week's lecture. 

18
00:01:07,280 --> 00:01:09,520
To recap: 

19
00:01:09,520 --> 00:01:10,430
the result we 

20
00:01:10,430 --> 00:01:13,510
proved at the previous lecture was that 

21
00:01:13,510 --> 00:01:15,370
if you have a finite 

22
00:01:15,370 --> 00:01:16,430
hypothesis class  if 

23
00:01:16,430 --> 00:01:19,509
h is a set of k hypotheses, 

24
00:01:19,509 --> 00:01:24,080


25
00:01:24,080 --> 00:01:25,710
and suppose you have some 

26
00:01:25,710 --> 00:01:32,710
fixed parameters, gamma and delta, 

27
00:01:33,030 --> 00:01:34,440
then 

28
00:01:34,440 --> 00:01:36,530


29
00:01:36,530 --> 00:01:40,890
in order to guarantee 

30
00:01:40,890 --> 00:01:47,750
that this holds, 

31
00:01:47,750 --> 00:01:50,770
we're probability at least one minus delta. 

32
00:01:50,770 --> 00:01:55,970
It suffices 

33
00:01:55,970 --> 00:02:02,970
that n is greater and equal to that; okay? 

34
00:02:05,240 --> 00:02:07,359
And using big-O notations, 

35
00:02:07,359 --> 00:02:14,359
just learning dropped constants, I can also write this as that; okay? 

36
00:02:14,459 --> 00:02:18,180
So just to quickly remind you of what all of the notation means, 

37
00:02:18,180 --> 00:02:21,180
we talked about empirical risk minimization, 

38
00:02:21,180 --> 00:02:21,970
which was 

39
00:02:21,970 --> 00:02:24,620
the simplified modern machine learning 

40
00:02:24,620 --> 00:02:25,899
that 

41
00:02:25,899 --> 00:02:28,549
has a hypothesis class of script h. 

42
00:02:28,549 --> 00:02:32,539
And what the empirical risk minimization-learning algorithm does is it just chooses the 

43
00:02:32,539 --> 00:02:34,029
hypothesis 

44
00:02:34,029 --> 00:02:38,169
that attains the smallest error on the training set. 

45
00:02:38,169 --> 00:02:39,019
And so 

46
00:02:39,019 --> 00:02:43,479
this symbol, epsilon, just denoted generalization error; right? This is the 

47
00:02:43,479 --> 00:02:44,450
probability 

48
00:02:44,450 --> 00:02:48,949
of a hypothesis h [inaudible] misclassifying a new example drawn from the same 

49
00:02:48,949 --> 00:02:50,930
distribution as the training set. 

50
00:02:50,930 --> 00:02:53,079
And so this says that 

51
00:02:53,079 --> 00:02:55,449


52
00:02:55,449 --> 00:02:56,939


53
00:02:56,939 --> 00:03:01,869


54
00:03:01,869 --> 00:03:06,119
in order to guarantee that the generalization error of the hypothesis h 

55
00:03:06,119 --> 00:03:08,679
[inaudible] output by empirical risk minimization  

56
00:03:08,679 --> 00:03:11,009
that this is less and equal to 

57
00:03:11,009 --> 00:03:14,209
the best possible generalization error 

58
00:03:14,209 --> 00:03:14,780
 

59
00:03:14,780 --> 00:03:19,889
use it in your hypothesis class plus two times gamma  two times this error threshold. 

60
00:03:19,889 --> 00:03:24,120
We want to guarantee that this holds a probability at least one minus delta. 

61
00:03:24,120 --> 00:03:25,519
We show that 

62
00:03:25,519 --> 00:03:28,999
it suffices for your training set size m 

63
00:03:28,999 --> 00:03:33,079
to be greater than equal to this; okay? One over two gamma square log two k over 

64
00:03:33,079 --> 00:03:34,179
delta; 

65
00:03:34,179 --> 00:03:39,229
where again, k is the size of your hypothesis class. 

66
00:03:39,229 --> 00:03:42,909
And so this is some complexity result because it gives us a bound in the 

67
00:03:42,909 --> 00:03:45,019
number of training examples we need 

68
00:03:45,019 --> 00:03:47,050
in order to give a guarantee 

69
00:03:47,050 --> 00:03:48,699
on something  on the error; okay? So 

70
00:03:48,699 --> 00:03:54,819
this is a sample complexity result. So 

71
00:03:54,819 --> 00:03:58,879
what I want to do now is take this result, and try to generalize it to the 

72
00:03:58,879 --> 00:04:02,939
case of infinite hypothesis classes. So here, 

73
00:04:02,939 --> 00:04:07,139
we said that the set script h is sort of just k 

74
00:04:07,139 --> 00:04:08,570
specific functions, 

75
00:04:08,570 --> 00:04:12,120
when you want to use a model like logistic regression, which is actually 

76
00:04:12,120 --> 00:04:14,059
parameterized by real numbers. 

77
00:04:14,059 --> 00:04:15,019
So 

78
00:04:15,019 --> 00:04:17,420
I'm actually first going to give an argument that's sort of 

79
00:04:17,420 --> 00:04:21,160
formally broken  just sort of technically somewhat broken, but conveys 

80
00:04:21,160 --> 00:04:25,230
useful intuition. And then I'll give the more correct argument, but 

81
00:04:25,230 --> 00:04:30,060
without proving. It's as if, full proof is somewhat involved. So here's a somewhat broken argument. Let's say I want 

82
00:04:30,060 --> 00:04:31,050
to 

83
00:04:31,050 --> 00:04:34,679
apply this result analyzing logistic regression. 

84
00:04:34,679 --> 00:04:35,550


85
00:04:35,550 --> 00:04:37,400
So let's say 

86
00:04:37,400 --> 00:04:39,210


87
00:04:39,210 --> 00:04:44,789
your hypothesis class is because of all linear division boundaries; right? So say 

88
00:04:44,789 --> 00:04:46,770
script h is 

89
00:04:46,770 --> 00:04:51,749
parameterized 

90
00:04:51,749 --> 00:04:57,860
by d 

91
00:04:57,860 --> 00:05:00,279
real numbers; okay? So for example, if 

92
00:05:00,279 --> 00:05:02,570
you're applying logistic regression with 

93
00:05:02,570 --> 00:05:06,940
over [inaudible], then d would be endless one with logistic regression to find the 

94
00:05:06,940 --> 00:05:08,680
linear position boundary, 

95
00:05:08,680 --> 00:05:09,640
parameterized by 

96
00:05:09,640 --> 00:05:13,529
endless one real numbers. 

97
00:05:13,529 --> 00:05:17,729
When you think about how your hypothesis class is really represented in a computer  

98
00:05:17,729 --> 00:05:18,569
computers use 

99
00:05:18,569 --> 00:05:21,949
zero one bits to represent real numbers. 

100
00:05:21,949 --> 00:05:22,930
And so if you 

101
00:05:22,930 --> 00:05:24,479
use like 

102
00:05:24,479 --> 00:05:28,559
a normal standard computer, it normally will represent real numbers by what's 

103
00:05:28,559 --> 00:05:30,869
called double position floating point numbers. 

104
00:05:30,869 --> 00:05:33,200
And what that means is that each real number 

105
00:05:33,200 --> 00:05:36,789
is represented by or a 64-bit representation; 

106
00:05:36,789 --> 00:05:38,060
right? So really 

107
00:05:38,060 --> 00:05:41,820
 you know what floating point is in a computer. So a 64-bit floating point 

108
00:05:41,820 --> 00:05:43,309
is what almost all of 

109
00:05:43,309 --> 00:05:44,780
us use routinely. 

110
00:05:44,780 --> 00:05:47,910
And so this parameterized by d real numbers, that's really 

111
00:05:47,910 --> 00:05:50,979
as if it's parameterized by 64 

112
00:05:50,979 --> 00:05:52,919
times d 

113
00:05:52,919 --> 00:05:54,319
bits. 

114
00:05:54,319 --> 00:05:56,590
Computers can't represent real numbers. They only represent  

115
00:05:56,590 --> 00:05:58,800
used to speed things. 

116
00:05:58,800 --> 00:06:00,689
And so the size of 

117
00:06:00,689 --> 00:06:02,949
your 

118
00:06:02,949 --> 00:06:07,250
hypothesis class in your computer representation  

119
00:06:07,250 --> 00:06:11,000
you have 64 times d bits that you can flip. 

120
00:06:11,000 --> 00:06:15,050
And so the number of possible values for your 62 to 64 d bits is 

121
00:06:15,050 --> 00:06:17,509
really just 

122
00:06:17,509 --> 00:06:20,889
to the power of 64 d; okay? 

123
00:06:20,889 --> 00:06:23,360
Because that's the number of ways you can 

124
00:06:23,360 --> 00:06:26,349
flip the 64 d bits. 

125
00:06:26,349 --> 00:06:29,150
And so 

126
00:06:29,150 --> 00:06:33,979
this is why it's important that we that we had log k there; right? So k is 

127
00:06:33,979 --> 00:06:36,819
therefore, to the 64 d. 

128
00:06:36,819 --> 00:06:40,379
And if I plug it into this equation over here, 

129
00:06:40,379 --> 00:06:43,300
what you find is that 

130
00:06:43,300 --> 00:06:47,139
in order to get this sort of guarantee, 

131
00:06:47,139 --> 00:06:50,979
it suffices that m 

132
00:06:50,979 --> 00:06:52,739
is 

133
00:06:52,739 --> 00:06:55,979
great and equal to 

134
00:06:55,979 --> 00:06:58,770
on the order of  one of 

135
00:06:58,770 --> 00:07:01,740
the gamma square log  it's 

136
00:07:01,740 --> 00:07:03,369
just a 64 d 

137
00:07:03,369 --> 00:07:05,050
over delta, 

138
00:07:05,050 --> 00:07:12,050
which is that; okay? So 

139
00:07:14,520 --> 00:07:17,779
just to be clear, in order to guarantee that 

140
00:07:17,779 --> 00:07:23,729
there's only one, instead of the same complexity result as we had before  so the question is: 

141
00:07:23,729 --> 00:07:27,309
suppose, you want a guarantee that a hypotheses returned by empirical risk 

142
00:07:27,309 --> 00:07:29,249
minimization will 

143
00:07:29,249 --> 00:07:33,909
have a generalization error that's within two gamma or the best hypotheses in your hypotheses 

144
00:07:33,909 --> 00:07:35,970
class. 

145
00:07:35,970 --> 00:07:39,400
Then what this result suggests is that, you know, in order to give that sort of 

146
00:07:39,400 --> 00:07:40,770
error bound guarantee, 

147
00:07:40,770 --> 00:07:43,429
it suffices that m is greater and equal to this. 

148
00:07:43,429 --> 00:07:44,579
In other words, that 

149
00:07:44,579 --> 00:07:46,889
your number of training examples has to be 

150
00:07:46,889 --> 00:07:51,699
on the order of d over gamma square; 10, 12, 1 over delta. Okay? And the 

151
00:07:51,699 --> 00:07:53,329


152
00:07:53,329 --> 00:07:54,419
intuition 

153
00:07:54,419 --> 00:07:57,740
that this conveys is actually, roughly right. This says, that the number of 

154
00:07:57,740 --> 00:07:59,050
training examples you need 

155
00:07:59,050 --> 00:08:00,740
is roughly linear 

156
00:08:00,740 --> 00:08:04,620
in the number of parameters of your hypothesis class. That 

157
00:08:04,620 --> 00:08:08,680
m has [inaudible] on the order of something linear, [inaudible]. That intuition is actually, roughly right. 

158
00:08:08,680 --> 00:08:09,580


159
00:08:09,580 --> 00:08:12,479
I'll say more about 

160
00:08:12,479 --> 00:08:16,969


161
00:08:16,969 --> 00:08:20,969
this 

162
00:08:20,969 --> 00:08:21,919


163
00:08:21,919 --> 00:08:25,520
later. This result is clearly, slightly broken, in the sense that it relies on a 

164
00:08:25,520 --> 00:08:28,729
64-bit representation of 14-point numbers. 

165
00:08:28,729 --> 00:08:31,899
So let me actually go ahead and outline 

166
00:08:31,899 --> 00:08:34,900
the śright wayť to show this more formally; all 

167
00:08:34,900 --> 00:08:38,910
right? And it turns out the śright wayť to show this more formally 

168
00:08:38,910 --> 00:08:43,120
involves a much longer  because the 

169
00:08:43,120 --> 00:08:46,250
proof is extremely involved, so I'm just actually going to state the result, and not 

170
00:08:46,250 --> 00:08:50,160
prove it. Farther proof  be a source of learning theory balance, infinite hypothesis 

171
00:08:50,160 --> 00:08:54,320
classes. 

172
00:08:54,320 --> 00:08:56,450


173
00:08:56,450 --> 00:09:03,450
This definition  given a set of d points, 

174
00:09:15,820 --> 00:09:17,430


175
00:09:17,430 --> 00:09:18,630


176
00:09:18,630 --> 00:09:22,880
we say, 

177
00:09:22,880 --> 00:09:26,560
a hypothesis class h 

178
00:09:26,560 --> 00:09:28,590
shatters 

179
00:09:28,590 --> 00:09:34,520
the set s, if 

180
00:09:34,520 --> 00:09:41,520
h can realize any labeling on it; okay? 

181
00:09:46,650 --> 00:09:47,360
And what 

182
00:09:47,360 --> 00:09:50,990
I mean by realizing any labeling on it  the informal way of thinking about this 

183
00:09:50,990 --> 00:09:53,020
is: 

184
00:09:53,020 --> 00:09:56,860
if a hypothesis class has shattered the set s, what that means is that I can take these 

185
00:09:56,860 --> 00:09:58,510
d points, 

186
00:09:58,510 --> 00:10:01,740
and I can associate these d points with any 

187
00:10:01,740 --> 00:10:04,250
caught set of labels y; right? 

188
00:10:04,250 --> 00:10:08,900
So choose any set of labeling y for each of these d 

189
00:10:08,900 --> 00:10:10,140
points. 

190
00:10:10,140 --> 00:10:13,840
And if your hypothesis class shatters s, then that means that 

191
00:10:13,840 --> 00:10:14,790
there will be 

192
00:10:14,790 --> 00:10:19,470
a hypothesis that labels those d examples perfectly; 

193
00:10:19,470 --> 00:10:23,710
okay? That's what shattering means. So let me just illustrate those in an example. 

194
00:10:23,710 --> 00:10:30,710
So let's say h is the class of all linear classifiers 

195
00:10:31,280 --> 00:10:35,520
into e, 

196
00:10:35,520 --> 00:10:38,400
and let's say that 

197
00:10:38,400 --> 00:10:40,370
s is this 

198
00:10:40,370 --> 00:10:43,740
[inaudible] comprising two points; okay? 

199
00:10:43,740 --> 00:10:45,980
So there are 

200
00:10:45,980 --> 00:10:49,520
four possible labelings that computes 

201
00:10:49,520 --> 00:10:53,270
with these two points. You can choose 

202
00:10:53,270 --> 00:10:56,470
to label both positive; one positive, one negative, one negative, one positive or you can 

203
00:10:56,470 --> 00:10:59,280
label both of them negative. 

204
00:10:59,280 --> 00:11:00,750
And if 

205
00:11:00,750 --> 00:11:04,730
the hypothesis class h classed all linear classifiers into the  then, for each 

206
00:11:04,730 --> 00:11:08,450
of these training sets, I can sort of find a 

207
00:11:08,450 --> 00:11:09,990
linear classifier 

208
00:11:09,990 --> 00:11:13,500
that attains zero training error on 

209
00:11:13,500 --> 00:11:17,810
each of these. Then on all possible labelings of this set of two points. 

210
00:11:17,810 --> 00:11:20,140
And so I'll say 

211
00:11:20,140 --> 00:11:23,160
that the hypothesis class script h shatters 

212
00:11:23,160 --> 00:11:24,129
this set s 

213
00:11:24,129 --> 00:11:29,480
of two points; okay? One 

214
00:11:29,480 --> 00:11:36,480
more example  show 

215
00:11:37,350 --> 00:11:38,780
you 

216
00:11:38,780 --> 00:11:42,059
a larger example. Suppose my set s is now 

217
00:11:42,059 --> 00:11:44,640
this set of three points; 

218
00:11:44,640 --> 00:11:51,640
right? Then, I now have eight possible labelings for these three points; 

219
00:12:10,600 --> 00:12:14,830
okay? And so for these three points, I now have eight possible labelings. 

220
00:12:14,830 --> 00:12:16,940
And once again, I can  for each 

221
00:12:16,940 --> 00:12:21,620
of these labelings, I can find the hypothesis in the hypothesis class 

222
00:12:21,620 --> 00:12:23,390
that labels 

223
00:12:23,390 --> 00:12:25,610
these examples correctly. 

224
00:12:25,610 --> 00:12:30,220
And so once again, I see that  by definition, say, that my hypothesis 

225
00:12:30,220 --> 00:12:33,890
class also shatters this set s. Student:Right. Instructor (Andrew Ng):And 

226
00:12:33,890 --> 00:12:38,340
then that  that terminology  h can realize any labeling on s. That's 

227
00:12:38,340 --> 00:12:39,830
obviously [inaudible]. 

228
00:12:39,830 --> 00:12:43,680
Give it any set of labels and you can find a hypothesis that perfectly 

229
00:12:43,680 --> 00:12:47,570
separates the positive and negative examples; 

230
00:12:47,570 --> 00:12:48,840
okay? 

231
00:12:48,840 --> 00:12:51,090
So 

232
00:12:51,090 --> 00:12:54,660
how about this 

233
00:12:54,660 --> 00:12:57,430
set? 

234
00:12:57,430 --> 00:13:00,840
Suppose s is now this set of four points, then, 

235
00:13:00,840 --> 00:13:04,840
you know, there are lots of labels. There are now 16 labelings we can choose on this; right? 

236
00:13:04,840 --> 00:13:06,400
That's one for instance, 

237
00:13:06,400 --> 00:13:09,090
and this is another one; 

238
00:13:09,090 --> 00:13:13,420
right? And so I can realize some labelings. But there's no linear division 

239
00:13:13,420 --> 00:13:15,780
boundary that can realize this labeling, 

240
00:13:15,780 --> 00:13:17,800
and so h does not shatter 

241
00:13:17,800 --> 00:13:20,250
this set of four points; okay? 

242
00:13:20,250 --> 00:13:21,580
And 

243
00:13:21,580 --> 00:13:25,360
I'm not really going to prove it here, but it turns out that you can show that 

244
00:13:25,360 --> 00:13:30,040
in two dimensions, there is no set of four points that  the 

245
00:13:30,040 --> 00:13:37,040
class of all linear classifiers can shatter; okay? 

246
00:13:45,630 --> 00:13:49,110
So here's another definition. When I say that the  well, 

247
00:13:49,110 --> 00:13:50,290
it's called the 

248
00:13:50,290 --> 00:13:52,700
VC dimension. These 

249
00:13:52,700 --> 00:13:59,700
two people, Vapnik and Chervonenkis  

250
00:14:06,550 --> 00:14:09,910
so given a hypothesis class, 

251
00:14:09,910 --> 00:14:13,590
the Vapnik and Chervonenkis 

252
00:14:13,590 --> 00:14:17,600
dimension of h, which we usually write as VC of script h, 

253
00:14:17,600 --> 00:14:24,600
is the size of the 

254
00:14:33,780 --> 00:14:36,670
larger 

255
00:14:36,670 --> 00:14:40,070
set that is shattered by this set  by h. 

256
00:14:40,070 --> 00:14:40,759
And 

257
00:14:40,759 --> 00:14:44,939
if a hypothesis class can shatter arbitrarily large sets, then the VC dimension 

258
00:14:44,939 --> 00:14:47,820
is infinite. 

259
00:14:47,820 --> 00:14:50,549
So just as a kind of good example: if h 

260
00:14:50,549 --> 00:14:51,920
is the class of all linear 

261
00:14:51,920 --> 00:14:57,850
classifiers 

262
00:14:57,850 --> 00:15:02,660
into d, then the 

263
00:15:02,660 --> 00:15:04,100
VC dimension of the set 

264
00:15:04,100 --> 00:15:04,730
is equal to 

265
00:15:04,730 --> 00:15:09,490
three because we saw just now that there is a size of 

266
00:15:09,490 --> 00:15:12,940
 there was a set s of size three that it could shatter, 

267
00:15:12,940 --> 00:15:17,259
and I don't really prove it. But it turns out there is no sets of size four 

268
00:15:17,259 --> 00:15:18,480
that it can 

269
00:15:18,480 --> 00:15:19,900
shatter. And therefore, 

270
00:15:19,900 --> 00:15:21,220
the VC dimension of this is three. Yeah? Student:But there are sets of size three that cannot shatter; right? [Inaudible] was your 

271
00:15:21,220 --> 00:15:23,270
point. Instructor 

272
00:15:23,270 --> 00:15:27,530
(Andrew 

273
00:15:27,530 --> 00:15:30,210
Ng):Yes, absolutely. So it turns out that 

274
00:15:30,210 --> 00:15:32,370
if 

275
00:15:32,370 --> 00:15:35,650
I choose a set like this  it's actually set s, 

276
00:15:35,650 --> 00:15:39,170
then there are labelings on this they cannot realize. And so, 

277
00:15:39,170 --> 00:15:40,980
h cannot shatter this set. 

278
00:15:40,980 --> 00:15:43,620
But that's okay because  right  there definitely is  

279
00:15:43,620 --> 00:15:46,590
there exists some other set of size three being shattered. So the VC 

280
00:15:46,590 --> 00:15:47,890
dimension is 

281
00:15:47,890 --> 00:15:54,890
three. And then there is no set of size four that can shatter. Yeah? Student:[Inaudible]. Instructor 

282
00:15:55,590 --> 00:15:58,130
(Andrew Ng):Not according to this definition. No. 

283
00:15:58,130 --> 00:16:00,640
Right. So again, let's see, 

284
00:16:00,640 --> 00:16:03,619
I can choose my set s to be 

285
00:16:03,619 --> 00:16:07,869
to be a set of three points that are all over lapping. Three points in exactly 

286
00:16:07,869 --> 00:16:09,780
the same place. 

287
00:16:09,780 --> 00:16:11,730
And clearly, I can't shatter this set, 

288
00:16:11,730 --> 00:16:15,130
but that's okay. And I can't shatter this set, either, 

289
00:16:15,130 --> 00:16:18,560
but that's okay because there are some other sets of size three that I can 

290
00:16:18,560 --> 00:16:25,040
shatter. 

291
00:16:25,040 --> 00:16:28,410
And it turns out this result holds true into the 

292
00:16:28,410 --> 00:16:31,690
 more generally, 

293
00:16:31,690 --> 00:16:34,430


294
00:16:34,430 --> 00:16:41,410
in 

295
00:16:41,410 --> 00:16:48,410
any dimensions  the VC dimension of the class of linear classifiers in 

296
00:16:51,460 --> 00:16:53,500
any dimensions is equal to n plus one. 

297
00:16:53,500 --> 00:16:59,210
Okay? So this is in [inaudible], and if you have linear classifiers over in any dimensional feature space, the VC 

298
00:16:59,210 --> 00:17:01,839
dimension in any dimensions; whereas, n d 

299
00:17:01,839 --> 00:17:04,350
is equal to n plus one. 

300
00:17:04,350 --> 00:17:11,350


301
00:17:12,059 --> 00:17:14,459
So maybe you wanna write it down: 

302
00:17:14,459 --> 00:17:15,809


303
00:17:15,809 --> 00:17:20,740
what is arguably the best-known result in all of learning theory, I guess; 

304
00:17:20,740 --> 00:17:25,930


305
00:17:25,930 --> 00:17:28,770
which is that. 

306
00:17:28,770 --> 00:17:35,770
Let a hypothesis class be given, 

307
00:17:37,600 --> 00:17:41,170
and let the VC dimension of h be 

308
00:17:41,170 --> 00:17:45,830
equal to d. Then we're 

309
00:17:45,830 --> 00:17:47,530
in probability of 

310
00:17:47,530 --> 00:17:50,090
one minus delta. 

311
00:17:50,090 --> 00:17:57,090
We have that  

312
00:18:31,830 --> 00:18:34,770
the formula on the right looks a bit complicated, but don't worry about it. I'll point 

313
00:18:34,770 --> 00:18:36,910
out the essential aspects of it later. 

314
00:18:36,910 --> 00:18:40,600
But the key to this result is that if you have a hypothesis class with VC dimension d, and 

315
00:18:40,600 --> 00:18:45,320
now this can be an infinite hypothesis 

316
00:18:45,320 --> 00:18:46,660
class, 

317
00:18:46,660 --> 00:18:48,620
what 

318
00:18:48,620 --> 00:18:50,990
Vapnik and Chervonenkis show is that 

319
00:18:50,990 --> 00:18:54,030
we're probability of at least one minus delta. 

320
00:18:54,030 --> 00:18:55,300
You enjoy this 

321
00:18:55,300 --> 00:18:57,540
sort of uniform conversions results; 

322
00:18:57,540 --> 00:19:00,380
okay? We have 

323
00:19:00,380 --> 00:19:02,520


324
00:19:02,520 --> 00:19:06,760
that for all hypotheses h  that for all the hypotheses in your 

325
00:19:06,760 --> 00:19:08,490
hypothesis class, 

326
00:19:08,490 --> 00:19:13,080
you have that the generalization error of h 

327
00:19:13,080 --> 00:19:13,780
minus 

328
00:19:13,780 --> 00:19:19,210
the training error of h. So the difference between these two things is bounded above 

329
00:19:19,210 --> 00:19:24,250
by some complicated formula like this; okay? 

330
00:19:24,250 --> 00:19:29,520
And thus, 

331
00:19:29,520 --> 00:19:36,520
we're probably one minus delta. We also have that  have the 

332
00:19:45,140 --> 00:19:52,140
same thing; okay? 

333
00:19:56,810 --> 00:20:00,730
And going from this step to this step; right? Going from this step 

334
00:20:00,730 --> 00:20:03,429
to this step is actually something that you saw yourself; 

335
00:20:03,429 --> 00:20:05,929
that we actually proved earlier. Because  

336
00:20:05,929 --> 00:20:09,200
you remember, in the previous lecture we proved that if you have uniform 

337
00:20:09,200 --> 00:20:10,720
conversions, 

338
00:20:10,720 --> 00:20:13,210
then that implies that 

339
00:20:13,210 --> 00:20:18,080
 it appears actually that we showed that if generalization error and training error are close to 

340
00:20:18,080 --> 00:20:20,800
each other; within gamma of each other, 

341
00:20:20,800 --> 00:20:24,360
then the generalization error of the hypotheses you pick will be within 

342
00:20:24,360 --> 00:20:26,150
two gamma 

343
00:20:26,150 --> 00:20:28,080
times the best generalization error. 

344
00:20:28,080 --> 00:20:29,650
So this is really 

345
00:20:29,650 --> 00:20:30,900
generalization error of 

346
00:20:30,900 --> 00:20:35,130
h [inaudible] best possible generalization error plus two times gamma. And just the 

347
00:20:35,130 --> 00:20:37,390
two constants in front here 

348
00:20:37,390 --> 00:20:43,790
that I've absorbed into the big-O notation. 

349
00:20:43,790 --> 00:20:47,310
So that formula is slightly more complicated. Let 

350
00:20:47,310 --> 00:20:54,310
me just rewrite this as a corollary, which is that 

351
00:20:55,370 --> 00:20:58,850
in order to guarantee that 

352
00:20:58,850 --> 00:21:04,390
this holds, 

353
00:21:04,390 --> 00:21:08,580
we're probability of one minus delta. 

354
00:21:08,580 --> 00:21:11,270
We're probably at least one minus delta, I should say. It 

355
00:21:11,270 --> 00:21:16,790
suffices 

356
00:21:16,790 --> 00:21:20,400
that  I'm gonna write this  this way: I'm gonna write m equals 

357
00:21:20,400 --> 00:21:22,310
big-O of 

358
00:21:22,310 --> 00:21:22,950
d, 

359
00:21:22,950 --> 00:21:25,880
and I'm going to put 

360
00:21:25,880 --> 00:21:32,570
gamma and delta in as a subscript error to denote that. 

361
00:21:32,570 --> 00:21:37,340
Let's see, if we treat gamma and delta as constants, so they allow me to absorb turns that 

362
00:21:37,340 --> 00:21:40,390
depend on gamma and delta into the big-O notation, 

363
00:21:40,390 --> 00:21:41,600
then 

364
00:21:41,600 --> 00:21:44,870
in order to guarantee this holds, it suffices that m 

365
00:21:44,870 --> 00:21:46,430
is on the order of the 

366
00:21:46,430 --> 00:21:53,430
VC dimension and hypotheses class; okay? 

367
00:21:53,470 --> 00:21:55,790
So 

368
00:21:55,790 --> 00:21:58,860
let's see. 

369
00:21:58,860 --> 00:22:02,490
So what we conclude from this is that 

370
00:22:02,490 --> 00:22:06,100
if you have a learning algorithm that tries to  for empirical risk minimization 

371
00:22:06,100 --> 00:22:08,770
algorithms  in other words, less formally, 

372
00:22:08,770 --> 00:22:12,630
for learning algorithms, they try to minimize training error. The 

373
00:22:12,630 --> 00:22:14,850
intuition to take away from this is that 

374
00:22:14,850 --> 00:22:18,080
the number of training examples you need is therefore, 

375
00:22:18,080 --> 00:22:19,250
roughly, linear 

376
00:22:19,250 --> 00:22:24,350
in the VC dimension of the hypotheses class. 

377
00:22:24,350 --> 00:22:27,690
And more formally, this shows that sample complexity 

378
00:22:27,690 --> 00:22:29,570
is upper bounded 

379
00:22:29,570 --> 00:22:34,040
by the VC dimension; okay? 

380
00:22:34,040 --> 00:22:38,490
It turns out that for most reasonable hypothesis classes, it turns out that 

381
00:22:38,490 --> 00:22:41,460
the VC dimension is sort 

382
00:22:41,460 --> 00:22:41,869


383
00:22:41,869 --> 00:22:44,930
of very similar, I guess, to 

384
00:22:44,930 --> 00:22:47,110
the number of parameters you model. 

385
00:22:47,110 --> 00:22:48,600
So for example, 

386
00:22:48,600 --> 00:22:52,080
you have model and logistic regression  linear classification. 

387
00:22:52,080 --> 00:22:56,059
In any dimensions  logistic regression in any dimensions is endless 

388
00:22:56,059 --> 00:22:57,430
one parameters. 

389
00:22:57,430 --> 00:23:02,430
And the VC dimension of which is the  of class of linear classifiers is always the endless one. 

390
00:23:02,430 --> 00:23:04,289
So it turns out that 

391
00:23:04,289 --> 00:23:07,360
for most reasonable hypothesis classes, the 

392
00:23:07,360 --> 00:23:11,270
VC dimension is usually linear in the number of parameters of your model. 

393
00:23:11,270 --> 00:23:14,820
Wherein, is most sense of low other polynomial; 

394
00:23:14,820 --> 00:23:17,440
in the number of parameters of your model. 

395
00:23:17,440 --> 00:23:18,760
And so this  

396
00:23:18,760 --> 00:23:21,120
the takeaway intuition from this is that 

397
00:23:21,120 --> 00:23:25,059
the number of training examples you need to fit in those models is going to be let's say, 

398
00:23:25,059 --> 00:23:29,550
roughly, linear in the number of parameters in your model; okay? 

399
00:23:29,550 --> 00:23:33,120
There are some  somewhat strange examples where what I just said is not true. There are some 

400
00:23:33,120 --> 00:23:36,559
strange examples where you have very few parameters, but the VC dimension is 

401
00:23:36,559 --> 00:23:37,450
enormous. 

402
00:23:37,450 --> 00:23:38,610
But I actually know of  

403
00:23:38,610 --> 00:23:42,070
all of the examples I know of that fall into that regime are somewhat 

404
00:23:42,070 --> 00:23:43,559
strange and 

405
00:23:43,559 --> 00:23:45,160
degenerate. So somewhat unusual, and 

406
00:23:45,160 --> 00:23:50,340
not the source of not learning algorithms you usually use. 

407
00:23:50,340 --> 00:23:52,270
Let's see, 

408
00:23:52,270 --> 00:23:55,030
just other things. It turns out that  

409
00:23:55,030 --> 00:23:58,490
so this result shows the sample complexity is upper bounded by VC 

410
00:23:58,490 --> 00:24:01,320
dimension. But if you have a number of training examples that 

411
00:24:01,320 --> 00:24:04,510
are on the order of the VC dimension, then you find 

412
00:24:04,510 --> 00:24:07,510
 it turns out that in the worse case 

413
00:24:07,510 --> 00:24:10,320


414
00:24:10,320 --> 00:24:12,200


415
00:24:12,200 --> 00:24:15,840
some complexity is also lower bounded by VC dimension. 

416
00:24:15,840 --> 00:24:20,220
And what that means is that if you have a perfectly nasty learning 

417
00:24:20,220 --> 00:24:23,120
problem, say, then if 

418
00:24:23,120 --> 00:24:26,630
the number of training examples you have is less than on the order of the VC 

419
00:24:26,630 --> 00:24:27,450
dimension; 

420
00:24:27,450 --> 00:24:29,690
then 

421
00:24:29,690 --> 00:24:33,910
it is not possible to prove this bound. So I guess in the worse case, 

422
00:24:33,910 --> 00:24:37,490
sample complexity in the number of training examples you need is upper bounded and lower bounded 

423
00:24:37,490 --> 00:24:41,820
by the VC dimension. Let's 

424
00:24:41,820 --> 00:24:48,110
see, 

425
00:24:48,110 --> 00:24:55,110
questions about this? Student:Does 

426
00:24:55,540 --> 00:24:59,700


427
00:24:59,700 --> 00:25:00,490
the 

428
00:25:00,490 --> 00:25:04,560
proof of this assume any sort of finites of, like, finite [inaudible] like you have to just [inaudible] real numbers and [inaudible]? Instructor (Andrew Ng):Let's see. The proof is not, no. I've actually stated the 

429
00:25:04,560 --> 00:25:07,280
entirety of the theorem. This is true. It 

430
00:25:07,280 --> 00:25:09,920
turns out in the proof  well, 

431
00:25:09,920 --> 00:25:12,730
somewhere, regardless of the proof there's a step reconstruction called an 

432
00:25:12,730 --> 00:25:15,750
epsilon net, which is a very clever [inaudible]. It' sort of in regardless of the 

433
00:25:15,750 --> 00:25:19,880
proof, it is not an assumption that you need. 

434
00:25:19,880 --> 00:25:23,610
In someway that sort of proof  that's one-step that uses a very 

435
00:25:23,610 --> 00:25:27,050
clever [inaudible] to prove 

436
00:25:27,050 --> 00:25:32,130
this. But that's not needed; it's an assumption. I'd like to say, back when 

437
00:25:32,130 --> 00:25:36,370
I was a Ph.D. student, when I was working through this proof, 

438
00:25:36,370 --> 00:25:38,220
there was sort of a solid week where I 

439
00:25:38,220 --> 00:25:42,100
would wake up, and go to the office at 9:00 a.m. Then I'd start reading 

440
00:25:42,100 --> 00:25:44,840
the book that led up to this proof. And then 

441
00:25:44,840 --> 00:25:49,330
I'd read from 9:00 a.m. to 6:00 p.m. And then I'd go home, and then the next day, I'd pick up where I left 

442
00:25:49,330 --> 00:25:51,609
off. And it sort of took me a whole week that way, 

443
00:25:51,609 --> 00:25:53,180
to understand this proof, so 

444
00:25:53,180 --> 00:26:00,180
I thought I would inflict that on you. 

445
00:26:03,090 --> 00:26:07,470
Just to tie a couple of loose ends: 

446
00:26:07,470 --> 00:26:13,110
what I'm about to do is, I'm about to just mention a few things that will 

447
00:26:13,110 --> 00:26:18,380
maybe, feel a little bit like random facts. But I'm just gonna tie up just a couple of loose ends. 

448
00:26:18,380 --> 00:26:21,380
And so 

449
00:26:21,380 --> 00:26:24,150
let's see, 

450
00:26:24,150 --> 00:26:26,860
it turns out that 

451
00:26:26,860 --> 00:26:30,170
 just so it will be more strong with you  so this bound was 

452
00:26:30,170 --> 00:26:30,670
proved 

453
00:26:30,670 --> 00:26:34,840
for an algorithm that uses empirical risk minimization, for an algorithm that 

454
00:26:34,840 --> 00:26:35,920
minimizes 

455
00:26:35,920 --> 00:26:39,580
0-1 training error. So 

456
00:26:39,580 --> 00:26:46,330
one 

457
00:26:46,330 --> 00:26:49,350
question that some of 

458
00:26:49,350 --> 00:26:51,720
you ask is 

459
00:26:51,720 --> 00:26:55,460
how about support vector machines; right? How come SVM's don't over 

460
00:26:55,460 --> 00:26:57,250
fit? 

461
00:26:57,250 --> 00:27:01,280
And in the sequel of  remember our discussion on support vector machines said that 

462
00:27:01,280 --> 00:27:06,450
you use kernels, and map the features in infinite dimensional feature space. 

463
00:27:06,450 --> 00:27:10,390
And so it seems like the VC dimension should be infinite; n plus 

464
00:27:10,390 --> 00:27:12,340
one and n is infinite. 

465
00:27:12,340 --> 00:27:14,840
So it turns out that 

466
00:27:14,840 --> 00:27:19,630
the class of linear separators with large margin actually has low VC dimension. 

467
00:27:19,630 --> 00:27:23,300
I wanna say this very quickly, and informally. 

468
00:27:23,300 --> 00:27:28,010
It's actually, not very important for you to understand the details, but I'm going 

469
00:27:28,010 --> 00:27:31,760
to say it very informally. It turns out that I will give you a set of points. 

470
00:27:31,760 --> 00:27:32,660
And 

471
00:27:32,660 --> 00:27:37,500
if I ask you to consider only the course of lines that separate these points of a 

472
00:27:37,500 --> 00:27:38,780
large margin [inaudible], 

473
00:27:38,780 --> 00:27:40,030
so 

474
00:27:40,030 --> 00:27:43,840
my hypothesis class will comprise only 

475
00:27:43,840 --> 00:27:45,950
the linear position boundaries 

476
00:27:45,950 --> 00:27:49,289
that separate the points of a large margin. Say with a 

477
00:27:49,289 --> 00:27:50,049
margin, 

478
00:27:50,049 --> 00:27:52,260
at least gamma; okay. 

479
00:27:52,260 --> 00:27:53,000
And so 

480
00:27:53,000 --> 00:27:55,760
I won't allow a point that comes closer. Like, 

481
00:27:55,760 --> 00:28:00,210
I won't allow that line because it comes too close to one of my points. 

482
00:28:00,210 --> 00:28:04,330
It turns out that 

483
00:28:04,330 --> 00:28:10,890
if I consider my 

484
00:28:10,890 --> 00:28:15,410
data points all 

485
00:28:15,410 --> 00:28:18,690
lie within some sphere of radius r, 

486
00:28:18,690 --> 00:28:23,190
and if I consider only the course of linear separators is separate to data 

487
00:28:23,190 --> 00:28:26,799
with a margin of at least gamma, 

488
00:28:26,799 --> 00:28:30,420
then the VC dimension of this 

489
00:28:30,420 --> 00:28:34,260
course is less than or equal to r squared over four 

490
00:28:34,260 --> 00:28:36,150
gamma squared 

491
00:28:36,150 --> 00:28:38,170
plus one; okay? 

492
00:28:38,170 --> 00:28:39,970
So 

493
00:28:39,970 --> 00:28:43,340
this funny symbol here, that just means rounding up. 

494
00:28:43,340 --> 00:28:47,190
This is a ceiling symbol; it means rounding up x. 

495
00:28:47,190 --> 00:28:48,360
And it 

496
00:28:48,360 --> 00:28:51,280
turns out you prove  and there are some strange things 

497
00:28:51,280 --> 00:28:52,940
about this result that I'm deliberately not 

498
00:28:52,940 --> 00:28:55,090
gonna to talk about  but turns they 

499
00:28:55,090 --> 00:28:59,100
can prove that the VC dimension of the class of linear classifiers with large 

500
00:28:59,100 --> 00:29:00,890
margins is actually bounded. 

501
00:29:00,890 --> 00:29:02,820
The surprising thing about this is that 

502
00:29:02,820 --> 00:29:05,909
this is the bound on VC dimension that has no dependents 

503
00:29:05,909 --> 00:29:07,860
on the dimension of the points x. So 

504
00:29:07,860 --> 00:29:09,240
in other words, 

505
00:29:09,240 --> 00:29:12,799
your data points x combine an infinite dimensional space, 

506
00:29:12,799 --> 00:29:15,680
but so long as you restrict attention to the class of 

507
00:29:15,680 --> 00:29:17,990
your separators with large margin, the 

508
00:29:17,990 --> 00:29:20,250
VC dimension is bounded. 

509
00:29:20,250 --> 00:29:21,529
And so 

510
00:29:21,529 --> 00:29:25,650
in trying to find a large margin separator 

511
00:29:25,650 --> 00:29:28,990
 in trying to find the line that separates your positive and 

512
00:29:28,990 --> 00:29:29,950
your negative examples 

513
00:29:29,950 --> 00:29:32,490
with large margin, 

514
00:29:32,490 --> 00:29:35,100
it turns out therefore, that 

515
00:29:35,100 --> 00:29:38,060
the support vector machine is automatically trying to find 

516
00:29:38,060 --> 00:29:39,809
a hypothesis class 

517
00:29:39,809 --> 00:29:42,450
with small VC dimension. And therefore, it does not over fit. Alex? Student:What is 

518
00:29:42,450 --> 00:29:49,320
the [inaudible]? Instructor (Andrew 

519
00:29:49,320 --> 00:29:50,010
Ng):It 

520
00:29:50,010 --> 00:29:51,879
is actually defined the 

521
00:29:51,879 --> 00:29:55,640
same way as finite dimensional spaces. So you 

522
00:29:55,640 --> 00:29:57,210
know, suppose you have infinite  actually, 

523
00:29:57,210 --> 00:30:00,590
these are constantly infinite dimensional vectors; not [inaudible] to the infinite 

524
00:30:00,590 --> 00:30:03,590
dimensional 

525
00:30:03,590 --> 00:30:06,080
vectors. Normally, the 2 to 1 

526
00:30:06,080 --> 00:30:07,460
squared 

527
00:30:07,460 --> 00:30:10,010
is equal to some [inaudible] equals 110 

528
00:30:10,010 --> 00:30:12,210
xi squared, so if x 

529
00:30:12,210 --> 00:30:17,470
is infinite dimensional, you just appoint it like 

530
00:30:17,470 --> 00:30:20,220
that. [Inaudible]. [Crosstalk] Student:[Inaudible]. Instructor (Andrew Ng):Now, say that again. 

531
00:30:20,220 --> 00:30:22,679
Student:[Inaudible]. 

532
00:30:22,679 --> 00:30:27,230
Instructor (Andrew Ng):Yes. Although, I assume that this is bounded by r. Student:Oh. Instructor (Andrew Ng):It's 

533
00:30:27,230 --> 00:30:30,750
a  yeah  so this insures that conversions. 

534
00:30:30,750 --> 00:30:31,670


535
00:30:31,670 --> 00:30:35,040
So just 

536
00:30:35,040 --> 00:30:37,650
something people sometimes wonder about. And 

537
00:30:37,650 --> 00:30:39,640
last, the  actually  

538
00:30:39,640 --> 00:30:44,470
tie empirical risk minimization back a little more strongly to the source of algorithms 

539
00:30:44,470 --> 00:30:46,650
we've talked about. 

540
00:30:46,650 --> 00:30:48,050
It turns out 

541
00:30:48,050 --> 00:30:55,050
that  

542
00:31:10,179 --> 00:31:15,100
so the theory was about, and so far, was really for empirical risk minimization. So that view's 

543
00:31:15,100 --> 00:31:17,770


544
00:31:17,770 --> 00:31:21,010
 so we focus on just one training example. 

545
00:31:21,010 --> 00:31:22,860
Let me draw a 

546
00:31:22,860 --> 00:31:24,049
function, you know, a 

547
00:31:24,049 --> 00:31:25,080
zero here jumps 

548
00:31:25,080 --> 00:31:26,210
to one, 

549
00:31:26,210 --> 00:31:27,670
and it looks like that. 

550
00:31:27,670 --> 00:31:29,570
And so this for 

551
00:31:29,570 --> 00:31:31,720
once, this training example, 

552
00:31:31,720 --> 00:31:36,230
this may be 

553
00:31:36,230 --> 00:31:42,830
indicator 

554
00:31:42,830 --> 00:31:47,110
h where [inaudible] is d 

555
00:31:47,110 --> 00:31:49,340
equals data transpose x; 

556
00:31:49,340 --> 00:31:52,600
okay? But one 

557
00:31:52,600 --> 00:31:56,250
training example  your training example will be positive or negative. And 

558
00:31:56,250 --> 00:31:59,360
depending on what the value of this data transpose x is, you either get it right 

559
00:31:59,360 --> 00:32:01,800
or wrong. And so 

560
00:32:01,800 --> 00:32:05,720
you know, I guess if your training example  if you have a positive example, 

561
00:32:05,720 --> 00:32:08,370
then when z is positive, 

562
00:32:08,370 --> 00:32:10,380
you get it 

563
00:32:10,380 --> 00:32:11,240
right. 

564
00:32:11,240 --> 00:32:18,240


565
00:32:18,610 --> 00:32:22,630
Suppose you have a negative example, so y equals 0; right? 

566
00:32:22,630 --> 00:32:26,780
Then if z, which is data transpose x  if this is positive, 

567
00:32:26,780 --> 00:32:29,650
then you will get this example wrong; 

568
00:32:29,650 --> 00:32:32,860
whereas, if z is negative then you'd get this example right. 

569
00:32:32,860 --> 00:32:36,630
And so this is a part of indicator h subscript [inaudible] x not equals y; okay? You 

570
00:32:36,630 --> 00:32:42,639
know, it's equal 

571
00:32:42,639 --> 00:32:49,539
to g of data transpose 

572
00:32:49,539 --> 00:32:50,730


573
00:32:50,730 --> 00:32:53,940
x; okay? And so it turns out that  

574
00:32:53,940 --> 00:32:57,260
so what you really like to do is choose parameters data so as to 

575
00:32:57,260 --> 00:32:59,550
minimize this step function; right? 

576
00:32:59,550 --> 00:33:02,400
You'd like to choose parameters data, so that 

577
00:33:02,400 --> 00:33:04,710
you end up with a 

578
00:33:04,710 --> 00:33:08,500
correct classification on setting your training example, and so you'd like 

579
00:33:08,500 --> 00:33:09,440
indicator h 

580
00:33:09,440 --> 00:33:10,890
of x not equal y. 

581
00:33:10,890 --> 00:33:14,960
You'd like this indicator function to be 0. It 

582
00:33:14,960 --> 00:33:18,560
turns out this step function is clearly a non-convex function. And so 

583
00:33:18,560 --> 00:33:21,799
it turns out that just the linear classifiers 

584
00:33:21,799 --> 00:33:25,130
minimizing the training error is an empty heart problem. It 

585
00:33:25,130 --> 00:33:29,560
turns out that both logistic regression, and support vector machines can be viewed as 

586
00:33:29,560 --> 00:33:31,210
using a convex 

587
00:33:31,210 --> 00:33:32,600
approximation for this problem. 

588
00:33:32,600 --> 00:33:36,440
And in particular  

589
00:33:36,440 --> 00:33:40,130
and draw a function like that 

590
00:33:40,130 --> 00:33:41,460
 

591
00:33:41,460 --> 00:33:48,460
it turns out that 

592
00:33:50,809 --> 00:33:54,600
logistic regression is trying to maximize likelihood. And so it's tying to minimize 

593
00:33:54,600 --> 00:33:56,900
the minus of the logged likelihood. 

594
00:33:56,900 --> 00:34:00,350
And if you plot the minus of the logged likelihood, it actually turns out it'll be a function that 

595
00:34:00,350 --> 00:34:03,400
looks like this. And 

596
00:34:03,400 --> 00:34:07,250
this line that I just drew, you can think of it as a rough approximation to this step function; which is 

597
00:34:07,250 --> 00:34:10,709
maybe what you're really trying to minimize, 

598
00:34:10,709 --> 00:34:14,539
so you want to minimize training error. So you can actually think of logistic regression as trying to approximate 

599
00:34:14,539 --> 00:34:16,069
empirical risk minimization. 

600
00:34:16,069 --> 00:34:19,859
Where instead of using this step function, which is non-convex, and gives you a hard 

601
00:34:19,859 --> 00:34:21,059
optimization problem, it 

602
00:34:21,059 --> 00:34:25,189
uses this line above  this curve above. So approximate it, 

603
00:34:25,189 --> 00:34:28,229
so you have a convex optimization problem you can 

604
00:34:28,229 --> 00:34:29,929
find the  

605
00:34:29,929 --> 00:34:34,409
maximum likelihood it's in the parameters for logistic regression. 

606
00:34:34,409 --> 00:34:38,690
And it turns out, support vector machine also can be viewed as approximated dysfunction 

607
00:34:38,690 --> 00:34:41,139
to only a little bit different  let's 

608
00:34:41,139 --> 00:34:44,719
see, support vector machine turns out, can be viewed as trying to approximate this 

609
00:34:44,719 --> 00:34:46,209
step function two 

610
00:34:46,209 --> 00:34:48,289
over different 

611
00:34:48,289 --> 00:34:50,929
approximation that's linear, and then  

612
00:34:50,929 --> 00:34:53,589
that sort of [inaudible] linear that  

613
00:34:53,589 --> 00:34:55,710
our results goes this [inaudible] there, and then it goes up as a 

614
00:34:55,710 --> 00:34:58,150
linear function there. And that's  that 

615
00:34:58,150 --> 00:35:00,170
is called the hinge class. And so you 

616
00:35:00,170 --> 00:35:03,799
can think of logistic regression and the support vector machine as 

617
00:35:03,799 --> 00:35:07,259
different approximations to try to minimize this 

618
00:35:07,259 --> 00:35:10,859
step function; 

619
00:35:10,859 --> 00:35:11,940
okay? 

620
00:35:11,940 --> 00:35:13,380
And that's why I guess, 

621
00:35:13,380 --> 00:35:16,109
all the theory we developed  

622
00:35:16,109 --> 00:35:19,239
even though SVM's and logistic regression aren't exactly due to 

623
00:35:19,239 --> 00:35:21,739
empirical risk minimization, 

624
00:35:21,739 --> 00:35:24,790
the theory we develop often gives the 

625
00:35:24,790 --> 00:35:31,790
completely appropriate intuitions for SVM's, and logistic regression; okay. 

626
00:35:32,879 --> 00:35:35,779
So that was the last of the loose ends. 

627
00:35:35,779 --> 00:35:39,319
And if you didn't get this, don't worry too much about it. It's a high-level message. It's 

628
00:35:39,319 --> 00:35:40,309
just that 

629
00:35:40,309 --> 00:35:44,479
SVM's and logistic regression are reasonable to think of as approximations 

630
00:35:44,479 --> 00:35:47,769
 empirical risk minimization algorithms. 

631
00:35:47,769 --> 00:35:51,609
What I want to do next is move on to talk about model selection. Before I do that, let me just 

632
00:35:51,609 --> 00:35:58,609
check for questions about 

633
00:36:38,669 --> 00:36:45,669


634
00:36:49,190 --> 00:36:52,869
this. Okay. Cool. Okay. So in the theory that we started to develop in the previous lecture, and that 

635
00:36:52,869 --> 00:36:53,329
we 

636
00:36:53,329 --> 00:36:55,049
sort of wrapped up with a 

637
00:36:55,049 --> 00:36:56,799
discussion on VC dimension, 

638
00:36:56,799 --> 00:37:01,019
we saw that there's often a trade-off between bias and variance. And in 

639
00:37:01,019 --> 00:37:02,489
particular, so 

640
00:37:02,489 --> 00:37:06,759
it is important not to choose a hypothesis that's either too simple or too 

641
00:37:06,759 --> 00:37:07,849
complex. So 

642
00:37:07,849 --> 00:37:11,249
if your data has sort of a quadratic structure to it, 

643
00:37:11,249 --> 00:37:13,710
then if you choose a linear 

644
00:37:13,710 --> 00:37:17,889
function to try to approximate it, then you would under fit. So you have a 

645
00:37:17,889 --> 00:37:19,939
hypothesis with high bias. 

646
00:37:19,939 --> 00:37:24,179
And conversely, we choose a hypothesis that's too complex, and you have high variance. 

647
00:37:24,179 --> 00:37:26,130
And you'll also 

648
00:37:26,130 --> 00:37:31,899
fail to fit. Then you would over fit the data, and you'd also fail to generalize well. 

649
00:37:31,899 --> 00:37:35,649
So model selection algorithms 

650
00:37:35,649 --> 00:37:39,629
provide a class of methods to automatically trade  make these tradeoffs 

651
00:37:39,629 --> 00:37:41,619
between bias 

652
00:37:41,619 --> 00:37:43,069
and variance; right? So remember the 

653
00:37:43,069 --> 00:37:44,989
cartoon I drew last time 

654
00:37:44,989 --> 00:37:51,989
of generalization error? 

655
00:37:54,309 --> 00:37:57,409
I drew this last time. Where on the x-axis 

656
00:37:57,409 --> 00:38:03,319
was model complexity, meaning the number of  

657
00:38:03,319 --> 00:38:07,599
the degree of the polynomial; the [inaudible] regression function 

658
00:38:07,599 --> 00:38:08,719
or whatever. 

659
00:38:08,719 --> 00:38:12,469
And if you have too simple a model, you have high generalization error, those under 

660
00:38:12,469 --> 00:38:13,529
fitting. 

661
00:38:13,529 --> 00:38:15,630
And you if have too complex a model, 

662
00:38:15,630 --> 00:38:19,729
like 15 or 14-degree polynomial to five data points, then you also have high 

663
00:38:19,729 --> 00:38:24,069
generalization error, and you're over fitting. 

664
00:38:24,069 --> 00:38:28,559
So what I wanna do now is actually just talk about model selection in the abstract; 

665
00:38:28,559 --> 00:38:30,019
all right? 

666
00:38:30,019 --> 00:38:32,779
Some examples of model selection problems will include  

667
00:38:32,779 --> 00:38:34,369
well, I'll run the example 

668
00:38:34,369 --> 00:38:36,139
of  

669
00:38:36,139 --> 00:38:36,790
let's say you're 

670
00:38:36,790 --> 00:38:43,790
trying to choose the degree of a polynomial; 

671
00:38:45,159 --> 00:38:48,909
right? What degree polynomial do you want to choose? 

672
00:38:48,909 --> 00:38:51,899
Another example of a model selection problem would be if you're trying to choose 

673
00:38:51,899 --> 00:38:53,869
the parameter [inaudible], 

674
00:38:53,869 --> 00:38:59,839
which was the bandwidth parameter 

675
00:38:59,839 --> 00:39:04,140
in locally awaited linear regression 

676
00:39:04,140 --> 00:39:07,709
or in some sort of local way to 

677
00:39:07,709 --> 00:39:08,440
regression. 

678
00:39:08,440 --> 00:39:10,480
Yet, another model selection problem 

679
00:39:10,480 --> 00:39:13,940
is if you're trying to choose the parameter c [inaudible] and as the 

680
00:39:13,940 --> 00:39:16,049
[inaudible]; 

681
00:39:16,049 --> 00:39:21,680
right? And so one known soft margin is the  

682
00:39:21,680 --> 00:39:28,539
we had this optimization objective; right? 

683
00:39:28,539 --> 00:39:31,039
And the parameter c 

684
00:39:31,039 --> 00:39:32,969
controls the tradeoff between 

685
00:39:32,969 --> 00:39:34,019
how much you want 

686
00:39:34,019 --> 00:39:38,239
to set for your example. So a large margin versus how much you want to penalize 

687
00:39:38,239 --> 00:39:40,499


688
00:39:40,499 --> 00:39:44,279
in this class [inaudible] example. So these are three specific examples of model selection 

689
00:39:44,279 --> 00:39:45,339
problems. 

690
00:39:45,339 --> 00:39:47,569
And 

691
00:39:47,569 --> 00:39:54,569
let's come up with a method for semantically choosing them. 

692
00:39:56,709 --> 00:39:58,879


693
00:39:58,879 --> 00:40:05,879


694
00:40:11,209 --> 00:40:13,679


695
00:40:13,679 --> 00:40:17,849
Let's say you have some finite set of models, and let's write these as m1, m2, m3, 

696
00:40:17,849 --> 00:40:20,129
and 

697
00:40:20,129 --> 00:40:25,219
so on. For example, this may be the linear 

698
00:40:25,219 --> 00:40:29,660
classifier or this may be the quadratic classifier, 

699
00:40:29,660 --> 00:40:30,410
and so on; 

700
00:40:30,410 --> 00:40:32,170
okay? Or this 

701
00:40:32,170 --> 00:40:33,700
may also be 

702
00:40:33,700 --> 00:40:37,240
 you may also take the bandwidth parameter [inaudible] 

703
00:40:37,240 --> 00:40:41,130
and discretize it into a range of values, and you're trying to choose from the most  

704
00:40:41,130 --> 00:40:43,469
discrete of the values. 

705
00:40:43,469 --> 00:40:48,239
So let's talk about how you would select an appropriate model; all right? Well, 

706
00:40:48,239 --> 00:40:52,029
one thing you could do is you can pick all of these models, and train them on 

707
00:40:52,029 --> 00:40:53,869
you're training set. 

708
00:40:53,869 --> 00:40:56,989
And then see which model has the lowest training 

709
00:40:56,989 --> 00:41:03,989
error. So that's a terrible idea, and why's that? Student:[Inaudible]. 

710
00:41:05,450 --> 00:41:11,099
Instructor (Andrew Ng):Right. Cool. Because of the over fit; right. And those  some of you are laughing that 

711
00:41:11,099 --> 00:41:13,939
I asked that. So that'd be a terrible idea to choose a model 

712
00:41:13,939 --> 00:41:17,019
by looking at your training set because well, obviously, you end up choosing the most 

713
00:41:17,019 --> 00:41:18,839
complex model; right? And you 

714
00:41:18,839 --> 00:41:25,549
choose a 10th degree polynomial because that's what fits the training set. So we 

715
00:41:25,549 --> 00:41:28,569
come to model selection in a training set  

716
00:41:28,569 --> 00:41:31,939
several standard procedures to do this. One is hold out cross 

717
00:41:31,939 --> 00:41:38,869
validation, 

718
00:41:38,869 --> 00:41:41,410
and in hold out cross validation, 

719
00:41:41,410 --> 00:41:45,509
we teach a training set. And we randomly split 

720
00:41:45,509 --> 00:41:50,529
the training set into two subsets. We 

721
00:41:50,529 --> 00:41:51,490
call it 

722
00:41:51,490 --> 00:41:53,589
subset  take all the data you have 

723
00:41:53,589 --> 00:42:00,589
and randomly split it into two subsets. And we'll call it the training set, and the 

724
00:42:00,799 --> 00:42:05,159
hold out cross validation subset. 

725
00:42:05,159 --> 00:42:07,119
And then, 

726
00:42:07,119 --> 00:42:11,989
you know, you train each model 

727
00:42:11,989 --> 00:42:18,739
on just trading subset of it, and test it 

728
00:42:18,739 --> 00:42:22,069
on your hold out cross validation set. 

729
00:42:22,069 --> 00:42:25,289
And you pick the model 

730
00:42:25,289 --> 00:42:32,289
with the lowest error 

731
00:42:32,949 --> 00:42:34,889
on the hold out cross validation 

732
00:42:34,889 --> 00:42:36,959
subset; okay? So this is sort of a 

733
00:42:36,959 --> 00:42:39,799
relatively straightforward procedure, and it's commonly used 

734
00:42:39,799 --> 00:42:42,039
where you train on 70 percent of the data. 

735
00:42:42,039 --> 00:42:44,509
Then test all of your models. And 30 percent, you can take 

736
00:42:44,509 --> 00:42:45,900
whatever has the smallest 

737
00:42:45,900 --> 00:42:51,239
hold out cross validation error. 

738
00:42:51,239 --> 00:42:54,829
And after this  you actually have a chose. You can actually  

739
00:42:54,829 --> 00:42:58,589
having taken all of these hypothesis trained on 70 percent of the data, 

740
00:42:58,589 --> 00:43:02,650
you can actually just output the hypothesis that has the lowest error on your hold out 

741
00:43:02,650 --> 00:43:04,469
cross validation set. 

742
00:43:04,469 --> 00:43:08,619
And optionally, you can actually take the model that you selected 

743
00:43:08,619 --> 00:43:13,079
and go back, and retrain it on all 100 percent of the data; 

744
00:43:13,079 --> 00:43:15,529
okay? So both versions are actually done and used really often. You can 

745
00:43:15,529 --> 00:43:16,409
either, 

746
00:43:16,409 --> 00:43:20,039
you know, just take the best hypothesis that was trained on 70 percent of the data, 

747
00:43:20,039 --> 00:43:22,419
and just output that as you find the hypothesis 

748
00:43:22,419 --> 00:43:23,959
or you can use this to  

749
00:43:23,959 --> 00:43:27,289
say, having chosen the degree of the polynomial you want to fit, you can then go 

750
00:43:27,289 --> 00:43:31,329
back and retrain the model on the entire 100 percent of your data. 

751
00:43:31,329 --> 00:43:38,329
And both of these are commonly done. 

752
00:43:39,749 --> 00:43:46,749


753
00:43:50,489 --> 00:43:54,459
How about a cross validation does  sort of work straight? 

754
00:43:54,459 --> 00:43:55,759
And 

755
00:43:55,759 --> 00:43:58,249
sometimes we're working 

756
00:43:58,249 --> 00:43:58,950


757
00:43:58,950 --> 00:44:03,670
with a company or application or something. The 

758
00:44:03,670 --> 00:44:05,929
many machine-learning applications we have 

759
00:44:05,929 --> 00:44:07,580
very little data or where, you 

760
00:44:07,580 --> 00:44:10,100
know, every training example you have was 

761
00:44:10,100 --> 00:44:14,940
painfully acquired at great cost; right? Sometimes your data is 

762
00:44:14,940 --> 00:44:16,039
acquired by 

763
00:44:16,039 --> 00:44:18,780
medical experiments, and each of these  each 

764
00:44:18,780 --> 00:44:23,829
training example represents a sick man in amounts of physical human pain or something. So 

765
00:44:23,829 --> 00:44:27,279
we talk and say, śWell, I'm going to hold out 30 percent of your data set, just to select 

766
00:44:27,279 --> 00:44:28,109
my model.ť 

767
00:44:28,109 --> 00:44:32,809
If people were who  sometimes that causes unhappiness, and so maybe you wanna 

768
00:44:32,809 --> 00:44:37,099
use  not have to leave out 30 percent of your data just to do model 

769
00:44:37,099 --> 00:44:38,719
selection. 

770
00:44:38,719 --> 00:44:43,400
So there are a couple of other variations on hold out cross validation that makes 

771
00:44:43,400 --> 00:44:47,809
sometimes, slightly more efficient use of the data. 

772
00:44:47,809 --> 00:44:51,909
And one is called k-fold cross validation. 

773
00:44:51,909 --> 00:44:54,190
And here's the idea: I'm gonna 

774
00:44:54,190 --> 00:44:55,950
take all of 

775
00:44:55,950 --> 00:44:56,819
my data s; 

776
00:44:56,819 --> 00:45:00,619
so imagine, I'm gonna draw this 

777
00:45:00,619 --> 00:45:04,489
box s, as to note the entirety of all the data I have. 

778
00:45:04,489 --> 00:45:07,489
And I'll then divide it 

779
00:45:07,489 --> 00:45:12,999
into k pieces, and this is five pieces in what I've drawn. 

780
00:45:12,999 --> 00:45:16,809
Then what'll I'll do is I will repeatedly 

781
00:45:16,809 --> 00:45:23,410
train on k minus one pieces. 

782
00:45:23,410 --> 00:45:27,669
Test on the remaining one  

783
00:45:27,669 --> 00:45:30,649
test on the remaining piece, I guess; 

784
00:45:30,649 --> 00:45:35,699
right? And then you average 

785
00:45:35,699 --> 00:45:41,369
over the k result. 

786
00:45:41,369 --> 00:45:43,809
So another way, we'll just hold out  I will 

787
00:45:43,809 --> 00:45:48,329
hold 

788
00:45:48,329 --> 00:45:50,619
out say, just 1/5 of my data 

789
00:45:50,619 --> 00:45:54,649
and I'll train on the remaining 4/5, and I'll test on the first one. 

790
00:45:54,649 --> 00:45:58,399
And then I'll then go and hold out the second 1/5 from my [inaudible] for the 

791
00:45:58,399 --> 00:46:03,099
remaining pieces  test on this, you remove the third piece, 

792
00:46:03,099 --> 00:46:05,909
train on the 4/5; I'm gonna do this five times. 

793
00:46:05,909 --> 00:46:09,089
And then I'll take the five error measures I 

794
00:46:09,089 --> 00:46:10,569
have and I'll 

795
00:46:10,569 --> 00:46:16,369
average them. And this then gives me an estimate of the generalization error of my model; okay? 

796
00:46:16,369 --> 00:46:18,089
And then, again, when you do 

797
00:46:18,089 --> 00:46:19,929
k-fold cross validation, 

798
00:46:19,929 --> 00:46:21,829
usually you then go back and 

799
00:46:21,829 --> 00:46:24,030
retrain the model you selected 

800
00:46:24,030 --> 00:46:27,670
on the entirety of your training set. 

801
00:46:27,670 --> 00:46:33,930
So I drew five pieces here because that was easier for me to draw, but k equals 10 is 

802
00:46:33,930 --> 00:46:39,339
very 

803
00:46:39,339 --> 00:46:41,710
common; okay? I should say 

804
00:46:41,710 --> 00:46:47,509
k equals 10 is the fairly common choice to do 10 fold cross validation. 

805
00:46:47,509 --> 00:46:52,299
And the advantage of the over hold out cross option is that you switch the data into ten pieces. 

806
00:46:52,299 --> 00:46:54,400
Then each time you're only holding out 

807
00:46:54,400 --> 00:46:56,380
1/10 of your data, rather than, 

808
00:46:56,380 --> 00:46:59,119
you know, say, 30 percent of your data. I 

809
00:46:59,119 --> 00:47:03,309
must say, in standard hold out  in simple hold out cross validation, 

810
00:47:03,309 --> 00:47:05,620
a 30  70 split is fairly common. 

811
00:47:05,620 --> 00:47:07,849
Sometimes like 2/3  1/3 or 

812
00:47:07,849 --> 00:47:09,919
a 70  30 split is fairly common. 

813
00:47:09,919 --> 00:47:14,299
And if you use k-fold cross validation, k equals 5 or more commonly k equals 10, and is the most 

814
00:47:14,299 --> 00:47:16,400
common choice. 

815
00:47:16,400 --> 00:47:19,849
The disadvantage of k-fold cross validation is that it can be much more 

816
00:47:19,849 --> 00:47:21,430
computationally expensive. 

817
00:47:21,430 --> 00:47:22,999
In particular, 

818
00:47:22,999 --> 00:47:26,160
to validate your model, you now need to train your model ten times, 

819
00:47:26,160 --> 00:47:29,089
instead of just once. And so you need 

820
00:47:29,089 --> 00:47:33,140
to: from logistic regression, ten times per model, rather than just once. And so this is 

821
00:47:33,140 --> 00:47:34,569
computationally more expensive. 

822
00:47:34,569 --> 00:47:38,329
But k equals ten works great. 

823
00:47:38,329 --> 00:47:40,249
And then, 

824
00:47:40,249 --> 00:47:44,269
finally, in 

825
00:47:44,269 --> 00:47:47,769
 there's actually a version of this that you can take even further, 

826
00:47:47,769 --> 00:47:50,059
which is when your set k 

827
00:47:50,059 --> 00:47:51,660
equals m. 

828
00:47:51,660 --> 00:47:56,690
And so that's when you take your training set, and you split it into as many pieces as you have training 

829
00:47:56,690 --> 00:47:59,190
examples. 

830
00:47:59,190 --> 00:48:06,190
And this procedure is called leave one out cross validation. 

831
00:48:06,429 --> 00:48:07,479
And 

832
00:48:07,479 --> 00:48:09,069
what you do is you then 

833
00:48:09,069 --> 00:48:12,479
take out the first training example, train on the rest, and test on the first 

834
00:48:12,479 --> 00:48:13,359
example. 

835
00:48:13,359 --> 00:48:16,479
Then you take out the second training example, 

836
00:48:16,479 --> 00:48:18,560
train on the rest, and test on the second example. Then 

837
00:48:18,560 --> 00:48:22,180
you take out the third example, train on everything, but your third example. Test on 

838
00:48:22,180 --> 00:48:24,079
the third example, and so on. 

839
00:48:24,079 --> 00:48:25,410
And so 

840
00:48:25,410 --> 00:48:27,900
with this many pieces you are now making, 

841
00:48:27,900 --> 00:48:31,170
maybe even more effective use of your data than k-fold cross 

842
00:48:31,170 --> 00:48:32,459
validation. But 

843
00:48:32,459 --> 00:48:36,269
you could leave  leave one out cross validation is computationally very expensive 

844
00:48:36,269 --> 00:48:37,459
because now 

845
00:48:37,459 --> 00:48:41,970
you need to repeatedly leave one example out, and then run your learning 

846
00:48:41,970 --> 00:48:46,459
algorithm on m minus one training examples. You need to do this a lot of times, and so 

847
00:48:46,459 --> 00:48:49,119
this is computationally very expensive. 

848
00:48:49,119 --> 00:48:54,829
And typically, this is done only when you're extremely data scarce. So if you 

849
00:48:54,829 --> 00:48:58,779
have a learning problem where you have, say, 15 training examples or something, 

850
00:48:58,779 --> 00:49:01,109
then if you have very few training examples, leave one 

851
00:49:01,109 --> 00:49:03,629
out cross validation 

852
00:49:03,629 --> 00:49:04,460
is 

853
00:49:04,460 --> 00:49:11,460
maybe preferred. 

854
00:49:15,709 --> 00:49:18,369
Yeah? 

855
00:49:18,369 --> 00:49:22,890
Student:You know, that time you proved that the difference between the generalized [inaudible] by 

856
00:49:22,890 --> 00:49:26,989
number of examples in your training 

857
00:49:26,989 --> 00:49:28,089
set and VC 

858
00:49:28,089 --> 00:49:30,609
dimension. So 

859
00:49:30,609 --> 00:49:32,679
maybe [inaudible] examples into 

860
00:49:32,679 --> 00:49:34,599
different groups, we can use that for [inaudible]. Instructor (Andrew Ng):Yeah, I 

861
00:49:34,599 --> 00:49:36,829
mean  Student:- compute the training error, and use that for computing [inaudible] for a generalized error. Instructor (Andrew Ng):Yeah, that's done, but  

862
00:49:36,829 --> 00:49:40,100
yeah, in practice, I personally tend not to do that. It 

863
00:49:40,100 --> 00:49:42,729
tends not to be  the VC 

864
00:49:42,729 --> 00:49:46,529
dimension bounds are somewhat loose bounds. And so 

865
00:49:46,529 --> 00:49:47,960
there are people  

866
00:49:47,960 --> 00:49:51,060
in structure risk minimization that propose what you do, but I personally tend not to do 

867
00:49:51,060 --> 00:49:54,539
that, 

868
00:49:54,539 --> 00:49:58,299


869
00:49:58,299 --> 00:50:05,299


870
00:50:09,459 --> 00:50:16,459
though. Questions for cross validation? 

871
00:50:33,519 --> 00:50:35,669
Yeah. Student:This 

872
00:50:35,669 --> 00:50:38,899
is kind of far from there because when we spend all this time [inaudible] but how many data points do you sort of need to go into your certain marginal [inaudible]? Instructor 

873
00:50:38,899 --> 00:50:39,939
(Andrew Ng):Right. 

874
00:50:39,939 --> 00:50:42,239
Student:So it seems like when I'd be 

875
00:50:42,239 --> 00:50:44,129
able to use that 

876
00:50:44,129 --> 00:50:46,629


877
00:50:46,629 --> 00:50:47,769
instead 

878
00:50:47,769 --> 00:50:49,980
of do this; more analytically, I guess. I mean  Instructor (Andrew Ng):Yeah. Student:[Inaudible]. Instructor 

879
00:50:49,980 --> 00:50:53,349
(Andrew Ng):No  okay. So it turns out that when you're proving learning theory bounds, very often 

880
00:50:53,349 --> 00:50:58,109
the bounds will be extremely loose because you're sort of proving the worse case upper bound that 

881
00:50:58,109 --> 00:50:59,160
holds true 

882
00:50:59,160 --> 00:51:03,299
even for very bad  what is it 

883
00:51:03,299 --> 00:51:06,429
 

884
00:51:06,429 --> 00:51:10,479
so the bounds that I proved just now; right? That holds true for absolutely any 

885
00:51:10,479 --> 00:51:12,209
probability distribution over training 

886
00:51:12,209 --> 00:51:13,480
examples; right? 

887
00:51:13,480 --> 00:51:17,069
So just assume the training examples we've drawn, iid 

888
00:51:17,069 --> 00:51:19,029
from some distribution script d, 

889
00:51:19,029 --> 00:51:23,209
and the bounds I proved hold true for absolutely any probability 

890
00:51:23,209 --> 00:51:27,129
distribution over script 

891
00:51:27,129 --> 00:51:28,939
d. And chances are 

892
00:51:28,939 --> 00:51:33,049
whatever real life distribution you get over, you know, houses and their 

893
00:51:33,049 --> 00:51:36,139
prices or whatever, is probably not as bad as 

894
00:51:36,139 --> 00:51:39,119
the very worse one you 

895
00:51:39,119 --> 00:51:41,249
could've gotten; okay? 

896
00:51:41,249 --> 00:51:42,189
And so it turns out that if you 

897
00:51:42,189 --> 00:51:46,299
actually plug in the constants of learning theory bounds, you often get 

898
00:51:46,299 --> 00:51:50,839
extremely large numbers. Take logistic regression  logistic 

899
00:51:50,839 --> 00:51:53,229
regression you have ten parameters 

900
00:51:53,229 --> 00:51:56,969
and 0.01 error, and with 95 percent probability. How many training 

901
00:51:56,969 --> 00:51:58,239
examples do I 

902
00:51:58,239 --> 00:52:02,109
need? If you actually plug in actual constants into the text for learning theory bounds, 

903
00:52:02,109 --> 00:52:06,329
you often get extremely pessimistic estimates with the number of examples you need. You end 

904
00:52:06,329 --> 00:52:07,529
up with 

905
00:52:07,529 --> 00:52:11,749
some ridiculously large numbers. You would need 10,000 training examples to fit 

906
00:52:11,749 --> 00:52:14,349
ten parameters. 

907
00:52:14,349 --> 00:52:17,489
So 

908
00:52:17,489 --> 00:52:20,319
a good way to think of these learning theory bounds is  

909
00:52:20,319 --> 00:52:24,599
and this is why, also, when I write papers on learning theory bounds, I 

910
00:52:24,599 --> 00:52:26,390
quite often 

911
00:52:26,390 --> 00:52:30,159
use big-O notation to just absolutely just ignore the constant factors because 

912
00:52:30,159 --> 00:52:33,339
the bounds seem to be very loose. 

913
00:52:33,339 --> 00:52:38,669
There are some attempts to use these bounds to give guidelines as to 

914
00:52:38,669 --> 00:52:40,019
what model 

915
00:52:40,019 --> 00:52:41,559
to choose, and so on. 

916
00:52:41,559 --> 00:52:46,589
But I personally tend to use the bounds  again, 

917
00:52:46,589 --> 00:52:47,489
intuition 

918
00:52:47,489 --> 00:52:49,239
about  

919
00:52:49,239 --> 00:52:53,089
for example, what are the number of training examples you need 

920
00:52:53,089 --> 00:52:56,089
gross linearly in the number of parameters or what are your gross x dimension in number of parameters; whether it goes 

921
00:52:56,089 --> 00:52:59,669
quadratic  parameters? 

922
00:52:59,669 --> 00:53:02,400
So it's quite often the shape of the bounds. The fact that 

923
00:53:02,400 --> 00:53:06,749
the number of training examples  the fact that some complexity is linear in the VC dimension, 

924
00:53:06,749 --> 00:53:09,449
that's sort of a useful intuition you can get from 

925
00:53:09,449 --> 00:53:11,329
these theories. But the 

926
00:53:11,329 --> 00:53:14,439
actual magnitude of the bound will tend to be much looser than 

927
00:53:14,439 --> 00:53:18,009
will hold true for a particular problem you are working 

928
00:53:18,009 --> 00:53:25,009
on. So did that 

929
00:53:25,839 --> 00:53:28,409
answer your question? Student:Uh-huh. Instructor (Andrew Ng):Yeah. And it turns out, by the 

930
00:53:28,409 --> 00:53:31,509
way, for myself, a rule of thumb that I often use is if 

931
00:53:31,509 --> 00:53:34,969
you're trying to fit a logistic regression model, 

932
00:53:34,969 --> 00:53:36,179
if you have 

933
00:53:36,179 --> 00:53:38,749
n parameters or n plus one parameters; 

934
00:53:38,749 --> 00:53:42,429
if the number of training examples is ten times your number of 

935
00:53:42,429 --> 00:53:44,709
parameters, then you're probably in good shape. 

936
00:53:44,709 --> 00:53:49,289
And if your number of training examples is like tiny times the number of parameters, then 

937
00:53:49,289 --> 00:53:52,299
you're probably perfectly fine fitting that model. 

938
00:53:52,299 --> 00:53:54,880
So those are the sorts of intuitions 

939
00:53:54,880 --> 00:54:01,880
that you can get from these bounds. Student:In cross validation do 

940
00:54:02,939 --> 00:54:06,739
we assume these examples randomly? Instructor (Andrew Ng):Yes. So by convention we usually split the train 

941
00:54:06,739 --> 00:54:13,739
testers randomly. 

942
00:54:22,989 --> 00:54:26,839
One more thing I want to talk about for model selection is  there's actually a special case of model selections, 

943
00:54:26,839 --> 00:54:33,839
called the feature selection problem. 

944
00:54:37,809 --> 00:54:41,339
And so here's the intuition: 

945
00:54:41,339 --> 00:54:45,479
for many machine-learning problems you may have a very high dimensional 

946
00:54:45,479 --> 00:54:48,119
feature space with very high dimensional  

947
00:54:48,119 --> 00:54:51,640
you have x's  [inaudible] feature x's. 

948
00:54:51,640 --> 00:54:56,289
So for example, for text classification  and I wanna talk about this text classification example that 

949
00:54:56,289 --> 00:54:58,299
spam versus 

950
00:54:58,299 --> 00:55:00,430
non-spam. You may easily have on 

951
00:55:00,430 --> 00:55:04,130
the order of 30,000 or 50,000 features. I think I used 50,000 in 

952
00:55:04,130 --> 00:55:07,150
my early examples. So if you have 

953
00:55:07,150 --> 00:55:10,659
so many features  you have 50,000 features, 

954
00:55:10,659 --> 00:55:13,759
depending on what learning algorithm you use, there may be a real 

955
00:55:13,759 --> 00:55:15,529
risk of over fitting. 

956
00:55:15,529 --> 00:55:17,219
And so 

957
00:55:17,219 --> 00:55:20,079
if you can reduce the number of features, maybe 

958
00:55:20,079 --> 00:55:23,179
you can reduce the variance of your learning algorithm, and reduce the risk of 

959
00:55:23,179 --> 00:55:25,309
over fitting. And for 

960
00:55:25,309 --> 00:55:28,259
the specific case of text classification, if 

961
00:55:28,259 --> 00:55:31,919
you imagine that maybe there's a small number of śrelevant features,ť so 

962
00:55:31,919 --> 00:55:33,679
there are all these English words. 

963
00:55:33,679 --> 00:55:36,969
And many of these English words probably don't tell you anything at all about 

964
00:55:36,969 --> 00:55:39,880
whether the email is spam or non-spam. If it were, you 

965
00:55:39,880 --> 00:55:43,989
know, English function words like, the, of, a, and; 

966
00:55:43,989 --> 00:55:48,049
these are probably words that don't tell you anything about whether the email is spam or non-spam. So 

967
00:55:48,049 --> 00:55:50,569
words in contrast will be a much smaller number of 

968
00:55:50,569 --> 00:55:54,319
features that are truly śrelevantť to the learning problem. 

969
00:55:54,319 --> 00:55:57,369
So for example, you see the word buy or Viagra, those are words that 

970
00:55:57,369 --> 00:56:01,669
are very useful. So you  words, some you spam and non-spam. You see the word 

971
00:56:01,669 --> 00:56:02,829
Stanford or 

972
00:56:02,829 --> 00:56:06,270
machinelearning or your own personal name. These are other words that are useful for 

973
00:56:06,270 --> 00:56:10,529
telling you whether something is spam or non-spam. So 

974
00:56:10,529 --> 00:56:15,659
in feature selection, we would like to select a subset of the features 

975
00:56:15,659 --> 00:56:19,519
that may be or hopefully the most relevant ones for a specific learning 

976
00:56:19,519 --> 00:56:20,369
problem, so 

977
00:56:20,369 --> 00:56:25,009
as to give ourselves a simpler learning  a simpler hypothesis class to choose 

978
00:56:25,009 --> 00:56:28,499
from. And then therefore, reduce the risk of over fitting. 

979
00:56:28,499 --> 00:56:29,699
Even when we 

980
00:56:29,699 --> 00:56:36,619
may have had 50,000 features originally. So 

981
00:56:36,619 --> 00:56:41,209
how do you do this? Well, if you 

982
00:56:41,209 --> 00:56:44,009
have n 

983
00:56:44,009 --> 00:56:47,329
features, then there are 

984
00:56:47,329 --> 00:56:49,529
two to the n possible subsets; 

985
00:56:49,529 --> 00:56:54,429


986
00:56:54,429 --> 00:56:55,900
right? Because, you know, 

987
00:56:55,900 --> 00:56:59,880
each of your n features can either be included or excluded. So there are two to the n 

988
00:56:59,880 --> 00:57:00,909
possibilities. 

989
00:57:00,909 --> 00:57:04,200
And this is a huge space. 

990
00:57:04,200 --> 00:57:08,859
So in feature selection, what we most commonly do is use various searcheristics  

991
00:57:08,859 --> 00:57:09,510
sort of 

992
00:57:09,510 --> 00:57:10,799
simple search algorithms 

993
00:57:10,799 --> 00:57:15,329
to try to search through this space of two to the n possible subsets of features; 

994
00:57:15,329 --> 00:57:16,710
to try to find a good 

995
00:57:16,710 --> 00:57:18,290
subset of features. This is 

996
00:57:18,290 --> 00:57:22,569
too large a number to enumerate all possible feature subsets. 

997
00:57:22,569 --> 00:57:25,519
And as a complete example, 

998
00:57:25,519 --> 00:57:26,759


999
00:57:26,759 --> 00:57:31,159
this is the forward search algorithm; it's also called the forward selection algorithm. 

1000
00:57:31,159 --> 00:57:32,689


1001
00:57:32,689 --> 00:57:34,849
It's actually pretty simple, but I'll just 

1002
00:57:34,849 --> 00:57:36,080
write it out. 

1003
00:57:36,080 --> 00:57:42,059
My writing it out will make it look more complicated than it really 

1004
00:57:42,059 --> 00:57:44,649
is, but 

1005
00:57:44,649 --> 00:57:46,040
it starts with  

1006
00:57:46,040 --> 00:57:48,149
initialize the sets script f 

1007
00:57:48,149 --> 00:57:51,649
to be the empty set, 

1008
00:57:51,649 --> 00:57:53,239
and then 

1009
00:57:53,239 --> 00:57:55,019
repeat 

1010
00:57:55,019 --> 00:57:56,619
for 

1011
00:57:56,619 --> 00:57:58,859
i equals one 

1012
00:57:58,859 --> 00:58:00,439
to n; 

1013
00:58:00,439 --> 00:58:04,849


1014
00:58:04,849 --> 00:58:08,239
try adding 

1015
00:58:08,239 --> 00:58:11,459
feature i 

1016
00:58:11,459 --> 00:58:13,849
to the set scripts 

1017
00:58:13,849 --> 00:58:16,039
f, and evaluate the 

1018
00:58:16,039 --> 00:58:19,739
model 

1019
00:58:19,739 --> 00:58:25,719
using cross validation. 

1020
00:58:25,719 --> 00:58:30,399
And by cross validation, I mean any of the three flavors, be it simple hold out cross 

1021
00:58:30,399 --> 00:58:34,629
validation or k-fold cross validation or leave one out cross 

1022
00:58:34,629 --> 00:58:35,739
validation. 

1023
00:58:35,739 --> 00:58:39,949
And then, you know, set f to be 

1024
00:58:39,949 --> 00:58:41,239
equal to 

1025
00:58:41,239 --> 00:58:43,249
f union, I guess. 

1026
00:58:43,249 --> 00:58:48,579
And then the best feature 

1027
00:58:48,579 --> 00:58:55,579
found is f 1, I guess; okay? 

1028
00:58:57,619 --> 00:59:00,059
And finally, you would  

1029
00:59:00,059 --> 00:59:06,920
okay? So 

1030
00:59:06,920 --> 00:59:13,920


1031
00:59:16,699 --> 00:59:20,319
forward selections, procedure is: follow through the empty set of features. 

1032
00:59:20,319 --> 00:59:22,319
And then on each 

1033
00:59:22,319 --> 00:59:25,689
generation, take each of 

1034
00:59:25,689 --> 00:59:28,829
your features that isn't already in your set script f and you try adding that 

1035
00:59:28,829 --> 00:59:30,449
feature to your set. Then 

1036
00:59:30,449 --> 00:59:33,870
you train them all, though, and evaluate them all, though, using 

1037
00:59:33,870 --> 00:59:35,039
cross validation. 

1038
00:59:35,039 --> 00:59:38,719
And basically, figure out what is the best single feature to add to your 

1039
00:59:38,719 --> 00:59:40,719
set 

1040
00:59:40,719 --> 00:59:43,709
script f. In step two here, you go ahead and add 

1041
00:59:43,709 --> 00:59:46,509
that feature to your set script f, and you get it right. 

1042
00:59:46,509 --> 00:59:50,559
And when I say best feature or best model here  by best, I really mean the 

1043
00:59:50,559 --> 00:59:53,669
best model according to hold out cross validation. 

1044
00:59:53,669 --> 00:59:56,049
By best, I really mean 

1045
00:59:56,049 --> 01:00:00,639
the single feature addition that results in the lowest hold out 

1046
01:00:00,639 --> 01:00:01,509
cross validation error or the 

1047
01:00:01,509 --> 01:00:03,209
lowest cross validation error. 

1048
01:00:03,209 --> 01:00:04,499
So you do this 

1049
01:00:04,499 --> 01:00:07,959
adding one feature at a time. 

1050
01:00:07,959 --> 01:00:12,019
When you terminate this a little bit, as if you've 

1051
01:00:12,019 --> 01:00:15,079
added all the features to f, so f is now 

1052
01:00:15,079 --> 01:00:17,919
the entire set of features; you can terminate this. 

1053
01:00:17,919 --> 01:00:19,309
Or if 

1054
01:00:19,309 --> 01:00:22,039
by some rule of thumb, you know that you 

1055
01:00:22,039 --> 01:00:23,769
probably don't ever want more than 

1056
01:00:23,769 --> 01:00:26,729
k features, you can also terminate 

1057
01:00:26,729 --> 01:00:30,239
this if f is already exceeded some threshold number of features. So maybe 

1058
01:00:30,239 --> 01:00:30,979
if 

1059
01:00:30,979 --> 01:00:34,139
you have 100 training examples, and you're fitting logistic regression, 

1060
01:00:34,139 --> 01:00:39,359
you probably know you won't want more than 100 features. And so 

1061
01:00:39,359 --> 01:00:41,229
you stop after you have 

1062
01:00:41,229 --> 01:00:45,709
100 features added to set f; okay? 

1063
01:00:45,709 --> 01:00:50,029
And then finally, having done this, output of best hypothesis found; again, by 

1064
01:00:50,029 --> 01:00:51,669
best, I mean, 

1065
01:00:51,669 --> 01:00:55,459
when learning this algorithm, you'd be seeing lots of hypothesis. You'd be training lots of 

1066
01:00:55,459 --> 01:00:58,419
hypothesis, and testing them using cross validation. 

1067
01:00:58,419 --> 01:01:01,659
So when I say output best hypothesis found, I mean 

1068
01:01:01,659 --> 01:01:04,650
of all of the hypothesis you've seen during this entire procedure, 

1069
01:01:04,650 --> 01:01:07,149
pick the one with the lowest 

1070
01:01:07,149 --> 01:01:10,309
cross validation error that you saw; okay? 

1071
01:01:10,309 --> 01:01:17,309
So that's forward selection. So 

1072
01:01:35,239 --> 01:01:39,599
let's see, just to give this a name, this is an incidence of what's called 

1073
01:01:39,599 --> 01:01:43,189


1074
01:01:43,189 --> 01:01:46,439
wrapper feature 

1075
01:01:46,439 --> 01:01:47,879
selection. 

1076
01:01:47,879 --> 01:01:50,579
And the term wrapper comes 

1077
01:01:50,579 --> 01:01:52,049
from the fact that 

1078
01:01:52,049 --> 01:01:55,489
this feature selection algorithm that I just described is a forward selection or forward 

1079
01:01:55,489 --> 01:01:56,819
search. 

1080
01:01:56,819 --> 01:01:58,680
It's a piece of software that 

1081
01:01:58,680 --> 01:02:02,149
you write that wraps around your learning algorithm. In the sense that 

1082
01:02:02,149 --> 01:02:04,249
to perform forward selection, 

1083
01:02:04,249 --> 01:02:07,579
you need to repeatedly make cause to your learning algorithm 

1084
01:02:07,579 --> 01:02:08,829
to train 

1085
01:02:08,829 --> 01:02:11,789
your model, 

1086
01:02:11,789 --> 01:02:15,289
using different subsets of features; okay? So this is called wrapper model 

1087
01:02:15,289 --> 01:02:17,319
feature selection. 

1088
01:02:17,319 --> 01:02:20,509
And it tends to be somewhat computationally expensive because as you're 

1089
01:02:20,509 --> 01:02:22,529
performing the search process, 

1090
01:02:22,529 --> 01:02:26,029
you're repeatedly training your learning algorithm over and over and over on all of 

1091
01:02:26,029 --> 01:02:30,329
these different subsets of features. Let's 

1092
01:02:30,329 --> 01:02:37,329
just mention also, there is a variation of this called backward search or 

1093
01:02:38,899 --> 01:02:40,829
backward selection, 

1094
01:02:40,829 --> 01:02:43,709
which is where you start with 

1095
01:02:43,709 --> 01:02:46,989
f equals 

1096
01:02:46,989 --> 01:02:53,989
the entire set of features, and you delete features one at a time; 

1097
01:02:59,469 --> 01:03:01,369
okay? 

1098
01:03:01,369 --> 01:03:03,979


1099
01:03:03,979 --> 01:03:06,759
So that's backward search or backward selection. 

1100
01:03:06,759 --> 01:03:08,079
And 

1101
01:03:08,079 --> 01:03:13,369
this is another feature selection algorithm that you might use. Part of 

1102
01:03:13,369 --> 01:03:15,400
whether this makes sense is 

1103
01:03:15,400 --> 01:03:17,409
really  there will be problems where 

1104
01:03:17,409 --> 01:03:21,319
it really doesn't even make sense to initialize f to be the set of all features. 

1105
01:03:21,319 --> 01:03:23,859
So if you have 100 training examples 

1106
01:03:23,859 --> 01:03:25,569
and 10,000 features, 

1107
01:03:25,569 --> 01:03:28,829
which may well happen  

1108
01:03:28,829 --> 01:03:30,759
100 emails and 10,000 training 

1109
01:03:30,759 --> 01:03:35,289
 10,000 features in email, then 100 training examples  then 

1110
01:03:35,289 --> 01:03:38,279
depending on the learning algorithm you're using, it may or may not make sense to 

1111
01:03:38,279 --> 01:03:38,989
initialize 

1112
01:03:38,989 --> 01:03:42,789
the set f to be all features, and train them all by using all features. And if it 

1113
01:03:42,789 --> 01:03:46,130
doesn't make sense, then you can train them all by using all features; then 

1114
01:03:46,130 --> 01:03:51,529
forward selection would be more common. 

1115
01:03:51,529 --> 01:03:53,609
So 

1116
01:03:53,609 --> 01:03:56,960
let's see. Wrapper model feature selection algorithms tend to work 

1117
01:03:56,960 --> 01:03:58,459
well. 

1118
01:03:58,459 --> 01:04:02,179
And in particular, they actually often work better than a different class of algorithms I'm gonna talk 

1119
01:04:02,179 --> 01:04:05,789
about now. But their main disadvantage is that they're computationally very 

1120
01:04:05,789 --> 01:04:08,709
expensive. 

1121
01:04:08,709 --> 01:04:15,709
Do you have any questions about this 

1122
01:04:16,289 --> 01:04:19,109
before I talk about the other? Yeah? Student:[Inaudible]. Instructor (Andrew Ng):Yeah  yes, 

1123
01:04:19,109 --> 01:04:23,459
you're actually right. So forward search and backward search, both of these are searcheristics, 

1124
01:04:23,459 --> 01:04:24,360
and 

1125
01:04:24,360 --> 01:04:28,090
you cannot  but for either of these you cannot guarantee they'll find the best 

1126
01:04:28,090 --> 01:04:29,119
subset of features. It 

1127
01:04:29,119 --> 01:04:30,729
actually turns out that 

1128
01:04:30,729 --> 01:04:34,999
under many formulizations of the feature selection problems  it actually turns out to be an empty 

1129
01:04:34,999 --> 01:04:40,029
heart problem, to find the best subset of features. 

1130
01:04:40,029 --> 01:04:43,999
But in practice, forward selection backward selection work fine, 

1131
01:04:43,999 --> 01:04:47,170
and you can also envision other search algorithms where you sort of 

1132
01:04:47,170 --> 01:04:50,489
have other methods to search through the space up to the end possible feature 

1133
01:04:50,489 --> 01:04:57,489
subsets. So 

1134
01:05:01,079 --> 01:05:03,700
let's see. 

1135
01:05:03,700 --> 01:05:06,339
Wrapper feature selection 

1136
01:05:06,339 --> 01:05:09,829
tends to work well when you can afford to do it 

1137
01:05:09,829 --> 01:05:11,539
computationally. 

1138
01:05:11,539 --> 01:05:13,800
But for problems 

1139
01:05:13,800 --> 01:05:18,279
such as text classification  it turns out for text classification specifically 

1140
01:05:18,279 --> 01:05:21,949
because you have so many features, and easily have 50,000 features. 

1141
01:05:21,949 --> 01:05:25,789
Forward selection would be very, very expensive. 

1142
01:05:25,789 --> 01:05:28,240
So there's a different class of algorithms that 

1143
01:05:28,240 --> 01:05:31,189
will give you  that 

1144
01:05:31,189 --> 01:05:35,319
tends not to do as well in the sense of generalization error. So you tend to learn the 

1145
01:05:35,319 --> 01:05:37,329
hypothesis that works less well, 

1146
01:05:37,329 --> 01:05:40,289
but is computationally much less expensive. 

1147
01:05:40,289 --> 01:05:44,189
And these are called 

1148
01:05:44,189 --> 01:05:46,669
the 

1149
01:05:46,669 --> 01:05:50,279
filter feature selection methods. 

1150
01:05:50,279 --> 01:05:53,119
And the basic idea is 

1151
01:05:53,119 --> 01:05:53,679


1152
01:05:53,679 --> 01:06:00,539
that for each feature i will 

1153
01:06:00,539 --> 01:06:03,939
compute 

1154
01:06:03,939 --> 01:06:10,539
some measure 

1155
01:06:10,539 --> 01:06:17,219
of how informative 

1156
01:06:17,219 --> 01:06:19,909
xi 

1157
01:06:19,909 --> 01:06:25,479
is about y; okay? 

1158
01:06:25,479 --> 01:06:28,779
And to do this, we'll use some simple 

1159
01:06:28,779 --> 01:06:33,489
heuristics; for every feature we'll just try to compute some rough estimate or 

1160
01:06:33,489 --> 01:06:40,489
compute 

1161
01:06:41,279 --> 01:06:48,279
some measure of how informative 

1162
01:06:50,179 --> 01:06:54,489
xi is about 

1163
01:06:54,489 --> 01:06:59,139
y. So there are many ways you can do this. One way you can choose is to just compute the 

1164
01:06:59,139 --> 01:07:01,279
correlation between xi and y. And just for each of 

1165
01:07:01,279 --> 01:07:04,999
your features just see how correlated this is with 

1166
01:07:04,999 --> 01:07:06,719
your class label y. 

1167
01:07:06,719 --> 01:07:12,809
And then you just pick the top k most correlated features. 

1168
01:07:12,809 --> 01:07:19,809
Another way 

1169
01:07:46,429 --> 01:07:51,549
to do this  

1170
01:07:51,549 --> 01:07:54,930
for the case of text classification, there's one other method, which especially 

1171
01:07:54,930 --> 01:07:56,779
for this k features I guess  

1172
01:07:56,779 --> 01:07:58,749
there's one other 

1173
01:07:58,749 --> 01:08:00,709
informative measure 

1174
01:08:00,709 --> 01:08:04,269
that's used very commonly, which is called major information. I'm going to 

1175
01:08:04,269 --> 01:08:11,229
tell you 

1176
01:08:11,229 --> 01:08:15,399
some of these ideas in problem sets, but I'll 

1177
01:08:15,399 --> 01:08:17,900
just say this very briefly. 

1178
01:08:17,900 --> 01:08:21,599
So the major information between feature xi and y 

1179
01:08:21,599 --> 01:08:27,670
 I'll just write out the definition, I guess. 

1180
01:08:27,670 --> 01:08:32,229
Let's say this is text classification, so x can take on two values, 0, 1; 

1181
01:08:32,229 --> 01:08:36,179
the major information between xi and y is to find out some overall possible values of 

1182
01:08:36,179 --> 01:08:37,359
x; 

1183
01:08:37,359 --> 01:08:40,929
some overall possible values of 

1184
01:08:40,929 --> 01:08:45,609
y times the distribution 

1185
01:08:45,609 --> 01:08:52,049


1186
01:08:52,049 --> 01:08:53,089
times that. 

1187
01:08:53,089 --> 01:08:54,379
Where 

1188
01:08:54,379 --> 01:08:56,129


1189
01:08:56,129 --> 01:08:59,569


1190
01:08:59,569 --> 01:09:04,039
all of these distributions  where so the joint distribution over xi and y, 

1191
01:09:04,039 --> 01:09:11,039
you would estimate from your training data 

1192
01:09:12,559 --> 01:09:14,220
all of these things you would use, as well. You would estimate 

1193
01:09:14,220 --> 01:09:15,259
from 

1194
01:09:15,259 --> 01:09:20,159
the training data what is the probability that x is 0, what's the probability that x is one, what's the 

1195
01:09:20,159 --> 01:09:27,159
probability that x is 0, y is 0, x is one; y is 0, and so on. So it 

1196
01:09:27,849 --> 01:09:31,190
turns out 

1197
01:09:31,190 --> 01:09:35,000
there's a standard information theoretic measure of how different 

1198
01:09:35,000 --> 01:09:36,750
probability distributions are. 

1199
01:09:36,750 --> 01:09:39,989
And I'm not gonna prove this here. But it turns out that 

1200
01:09:39,989 --> 01:09:45,609
this major information is actually 

1201
01:09:45,609 --> 01:09:51,749
 

1202
01:09:51,749 --> 01:09:54,309
so the standard 

1203
01:09:54,309 --> 01:09:58,099
measure of how different distributions are; called the K-L divergence. 

1204
01:09:58,099 --> 01:10:00,359
When you take a class in information theory, 

1205
01:10:00,359 --> 01:10:03,419
you have seen concepts of mutual information in the K-L divergence, but if you 

1206
01:10:03,419 --> 01:10:06,019
haven't, don't worry about it. Just 

1207
01:10:06,019 --> 01:10:09,239
the intuition is there's something called K-L divergence that's a formal measure 

1208
01:10:09,239 --> 01:10:12,949
of how different two probability distributions 

1209
01:10:12,949 --> 01:10:16,609
are. And mutual information is a measure for how different  

1210
01:10:16,609 --> 01:10:19,829
the joint distribution is 

1211
01:10:19,829 --> 01:10:21,569
of x and y; 

1212
01:10:21,569 --> 01:10:23,400
from the distribution 

1213
01:10:23,400 --> 01:10:26,159
you would get  if you were to assume they were independent; 

1214
01:10:26,159 --> 01:10:27,210
okay? So 

1215
01:10:27,210 --> 01:10:29,449
if x and y were independent, then 

1216
01:10:29,449 --> 01:10:33,979
p of x, y would be equal to p of x times p of y. 

1217
01:10:33,979 --> 01:10:37,340
And so you know, this distribution and this distribution would be identical, and the 

1218
01:10:37,340 --> 01:10:40,369
K-L divergence would be 0. 

1219
01:10:40,369 --> 01:10:43,579
In contrast, if x and y were very non-independent  in 

1220
01:10:43,579 --> 01:10:46,900
other words, if x and y are very informative about each other, 

1221
01:10:46,900 --> 01:10:48,449
then this K-L divergence 

1222
01:10:48,449 --> 01:10:49,719
will be large. 

1223
01:10:49,719 --> 01:10:52,379
And so mutual information is a formal measure of 

1224
01:10:52,379 --> 01:10:55,460
how non-independent x and y are. 

1225
01:10:55,460 --> 01:10:58,299
And if x and y are highly non-independent 

1226
01:10:58,299 --> 01:10:59,260
then that means that 

1227
01:10:59,260 --> 01:11:02,209
x will presumably tell you something about y, 

1228
01:11:02,209 --> 01:11:05,989
and so they'll have large mutual information. And 

1229
01:11:05,989 --> 01:11:09,519
this measure of information will tell you x might be a good feature. And you 

1230
01:11:09,519 --> 01:11:11,989
get to play with some of these ideas 

1231
01:11:11,989 --> 01:11:15,309
more in the problem sets. So I won't say much more about it. 

1232
01:11:15,309 --> 01:11:16,579


1233
01:11:16,579 --> 01:11:19,420
And what you do then is  having chosen 

1234
01:11:19,420 --> 01:11:23,780
some measure like correlation or major information or something else, you 

1235
01:11:23,780 --> 01:11:28,889
then pick the top 

1236
01:11:28,889 --> 01:11:33,659
k features; meaning that you compute correlation between xi and y for all the 

1237
01:11:33,659 --> 01:11:36,979
features of mutual information  xi and y for all the features. 

1238
01:11:36,979 --> 01:11:40,080
And then you include in your learning algorithm the 

1239
01:11:40,080 --> 01:11:43,369
k features of the largest correlation with the label or 

1240
01:11:43,369 --> 01:11:46,489
the largest mutual information label, whatever. 

1241
01:11:46,489 --> 01:11:51,559
And to choose 

1242
01:11:51,559 --> 01:11:54,809
k, 

1243
01:11:54,809 --> 01:11:57,799
you can actually use cross validation, as well; okay? 

1244
01:11:57,799 --> 01:11:59,360
So you would 

1245
01:11:59,360 --> 01:12:03,059
take all your features, and sort them in decreasing order of mutual 

1246
01:12:03,059 --> 01:12:04,109
information. 

1247
01:12:04,109 --> 01:12:07,529
And then you'd try using just the top one feature, the top two features, the top 

1248
01:12:07,529 --> 01:12:09,559
three features, and so on. 

1249
01:12:09,559 --> 01:12:13,869
And you decide how many features includes 

1250
01:12:13,869 --> 01:12:17,279
using cross validation; okay? Or you 

1251
01:12:17,279 --> 01:12:24,279
can  sometimes you can just choose this by hand, as well. Okay. Questions about 

1252
01:12:24,279 --> 01:12:31,279
this? Okay. 

1253
01:12:34,559 --> 01:12:36,479
Cool. Great. So 

1254
01:12:36,479 --> 01:12:40,629
next lecture I'll continue I'll wrap up the Bayesian model selection, but less close to the end.

1255
01:12:40,629 --> 01:12:40,879