1
00:00:09,030 --> 00:00:10,429

2
00:00:10,429 --> 00:00:13,699
This presentation is delivered by the Stanford Center for Professional 

3
00:00:13,699 --> 00:00:20,699
Development. Welcome back. If you haven't given me the homework yet, 

4
00:00:27,019 --> 00:00:28,470
you can just give it to me at the end of class. That's fine. Let's see. And also just a quick reminder " I've actually seen project proposals start to trickle in already, which is great. As a reminder, project proposals are due this Friday, and if any of you want to meet and chat more about project ideas, I also have office hours immediately after lecture today. Are there any questions about any of that before I get started today? Great. Okay. Welcome back. 

5
00:00:28,470 --> 00:00:32,259
What I want to do today is wrap up our discussion on support vector 

6
00:00:32,259 --> 00:00:33,510
machines 

7
00:00:33,510 --> 00:00:37,250
and in particular we'll also talk about the idea of kernels 

8
00:00:37,250 --> 00:00:39,440
and then talk about 

9
00:00:39,440 --> 00:00:41,869
[inaudible] 

10
00:00:41,869 --> 00:00:45,600
and then I'll talk about 

11
00:00:45,600 --> 00:00:50,030
the SMO algorithm, which is an algorithm for solving the 

12
00:00:50,030 --> 00:00:53,770
optimization problem that we posed last time. 

13
00:00:53,770 --> 00:00:55,650


14
00:00:55,650 --> 00:00:58,200
To recap, we 

15
00:00:58,200 --> 00:01:01,380
wrote down 

16
00:01:01,380 --> 00:01:08,380
the following context optimization problem. 

17
00:01:12,270 --> 00:01:15,990
All this is assuming that the data is linearly separable, which is an assumption that I'll 

18
00:01:15,990 --> 00:01:18,110
fix later, 

19
00:01:18,110 --> 00:01:19,530


20
00:01:19,530 --> 00:01:22,770


21
00:01:22,770 --> 00:01:24,660


22
00:01:24,660 --> 00:01:30,210
and so with this optimization problem, given a training set, 

23
00:01:30,210 --> 00:01:34,840
this will find 

24
00:01:34,840 --> 00:01:37,439
the optimal margin 

25
00:01:37,439 --> 00:01:41,299
classifier for the data set that maximizes this 

26
00:01:41,299 --> 00:01:45,020
geometric margin from your training 

27
00:01:45,020 --> 00:01:46,900
examples. 

28
00:01:46,900 --> 00:01:51,810
And so in the previous lecture, we also derived the dual of this problem, 

29
00:01:51,810 --> 00:01:54,130
which was to maximize 

30
00:01:54,130 --> 00:02:01,130


31
00:02:01,590 --> 00:02:08,590


32
00:02:18,319 --> 00:02:21,549
this. And this is the dual of our 

33
00:02:21,549 --> 00:02:24,129
primal [inaudible] optimization problem. 

34
00:02:24,129 --> 00:02:28,379
Here, I'm using these angle brackets to denote inner product, 

35
00:02:28,379 --> 00:02:29,389


36
00:02:29,389 --> 00:02:31,369
so 

37
00:02:31,369 --> 00:02:33,189
this is just XI transpose XJ 

38
00:02:33,189 --> 00:02:38,559
for vectors XI 

39
00:02:38,559 --> 00:02:40,819
and XJ. We also worked out the 

40
00:02:40,819 --> 00:02:47,489
ways W would be given by sum over I alpha I 

41
00:02:47,489 --> 00:02:51,129
YI XI. Therefore, when you need to make a prediction of classification 

42
00:02:51,129 --> 00:02:52,149
time, you 

43
00:02:52,149 --> 00:02:54,689
need to compute the 

44
00:02:54,689 --> 00:03:01,299
value of the hypothesis applied to an [inaudible], which is G of W 

45
00:03:01,299 --> 00:03:05,969
transpose X plus B where G is that threshold function that outputs plus one and minus one. 

46
00:03:05,969 --> 00:03:08,660
And so this is G of 

47
00:03:08,660 --> 00:03:09,900
sum over I alpha 

48
00:03:09,900 --> 00:03:11,589


49
00:03:11,589 --> 00:03:18,589


50
00:03:20,239 --> 00:03:22,129
I. So that can also be written 

51
00:03:22,129 --> 00:03:24,389
in terms of inner 

52
00:03:24,389 --> 00:03:25,629
products between input 

53
00:03:25,629 --> 00:03:31,369
vectors X. So what I 

54
00:03:31,369 --> 00:03:35,180
want to do is now talk about the idea of kernels, which will make use of this 

55
00:03:35,180 --> 00:03:36,809
property because it turns out 

56
00:03:36,809 --> 00:03:41,690
you can take the only dependers of the algorithm on X 

57
00:03:41,690 --> 00:03:44,230
is through these inner products. 

58
00:03:44,230 --> 00:03:46,609
In fact, you can write the entire algorithm 

59
00:03:46,609 --> 00:03:50,300
without ever explicitly referring to 

60
00:03:50,300 --> 00:03:52,389
an X vector [inaudible] 

61
00:03:52,389 --> 00:03:53,890


62
00:03:53,890 --> 00:03:57,729
between input feature vectors. 

63
00:03:57,729 --> 00:04:00,939
And the idea of a high kernel is as following " 

64
00:04:00,939 --> 00:04:07,389
let's say that you have an input attribute. 

65
00:04:07,389 --> 00:04:10,839
Let's just say for now it's a real number. Maybe this is the 

66
00:04:10,839 --> 00:04:13,429
living area of a house 

67
00:04:13,429 --> 00:04:19,649
that you're trying to make a prediction on, like whether it will be sold in the next six months. 

68
00:04:19,649 --> 00:04:21,319
Quite often, we'll take this 

69
00:04:21,319 --> 00:04:23,789
feature X and we'll map it 

70
00:04:23,789 --> 00:04:28,979
to a richer set of features. So for example, we will take X and map it 

71
00:04:28,979 --> 00:04:35,979
to these four polynomial features, and let me acutely call this mapping Phi. So we'll 

72
00:04:36,560 --> 00:04:38,560
let Phi of X denote 

73
00:04:38,560 --> 00:04:40,930
the mapping from your original features to 

74
00:04:40,930 --> 00:04:44,799
some higher dimensional set of features. 

75
00:04:44,799 --> 00:04:49,080
So if you do this and you want to use the features Phi of X, 

76
00:04:49,080 --> 00:04:49,680
then 

77
00:04:49,680 --> 00:04:53,569
all you need to do is go back to the learning algorithm 

78
00:04:53,569 --> 00:04:57,270
and everywhere you see 

79
00:04:57,270 --> 00:04:59,000
XI, XJ, we'll replace it 

80
00:04:59,000 --> 00:05:00,190
with 

81
00:05:00,190 --> 00:05:04,979
the inner product between Phi of XI and 

82
00:05:04,979 --> 00:05:07,629
Phi 

83
00:05:07,629 --> 00:05:10,590
of XJ. So this corresponds to running a support vector machine 

84
00:05:10,590 --> 00:05:14,380
with the features given by Phi of X rather than with your original 

85
00:05:14,380 --> 00:05:19,129
onedimensional input feature X. 

86
00:05:19,129 --> 00:05:21,689
And in a scenario that I want to consider, 

87
00:05:21,689 --> 00:05:23,389
sometimes Phi of X 

88
00:05:23,389 --> 00:05:29,489
will be very high dimensional, 

89
00:05:29,489 --> 00:05:33,759
and in fact sometimes Phi of X " so for example, Phi of X may contain 

90
00:05:33,759 --> 00:05:36,599
very high degree polynomial features. 

91
00:05:36,599 --> 00:05:41,529
Sometimes Phi of X will actually even be an infinite dimensional vector of features, 

92
00:05:41,529 --> 00:05:45,089
and the question is if Phi of X is an extremely high dimensional, 

93
00:05:45,089 --> 00:05:48,600
then you can't actually compute to these inner products very 

94
00:05:48,600 --> 00:05:51,300
efficiently, it seems, because computers need to 

95
00:05:51,300 --> 00:05:55,319
represent an extremely high dimensional feature vector and then 

96
00:05:55,319 --> 00:05:57,199
take 

97
00:05:57,199 --> 00:05:59,740
[inaudible] inefficient. It 

98
00:05:59,740 --> 00:06:02,929
turns out that 

99
00:06:02,929 --> 00:06:07,879
in many important special cases, we can write down " let's call the kernel function, 

100
00:06:07,879 --> 00:06:09,839
denoted by K, 

101
00:06:09,839 --> 00:06:12,499
which will be 

102
00:06:12,499 --> 00:06:18,300
this, 

103
00:06:18,300 --> 00:06:21,339
which would be inner product between those feature 

104
00:06:21,339 --> 00:06:24,409
vectors. It turns out there will be important special cases where 

105
00:06:24,409 --> 00:06:28,809
computing Phi of X is computationally very expensive " maybe is impossible. 

106
00:06:28,809 --> 00:06:30,229
There's an infinite dimensional vector, 

107
00:06:30,229 --> 00:06:33,369
and you can't compute infinite dimensional vectors. 

108
00:06:33,369 --> 00:06:37,219
There will be important special cases where Phi of X is very expensive to represent because it 

109
00:06:37,219 --> 00:06:38,729
is so high dimensional, 

110
00:06:38,729 --> 00:06:41,529
but nonetheless, you can actually compute a kernel between 

111
00:06:41,529 --> 00:06:46,150
XI and XJ. You can compute the inner product between these two vectors 

112
00:06:46,150 --> 00:06:48,749
very inexpensively. 

113
00:06:48,749 --> 00:06:52,330
And so the idea of the support vector machine is that 

114
00:06:52,330 --> 00:06:56,539
everywhere in the algorithm that you see these inner products, 

115
00:06:56,539 --> 00:06:58,029
we're going to replace it 

116
00:06:58,029 --> 00:07:01,629
with a kernel function that you can compute efficiently, 

117
00:07:01,629 --> 00:07:03,839
and that lets you work in feature 

118
00:07:03,839 --> 00:07:07,909
spaces Phi of X even if Phi of X 

119
00:07:07,909 --> 00:07:12,429
are very high dimensional. Let 

120
00:07:12,429 --> 00:07:16,289
me now say how that's done. A 

121
00:07:16,289 --> 00:07:17,029


122
00:07:17,029 --> 00:07:20,689
little bit later today, we'll actually see some concrete examples of Phi of X and 

123
00:07:20,689 --> 00:07:25,579
of kernels. For now, let's just think about constructing kernels explicitly. 

124
00:07:25,579 --> 00:07:28,939
This best illustrates my example. 

125
00:07:28,939 --> 00:07:32,539


126
00:07:32,539 --> 00:07:35,049
Let's say you have 

127
00:07:35,049 --> 00:07:40,129
two inputs, X and Z. Normally I should write those as XI and XJ, but I'm just going to 

128
00:07:40,129 --> 00:07:45,519
write X and Z to save on writing. 

129
00:07:45,519 --> 00:07:49,679
Let's say my kernel is K of X, Z 

130
00:07:49,679 --> 00:07:50,719
equals 

131
00:07:50,719 --> 00:07:55,439
X transpose Z squared. 

132
00:07:55,439 --> 00:08:02,439
And so this is " 

133
00:08:09,199 --> 00:08:12,859
right? X transpose Z " this thing here is X 

134
00:08:12,859 --> 00:08:15,240
transpose Z and this thing is X transpose Z, so this is X 

135
00:08:15,240 --> 00:08:17,389
transpose Z squared. 

136
00:08:17,389 --> 00:08:18,249


137
00:08:18,249 --> 00:08:25,249
And that's equal 

138
00:08:29,490 --> 00:08:33,190
to that. 

139
00:08:33,190 --> 00:08:35,860
And so this kernel corresponds 

140
00:08:35,860 --> 00:08:37,580
to the feature mapping where Phi 

141
00:08:37,580 --> 00:08:40,720
of X is equal to " and I'll write 

142
00:08:40,720 --> 00:08:47,720
this down for the case of N equals free, I guess. 

143
00:09:03,150 --> 00:09:06,520
And so with this definition of Phi of X, 

144
00:09:06,520 --> 00:09:10,300
you can verify for yourself that this thing 

145
00:09:10,300 --> 00:09:11,660
becomes the 

146
00:09:11,660 --> 00:09:17,870
inner product between Phi of X 

147
00:09:17,870 --> 00:09:21,960
and Phi of Z, because to get an inner product between two vectors 

148
00:09:21,960 --> 00:09:25,440
is " you can just take a sum of the corresponding elements of the vectors. You 

149
00:09:25,440 --> 00:09:27,470
multiply them. So 

150
00:09:27,470 --> 00:09:29,170
if this is Phi of X, 

151
00:09:29,170 --> 00:09:32,400
then the inner product between Phi of X and Phi of Z will be 

152
00:09:32,400 --> 00:09:33,550
the sum 

153
00:09:33,550 --> 00:09:35,160
over all the 

154
00:09:35,160 --> 00:09:39,610
elements of this vector times the corresponding elements of Phi of Z, 

155
00:09:39,610 --> 00:09:45,130
and what you get is this one. 

156
00:09:45,130 --> 00:09:48,350
And so the cool thing about this is that 

157
00:09:48,350 --> 00:09:53,400
in order to compute Phi of X, you 

158
00:09:53,400 --> 00:09:55,920


159
00:09:55,920 --> 00:10:02,920
need [inaudible] just to compute 

160
00:10:03,020 --> 00:10:05,370
Phi of X. If N 

161
00:10:05,370 --> 00:10:08,860
is a dimension of X and Z, then 

162
00:10:08,860 --> 00:10:13,790
Phi of X is a vector of all pairs of XI XJ multiplied of 

163
00:10:13,790 --> 00:10:15,020
each other, 

164
00:10:15,020 --> 00:10:19,450
and so the length of Phi of X is N squared. You need order N 

165
00:10:19,450 --> 00:10:23,840
squared time just to compute Phi of X. But 

166
00:10:23,840 --> 00:10:30,840
to compute K 

167
00:10:37,150 --> 00:10:39,360
" to compute the kernel function, 

168
00:10:39,360 --> 00:10:41,220
all you need is order N time, 

169
00:10:41,220 --> 00:10:43,070
because the 

170
00:10:43,070 --> 00:10:47,150
kernel function is defined as X transpose Z squared, so you just take the inner product 

171
00:10:47,150 --> 00:10:50,780
between X and Z, which is order N time and you square that 

172
00:10:50,780 --> 00:10:52,410
and you've computed this 

173
00:10:52,410 --> 00:10:54,230
kernel function, 

174
00:10:54,230 --> 00:10:58,000
and so you just computed the inner product between two vectors where each vector 

175
00:10:58,000 --> 00:11:00,200
has N squared elements, but 

176
00:11:00,200 --> 00:11:07,200
you did it in N square time. Student:For any kernel you find 

177
00:11:13,880 --> 00:11:20,650
for X and Z, 

178
00:11:20,650 --> 00:11:23,800
does Phi exist for X and Z? Instructor (Andrew 

179
00:11:23,800 --> 00:11:30,800
Ng):Let me talk about that later. We'll talk about what is a valid kernel later. Please raise your hand if this makes sense. 

180
00:11:31,660 --> 00:11:34,000


181
00:11:34,000 --> 00:11:35,460
So 

182
00:11:35,460 --> 00:11:39,310
let me just describe a couple of quick generalizations to 

183
00:11:39,310 --> 00:11:42,170
this. 

184
00:11:42,170 --> 00:11:47,460
One is that if 

185
00:11:47,460 --> 00:11:54,460
you define KXZ to be equal to 

186
00:11:55,210 --> 00:12:00,190
X transpose Z plus C squared, so again, you can compute this kernel 

187
00:12:00,190 --> 00:12:03,200
in order N time, 

188
00:12:03,200 --> 00:12:07,070
then that turns out to correspond to a feature vector 

189
00:12:07,070 --> 00:12:08,070
where I'm 

190
00:12:08,070 --> 00:12:10,810
just going to add a few more elements at the bottom 

191
00:12:10,810 --> 00:12:17,810
where you add root 2. 

192
00:12:23,550 --> 00:12:27,700
Let me read that. That was root 2 CX1 root 2 CX2 root 2 CX3 

193
00:12:27,700 --> 00:12:29,670
and C. 

194
00:12:29,670 --> 00:12:31,340
And so 

195
00:12:31,340 --> 00:12:35,450
this is a way of creating a feature vector with both the monomials, 

196
00:12:35,450 --> 00:12:37,069
meaning the first order terms, 

197
00:12:37,069 --> 00:12:40,539
as well as the quadratic or the inner product 

198
00:12:40,539 --> 00:12:42,730
terms between XI and XJ, 

199
00:12:42,730 --> 00:12:45,230
and the parameter C here 

200
00:12:45,230 --> 00:12:49,020
allows you to control the relative waiting between the monomial 

201
00:12:49,020 --> 00:12:51,490
terms, so the first order terms, 

202
00:12:51,490 --> 00:12:53,680
and the quadratic 

203
00:12:53,680 --> 00:12:56,940
terms. Again, this is still inner 

204
00:12:56,940 --> 00:13:00,600
product between vectors of length and square [inaudible] 

205
00:13:00,600 --> 00:13:02,920
in order N time. 

206
00:13:02,920 --> 00:13:04,630


207
00:13:04,630 --> 00:13:11,630
More generally, here are some other examples of kernels. 

208
00:13:13,720 --> 00:13:18,470
Actually, a 

209
00:13:18,470 --> 00:13:25,070
generalization of the one I just derived right now would be the following kernel. 

210
00:13:25,070 --> 00:13:30,960
And so this corresponds to using all N plus DQZ 

211
00:13:30,960 --> 00:13:33,660
features 

212
00:13:33,660 --> 00:13:39,240
of all 

213
00:13:39,240 --> 00:13:43,400
monomials. Monomials just mean the products of XI XJ XK. Just 

214
00:13:43,400 --> 00:13:46,570
all the polynomial terms up 

215
00:13:46,570 --> 00:13:51,850
to degree D 

216
00:13:51,850 --> 00:13:56,100
and plus [inaudible] so on the order of 

217
00:13:56,100 --> 00:13:58,750
N plus D to the power of D, so this grows exponentially in D. 

218
00:13:58,750 --> 00:14:01,130
This is a very high dimensional feature vector, 

219
00:14:01,130 --> 00:14:04,930
but again, you can implicitly construct the feature vector and take 

220
00:14:04,930 --> 00:14:07,460
inner products between them. 

221
00:14:07,460 --> 00:14:10,730
It's very computationally efficient, because you just compute the inner product 

222
00:14:10,730 --> 00:14:14,870
between X and Z, add C and you take that real number to the power of D 

223
00:14:14,870 --> 00:14:18,590
and by plugging this in as a kernel, you're implicitly working in an extremely 

224
00:14:18,590 --> 00:14:25,590
high dimensional computing space. 

225
00:14:25,880 --> 00:14:27,730
So what 

226
00:14:27,730 --> 00:14:34,730
I've given is just a few specific examples of how to create kernels. I want 

227
00:14:36,280 --> 00:14:40,320
to go over just a few specific examples of kernels. So let's you ask 

228
00:14:40,320 --> 00:14:43,609
you more generally if you're faced with a new machine-learning problem, 

229
00:14:43,609 --> 00:14:46,120
how do you come up with a kernel? There 

230
00:14:46,120 --> 00:14:49,720
are many ways to think about it, but here's one intuition that's sort of 

231
00:14:49,720 --> 00:14:50,560
useful. 

232
00:14:50,560 --> 00:14:51,650


233
00:14:51,650 --> 00:14:53,770
So given a set of attributes of X, you're 

234
00:14:53,770 --> 00:14:57,800
going to use a feature vector of Phi of X 

235
00:14:57,800 --> 00:14:59,510
and given a set 

236
00:14:59,510 --> 00:15:04,150
of attributes Z, you're going to use an input feature vector Phi of Z, 

237
00:15:04,150 --> 00:15:07,390
and so the kernel is computing 

238
00:15:07,390 --> 00:15:11,900
the inner product between Phi of X and Phi of Z. 

239
00:15:11,900 --> 00:15:13,139
And so 

240
00:15:13,139 --> 00:15:14,870
one intuition " 

241
00:15:14,870 --> 00:15:17,190
this is a partial intuition. This 

242
00:15:17,190 --> 00:15:20,820
isn't as rigorous intuition that it is used for. It 

243
00:15:20,820 --> 00:15:21,670
is that 

244
00:15:21,670 --> 00:15:24,470
if X and Z are very similar, 

245
00:15:24,470 --> 00:15:28,380
then Phi of X and Phi of Z will be pointing in the same direction, 

246
00:15:28,380 --> 00:15:32,890
and therefore the inner product would be large. 

247
00:15:32,890 --> 00:15:34,150
Whereas in contrast, if 

248
00:15:34,150 --> 00:15:36,720
X and Z are very dissimilar, 

249
00:15:36,720 --> 00:15:40,420
then Phi of X and Phi of Z may be pointing different directions, 

250
00:15:40,420 --> 00:15:42,370
and so the inner product may be small. 

251
00:15:42,370 --> 00:15:46,990
That intuition is not a rigorous one, but it's sort of a useful one to think about. 

252
00:15:46,990 --> 00:15:49,240


253
00:15:49,240 --> 00:15:53,360
If you're faced with a new learning problem " if I give you some random 

254
00:15:53,360 --> 00:15:54,620
thing to classify 

255
00:15:54,620 --> 00:15:56,190
and you want to decide how to come up with 

256
00:15:56,190 --> 00:15:58,000
a kernel, 

257
00:15:58,000 --> 00:15:59,209
one way is 

258
00:15:59,209 --> 00:16:02,710
to try to come up with the function 

259
00:16:02,710 --> 00:16:04,260
P of XZ that 

260
00:16:04,260 --> 00:16:05,570
is large, 

261
00:16:05,570 --> 00:16:10,540
if you want to learn the algorithm to think of X and Z as similar 

262
00:16:10,540 --> 00:16:11,370
and 

263
00:16:11,370 --> 00:16:15,040
small. 

264
00:16:15,040 --> 00:16:17,640


265
00:16:17,640 --> 00:16:19,290


266
00:16:19,290 --> 00:16:24,500
Again, this isn't always true, but this is one of several intuitions. 

267
00:16:24,500 --> 00:16:27,969
So if you're trying to classify some brand new thing " you're trying to 

268
00:16:27,969 --> 00:16:31,999
classify [inaudible] or DNA sequences or something, 

269
00:16:31,999 --> 00:16:33,010


270
00:16:33,010 --> 00:16:37,250
some strange thing you want to classify, one thing you could do is try to come up with a kernel that's large when you want the algorithm to think 

271
00:16:37,250 --> 00:16:38,790
these are similar things 

272
00:16:38,790 --> 00:16:41,870
or these are dissimilar. 

273
00:16:41,870 --> 00:16:43,590
And so this 

274
00:16:43,590 --> 00:16:46,730
answers the question of let's 

275
00:16:46,730 --> 00:16:49,860
say I have something I want to classify, and let's say I write down 

276
00:16:49,860 --> 00:16:53,370
the function 

277
00:16:53,370 --> 00:16:57,700
that I think is a good measure of how similar or dissimilar X and Z 

278
00:16:57,700 --> 00:16:59,370
are for my specific problem. 

279
00:16:59,370 --> 00:17:05,310
Let's say I write down K of XZ equals E to the 

280
00:17:05,310 --> 00:17:08,209
minus. 

281
00:17:08,209 --> 00:17:13,260
Let's say I write down this function, 

282
00:17:13,260 --> 00:17:19,089
and I think this is a good measure of how similar X and Z are. 

283
00:17:19,089 --> 00:17:26,089
The question, then, is is this really a valid 

284
00:17:28,490 --> 00:17:32,650
kernel? In other words, to understand how we can construct kernels " if I write down the 

285
00:17:32,650 --> 00:17:33,669
function like that, 

286
00:17:33,669 --> 00:17:37,940
the question is does there really exist some Phi 

287
00:17:37,940 --> 00:17:39,830
such that 

288
00:17:39,830 --> 00:17:41,049
KXZ 

289
00:17:41,049 --> 00:17:42,419
is equal to 

290
00:17:42,419 --> 00:17:48,370
the inner product? 

291
00:17:48,370 --> 00:17:49,860
And that's the 

292
00:17:49,860 --> 00:17:56,160
question of is K a valid kernel. 

293
00:17:56,160 --> 00:17:59,780
It turns out that 

294
00:17:59,780 --> 00:18:04,260
there is a result that characterizes necessary and sufficient 

295
00:18:04,260 --> 00:18:05,380
conditions for 

296
00:18:05,380 --> 00:18:07,700
when a function K that you might choose 

297
00:18:07,700 --> 00:18:10,370
is a valid kernel. 

298
00:18:10,370 --> 00:18:17,340
I should go ahead show part of that result now. 

299
00:18:17,340 --> 00:18:23,559
Suppose K is a valid kernel, 

300
00:18:23,559 --> 00:18:27,179
and when I say K is a kernel, what I mean is there does indeed exist some 

301
00:18:27,179 --> 00:18:31,800
function Phi for which this holds true. 

302
00:18:31,800 --> 00:18:34,700
Then 

303
00:18:34,700 --> 00:18:39,190
let any set of points 

304
00:18:39,190 --> 00:18:46,190
X1 up to XM be given. Let 

305
00:18:47,610 --> 00:18:52,640
me define a matrix K. 

306
00:18:52,640 --> 00:18:54,410


307
00:18:54,410 --> 00:18:57,690
I apologize for overloading notation. K I'm going 

308
00:18:57,690 --> 00:19:00,029
to use to denote both the 

309
00:19:00,029 --> 00:19:02,720
kernel function, which is the function of X and Z 

310
00:19:02,720 --> 00:19:06,940
as well as a matrix. Unfortunately, 

311
00:19:06,940 --> 00:19:10,180
there aren't enough alphabets. 

312
00:19:10,180 --> 00:19:12,590
Well, that's not true. We need to find 

313
00:19:12,590 --> 00:19:15,350
the kernel matrix to 

314
00:19:15,350 --> 00:19:20,170
be an M-by-M matrix such that K subscript IJ is equal to the 

315
00:19:20,170 --> 00:19:23,810
kernel function 

316
00:19:23,810 --> 00:19:30,810
applied to two of my examples. 

317
00:19:31,480 --> 00:19:38,480
Then it 

318
00:19:39,250 --> 00:19:45,030
turns out that for any vector Z that's indimensional, I want 

319
00:19:45,030 --> 00:19:50,270
you to consider Z transpose 

320
00:19:50,270 --> 00:19:57,270


321
00:20:01,780 --> 00:20:04,080
KZ. 

322
00:20:04,080 --> 00:20:05,500


323
00:20:05,500 --> 00:20:07,880
By 

324
00:20:07,880 --> 00:20:13,230
definition of matrix multiplication, this is 

325
00:20:13,230 --> 00:20:15,120
that, 

326
00:20:15,120 --> 00:20:18,500
and so 

327
00:20:18,500 --> 00:20:23,120
KIJ is a kernel function between XI and XJ, so that must equal to this. I assume that K is 

328
00:20:23,120 --> 00:20:27,580
a valid kernel function, 

329
00:20:27,580 --> 00:20:31,120
and so there must exist such a 

330
00:20:31,120 --> 00:20:33,419
value 

331
00:20:33,419 --> 00:20:36,720
for Phi. 

332
00:20:36,720 --> 00:20:40,500
This is the inner product between two feature vectors, so let me just make that inner product 

333
00:20:40,500 --> 00:20:43,090
the explicit. I'm going 

334
00:20:43,090 --> 00:20:45,950
to sum over the elements of this vector, 

335
00:20:45,950 --> 00:20:50,320
and I'm going to use Phi XI subscript K just to denote the K 

336
00:20:50,320 --> 00:20:57,320
element of this vector. 

337
00:20:58,490 --> 00:21:03,480
Just 

338
00:21:03,480 --> 00:21:10,480
rearrange sums. You get 

339
00:21:24,940 --> 00:21:30,110
sum over K. This next set may look familiar to some of you, 

340
00:21:30,110 --> 00:21:37,110
which is just " 

341
00:21:47,220 --> 00:21:48,300
right? Therefore, 

342
00:21:48,300 --> 00:21:51,540
this is the sum of squares 

343
00:21:51,540 --> 00:21:56,630
and it must therefore be greater than or equal to zero. 

344
00:21:56,630 --> 00:22:00,250
Do you want to take a minute to look for all the steps and just make 

345
00:22:00,250 --> 00:22:07,250
sure you buy 

346
00:22:09,400 --> 00:22:11,010
them all? Oh, this is 

347
00:22:11,010 --> 00:22:15,080
the inner product between the vector of Phi of XI and Phi 

348
00:22:15,080 --> 00:22:17,170
of XJ, 

349
00:22:17,170 --> 00:22:19,210
so the 

350
00:22:19,210 --> 00:22:21,100
inner product between two vectors 

351
00:22:21,100 --> 00:22:28,080
is the sum over all the elements of the vectors of the corresponding element. 

352
00:22:28,080 --> 00:22:31,559
Student:[Inaudible]. Instructor (Andrew Ng):Oh, yes it is. This is just A 

353
00:22:31,559 --> 00:22:36,970
transpose B equals sum over K, AK BK, so that's 

354
00:22:36,970 --> 00:22:38,830
just this. This is the sum of K of the 

355
00:22:38,830 --> 00:22:44,490
K elements of this vector. 

356
00:22:44,490 --> 00:22:46,180
Take a look at this and make sure 

357
00:22:46,180 --> 00:22:50,470
it makes sense. Questions about 

358
00:22:50,470 --> 00:22:57,470
this? So just to 

359
00:22:59,720 --> 00:23:00,419
summarize, 

360
00:23:00,419 --> 00:23:03,300
what we showed was that 

361
00:23:03,300 --> 00:23:05,309
for any vector Z, 

362
00:23:05,309 --> 00:23:07,380
Z transpose KZ 

363
00:23:07,380 --> 00:23:09,700
is greater than or equal to zero, 

364
00:23:09,700 --> 00:23:14,090
and this is one of the standard definitions of a matrix, 

365
00:23:14,090 --> 00:23:17,540
the matrix K being posisemidefinite 

366
00:23:17,540 --> 00:23:24,240
when a matrix K is posisemidefinite, that is, K is equal to 

367
00:23:24,240 --> 00:23:28,010
zero. Just to summarize, what was shown is that if 

368
00:23:28,010 --> 00:23:29,000
K 

369
00:23:29,000 --> 00:23:31,150
is a 

370
00:23:31,150 --> 00:23:34,430
valid kernel " in other words, if K is a function 

371
00:23:34,430 --> 00:23:37,140
for which there exists 

372
00:23:37,140 --> 00:23:38,390
some Phi such that 

373
00:23:38,390 --> 00:23:39,990
K of XI 

374
00:23:39,990 --> 00:23:43,830
XJ is the inner product between Phi of XI and Phi of XJ. 

375
00:23:43,830 --> 00:23:48,130
So if K is a valid kernel, we showed, then, that the kernel matrix 

376
00:23:48,130 --> 00:23:50,929


377
00:23:50,929 --> 00:23:55,030
must be posisemidefinite. 

378
00:23:55,030 --> 00:23:59,910


379
00:23:59,910 --> 00:24:01,970
It turns out that the 

380
00:24:01,970 --> 00:24:04,840
converse [inaudible] 

381
00:24:04,840 --> 00:24:09,480
and so this gives you a test for 

382
00:24:09,480 --> 00:24:13,010
whether a function K is a valid kernel. 

383
00:24:13,010 --> 00:24:14,600
So this is 

384
00:24:14,600 --> 00:24:18,370
a theorem due to Mercer, and so kernels are also sometimes called Mercer 

385
00:24:18,370 --> 00:24:25,370
kernels. It means the same thing. It just means it's a valid kernel. Let 

386
00:24:25,510 --> 00:24:31,440
K of XZ be given. 

387
00:24:31,440 --> 00:24:33,809
Then K 

388
00:24:33,809 --> 00:24:36,980
is a valid 

389
00:24:36,980 --> 00:24:43,160
kernel " 

390
00:24:43,160 --> 00:24:45,310
in other 

391
00:24:45,310 --> 00:24:48,060
words, it's a Mercer kernel, i.e., there exists a Phi 

392
00:24:48,060 --> 00:24:50,120
such that 

393
00:24:50,120 --> 00:24:52,360
KXZ 

394
00:24:52,360 --> 00:24:53,640


395
00:24:53,640 --> 00:24:58,649


396
00:24:58,649 --> 00:25:00,970
equals Phi of X 

397
00:25:00,970 --> 00:25:04,840
transpose 

398
00:25:04,840 --> 00:25:11,179
Phi of Z " if and only if for any set of M examples, 

399
00:25:11,179 --> 00:25:15,039
and this really means for any set of M points. It's not necessarily a training set. 

400
00:25:15,039 --> 00:25:21,299
It's just any set of M points you may choose. It 

401
00:25:21,299 --> 00:25:28,299
holds true that the kernel 

402
00:25:29,040 --> 00:25:30,410
matrix, 

403
00:25:30,410 --> 00:25:34,290
capital K that I defined just now, 

404
00:25:34,290 --> 00:25:36,510
is 

405
00:25:36,510 --> 00:25:37,960


406
00:25:37,960 --> 00:25:44,960
symmetric posisemidefinite. 

407
00:25:49,260 --> 00:25:54,070
And so I proved only one direction of this result. I proved that if it's a 

408
00:25:54,070 --> 00:25:55,420
valid kernel, then K 

409
00:25:55,420 --> 00:25:57,299
is symmetry 

410
00:25:57,299 --> 00:26:00,909
posisemidefinite, but the converse I didn't show. It turns out that this is 

411
00:26:00,909 --> 00:26:03,710
necessary and a sufficient condition. And 

412
00:26:03,710 --> 00:26:06,250
so this gives you a useful test for 

413
00:26:06,250 --> 00:26:09,289
whether any function that you might want to choose 

414
00:26:09,289 --> 00:26:16,289
is a 

415
00:26:23,270 --> 00:26:27,040
kernel. A concrete example of something that's clearly not a 

416
00:26:27,040 --> 00:26:28,240
valid kernel would 

417
00:26:28,240 --> 00:26:30,659
be 

418
00:26:30,659 --> 00:26:36,040
if you find an input X such that K of X, X " and this 

419
00:26:36,040 --> 00:26:39,410
is minus one, for example " then this is an example of something that's clearly 

420
00:26:39,410 --> 00:26:41,530
not a valid kernel, 

421
00:26:41,530 --> 00:26:48,010
because minus one cannot possibly be equal to Phi of X 

422
00:26:48,010 --> 00:26:49,390
transpose Phi of X, 

423
00:26:49,390 --> 00:26:53,250
and so this would be one of many examples of functions that 

424
00:26:53,250 --> 00:26:57,040
will fail to meet the conditions of this theorem, 

425
00:26:57,040 --> 00:27:04,040
because inner products of a vector itself are always greater than zero. So 

426
00:27:15,820 --> 00:27:22,820


427
00:27:29,530 --> 00:27:33,620
just to tie this back explicitly to an 

428
00:27:33,620 --> 00:27:34,710
SVM, 

429
00:27:34,710 --> 00:27:37,190
let's say to use a support vector machine with a kernel, 

430
00:27:37,190 --> 00:27:41,400
what you do is you choose some function K of XZ, 

431
00:27:41,400 --> 00:27:45,860
and so you can choose " and it turns out that function I wrote down just now " this is, 

432
00:27:45,860 --> 00:27:49,060
indeed, a valid kernel. It 

433
00:27:49,060 --> 00:27:50,850
is called the Galcean kernel 

434
00:27:50,850 --> 00:27:52,250
because of the similarity to 

435
00:27:52,250 --> 00:27:53,540
Galceans. 

436
00:27:53,540 --> 00:27:58,580
So you choose some kernel function like this, or you may choose X 

437
00:27:58,580 --> 00:28:02,490
transpose Z plus 

438
00:28:02,490 --> 00:28:03,470
C 

439
00:28:03,470 --> 00:28:06,420
to the D vector. To apply a support vector machine kernel, you choose one of these 

440
00:28:06,420 --> 00:28:07,500
functions, 

441
00:28:07,500 --> 00:28:09,490
and the choice of this would depend on your problem. 

442
00:28:09,490 --> 00:28:14,149
It depends on what is a good measure of one or two examples 

443
00:28:14,149 --> 00:28:17,060
similar and one or two examples different for your problem. 

444
00:28:17,060 --> 00:28:21,940
Then you go back to our formulation of support vector machine, 

445
00:28:21,940 --> 00:28:24,810
and you have to use the dual formulation, 

446
00:28:24,810 --> 00:28:28,730
and you then replace everywhere you see 

447
00:28:28,730 --> 00:28:31,570
these 

448
00:28:31,570 --> 00:28:36,799
things, you 

449
00:28:36,799 --> 00:28:39,900
replace it with K of XI, 

450
00:28:39,900 --> 00:28:41,820
XJ. 

451
00:28:41,820 --> 00:28:45,360
And you then run exactly the same support vector machine algorithm, only everywhere 

452
00:28:45,360 --> 00:28:48,950
you see these inner products, you replace them with that, 

453
00:28:48,950 --> 00:28:52,480
and what you've just done is you've taken a support vector machine 

454
00:28:52,480 --> 00:28:56,890
and you've taken each of your feature vectors X 

455
00:28:56,890 --> 00:29:02,760
and you've replaced it with implicitly a very high dimensional feature vector. 

456
00:29:02,760 --> 00:29:06,270
It turns out that the Galcean kernel corresponds to a feature vector that's 

457
00:29:06,270 --> 00:29:08,640
infinite dimensional. 

458
00:29:08,640 --> 00:29:09,950


459
00:29:09,950 --> 00:29:13,860
Nonetheless, you can run a support vector machine in a finite amount of time, 

460
00:29:13,860 --> 00:29:15,450
even though you're working with 

461
00:29:15,450 --> 00:29:18,920
infinite dimensional feature vectors, 

462
00:29:18,920 --> 00:29:20,559
because all you ever need to do is 

463
00:29:20,559 --> 00:29:23,120
compute these things, and you don't ever need 

464
00:29:23,120 --> 00:29:26,330
to represent these infinite dimensional feature vectors 

465
00:29:26,330 --> 00:29:28,330
explicitly. 

466
00:29:28,330 --> 00:29:30,040
Why 

467
00:29:30,040 --> 00:29:33,250
is this a good idea? It turns out " 

468
00:29:33,250 --> 00:29:38,070
I think I started off talking about support vector machines. I started 

469
00:29:38,070 --> 00:29:41,580
saying that we wanted to start to develop non-linear learning algorithms. 

470
00:29:41,580 --> 00:29:45,350
So here's one useful picture to keep in mind, which is that 

471
00:29:45,350 --> 00:29:48,420
let's say your original data " 

472
00:29:48,420 --> 00:29:52,060
I didn't mean to draw that slanted. Let's say you have one-dimensional input 

473
00:29:52,060 --> 00:29:56,540
data. You just have one feature X and R. What a kernel 

474
00:29:56,540 --> 00:30:00,960
does is the following. It takes your original input data and maps it to a 

475
00:30:00,960 --> 00:30:04,910
very high dimensional feature space. In the case of Galcean kernels, an infinite dimensional feature space " for pedagogical reasons, I'll draw 

476
00:30:04,910 --> 00:30:09,170
two 

477
00:30:09,170 --> 00:30:12,950
dimensions here. 

478
00:30:12,950 --> 00:30:19,950
So say [inaudible] very high dimensional feature space where " 

479
00:30:23,070 --> 00:30:28,900
like so. So it takes all your data in R1 and maps it to R infinity, 

480
00:30:28,900 --> 00:30:30,360
and then you run a 

481
00:30:30,360 --> 00:30:33,770
support vector machine in this infinite dimensional space and also exponentially 

482
00:30:33,770 --> 00:30:35,240
high dimensional space, 

483
00:30:35,240 --> 00:30:38,030
and you'll find the optimal margin classifier " 

484
00:30:38,030 --> 00:30:39,730
in other words, the 

485
00:30:39,730 --> 00:30:43,390
classifier that separates your data in this very high dimensional space 

486
00:30:43,390 --> 00:30:46,060
with the largest possible geometric margin. 

487
00:30:46,060 --> 00:30:50,350
In this example that you just drew anyway, 

488
00:30:50,350 --> 00:30:54,680
whereas your data was not linearly separable in your originally one dimensional space, 

489
00:30:54,680 --> 00:30:58,340
when you map it to this much higher dimensional space, it becomes linearly separable, so you 

490
00:30:58,340 --> 00:30:59,960
can use your 

491
00:30:59,960 --> 00:31:01,250
linear classifier 

492
00:31:01,250 --> 00:31:04,990
to [inaudible] which data is not really separable in your original space. 

493
00:31:04,990 --> 00:31:07,090
This is what support vector machines 

494
00:31:07,090 --> 00:31:10,460
output nonlinear decision boundaries 

495
00:31:10,460 --> 00:31:14,930
and in the entire process, all you ever need to do is solve complex optimization 

496
00:31:14,930 --> 00:31:18,780
problems. 

497
00:31:18,780 --> 00:31:19,560


498
00:31:19,560 --> 00:31:25,000
Questions about any of this? Student:[Inaudible] sigmer? Instructor 

499
00:31:25,000 --> 00:31:30,310
(Andrew Ng):Yeah, so sigmer is " let's 

500
00:31:30,310 --> 00:31:31,999
see. Well, I was going to talk about [inaudible] 

501
00:31:31,999 --> 00:31:32,730
later. One way 

502
00:31:32,730 --> 00:31:35,690
to choose sigmer is 

503
00:31:35,690 --> 00:31:38,859
save aside a small amount of your data and 

504
00:31:38,859 --> 00:31:41,909
try different values of sigmer and train an SVM using, 

505
00:31:41,909 --> 00:31:44,740
say, two thirds of your data. 

506
00:31:44,740 --> 00:31:48,500
Try different values of sigmer, then see what works best on a separate 

507
00:31:48,500 --> 00:31:48,950
hold out 

508
00:31:48,950 --> 00:31:53,270
cross validation set " on a separate set that you're testing. Something about 

509
00:31:53,270 --> 00:31:56,289
learning algorithms we talked about 

510
00:31:56,289 --> 00:31:57,409
" 

511
00:31:57,409 --> 00:32:00,360
locally [inaudible] linear aggressions [inaudible] bandwidth parameter, so there are a 

512
00:32:00,360 --> 00:32:03,159
number of parameters to some of these algorithms that you 

513
00:32:03,159 --> 00:32:05,829
can choose IDs by saving aside some data to test on. I'll 

514
00:32:05,829 --> 00:32:10,250
talk more about model selection [inaudible] explicitly. Are there other questions? Student:So 

515
00:32:10,250 --> 00:32:12,790
how do you know that 

516
00:32:12,790 --> 00:32:14,299


517
00:32:14,299 --> 00:32:16,830
moving it up to high dimensional space 

518
00:32:16,830 --> 00:32:19,820
is going to give you that kind of separation? Instructor (Andrew 

519
00:32:19,820 --> 00:32:21,279
Ng):Good question. 

520
00:32:21,279 --> 00:32:25,980
Usually, you don't know [inaudible]. Sometimes you can know, but 

521
00:32:25,980 --> 00:32:29,230
in most cases, you won't know [inaudible] actually going to 

522
00:32:29,230 --> 00:32:30,010
linearly 

523
00:32:30,010 --> 00:32:33,920
separable, so the next topic will be [inaudible], which is what 

524
00:32:33,920 --> 00:32:40,920
[inaudible] SVMs that work even though the data is not linearly separable. Student:If you tend 

525
00:32:41,080 --> 00:32:42,260
linearly 

526
00:32:42,260 --> 00:32:44,630
separated by mapping a higher dimension, couldn't you 

527
00:32:44,630 --> 00:32:47,309
also just use [inaudible] 

528
00:32:47,309 --> 00:32:49,400
higher 

529
00:32:49,400 --> 00:32:53,580
dimension? Instructor (Andrew Ng):So very right. This is a question about what to do if you can't 

530
00:32:53,580 --> 00:32:55,350
separate it in higher dimensional 

531
00:32:55,350 --> 00:32:59,460
space. Let me try to address that work with a discussion of [inaudible] soft margin SVMs. 

532
00:32:59,460 --> 00:33:06,120
Okay. Student:What 

533
00:33:06,120 --> 00:33:08,180
if 

534
00:33:08,180 --> 00:33:09,370
you run an SVM algorithm 

535
00:33:09,370 --> 00:33:11,720
that assumes the data are linearly 

536
00:33:11,720 --> 00:33:12,770
separable on data that 

537
00:33:12,770 --> 00:33:15,320
is not actually linearly separable? Instructor 

538
00:33:15,320 --> 00:33:19,490
(Andrew Ng):You guys are really giving me a hard time about whether the 

539
00:33:19,490 --> 00:33:20,220
data's linearly 

540
00:33:20,220 --> 00:33:24,320
separable. It turns out this algorithm won't work if the data is not linearly separable, but 

541
00:33:24,320 --> 00:33:29,260
I'll change that in a second and make it work. 

542
00:33:29,260 --> 00:33:31,100


543
00:33:31,100 --> 00:33:35,390
If I move on to talk about that, 

544
00:33:35,390 --> 00:33:39,760
let me just say one final word about kernels, 

545
00:33:39,760 --> 00:33:46,450
which is that 

546
00:33:46,450 --> 00:33:49,130


547
00:33:49,130 --> 00:33:52,559


548
00:33:52,559 --> 00:33:54,780
I talked about kernels 

549
00:33:54,780 --> 00:34:00,770
in a context of support vector machines, 

550
00:34:00,770 --> 00:34:04,550
and the idea of kernels was what really made support vector machines a very 

551
00:34:04,550 --> 00:34:06,700
powerful learning algorithm, 

552
00:34:06,700 --> 00:34:09,850
and actually towards the end of today's lecture if I have time, I'll actually 

553
00:34:09,850 --> 00:34:14,119
give a couple more [inaudible] examples of how to choose kernels as well. 

554
00:34:14,119 --> 00:34:14,949


555
00:34:14,949 --> 00:34:18,719
It turns out that the idea of kernels is actually more general than support vector 

556
00:34:18,719 --> 00:34:19,659
machines, 

557
00:34:19,659 --> 00:34:21,209
and in particular, 

558
00:34:21,209 --> 00:34:25,559
we took this SVM algorithm and we derived a dual, and that was what let us write 

559
00:34:25,559 --> 00:34:27,959
the entire algorithm 

560
00:34:27,959 --> 00:34:30,999
in terms of inner products of these. 

561
00:34:30,999 --> 00:34:35,319
It turns out that you can take many of the other algorithms that you've seen 

562
00:34:35,319 --> 00:34:36,630
in this class " in fact, it turns 

563
00:34:36,630 --> 00:34:40,240
out you can take most of the linear algorithms we talked about, such as linear 

564
00:34:40,240 --> 00:34:42,119
regression, logistic regression 

565
00:34:42,119 --> 00:34:43,869
[inaudible] 

566
00:34:43,869 --> 00:34:48,029
and it turns out you can take all of these algorithms and rewrite them entirely 

567
00:34:48,029 --> 00:34:51,429
in terms of these inner products. 

568
00:34:51,429 --> 00:34:55,149
So if you have any algorithm that you can rewrite in terms of inner products, then that 

569
00:34:55,149 --> 00:35:01,259
means you can replace it with K of 

570
00:35:01,259 --> 00:35:02,430
XI XJ 

571
00:35:02,430 --> 00:35:06,499
and that means that you can take any of theses algorithms and 

572
00:35:06,499 --> 00:35:10,949
implicitly map the features vectors of these very high dimensional feature spaces 

573
00:35:10,949 --> 00:35:12,990
and have the algorithm 

574
00:35:12,990 --> 00:35:13,670


575
00:35:13,670 --> 00:35:17,380
still work. The idea of kernels is perhaps most widely used with support vector 

576
00:35:17,380 --> 00:35:19,869
machines, but it is actually more general than that, 

577
00:35:19,869 --> 00:35:23,429
and you can take many of the other algorithms that you've seen and many of the algorithms that 

578
00:35:23,429 --> 00:35:26,499
we'll see later this quarter as well 

579
00:35:26,499 --> 00:35:30,309
and write them in terms of inner products and thereby kernalize them and apply them to 

580
00:35:30,309 --> 00:35:33,199
infinite dimensional feature spaces. 

581
00:35:33,199 --> 00:35:36,309
You'll actually get to play with many of these ideas more in the next problem 

582
00:35:36,309 --> 00:35:41,089


583
00:35:41,089 --> 00:35:43,269
set. 

584
00:35:43,269 --> 00:35:47,019
Let's talk about non-linear decision boundaries, and this is 

585
00:35:47,019 --> 00:35:50,209
the idea of 

586
00:35:50,209 --> 00:35:55,059
" it's called the L1 norm soft 

587
00:35:55,059 --> 00:35:59,140
margin SVM. Machine only people sometimes aren't great at coming up with good 

588
00:35:59,140 --> 00:36:00,269
names, but here's 

589
00:36:00,269 --> 00:36:01,839
the idea. 

590
00:36:01,839 --> 00:36:08,839
Let's say I have a data 

591
00:36:12,670 --> 00:36:15,380
set. This is a linearly separable data 

592
00:36:15,380 --> 00:36:19,100
set, but what I do if I have a couple of other examples there that makes the data 

593
00:36:19,100 --> 00:36:23,779
nonlinearly separable, 

594
00:36:23,779 --> 00:36:25,929
and in fact, 

595
00:36:25,929 --> 00:36:30,059
sometimes even if the data is linearly separable, maybe you might not want to. 

596
00:36:30,059 --> 00:36:32,829
So for example, this is a very nice data set. It looks like 

597
00:36:32,829 --> 00:36:37,109
there's a great decision boundary that separates the two [inaudible]. Well, 

598
00:36:37,109 --> 00:36:40,269
what if I had just one outlier 

599
00:36:40,269 --> 00:36:42,489
down here? I could 

600
00:36:42,489 --> 00:36:46,269
still linearly separate this data set 

601
00:36:46,269 --> 00:36:48,029
with something like that, 

602
00:36:48,029 --> 00:36:50,190
but I'm somehow letting one 

603
00:36:50,190 --> 00:36:52,539
slightly suspicious example 

604
00:36:52,539 --> 00:36:55,950
skew my entire decision boundary by a lot, 

605
00:36:55,950 --> 00:36:57,279
and so 

606
00:36:57,279 --> 00:37:00,969
what I'm going to talk about now is the L1 norm soft margin SVM, which is 

607
00:37:00,969 --> 00:37:05,339
a slightly modified formulation of the SVM optimization problem. 

608
00:37:05,339 --> 00:37:09,069
They will let us deal with both of these cases " one where one of the data's 

609
00:37:09,069 --> 00:37:10,809
just not linearly separable 

610
00:37:10,809 --> 00:37:14,149
and two, what if you have some examples that you'd rather 

611
00:37:14,149 --> 00:37:16,599
not get [inaudible] in a training set. Maybe 

612
00:37:16,599 --> 00:37:19,389
with an outlier here, maybe you actually prefer to 

613
00:37:19,389 --> 00:37:23,119
choose that original decision boundary and not try so hard to get 

614
00:37:23,119 --> 00:37:26,619
that training example. Here's the formulation. 

615
00:37:26,619 --> 00:37:33,619
Our 

616
00:37:44,479 --> 00:37:45,429


617
00:37:45,429 --> 00:37:46,429
SVM 

618
00:37:46,429 --> 00:37:51,109
primal problem was to minimize one-half [inaudible] W 

619
00:37:51,109 --> 00:37:51,919
squared. 

620
00:37:51,919 --> 00:37:58,919


621
00:38:02,720 --> 00:38:05,189
So this is our original problem, and I'm 

622
00:38:05,189 --> 00:38:09,739
going to modify this by adding the 

623
00:38:09,739 --> 00:38:16,739
following. 

624
00:38:23,449 --> 00:38:27,369
In other words, I'm gonna add these penalty terms, CIs, 

625
00:38:27,369 --> 00:38:32,409
and I'm going to demand that each of my training examples is separated with 

626
00:38:32,409 --> 00:38:35,999
functional margin greater than or equal to one minus CI, 

627
00:38:35,999 --> 00:38:39,639
and you remember 

628
00:38:39,639 --> 00:38:45,479


629
00:38:45,479 --> 00:38:47,430
if this is 

630
00:38:47,430 --> 00:38:50,010
greater than zero " was it 

631
00:38:50,010 --> 00:38:52,750
two lectures ago that I said that if the 

632
00:38:52,750 --> 00:38:54,689
function margin is greater than zero, 

633
00:38:54,689 --> 00:39:00,809
that implies you classified it correctly. 

634
00:39:00,809 --> 00:39:04,589
If it's less than zero, then you misclassified it. 

635
00:39:04,589 --> 00:39:08,229
By setting some of the CIs to be larger than one, I can 

636
00:39:08,229 --> 00:39:10,660
actually have some examples with 

637
00:39:10,660 --> 00:39:12,879
functional margin negative, 

638
00:39:12,879 --> 00:39:16,439
and therefore I'm allowing my algorithm to misclassify some of the examples of the 

639
00:39:16,439 --> 00:39:19,089
training set. 

640
00:39:19,089 --> 00:39:23,130
However, I'll encourage the algorithm not to do that by adding to the 

641
00:39:23,130 --> 00:39:24,379
optimization objective, 

642
00:39:24,379 --> 00:39:29,609
this sort of penalty term that penalizes setting CIs to be large. 

643
00:39:29,609 --> 00:39:34,119
This is an optimization problem where the parameters are WB 

644
00:39:34,119 --> 00:39:37,489
and all of the CIs 

645
00:39:37,489 --> 00:39:42,739
and this is also a convex optimization problem. It 

646
00:39:42,739 --> 00:39:46,549
turns out that 

647
00:39:46,549 --> 00:39:50,630
similar to how we worked on the dual of the support vector machine, 

648
00:39:50,630 --> 00:39:54,319
we can also work out the dual for this optimization problem. I won't 

649
00:39:54,319 --> 00:39:55,499
actually do it, 

650
00:39:55,499 --> 00:39:59,169
but just to show you the steps, what you do is you construct [inaudible] Alpha R, and I'm going 

651
00:39:59,169 --> 00:40:01,640
to 

652
00:40:01,640 --> 00:40:03,489
use 

653
00:40:03,489 --> 00:40:08,119
Alpha and R to denote the [inaudible] multipliers no corresponding to 

654
00:40:08,119 --> 00:40:10,599
this set of constraints that we had previously 

655
00:40:10,599 --> 00:40:15,069
and this new set of constraints on the CI [inaudible] zero. This gives us a use of 

656
00:40:15,069 --> 00:40:16,439
the 

657
00:40:16,439 --> 00:40:23,279
[inaudible] multipliers. The [inaudible] will be 

658
00:40:23,279 --> 00:40:24,569
optimization objective 

659
00:40:24,569 --> 00:40:31,569
minus sum from 

660
00:40:35,879 --> 00:40:38,699
plus CI 

661
00:40:38,699 --> 00:40:44,349
minus " 

662
00:40:44,349 --> 00:40:46,679
and so 

663
00:40:46,679 --> 00:40:50,549
there's our [inaudible] optimization objective 

664
00:40:50,549 --> 00:40:55,049
minus or plus Alpha times each of these constraints, 

665
00:40:55,049 --> 00:40:56,769


666
00:40:56,769 --> 00:41:03,769
which are 

667
00:41:05,199 --> 00:41:12,199


668
00:41:17,969 --> 00:41:20,779
greater or equal to 

669
00:41:20,779 --> 00:41:27,779
zero. I won't redivide the entire dual again, but it's really the same, 

670
00:41:28,499 --> 00:41:33,799
and when you derive the dual of this optimization problem and when you simplify, 

671
00:41:33,799 --> 00:41:37,199
you find that you get the following. 

672
00:41:37,199 --> 00:41:40,609
You have to maximize 

673
00:41:40,609 --> 00:41:47,609
[inaudible], which is actually the same as before. 

674
00:41:50,969 --> 00:41:57,969


675
00:42:03,130 --> 00:42:10,130


676
00:42:20,319 --> 00:42:23,389
So it turns out when you derive the dual 

677
00:42:23,389 --> 00:42:24,720
and simply, 

678
00:42:24,720 --> 00:42:28,730
it turns out that the only way the dual changes compared to the previous one 

679
00:42:28,730 --> 00:42:32,389
is that rather than the constraint that the Alpha [inaudible] are greater than or equal to zero, 

680
00:42:32,389 --> 00:42:36,839
we now have a constraint that the Alphas are between zero and C. 

681
00:42:36,839 --> 00:42:40,009
This derivation isn't very hard, and you're encouraged to go home and try to do it yourself. 

682
00:42:40,009 --> 00:42:42,450
It's really essentially the same math, 

683
00:42:42,450 --> 00:42:44,420
and when you simply, it turns out 

684
00:42:44,420 --> 00:42:48,619
you can simply the R of the [inaudible] multiplier away 

685
00:42:48,619 --> 00:42:49,420
and you end up with 

686
00:42:49,420 --> 00:42:54,789
just these constraints of 

687
00:42:54,789 --> 00:42:57,299


688
00:42:57,299 --> 00:43:01,909
the 

689
00:43:01,909 --> 00:43:05,179
Alphas. 

690
00:43:05,179 --> 00:43:06,450
Just as an aside, 

691
00:43:06,450 --> 00:43:08,779
I won't 

692
00:43:08,779 --> 00:43:12,499
derive these, either. It turns out that " 

693
00:43:12,499 --> 00:43:15,779
remember, I wrote down the [inaudible] conditions in 

694
00:43:15,779 --> 00:43:18,259
the last lecture. 

695
00:43:18,259 --> 00:43:21,609
The necessary conditions for something to be an optimal solution 

696
00:43:21,609 --> 00:43:24,019
to constrain optimization problems. 

697
00:43:24,019 --> 00:43:27,119
So if you used the [inaudible] conditions, 

698
00:43:27,119 --> 00:43:30,499
it turns out you can actually derive conversions conditions, so 

699
00:43:30,499 --> 00:43:32,999
we want to solve this optimization problem. 

700
00:43:32,999 --> 00:43:37,929
When do we know the Alphas have converged to the global optimum? 

701
00:43:37,929 --> 00:43:40,819
It turns out you can use the following. 

702
00:43:40,819 --> 00:43:47,819


703
00:43:48,449 --> 00:43:55,449


704
00:44:05,289 --> 00:44:12,289


705
00:44:14,140 --> 00:44:16,789


706
00:44:16,789 --> 00:44:20,970
I don't want to say a lot about these. It 

707
00:44:20,970 --> 00:44:24,219
turns out from the [inaudible] conditions you can derive 

708
00:44:24,219 --> 00:44:27,339
these as the conversion conditions for an algorithm that you might choose to 

709
00:44:27,339 --> 00:44:28,379
use 

710
00:44:28,379 --> 00:44:34,469
to try to solve the optimization problem in terms of the 

711
00:44:34,469 --> 00:44:39,789
Alphas. That's the L1 norm soft margin SVM, and this is the 

712
00:44:39,789 --> 00:44:43,749
change the algorithm that lets us handle non-linearly separable data sets as well as 

713
00:44:43,749 --> 00:44:47,160
single outliers that may still be linearly separable 

714
00:44:47,160 --> 00:44:54,160
but you may choose not to separate [inaudible]. 

715
00:44:55,869 --> 00:44:58,409
Questions about this? 

716
00:44:58,409 --> 00:45:03,319
Raise 

717
00:45:03,319 --> 00:45:10,319


718
00:45:17,749 --> 00:45:24,749


719
00:45:39,880 --> 00:45:46,880


720
00:45:52,069 --> 00:45:56,529
your hand if this stuff makes sense at all. Great. 

721
00:45:56,529 --> 00:45:59,979
So 

722
00:45:59,979 --> 00:46:02,849
the last thing I want to do is 

723
00:46:02,849 --> 00:46:07,619
talk about an algorithm for actually solving this optimization problem. 

724
00:46:07,619 --> 00:46:10,400
We wrote down 

725
00:46:10,400 --> 00:46:13,769
this dual optimization problem with convergence criteria, 

726
00:46:13,769 --> 00:46:19,169
so let's come up with an efficient algorithm to actually solve this optimization problem. 

727
00:46:19,169 --> 00:46:21,180


728
00:46:21,180 --> 00:46:25,329
I want to do this partly to give me an excuse to talk about an algorithm 

729
00:46:25,329 --> 00:46:28,859
called coordinate assent, which is useful to 

730
00:46:28,859 --> 00:46:34,229
do. What I actually want to do is tell you about an algorithm called coordinate 

731
00:46:34,229 --> 00:46:36,809
assent, which is a useful algorithm to know about, 

732
00:46:36,809 --> 00:46:40,049
and it'll turn out that it won't apply in the simplest form 

733
00:46:40,049 --> 00:46:43,589
to this problem, but we'll then be able to modify it slightly and then it'll 

734
00:46:43,589 --> 00:46:45,549
give us a very efficient 

735
00:46:45,549 --> 00:46:48,910
algorithm for solving this [inaudible] 

736
00:46:48,910 --> 00:46:52,249
optimization problem. That was the other reason that I had to derive the dual, not only so that we could 

737
00:46:52,249 --> 00:46:53,320
use kernels 

738
00:46:53,320 --> 00:46:57,610
but also so that we can apply an algorithm like the 

739
00:46:57,610 --> 00:47:00,829
SMO 

740
00:47:00,829 --> 00:47:04,809
algorithm. First, let's talk about coordinate assent, which is another 

741
00:47:04,809 --> 00:47:11,809
[inaudible] optimization 

742
00:47:17,829 --> 00:47:21,239
algorithm. To describe coordinate 

743
00:47:21,239 --> 00:47:24,679
assent, I just want you to consider the problem of if we 

744
00:47:24,679 --> 00:47:29,529
want to maximize some function 

745
00:47:29,529 --> 00:47:34,039
W, which is a function of Alpha one through Alpha M with no constraints. So for 

746
00:47:34,039 --> 00:47:38,769


747
00:47:38,769 --> 00:47:39,709


748
00:47:39,709 --> 00:47:43,199
now, forget about the constraint that the Alpha [inaudible] must be between zero and C. Forget about the 

749
00:47:43,199 --> 00:47:49,569
constraint that some of YI Alpha I must be equal to zero. 

750
00:47:49,569 --> 00:47:53,279
Then this is the coordinate 

751
00:47:53,279 --> 00:47:54,220
assent 

752
00:47:54,220 --> 00:47:56,409
algorithm. It will 

753
00:47:56,409 --> 00:48:00,549
repeat until convergence 

754
00:48:00,549 --> 00:48:03,839


755
00:48:03,839 --> 00:48:06,949
and will do for I equals one to M. The [inaudible] of coordinate assent essentially 

756
00:48:06,949 --> 00:48:09,549
holds all the parameters 

757
00:48:09,549 --> 00:48:12,039
except Alpha I fixed 

758
00:48:12,039 --> 00:48:16,249
and then it just maximizes this function with respect to just one of the parameters. Let me 

759
00:48:16,249 --> 00:48:18,219
write that 

760
00:48:18,219 --> 00:48:21,279
as Alpha I gets 

761
00:48:21,279 --> 00:48:24,039
updated as [inaudible] 

762
00:48:24,039 --> 00:48:29,229
over Alpha I hat of 

763
00:48:29,229 --> 00:48:29,839
W 

764
00:48:29,839 --> 00:48:31,969
Alpha one 

765
00:48:31,969 --> 00:48:38,969


766
00:48:41,789 --> 00:48:48,789
Alpha I minus one. This is really the fancy way of saying hold everything 

767
00:48:49,679 --> 00:48:54,249
except Alpha I 

768
00:48:54,249 --> 00:48:58,089
fixed. Just optimize W by optimization objective with 

769
00:48:58,089 --> 00:48:59,650
respect to only 

770
00:48:59,650 --> 00:49:00,960
Alpha I. 

771
00:49:00,960 --> 00:49:02,019
This is just a fancy 

772
00:49:02,019 --> 00:49:04,390
way of writing 

773
00:49:04,390 --> 00:49:09,219
it. This is coordinate 

774
00:49:09,219 --> 00:49:11,109
assent. 

775
00:49:11,109 --> 00:49:14,259
One picture 

776
00:49:14,259 --> 00:49:17,739
that's kind of useful for coordinate assent is if you 

777
00:49:17,739 --> 00:49:24,739
imagine you're trying to optimize a quadratic function, it 

778
00:49:25,009 --> 00:49:27,489
really looks like 

779
00:49:27,489 --> 00:49:30,799
that. These are the contours of the quadratic function and the minimums 

780
00:49:30,799 --> 00:49:31,919
here. 

781
00:49:31,919 --> 00:49:37,149
This is what 

782
00:49:37,149 --> 00:49:39,839
coordinate assent would look like. These are my [inaudible] call this Alpha two and I'll call this Alpha one. 

783
00:49:39,839 --> 00:49:44,859
My Alpha one Alpha two axis, and so let's say I start down here. Then I'm going 

784
00:49:44,859 --> 00:49:48,969
to begin by minimizing this with respect to Alpha one. I 

785
00:49:48,969 --> 00:49:51,029
go there. 

786
00:49:51,029 --> 00:49:54,799
And then at my new point, I'll minimize with respect to Alpha two, and so I might go to someplace like 

787
00:49:54,799 --> 00:49:56,249
that. 

788
00:49:56,249 --> 00:49:58,359
Then, I'll minimize 

789
00:49:58,359 --> 00:50:02,239
with respect to Alpha one goes back to Alpha two 

790
00:50:02,239 --> 00:50:04,469
and so on. 

791
00:50:04,469 --> 00:50:08,049
You're always taking these axis-aligned steps 

792
00:50:08,049 --> 00:50:11,709
to get to the minimum. 

793
00:50:11,709 --> 00:50:16,039
It turns out that there's a modification to this. There are 

794
00:50:16,039 --> 00:50:18,149
variations of this as well. The way I describe 

795
00:50:18,149 --> 00:50:20,729
the algorithm, we're always doing this 

796
00:50:20,729 --> 00:50:24,169
in alternating order. We always optimize with respect to Alpha 

797
00:50:24,169 --> 00:50:28,519
one then Alpha two, then Alpha one, then Alpha two. 

798
00:50:28,519 --> 00:50:32,159
What I'm about to say applies only in higher dimensions, but it turns out if you have 

799
00:50:32,159 --> 00:50:35,239
a lot of parameters, Alpha one through Alpha M, 

800
00:50:35,239 --> 00:50:39,290
you may not choose to always visit them in a fixed order. You may choose 

801
00:50:39,290 --> 00:50:43,240
which Alphas update next depending on what you 

802
00:50:43,240 --> 00:50:46,079
think will allow you to make the most progress. If you have only two parameters, it makes 

803
00:50:46,079 --> 00:50:50,000
sense to alternate between them. 

804
00:50:50,000 --> 00:50:51,099
If 

805
00:50:51,099 --> 00:50:52,730
you have 

806
00:50:52,730 --> 00:50:57,020
higher dimensional parameters, sometimes you may choose to update them in a 

807
00:50:57,020 --> 00:51:00,650
different order if you think doing so would let you make faster progress towards the 

808
00:51:00,650 --> 00:51:01,529


809
00:51:01,529 --> 00:51:04,959


810
00:51:04,959 --> 00:51:08,519
maximum. It turns out that coordinate assent 

811
00:51:08,519 --> 00:51:11,459
compared to some of the algorithms we saw previously " compared to, say, Newton's 

812
00:51:11,459 --> 00:51:12,470


813
00:51:12,470 --> 00:51:16,329
method, coordinate assent will usually take a lot more steps, 

814
00:51:16,329 --> 00:51:20,819
but the chief advantage of coordinate assent when it works well 

815
00:51:20,819 --> 00:51:22,570
is that sometimes 

816
00:51:22,570 --> 00:51:29,119
the optimization objective W 

817
00:51:29,119 --> 00:51:33,160
sometimes is very inexpensive to optimize W with respect to any one of your 

818
00:51:33,160 --> 00:51:34,129
parameters, 

819
00:51:34,129 --> 00:51:37,939
and so coordinate assent has to take many more iterations than, say, Newton's 

820
00:51:37,939 --> 00:51:38,859
method 

821
00:51:38,859 --> 00:51:41,419
in order to 

822
00:51:41,419 --> 00:51:45,009
converge. It turns out that there are many optimization problems for which it's 

823
00:51:45,009 --> 00:51:46,749
particularly easy 

824
00:51:46,749 --> 00:51:47,569
to fix 

825
00:51:47,569 --> 00:51:51,939
all but one of the parameters and optimize with respect to just that one parameter, 

826
00:51:51,939 --> 00:51:55,619
and if that's true, then the inner group of coordinate assent with optimizing with 

827
00:51:55,619 --> 00:51:58,219
respect to Alpha I can be done very quickly 

828
00:51:58,219 --> 00:52:00,999
and cause [inaudible]. 

829
00:52:00,999 --> 00:52:02,259
It turns out that 

830
00:52:02,259 --> 00:52:06,849
this will be true when we modify this algorithm to solve the SVM 

831
00:52:06,849 --> 00:52:11,880
optimization problem. 

832
00:52:11,880 --> 00:52:18,880
Questions about this? 

833
00:52:21,879 --> 00:52:28,879


834
00:52:47,489 --> 00:52:49,699
Okay. Let's go ahead and apply this 

835
00:52:49,699 --> 00:52:54,329
to our support vector machine dual optimization problem. It 

836
00:52:54,329 --> 00:52:57,179
turns out that 

837
00:52:57,179 --> 00:53:02,459
coordinate assent in its basic form does not work for the following reason. The 

838
00:53:02,459 --> 00:53:07,369
reason is we have constrains on the Alpha Is. Mainly, what we can 

839
00:53:07,369 --> 00:53:10,529
recall from what we worked out previously, 

840
00:53:10,529 --> 00:53:12,259
we have a constraint that the sum of [inaudible] Y 

841
00:53:12,259 --> 00:53:15,569
Alpha I must be equal to zero, 

842
00:53:15,569 --> 00:53:19,649
and so if you fix all the Alphas except for one, 

843
00:53:19,649 --> 00:53:23,919
then you can't change one Alpha without violating the constraint. If 

844
00:53:23,919 --> 00:53:24,669


845
00:53:24,669 --> 00:53:26,459
I just try to change Alpha one, 

846
00:53:26,459 --> 00:53:27,239


847
00:53:27,239 --> 00:53:30,559
Alpha one is actually exactly determined as a function of the other 

848
00:53:30,559 --> 00:53:33,400
Alphas because this was sum to zero. 

849
00:53:33,400 --> 00:53:35,729
The 

850
00:53:35,729 --> 00:53:39,390
SMO algorithm, by the way, is due to John Platt, a colleague at 

851
00:53:39,390 --> 00:53:42,609
Microsoft. The SMO 

852
00:53:42,609 --> 00:53:46,289
algorithm, therefore, instead of trying to change one Alpha at a time, 

853
00:53:46,289 --> 00:53:48,979
we will try to change 

854
00:53:48,979 --> 00:53:50,489
two 

855
00:53:50,489 --> 00:53:54,199
Alphas at a time. 

856
00:53:54,199 --> 00:53:55,270
This is 

857
00:53:55,270 --> 00:53:58,189
called the SMO algorithm, in a 

858
00:53:58,189 --> 00:54:00,599
sense the sequential minimal optimization 

859
00:54:00,599 --> 00:54:02,990
and the term minimal refers to the fact that 

860
00:54:02,990 --> 00:54:06,350
we're choosing the smallest number of Alpha Is to change at a time, which 

861
00:54:06,350 --> 00:54:10,679
in this case, we need to change at least two at a time. 

862
00:54:10,679 --> 00:54:16,849
So then go ahead and outline the algorithm. 

863
00:54:16,849 --> 00:54:19,660
We will 

864
00:54:19,660 --> 00:54:23,629
select 

865
00:54:23,629 --> 00:54:28,609
two Alphas to change, some Alpha I and Alpha J [inaudible] " it 

866
00:54:28,609 --> 00:54:33,979
just means a rule of thumb. 

867
00:54:33,979 --> 00:54:40,929
We'll hold all the Alpha Ks fixed 

868
00:54:40,929 --> 00:54:44,679
except 

869
00:54:44,679 --> 00:54:49,939
Alpha I and Alpha J 

870
00:54:49,939 --> 00:54:54,199
and optimize W [inaudible] Alpha 

871
00:54:54,199 --> 00:54:56,640
with respect 

872
00:54:56,640 --> 00:54:59,439
to Alpha I 

873
00:54:59,439 --> 00:55:06,439
and Alpha J subject to all the constraints. 

874
00:55:14,329 --> 00:55:17,650
It turns out the 

875
00:55:17,650 --> 00:55:19,880
key step 

876
00:55:19,880 --> 00:55:21,440
which I'm going to 

877
00:55:21,440 --> 00:55:22,100
work out 

878
00:55:22,100 --> 00:55:25,579
is this one, which 

879
00:55:25,579 --> 00:55:28,449
is how do you optimize W of Alpha 

880
00:55:28,449 --> 00:55:32,199
with respect to the two parameters that you just chose to update 

881
00:55:32,199 --> 00:55:34,729
and subject to the constraints? I'll do 

882
00:55:34,729 --> 00:55:37,459
that in a second. 

883
00:55:37,459 --> 00:55:38,900


884
00:55:38,900 --> 00:55:41,130
You would keep running this algorithm 

885
00:55:41,130 --> 00:55:43,369
until you have satisfied 

886
00:55:43,369 --> 00:55:44,690
these 

887
00:55:44,690 --> 00:55:49,469
convergence criteria up to Epsilon. 

888
00:55:49,469 --> 00:55:53,509
What I want to do now is describe how to do this [inaudible] " 

889
00:55:53,509 --> 00:55:55,759
how to optimize W of Alpha 

890
00:55:55,759 --> 00:55:57,759
with respect to 

891
00:55:57,759 --> 00:56:00,499
Alpha I and Alpha J, 

892
00:56:00,499 --> 00:56:05,489
and it turns out that it's because you can do this extremely efficiently 

893
00:56:05,489 --> 00:56:08,449
that the SMO algorithm works well. The [inaudible] for the SMO 

894
00:56:08,449 --> 00:56:11,549
algorithm can be done extremely efficiently, so it may take a large number of 

895
00:56:11,549 --> 00:56:18,549
iterations to converge, but each iteration is very cheap. Let's talk about that. 

896
00:56:33,449 --> 00:56:40,449


897
00:56:41,749 --> 00:56:45,679
So in order to derive that step where we update in respect to Alpha I and 

898
00:56:45,679 --> 00:56:46,530
Alpha J, 

899
00:56:46,530 --> 00:56:50,849
I'm actually going to use Alpha one and Alpha two as my example. I'm gonna 

900
00:56:50,849 --> 00:56:51,879
update 

901
00:56:51,879 --> 00:56:56,609
Alpha one and Alpha two. In general, this could be any Alpha I and Alpha J, but 

902
00:56:56,609 --> 00:56:59,659
just to make my notation on the board easier, I'm going to derive the 

903
00:56:59,659 --> 00:57:02,209
derivation for Alpha one and Alpha two 

904
00:57:02,209 --> 00:57:07,269
and the general [inaudible] completely analogous. 

905
00:57:07,269 --> 00:57:08,869


906
00:57:08,869 --> 00:57:13,259
On every step of the algorithm with respect to constraint, that 

907
00:57:13,259 --> 00:57:17,130
sum over I Alpha I YI is equal to zero. This is one of the 

908
00:57:17,130 --> 00:57:18,959
constraints we had previously for 

909
00:57:18,959 --> 00:57:22,569
our dual optimization problem. 

910
00:57:22,569 --> 00:57:25,749
This means that Alpha one Y1 plus Alpha 

911
00:57:25,749 --> 00:57:28,159
two Y2 must be equal to this, 

912
00:57:28,159 --> 00:57:31,319
to 

913
00:57:31,319 --> 00:57:38,319


914
00:57:41,179 --> 00:57:43,029
which I'm going to denote 

915
00:57:43,029 --> 00:57:46,259
by Zeta. 

916
00:57:46,259 --> 00:57:48,119


917
00:57:48,119 --> 00:57:49,449
So 

918
00:57:49,449 --> 00:57:51,949
we also have the constraint 

919
00:57:51,949 --> 00:57:53,439
that the Alpha Is 

920
00:57:53,439 --> 00:57:57,289
must be between zero and C. We had two constraints on our dual. This was one of the 

921
00:57:57,289 --> 00:57:59,809
constraints. This was the other one. 

922
00:57:59,809 --> 00:58:06,809
In pictures, 

923
00:58:08,139 --> 00:58:15,139
the 

924
00:58:29,289 --> 00:58:33,039
constraint that the Alpha Is is between zero 

925
00:58:33,039 --> 00:58:35,339
and C, that is often called the Bosk constraint, 

926
00:58:35,339 --> 00:58:38,089
and so if I draw Alpha one 

927
00:58:38,089 --> 00:58:43,239
and Alpha two, then 

928
00:58:43,239 --> 00:58:50,239
I 

929
00:58:50,239 --> 00:58:53,269
have a constraint that 

930
00:58:53,269 --> 00:58:57,079
the values of Alpha one and Alpha two must lie within this box that ranges from zero 

931
00:58:57,079 --> 00:58:58,889
to C. 

932
00:58:58,889 --> 00:59:01,909
And so the picture of the algorithm is as follows. 

933
00:59:01,909 --> 00:59:04,449
I have some constraint 

934
00:59:04,449 --> 00:59:06,699
that Alpha one Y1 

935
00:59:06,699 --> 00:59:08,630
plus Alpha 

936
00:59:08,630 --> 00:59:10,449
two Y2 

937
00:59:10,449 --> 00:59:14,129
must equal to Zeta, 

938
00:59:14,129 --> 00:59:18,239


939
00:59:18,239 --> 00:59:20,789


940
00:59:20,789 --> 00:59:21,839


941
00:59:21,839 --> 00:59:26,479
and so this implies that 

942
00:59:26,479 --> 00:59:32,279
Alpha one must be equal to Zeta minus Alpha two Y2 

943
00:59:32,279 --> 00:59:36,859
over Y1, 

944
00:59:36,859 --> 00:59:39,789
and so what I want to do is I want to optimize the objective with respect 

945
00:59:39,789 --> 00:59:43,439


946
00:59:43,439 --> 00:59:45,209
to this. 

947
00:59:45,209 --> 00:59:48,589
What I can do is plug in my definition for Alpha one as a function of Alpha two 

948
00:59:48,589 --> 00:59:51,349
and so this becomes W of 

949
00:59:51,349 --> 00:59:55,469
Alpha one must be equal to Zeta minus Alpha two Y2 

950
00:59:55,469 --> 00:59:56,530
over 

951
00:59:56,530 --> 01:00:01,649
Y1, Alpha two, Alpha three 

952
01:00:01,649 --> 01:00:03,140
and so on, 

953
01:00:03,140 --> 01:00:06,719
and it turns out that because W is a quadratic function " if you look back to our 

954
01:00:06,719 --> 01:00:10,669
earlier definition of W, you find it's a quadratic function in all the Alphas " 

955
01:00:10,669 --> 01:00:13,409
it turns out that if you look at this expression for W 

956
01:00:13,409 --> 01:00:16,709
and if you view it as just a function of Alpha two, 

957
01:00:16,709 --> 01:00:21,529
you find that this is a one dimensional quadratic function of Alpha two 

958
01:00:21,529 --> 01:00:25,049
if you hold Alpha three, Alpha four and so on fixed, 

959
01:00:25,049 --> 01:00:26,440
and so 

960
01:00:26,440 --> 01:00:30,369
this can be simplified to some expression of the form A Alpha two squared 

961
01:00:30,369 --> 01:00:32,359
plus B Alpha two plus 

962
01:00:32,359 --> 01:00:33,229
C. This 

963
01:00:33,229 --> 01:00:37,509
is a standard quadratic function. 

964
01:00:37,509 --> 01:00:40,470
This is really easy to optimize. We know how to optimize 

965
01:00:40,470 --> 01:00:41,630
" 

966
01:00:41,630 --> 01:00:43,820
when did we learn this? This was high school 

967
01:00:43,820 --> 01:00:47,969
or undergrad or something. You know how to optimize quadratic functions like these. 

968
01:00:47,969 --> 01:00:49,930
You just do that and that gives you the 

969
01:00:49,930 --> 01:00:52,449
optimal value for Alpha two. 

970
01:00:52,449 --> 01:00:54,739
The last step with a 

971
01:00:54,739 --> 01:00:59,089
Bosk constraint like this " just in pictures, 

972
01:00:59,089 --> 01:01:02,469
you know your solution must lie on this line, and so there'll be some 

973
01:01:02,469 --> 01:01:04,830
sort of quadratic function 

974
01:01:04,830 --> 01:01:07,739
over this line, say, 

975
01:01:07,739 --> 01:01:10,079
and so if you minimize the quadratic function, 

976
01:01:10,079 --> 01:01:13,969
maybe you get a value that lies in the box, and if so, then you're done. 

977
01:01:13,969 --> 01:01:17,229
Or if your quadratic function looks like this, 

978
01:01:17,229 --> 01:01:20,979
maybe when you optimize your quadratic function, you may end up with a value outside, so you end up with 

979
01:01:20,979 --> 01:01:22,719
a solution like that. If 

980
01:01:22,719 --> 01:01:25,939
that happens, you clip your solution 

981
01:01:25,939 --> 01:01:29,589
just to map it back inside the box. 

982
01:01:29,589 --> 01:01:34,769
That'll give you the optimal solution of this quadratic optimization problem 

983
01:01:34,769 --> 01:01:35,679
subject to 

984
01:01:35,679 --> 01:01:37,079
your solution 

985
01:01:37,079 --> 01:01:39,539
satisfying this box constraint 

986
01:01:39,539 --> 01:01:42,589
and lying on this straight line " in other words, subject to 

987
01:01:42,589 --> 01:01:48,739
the solution lying on this line segment within the box. 

988
01:01:48,739 --> 01:01:53,179


989
01:01:53,179 --> 01:01:57,629
Having solved the Alpha two this way, you can clip it if necessary 

990
01:01:57,629 --> 01:02:00,430
to get it back within the box constraint 

991
01:02:00,430 --> 01:02:01,070
and then 

992
01:02:01,070 --> 01:02:04,129
we have Alpha one as a function of Alpha two 

993
01:02:04,129 --> 01:02:06,489
and this allows you to 

994
01:02:06,489 --> 01:02:09,839
optimize W with respect to Alpha one and Alpha two quickly, 

995
01:02:09,839 --> 01:02:11,420
subject to all the constraints, 

996
01:02:11,420 --> 01:02:14,769
and the key step is really just sort of one dequadratic optimization, which 

997
01:02:14,769 --> 01:02:16,439
we do very quickly, 

998
01:02:16,439 --> 01:02:23,439
which is what makes the inner loop of the SMO algorithm very efficient. Student:You mentioned here that 

999
01:02:24,959 --> 01:02:31,959
we can 

1000
01:02:32,529 --> 01:02:34,329
change 

1001
01:02:34,329 --> 01:02:37,749
whatever, but the SMO 

1002
01:02:37,749 --> 01:02:42,879
algorithm, we can 

1003
01:02:42,879 --> 01:02:44,739
change two at a 

1004
01:02:44,739 --> 01:02:46,019
time, 

1005
01:02:46,019 --> 01:02:49,059
so how is that [inaudible] understand that. Instructor (Andrew Ng):Right. Let's say I want to change 

1006
01:02:49,059 --> 01:02:52,469
" as I run optimization algorithm, I 

1007
01:02:52,469 --> 01:02:54,829
have to respect the constraint that 

1008
01:02:54,829 --> 01:02:56,060
sum over 

1009
01:02:56,060 --> 01:02:57,899
I Alpha I YI 

1010
01:02:57,899 --> 01:03:00,339
must be equal to zero, 

1011
01:03:00,339 --> 01:03:07,339
so this is a linear constraint that I didn't have when I was talking about [inaudible] ascent. 

1012
01:03:07,489 --> 01:03:09,249
Suppose I tried 

1013
01:03:09,249 --> 01:03:13,259
to change just Alpha one. 

1014
01:03:13,259 --> 01:03:15,449
Then I know that Alpha one 

1015
01:03:15,449 --> 01:03:20,309
must be equal to the sum from I equals two to M 

1016
01:03:20,309 --> 01:03:21,809
Alpha I 

1017
01:03:21,809 --> 01:03:22,959
YI 

1018
01:03:22,959 --> 01:03:25,839
divided by Y1, right, 

1019
01:03:25,839 --> 01:03:27,099
and so Alpha 

1020
01:03:27,099 --> 01:03:30,039
one can actually be written exactly as a function of Alpha two, Alpha three and so on 

1021
01:03:30,039 --> 01:03:32,589
through Alpha M. 

1022
01:03:32,589 --> 01:03:35,610
And so if I hold Alpha two, Alpha three, Alpha four 

1023
01:03:35,610 --> 01:03:39,710
through Alpha M fixed, then I can't change Alpha one, because Alpha one is the final 

1024
01:03:39,710 --> 01:03:41,700


1025
01:03:41,700 --> 01:03:45,509
[inaudible]. Whereas in contrast, if I choose to change Alpha one and Alpha two at the same 

1026
01:03:45,509 --> 01:03:46,869
time, 

1027
01:03:46,869 --> 01:03:50,909
then I still have a constraint and so I know Alpha one and Alpha two 

1028
01:03:50,909 --> 01:03:53,629
must satisfy 

1029
01:03:53,629 --> 01:03:55,339
that linear constraint 

1030
01:03:55,339 --> 01:03:59,199
but at least this way I can change Alpha one if I also 

1031
01:03:59,199 --> 01:04:04,549
change Alpha two accordingly to make sure this satisfies 

1032
01:04:04,549 --> 01:04:08,869
the constraint. 

1033
01:04:08,869 --> 01:04:13,890
Student:[Inaudible]. Instructor (Andrew Ng):So Zeta was defined [inaudible]. So on 

1034
01:04:13,890 --> 01:04:15,919
each iteration, I have some 

1035
01:04:15,919 --> 01:04:18,099
setting of the parameters, 

1036
01:04:18,099 --> 01:04:21,400
Alpha one, Alpha two, Alpha three and so on, 

1037
01:04:21,400 --> 01:04:24,729
and I want to change Alpha one and Alpha two, say. 

1038
01:04:24,729 --> 01:04:27,009
So from the previous iteration, 

1039
01:04:27,009 --> 01:04:28,229
let's 

1040
01:04:28,229 --> 01:04:31,759
say I had not validated the constraint, so that holds true, 

1041
01:04:31,759 --> 01:04:35,259
and so I'm just defining Zeta to be equal to this, because 

1042
01:04:35,259 --> 01:04:37,769
Alpha one Y1 plus Alpha two Y2 must be equal to sum from 

1043
01:04:37,769 --> 01:04:42,409
I equals [inaudible] to M of that, 

1044
01:04:42,409 --> 01:04:44,119
and so I'm just defining this 

1045
01:04:44,119 --> 01:04:51,119
to be Zeta. 

1046
01:04:59,709 --> 01:05:01,439
Student:[Inaudible]. Instructor (Andrew 

1047
01:05:01,439 --> 01:05:06,709
Ng):On every iteration, you change maybe a different pair of Alphas to update. The 

1048
01:05:06,709 --> 01:05:10,979
way you do this is something I don't want to talk about. I'll say a couple more words about 

1049
01:05:10,979 --> 01:05:11,729
that, but 

1050
01:05:11,729 --> 01:05:14,469
the basic outline of the algorithm is 

1051
01:05:14,469 --> 01:05:18,159
on every iteration of the algorithm, you're going to select some Alpha I and 

1052
01:05:18,159 --> 01:05:19,390
Alpha J to update 

1053
01:05:19,390 --> 01:05:23,129
like on this board. So that's an Alpha I and an Alpha J to update 

1054
01:05:23,129 --> 01:05:24,739
via some [inaudible] 

1055
01:05:24,739 --> 01:05:27,839
and then you use the procedure I just described to 

1056
01:05:27,839 --> 01:05:31,049
actually update 

1057
01:05:31,049 --> 01:05:33,880
Alpha I and Alpha J. What I actually just talked about 

1058
01:05:33,880 --> 01:05:35,390
was the procedure 

1059
01:05:35,390 --> 01:05:39,699
to optimize W with respect to Alpha I and Alpha J. I didn't actually talk about the 

1060
01:05:39,699 --> 01:05:46,699
[inaudible] for choosing Alpha 

1061
01:05:48,699 --> 01:05:52,289
I and Alpha J. Student:What is the function MW? Instructor (Andrew Ng):MW is way up 

1062
01:05:52,289 --> 01:05:53,839
there. I'll just write it again. W of 

1063
01:05:53,839 --> 01:06:00,199
Alpha is that function we had previously. W of Alpha was the sum over I 

1064
01:06:00,199 --> 01:06:02,949
" this is about solving the " 

1065
01:06:02,949 --> 01:06:09,599
it was that thing. 

1066
01:06:09,599 --> 01:06:12,869
All of this is about solving the optimization problem for the SVM, so this 

1067
01:06:12,869 --> 01:06:19,469
is the objective function we had, so that's W of Alpha. Student:[Inaudible]? Exchanging one of the Alphas " optimizing 

1068
01:06:19,469 --> 01:06:26,029
that one, you 

1069
01:06:26,029 --> 01:06:27,190
can 

1070
01:06:27,190 --> 01:06:28,639
make the 

1071
01:06:28,639 --> 01:06:34,339
other one that you have to change work, 

1072
01:06:34,339 --> 01:06:35,619
right? Instructor (Andrew Ng):What 

1073
01:06:35,619 --> 01:06:39,929
do 

1074
01:06:39,929 --> 01:06:43,979
you mean works? Student:It will get farther from its optimal. Instructor (Andrew Ng):Let me translate it differently. 

1075
01:06:43,979 --> 01:06:46,890
What we're trying to do is we're trying to optimize the objective function W 

1076
01:06:46,890 --> 01:06:48,239
of Alpha, 

1077
01:06:48,239 --> 01:06:52,049
so the metric of progress that we care about is whether W of Alpha is getting 

1078
01:06:52,049 --> 01:06:54,959
better on every iteration, 

1079
01:06:54,959 --> 01:06:58,849
and so what is true for coordinate assent and for SMO is on every iteration; 

1080
01:06:58,849 --> 01:07:02,929
W of Alpha can only increase. It may stay the same 

1081
01:07:02,929 --> 01:07:04,900
or it may increase. It can't get worse. 

1082
01:07:04,900 --> 01:07:08,829
It's true that eventually, the Alphas will converge at some value. It's true that 

1083
01:07:08,829 --> 01:07:12,909
in intervening iterations, one of the Alphas may move further 

1084
01:07:12,909 --> 01:07:16,529
away and then closer and further and closer to its final value, 

1085
01:07:16,529 --> 01:07:20,089
but what we really care about is that W of Alpha is getting better 

1086
01:07:20,089 --> 01:07:27,089
every time, which is true. 

1087
01:07:28,489 --> 01:07:30,839
Just a couple more words on 

1088
01:07:30,839 --> 01:07:33,719
SMO before I wrap up on this. 

1089
01:07:33,719 --> 01:07:38,179
One is that John Platt's original 

1090
01:07:38,179 --> 01:07:39,849
algorithm talked about a [inaudible] 

1091
01:07:39,849 --> 01:07:43,869
for choosing which values or pairs, Alpha I and Alpha J, to update 

1092
01:07:43,869 --> 01:07:46,650
next 

1093
01:07:46,650 --> 01:07:50,170
is one of those things that's not conceptually complicated but it's very 

1094
01:07:50,170 --> 01:07:51,009
complicated 

1095
01:07:51,009 --> 01:07:53,329
to explain in words. I won't talk about that here. 

1096
01:07:53,329 --> 01:07:56,369
If you want to learn about it, 

1097
01:07:56,369 --> 01:07:58,539
go ahead and look up John Platt's paper on the 

1098
01:07:58,539 --> 01:08:00,179


1099
01:08:00,179 --> 01:08:06,239
SMO algorithm. The [inaudible] is pretty easy to read, 

1100
01:08:06,239 --> 01:08:11,329
and later on, we'll also posting a handout on the 

1101
01:08:11,329 --> 01:08:13,389
course homepage 

1102
01:08:13,389 --> 01:08:14,200
with 

1103
01:08:14,200 --> 01:08:18,120
some of a simplified version of this [inaudible] that you can use in 

1104
01:08:18,120 --> 01:08:20,170
problems. 

1105
01:08:20,170 --> 01:08:24,230
You can see some of the process readings in more details. 

1106
01:08:24,230 --> 01:08:28,579
One other thing that I didn't talk about was how to update the parameter B. So 

1107
01:08:28,579 --> 01:08:30,899
this is solving all your Alphas. 

1108
01:08:30,899 --> 01:08:34,020
This is also the Alpha that allows us to get W. 

1109
01:08:34,020 --> 01:08:38,060
The other thing I didn't talk about was how to compute the parameter B, and it turns out that again is actually not 

1110
01:08:38,060 --> 01:08:39,929
very difficult. 

1111
01:08:39,929 --> 01:08:42,510
I'll let you read about that yourself with the 

1112
01:08:42,510 --> 01:08:45,049
notes that we'll post 

1113
01:08:45,049 --> 01:08:52,049
along with the next 

1114
01:08:52,229 --> 01:08:55,589
problems. To wrap up today's lecture, what I want to do is just tell 

1115
01:08:55,589 --> 01:09:01,469
you briefly about a couple of examples of applications 

1116
01:09:01,469 --> 01:09:03,029


1117
01:09:03,029 --> 01:09:04,389


1118
01:09:04,389 --> 01:09:07,489


1119
01:09:07,489 --> 01:09:14,489
of SVMs. 

1120
01:09:26,009 --> 01:09:29,630
Let's consider the problem of Handler's Integer Recognition. 

1121
01:09:29,630 --> 01:09:31,909
In Handler's Integer 

1122
01:09:31,909 --> 01:09:33,469
Recognition, 

1123
01:09:33,469 --> 01:09:36,569
you're given a pixel array with 

1124
01:09:36,569 --> 01:09:40,609
a scanned image of, 

1125
01:09:40,609 --> 01:09:44,119
say, a zip code somewhere in Britain. This is an array of pixels, 

1126
01:09:44,119 --> 01:09:47,859
and some of these pixels will be on 

1127
01:09:47,859 --> 01:09:48,849
and other pixels 

1128
01:09:48,849 --> 01:09:52,689
will be off. This combination of pixels being on maybe represents the 

1129
01:09:52,689 --> 01:09:56,030
character one. 

1130
01:09:56,030 --> 01:09:57,739
The question is given 

1131
01:09:57,739 --> 01:10:00,269
an input feature vector like this, 

1132
01:10:00,269 --> 01:10:03,119
if you 

1133
01:10:03,119 --> 01:10:06,829
have, say, ten pixels by ten pixels, then you have 

1134
01:10:06,829 --> 01:10:13,639
a hundred dimensional feature vector, 

1135
01:10:13,639 --> 01:10:16,999
then [inaudible]. If you have ten pixels by ten pixels, you have 

1136
01:10:16,999 --> 01:10:18,319
100 

1137
01:10:18,319 --> 01:10:22,239
features, and maybe these are binary features of XB01 or maybe the Xs are 

1138
01:10:22,239 --> 01:10:23,929
gray scale values 

1139
01:10:23,929 --> 01:10:29,169
corresponding to how dark each of these pixels was. 

1140
01:10:29,169 --> 01:10:32,300
[Inaudible]. 

1141
01:10:32,300 --> 01:10:36,010
Turns out for many years, there was a neuronetwork that was a champion 

1142
01:10:36,010 --> 01:10:40,039
algorithm for Handler's Integer Recognition. 

1143
01:10:40,039 --> 01:10:44,049
And it turns out that you can apply an SVM with the 

1144
01:10:44,049 --> 01:10:46,889
following 

1145
01:10:46,889 --> 01:10:53,159
kernel. It turns out either the polynomial kernel or 

1146
01:10:53,159 --> 01:10:57,929
the Galcean 

1147
01:10:57,929 --> 01:11:02,580


1148
01:11:02,580 --> 01:11:05,199
kernel works fine for this problem, 

1149
01:11:05,199 --> 01:11:10,460
and just by writing down this kernel and throwing an SVM at 

1150
01:11:10,460 --> 01:11:14,239
it, an SVM gave performance comparable to the very best neuronetworks. 

1151
01:11:14,239 --> 01:11:15,099


1152
01:11:15,099 --> 01:11:19,559


1153
01:11:19,559 --> 01:11:22,210
This is surprising because support vector machine 

1154
01:11:22,210 --> 01:11:26,059
doesn't take into account any knowledge about the pixels, 

1155
01:11:26,059 --> 01:11:27,110
and in particular, 

1156
01:11:27,110 --> 01:11:30,310
it doesn't know that this pixel is 

1157
01:11:30,310 --> 01:11:32,130
next to that pixel 

1158
01:11:32,130 --> 01:11:36,559
because it's just representing the pixel intensity value as a vector. 

1159
01:11:36,559 --> 01:11:39,920
And so this means the performance of SVM would be the same even if you were to 

1160
01:11:39,920 --> 01:11:42,890
shuffle all the pixels around. 

1161
01:11:42,890 --> 01:11:46,420
[Inaudible] 

1162
01:11:46,420 --> 01:11:52,030
let's say comparable to the very best neuronetworks, which had been 

1163
01:11:52,030 --> 01:11:58,579
under very careful development for many years. I want 

1164
01:11:58,579 --> 01:11:59,789
to 

1165
01:11:59,789 --> 01:12:03,929
tell you about one other cool example, which is SVMs are 

1166
01:12:03,929 --> 01:12:08,610
also used also to classify other fairly esoteric objects. So for 

1167
01:12:08,610 --> 01:12:09,360
example, 

1168
01:12:09,360 --> 01:12:13,440
let's say you want to classify 

1169
01:12:13,440 --> 01:12:16,989
protein sequences into different classes of 

1170
01:12:16,989 --> 01:12:20,989
proteins. Every time I do this, I suspect that biologists in the room cringe, so I apologize 

1171
01:12:20,989 --> 01:12:25,040
for that. There are 

1172
01:12:25,040 --> 01:12:29,280
20 amino acids, and proteins in our bodies are made up by 

1173
01:12:29,280 --> 01:12:31,439
sequences of amino acids. 

1174
01:12:31,439 --> 01:12:35,469
Even though there are 20 amino acids and 26 alphabets, I'm going to 

1175
01:12:35,469 --> 01:12:37,130
denote amino acids by the alphabet 

1176
01:12:37,130 --> 01:12:41,409
A through Z with apologizes to the biologists. Here's 

1177
01:12:41,409 --> 01:12:45,459
an amino acid sequence 

1178
01:12:45,459 --> 01:12:52,459


1179
01:12:53,679 --> 01:13:00,679


1180
01:13:03,449 --> 01:13:05,139


1181
01:13:05,139 --> 01:13:12,139
represented by a series of alphabets. 

1182
01:13:15,949 --> 01:13:17,829


1183
01:13:17,829 --> 01:13:18,559
So 

1184
01:13:18,559 --> 01:13:21,419
suppose I want to assign 

1185
01:13:21,419 --> 01:13:25,070
this protein into a few classes depending on what 

1186
01:13:25,070 --> 01:13:27,790
type of protein it is. The 

1187
01:13:27,790 --> 01:13:29,780
question is how 

1188
01:13:29,780 --> 01:13:32,639
do I construct my feature 

1189
01:13:32,639 --> 01:13:36,590
vector? This is challenging for many reasons, one of which is that protein 

1190
01:13:36,590 --> 01:13:39,889
sequences can be of different lengths. There are some very long protein 

1191
01:13:39,889 --> 01:13:41,900
sequences and some very short ones, 

1192
01:13:41,900 --> 01:13:44,590
and so you can't have a feature saying what is 

1193
01:13:44,590 --> 01:13:48,869
the amino acid in the 100th position, because maybe there is no 100th 

1194
01:13:48,869 --> 01:13:53,050
position in this protein sequence. Some are very long. Some are very short. 

1195
01:13:53,050 --> 01:13:55,440
Here's my feature representation, 

1196
01:13:55,440 --> 01:13:56,929
which is 

1197
01:13:56,929 --> 01:13:59,720
I'm going to write down 

1198
01:13:59,720 --> 01:14:05,270
all possible combinations of four alphabets. I'm going to write down AAAA, AAAB, AAAC 

1199
01:14:05,270 --> 01:14:07,219
down 

1200
01:14:07,219 --> 01:14:09,420
to 

1201
01:14:09,420 --> 01:14:11,469
AAAZ and then 

1202
01:14:11,469 --> 01:14:13,979
AABA 

1203
01:14:13,979 --> 01:14:20,169
and so on. You get the idea. Write down 

1204
01:14:20,169 --> 01:14:23,760
all 

1205
01:14:23,760 --> 01:14:27,070
possible combinations of 

1206
01:14:27,070 --> 01:14:28,330
four alphabets 

1207
01:14:28,330 --> 01:14:30,559
and my feature representation will be 

1208
01:14:30,559 --> 01:14:33,570
I'm going to scan through this sequence of amino acids 

1209
01:14:33,570 --> 01:14:38,599
and count how often each of these subsequences occur. So for example, 

1210
01:14:38,599 --> 01:14:40,340
BAJT occurs twice and so I'll put 

1211
01:14:40,340 --> 01:14:42,749
a two there, and 

1212
01:14:42,749 --> 01:14:44,530
none of these sequences occur, so I'll put 

1213
01:14:44,530 --> 01:14:46,169
a zero there. I guess 

1214
01:14:46,169 --> 01:14:50,009
I have a one here 

1215
01:14:50,009 --> 01:14:54,429
and a zero there. This very long vector 

1216
01:14:54,429 --> 01:14:57,300
will be my feature representation for protein. 

1217
01:14:57,300 --> 01:15:00,469
This representation applies no matter how long my protein sequence is. 

1218
01:15:00,469 --> 01:15:02,729
How 

1219
01:15:02,729 --> 01:15:05,249
large is this? Well, it turns out 

1220
01:15:05,249 --> 01:15:08,109
this is going to be in R20 

1221
01:15:08,109 --> 01:15:12,070
to the four, 

1222
01:15:12,070 --> 01:15:16,089
and so you have a 160,000 

1223
01:15:16,089 --> 01:15:18,360
dimensional feature vector, 

1224
01:15:18,360 --> 01:15:21,429
which is reasonably large, even by modern computer standards. 

1225
01:15:21,429 --> 01:15:23,910
Clearly, we don't want to explicitly represent 

1226
01:15:23,910 --> 01:15:26,069
these high 

1227
01:15:26,069 --> 01:15:30,029
dimensional feature vectors. Imagine you have 1,000 examples and you store this as double 

1228
01:15:30,029 --> 01:15:34,280
[inaudible]. Even on modern day computers, this is big. 

1229
01:15:34,280 --> 01:15:38,429
It turns out that there's an 

1230
01:15:38,429 --> 01:15:42,090
efficient dynamic programming algorithm that can efficiently 

1231
01:15:42,090 --> 01:15:45,260
compute inner products between these feature vectors, and so we can apply this 

1232
01:15:45,260 --> 01:15:48,210
feature representation, even though it's a ridiculously high 

1233
01:15:48,210 --> 01:15:49,630
feature vector 

1234
01:15:49,630 --> 01:15:52,499
to classify protein sequences. 

1235
01:15:52,499 --> 01:15:56,579
I won't talk about the [inaudible] algorithm. If any of you have seen 

1236
01:15:56,579 --> 01:16:00,849
the [inaudible] algorithm for finding subsequences, it's kind of 

1237
01:16:00,849 --> 01:16:02,309
reminiscent of that. You 

1238
01:16:02,309 --> 01:16:05,289
can look those up if you're interested. 

1239
01:16:05,289 --> 01:16:08,179
This is just another example of a cool kernel, and 

1240
01:16:08,179 --> 01:16:13,230
more generally, if you're faced with some new machine-learning problem, sometimes you can 

1241
01:16:13,230 --> 01:16:15,750
choose a standard kernel like a Galcean kernel, 

1242
01:16:15,750 --> 01:16:19,580
and sometimes there are research papers written on how to 

1243
01:16:19,580 --> 01:16:21,410
come up with a new kernel 

1244
01:16:21,410 --> 01:16:27,520
for a new problem. 

1245
01:16:27,520 --> 01:16:29,590
Two last sentences I want 

1246
01:16:29,590 --> 01:16:33,710
to say. Where are we now? That wraps up SVMs, which many 

1247
01:16:33,710 --> 01:16:38,379
people would consider one of the most effective off the shelf learning algorithms, 

1248
01:16:38,379 --> 01:16:42,229
and so as of today, you've actually seen a lot of learning algorithms. I want to 

1249
01:16:42,229 --> 01:16:45,689
close this class by saying congrats. You're now well qualified to actually go and apply learning algorithms 

1250
01:16:45,689 --> 01:16:48,169
to a lot of problems. 

1251
01:16:48,169 --> 01:16:52,510
We're still in week four of the quarter, so there's more to come. In particular, what I 

1252
01:16:52,510 --> 01:16:56,890
want to do next is talk about how to really understand the learning algorithms and 

1253
01:16:56,890 --> 01:16:58,860
when they work well and when they work poorly 

1254
01:16:58,860 --> 01:17:01,889
and to take the tools which you now have and really talk about how you can use 

1255
01:17:01,889 --> 01:17:04,449
them really well. We'll start to do that in the next lecture. Thanks.