00:00:01,520 --> 00:00:08,520
This subtitle is provided from lecsub.jimdo.com under Creative Commons License.

2
00:00:11,789 --> 00:00:15,049
This presentation is delivered by the Stanford Center for Professional 

3
00:00:15,049 --> 00:00:22,049
Development. 

4
00:00:23,680 --> 00:00:25,839
So welcome back. 

5
00:00:25,839 --> 00:00:30,859
And what I wanna do today is continue our discussion on support vector machines. And in 

6
00:00:30,859 --> 00:00:34,410
particular, I wanna talk about the optimal margin classifier. 

7
00:00:34,410 --> 00:00:38,280
Then I wanna take a brief digression and talk about primal and duo optimization problems, and 

8
00:00:38,280 --> 00:00:39,410
in particular, 

9
00:00:39,410 --> 00:00:41,570
what's called the KKT conditions. 

10
00:00:41,570 --> 00:00:45,190
And then we'll derive the duo to the optimization problem that I 

11
00:00:45,190 --> 00:00:46,570
had posed earlier. 

12
00:00:46,570 --> 00:00:50,980
And that will lead us into a discussion of kernels, which I won't really - 

13
00:00:50,980 --> 00:00:54,010
which we just get to say couple words about, but which I'll do probably 

14
00:00:54,010 --> 00:00:57,530
only in the next lecture. 

15
00:00:57,530 --> 00:01:02,079
And as part of today's lecture, I'll spend some time talking about optimization problems. 

16
00:01:02,079 --> 00:01:04,180
And in 

17
00:01:04,180 --> 00:01:08,299
the little time I have today, I won't really be able to do this topic justice. 

18
00:01:08,299 --> 00:01:13,010
I wanna talk about convex optimization and do that topic justice. 

19
00:01:13,010 --> 00:01:17,500
And so at this week's discussion session, the TAs will have more 

20
00:01:17,500 --> 00:01:22,110
time - will teach a discussion session - focus on convex optimization 

21
00:01:22,110 --> 00:01:25,159
- sort of very beautiful and useful theory. 

22
00:01:25,159 --> 00:01:32,159
So you want to learn more about that, listen to this Friday's discussion session. 

23
00:01:33,100 --> 00:01:36,630
Just to recap what we did in the previous lecture, 

24
00:01:36,630 --> 00:01:38,830
as 

25
00:01:38,830 --> 00:01:42,420
we were beginning on developing on support vector machines, I 

26
00:01:42,420 --> 00:01:43,900
said that a 

27
00:01:43,900 --> 00:01:48,800
hypothesis represented as H sub [inaudible] wb as g of 

28
00:01:48,800 --> 00:01:53,090
w transpose [inaudible] x + b, where 

29
00:01:53,090 --> 00:01:55,030


30
00:01:55,030 --> 00:01:55,820
g 

31
00:01:55,820 --> 00:02:02,820
will be 

32
00:02:04,350 --> 00:02:08,599
+1 or -1, depending on whether z is greater than 

33
00:02:08,599 --> 00:02:10,069
0. 

34
00:02:10,069 --> 00:02:12,779
And I said that in our development of support vector machines, 

35
00:02:12,779 --> 00:02:16,999
we'll use - we'll change the convention of letting y be +1, -1 to note 

36
00:02:16,999 --> 00:02:20,159
the class labels. 

37
00:02:20,159 --> 00:02:24,439
So last time, 

38
00:02:24,439 --> 00:02:28,149
we also talked about the functional margin, which was this thing, gamma 

39
00:02:28,149 --> 00:02:33,519
hat i. 

40
00:02:33,519 --> 00:02:36,109


41
00:02:36,109 --> 00:02:38,759
And so we had the intuition that the 

42
00:02:38,759 --> 00:02:42,389
if functional margin is a large positive number, 

43
00:02:42,389 --> 00:02:46,769
then that means that we are classifying a training example correctly and very 

44
00:02:46,769 --> 00:02:48,209
confidently. So 

45
00:02:48,209 --> 00:02:50,389
yi is +1. We 

46
00:02:50,389 --> 00:02:53,319
would like w transpose xi + b to be very large. 

47
00:02:53,319 --> 00:02:54,649
And it makes i - if, excuse me, if 

48
00:02:54,649 --> 00:02:56,779
yi is -1, 

49
00:02:56,779 --> 00:02:59,459
then we'd w transpose xi + b to be a large negative number. So 

50
00:02:59,459 --> 00:03:03,399
we'd sort of like functional margins to be large. We 

51
00:03:03,399 --> 00:03:06,029
also said - functional margin is a strange property - 

52
00:03:06,029 --> 00:03:10,469
that you can increase functional margin just by, say, taking your parameters, 

53
00:03:10,469 --> 00:03:11,359
w and b, 

54
00:03:11,359 --> 00:03:13,909
and multiplying them by 

55
00:03:13,909 --> 00:03:15,809
2. 

56
00:03:15,809 --> 00:03:18,299
And then we also 

57
00:03:18,299 --> 00:03:20,949
defined the geometric margin, 

58
00:03:20,949 --> 00:03:22,709
which 

59
00:03:22,709 --> 00:03:24,659


60
00:03:24,659 --> 00:03:31,659
was 

61
00:03:33,219 --> 00:03:37,389
that we just - essentially, the functional margin 

62
00:03:37,389 --> 00:03:38,499
divided by 

63
00:03:38,499 --> 00:03:40,349
the normal w. 

64
00:03:40,349 --> 00:03:43,449
And so the geometric margin had 

65
00:03:43,449 --> 00:03:46,589
the interpretation as being - I'll give 

66
00:03:46,589 --> 00:03:48,799
you a few examples. The 

67
00:03:48,799 --> 00:03:52,079
geometric margin, for example, is - has 

68
00:03:52,079 --> 00:03:56,139
the interpretation as a distance between a training example 

69
00:03:56,139 --> 00:03:56,909
and a hyperplane. 

70
00:03:56,909 --> 00:04:00,679
And it'll actually be a sin distance, so that this distance will be positive 

71
00:04:00,679 --> 00:04:05,059
if you're classifying the example correctly. And if you misclassify the example, this 

72
00:04:05,059 --> 00:04:07,929
distance - it'll be the minus of the distance, 

73
00:04:07,929 --> 00:04:10,269
reaching the point, reaching the training example. 

74
00:04:10,269 --> 00:04:12,079
And you're separating hyperplane. 

75
00:04:12,079 --> 00:04:16,329
Where you're separating hyperplane is defined by the equation w transpose x 

76
00:04:16,329 --> 00:04:17,620
+ 

77
00:04:17,620 --> 00:04:20,589
b = 

78
00:04:20,589 --> 00:04:24,229
0. 

79
00:04:24,229 --> 00:04:28,089
So - oh, 

80
00:04:28,089 --> 00:04:32,099
well, and I guess 

81
00:04:32,099 --> 00:04:36,539
also defined these things as the functional margin, geometric margins, 

82
00:04:36,539 --> 00:04:38,499
respect to training set I defined as 

83
00:04:38,499 --> 00:04:45,499
the worst case or the minimum functional geometric margin. So in our 

84
00:04:48,719 --> 00:04:52,870
development of the optimal margin classifier, 

85
00:04:52,870 --> 00:04:56,499
our learning algorithm would choose parameters w and b so as to maximize 

86
00:04:56,499 --> 00:05:00,050
the geometric margin. So our goal is to find the separating hyperplane 

87
00:05:00,050 --> 00:05:04,729
that separates the positive and negative examples with as large a distance as possible 

88
00:05:04,729 --> 00:05:06,100
between hyperplane 

89
00:05:06,100 --> 00:05:10,540
and the positive and negative examples. And if you 

90
00:05:10,540 --> 00:05:14,940
go to choose parameters w and b to maximize this, [inaudible] one copy of 

91
00:05:14,940 --> 00:05:16,620
the geometric margin 

92
00:05:16,620 --> 00:05:18,159
is that you 

93
00:05:18,159 --> 00:05:19,989
can actually scale w 

94
00:05:19,989 --> 00:05:24,460
and b arbitrarily. So you look at this definition for the geometric margin. 

95
00:05:24,460 --> 00:05:24,850


96
00:05:24,850 --> 00:05:28,300
I can choose to multiply my parameters w and b 

97
00:05:28,300 --> 00:05:31,629
by 2 or by 10 or any other constant. 

98
00:05:31,629 --> 00:05:33,589
And it doesn't change 

99
00:05:33,589 --> 00:05:35,819
my geometric margin. 

100
00:05:35,819 --> 00:05:38,280
And one way of interpreting that is you're looking 

101
00:05:38,280 --> 00:05:39,379
at 

102
00:05:39,379 --> 00:05:43,099
just separating hyperplane. You look at this line you're separating by positive and negative training examples. If 

103
00:05:43,099 --> 00:05:44,819
I 

104
00:05:44,819 --> 00:05:45,740
scale w 

105
00:05:45,740 --> 00:05:47,400
and b, 

106
00:05:47,400 --> 00:05:49,489
that doesn't change the position of this plane, though 

107
00:05:49,489 --> 00:05:51,629
because the equation wh + 

108
00:05:51,629 --> 00:05:58,629
b = 0 is the same as equation 2 w transpose x + 2b = 0. So it use the same straight 

109
00:05:58,899 --> 00:06:02,589
line. And what that means is that I can actually choose whatever scaling 

110
00:06:02,589 --> 00:06:05,969
for w and b is convenient for me. 

111
00:06:05,969 --> 00:06:07,819
And in particular, 

112
00:06:07,819 --> 00:06:09,199
we use in a minute, 

113
00:06:09,199 --> 00:06:13,600
I can [inaudible] perfect constraint like that the normal w [inaudible] 1 

114
00:06:13,600 --> 00:06:17,180
because this means that you can find a solution to w and b. 

115
00:06:17,180 --> 00:06:21,930
And then by rescaling the parameters, you can easily meet this condition, this rescaled w 

116
00:06:21,930 --> 00:06:23,530
[inaudible] 1. And so I can 

117
00:06:23,530 --> 00:06:28,440
add the condition like this and then essentially not change the problem. Or I 

118
00:06:28,440 --> 00:06:32,469
can add other conditions. I can actually add a 

119
00:06:32,469 --> 00:06:35,120
condition 

120
00:06:35,120 --> 00:06:38,150
that - excuse me, the absolute value of w1 = 1. I can have 

121
00:06:38,150 --> 00:06:41,430
only one of these conditions right now [inaudible]. And adding condition to the absolute 

122
00:06:41,430 --> 00:06:42,419
value - the 

123
00:06:42,419 --> 00:06:45,169
first component of w must be to 1. And again, 

124
00:06:45,169 --> 00:06:48,569
you can find the absolute solution and just rescale w and meet this 

125
00:06:48,569 --> 00:06:50,539
condition. 

126
00:06:50,539 --> 00:06:53,179
And it can have other, 

127
00:06:53,179 --> 00:06:56,300
most esoteric conditions like that 

128
00:06:56,300 --> 00:06:57,250
because again, 

129
00:06:57,250 --> 00:07:01,039
this is a condition that you can solve for the optimal margin, and then just 

130
00:07:01,039 --> 00:07:02,669
by scaling, 

131
00:07:02,669 --> 00:07:04,789
you have w up and down. You can - you can then 

132
00:07:04,789 --> 00:07:08,729
ensure you meet this condition as well. So 

133
00:07:08,729 --> 00:07:12,580
again, [inaudible] one of these conditions right now, not all of them. 

134
00:07:12,580 --> 00:07:15,449
And so our ability to choose 

135
00:07:15,449 --> 00:07:19,330
any scaling condition on w that's convenient to us 

136
00:07:19,330 --> 00:07:23,779
will be useful again in a second. All right. 

137
00:07:23,779 --> 00:07:28,749
So let's go ahead and break down the optimization problem. And again, my goal is to choose 

138
00:07:28,749 --> 00:07:30,279
parameters w and b 

139
00:07:30,279 --> 00:07:34,169
so as to maximize the geometric margin. 

140
00:07:34,169 --> 00:07:38,569
Here's my first attempt at writing down the optimization problem. Actually wrote this one down 

141
00:07:38,569 --> 00:07:40,379
right at 

142
00:07:40,379 --> 00:07:42,520
the end of the previous 

143
00:07:42,520 --> 00:07:46,529
lecture. Begin to solve the parameters gamma w and b 

144
00:07:46,529 --> 00:07:48,300
such that 

145
00:07:48,300 --> 00:07:49,429


146
00:07:49,429 --> 00:07:53,989
- 

147
00:07:53,989 --> 00:07:57,399
that [inaudible] i 

148
00:07:57,399 --> 00:08:01,610
for in training examples. 

149
00:08:01,610 --> 00:08:02,460
Let's say I 

150
00:08:02,460 --> 00:08:06,439
choose to add this normalization condition. 

151
00:08:06,439 --> 00:08:10,629
So the norm condition that w - the normal w is equal to 1 just makes 

152
00:08:10,629 --> 00:08:15,279
the geometric and the functional margin the same. 

153
00:08:15,279 --> 00:08:18,029
And so I'm saying I want to find a value - 

154
00:08:18,029 --> 00:08:22,610
I want to find a value for gamma as big as possible 

155
00:08:22,610 --> 00:08:26,379
so that all of my training examples have functional margin 

156
00:08:26,379 --> 00:08:29,049
greater than or equals gamma, 

157
00:08:29,049 --> 00:08:30,010
and 

158
00:08:30,010 --> 00:08:33,780
with the constraint that normal w equals 1, 

159
00:08:33,780 --> 00:08:36,089
functional margin and geometric margin are the same. 

160
00:08:36,089 --> 00:08:37,760
So it's the same. 

161
00:08:37,760 --> 00:08:39,670
Find the value for gamma so that 

162
00:08:39,670 --> 00:08:43,130
all the values - all the geometric margins are greater or equal to 

163
00:08:43,130 --> 00:08:45,870
gamma. 

164
00:08:45,870 --> 00:08:51,269
So you solve this optimization problem, then you have derived 

165
00:08:51,269 --> 00:08:55,060
the optimal margin classifier - 

166
00:08:55,060 --> 00:08:56,940
that 

167
00:08:56,940 --> 00:09:00,170
there's not a very nice optimization problem because this is a 

168
00:09:00,170 --> 00:09:04,300
nasty, nonconvex constraints. And [inaudible] is asking that you 

169
00:09:04,300 --> 00:09:06,229
solve for parameters w 

170
00:09:06,229 --> 00:09:10,989
that lie on the surface of a unisphere, lie on his [inaudible]. 

171
00:09:10,989 --> 00:09:15,170
It lies on a unicircle - a unisphere. 

172
00:09:15,170 --> 00:09:16,390


173
00:09:16,390 --> 00:09:17,990
And so 

174
00:09:17,990 --> 00:09:21,160
if we can come up with a convex optimization problem, then 

175
00:09:21,160 --> 00:09:24,579
we'd be guaranteed that our [inaudible] descend to other local [inaudible] will 

176
00:09:24,579 --> 00:09:25,460
not have 

177
00:09:25,460 --> 00:09:26,490
local optimal. And 

178
00:09:26,490 --> 00:09:31,480
it turns out this is an example of a nonconvex constraint. This is a nasty constraint 

179
00:09:31,480 --> 00:09:38,480
that I would like to get rid of. So 

180
00:09:40,790 --> 00:09:42,840
let's change the optimization problem 

181
00:09:42,840 --> 00:09:47,650
one more time. 

182
00:09:47,650 --> 00:09:52,180
Now, let me 

183
00:09:52,180 --> 00:09:56,870
pose a slightly different optimization problem. Let 

184
00:09:56,870 --> 00:10:02,680
me maximize the functional margin divided by the normal w 

185
00:10:02,680 --> 00:10:04,870
subject 

186
00:10:04,870 --> 00:10:11,870
to yi w transpose xi. 

187
00:10:12,580 --> 00:10:15,390
So in other words, once you find 

188
00:10:15,390 --> 00:10:17,220
a number, gamma hat, 

189
00:10:17,220 --> 00:10:19,700
so that every one of my training examples has functional margin greater 

190
00:10:19,700 --> 00:10:22,840
than the gamma hat, 

191
00:10:22,840 --> 00:10:27,300
and my optimization objective is I want to maximize gamma hat divided by the normal 

192
00:10:27,300 --> 00:10:28,180


193
00:10:28,180 --> 00:10:30,550
w. And so I wanna maximize the 

194
00:10:30,550 --> 00:10:35,680
function margin divided by the normal w. And we saw previously the function 

195
00:10:35,680 --> 00:10:38,390
margin divided by the normal 

196
00:10:38,390 --> 00:10:41,780
w is just a geometric margin, and so this is a different way of posing 

197
00:10:41,780 --> 00:10:44,610
the same 

198
00:10:44,610 --> 00:10:46,780
optimization problem. [Inaudible] confused, though. Are there questions 

199
00:10:46,780 --> 00:10:53,710
about this? 

200
00:10:53,710 --> 00:10:53,960


201
00:10:53,940 --> 00:10:56,590
Student:[Inaudible] the second statement has to be made of the functional margin y 

202
00:10:56,590 --> 00:10:58,900
divided by - why don't you just have it 

203
00:10:58,900 --> 00:11:02,280
the geometric 

204
00:11:02,280 --> 00:11:04,800
margin? Why do 

205
00:11:04,800 --> 00:11:09,450
you [inaudible]? Instructor (Andrew Ng):[Inaudible] say it again? Student:For the second statement, where we're saying the data of the functional margin is divided [inaudible]. Instructor (Andrew 

206
00:11:09,450 --> 00:11:10,740
Ng):Oh, I see, yes. Student:[Inaudible] 

207
00:11:10,740 --> 00:11:12,730


208
00:11:12,730 --> 00:11:17,050
is that [inaudible]? Instructor (Andrew Ng):So let's see, this is the function margin, right? This is not the geometric margin. 

209
00:11:17,050 --> 00:11:18,820
Student:Yeah. 

210
00:11:18,820 --> 00:11:25,820
Instructor (Andrew Ng):So - oh, I want to divide by the normal w of my optimization objective. Student:I'm just wondering how come you end up dividing also under the second stage [inaudible] the functional 

211
00:11:26,040 --> 00:11:27,580
margin. 

212
00:11:27,580 --> 00:11:31,780
Why are you dividing there by the normal w? Instructor (Andrew Ng):Let's see. I'm not sure I get the question. Let me 

213
00:11:31,780 --> 00:11:36,210
try saying 

214
00:11:36,210 --> 00:11:40,390
this again. So here's my goal. My - I want [inaudible]. So 

215
00:11:40,390 --> 00:11:42,110
let's see, 

216
00:11:42,110 --> 00:11:46,750
the parameters of this optimization problem where gamma hat w and b - so 

217
00:11:46,750 --> 00:11:49,610
the convex optimization software 

218
00:11:49,610 --> 00:11:52,399
solves this problem for some set of parameters gamma 

219
00:11:52,399 --> 00:11:54,260
w and b. 

220
00:11:54,260 --> 00:11:57,120
And I'm imposing the constraint that 

221
00:11:57,120 --> 00:11:59,050
whatever values it comes up with, 

222
00:11:59,050 --> 00:12:03,750
yi x [inaudible] x5 + b must be greater than gamma hat. 

223
00:12:03,750 --> 00:12:05,889
And so this means that 

224
00:12:05,889 --> 00:12:08,640
the functional margin of every example had 

225
00:12:08,640 --> 00:12:09,310
better be 

226
00:12:09,310 --> 00:12:11,070
greater than equal to gamma hat. So 

227
00:12:11,070 --> 00:12:14,300
there's a constraint to the function margin and a constraint to the gamma hat. 

228
00:12:14,300 --> 00:12:17,950
But what I care about is not really maximizing the functional margin. What I 

229
00:12:17,950 --> 00:12:19,010
really care about - 

230
00:12:19,010 --> 00:12:21,430
in other words, in optimization objective, 

231
00:12:21,430 --> 00:12:25,110
is maximizing gamma hat divided by the normal w, 

232
00:12:25,110 --> 00:12:27,890
which is the geometric margin. 

233
00:12:27,890 --> 00:12:32,370
So in other words, my optimization [inaudible] is I want to maximize the function margin 

234
00:12:32,370 --> 00:12:35,210
divided by the normal 

235
00:12:35,210 --> 00:12:38,230
w. Subject to that, every example must have 

236
00:12:38,230 --> 00:12:40,290
function margin and at least gamma hat. Does that make 

237
00:12:40,290 --> 00:12:46,120
sense now? Student:[Inaudible] when you 

238
00:12:46,120 --> 00:12:48,000
said that to maximize gamma 

239
00:12:48,000 --> 00:12:50,580
or gamma hat, respect to gamma w and with 

240
00:12:50,580 --> 00:12:51,640
respect to gamma hat 

241
00:12:51,640 --> 00:12:56,820
so that 

242
00:12:56,820 --> 00:13:01,490
[inaudible] gamma hat are no 

243
00:13:01,490 --> 00:13:04,920
longer [inaudible]? Instructor (Andrew Ng):So this is the - 

244
00:13:04,920 --> 00:13:06,800
so it turns out - 

245
00:13:06,800 --> 00:13:11,020
so this is how I write down the - this is how I write down an optimization problem 

246
00:13:11,020 --> 00:13:12,790
in order to solve for 

247
00:13:12,790 --> 00:13:15,950
the geometric margin. What is it - 

248
00:13:15,950 --> 00:13:18,450
so it turns out that 

249
00:13:18,450 --> 00:13:21,410
the question of this - is the gamma hat the function of w and b? And it turns out 

250
00:13:21,410 --> 00:13:22,250
that 

251
00:13:22,250 --> 00:13:25,130
in my previous mathematical definition, it was, 

252
00:13:25,130 --> 00:13:29,590
but the way I'm going to pose this as an optimization problem is 

253
00:13:29,590 --> 00:13:34,630
I'm going to ask the convex optimization solvers - and this [inaudible] software - unless you have software for solving 

254
00:13:34,630 --> 00:13:37,920
convex optimization problems - 

255
00:13:37,920 --> 00:13:39,550
hen I'm going to 

256
00:13:39,550 --> 00:13:43,930
pretend that these are independent variables and ask my convex optimization software 

257
00:13:43,930 --> 00:13:46,420
to find me values for gamma, w, and b, 

258
00:13:46,420 --> 00:13:50,610
to make this value as big as possible and subject to this constraint. 

259
00:13:50,610 --> 00:13:52,670
And it'll turn out that 

260
00:13:52,670 --> 00:13:55,329
when it does that, it will choose - or 

261
00:13:55,329 --> 00:13:58,790
obviously, it will choose for gamma to be as big as possible 

262
00:13:58,790 --> 00:14:01,340
because optimization objective is this: 

263
00:14:01,340 --> 00:14:02,730
You're trying to maximize gamma hat. 

264
00:14:02,730 --> 00:14:05,540
So for x value of w and b, 

265
00:14:05,540 --> 00:14:09,140
my software, which choose to make gamma hat as big as possible - 

266
00:14:09,140 --> 00:14:12,590
well, but how big can we make gamma hat? Well, it's limited by use 

267
00:14:12,590 --> 00:14:15,460
constraints. It says that every training example 

268
00:14:15,460 --> 00:14:17,079
must have function margin 

269
00:14:17,079 --> 00:14:20,380
greater than equal to gamma hat. 

270
00:14:20,380 --> 00:14:23,950
And so my - the bigger you can make gamma hat 

271
00:14:23,950 --> 00:14:27,880
will be the value of the smallest functional margin. 

272
00:14:27,880 --> 00:14:30,220
And so when you solve this optimization problem, 

273
00:14:30,220 --> 00:14:33,720
the value of gamma hat you get out will be, indeed, 

274
00:14:33,720 --> 00:14:38,590
the minimum of the functional margins of your training set. Okay, so 

275
00:14:38,590 --> 00:14:40,530
Justin? Student:Yeah, I was just 

276
00:14:40,530 --> 00:14:41,760
wondering, I 

277
00:14:41,760 --> 00:14:46,200
guess I'm a little confused because it's like, okay, you have two class of data. And you can say, "Okay, please draw me a line 

278
00:14:46,200 --> 00:14:50,180
such that you maximize the distance between - the smallest distance that [inaudible] between the line and 

279
00:14:50,180 --> 00:14:52,190
the 

280
00:14:52,190 --> 00:14:53,780
data points." 

281
00:14:53,780 --> 00:14:56,370
And it seems like that's kind of what we're doing, but it's - it seems like 

282
00:14:56,370 --> 00:15:00,200
this is more complicated than that. And I guess I'm wondering what 

283
00:15:00,200 --> 00:15:04,109
is the difference. Instructor (Andrew Ng):I see. So I mean, this is - the question is [inaudible]. Two class of data - trying to find separate hyperplane. And 

284
00:15:04,109 --> 00:15:08,520
this seems 

285
00:15:08,520 --> 00:15:12,420
more complicated than trying to find a line [inaudible]. So I'm 

286
00:15:12,420 --> 00:15:17,550
just repeating the questions in case - since I'm not sure how all the audio catches it. 

287
00:15:17,550 --> 00:15:21,380
So the answer is this is actually exactly that problem. This is exactly that problem 

288
00:15:21,380 --> 00:15:22,710
of 

289
00:15:22,710 --> 00:15:26,620
given the two class of data, positive and negative examples, 

290
00:15:26,620 --> 00:15:29,900
this is exactly the formalization of the problem 

291
00:15:29,900 --> 00:15:31,519
where I go is to find 

292
00:15:31,519 --> 00:15:35,880
a line that separates the two - the positive and negative examples, 

293
00:15:35,880 --> 00:15:42,880
maximizing the worst-case distance between the [inaudible] point and this line. Okay? Yeah, [Inaudible]? Student:So why do you care about the worst-case 

294
00:15:43,040 --> 00:15:43,740
distance [inaudible]? 

295
00:15:43,740 --> 00:15:46,270
Instructor (Andrew Ng):Yeah, let me - for now, why do we care about the worst-case distance? For now, 

296
00:15:46,270 --> 00:15:48,230


297
00:15:48,230 --> 00:15:52,590
let's just say - let's just care about the worst-case distance for now. We'll come back, 

298
00:15:52,590 --> 00:15:55,330
and we'll fix that later. We'll - that's a - 

299
00:15:55,330 --> 00:15:57,850
caring about the worst case is is just - 

300
00:15:57,850 --> 00:16:01,240
is just a nice way to formulate this optimization problem. I'll come back, and I'll 

301
00:16:01,240 --> 00:16:03,580
change that later. Okay, 

302
00:16:03,580 --> 00:16:10,100
raise your hand if this makes sense - if this formulation makes sense? Okay, yeah, cool. 

303
00:16:10,100 --> 00:16:13,830
Great. So let's see - 

304
00:16:13,830 --> 00:16:15,910
so this is just a different way of posing 

305
00:16:15,910 --> 00:16:18,160
the same optimization problem. And 

306
00:16:18,160 --> 00:16:22,339
on the one hand, I've got to get rid of this nasty, nonconvex constraint, while on 

307
00:16:22,339 --> 00:16:23,880
the other hand, I've now 

308
00:16:23,880 --> 00:16:24,830
added 

309
00:16:24,830 --> 00:16:30,200
a nasty, nonconvex objective. In particular, this is not a convex function in parameters 

310
00:16:30,200 --> 00:16:31,010
w. 

311
00:16:31,010 --> 00:16:32,610
And so you can't - 

312
00:16:32,610 --> 00:16:36,670
you don't have the usual guarantees like if you [inaudible] 

313
00:16:36,670 --> 00:16:38,240
global minimum. 

314
00:16:38,240 --> 00:16:40,170
At least that 

315
00:16:40,170 --> 00:16:45,190
does not follow immediately from this because this is nonconvex. 

316
00:16:45,190 --> 00:16:47,840


317
00:16:47,840 --> 00:16:51,380
So what 

318
00:16:51,380 --> 00:16:54,810
I'm going to do is, 

319
00:16:54,810 --> 00:16:56,940
earlier, I said that can pose 

320
00:16:56,940 --> 00:16:57,770
any of 

321
00:16:57,770 --> 00:17:02,630
a number of even fairly bizarre scaling constraints on w. So you can 

322
00:17:02,630 --> 00:17:06,100
choose any scaling constraint like this, and things are still fine. 

323
00:17:06,100 --> 00:17:07,660
And so here's 

324
00:17:07,660 --> 00:17:14,089
the scaling I'm going to choose to add. Again, I'm 

325
00:17:14,089 --> 00:17:18,380
gonna assume for the purposes of today's lecture, I'm gonna assume that these examples are linearly 

326
00:17:18,380 --> 00:17:19,580
separable, that 

327
00:17:19,580 --> 00:17:23,290
you can actually separate the positive and negative classes, and that we'll come back and 

328
00:17:23,290 --> 00:17:25,580
fix this later as well. 

329
00:17:25,580 --> 00:17:28,700
But here's the scaling constraint I want to impose on w. I want 

330
00:17:28,700 --> 00:17:33,840
to impose a constraint that 

331
00:17:33,840 --> 00:17:37,840
the functional margin is equal to 1. 

332
00:17:37,840 --> 00:17:40,059
And another way of writing that 

333
00:17:40,059 --> 00:17:44,380
is that I want to impose a constraint that min 

334
00:17:44,380 --> 00:17:51,380
over i, yi - 

335
00:17:52,380 --> 00:17:55,170
that in the worst case, function y is over 1. 

336
00:17:55,170 --> 00:17:59,390
And clearly, this is a scaling constraint because if 

337
00:17:59,390 --> 00:18:01,740


338
00:18:01,740 --> 00:18:05,820
you solve for w and b, and you find that your worst-case function margin is actually 10 or 

339
00:18:05,820 --> 00:18:06,790
whatever, 

340
00:18:06,790 --> 00:18:10,099
then by dividing through w and b by a factor of 10, I can get 

341
00:18:10,099 --> 00:18:13,210
my functional margin to be over 1. 

342
00:18:13,210 --> 00:18:15,580
So this is a scaling constraint [inaudible] would 

343
00:18:15,580 --> 00:18:17,289
imply. And this is 

344
00:18:17,289 --> 00:18:19,220
just more compactly written 

345
00:18:19,220 --> 00:18:22,590
as follows. This is imposing a constraint that the functional margin be 

346
00:18:22,590 --> 00:18:25,520
equal to 1. 

347
00:18:25,520 --> 00:18:28,770
And so if we just take 

348
00:18:28,770 --> 00:18:32,230
what I wrote down as No. 2 of our previous optimization problem and add the 

349
00:18:32,230 --> 00:18:34,510
scaling constraint, 

350
00:18:34,510 --> 00:18:38,400
we then get the following optimization problem: 

351
00:18:38,400 --> 00:18:45,400
min over wb. 

352
00:18:57,360 --> 00:19:00,520
I guess previously, we had a maximization 

353
00:19:00,520 --> 00:19:05,500
over gamma hats divided by the normal w. So those maximize 

354
00:19:05,500 --> 00:19:10,240
1 over the normal w, but so that's the same as minimizing the normal w squared. It was great. Maximum 

355
00:19:10,240 --> 00:19:10,720
normal w 

356
00:19:10,720 --> 00:19:14,510
is min w - normal w squared. And then these 

357
00:19:14,510 --> 00:19:16,400
are our constraints. 

358
00:19:16,400 --> 00:19:20,980
Since I've added the constraint, the functional margin 

359
00:19:20,980 --> 00:19:23,570
is over 1. And this is actually 

360
00:19:23,570 --> 00:19:26,830
my 

361
00:19:26,830 --> 00:19:30,440
final - well, 

362
00:19:30,440 --> 00:19:36,960
final formulation of the optimal margin classifier problem, at least for now. 

363
00:19:36,960 --> 00:19:38,250
So the picture 

364
00:19:38,250 --> 00:19:40,640
to 

365
00:19:40,640 --> 00:19:44,540
keep in mind for this, I guess, is that our 

366
00:19:44,540 --> 00:19:48,450
optimization objective is once you minimize the normal w. And so our 

367
00:19:48,450 --> 00:19:50,020
optimization objective 

368
00:19:50,020 --> 00:19:52,450
is just the [inaudible] quadratic function. 

369
00:19:52,450 --> 00:19:55,300
And [inaudible] those pictures [inaudible] can draw 

370
00:19:55,300 --> 00:19:56,659
it. So it - 

371
00:19:56,659 --> 00:19:58,850
if [inaudible] is w1 and w2, and you 

372
00:19:58,850 --> 00:20:02,640
want to minimize the quadratic function like this - so quadratic function 

373
00:20:02,640 --> 00:20:06,250
just has [inaudible] that look like this. 

374
00:20:06,250 --> 00:20:10,190
And moreover, you have a number of linear constraints in your parameters, so you may have linear 

375
00:20:10,190 --> 00:20:11,940
constraints that eliminates that half space or 

376
00:20:11,940 --> 00:20:17,260
linear constraint eliminates that half space [inaudible]. So there's that half space 

377
00:20:17,260 --> 00:20:19,560


378
00:20:19,560 --> 00:20:22,570
and so on. 

379
00:20:22,570 --> 00:20:25,610
And so the picture is you have a quadratic function, 

380
00:20:25,610 --> 00:20:27,020
and you're ruling out 

381
00:20:27,020 --> 00:20:28,639


382
00:20:28,639 --> 00:20:33,540
various half spaces where each of these linear constraints. And I hope - if 

383
00:20:33,540 --> 00:20:38,080
you can picture this in 3D, I guess [inaudible] kinda draw our own 3D, hope you can 

384
00:20:38,080 --> 00:20:43,250
convince yourself that this is a convex problem that has no local optimum. But they 

385
00:20:43,250 --> 00:20:45,640
be run great 

386
00:20:45,640 --> 00:20:49,850
[inaudible] within this set of points that hasn't ruled out, then you convert to the 

387
00:20:49,850 --> 00:20:53,400
global optimum. 

388
00:20:53,400 --> 00:20:56,180
And so that's the convex optimization problem. 

389
00:20:56,180 --> 00:20:58,600
The - does this [inaudible] nice and 

390
00:20:58,600 --> 00:21:02,390
[inaudible]. 

391
00:21:02,390 --> 00:21:09,390
Questions about this? 

392
00:21:17,080 --> 00:21:24,080
Actually, just raise your hand if this makes sense. Okay, cool. 

393
00:21:25,799 --> 00:21:30,250
So this gives you the optimal margin classifier algorithm. 

394
00:21:30,250 --> 00:21:31,500
And it turns out that 

395
00:21:31,500 --> 00:21:33,789
this is the convex optimization problem, 

396
00:21:33,789 --> 00:21:36,970
so you can actually take this formulation of the problem 

397
00:21:36,970 --> 00:21:41,410
and throw it at off-the-shelf software - what's called a QP or quadratic 

398
00:21:41,410 --> 00:21:45,039
program software. This [inaudible] optimization is called a quadratic program, 

399
00:21:45,039 --> 00:21:48,720
where the quadratic convex objective function and [inaudible] constraints - 

400
00:21:48,720 --> 00:21:53,700
so you can actually download software to solve these optimization problems for you. 

401
00:21:53,700 --> 00:21:55,810
Usually, as you wanna use the - 

402
00:21:55,810 --> 00:21:56,880
use 

403
00:21:56,880 --> 00:22:00,350
[inaudible] because you have constraints like these, although you could actually modify [inaudible] work with 

404
00:22:00,350 --> 00:22:03,520
this, too. 

405
00:22:03,520 --> 00:22:06,500
So we could just declare success and say that we're done with this formulation of the 

406
00:22:06,500 --> 00:22:07,860
problem. But 

407
00:22:07,860 --> 00:22:10,899
what I'm going to do now is take a 

408
00:22:10,899 --> 00:22:14,680
digression to talk about primal and duo optimization problems. 

409
00:22:14,680 --> 00:22:16,160
And in particular, I'm going to 

410
00:22:16,160 --> 00:22:17,970
- 

411
00:22:17,970 --> 00:22:21,700
later, I'm going to come back and derive yet another very different form 

412
00:22:21,700 --> 00:22:24,640
of this optimization problem. 

413
00:22:24,640 --> 00:22:29,130
And the reason we'll do that is because it turns out this optimization problem has 

414
00:22:29,130 --> 00:22:33,050
certain properties that make it amenable to very efficient algorithms. 

415
00:22:33,050 --> 00:22:35,320
And moreover, I'll be deriving 

416
00:22:35,320 --> 00:22:37,430
what's called the duo formulation of this 

417
00:22:37,430 --> 00:22:39,830
that allows us to 

418
00:22:39,830 --> 00:22:42,850
apply the optimal margin classifier even in 

419
00:22:42,850 --> 00:22:46,950
very high-dimensional feature spaces - even in sometimes infinite 

420
00:22:46,950 --> 00:22:49,730
dimensional feature spaces. So we 

421
00:22:49,730 --> 00:22:51,780
can come back to that later. 

422
00:22:51,780 --> 00:22:53,190
But 

423
00:22:53,190 --> 00:22:54,450
let me know, 

424
00:22:54,450 --> 00:23:01,450
since I'm talking about 

425
00:23:03,470 --> 00:23:06,480
convex optimization. So how many here is - how many of you, from, I don't 

426
00:23:06,480 --> 00:23:08,080
know, calculus, 

427
00:23:08,080 --> 00:23:12,120
remember the method of Lagrange multipliers for 

428
00:23:12,120 --> 00:23:13,910
solving an optimization problem 

429
00:23:13,910 --> 00:23:18,299
like minimum - minimization, maximization problem subject to some constraint? 

430
00:23:18,299 --> 00:23:22,740
How many of you remember the method of Lagrange multipliers for that? Oh, okay, 

431
00:23:22,740 --> 00:23:24,070
cool. Some of you, yeah. 

432
00:23:24,070 --> 00:23:27,710
So if you don't remember, don't worry. I - I'll describe that briefly 

433
00:23:27,710 --> 00:23:28,390
here 

434
00:23:28,390 --> 00:23:32,039
as well, but what I'm really gonna do is talk about the generalization of this method 

435
00:23:32,039 --> 00:23:33,710
of Lagrange multipliers 

436
00:23:33,710 --> 00:23:37,659
that you may or may not have seen in some calculus classes. But if you haven't 

437
00:23:37,659 --> 00:23:40,080
seen it before, don't worry about it. 

438
00:23:40,080 --> 00:23:43,350


439
00:23:43,350 --> 00:23:49,159
So the method of Lagrange multipliers is - was - well, suppose there's some 

440
00:23:49,159 --> 00:23:52,950
function you want to minimize, or minimize f of w. 

441
00:23:52,950 --> 00:23:55,660
We're subject to 

442
00:23:55,660 --> 00:24:02,660
some set of constraints that each i of w must equal 0 - for i = 1 [inaudible] l. And 

443
00:24:03,400 --> 00:24:06,770
given this constraint, I'll actually usually write it in vectorial 

444
00:24:06,770 --> 00:24:09,550
form in which I write h of w 

445
00:24:09,550 --> 00:24:11,429
as this 

446
00:24:11,429 --> 00:24:16,980
vector value function. 

447
00:24:16,980 --> 00:24:20,919
So that is equal to 0, where 0 is the arrow on top. I used 

448
00:24:20,919 --> 00:24:24,600
that to denote the vector of 

449
00:24:24,600 --> 00:24:27,200
all 0s. So you want to solve this optimization problem. 

450
00:24:27,200 --> 00:24:28,880


451
00:24:28,880 --> 00:24:32,130
Some of you have seen method of Lagrange multipliers where 

452
00:24:32,130 --> 00:24:34,880
you construct this 

453
00:24:34,880 --> 00:24:39,470
[inaudible] Lagrange, 

454
00:24:39,470 --> 00:24:45,980


455
00:24:45,980 --> 00:24:49,810
which is the original optimization objective plus some [inaudible] Lagrange multipliers the 

456
00:24:49,810 --> 00:24:51,950
highest constraints. 

457
00:24:51,950 --> 00:24:54,720
And these parameters - they derive - 

458
00:24:54,720 --> 00:25:01,570
we call the Lagrange multipliers. 

459
00:25:01,570 --> 00:25:06,320
And so the way you actually solve the optimization problem is 

460
00:25:06,320 --> 00:25:10,830
you take the partial derivative of this with respect to the original parameters 

461
00:25:10,830 --> 00:25:13,740
and set that to 0. So 

462
00:25:13,740 --> 00:25:19,710
the partial derivative with respect to your Lagrange multipliers [inaudible], and set that to 0. 

463
00:25:19,710 --> 00:25:23,630
And then the same as theorem through [inaudible], I guess [inaudible] 

464
00:25:23,630 --> 00:25:24,510
Lagrange 

465
00:25:24,510 --> 00:25:30,800
was that for w - for some value w star to get a 

466
00:25:30,800 --> 00:25:32,380
solution, 

467
00:25:32,380 --> 00:25:34,660
it 

468
00:25:34,660 --> 00:25:41,660
is necessary 

469
00:25:42,970 --> 00:25:48,299
that - can this 

470
00:25:48,299 --> 00:25:48,840
be 

471
00:25:48,840 --> 00:25:50,200


472
00:25:50,200 --> 00:25:53,510
the star? Student:Right. Instructor (Andrew Ng):The backwards e - there exists. So there exists 

473
00:25:53,510 --> 00:25:54,750
beta star 

474
00:25:54,750 --> 00:26:00,420


475
00:26:00,420 --> 00:26:07,420


476
00:26:17,300 --> 00:26:19,940
such that those partial derivatives are 

477
00:26:19,940 --> 00:26:21,500
equal to 

478
00:26:21,500 --> 00:26:22,250
0. 

479
00:26:22,250 --> 00:26:24,020
So the 

480
00:26:24,020 --> 00:26:25,400
method 

481
00:26:25,400 --> 00:26:28,510
of Lagrange multipliers is to solve this problem, 

482
00:26:28,510 --> 00:26:30,710
you construct a Lagrange, 

483
00:26:30,710 --> 00:26:32,580
take the derivative with respect to 

484
00:26:32,580 --> 00:26:34,520
the original parameters b, the original 

485
00:26:34,520 --> 00:26:39,809
parameters w, and with respect to the Lagrange multipliers beta. 

486
00:26:39,809 --> 00:26:42,139
Set the partial derivatives equal to 0, and solve 

487
00:26:42,139 --> 00:26:45,460
for our solutions. And then you check each of the solutions to see if it is indeed a minimum. 

488
00:26:45,460 --> 00:26:48,500
Great. 

489
00:26:48,500 --> 00:26:52,380
So great 

490
00:26:52,380 --> 00:26:56,790
- 

491
00:26:56,790 --> 00:26:59,110
so what I'm going to do is actually 

492
00:26:59,110 --> 00:27:00,420
write 

493
00:27:00,420 --> 00:27:02,760
down the generalization of this. And 

494
00:27:02,760 --> 00:27:06,090
if you haven't seen that before, don't worry about it. This is [inaudible]. 

495
00:27:06,090 --> 00:27:11,000
So what I'm going to do is actually write down the generalization of this to solve 

496
00:27:11,000 --> 00:27:15,870
a slightly more difficult type of constraint optimization problem, 

497
00:27:15,870 --> 00:27:20,370
which is suppose you want to minimize f of w 

498
00:27:20,370 --> 00:27:24,039
subject to the constraint that gi of w, 

499
00:27:24,039 --> 00:27:25,539
excuse me, 

500
00:27:25,539 --> 00:27:30,780
is less than equal to 

501
00:27:30,780 --> 00:27:34,940
0, and that hi of w is equal to 0. 

502
00:27:34,940 --> 00:27:38,280
And 

503
00:27:38,280 --> 00:27:43,230
again, using my vector notation, I'll write this as g of w is equal to 0. And h of w is equal to 0. 

504
00:27:43,230 --> 00:27:44,710
So 

505
00:27:44,710 --> 00:27:48,940
in [inaudible]'s 

506
00:27:48,940 --> 00:27:52,169
case, we now have inequality for constraint as well as 

507
00:27:52,169 --> 00:27:59,169
equality constraint. 

508
00:28:01,620 --> 00:28:02,779
I then have 

509
00:28:02,779 --> 00:28:06,719
a Lagrange, or it's actually still - called - say generalized 

510
00:28:06,719 --> 00:28:08,370
Lagrange, 

511
00:28:08,370 --> 00:28:13,040
which is now a function of my original optimization for parameters w, 

512
00:28:13,040 --> 00:28:18,050
as well as two sets of Lagrange multipliers, alpha and beta. 

513
00:28:18,050 --> 00:28:19,799
And so this will be 

514
00:28:19,799 --> 00:28:26,799
f of w. 

515
00:28:28,650 --> 00:28:32,549
Now, 

516
00:28:32,549 --> 00:28:36,559


517
00:28:36,559 --> 00:28:39,350
here's a 

518
00:28:39,350 --> 00:28:40,220
cool part. 

519
00:28:40,220 --> 00:28:42,650
I'm going to define theta 

520
00:28:42,650 --> 00:28:43,679


521
00:28:43,679 --> 00:28:45,990
subscript p of 

522
00:28:45,990 --> 00:28:46,690
w 

523
00:28:46,690 --> 00:28:49,030
to be equal to 

524
00:28:49,030 --> 00:28:52,030
max of alpha beta subject to the 

525
00:28:52,030 --> 00:28:59,030
constraints that the alphas are, beta equal 

526
00:28:59,090 --> 00:29:06,090
to 0 of the Lagrange. 

527
00:29:10,480 --> 00:29:15,490
And so 

528
00:29:15,490 --> 00:29:17,910
I want you to consider 

529
00:29:17,910 --> 00:29:21,610
the optimization problem 

530
00:29:21,610 --> 00:29:24,350
min over w of 

531
00:29:24,350 --> 00:29:31,350
max over alpha beta, such that the alpha is a greater than 0 of the Lagrange. And 

532
00:29:31,700 --> 00:29:35,690
that's just equal to min over w, 

533
00:29:35,690 --> 00:29:42,000
theta p of 

534
00:29:42,000 --> 00:29:47,990
w. And just to give us a name, the [inaudible] - the subscript p here is a sense of 

535
00:29:47,990 --> 00:29:49,590
primal problem. 

536
00:29:49,590 --> 00:29:56,590
And that refers to this entire thing. 

537
00:29:56,920 --> 00:30:01,080
This optimization problem that written down is called a primal problem. This means 

538
00:30:01,080 --> 00:30:04,270
there's the original optimization problem in which [inaudible] solving. 

539
00:30:04,270 --> 00:30:08,450
And later on, I'll derive in another version of this, but that's what p 

540
00:30:08,450 --> 00:30:13,360
stands for. It's a - this is a primal problem. 

541
00:30:13,360 --> 00:30:17,690
Now, I want you to look at - consider theta over p again. And in particular, I wanna 

542
00:30:17,690 --> 00:30:21,510
consider the problem of what happens if you minimize w 

543
00:30:21,510 --> 00:30:24,070
- minimize as a function of w 

544
00:30:24,070 --> 00:30:27,340
this quantity theta over p. 

545
00:30:27,340 --> 00:30:34,160


546
00:30:34,160 --> 00:30:37,510
So let's look at what theta p of w is. Notice 

547
00:30:37,510 --> 00:30:39,659
that 

548
00:30:39,659 --> 00:30:41,620
if 

549
00:30:41,620 --> 00:30:43,480
gi of w 

550
00:30:43,480 --> 00:30:47,039
is greater than 0, so let's pick the value of w. 

551
00:30:47,039 --> 00:30:49,779
And let's ask what is the state of p of w? 

552
00:30:49,779 --> 00:30:56,779
So if w violates one of your primal problems constraints, 

553
00:30:57,050 --> 00:31:00,280


554
00:31:00,280 --> 00:31:03,670
then state of p of w would be infinity. 

555
00:31:03,670 --> 00:31:05,780
Why is that? 

556
00:31:05,780 --> 00:31:09,440
[Inaudible] p [inaudible] second. 

557
00:31:09,440 --> 00:31:12,800
Suppose I pick a value of w that violates one of these constraints. So gi of 

558
00:31:12,800 --> 00:31:16,880
w is positive. 

559
00:31:16,880 --> 00:31:21,840
Then - well, theta p is this - maximize this function of alpha and beta - the Lagrange. So 

560
00:31:21,840 --> 00:31:25,740
one of these gi of w's is this positive, 

561
00:31:25,740 --> 00:31:30,300
then by setting the other responding alpha i to plus infinity, I can make this 

562
00:31:30,300 --> 00:31:31,940
arbitrarily large. 

563
00:31:31,940 --> 00:31:34,539
And so if w violates one of my 

564
00:31:34,539 --> 00:31:37,770
primal problem's constraints in one of the gis, then 

565
00:31:37,770 --> 00:31:42,290
max over alpha of this Lagrange will be plus 

566
00:31:42,290 --> 00:31:49,290
infinity. There's some of - and in the same way - I guess in a similar way, 

567
00:31:50,500 --> 00:31:55,240
if hi of w is not equal to 

568
00:31:55,240 --> 00:31:57,690
0, 

569
00:31:57,690 --> 00:32:02,669
then theta p of w also be infinity for a very similar reason because 

570
00:32:02,669 --> 00:32:05,650
if hi of w is not equal to 0 for some value of i, 

571
00:32:05,650 --> 00:32:09,990
then in my Lagrange, I had a beta i x hi theorem. 

572
00:32:09,990 --> 00:32:13,480
And so by setting beta i to be plus infinity or minus 

573
00:32:13,480 --> 00:32:15,830
infinity depending on the sign of hi, 

574
00:32:15,830 --> 00:32:20,310
I can make this plus infinity as well. 

575
00:32:20,310 --> 00:32:22,190
And otherwise, 

576
00:32:22,190 --> 00:32:29,190


577
00:32:30,400 --> 00:32:33,730


578
00:32:33,730 --> 00:32:38,270
theta p of w is just equal to f of w. Turns 

579
00:32:38,270 --> 00:32:39,649
out if 

580
00:32:39,649 --> 00:32:41,799
I had a value of w that 

581
00:32:41,799 --> 00:32:44,940
satisfies all of the gi and the hi constraints, 

582
00:32:44,940 --> 00:32:47,749
then we maximize in terms of alpha and beta 

583
00:32:47,749 --> 00:32:48,950
- all the Lagrange multiply 

584
00:32:48,950 --> 00:32:53,830
theorems will actually be obtained by 

585
00:32:53,830 --> 00:32:57,600
setting all the Lagrange multiply theorems to be 0, 

586
00:32:57,600 --> 00:32:59,190
and so theta p 

587
00:32:59,190 --> 00:33:03,590
just left with f of w. Thus, theta 

588
00:33:03,590 --> 00:33:06,400
p 

589
00:33:06,400 --> 00:33:11,490
of w is equal to 

590
00:33:11,490 --> 00:33:13,540
f of w 

591
00:33:13,540 --> 00:33:16,610
if constraints are 

592
00:33:16,610 --> 00:33:17,929
satisfied 

593
00:33:17,929 --> 00:33:21,330
[inaudible] 

594
00:33:21,330 --> 00:33:22,540
the gi in 

595
00:33:22,540 --> 00:33:24,409
hi constraints, 

596
00:33:24,409 --> 00:33:28,270
and is equal to plus infinity 

597
00:33:28,270 --> 00:33:31,770
otherwise. 

598
00:33:31,770 --> 00:33:34,500
So the problem I 

599
00:33:34,500 --> 00:33:38,619
wrote down that minimizes the function of w - 

600
00:33:38,619 --> 00:33:44,630
theta p of w - this is [inaudible] problem. That's 

601
00:33:44,630 --> 00:33:47,540
just 

602
00:33:47,540 --> 00:33:50,639
exactly the same problem as my original primal problem 

603
00:33:50,639 --> 00:33:55,010
because if you choose a value of w that violates the constraints, you get infinity. 

604
00:33:55,010 --> 00:33:57,920
And if you satisfy the constraints, you get f of w. 

605
00:33:57,920 --> 00:33:59,809
So this is really just the same as - well, 

606
00:33:59,809 --> 00:34:00,980
we'll say, 

607
00:34:00,980 --> 00:34:05,100
"Satisfy the constraints, and minimize f of w." That's really what 

608
00:34:05,100 --> 00:34:08,609
minimizing the state of p 

609
00:34:08,609 --> 00:34:15,609
of w is. Raise your hand if this makes sense. Yeah, okay, cool. So all right. 

610
00:34:24,699 --> 00:34:28,339
I hope no one's getting mad at me because I'm doing so much work, and when we come back, it'll be exactly 

611
00:34:28,339 --> 00:34:31,159
the same thing we started with. So here's 

612
00:34:31,159 --> 00:34:33,499
the cool part. 

613
00:34:33,499 --> 00:34:40,289
Let me know if you find it in your problem. To find 

614
00:34:40,289 --> 00:34:41,909
theta 

615
00:34:41,909 --> 00:34:45,059
d 

616
00:34:45,059 --> 00:34:47,129
and d [inaudible] duo, 

617
00:34:47,129 --> 00:34:50,409
and this is how the function of alpha and beta is. It's not the function of the Lagrange 

618
00:34:50,409 --> 00:34:52,589
multipliers. It's not of w. 

619
00:34:52,589 --> 00:34:58,229
To find this, we minimize over w of 

620
00:34:58,229 --> 00:35:05,229
my generalized Lagrange. And 

621
00:35:06,309 --> 00:35:13,309
my duo problem is this. So in 

622
00:35:16,729 --> 00:35:20,359
other words, this is max over 

623
00:35:20,359 --> 00:35:21,859
that. 

624
00:35:21,859 --> 00:35:23,130


625
00:35:23,130 --> 00:35:28,239


626
00:35:28,239 --> 00:35:33,019
And so this is my duo optimization problem. To maximize over alpha and 

627
00:35:33,019 --> 00:35:34,059
beta, theta 

628
00:35:34,059 --> 00:35:35,499
d over alpha 

629
00:35:35,499 --> 00:35:37,980
and beta. So this optimization problem, I 

630
00:35:37,980 --> 00:35:40,859
guess, is my duo problem. I 

631
00:35:40,859 --> 00:35:44,810
want you to compare this to our previous prime optimization problem. 

632
00:35:44,810 --> 00:35:47,499
The only difference is that 

633
00:35:47,499 --> 00:35:51,690
I took the max and min, and I switched the order around with the max and 

634
00:35:51,690 --> 00:35:56,199
min. That's the difference in the primal and the duo optimization [inaudible]. 

635
00:35:56,199 --> 00:36:00,210
And it turns out that 

636
00:36:00,210 --> 00:36:01,669


637
00:36:01,669 --> 00:36:04,730
it's a - it's sort of - it's a fact - it's 

638
00:36:04,730 --> 00:36:06,169
true, generally, that d 

639
00:36:06,169 --> 00:36:09,919
star is less than [inaudible] p star. In other words, I think I defined 

640
00:36:09,919 --> 00:36:15,609
p star previously. P star was a value of the prime optimization problem. 

641
00:36:15,609 --> 00:36:19,829
And in other words, that it's just generally true 

642
00:36:19,829 --> 00:36:25,150
that the max of the min of something is less than equal to the min of the max 

643
00:36:25,150 --> 00:36:27,609
of something. 

644
00:36:27,609 --> 00:36:29,199
And this is a 

645
00:36:29,199 --> 00:36:30,099
general fact. 

646
00:36:30,099 --> 00:36:33,169
And just as a concrete example, the 

647
00:36:33,169 --> 00:36:37,069
max over y in the set 01 x - 

648
00:36:37,069 --> 00:36:40,449
oh, excuse me, of the min of the 

649
00:36:40,449 --> 00:36:43,579
set in 01 

650
00:36:43,579 --> 00:36:47,939
of indicator x = 

651
00:36:47,939 --> 00:36:50,940
y - this is 

652
00:36:50,940 --> 00:36:57,940
[inaudible] equal to 

653
00:37:02,689 --> 00:37:09,669
min. 

654
00:37:09,669 --> 00:37:14,229
So this equality - this inequality actually holds true for any 

655
00:37:14,229 --> 00:37:15,759
function you might find in here. 

656
00:37:15,759 --> 00:37:18,059
And this is one specific example 

657
00:37:18,059 --> 00:37:20,670
where 

658
00:37:20,670 --> 00:37:24,289
the min over xy - excuse me, min over x of [inaudible] equals y - 

659
00:37:24,289 --> 00:37:27,009
this is always equal to 0 

660
00:37:27,009 --> 00:37:30,760
because whatever y is, you can choose x to be something different. So 

661
00:37:30,760 --> 00:37:32,539
this is always 0, 

662
00:37:32,539 --> 00:37:34,079
whereas if 

663
00:37:34,079 --> 00:37:37,680
I exchange the order to min 

664
00:37:37,680 --> 00:37:39,569
and max, then thing here is always equal to 

665
00:37:39,569 --> 00:37:41,669
1. So 0 [inaudible] to 

666
00:37:41,669 --> 00:37:44,700
1. And more generally, this min/max - excuse 

667
00:37:44,700 --> 00:37:50,909
me, this max/min, thus with the min/max holds true for any function you might put in 

668
00:37:50,909 --> 00:37:52,149
there. 

669
00:37:52,149 --> 00:37:57,779
But it turns out that sometimes under certain conditions, 

670
00:37:57,779 --> 00:37:59,349


671
00:37:59,349 --> 00:38:01,349


672
00:38:01,349 --> 00:38:01,839


673
00:38:01,839 --> 00:38:03,239


674
00:38:03,239 --> 00:38:06,949
these two optimization problems have the same value. Sometimes under certain 

675
00:38:06,949 --> 00:38:07,859
conditions, 

676
00:38:07,859 --> 00:38:10,110
the primal and the duo problems 

677
00:38:10,110 --> 00:38:12,349
have the same value. 

678
00:38:12,349 --> 00:38:14,559
And so 

679
00:38:14,559 --> 00:38:15,700
you might be able to solve 

680
00:38:15,700 --> 00:38:20,079
the duo problem rather than the primal problem. 

681
00:38:20,079 --> 00:38:22,290
And the reason to do that is that 

682
00:38:22,290 --> 00:38:26,289
sometimes, which we'll see in the optimal margin classifier problem, the support vector machine problem, 

683
00:38:26,289 --> 00:38:31,130
the duo problem turns out to be much easier than it - often has many useful properties that 

684
00:38:31,130 --> 00:38:33,749
will make user 

685
00:38:33,749 --> 00:38:40,749
compared to the primal. So for the sake of - 

686
00:38:42,199 --> 00:38:48,019
so 

687
00:38:48,019 --> 00:38:55,019
what 

688
00:39:01,549 --> 00:39:05,230
I'm going to do now is write down formally the certain conditions under which that's 

689
00:39:05,230 --> 00:39:09,449
true - where the primal and the duo problems are equivalent. 

690
00:39:09,449 --> 00:39:14,199
And so our strategy for working out the [inaudible] of support vector machine algorithm 

691
00:39:14,199 --> 00:39:17,849
will be that we'll write down the primal optimization problem, which we did 

692
00:39:17,849 --> 00:39:20,619
previously, and maximizing classifier. 

693
00:39:20,619 --> 00:39:21,780
And then we'll 

694
00:39:21,780 --> 00:39:24,470
derive the duo optimization problem for that. 

695
00:39:24,470 --> 00:39:26,489
And then we'll solve the duo problem. 

696
00:39:26,489 --> 00:39:29,279
And by modifying that a little bit, that's how we'll derive this support vector machine. But let me ask you - for 

697
00:39:29,279 --> 00:39:30,489


698
00:39:30,489 --> 00:39:33,639
now, let me just first, for 

699
00:39:33,639 --> 00:39:36,659
the sake of completeness, I just write down the conditions under which the primal 

700
00:39:36,659 --> 00:39:41,579
and the duo optimization problems give you the same solutions. So let f 

701
00:39:41,579 --> 00:39:45,170
be convex. If you're not 

702
00:39:45,170 --> 00:39:46,699


703
00:39:46,699 --> 00:39:50,249
sure what convex means, for the purposes of this class, you can take it to 

704
00:39:50,249 --> 00:39:53,319
mean that the 

705
00:39:53,319 --> 00:39:56,260
Hessian, h is positive. [Inaudible], so it just means it's 

706
00:39:56,260 --> 00:39:58,199
a [inaudible] function like that. 

707
00:39:58,199 --> 00:39:58,860
And 

708
00:39:58,860 --> 00:40:01,030
once you learn more about optimization 

709
00:40:01,030 --> 00:40:06,559
- again, please come to this week's discussion session taught by the TAs. 

710
00:40:06,559 --> 00:40:09,279
Then 

711
00:40:09,279 --> 00:40:11,309
suppose 

712
00:40:11,309 --> 00:40:12,520
hi 

713
00:40:12,520 --> 00:40:17,919
- the hi constraints [inaudible], and what that means is that hi of w equals alpha i 

714
00:40:17,919 --> 00:40:24,519
transpose w plus vi. This actually 

715
00:40:24,519 --> 00:40:28,509
means the same thing as linear. Without the term b here, we say 

716
00:40:28,509 --> 00:40:29,970
that hi is linear 

717
00:40:29,970 --> 00:40:34,079
where we have a constant interceptor as well. This is technically called [inaudible] other than 

718
00:40:34,079 --> 00:40:37,229
linear. 

719
00:40:37,229 --> 00:40:39,770
And let's suppose 

720
00:40:39,770 --> 00:40:42,599


721
00:40:42,599 --> 00:40:49,599
that gi's are strictly feasible. 

722
00:40:51,170 --> 00:40:54,049
And 

723
00:40:54,049 --> 00:40:57,029
what that means is that 

724
00:40:57,029 --> 00:41:00,869
there is just a value of the w such that 

725
00:41:00,869 --> 00:41:04,249
from i, 

726
00:41:04,249 --> 00:41:06,669
gi of w is less 

727
00:41:06,669 --> 00:41:10,519
than 0. Don't worry too much [inaudible]. I'm writing these things down for the sake of completeness, but don't worry too much about all the 

728
00:41:10,519 --> 00:41:12,939
technical details. 

729
00:41:12,939 --> 00:41:15,820
Strictly feasible, which just means that there's a value of w such that 

730
00:41:15,820 --> 00:41:19,260
all of these constraints are satisfy were stricter than the equality rather than what 

731
00:41:19,260 --> 00:41:21,739
less than equal to. 

732
00:41:21,739 --> 00:41:22,490


733
00:41:22,490 --> 00:41:25,130
Under these conditions, 

734
00:41:25,130 --> 00:41:26,180


735
00:41:26,180 --> 00:41:29,990
there were exists w star, 

736
00:41:29,990 --> 00:41:32,299
alpha 

737
00:41:32,299 --> 00:41:33,800
star, beta 

738
00:41:33,800 --> 00:41:39,639
star such that w star solves the primal problem. 

739
00:41:39,639 --> 00:41:41,849
And alpha star 

740
00:41:41,849 --> 00:41:48,849
and beta star, the Lagrange multipliers, solve the duo problem. 

741
00:41:51,829 --> 00:41:53,659


742
00:41:53,659 --> 00:41:58,679
And the value of the primal problem will be equal to the value of the duo problem will 

743
00:41:58,679 --> 00:42:03,649
be equal to the value of your Lagrange multiplier - excuse 

744
00:42:03,649 --> 00:42:06,189
me, will be equal to the value of your generalized Lagrange, the value 

745
00:42:06,189 --> 00:42:08,759
of that w star, alpha star, beta star. 

746
00:42:08,759 --> 00:42:13,339
In other words, you can solve either the primal or the duo problem. You get 

747
00:42:13,339 --> 00:42:14,899
the same 

748
00:42:14,899 --> 00:42:19,869
solution. Further, 

749
00:42:19,869 --> 00:42:22,530
your parameters will satisfy 

750
00:42:22,530 --> 00:42:24,139


751
00:42:24,139 --> 00:42:27,190


752
00:42:27,190 --> 00:42:32,119
these conditions. Partial derivative perspective parameters would be 

753
00:42:32,119 --> 00:42:33,339
0. And 

754
00:42:33,339 --> 00:42:36,559
actually, to keep this equation in mind, we'll actually use this in a second 

755
00:42:36,559 --> 00:42:38,630
when we take the Lagrange, and we - and 

756
00:42:38,630 --> 00:42:43,259
our support vector machine problem, and take a derivative with respect to w to solve a - 

757
00:42:43,259 --> 00:42:44,679
to solve our - 

758
00:42:44,679 --> 00:42:46,519
to derive our duo problem. We'll actually 

759
00:42:46,519 --> 00:42:50,480
perform this step ourselves in a second. 

760
00:42:50,480 --> 00:42:52,099


761
00:42:52,099 --> 00:42:58,289
Partial derivative with respect to the Lagrange multiplier beta is 

762
00:42:58,289 --> 00:43:01,789


763
00:43:01,789 --> 00:43:05,169
equal 

764
00:43:05,169 --> 00:43:06,269
to 

765
00:43:06,269 --> 00:43:08,439
0. 

766
00:43:08,439 --> 00:43:10,479
Turns out this will hold true, 

767
00:43:10,479 --> 00:43:11,869
too. 

768
00:43:11,869 --> 00:43:14,429
This is called 

769
00:43:14,429 --> 00:43:18,960


770
00:43:18,960 --> 00:43:20,549
the - 

771
00:43:20,549 --> 00:43:27,380
well 

772
00:43:27,380 --> 00:43:28,569
- 

773
00:43:28,569 --> 00:43:32,359
this is called the KKT complementary condition. 

774
00:43:32,359 --> 00:43:37,869
KKT stands for Karush-Kuhn-Tucker, which were the authors of this theorem. 

775
00:43:37,869 --> 00:43:41,630
Well, and by tradition, usually this [inaudible] KKT conditions. 

776
00:43:41,630 --> 00:43:48,630
But the other two are 

777
00:43:50,489 --> 00:43:53,239
- just so the [inaudible] is greater than 0, which we had 

778
00:43:53,239 --> 00:43:54,509
previously 

779
00:43:54,509 --> 00:44:01,509
and that your constraints are actually satisfied. 

780
00:44:02,919 --> 00:44:09,919
So let's see. [Inaudible] All 

781
00:44:21,789 --> 00:44:28,789
right. 

782
00:44:29,959 --> 00:44:34,170
So let's take those and apply this to our 

783
00:44:34,170 --> 00:44:37,349
optimal margin 

784
00:44:37,349 --> 00:44:40,689
optimization problem that we had previously. I 

785
00:44:40,689 --> 00:44:43,779
was gonna say one word about this, 

786
00:44:43,779 --> 00:44:46,809
which is - 

787
00:44:46,809 --> 00:44:49,369
was gonna say one word about this 

788
00:44:49,369 --> 00:44:50,880
KTT 

789
00:44:50,880 --> 00:44:52,379
complementary condition is 

790
00:44:52,379 --> 00:44:54,229
that a condition that is a - 

791
00:44:54,229 --> 00:45:00,839
at your solution, you must have that alpha star i times gi of w is equal to 0. 

792
00:45:00,839 --> 00:45:02,680
So 

793
00:45:02,680 --> 00:45:04,390
let's see. 

794
00:45:04,390 --> 00:45:08,789
So the product of two numbers is equal to 0. That means that 

795
00:45:08,789 --> 00:45:09,450


796
00:45:09,450 --> 00:45:12,459
at least one of these things must be equal to 

797
00:45:12,459 --> 00:45:15,989
0. For the product of two things to be equal to 0, well, that's just saying either alpha 

798
00:45:15,989 --> 00:45:17,230
or i 

799
00:45:17,230 --> 00:45:21,269
or gi is equal to 0. 

800
00:45:21,269 --> 00:45:25,759
So what that implies is that the - just Karush-Kuhn-Tucker 

801
00:45:25,759 --> 00:45:31,539


802
00:45:31,539 --> 00:45:38,410


803
00:45:38,410 --> 00:45:42,630
- most people just say KKT, but we wanna show you the right 

804
00:45:42,630 --> 00:45:45,589
spelling 

805
00:45:45,589 --> 00:45:46,890
of their names. So 

806
00:45:46,890 --> 00:45:47,800
KKT 

807
00:45:47,800 --> 00:45:52,119
complementary condition implies that if alpha 

808
00:45:52,119 --> 00:45:55,360
i is not 0, 

809
00:45:55,360 --> 00:46:00,559
that necessarily implies that 

810
00:46:00,559 --> 00:46:03,199
gi of w star 

811
00:46:03,199 --> 00:46:06,239
is equal to 

812
00:46:06,239 --> 00:46:10,959
0. 

813
00:46:10,959 --> 00:46:13,039
And 

814
00:46:13,039 --> 00:46:15,000
usually, 

815
00:46:15,000 --> 00:46:21,149


816
00:46:21,149 --> 00:46:23,629


817
00:46:23,629 --> 00:46:25,999
it turns out - so 

818
00:46:25,999 --> 00:46:28,649
all the KKT condition guarantees is that 

819
00:46:28,649 --> 00:46:32,939
at least one of them is 0. It may actually be the case that both 

820
00:46:32,939 --> 00:46:35,279
alpha and gi are both equal to 0. 

821
00:46:35,279 --> 00:46:39,170
But in practice, when you solve this optimization problem, you find that 

822
00:46:39,170 --> 00:46:42,880
to a large part, alpha i star is not equal to 0 if and only gi of 

823
00:46:42,880 --> 00:46:44,749
w star 0, 0. 

824
00:46:44,749 --> 00:46:50,449
This is not strictly true because it's possible that both of these may be 0. 

825
00:46:50,449 --> 00:46:51,720
But in practice, 

826
00:46:51,720 --> 00:46:55,119
when we - because when we solve problems like these, you're, for the most part, 

827
00:46:55,119 --> 00:46:59,319
usually exactly one of these will be non-0. 

828
00:46:59,319 --> 00:47:03,459
And also, when this holds true, when gi of w star is equal to 0, 

829
00:47:03,459 --> 00:47:05,819
we say that 

830
00:47:05,819 --> 00:47:06,540
gi 

831
00:47:06,540 --> 00:47:13,540
- gi of w, I guess, is an active 

832
00:47:16,279 --> 00:47:17,919
constraint 

833
00:47:17,919 --> 00:47:22,019
because we call a constraint - our constraint was a gi of w must be less than or equal to 

834
00:47:22,019 --> 00:47:23,399
0. 

835
00:47:23,399 --> 00:47:25,589
And so it is equal to 0, then 

836
00:47:25,589 --> 00:47:32,589
we say that that's a constraint that this is an active constraint. 

837
00:47:33,679 --> 00:47:37,410
Once we talk about [inaudible], we come back and [inaudible] and just extend this 

838
00:47:37,410 --> 00:47:42,589
idea a little bit more. 

839
00:47:42,589 --> 00:47:48,349
[Inaudible] board. [Inaudible] 

840
00:47:48,349 --> 00:47:52,159
turn 

841
00:47:52,159 --> 00:47:59,159
to this board in a second, but - 

842
00:48:07,269 --> 00:48:10,899
so let's go back and work out one of the primal and the duo optimization problems for 

843
00:48:10,899 --> 00:48:12,029


844
00:48:12,029 --> 00:48:17,329
our optimal margin classifier for the optimization problem that we worked on just now. As 

845
00:48:17,329 --> 00:48:20,379
a point of notation, 

846
00:48:20,379 --> 00:48:24,669
in whatever I've been writing down so far in deriving the KKT conditions, 

847
00:48:24,669 --> 00:48:28,189


848
00:48:28,189 --> 00:48:32,369
when Lagrange multipliers were alpha i and beta i, 

849
00:48:32,369 --> 00:48:35,600
it turns out that when applied as [inaudible] 

850
00:48:35,600 --> 00:48:37,020
dm, 

851
00:48:37,020 --> 00:48:41,089
turns out we only have one set of Lagrange multipliers alpha i. 

852
00:48:41,089 --> 00:48:43,859
And also, 

853
00:48:43,859 --> 00:48:47,419
as I was working out the KKT conditions, I used w 

854
00:48:47,419 --> 00:48:52,189
to denote the parameters of my primal optimization problem. [Inaudible] I wanted to 

855
00:48:52,189 --> 00:48:53,999
minimize f of w. In my 

856
00:48:53,999 --> 00:48:57,739
very first optimization problem, I had 

857
00:48:57,739 --> 00:48:58,989
that optimization problem [inaudible] finding the parameters 

858
00:48:58,989 --> 00:49:01,670
w. In my svn problem, I'm 

859
00:49:01,670 --> 00:49:04,640
actually gonna have two sets of parameters, w and b. So 

860
00:49:04,640 --> 00:49:08,479
this is just a - keep that 

861
00:49:08,479 --> 00:49:15,479
sort of slight notation change in mind. So 

862
00:49:15,960 --> 00:49:20,289
problem we worked out previously was we want to minimize the normal w squared and just add a 

863
00:49:20,289 --> 00:49:21,149
half there 

864
00:49:21,149 --> 00:49:25,719
by convention because it makes other work - math work a little 

865
00:49:25,719 --> 00:49:26,799
nicer. 

866
00:49:26,799 --> 00:49:32,829
And subject to the constraint that yi x w [inaudible] xi + v must be = greater 

868
00:49:40,949 --> 00:49:46,390
And so let me just take this constraint, and I'll rewrite it as a constraint. It's gi of w, 

869
00:49:46,390 --> 00:49:48,839
b. Again, previously, 

870
00:49:48,839 --> 00:49:52,559
I had gi 

871
00:49:52,559 --> 00:49:56,079
of w, but now I have parameters w and b. So 

872
00:49:56,079 --> 00:50:03,079
gi of w, b defined 

873
00:50:03,369 --> 00:50:09,589
as 1. 

874
00:50:09,589 --> 00:50:15,659
So 

875
00:50:15,659 --> 00:50:19,799
let's look at the implications of this in terms 

876
00:50:19,799 --> 00:50:25,380
of the KKT duo complementary condition again. 

877
00:50:25,380 --> 00:50:29,180
So we have that alpha i is basically equal to 

878
00:50:29,180 --> 00:50:32,309
0. That necessarily implies that 

879
00:50:32,309 --> 00:50:35,149
gi of w, b 

880
00:50:35,149 --> 00:50:42,149
is equal to 0. In other words, this is an active constraint. 

881
00:50:46,849 --> 00:50:53,849
And what does this mean? It means that it actually turns out gi of wb equal to 0 that 

882
00:50:54,460 --> 00:51:00,729
is - that means exactly that the training example xi, yi 

883
00:51:00,729 --> 00:51:07,729
has functional margin 

884
00:51:08,909 --> 00:51:10,289
equal to 1. 

885
00:51:10,289 --> 00:51:11,279
Because 

886
00:51:11,279 --> 00:51:15,049
this constraint was that 

887
00:51:15,049 --> 00:51:18,769
the functional margin of every example has to be greater equal to 

888
00:51:18,769 --> 00:51:22,139
1. And so if this is an active constraint, it just - 

889
00:51:22,139 --> 00:51:24,160
inequality holds that equality. 

890
00:51:24,160 --> 00:51:26,529
That means that my training example i 

891
00:51:26,529 --> 00:51:30,349
must have functional margin equal to exactly 1. And so - actually, yeah, 

892
00:51:30,349 --> 00:51:33,609
right now, I'll 

893
00:51:33,609 --> 00:51:39,689
do this on a different board, I guess. 

894
00:51:39,689 --> 00:51:46,689


895
00:51:57,239 --> 00:51:59,439
So in pictures, 

896
00:51:59,439 --> 00:52:03,650
what that means is that, you have some 

897
00:52:03,650 --> 00:52:10,650
training sets, and you'll 

898
00:52:10,729 --> 00:52:15,189
have some separating hyperplane. 

899
00:52:15,189 --> 00:52:19,469
And so the examples with functional margin equal to 1 

900
00:52:19,469 --> 00:52:23,299
will be exactly those which are - 

901
00:52:23,299 --> 00:52:25,549
so they're closest 

902
00:52:25,549 --> 00:52:27,799
to my separating hyperplane. So 

903
00:52:27,799 --> 00:52:30,669
that's 

904
00:52:30,669 --> 00:52:34,109
my equation. [Inaudible] equal to 0. 

905
00:52:34,109 --> 00:52:39,499
And so in this - in this cartoon example that I've done, it'll be 

906
00:52:39,499 --> 00:52:44,649
exactly 

907
00:52:44,649 --> 00:52:49,699
- these three 

908
00:52:49,699 --> 00:52:52,709
examples that have functional margin 

909
00:52:52,709 --> 00:52:54,399
equal to 1, 

910
00:52:54,399 --> 00:52:58,519
and all of the other examples as being further away than these 

911
00:52:58,519 --> 00:53:05,519
three will have functional margin that is strictly greater than 1. 

912
00:53:05,929 --> 00:53:08,109
And 

913
00:53:08,109 --> 00:53:11,029
the examples with functional margin equal to 1 will usually correspond to the 

914
00:53:11,029 --> 00:53:15,429


915
00:53:15,429 --> 00:53:21,589


916
00:53:21,589 --> 00:53:22,979
ones where 

917
00:53:22,979 --> 00:53:27,049
the corresponding Lagrange multipliers also not equal to 0. And again, it may not 

918
00:53:27,049 --> 00:53:29,009
hold true. It may be the case that 

919
00:53:29,009 --> 00:53:32,679
gi and alpha i equal to 0. But usually, when gi's not - 

920
00:53:32,679 --> 00:53:35,739
is 0, alpha i will be non-0. 

921
00:53:35,739 --> 00:53:39,380
And so the examples of functional margin equal to 1 will be the ones where alpha i is not equal 

922
00:53:39,380 --> 00:53:46,380
to 0. One 

923
00:53:46,920 --> 00:53:48,969
useful property of this is that 

924
00:53:48,969 --> 00:53:52,839
as suggested by this picture and so true in general as well, it 

925
00:53:52,839 --> 00:53:57,069
turns out that we find a solution to this - to the optimization problem, 

926
00:53:57,069 --> 00:54:01,509
you find that relatively few training examples have functional margin equal to 1. In 

927
00:54:01,509 --> 00:54:02,829
this picture I've drawn, 

928
00:54:02,829 --> 00:54:05,859
there are three examples with functional margin equal to 1. There 

929
00:54:05,859 --> 00:54:09,449
are just few examples of this minimum possible distance to your separating hyperplane. 

930
00:54:09,449 --> 00:54:11,479


931
00:54:11,479 --> 00:54:13,779
And these are three - 

932
00:54:13,779 --> 00:54:18,420
these examples of functional margin equal to 1 - they 

933
00:54:18,420 --> 00:54:21,369
are what we're going to call 

934
00:54:21,369 --> 00:54:26,169
the support vectors. And 

935
00:54:26,169 --> 00:54:29,839
this needs the name support vector machine. There'll be these three points with functional margin 

936
00:54:29,839 --> 00:54:30,809
1 

937
00:54:30,809 --> 00:54:35,229
that we're calling support vectors. 

938
00:54:35,229 --> 00:54:36,749
And 

939
00:54:36,749 --> 00:54:40,089
the fact that they're relatively few support vectors also means that 

940
00:54:40,089 --> 00:54:40,919
usually, 

941
00:54:40,919 --> 00:54:43,319
most of the alpha i's are equal to 

942
00:54:43,319 --> 00:54:44,909
0. So with alpha i equal 

943
00:54:44,909 --> 00:54:51,909
to 

944
00:54:52,599 --> 00:54:54,239
0, 

945
00:54:54,239 --> 00:55:01,239
for examples, though, not support vectors. 

946
00:55:03,349 --> 00:55:06,769
Let's go ahead and work out the actual 

947
00:55:06,769 --> 00:55:08,369
optimization problem. 

948
00:55:08,369 --> 00:55:12,169


949
00:55:12,169 --> 00:55:19,169


950
00:55:28,499 --> 00:55:29,369
So 

951
00:55:29,369 --> 00:55:31,779
we have a [inaudible] margin 

952
00:55:31,779 --> 00:55:33,129
optimization problem. 

953
00:55:33,129 --> 00:55:34,009


954
00:55:34,009 --> 00:55:38,599
So there we go and write down the margin, 

955
00:55:38,599 --> 00:55:39,249
and 

956
00:55:39,249 --> 00:55:43,599
because we only have inequality constraints where we really have gi star 

957
00:55:43,599 --> 00:55:47,059
constraints, no hi star constraint. We have 

958
00:55:47,059 --> 00:55:49,739
inequality constraints and no equality constraints, 

959
00:55:49,739 --> 00:55:51,599
I'll only have 

960
00:55:51,599 --> 00:55:52,990
Lagrange multipliers of type 

961
00:55:52,990 --> 00:55:57,489
alpha - no betas in my generalized Lagrange. But 

962
00:55:57,489 --> 00:56:00,519


963
00:56:00,519 --> 00:56:02,579
my Lagrange will be 

964
00:56:02,579 --> 00:56:09,579
one-half w squared minus. 

965
00:56:14,739 --> 00:56:17,579


966
00:56:17,579 --> 00:56:19,829
That's my 

967
00:56:19,829 --> 00:56:26,729
Lagrange. 

968
00:56:26,729 --> 00:56:29,389
And so let's work out what the duo problem is. 

969
00:56:29,389 --> 00:56:33,479
And to do that, I need to figure out what theta d of alpha - and I know again, beta's there 

970
00:56:33,479 --> 00:56:37,239
- so what theta d of alpha 

971
00:56:37,239 --> 00:56:43,829
is min with 

972
00:56:43,829 --> 00:56:48,419
respect to wb of lb alpha. So the duo problem is the maximize theta d as the function of alpha. 

973
00:56:48,419 --> 00:56:54,999
So as to work out what theta d is, and then that'll give us our duo problem. 

974
00:56:54,999 --> 00:56:58,090
So then to work out what this is, what do you need to do? We need to 

975
00:56:58,090 --> 00:57:01,930
take a look at Lagrange and minimize it as a function of lv and b 

976
00:57:01,930 --> 00:57:02,709
so - and what is this? How do you 

977
00:57:02,709 --> 00:57:05,079
minimize Lagrange? So in order to 

978
00:57:05,079 --> 00:57:05,890


979
00:57:05,890 --> 00:57:09,079
minimize the Lagrange as a function of w and b, 

980
00:57:09,079 --> 00:57:11,199
we do the usual thing. We 

981
00:57:11,199 --> 00:57:13,119
take the derivatives of w - 

982
00:57:13,119 --> 00:57:16,819
Lagrange with respect to w and b. And we set that to 0. That's how we 

983
00:57:16,819 --> 00:57:20,209
minimize the Lagrange with respect 

984
00:57:20,209 --> 00:57:25,709
to w and b. So take the derivative with respect to w of the Lagrange. 

985
00:57:25,709 --> 00:57:27,829
And 

986
00:57:27,829 --> 00:57:34,829
I want - I just write down the answer. You know how to do calculus like this. 

987
00:57:38,189 --> 00:57:41,439
So I wanna minimize this function of w, so I take the derivative and set it 

988
00:57:41,439 --> 00:57:42,839
to 0. And 

989
00:57:42,839 --> 00:57:44,950
I get that. And 

990
00:57:44,950 --> 00:57:51,950
then so this implies that w 

991
00:57:55,940 --> 00:57:58,789
must be that. 

992
00:57:58,789 --> 00:58:00,779
And so 

993
00:58:00,779 --> 00:58:05,189
w, therefore, is actually a linear combination of your input feature vectors xi. 

994
00:58:05,189 --> 00:58:06,779
This is 

995
00:58:06,779 --> 00:58:09,570
sum of your various weights given by the alpha i's and times 

996
00:58:09,570 --> 00:58:13,069
the xi's, which are your examples in your training set. 

997
00:58:13,069 --> 00:58:15,939
And this will be useful later. The 

998
00:58:15,939 --> 00:58:22,939
other equation we have is - here, partial derivative of 

999
00:58:23,139 --> 00:58:27,900
Lagrange with respect to p is equal to minus sum 

1000
00:58:27,900 --> 00:58:33,819
of i plus 1 to m [inaudible] for 

1001
00:58:33,819 --> 00:58:35,219
i. 

1002
00:58:35,219 --> 00:58:37,199
And so I'll just set that to equal to 

1003
00:58:37,199 --> 00:58:40,439
0. And so these are my two constraints. 

1004
00:58:40,439 --> 00:58:42,699
And so 

1005
00:58:42,699 --> 00:58:48,489


1006
00:58:48,489 --> 00:58:49,769
[inaudible]. 

1007
00:58:49,769 --> 00:58:53,650
So what I'm going to do is I'm actually going to take these two constraints, 

1008
00:58:53,650 --> 00:58:57,579
and well, I'm going to take whatever I thought to be the value for w. 

1009
00:58:57,579 --> 00:59:00,379
And I'm 

1010
00:59:00,379 --> 00:59:01,309
gonna 

1011
00:59:01,309 --> 00:59:04,070
take what I've worked out to be the 

1012
00:59:04,070 --> 00:59:04,769
value for w, and 

1013
00:59:04,769 --> 00:59:06,979
I'll plug it back in there 

1014
00:59:06,979 --> 00:59:09,169
to figure out what the Lagrange really is 

1015
00:59:09,169 --> 00:59:13,989
when I minimize with respect to w. [Inaudible] and I'll 

1016
00:59:13,989 --> 00:59:20,989
deal with b in a second. 

1017
00:59:27,869 --> 00:59:34,869
So 

1018
00:59:37,629 --> 00:59:41,109
let's see. So my Lagrange is 1/2 

1019
00:59:41,109 --> 00:59:44,479
w transpose w minus. 

1020
00:59:44,479 --> 00:59:51,199


1021
00:59:51,199 --> 00:59:58,079


1022
00:59:58,079 --> 01:00:03,529
So this first term, w transpose w 

1023
01:00:03,529 --> 01:00:05,939
- this becomes 

1024
01:00:05,939 --> 01:00:11,039
sum y equals one to m, alpha i, yi, 

1025
01:00:11,039 --> 01:00:18,039


1026
01:00:20,749 --> 01:00:24,859
xi transpose. This is just putting in the value for w that I worked out previously. 

1027
01:00:24,859 --> 01:00:27,880
But since this is w transpose w - 

1028
01:00:27,880 --> 01:00:32,239
and so when they expand out of this quadratic function, and when I plug in w 

1029
01:00:32,239 --> 01:00:34,469
over there as well, 

1030
01:00:34,469 --> 01:00:36,329
I 

1031
01:00:36,329 --> 01:00:38,759
find 

1032
01:00:38,759 --> 01:00:40,849


1033
01:00:40,849 --> 01:00:42,869


1034
01:00:42,869 --> 01:00:49,539
that 

1035
01:00:49,539 --> 01:00:55,189


1036
01:00:55,189 --> 01:00:58,499
I 

1037
01:00:58,499 --> 01:01:00,150
have 

1038
01:01:00,150 --> 01:01:03,969


1039
01:01:03,969 --> 01:01:08,529


1040
01:01:08,529 --> 01:01:10,979


1041
01:01:10,979 --> 01:01:17,979
that. 

1042
01:01:20,689 --> 01:01:22,469
Oh, 

1043
01:01:22,469 --> 01:01:28,589
where I'm using these angle brackets to denote end product, so this 

1044
01:01:28,589 --> 01:01:35,179
thing here, it just means the end product, xi transpose 

1045
01:01:35,179 --> 01:01:36,259
xj. And 

1046
01:01:36,259 --> 01:01:40,299
the first and second terms are actually the same except for the minus one half. So to 

1047
01:01:40,299 --> 01:01:41,689
simplify to be 

1048
01:01:41,689 --> 01:01:44,989
equal to 

1049
01:01:44,989 --> 01:01:51,259


1050
01:01:51,259 --> 01:01:58,259


1051
01:02:08,239 --> 01:02:12,549
that. 

1052
01:02:12,549 --> 01:02:15,689
So 

1053
01:02:15,689 --> 01:02:18,749
let me go ahead and 

1054
01:02:18,749 --> 01:02:22,239
call this w of alpha. 

1055
01:02:22,239 --> 01:02:23,449


1056
01:02:23,449 --> 01:02:29,469


1057
01:02:29,469 --> 01:02:36,469


1058
01:02:46,779 --> 01:02:51,919
My duo problem is, therefore, the following. I want to maximize w 

1059
01:02:51,919 --> 01:02:55,839
of alpha, which is that [inaudible]. 

1060
01:02:55,839 --> 01:02:58,380
And 

1061
01:02:58,380 --> 01:03:02,429
I want to the - I realize the notation is somewhat 

1062
01:03:02,429 --> 01:03:07,259
unfortunate. I'm using capital W of alpha to denote that formula I wrote down earlier. 

1063
01:03:07,259 --> 01:03:13,529
And then we also had our lowercase w. The original [inaudible] is the primal 

1064
01:03:13,529 --> 01:03:16,349
problem. Lowercase w transpose xi. So 

1065
01:03:16,349 --> 01:03:18,059
uppercase and lowercase w 

1066
01:03:18,059 --> 01:03:20,609
are totally different 

1067
01:03:20,609 --> 01:03:27,160
things, so unfortunately, the notation is standard as well, as far as I know, 

1068
01:03:27,160 --> 01:03:29,929
so. So the duo problem is 

1069
01:03:29,929 --> 01:03:33,119
that subject to the alpha [inaudible] related to 0, 

1070
01:03:33,119 --> 01:03:37,239
and 

1071
01:03:37,239 --> 01:03:39,439
we also have that the 

1072
01:03:39,439 --> 01:03:40,619
sum of i, 

1073
01:03:40,619 --> 01:03:43,819
yi, alpha i is related to 0. 

1074
01:03:43,819 --> 01:03:46,380
That last constraint 

1075
01:03:46,380 --> 01:03:49,589
was the constraint I got from 

1076
01:03:49,589 --> 01:03:52,299
this - the 

1077
01:03:52,299 --> 01:03:55,020
sum of i - sum of i, yi alpha i equals to 0. But that's where 

1078
01:03:55,020 --> 01:03:59,619
that [inaudible] came 

1079
01:03:59,619 --> 01:04:01,839
from. Let 

1080
01:04:01,839 --> 01:04:02,380
me just 

1081
01:04:02,380 --> 01:04:05,619
- I think in previous years that I taught this, 

1082
01:04:05,619 --> 01:04:09,199
where this constraint comes from is just - is slightly confusing. So let 

1083
01:04:09,199 --> 01:04:12,910
me just take two minutes to say what the real interpretation of that is. And if you 

1084
01:04:12,910 --> 01:04:15,769
don't understand it, it's 

1085
01:04:15,769 --> 01:04:18,069
not a big deal, I guess. 

1086
01:04:18,069 --> 01:04:21,789
So when we took the partial derivative of the Lagrange with 

1087
01:04:21,789 --> 01:04:22,669
respect to b, 

1088
01:04:22,669 --> 01:04:28,219
we end up with this constraint that sum of i, yi, alpha i must be equal to 0. 

1089
01:04:28,219 --> 01:04:34,219
The interpretation of that, it turns out, is that if sum of i, yi, alpha i 

1090
01:04:34,219 --> 01:04:37,410
is not equal to 

1091
01:04:37,410 --> 01:04:40,920
0, then 

1092
01:04:40,920 --> 01:04:42,809


1093
01:04:42,809 --> 01:04:49,809


1094
01:04:51,019 --> 01:04:53,210
theta d of wb 

1095
01:04:53,210 --> 01:04:55,659
is - 

1096
01:04:55,659 --> 01:04:59,829
actually, excuse me. 

1097
01:04:59,829 --> 01:05:01,559


1098
01:05:01,559 --> 01:05:02,779


1099
01:05:02,779 --> 01:05:06,259
Then theta d of alpha is equal to 

1100
01:05:06,259 --> 01:05:08,499


1101
01:05:08,499 --> 01:05:13,179
minus infinity for minimizing. 

1102
01:05:13,179 --> 01:05:16,279
So in other words, it turns out my Lagrange is 

1103
01:05:16,279 --> 01:05:20,479
actually a linear function of my parameters b. And so the interpretation of 

1104
01:05:20,479 --> 01:05:22,890
that constraint we worked out previously was that if sum of i or yi, alpha i 

1105
01:05:22,890 --> 01:05:25,820
is not equal to 0, then 

1106
01:05:25,820 --> 01:05:28,889
theta d of alpha is equal to minus infinity. 

1107
01:05:28,889 --> 01:05:32,469
And so if your goal is to 

1108
01:05:32,469 --> 01:05:36,839
maximize as a function of alpha, theta 

1109
01:05:36,839 --> 01:05:38,339
d of alpha, 

1110
01:05:38,339 --> 01:05:44,599
then you've gotta choose values of alpha for which sum of yi alpha is equal to 0. 

1111
01:05:44,599 --> 01:05:50,969
And then when sum of yi alpha is equal to 0, then 

1112
01:05:50,969 --> 01:05:54,189


1113
01:05:54,189 --> 01:05:56,099


1114
01:05:56,099 --> 01:06:00,949
theta d of 

1115
01:06:00,949 --> 01:06:03,999
alpha is equal to w of alpha. 

1116
01:06:03,999 --> 01:06:08,690
And so that's why we ended up deciding to maximize w of alpha subject to 

1117
01:06:08,690 --> 01:06:14,319
that sum of yi alpha is equal to 0. 

1118
01:06:14,319 --> 01:06:18,929
Yeah, the - unfortunately, the fact of that d would be [inaudible] 

1119
01:06:18,929 --> 01:06:21,619
adds just a little bit of extra notation in our 

1120
01:06:21,619 --> 01:06:22,989
derivation of the duo. But 

1121
01:06:22,989 --> 01:06:27,319
by the way, and [inaudible] all the action of the optimization problem is with w 

1122
01:06:27,319 --> 01:06:32,809
because b is just one parameter. 

1123
01:06:32,809 --> 01:06:39,809
So let's check. Are there any questions about this? Okay, cool. 

1124
01:06:47,069 --> 01:06:49,499
So 

1125
01:06:49,499 --> 01:06:53,729
what derived a duo optimization problem - and really, don't worry about this 

1126
01:06:53,729 --> 01:06:56,469
if you're not quite sure where this was. Just think of this as 

1127
01:06:56,469 --> 01:06:59,959
we worked out this constraint, and we worked out, and we took partial derivative with 

1128
01:06:59,959 --> 01:07:01,309
respect to b, 

1129
01:07:01,309 --> 01:07:06,049
that this constraint has the [inaudible] and so I just copied that over here. But so - worked out 

1130
01:07:06,049 --> 01:07:11,129
the duo of the optimization problem, 

1131
01:07:11,129 --> 01:07:16,069
so our approach to finding - to deriving the optimal margin classifier or support vector 

1132
01:07:16,069 --> 01:07:16,679
machine 

1133
01:07:16,679 --> 01:07:19,409
will be that we'll solve 

1134
01:07:19,409 --> 01:07:26,409
along this duo optimization problem for the parameters alpha 

1135
01:07:28,109 --> 01:07:29,169
star. 

1136
01:07:29,169 --> 01:07:30,509
And then 

1137
01:07:30,509 --> 01:07:34,149
if you want, you can then - this is the equation that we worked out on 

1138
01:07:34,149 --> 01:07:36,459
the previous board. We said that 

1139
01:07:36,459 --> 01:07:43,459
w - this [inaudible] alpha - w must be equal to 

1140
01:07:44,369 --> 01:07:46,349
that. 

1141
01:07:46,349 --> 01:07:47,690
And so 

1142
01:07:47,690 --> 01:07:53,309
once you solve for alpha, you can then go back and quickly derive 

1143
01:07:53,309 --> 01:07:57,389
w in parameters to your primal problem. And we worked this out earlier. 

1144
01:07:57,389 --> 01:07:58,209


1145
01:07:58,209 --> 01:08:00,429
And moreover, 

1146
01:08:00,429 --> 01:08:05,409
once you solve alpha and w, you can then focus back into your - once you solve for alpha and w, 

1147
01:08:05,409 --> 01:08:06,819


1148
01:08:06,819 --> 01:08:10,419
it's really easy to solve for v, so 

1149
01:08:10,419 --> 01:08:13,979


1150
01:08:13,979 --> 01:08:15,269
that b gives us the interpretation of [inaudible] 

1151
01:08:15,269 --> 01:08:19,559
training set, and you found the direction for w. So you know where your separating 

1152
01:08:19,559 --> 01:08:22,079
hyperplane's direction is. You know it's got to be 

1153
01:08:22,079 --> 01:08:25,059
one of these things. 

1154
01:08:25,059 --> 01:08:28,499
And you know the orientation and separating hyperplane. You just have to 

1155
01:08:28,499 --> 01:08:31,000
decide where to place 

1156
01:08:31,000 --> 01:08:33,289
this hyperplane. And that's what solving b is. 

1157
01:08:33,289 --> 01:08:36,969
So once you solve for alpha and w, it's really easy to solve b. 

1158
01:08:36,969 --> 01:08:40,409
You can plug alpha and w back into the 

1159
01:08:40,409 --> 01:08:42,779
primal optimization problem 

1160
01:08:42,779 --> 01:08:45,649


1161
01:08:45,649 --> 01:08:49,899


1162
01:08:49,899 --> 01:08:56,899
and solve for b. 

1163
01:09:02,329 --> 01:09:05,140
And I just wrote it down 

1164
01:09:05,140 --> 01:09:12,140
for the sake of completeness, 

1165
01:09:14,859 --> 01:09:19,890


1166
01:09:19,890 --> 01:09:21,299


1167
01:09:21,299 --> 01:09:23,279
but 

1168
01:09:23,279 --> 01:09:27,939


1169
01:09:27,939 --> 01:09:29,609
- and the 

1170
01:09:29,609 --> 01:09:36,199
intuition behind this formula is just that find the worst positive 

1171
01:09:36,199 --> 01:09:37,309
[inaudible] and the 

1172
01:09:37,309 --> 01:09:41,920
worst negative example. Let's say 

1173
01:09:41,920 --> 01:09:44,779
this one and this one - say [inaudible] and [inaudible] the difference between them. And 

1174
01:09:44,779 --> 01:09:45,990
that tells you where you should 

1175
01:09:45,990 --> 01:09:47,969
set the threshold for 

1176
01:09:47,969 --> 01:09:52,509
where to place the separating hyperplane. 

1177
01:09:52,509 --> 01:09:53,410
And then 

1178
01:09:53,410 --> 01:09:54,829
that's the - 

1179
01:09:54,829 --> 01:09:57,749
this is the optimal margin classifier. This is also called a support vector 

1180
01:09:57,749 --> 01:10:02,329
machine. If you do not use one y [inaudible], it's called kernels. And I'll say a few words 

1181
01:10:02,329 --> 01:10:04,320
about that. But I 

1182
01:10:04,320 --> 01:10:08,080
hope the process is clear. It's a duo problem. We're going to solve the duo 

1183
01:10:08,080 --> 01:10:09,789
problem for the alpha i's. 

1184
01:10:09,789 --> 01:10:11,360
That gives us w, and that gives 

1185
01:10:11,360 --> 01:10:13,589
us b. 

1186
01:10:13,589 --> 01:10:16,649
So 

1187
01:10:16,649 --> 01:10:21,709
there's just one more thing I wanna point out as I lead into the next lecture, 

1188
01:10:21,709 --> 01:10:23,070
which is that - I'll just 

1189
01:10:23,070 --> 01:10:28,440
write this out again, 

1190
01:10:28,440 --> 01:10:29,250
I guess - 

1191
01:10:29,250 --> 01:10:31,359
which is that it turns out 

1192
01:10:31,359 --> 01:10:33,530
we can take the entire algorithm, 

1193
01:10:33,530 --> 01:10:37,530
and we can express the entire algorithm in terms of inner products. And here's what I 

1194
01:10:37,530 --> 01:10:40,730
mean by that. So 

1195
01:10:40,730 --> 01:10:42,249
say that the parameters w 

1196
01:10:42,249 --> 01:10:45,469
is the sum of your input examples. 

1197
01:10:45,469 --> 01:10:49,160
And so we need to make a prediction. 

1198
01:10:49,160 --> 01:10:54,659
Someone gives you a new value of x. You want a value of the hypothesis on the value of x. 

1199
01:10:54,659 --> 01:10:58,090
That's given by g of w transpose x plus b, or 

1200
01:10:58,090 --> 01:11:02,869
where g was this threshold function that outputs minus 1 or plus 1. 

1201
01:11:02,869 --> 01:11:06,380
And so you need to compute w transpose x plus b. 

1202
01:11:06,380 --> 01:11:11,300
And that is equal to alpha i, 

1203
01:11:11,300 --> 01:11:18,300
yi. 

1204
01:11:19,739 --> 01:11:24,029
And that can be expressed as a sum of these inner products between 

1205
01:11:24,029 --> 01:11:26,070
your training examples 

1206
01:11:26,070 --> 01:11:33,070
and this new value of x [inaudible] value [inaudible]. And this will 

1207
01:11:33,849 --> 01:11:39,329
lead into our next lecture, which is the idea of kernels. 

1208
01:11:39,329 --> 01:11:39,929
And 

1209
01:11:39,929 --> 01:11:44,209
it turns out that in the source of feature spaces where used to support vector 

1210
01:11:44,209 --> 01:11:45,039
machines - 

1211
01:11:45,039 --> 01:11:52,039
it turns out that sometimes your training examples may be very high-dimensional. It may even be the case 

1212
01:11:53,730 --> 01:11:58,329
that the features that you want to use 

1213
01:11:58,329 --> 01:11:59,579
are 

1214
01:11:59,579 --> 01:12:04,129
inner-dimensional feature vectors. 

1215
01:12:04,129 --> 01:12:11,129
But despite this, it'll turn out that there'll be an interesting representation that 

1216
01:12:11,229 --> 01:12:12,879
you can use 

1217
01:12:12,879 --> 01:12:14,820
that will allow you 

1218
01:12:14,820 --> 01:12:18,719
to compute inner products like these efficiently. 

1219
01:12:18,719 --> 01:12:25,719


1220
01:12:25,780 --> 01:12:29,179
And this holds true only for certain feature spaces. It doesn't hold true for arbitrary sets 

1221
01:12:29,179 --> 01:12:30,479
of features. 

1222
01:12:30,479 --> 01:12:33,609
But we talk about the idea of 

1223
01:12:33,609 --> 01:12:35,320
kernels. In the next lecture, we'll 

1224
01:12:35,320 --> 01:12:37,809
see examples where 

1225
01:12:37,809 --> 01:12:41,249
even though you have extremely high-dimensional feature vectors, you can compute 

1226
01:12:41,249 --> 01:12:45,829
- you may never want to represent xi, x plus [inaudible] inner-dimensional 

1227
01:12:45,829 --> 01:12:48,459
feature vector. You can even store in computer memory. 

1228
01:12:48,459 --> 01:12:51,769
But you will nonetheless be able to compute inner products between different 

1229
01:12:51,769 --> 01:12:53,839
[inaudible] feature vectors very efficiently. 

1230
01:12:53,839 --> 01:12:57,419
And so you can - for example, you can make predictions by making use of these inner 

1231
01:12:57,419 --> 01:12:58,790
products. 

1232
01:12:58,790 --> 01:13:02,129
This is just xi 

1233
01:13:02,129 --> 01:13:03,809
transpose. 

1234
01:13:03,809 --> 01:13:08,570
You will compute these inner products very efficiently and, therefore, make predictions. 

1235
01:13:08,570 --> 01:13:11,109
And this pointed also - the other 

1236
01:13:11,109 --> 01:13:14,599
reason we derive the duo was because 

1237
01:13:14,599 --> 01:13:18,300
on this board, when we worked out what w of alpha is, w of alpha 

1238
01:13:18,300 --> 01:13:23,620
- actually are the same property - w of alpha is again 

1239
01:13:23,620 --> 01:13:26,699
written in terms of these inner products. 

1240
01:13:26,699 --> 01:13:29,040
And so if you 

1241
01:13:29,040 --> 01:13:33,429
actually look at the duo optimization problem and step - for all the steps of the 

1242
01:13:33,429 --> 01:13:33,999
algorithm, 

1243
01:13:33,999 --> 01:13:36,929
you'll find that you actually do everything you want - learn the parameters of 

1244
01:13:36,929 --> 01:13:38,169
alpha. So 

1245
01:13:38,169 --> 01:13:41,989
suppose you do an optimization problem, go into parameters alpha, and you do everything you want 

1246
01:13:41,989 --> 01:13:46,519
without ever needing to represent xi directly. And all you need to do 

1247
01:13:46,519 --> 01:13:53,519
is represent this compute inner products with your feature vectors like these. Well, 

1248
01:13:53,879 --> 01:13:56,400
one last property of 

1249
01:13:56,400 --> 01:13:58,159
this algorithm that's kinda nice is that 

1250
01:13:58,159 --> 01:13:59,900
I said previously 

1251
01:13:59,900 --> 01:14:00,929
that 

1252
01:14:00,929 --> 01:14:07,389
the alpha i's are 0 only for the - are non-0 only for the support vectors, 

1253
01:14:07,389 --> 01:14:08,690
only for the vectors 

1254
01:14:08,690 --> 01:14:10,400
that function y [inaudible] 1. 

1255
01:14:10,400 --> 01:14:13,540
And in practice, there are usually fairly few of them. 

1256
01:14:13,540 --> 01:14:17,360
And so what this means is that if you're representing w this way, 

1257
01:14:17,360 --> 01:14:18,320
then 

1258
01:14:18,320 --> 01:14:22,870
w when represented as a fairly small fraction of training examples 

1259
01:14:22,870 --> 01:14:25,139
because mostly alpha i's is 0 - 

1260
01:14:25,139 --> 01:14:27,180
and so when you're summing up 

1261
01:14:27,180 --> 01:14:28,279
the sum, 

1262
01:14:28,279 --> 01:14:32,359
you need to compute inner products only if the support vectors, which is 

1263
01:14:32,359 --> 01:14:36,139
usually a small fraction of your training set. So that's another nice [inaudible] 

1264
01:14:36,139 --> 01:14:39,320
because [inaudible] alpha is 

1265
01:14:39,320 --> 01:14:40,999
0. And well, 

1266
01:14:40,999 --> 01:14:44,639
much of this will make much more sense when we talk about kernels. [Inaudible] quick 

1267
01:14:44,639 --> 01:14:51,639
questions 

1268
01:14:51,959 --> 01:14:52,920


1269
01:14:52,920 --> 01:14:57,989
before I close? Yeah. Student:It seems that for anything we've done the work, the point file has to be really well 

1270
01:14:57,989 --> 01:14:59,560
behaved, and if any of the points are kinda on the wrong side - Instructor (Andrew Ng):No, oh, yeah, so again, for today's lecture asks you that 

1271
01:14:59,560 --> 01:15:03,889
the data is linearly separable - that you can actually get perfect 

1272
01:15:03,889 --> 01:15:10,889
[inaudible]. I'll fix this in the next lecture as well. But excellent assumption. Yes? Student:So can't we assume that [inaudible] 

1273
01:15:10,909 --> 01:15:13,380
point [inaudible], so [inaudible] 

1274
01:15:13,380 --> 01:15:17,579
have 

1275
01:15:17,579 --> 01:15:19,300
[inaudible]? Instructor (Andrew Ng):Yes, so unless I - says that 

1276
01:15:19,300 --> 01:15:23,429
there are ways to generalize this in multiple classes that I probably won't [inaudible] - 

1277
01:15:23,429 --> 01:15:26,429
but yeah, that's generalization [inaudible]. 

1278
01:15:26,429 --> 01:15:27,870
Okay. Let's close for today, then. 

1279
01:15:27,870 --> 01:15:29,300
We'll talk about kernels in our next lecture.