1
00:00:00,290 --> 00:00:12,079
This subtitle is provided from lecsub.jimdo.com under Creative Commons License.


2
00:00:12,079 --> 00:00:15,349
This presentation is delivered by the Stanford Center for Professional 

3
00:00:15,349 --> 00:00:22,349
Development. 

4
00:00:24,169 --> 00:00:25,189


5
00:00:25,189 --> 00:00:27,299
What I want to do today is 

6
00:00:27,299 --> 00:00:30,829
continue our discussion of Naive Bayes, which is the learning algorithm that 

7
00:00:30,829 --> 00:00:32,600
I 

8
00:00:32,600 --> 00:00:36,110
started to discuss in the previous lecture 

9
00:00:36,110 --> 00:00:40,250
and talk about a couple of different event models in Naive Bayes, 

10
00:00:40,250 --> 00:00:44,450
and then I'll take a brief digression to talk about neural networks, which is 

11
00:00:44,450 --> 00:00:47,000
something that I actually won't spend a lot of time on, 

12
00:00:47,000 --> 00:00:50,350
and then I want to start to talk about support vector machines, 

13
00:00:50,350 --> 00:00:52,680
and support vector machines is 

14
00:00:52,680 --> 00:00:56,450
the learning algorithms, the supervised learning algorithm that 

15
00:00:56,450 --> 00:01:00,910
many people consider the most effective, off-the-shelf 

16
00:01:00,910 --> 00:01:02,250
supervised learning 

17
00:01:02,250 --> 00:01:05,240
algorithm. That point of view is debatable, but there are many people that hold that 

18
00:01:05,240 --> 00:01:06,340
point of view, 

19
00:01:06,340 --> 00:01:10,800
and we'll start discussing that today, and this will actually take us a few 

20
00:01:10,800 --> 00:01:13,830
lectures to complete. 

21
00:01:13,830 --> 00:01:16,940
So let's talk about Naive Bayes. To recap 

22
00:01:16,940 --> 00:01:21,370
from the previous lecture, 

23
00:01:21,370 --> 00:01:23,299
I started off 

24
00:01:23,299 --> 00:01:24,420
describing 

25
00:01:24,420 --> 00:01:28,329
spam classification as the most [inaudible] example for Naive Bayes 

26
00:01:28,329 --> 00:01:33,700
in which we would create feature vectors like these, right, that correspond 

27
00:01:33,700 --> 00:01:37,840
to words in a dictionary. 

28
00:01:37,840 --> 00:01:39,500
And so, you know, 

29
00:01:39,500 --> 00:01:43,100
based on what words appear in a piece of email were represented as a 

30
00:01:43,100 --> 00:01:45,230
feature vector with 

31
00:01:45,230 --> 00:01:48,130
ones and zeros in the corresponding places, 

32
00:01:48,130 --> 00:01:49,950
and Naive 

33
00:01:49,950 --> 00:01:56,950
Bayes was a generative learning algorithm, and by that 

34
00:01:58,799 --> 00:02:00,159
I mean it's 

35
00:02:00,159 --> 00:02:06,140
an algorithm in which we model P(X|Y), and for Naive 

36
00:02:06,140 --> 00:02:08,479
Bayes, specifically, we 

37
00:02:08,479 --> 00:02:12,859
modeled it as product from I equals one 

38
00:02:12,859 --> 00:02:13,789
to 

39
00:02:13,789 --> 00:02:17,109
N, P(Xi|Y), 

40
00:02:17,109 --> 00:02:19,259
and also we model P(Y), 

41
00:02:19,259 --> 00:02:22,749
and then we use Bayes Rule, right, to combine these two together, 

42
00:02:22,749 --> 00:02:25,050
and so our predictions, 

43
00:02:25,050 --> 00:02:28,549
when you give it a new piece of email you want to tell if it's spam or not spam, 

44
00:02:28,549 --> 00:02:33,119
you predict arg max over Y, P(Y|X), which 

45
00:02:33,119 --> 00:02:36,459
by Bayes Rule is arg max over Y, 

46
00:02:36,459 --> 00:02:39,420
P(X given Y)

47
00:02:39,420 --> 00:02:41,019
times P(Y), okay? 

48
00:02:41,019 --> 00:02:43,559
So this is Naive Bayes, 

49
00:02:43,559 --> 00:02:45,869
and just to draw attention to two things, 

50
00:02:45,869 --> 00:02:49,819
one is that in this model, each of our features 

51
00:02:49,819 --> 00:02:51,549
were zero, one, so 

52
00:02:51,549 --> 00:02:53,919
indicating whether different words appear, 

53
00:02:53,919 --> 00:02:57,589
and the length or the feature vector was, sort of, 

54
00:02:57,589 --> 00:03:00,029
the length N of the feature vector was 

55
00:03:00,029 --> 00:03:07,029
the number of words in the dictionary. 

56
00:03:08,339 --> 00:03:15,339
So it might be on this version on the order of 50,000 words, say. What I want 

57
00:03:18,419 --> 00:03:21,170
to do now is describe two variations on this algorithm. 

58
00:03:21,170 --> 00:03:23,779
The first one is the simpler one, 

59
00:03:23,779 --> 00:03:26,609
which it's just a generalization to if 

60
00:03:26,609 --> 00:03:28,039
xi

61
00:03:28,039 --> 00:03:33,359
takes on more values. So, you know, one thing that's commonly done 

62
00:03:33,359 --> 00:03:35,169
is 

63
00:03:35,169 --> 00:03:38,839
to apply Naive Bayes to problems where 

64
00:03:38,839 --> 00:03:43,479
some of these features, xi, takes on K values rather than just two values, 

65
00:03:43,479 --> 00:03:46,579
and in that case, 

66
00:03:46,579 --> 00:03:48,679
you actually build, 

67
00:03:48,679 --> 00:03:51,919
sort of, a very similar model where P(X|Y) 

68
00:03:51,919 --> 00:03:53,130
is 

69
00:03:53,130 --> 00:03:57,120
really the same thing, right, 

70
00:03:57,120 --> 00:04:02,099
where now these are going to be multinomial probabilities 

71
00:04:02,099 --> 00:04:07,729
rather than Bernoulli's because the XI's can, maybe, take on up to K values. 

72
00:04:07,729 --> 00:04:11,739
It turns out, the situation where - one situation where this arises very 

73
00:04:11,739 --> 00:04:16,760
commonly is if you have a feature that's actually continuous valued, 

74
00:04:16,760 --> 00:04:20,639
and you choose to dispertise it, and you choose to take a continuous value feature 

75
00:04:20,639 --> 00:04:25,770
and dispertise it into a finite set of K values, 

76
00:04:25,770 --> 00:04:27,710
and so it's a perfect example 

77
00:04:27,710 --> 00:04:32,370
if you remember our very first 

78
00:04:32,370 --> 00:04:35,189
supervised learning problem of predicting 

79
00:04:35,189 --> 00:04:37,089
the price of 

80
00:04:37,089 --> 00:04:41,039
houses. If you have the classification problem on these houses, so based on 

81
00:04:41,039 --> 00:04:44,129
features of a house, and you want to predict whether or not the house will be sold in the next six 

82
00:04:44,129 --> 00:04:46,880
months, say. That's a classification problem, and 

83
00:04:46,880 --> 00:04:49,210
once you use Naive Bayes, 

84
00:04:49,210 --> 00:04:52,229
then given a continuous value feature 

85
00:04:52,229 --> 00:04:54,619
like the living area, 

86
00:04:54,619 --> 00:04:59,259
you know, one pretty common thing to do would be take the continuous value living area 

87
00:04:59,259 --> 00:05:02,300
and just dispertise it 

88
00:05:02,300 --> 00:05:06,669
into a few 

89
00:05:06,669 --> 00:05:08,909


90
00:05:08,909 --> 00:05:12,769
- discreet buckets, 

91
00:05:12,769 --> 00:05:13,900
and so 

92
00:05:13,900 --> 00:05:16,900
depending on whether the living area of the house is less than 500 square feet 

93
00:05:16,900 --> 00:05:19,479
or between 1,000 and 1500 square feet, 

94
00:05:19,479 --> 00:05:22,279
and so on, or whether it's greater than 2,000 square feet, 

95
00:05:22,279 --> 00:05:26,490
you choose the value of the corresponding feature, XI, to be one, 

96
00:05:26,490 --> 00:05:28,889
two, three, or four, okay? 

97
00:05:28,889 --> 00:05:31,210
So that was the first 

98
00:05:31,210 --> 00:05:36,479
variation or generalization of Naive Bayes I 

99
00:05:36,479 --> 00:05:43,479
wanted to talk about. I should just check; are there questions about this? Okay. 

100
00:05:49,919 --> 00:05:55,740
Cool. And so it turns out that in practice, it's fairly common to use about ten buckets 

101
00:05:55,740 --> 00:05:59,099
to dispertise a continuous value feature. I drew four here only to 

102
00:05:59,099 --> 00:06:00,900
save on writing. 

103
00:06:00,900 --> 00:06:02,789


104
00:06:02,789 --> 00:06:03,389


105
00:06:03,389 --> 00:06:06,639
The second and, sort of, final variation that I want to talk about for Naive Bayes 

106
00:06:06,639 --> 00:06:09,839
is a variation that's specific 

107
00:06:09,839 --> 00:06:10,510
to 

108
00:06:10,510 --> 00:06:15,720
classifying text documents, or, more generally, for classifying sequences. So the text 

109
00:06:15,720 --> 00:06:20,219
document, like a piece of email, you can think of as a sequence of words 

110
00:06:20,219 --> 00:06:23,509
and you can apply this, sort of, model I'm about to describe to classifying other sequences as 

111
00:06:23,509 --> 00:06:24,909
well, but let 

112
00:06:24,909 --> 00:06:27,400
me just focus on text, 

113
00:06:27,400 --> 00:06:29,689
and here's 

114
00:06:29,689 --> 00:06:33,419
the idea. So the 

115
00:06:33,419 --> 00:06:38,379
Naiven a piece of email, we were 

116
00:06:38,379 --> 00:06:42,110
representing it using this binary vector value representation, 

117
00:06:42,110 --> 00:06:46,469
and one of the things that this loses, for instance, is the number of times that different words 

118
00:06:46,469 --> 00:06:47,049
appear, 

119
00:06:47,049 --> 00:06:49,429
all right? So, for example, if 

120
00:06:49,429 --> 00:06:53,149
some word appears a lot of times, and you see the word, you know, śbuyť a lot of times. 

121
00:06:53,149 --> 00:06:58,369
You see the word śViagrať; it seems to be a common email example. You see the word Viagra a ton of 

122
00:06:58,369 --> 00:06:59,490
times in the email, it 

123
00:06:59,490 --> 00:07:03,449
is more likely to be spam than it appears, I guess, only once 

124
00:07:03,449 --> 00:07:08,129
because even once, I guess, is enough. So let me just 

125
00:07:08,129 --> 00:07:11,210
try a different, what's called an event model 

126
00:07:11,210 --> 00:07:14,870
for Naive Bayes that will take into account the number of times a word appears in 

127
00:07:14,870 --> 00:07:16,530
the email, 

128
00:07:16,530 --> 00:07:17,430
and to give 

129
00:07:17,430 --> 00:07:22,300
this previous model a name as well this particular model for 

130
00:07:22,300 --> 00:07:24,059
text 

131
00:07:24,059 --> 00:07:25,990
classification 

132
00:07:25,990 --> 00:07:29,189


133
00:07:29,189 --> 00:07:32,369


134
00:07:32,369 --> 00:07:35,639
is called the Multivariate Bernoulli Event Model. It's not a great name. Don't worry about what 

135
00:07:35,639 --> 00:07:39,639
the name means. It 

136
00:07:39,639 --> 00:07:43,139
refers to the fact that there are multiple Bernoulli random variables, but it's really - don't worry about 

137
00:07:43,139 --> 00:07:45,349
what the name means. 

138
00:07:45,349 --> 00:07:46,469
In contrast, 

139
00:07:46,469 --> 00:07:50,529
what I want to do now is describe a different representation for email in terms of the 

140
00:07:50,529 --> 00:07:51,819
feature 

141
00:07:51,819 --> 00:07:58,819
vector, and this is called the Multinomial Event Model, and, 

142
00:08:00,619 --> 00:08:03,319
again, there is a rationale behind the name, but it's slightly cryptic, so don't worry 

143
00:08:03,319 --> 00:08:07,249
about why it's called the Multinomial Event Model; it's just called that. 

144
00:08:07,249 --> 00:08:09,349
And here's what we're gonna do, 

145
00:08:09,349 --> 00:08:15,559
given a piece of email, I'm going to represent my email as a feature vector, 

146
00:08:15,559 --> 00:08:19,039
and so my IF training 

147
00:08:19,039 --> 00:08:23,340
example, XI will be a feature vector, 

148
00:08:23,340 --> 00:08:29,150
XI sub group one, XI sub group two, 

149
00:08:29,150 --> 00:08:31,279
XI subscript 

150
00:08:31,279 --> 00:08:33,020
NI 

151
00:08:33,020 --> 00:08:35,370
where 

152
00:08:35,370 --> 00:08:38,810
NI is equal to the 

153
00:08:38,810 --> 00:08:42,160
number of words 

154
00:08:42,160 --> 00:08:45,830
in this email, right? So if one of my training examples is an email with 

155
00:08:45,830 --> 00:08:47,340
300 words in it, 

156
00:08:47,340 --> 00:08:49,750
then I represent this email via 

157
00:08:49,750 --> 00:08:52,010
a feature vector 

158
00:08:52,010 --> 00:08:57,320
with 300 elements, 

159
00:08:57,320 --> 00:09:00,360
and each 

160
00:09:00,360 --> 00:09:04,270
of these elements of the feature vector - lets see. Let me just write this as X subscript 

161
00:09:04,270 --> 00:09:05,560
J. 

162
00:09:05,560 --> 00:09:12,560
These will be an index into my dictionary, okay? 

163
00:09:12,820 --> 00:09:16,350
And so if my dictionary has 50,000 words, then 

164
00:09:16,350 --> 00:09:18,690
each position in my feature vector 

165
00:09:18,690 --> 00:09:22,520
will be a variable that takes on one of 50,000 possible 

166
00:09:22,520 --> 00:09:24,440
values 

167
00:09:24,440 --> 00:09:25,640


168
00:09:25,640 --> 00:09:31,580
corresponding to what word appeared in the J position of my email, okay? 

169
00:09:31,580 --> 00:09:33,940
So, in other words, I'm gonna take all the words in my email 

170
00:09:33,940 --> 00:09:35,680
and you have a feature 

171
00:09:35,680 --> 00:09:38,550
vector that just says which word 

172
00:09:38,550 --> 00:09:39,899
in my dictionary 

173
00:09:39,899 --> 00:09:43,700
was each word in the email, okay? 

174
00:09:43,700 --> 00:09:47,680
So a different definition for NI now, NI now varies and is different for every 

175
00:09:47,680 --> 00:09:49,050
training example, 

176
00:09:49,050 --> 00:09:52,740
and this XJ is now indexed into the dictionary. 

177
00:09:52,740 --> 00:09:54,740
You know, the components of the feature vector 

178
00:09:54,740 --> 00:09:59,450
are no longer binary random variables; they're these indices in the dictionary that take on a much 

179
00:09:59,450 --> 00:10:02,750
larger set of values. 

180
00:10:02,750 --> 00:10:05,660
And so 

181
00:10:05,660 --> 00:10:08,890
our generative model for this 

182
00:10:08,890 --> 00:10:09,980


183
00:10:09,980 --> 00:10:16,980


184
00:10:22,550 --> 00:10:24,360
will be that the joint distribution 

185
00:10:24,360 --> 00:10:27,760
over X and Y will be that, where again N is now the length of the email, all right? 

186
00:10:27,760 --> 00:10:30,839
So the way to think about this formula is you 

187
00:10:30,839 --> 00:10:32,060
imagine 

188
00:10:32,060 --> 00:10:33,410
that there was some 

189
00:10:33,410 --> 00:10:38,050
probably distribution over emails. There's some random distribution that generates the emails, 

190
00:10:38,050 --> 00:10:40,900
and that process proceeds as follows: 

191
00:10:40,900 --> 00:10:45,090
First, Y is chosen, first the class label. Is someone gonna send you spam email or not 

192
00:10:45,090 --> 00:10:48,520
spam emails is chosen for 

193
00:10:48,520 --> 00:10:54,240
us. So first Y, the random variable Y, the class label of spam or not spam is generated, 

194
00:10:54,240 --> 00:10:58,630
and then having decided whether they sent you spam or not spam, 

195
00:10:58,630 --> 00:11:02,879
someone iterates over all 300 positions of the email, 

196
00:11:02,879 --> 00:11:07,360
or 300 words that are going to compose them as email, 

197
00:11:07,360 --> 00:11:10,300
and would generate words from some distribution that 

198
00:11:10,300 --> 00:11:14,240
depends on whether they chose to send you spam or not spam. So if 

199
00:11:14,240 --> 00:11:17,459
they sent you spam, they'll send you words - they'll tend to generate words like, you know, buy, and Viagra, and 

200
00:11:17,459 --> 00:11:19,520
whatever at discounts, sale, whatever. 

201
00:11:19,520 --> 00:11:23,270
And if somebody chose to send you not spam, then they'll send you, sort of, the more normal 

202
00:11:23,270 --> 00:11:27,580


203
00:11:27,580 --> 00:11:31,150
words you get in an email, okay? So, sort of, just careful, right? XI here has a very different definition from the 

204
00:11:31,150 --> 00:11:34,820
previous event model, and N has a very different definition from the previous event 

205
00:11:34,820 --> 00:11:38,150
model. 

206
00:11:38,150 --> 00:11:44,680
And so the parameters of this model are - let's see. Phi 

207
00:11:44,680 --> 00:11:48,200
subscript K given Y equals one, which 

208
00:11:48,200 --> 00:11:51,370
is the 

209
00:11:51,370 --> 00:11:55,190
probability that, 

210
00:11:55,190 --> 00:11:57,490
you know, conditioned on 

211
00:11:57,490 --> 00:11:59,280
someone deciding to spend you spam, 

212
00:11:59,280 --> 00:12:03,070
what's the probability that the next word they choose to email you in 

213
00:12:03,070 --> 00:12:06,260
the spam email is going to be word K, and 

214
00:12:06,260 --> 00:12:10,350
similarly, you know, sort of, same thing - well, I'll just write it out, I guess - and Phi 

215
00:12:10,350 --> 00:12:17,350
Y 

216
00:12:21,160 --> 00:12:25,590
and just same as before, okay? 

217
00:12:25,590 --> 00:12:28,170
So these are the parameters of the model, 

218
00:12:28,170 --> 00:12:29,350


219
00:12:29,350 --> 00:12:31,450
and 

220
00:12:31,450 --> 00:12:35,170
given a training set, you can work out the maximum likelihood estimates of the 

221
00:12:35,170 --> 00:12:42,170


222
00:12:42,230 --> 00:12:47,810
parameters. So the maximum likelihood estimate of the parameters will be 

223
00:12:47,810 --> 00:12:49,100
equal to - 

224
00:12:49,100 --> 00:12:53,440
and now I'm gonna write one of these, you know, big indicator function things again. It'll be 

225
00:12:53,440 --> 00:12:57,750
a sum over your training sets 

226
00:12:57,750 --> 00:13:00,290
indicator 

227
00:13:00,290 --> 00:13:07,290
whether that was spam times the 

228
00:13:07,450 --> 00:13:10,129
sum over all the words in that email where N subscript 

229
00:13:10,129 --> 00:13:15,310
I is the number of words in email I in your training set, 

230
00:13:15,310 --> 00:13:22,310
times indicator XIJ, SK times that I, okay? 

231
00:13:39,390 --> 00:13:40,350
So 

232
00:13:40,350 --> 00:13:45,400
the numerator says sum over all your emails and take into account all the emails that had 

233
00:13:45,400 --> 00:13:48,400
class label one, take into account only of the emails that were spam because if Y equals zero, then 

234
00:13:48,400 --> 00:13:52,210
this is zero, and this would go away, 

235
00:13:52,210 --> 00:13:54,470
and then times sum over 

236
00:13:54,470 --> 00:13:56,579
all the words in your spam email, and 

237
00:13:56,579 --> 00:14:01,040
it counts up the number of times you observed the word K in your spam emails. 

238
00:14:01,040 --> 00:14:04,670
So, in other words, the numerator is look at all the spam emails in your training set 

239
00:14:04,670 --> 00:14:11,670
and count up the total number of times the word K appeared in this email. The 

240
00:14:13,840 --> 00:14:17,160
denominator then is sum over I into our training set of 

241
00:14:17,160 --> 00:14:20,199
whenever one of your examples is spam, 

242
00:14:20,199 --> 00:14:21,430
you know, sum up the length 

243
00:14:21,430 --> 00:14:23,220
of that spam email, 

244
00:14:23,220 --> 00:14:26,160
and so the denominator is the total length 

245
00:14:26,160 --> 00:14:32,370
of all of the spam emails in your training set. 

246
00:14:32,370 --> 00:14:36,820
And so the ratio is just out of all your spam emails, what is the fraction of words in 

247
00:14:36,820 --> 00:14:39,190
all your spam emails that were word K, 

248
00:14:39,190 --> 00:14:41,840
and that's your estimate for the probability 

249
00:14:41,840 --> 00:14:44,240
of the next piece of spam mail 

250
00:14:44,240 --> 00:14:51,240
generating the word K in any given position, okay? At the 

251
00:14:53,100 --> 00:14:56,480
end of the previous lecture, I talked about LaPlace smoothing, 

252
00:14:56,480 --> 00:14:58,980
and so when you do that as 

253
00:14:58,980 --> 00:15:03,310
well, you add one to the numerator and K to the denominator, and this is 

254
00:15:03,310 --> 00:15:06,600
the LaPlace smoothed estimate 

255
00:15:06,600 --> 00:15:08,790
of this parameter, 

256
00:15:08,790 --> 00:15:11,370
okay? And similarly, you do the same thing for - 

257
00:15:11,370 --> 00:15:14,040
and you can work out the estimates for the 

258
00:15:14,040 --> 00:15:20,970
other parameters yourself, okay? So it's very similar. Yeah? Student:I'm sorry. On the right on the top, I was just 

259
00:15:20,970 --> 00:15:22,620
wondering 

260
00:15:22,620 --> 00:15:23,600


261
00:15:23,600 --> 00:15:26,290
what the X of I is, and what the N of - 

262
00:15:26,290 --> 00:15:28,350
Instructor (Andrew Ng):Right. So 

263
00:15:28,350 --> 00:15:32,140
in this second event model, the definition for XI and the definition for N 

264
00:15:32,140 --> 00:15:34,770
are different, right? So here - 

265
00:15:34,770 --> 00:15:41,770
well, this is for one example XY. So here, N is the number of words 

266
00:15:44,300 --> 00:15:46,339
in a given email, 

267
00:15:46,339 --> 00:15:50,730
right? And if it's the I email subscripting then this N subscript I, and so 

268
00:15:50,730 --> 00:15:55,710
N will be different for different training examples, and here 

269
00:15:55,710 --> 00:16:00,910
XI will be, you know, 

270
00:16:00,910 --> 00:16:04,070
these values from 1 to 50,000, 

271
00:16:04,070 --> 00:16:05,580
and XI is 

272
00:16:05,580 --> 00:16:06,940
essentially 

273
00:16:06,940 --> 00:16:10,600
the identity of the Ith 

274
00:16:10,600 --> 00:16:11,340
word 

275
00:16:11,340 --> 00:16:14,830
in a given piece of email, 

276
00:16:14,830 --> 00:16:16,850
okay? So that's why this is 

277
00:16:16,850 --> 00:16:20,440
grouping, or this is a product over all the different words of your email 

278
00:16:20,440 --> 00:16:22,059
of their probability 

279
00:16:22,059 --> 00:16:24,430
the Ith word in your email, 

280
00:16:24,430 --> 00:16:31,430
conditioned on Y. Yeah? 

281
00:16:32,280 --> 00:16:36,480
Student:[Off mic]. Instructor (Andrew Ng):Oh, no, actually, you know what, I apologize. I just realized that overload the notation, right, 

282
00:16:36,480 --> 00:16:39,260
and I shouldn't have used K here. Let 

283
00:16:39,260 --> 00:16:43,750
me use a different alphabet and see if that makes sense; does 

284
00:16:43,750 --> 00:16:46,490
that make sense? Oh, you know what, I'm sorry. 

285
00:16:46,490 --> 00:16:49,430
You're absolutely right. 

286
00:16:49,430 --> 00:16:50,560
Thank you. All right. So 

287
00:16:50,560 --> 00:16:54,660
in LaPlace smoothing, that shouldn't be 

288
00:16:54,660 --> 00:16:57,940
K. This should be, you know, 50,000, 

289
00:16:57,940 --> 00:17:01,940
if you have 50,000 words in your dictionary. Yeah, thanks. Great. 

290
00:17:01,940 --> 00:17:04,660
I stole notation from the previous lecture and 

291
00:17:04,660 --> 00:17:06,720
didn't translate it properly. 

292
00:17:06,720 --> 00:17:08,400
So LaPlace smoothing, right, 

293
00:17:08,400 --> 00:17:15,400
this is the number of possible values that the random variable XI can 

294
00:17:16,430 --> 00:17:17,320
take on. Cool. Raise your hand if 

295
00:17:17,320 --> 00:17:21,680
this makes sense? Okay. Some of you, are there 

296
00:17:21,680 --> 00:17:28,680
more questions about this? 

297
00:17:34,850 --> 00:17:41,850


298
00:17:42,010 --> 00:17:43,880
Yeah. Student:On LaPlace smoothing, the denominator and the plus A is the number of values that Y could take? Instructor (Andrew Ng):Yeah, 

299
00:17:43,880 --> 00:17:45,110
let's see. 

300
00:17:45,110 --> 00:17:49,920
So LaPlace smoothing is 

301
00:17:49,920 --> 00:17:51,110
a 

302
00:17:51,110 --> 00:17:55,570
method to give you, sort of, hopefully, better estimates of their probability 

303
00:17:55,570 --> 00:17:58,150
distribution over a multinomial, 

304
00:17:58,150 --> 00:18:00,430
and so 

305
00:18:00,430 --> 00:18:02,870
was I using X to Y in the previous lecture? 

306
00:18:02,870 --> 00:18:04,600
So 

307
00:18:04,600 --> 00:18:07,809
in trying to estimate the probability over a multinomial - I think X 

308
00:18:07,809 --> 00:18:10,660
and Y are different. I think - 

309
00:18:10,660 --> 00:18:13,690
was it X or Y? I think it was X, actually. Well - oh, 

310
00:18:13,690 --> 00:18:16,800
I see, right, right. I think I was using a different definition 

311
00:18:16,800 --> 00:18:18,960
for the random variable Y because 

312
00:18:18,960 --> 00:18:23,860
suppose you have a multinomial random variable, X 

313
00:18:23,860 --> 00:18:27,890
which takes on - let's 

314
00:18:27,890 --> 00:18:30,190
use a different alphabet. Suppose you have 

315
00:18:30,190 --> 00:18:34,950
a multinomial random variable X which takes on L different values, then 

316
00:18:34,950 --> 00:18:37,440
the maximum likelihood estimate 

317
00:18:37,440 --> 00:18:40,550
for the probability of X, 

318
00:18:40,550 --> 00:18:42,060
PFX equals K, 

319
00:18:42,060 --> 00:18:43,720
will be equal to, right, 

320
00:18:43,720 --> 00:18:46,700
the number of observations. 

321
00:18:46,700 --> 00:18:51,860


322
00:18:51,860 --> 00:18:55,140
The maximum likelihood estimate for the 

323
00:18:55,140 --> 00:19:02,110
probability of X being equal to K will be the number of observations of X 

324
00:19:02,110 --> 00:19:04,800
equals K divided 

325
00:19:04,800 --> 00:19:10,510
by the total number of observations 

326
00:19:10,510 --> 00:19:11,330
of X, 

327
00:19:11,330 --> 00:19:13,750
okay? So that's the maximum likelihood estimate. 

328
00:19:13,750 --> 00:19:17,710
And to add LaPlace smoothing to this, you, sort of, add one to the numerator, 

329
00:19:17,710 --> 00:19:20,120
and you add L to the 

330
00:19:20,120 --> 00:19:20,910


331
00:19:20,910 --> 00:19:22,610
denominator 

332
00:19:22,610 --> 00:19:26,200
where L was the number of possible values that X can take on. 

333
00:19:26,200 --> 00:19:28,540
So, in this case, 

334
00:19:28,540 --> 00:19:31,740
this is a probability that X equals K, 

335
00:19:31,740 --> 00:19:36,400
and X can take on 50,000 values if 50,000 is the length of your dictionary; it may be 

336
00:19:36,400 --> 00:19:40,300
something else, but that's why I add 50,000 to the denominator. Are there other questions? Yeah. Student:Is there a specific 

337
00:19:40,300 --> 00:19:47,300
definition for a 

338
00:19:48,190 --> 00:19:48,659
maximum likelihood estimation of a parameter? We've talked about it a couple times, and all the examples 

339
00:19:48,659 --> 00:19:51,630
make sense, but I don't know what the, like, general formula for it is. 

340
00:19:51,630 --> 00:19:53,970
Instructor (Andrew Ng):I see. Yeah, right. So the definition of maximum likelihood - so 

341
00:19:53,970 --> 00:19:55,610
the question is 

342
00:19:55,610 --> 00:20:01,809
what's the definition for maximum likelihood estimate? So actually 

343
00:20:01,809 --> 00:20:05,460
in today's lecture and the previous lecture when I talk about Gaussian Discriminant Analysis I 

344
00:20:05,460 --> 00:20:06,550
was, sort of, 

345
00:20:06,550 --> 00:20:09,130
throwing out the maximum likelihood estimates on the board 

346
00:20:09,130 --> 00:20:11,049
without proving them. 

347
00:20:11,049 --> 00:20:15,520
The way to actually work this out is 

348
00:20:15,520 --> 00:20:19,900
to actually 

349
00:20:19,900 --> 00:20:26,900
write down the likelihood. 

350
00:20:29,070 --> 00:20:32,850
So the way to figure out all of these maximum likelihood estimates is to write 

351
00:20:32,850 --> 00:20:33,730
down 

352
00:20:33,730 --> 00:20:35,190
the likelihood of 

353
00:20:35,190 --> 00:20:41,170
the parameters, phi K given Y being 

354
00:20:41,170 --> 00:20:43,810
zero, phi 

355
00:20:43,810 --> 00:20:44,450
Y, 

356
00:20:44,450 --> 00:20:47,090
right? And so given a training set, 

357
00:20:47,090 --> 00:20:50,350
the likelihood, I guess, I should be writing log likelihood 

358
00:20:50,350 --> 00:20:54,740
will be the log of the product of I equals one to 

359
00:20:54,740 --> 00:20:55,910
N, PFXI, YI, 

360
00:20:55,910 --> 00:20:57,950
you 

361
00:20:57,950 --> 00:21:04,950
know, parameterized by these things, okay? 

362
00:21:06,940 --> 00:21:09,450
Where PFXI, YI, right, is given by NI, PFX, YJ given YI. They are 

363
00:21:09,450 --> 00:21:12,010
parameterized 

364
00:21:12,010 --> 00:21:13,500
by 

365
00:21:13,500 --> 00:21:19,299


366
00:21:19,299 --> 00:21:21,520


367
00:21:21,520 --> 00:21:26,930
- well, 

368
00:21:26,930 --> 00:21:28,140


369
00:21:28,140 --> 00:21:31,690
I'll just drop the parameters to write this 

370
00:21:31,690 --> 00:21:38,690


371
00:21:42,370 --> 00:21:49,200
more simply - oh, I just put it in - times PFYI, okay? 

372
00:21:49,200 --> 00:21:50,970
So this is my log likelihood, 

373
00:21:50,970 --> 00:21:55,090
and so the way you get the maximum likelihood estimate of the parameters 

374
00:21:55,090 --> 00:21:55,809
is you - 

375
00:21:55,809 --> 00:21:59,750
so if given a fixed training set, given a set of fixed IYI's, 

376
00:21:59,750 --> 00:22:02,180
you maximize this in terms of 

377
00:22:02,180 --> 00:22:05,740
these parameters, and then you get the maximum likelihood estimates that I've been writing 

378
00:22:05,740 --> 00:22:07,530
out. 

379
00:22:07,530 --> 00:22:11,220
So in a previous section of today's lecture I wrote out some maximum likelihood estimates 

380
00:22:11,220 --> 00:22:14,280
for the Gaussian Discriminant Analysis model, and 

381
00:22:14,280 --> 00:22:15,610
for Naive Bayes, 

382
00:22:15,610 --> 00:22:17,750
and then this - I didn't prove them, 

383
00:22:17,750 --> 00:22:22,310
but you get to, sort of, play with that yourself in the homework problem as well 

384
00:22:22,310 --> 00:22:26,420
and for one of these models, and you'll be able to verify that when 

385
00:22:26,420 --> 00:22:29,700
you maximize the likelihood and maximize the log likelihood 

386
00:22:29,700 --> 00:22:33,750
that hopefully you do get the same formulas as what I was drawing up on the board, but a way is to 

387
00:22:33,750 --> 00:22:34,950
find the way 

388
00:22:34,950 --> 00:22:41,630
these are derived is by maximizing this, okay? Cool. All right. 

389
00:22:41,630 --> 00:22:44,730
So 

390
00:22:44,730 --> 00:22:51,299
that wraps up what I wanted to say about - oh, 

391
00:22:51,299 --> 00:22:55,809
so that, more or less, wraps up what I wanted to say about Naive Bayes, and it turns out that 

392
00:22:55,809 --> 00:23:01,480
for text classification, the Naivent 

393
00:23:01,480 --> 00:23:04,080
model, the last 

394
00:23:04,080 --> 00:23:07,300
Naivent model, 

395
00:23:07,300 --> 00:23:10,330
it turns out that almost always does better than 

396
00:23:10,330 --> 00:23:11,370
the 

397
00:23:11,370 --> 00:23:15,360
first Naive Bayes model I talked about when you're applying it to the 

398
00:23:15,360 --> 00:23:18,670
specific case - to the specific of text classification, 

399
00:23:18,670 --> 00:23:21,070
and 

400
00:23:21,070 --> 00:23:25,049
one of the reasons is hypothesized for this is that this second model, 

401
00:23:25,049 --> 00:23:26,880
the multinomial event model, 

402
00:23:26,880 --> 00:23:30,120
takes into account 

403
00:23:30,120 --> 00:23:33,880
the number of times a word appears in a document, whereas the former model 

404
00:23:33,880 --> 00:23:35,170
doesn't. 

405
00:23:35,170 --> 00:23:37,370
I should say that in truth 

406
00:23:37,370 --> 00:23:40,700
that actually turns out not to be completely understood why the latter 

407
00:23:40,700 --> 00:23:43,780
model does better than the former one for text classification, and, sort of, 

408
00:23:43,780 --> 00:23:45,640
researchers are still 

409
00:23:45,640 --> 00:23:46,790
debating about it a little bit, but if you 

410
00:23:46,790 --> 00:23:51,320
ever have a text classification problem, you know, Naive Bayes Classify is probably 

411
00:23:51,320 --> 00:23:51,860
not, 

412
00:23:51,860 --> 00:23:54,150
by far, the best learning algorithm out there, 

413
00:23:54,150 --> 00:23:55,630
but it is 

414
00:23:55,630 --> 00:23:58,460
relatively straightforward to implement, and it's a very good 

415
00:23:58,460 --> 00:24:03,070
algorithm to try if you have a text classification problem, okay? Still a question? Yeah. Student:So the second model is still positioning a variant, right? It 

416
00:24:03,070 --> 00:24:04,490
doesn't 

417
00:24:04,490 --> 00:24:07,020
actually 

418
00:24:07,020 --> 00:24:07,910


419
00:24:07,910 --> 00:24:13,710
care where the words are. Instructor (Andrew Ng):Yes, all right. Student:And, I 

420
00:24:13,710 --> 00:24:14,950
mean, X variable, if my model like you had exclamation in, does 

421
00:24:14,950 --> 00:24:17,170
that usually 

422
00:24:17,170 --> 00:24:18,329
do 

423
00:24:18,329 --> 00:24:20,340
better if you have enough data? 

424
00:24:20,340 --> 00:24:23,050
Instructor (Andrew Ng):Yeah, so the question is, sort of, the second model, right? The second model, the multinomial event model actually doesn't care about the ordering of the words. 

425
00:24:23,050 --> 00:24:26,990
You can shuffle all the words in the email, and it does exactly the same thing. 

426
00:24:26,990 --> 00:24:27,770
So 

427
00:24:27,770 --> 00:24:30,929
in natural language processing, there's actually another name; it's called a Unique Grand 

428
00:24:30,929 --> 00:24:32,850
Model in natural language processing, 

429
00:24:32,850 --> 00:24:35,529
and there's some other models like, sort of, 

430
00:24:35,529 --> 00:24:39,190
say, higher order markup models that take into account some of the ordering 

431
00:24:39,190 --> 00:24:44,080
of the words. It turns out that for text classification, the models 

432
00:24:44,080 --> 00:24:45,260
like the 

433
00:24:45,260 --> 00:24:49,290
bigram models or trigram models, 

434
00:24:49,290 --> 00:24:53,360
I believe they do only very slightly better, if at all, 

435
00:24:53,360 --> 00:25:00,360
but that's when you're applying them to text classification, okay? All right. 

436
00:25:03,550 --> 00:25:07,190
So the next thing I want to talk about is to start again to discussion of 

437
00:25:07,190 --> 00:25:11,560
non-linear classifiers. So 

438
00:25:11,560 --> 00:25:18,560
it turns out 

439
00:25:21,670 --> 00:25:27,590
- well, and so the very first 

440
00:25:27,590 --> 00:25:31,220
classification algorithm we talked about was logistic regression, which 

441
00:25:31,220 --> 00:25:33,419
had the forming form for 

442
00:25:33,419 --> 00:25:36,340
hypothesis, 

443
00:25:36,340 --> 00:25:37,520
and 

444
00:25:37,520 --> 00:25:39,220
you can think of this as 

445
00:25:39,220 --> 00:25:41,190
predicting one when 

446
00:25:41,190 --> 00:25:44,420
this estimator probability is greater or equal to 0.5 and 

447
00:25:44,420 --> 00:25:46,180
predicting zero, 

448
00:25:46,180 --> 00:25:46,839
right, 

449
00:25:46,839 --> 00:25:50,050
when this is less than 0.5, 

450
00:25:50,050 --> 00:25:52,760
and given a training set, right? 

451
00:25:52,760 --> 00:25:59,760
Logistic regression 

452
00:26:02,549 --> 00:26:05,919
will maybe do grade and descends or something or 

453
00:26:05,919 --> 00:26:07,440
use Newton's method 

454
00:26:07,440 --> 00:26:09,370
to find a straight line that 

455
00:26:09,370 --> 00:26:12,130
reasonably separates the positive and negative classes. 

456
00:26:12,130 --> 00:26:15,830
But sometimes a data set just can't be separated by a straight line, so is there 

457
00:26:15,830 --> 00:26:16,950
an algorithm 

458
00:26:16,950 --> 00:26:18,230
that will let you 

459
00:26:18,230 --> 00:26:25,230
start to learn these sorts of non-linear division boundaries? 

460
00:26:25,540 --> 00:26:29,230
And so how do you go about getting a non-linear classifier? And, by the 

461
00:26:29,230 --> 00:26:34,360
way, one cool result is that remember when I said - when we talked 

462
00:26:34,360 --> 00:26:37,790
about generative learning algorithms, I said that 

463
00:26:37,790 --> 00:26:41,040
if you assume Y given X is 

464
00:26:41,040 --> 00:26:43,710
exponential family, 

465
00:26:43,710 --> 00:26:45,330


466
00:26:45,330 --> 00:26:46,669
right, with parameter 

467
00:26:46,669 --> 00:26:50,730
A, and if you build a generative learning algorithm using this, right, plus 

468
00:26:50,730 --> 00:26:55,850
one, if this is A to 

469
00:26:55,850 --> 00:26:59,710
one. This is exponential family 

470
00:26:59,710 --> 00:27:01,750
with natural parameter A to zero, 

471
00:27:01,750 --> 00:27:05,610
right. I think when we talked about Gaussian Discriminant Analysis, I said that if this holds true, 

472
00:27:05,610 --> 00:27:06,210
then 

473
00:27:06,210 --> 00:27:08,610
you end up with a logistic posterior. It 

474
00:27:08,610 --> 00:27:11,410
actually turns out that a Naive Bayes model 

475
00:27:11,410 --> 00:27:12,629
actually falls into this 

476
00:27:12,629 --> 00:27:17,440
as well. So the Naive Bayes model actually falls into this exponential family as well, and, 

477
00:27:17,440 --> 00:27:19,370
therefore, 

478
00:27:19,370 --> 00:27:23,000
under the Naive Bayes model, you're actually 

479
00:27:23,000 --> 00:27:27,080
using this other linear classifier as well, okay? 

480
00:27:27,080 --> 00:27:29,020
So the question is 

481
00:27:29,020 --> 00:27:32,480
how can you start to get non-linear classifiers? 

482
00:27:32,480 --> 00:27:35,649
And I'm going to talk about one method today 

483
00:27:35,649 --> 00:27:41,840
which is - and 

484
00:27:41,840 --> 00:27:45,300
we started to talk about it very briefly which is 

485
00:27:45,300 --> 00:27:46,650
taking a 

486
00:27:46,650 --> 00:27:48,990
simpler algorithm like 

487
00:27:48,990 --> 00:27:50,720
logistic regression 

488
00:27:50,720 --> 00:27:57,720
and using it to build up to more complex non-linear classifiers, okay? So 

489
00:27:58,390 --> 00:28:00,440
to motivate this discussion, 

490
00:28:00,440 --> 00:28:05,340
I'm going to use the little picture - let's see. So suppose you have features X1, 

491
00:28:05,340 --> 00:28:07,340
X2, and X3, 

492
00:28:07,340 --> 00:28:10,320
and so by convention, I'm gonna follow 

493
00:28:10,320 --> 00:28:13,460
our earlier convention that X0 is set to one, 

494
00:28:13,460 --> 00:28:17,010
and so I'm gonna use a little diagram like this 

495
00:28:17,010 --> 00:28:18,090
to denote 

496
00:28:18,090 --> 00:28:19,690
our 

497
00:28:19,690 --> 00:28:24,320
logistic regression unit, okay? So 

498
00:28:24,320 --> 00:28:28,609
think of a little picture like that, you know, this little circle as denoting a 

499
00:28:28,609 --> 00:28:29,419
computation note 

500
00:28:29,419 --> 00:28:32,630
that takes this input, you know, several features and then it outputs another 

501
00:28:32,630 --> 00:28:36,090
number that's X subscript theta of X, given by 

502
00:28:36,090 --> 00:28:37,800
a sigmoid function, 

503
00:28:37,800 --> 00:28:39,930
and so this little computational unit - 

504
00:28:39,930 --> 00:28:43,460
well, will have parameters theta. 

505
00:28:43,460 --> 00:28:47,700
Now, in order to get non-linear division boundaries, all we need to do - 

506
00:28:47,700 --> 00:28:50,850
well, at least one thing to do is just 

507
00:28:50,850 --> 00:28:52,780
come up with 

508
00:28:52,780 --> 00:28:54,390


509
00:28:54,390 --> 00:28:57,020
a way to represent hypotheses 

510
00:28:57,020 --> 00:29:00,640
that can output non-linear division boundaries, right, 

511
00:29:00,640 --> 00:29:02,700
and so 

512
00:29:02,700 --> 00:29:07,490
this is - 

513
00:29:07,490 --> 00:29:09,750
when you put a bunch of those little pictures 

514
00:29:09,750 --> 00:29:11,390
that I drew 

515
00:29:11,390 --> 00:29:12,790
on the previous board, 

516
00:29:12,790 --> 00:29:14,169
you can then get 

517
00:29:14,169 --> 00:29:18,549
what's called 

518
00:29:18,549 --> 00:29:20,780
a Neural Network in which you 

519
00:29:20,780 --> 00:29:23,710
think of having my features here 

520
00:29:23,710 --> 00:29:30,710
and then I would feed them to 

521
00:29:31,019 --> 00:29:34,530
say a few of these little 

522
00:29:34,530 --> 00:29:38,940
sigmoidal units, 

523
00:29:38,940 --> 00:29:43,559
and these together will feed into yet another sigmoidal unit, say, 

524
00:29:43,559 --> 00:29:46,270
which will output 

525
00:29:46,270 --> 00:29:51,510
my final output H subscript theta of X, okay? And 

526
00:29:51,510 --> 00:29:54,500
just to give these things names, let me call 

527
00:29:54,500 --> 00:29:58,380
the values output by these three intermediate sigmoidal units; let me call them A1, 

528
00:29:58,380 --> 00:30:01,360
A2, A3. 

529
00:30:01,360 --> 00:30:02,580
And let me just be 

530
00:30:02,580 --> 00:30:05,679
completely concrete about what this formula represents, right? So 

531
00:30:05,679 --> 00:30:06,570


532
00:30:06,570 --> 00:30:08,309
each of these 

533
00:30:08,309 --> 00:30:12,080
units in the middle will have their own associated set of parameters, 

534
00:30:12,080 --> 00:30:15,820
and so the value A1 will be computed as 

535
00:30:15,820 --> 00:30:17,940
G 

536
00:30:17,940 --> 00:30:21,440
of X transpose, 

537
00:30:21,440 --> 00:30:27,440
and then some set of parameters, which I'll write as theta one, and similarly, A2 will be computed as G of X transpose theta 

538
00:30:27,440 --> 00:30:29,090
two, 

539
00:30:29,090 --> 00:30:30,980
and A3 will be 

540
00:30:30,980 --> 00:30:34,490
G of X transpose, 

541
00:30:34,490 --> 00:30:35,700
theta three, 

542
00:30:35,700 --> 00:30:42,700
where G is the sigmoid function, all right? So G of Z, 

543
00:30:43,640 --> 00:30:49,850
and then, finally, our hypothesis will output 

544
00:30:49,850 --> 00:30:52,360
G of 

545
00:30:52,360 --> 00:30:53,950
A 

546
00:30:53,950 --> 00:30:56,620
transpose 

547
00:30:56,620 --> 00:30:59,200
theta four, right? Where, you 

548
00:30:59,200 --> 00:31:01,320
know, this A vector 

549
00:31:01,320 --> 00:31:04,390
is a vector of A1, 

550
00:31:04,390 --> 00:31:05,780
A2, A3. 

551
00:31:05,780 --> 00:31:08,220
We can append another one to it at first 

552
00:31:08,220 --> 00:31:11,470
if you want, okay? Let 

553
00:31:11,470 --> 00:31:14,270
me 

554
00:31:14,270 --> 00:31:16,210
just draw up here this - I'm 

555
00:31:16,210 --> 00:31:18,470
sorry about the cluttered board. 

556
00:31:18,470 --> 00:31:20,950
And so H subscript theta of X, 

557
00:31:20,950 --> 00:31:23,850
this is a function 

558
00:31:23,850 --> 00:31:29,300
of all the parameters theta one 

559
00:31:29,300 --> 00:31:32,420
through theta four, 

560
00:31:32,420 --> 00:31:36,680
and so one way to learn parameters for this model 

561
00:31:36,680 --> 00:31:37,650
is to 

562
00:31:37,650 --> 00:31:39,910
write down the cost function, 

563
00:31:39,910 --> 00:31:41,180
say, J of theta 

564
00:31:41,180 --> 00:31:46,040
equals one-half sum from Y equals one to 

565
00:31:46,040 --> 00:31:49,730
M, YI minus H subscript theta of 

566
00:31:49,730 --> 00:31:51,770
XI 

567
00:31:51,770 --> 00:31:54,450
squared, say. Okay, 

568
00:31:54,450 --> 00:31:56,070
so that's our familiar 

569
00:31:56,070 --> 00:31:58,280
quadratic cost function, 

570
00:31:58,280 --> 00:31:59,980
and so 

571
00:31:59,980 --> 00:32:03,990
one way to learn the parameters of an algorithm like this is to just use gradient 

572
00:32:03,990 --> 00:32:04,840
interscent 

573
00:32:04,840 --> 00:32:08,809
to minimize J of theta as a function of theta, okay? See, in the phi 

574
00:32:08,809 --> 00:32:10,940
gradient descent 

575
00:32:10,940 --> 00:32:14,610
to minimize this square area, which 

576
00:32:14,610 --> 00:32:18,050
stated differently means you use gradient descent to make the predictions of your neural 

577
00:32:18,050 --> 00:32:20,330
network as close as possible 

578
00:32:20,330 --> 00:32:27,330
to what you observed as the labels in your training 

579
00:32:30,550 --> 00:32:34,520
set, okay? So it turns out green descent on this neural network is a specific name, the 

580
00:32:34,520 --> 00:32:35,059
algorithm 

581
00:32:35,059 --> 00:32:38,789
that implements grand descent is called back propagation, and so if you ever hear that all that 

582
00:32:38,789 --> 00:32:41,240
means is - it just means gradient interscent on 

583
00:32:41,240 --> 00:32:45,170
a cost function like this or a variation of this on the neural network 

584
00:32:45,170 --> 00:32:47,909
that looks like that, 

585
00:32:47,909 --> 00:32:50,220
and - 

586
00:32:50,220 --> 00:32:54,360
well, this algorithm actually has some advantages and disadvantages, but let me 

587
00:32:54,360 --> 00:32:57,830
actually show you. So, let's see. 

588
00:32:57,830 --> 00:33:00,729
One of the interesting things about the neural network is that 

589
00:33:00,729 --> 00:33:05,430
you can look at what these intermediate notes are computing, right? So 

590
00:33:05,430 --> 00:33:08,270
this neural network has what's called 

591
00:33:08,270 --> 00:33:11,610
a hidden layer before you then have the output layer, 

592
00:33:11,610 --> 00:33:14,410
and, more generally, you can actually have inputs feed 

593
00:33:14,410 --> 00:33:16,279
into these 

594
00:33:16,279 --> 00:33:19,690
computation units, feed into more layers of computation units, to even more layers, to 

595
00:33:19,690 --> 00:33:23,370
more layers, and then finally you have an output layer at the end 

596
00:33:23,370 --> 00:33:26,970
And one cool thing you can do is look at all of these intermediate units, look at 

597
00:33:26,970 --> 00:33:28,280
these 

598
00:33:28,280 --> 00:33:30,980
units and what's called a hidden layer of the neural network. Don't 

599
00:33:30,980 --> 00:33:32,580
worry about why it's called that. Look 

600
00:33:32,580 --> 00:33:34,489
at computations of the hidden unit 

601
00:33:34,489 --> 00:33:39,570
and ask what is the hidden unit computing the neural network? So 

602
00:33:39,570 --> 00:33:42,900
to, maybe, get a better sense of neural networks might be doing, let me show you a 

603
00:33:42,900 --> 00:33:44,580
video - I'm gonna switch to the laptop - this is made 

604
00:33:44,580 --> 00:33:46,250
by a 

605
00:33:46,250 --> 00:33:49,360
friend, Yann LeCun 

606
00:33:49,360 --> 00:33:52,420
who's currently a professor at New York University. 

607
00:33:52,420 --> 00:33:56,289
Can I show a video on the laptop? 

608
00:33:56,289 --> 00:33:58,600
So let me show you a video 

609
00:33:58,600 --> 00:34:00,440
from Yann 

610
00:34:00,440 --> 00:34:06,330
LeCun on a neural network that he developed for Hammerton Digit Recognition. 

611
00:34:06,330 --> 00:34:10,110
There was one other thing he did in this neural network that I'm not gonna talk about called 

612
00:34:10,110 --> 00:34:12,389
a Convolutional Neural Network 

613
00:34:12,389 --> 00:34:16,009
that - well, 

614
00:34:16,009 --> 00:34:18,809
his system is called LeNet, 

615
00:34:18,809 --> 00:34:22,569
and let's see. Would you put 

616
00:34:22,569 --> 00:34:23,360
on the 

617
00:34:23,360 --> 00:34:30,360
laptop display? Hum, 

618
00:34:37,779 --> 00:34:38,789
actually maybe if - 

619
00:34:38,789 --> 00:34:42,339
or you can just put on the screen on the side; that would work too 

620
00:34:42,339 --> 00:34:49,339
if the big screen isn't working. 

621
00:35:12,249 --> 00:35:13,729
Let's see. I'm 

622
00:35:13,729 --> 00:35:19,299
just trying to think, okay, how do I keep you guys entertained while we're waiting for the video to come 

623
00:35:19,299 --> 00:35:24,209
on? Well, let me say a few more things about neural network. 

624
00:35:24,209 --> 00:35:25,529
So it turns out that 

625
00:35:25,529 --> 00:35:29,039
when you write a quadratic cost function like I 

626
00:35:29,039 --> 00:35:31,689
wrote down on the chalkboard just now, 

627
00:35:31,689 --> 00:35:34,639
it turns out that unlike logistic regression, 

628
00:35:34,639 --> 00:35:39,029
that will almost always respond to non-convex optimization problem, 

629
00:35:39,029 --> 00:35:42,709
and so whereas for logistic regression if you run gradient descent 

630
00:35:42,709 --> 00:35:44,280
or Newton's method or whatever, 

631
00:35:44,280 --> 00:35:46,000
you converse the global optimer. 

632
00:35:46,000 --> 00:35:50,500
This is not true for neural networks. In general, there are lots of local optimer 

633
00:35:50,500 --> 00:35:55,209
and, sort of, much harder optimization problem. 

634
00:35:55,209 --> 00:35:56,159
So 

635
00:35:56,159 --> 00:35:59,699
neural networks, if you're, sort of, familiar with them, and you're good at making design choices 

636
00:35:59,699 --> 00:36:01,599
like what learning rate to use, and 

637
00:36:01,599 --> 00:36:04,319
how many hidden units to use, and so on, you can, 

638
00:36:04,319 --> 00:36:07,700
sort of, get them to be fairly effective, 

639
00:36:07,700 --> 00:36:09,109
and 

640
00:36:09,109 --> 00:36:12,910
there's, sort of, often ongoing debates about, you know, is this learning algorithm 

641
00:36:12,910 --> 00:36:13,819
better, or is that learning 

642
00:36:13,819 --> 00:36:17,869
algorithm better? The vast majority of machine learning researchers today seem to perceive 

643
00:36:17,869 --> 00:36:20,180
support vector machines, which is what I'll talk about later, 

644
00:36:20,180 --> 00:36:24,349
to be a much more effective off-theshelf learning algorithm than neural networks. 

645
00:36:24,349 --> 00:36:26,509
This point of view is contested 

646
00:36:26,509 --> 00:36:32,039
a bit, but so neural networks are not something that I personally use a lot right 

647
00:36:32,039 --> 00:36:35,670
now because there's a hard optimization problem and you should do so often verge, 

648
00:36:35,670 --> 00:36:39,969
and it actually, sort of works. It, sort of, works reasonably 

649
00:36:39,969 --> 00:36:41,739
well. It's just 

650
00:36:41,739 --> 00:36:43,700
because this is fairly complicated, 

651
00:36:43,700 --> 00:36:45,269
there's not an algorithm that 

652
00:36:45,269 --> 00:36:52,009
I use commonly or that my friends use all 

653
00:36:52,009 --> 00:36:56,019
time. Oh, cool. So but let me just go and show you an example of neural network, which was for many years, 

654
00:36:56,019 --> 00:36:56,819


655
00:36:56,819 --> 00:37:00,400
you know, the most effective learning algorithm before support vector 

656
00:37:00,400 --> 00:37:02,029
machines were invented. 

657
00:37:02,029 --> 00:37:06,349
So here's Yann LeCun's video, and - 

658
00:37:06,349 --> 00:37:09,350
well, there's actually audio on this too, the soundboard. So I'll just tell you 

659
00:37:09,350 --> 00:37:11,209
what's happening. 

660
00:37:11,209 --> 00:37:11,880


661
00:37:11,880 --> 00:37:14,629
What you're seeing is a trained neural network, 

662
00:37:14,629 --> 00:37:18,549
and this display where my mouse pointer is pointing, right, this big three 

663
00:37:18,549 --> 00:37:19,429
there 

664
00:37:19,429 --> 00:37:21,189
is 

665
00:37:21,189 --> 00:37:24,799
the input to the neural network. So you're showing the neural network this image, and it's 

666
00:37:24,799 --> 00:37:26,859
trying to recognize what is this. 

667
00:37:26,859 --> 00:37:30,769
The final answer output by the neural network is this number up here, right 

668
00:37:30,769 --> 00:37:33,189
below where it says LeNet-5, 

669
00:37:33,189 --> 00:37:34,229
and the 

670
00:37:34,229 --> 00:37:37,759
neural network correctly recognizes this image as a three, 

671
00:37:37,759 --> 00:37:39,890
and if you look to the left of this image, 

672
00:37:39,890 --> 00:37:42,309
what's interesting about this is 

673
00:37:42,309 --> 00:37:46,640
the display on the left portion of this is actually showing the 

674
00:37:46,640 --> 00:37:49,999
intermediate computations of the neural network. In other words, it's showing you 

675
00:37:49,999 --> 00:37:51,900
what are the hidden layers of the 

676
00:37:51,900 --> 00:37:53,829
neural network computing. 

677
00:37:53,829 --> 00:37:57,529
And so, for example, if you look at this one, the third image down from the top, 

678
00:37:57,529 --> 00:38:01,569
this seems to be computing, you know, certain edges into digits, 

679
00:38:01,569 --> 00:38:05,630
right? We're just computing digits on the 

680
00:38:05,630 --> 00:38:07,489
right-hand side of the bottom or something 

681
00:38:07,489 --> 00:38:09,889
of the input display 

682
00:38:09,889 --> 00:38:11,479
of the input image, okay? 

683
00:38:11,479 --> 00:38:15,129
So let me just play this video, and you can see 

684
00:38:15,129 --> 00:38:17,019
some 

685
00:38:17,019 --> 00:38:20,619
of the inputs and outputs of the neural network, and 

686
00:38:20,619 --> 00:38:21,529
those 

687
00:38:21,529 --> 00:38:26,309
are very different fonts. There's this robustness to 

688
00:38:26,309 --> 00:38:30,670
noise. 

689
00:38:30,670 --> 00:38:36,929
All 

690
00:38:36,929 --> 00:38:37,880


691
00:38:37,880 --> 00:38:41,949


692
00:38:41,949 --> 00:38:43,000
right. Multiple 

693
00:38:43,000 --> 00:38:49,089
digits, 

694
00:38:49,089 --> 00:38:50,739


695
00:38:50,739 --> 00:38:57,739
that's, kind of, cool. All 

696
00:38:59,259 --> 00:39:06,259
right. 

697
00:39:12,140 --> 00:39:19,140


698
00:39:30,779 --> 00:39:35,789


699
00:39:35,789 --> 00:39:38,689


700
00:39:38,689 --> 00:39:40,459


701
00:39:40,459 --> 00:39:41,660


702
00:39:41,660 --> 00:39:48,660


703
00:39:52,359 --> 00:39:54,809
So, 

704
00:39:54,809 --> 00:39:58,219
just for fun, let me show you one more video, 

705
00:39:58,219 --> 00:40:00,579
which was - let's 

706
00:40:00,579 --> 00:40:06,089
see. This is another video from the various CV's, the machine that changed the world, which 

707
00:40:06,089 --> 00:40:07,490
was produced by WGBH 

708
00:40:07,490 --> 00:40:09,059
Television 

709
00:40:09,059 --> 00:40:10,800
in 

710
00:40:10,800 --> 00:40:12,599
corporation with British Foreclass Incorporation, 

711
00:40:12,599 --> 00:40:16,159
and it was aired on PBS a few years ago, I think. 

712
00:40:16,159 --> 00:40:17,630
I want to show you a 

713
00:40:17,630 --> 00:40:19,589
video describing the NETtalk 

714
00:40:19,589 --> 00:40:23,629
Neural Network, which was developed by Terry Sejnowski; he's a 

715
00:40:23,629 --> 00:40:25,739
researcher. 

716
00:40:25,739 --> 00:40:29,589
And so NETtalk was actually one of the major milestones in the 

717
00:40:29,589 --> 00:40:31,749
history of neural network, 

718
00:40:31,749 --> 00:40:33,490
and this specific application 

719
00:40:33,490 --> 00:40:36,399
is getting the neural network to read text. 

720
00:40:36,399 --> 00:40:38,779
So, in other words, can you show a 

721
00:40:38,779 --> 00:40:41,659
piece of English to a computer 

722
00:40:41,659 --> 00:40:43,419
and have the computer read, 

723
00:40:43,419 --> 00:40:46,410
sort of, verbally produce sounds that could respond 

724
00:40:46,410 --> 00:40:48,319
to the 

725
00:40:48,319 --> 00:40:50,629
reading of the text. 

726
00:40:50,629 --> 00:40:55,279
And it turns out that in the history of AI and the history of machine learning, 

727
00:40:55,279 --> 00:40:56,509
this video 

728
00:40:56,509 --> 00:41:00,849
created a lot of excitement about neural networks and about machine learning. Part of the 

729
00:41:00,849 --> 00:41:05,429
reason was that Terry Sejnowski had the foresight to choose 

730
00:41:05,429 --> 00:41:06,699
to use, 

731
00:41:06,699 --> 00:41:08,040
in his video, 

732
00:41:08,040 --> 00:41:12,940
a child-like voice talking about visiting your grandmother's house and so on. You'll 

733
00:41:12,940 --> 00:41:14,289
see it in a second, 

734
00:41:14,289 --> 00:41:16,029
and so 

735
00:41:16,029 --> 00:41:19,399
this really created the perception of - created 

736
00:41:19,399 --> 00:41:23,129
the impression of the neural network being like a young child learning how to speak, 

737
00:41:23,129 --> 00:41:25,260
and talking about going to your grandmothers, and so 

738
00:41:25,260 --> 00:41:27,379
on. So this actually 

739
00:41:27,379 --> 00:41:29,289
helped generate a lot of excitement 

740
00:41:29,289 --> 00:41:32,449
within academia and outside academia on neural networks, sort of, early in the history 

741
00:41:32,449 --> 00:41:36,109
of neural networks. I'm just gonna show you the video. 

742
00:41:36,109 --> 00:41:37,910
[Begin Video] You're going to hear first 

743
00:41:37,910 --> 00:41:41,559
what the network sounds like at the very beginning of the training, and it won't 

744
00:41:41,559 --> 00:41:43,589
sound like words, but it'll sound 

745
00:41:43,589 --> 00:41:48,009
like attempts that will get better and better with 

746
00:41:48,009 --> 00:41:54,239
time. [Computer's voice] The network 

747
00:41:54,239 --> 00:41:58,249
takes the letters, say the phrase, śgrandmother's house,ť and 

748
00:41:58,249 --> 00:42:02,679
makes a random attempt at pronouncing it. [Computer's 

749
00:42:02,679 --> 00:42:07,219
voice] 

750
00:42:07,219 --> 00:42:09,269
Grandmother's house. The 

751
00:42:09,269 --> 00:42:12,530
phonetic difference between the guess and the right pronunciation 

752
00:42:12,530 --> 00:42:16,539
is sent back through the network. [Computer's 

753
00:42:16,539 --> 00:42:18,989
voice] 

754
00:42:18,989 --> 00:42:20,329
Grandmother's house. 

755
00:42:20,329 --> 00:42:23,920
By adjusting the connection strengths after each attempt, 

756
00:42:23,920 --> 00:42:26,689
the net slowly improves. And, 

757
00:42:26,689 --> 00:42:30,029
finally, after letting it train overnight, 

758
00:42:30,029 --> 00:42:32,700
the next morning it sounds like this: Grandmother's house, I'd like to go to my grandmother's 

759
00:42:32,700 --> 00:42:37,759
house. 

760
00:42:37,759 --> 00:42:41,149
Well, because she gives us candy. 

761
00:42:41,149 --> 00:42:46,789
Well, and we - NETtalk understands nothing about the language. It is simply associating letters 

762
00:42:46,789 --> 00:42:49,809
with sounds. 

763
00:42:49,809 --> 00:42:52,359
[End Video] All right. So 

764
00:42:52,359 --> 00:42:54,380
at the time this was done, I mean, this is 

765
00:42:54,380 --> 00:42:57,379
an amazing piece of work. I should say 

766
00:42:57,379 --> 00:42:59,140
today there are other 

767
00:42:59,140 --> 00:43:03,389
text to speech systems that work better than what you just saw, 

768
00:43:03,389 --> 00:43:08,359
and you'll also appreciate getting candy from your grandmother's house is a little bit 

769
00:43:08,359 --> 00:43:11,449
less impressive than talking about the Dow Jones falling 15 points, 

770
00:43:11,449 --> 00:43:16,149
and profit taking, whatever. So 

771
00:43:16,149 --> 00:43:18,789
but I wanted to show that just because that was another cool, 

772
00:43:18,789 --> 00:43:25,789
major landmark in the history of neural networks. Okay. 

773
00:43:27,259 --> 00:43:32,949
So let's switch back to the chalkboard, 

774
00:43:32,949 --> 00:43:35,859
and what I want to do next 

775
00:43:35,859 --> 00:43:39,479
is 

776
00:43:39,479 --> 00:43:42,369
tell you about Support Vector Machines, 

777
00:43:42,369 --> 00:43:49,369
okay? That, sort of, wraps up our discussion on neural 

778
00:44:08,839 --> 00:44:10,850
networks. So I started off talking about neural networks 

779
00:44:10,850 --> 00:44:13,269
by motivating it as a way to get 

780
00:44:13,269 --> 00:44:17,920
us to output non-linear classifiers, right? I don't really approve of it. It turns out 

781
00:44:17,920 --> 00:44:18,419
that 

782
00:44:18,419 --> 00:44:21,799
you'd be able to come up with non-linear division boundaries using a neural 

783
00:44:21,799 --> 00:44:22,479
network 

784
00:44:22,479 --> 00:44:27,309
like what I drew on the chalkboard earlier. 

785
00:44:27,309 --> 00:44:30,469
Support Vector Machines will be another learning algorithm that will give us a way 

786
00:44:30,469 --> 00:44:34,529
to come up with non-linear classifiers. There's a very effective, off-the-shelf 

787
00:44:34,529 --> 00:44:36,019
learning algorithm, 

788
00:44:36,019 --> 00:44:39,399
but it turns out that in the discussion I'm gonna - in 

789
00:44:39,399 --> 00:44:43,379
the progression and development I'm gonna pursue, I'm actually going to start 

790
00:44:43,379 --> 00:44:47,799
off by describing yet another class of linear classifiers with linear division 

791
00:44:47,799 --> 00:44:49,269
boundaries, 

792
00:44:49,269 --> 00:44:54,179
and only be later, sort of, in probably the next lecture or the one after that, 

793
00:44:54,179 --> 00:44:55,249
that we'll then 

794
00:44:55,249 --> 00:44:58,699
take the support vector machine idea and, sort of, do some clever things to it 

795
00:44:58,699 --> 00:45:02,959
to make it work very well to generate non-linear division boundaries as well, okay? 

796
00:45:02,959 --> 00:45:07,859
But we'll actually start by talking about linear classifiers a little bit more. 

797
00:45:07,859 --> 00:45:12,619
And to do that, I want to 

798
00:45:12,619 --> 00:45:14,559
convey two 

799
00:45:14,559 --> 00:45:16,689
intuitions about classification. 

800
00:45:16,689 --> 00:45:20,700
One is 

801
00:45:20,700 --> 00:45:24,450
you think about logistic regression; we have this logistic function that was outputting 

802
00:45:24,450 --> 00:45:28,669
the probability that Y equals one, 

803
00:45:28,669 --> 00:45:31,049
and it crosses 

804
00:45:31,049 --> 00:45:32,959
this line at zero. 

805
00:45:32,959 --> 00:45:37,819
So when you run logistic regression, I want you to think of it as 

806
00:45:37,819 --> 00:45:41,179
an algorithm that computes 

807
00:45:41,179 --> 00:45:43,239
theta transpose X, 

808
00:45:43,239 --> 00:45:46,789
and then it predicts 

809
00:45:46,789 --> 00:45:51,170
one, right, 

810
00:45:51,170 --> 00:45:56,359
if and only if, theta transpose X is greater than zero, right? IFF stands for if 

811
00:45:56,359 --> 00:45:59,539
and only if. It means the same thing as a double implication, 

812
00:45:59,539 --> 00:46:04,629
and it predicts zero, 

813
00:46:04,629 --> 00:46:11,629
if and only if, theta transpose X is less than zero, okay? So if 

814
00:46:12,499 --> 00:46:15,019
it's the case that 

815
00:46:15,019 --> 00:46:18,069
theta transpose X is much greater than zero, 

816
00:46:18,069 --> 00:46:22,309
the double greater than sign means these are much greater than, all right. 

817
00:46:22,309 --> 00:46:25,589
So if theta transpose X is much greater than zero, 

818
00:46:25,589 --> 00:46:26,739
then, 

819
00:46:26,739 --> 00:46:33,069
you know, think of that as a very confident 

820
00:46:33,069 --> 00:46:38,779
prediction 

821
00:46:38,779 --> 00:46:40,779
that Y is equal to one, 

822
00:46:40,779 --> 00:46:43,459
right? If theta transpose X is much greater than zero, then we're 

823
00:46:43,459 --> 00:46:47,049
gonna predict one then moreover we're very confident it's one, and the picture for 

824
00:46:47,049 --> 00:46:48,479
that is if theta 

825
00:46:48,479 --> 00:46:51,179
transpose X is way out here, then 

826
00:46:51,179 --> 00:46:54,329
we're estimating that the probability of Y being equal to one 

827
00:46:54,329 --> 00:46:56,089
on the sigmoid function, it will 

828
00:46:56,089 --> 00:46:58,699
be very close to one. 

829
00:46:58,699 --> 00:47:02,779
And, in the same way, if theta transpose X 

830
00:47:02,779 --> 00:47:05,190
is much less than zero, 

831
00:47:05,190 --> 00:47:12,190
then we're very confident that 

832
00:47:12,809 --> 00:47:15,849
Y is equal to zero. 

833
00:47:15,849 --> 00:47:19,599
So 

834
00:47:19,599 --> 00:47:23,859
wouldn't it be nice - so when we fit logistic regression of some of the 

835
00:47:23,859 --> 00:47:25,309
classifiers is your training set, 

836
00:47:25,309 --> 00:47:29,569
then so wouldn't it be nice if, 

837
00:47:29,569 --> 00:47:30,430
right, 

838
00:47:30,430 --> 00:47:34,349
for all I 

839
00:47:34,349 --> 00:47:38,739
such that Y is equal to one. 

840
00:47:38,739 --> 00:47:39,699
We have 

841
00:47:39,699 --> 00:47:41,279
theta 

842
00:47:41,279 --> 00:47:45,009
transpose XI is much greater than zero, 

843
00:47:45,009 --> 00:47:47,929
and for all I such that Y is equal 

844
00:47:47,929 --> 00:47:49,869
to 

845
00:47:49,869 --> 00:47:56,869
zero, 

846
00:47:57,589 --> 00:47:59,910
we have theta transpose XI is 

847
00:47:59,910 --> 00:48:00,970
much less than zero, 

848
00:48:00,970 --> 00:48:05,659
okay? So wouldn't it be nice if this is true? That, 

849
00:48:05,659 --> 00:48:06,380


850
00:48:06,380 --> 00:48:09,929
essentially, if our training set, we can find parameters theta 

851
00:48:09,929 --> 00:48:12,749
so that our learning algorithm not only 

852
00:48:12,749 --> 00:48:16,839
makes correct classifications on all the examples in a training set, but further it's, sort 

853
00:48:16,839 --> 00:48:19,880
of, is very confident about all of those correct 

854
00:48:19,880 --> 00:48:22,299
classifications. This is 

855
00:48:22,299 --> 00:48:26,859
the first intuition that I want you to have, and we'll come back to this first intuition 

856
00:48:26,859 --> 00:48:28,889
in a second 

857
00:48:28,889 --> 00:48:33,209
when we talk about functional margins, okay? 

858
00:48:33,209 --> 00:48:40,209
We'll define this later. The second 

859
00:48:43,539 --> 00:48:48,119
intuition that I want to 

860
00:48:48,119 --> 00:48:55,119
convey, 

861
00:48:56,769 --> 00:49:00,209
and it turns out for the rest of today's lecture I'm going to assume 

862
00:49:00,209 --> 00:49:04,049
that a training set is linearly separable, okay? So by that I mean for the rest of 

863
00:49:04,049 --> 00:49:05,619


864
00:49:05,619 --> 00:49:08,690
today's lecture, I'm going to assume that 

865
00:49:08,690 --> 00:49:10,900
there is indeed a straight line that can separate 

866
00:49:10,900 --> 00:49:15,030
your training set, and we'll remove this assumption later, but 

867
00:49:15,030 --> 00:49:17,089
just to develop the algorithm, let's take away the 

868
00:49:17,089 --> 00:49:19,409
linearly separable training set. 

869
00:49:19,409 --> 00:49:20,440
And so 

870
00:49:20,440 --> 00:49:24,400
there's a sense that out of all the straight lines that separate the training 

871
00:49:24,400 --> 00:49:28,569
set, you know, maybe that straight line isn't such a good one, 

872
00:49:28,569 --> 00:49:33,139
and that one actually isn't such a great one either, but 

873
00:49:33,139 --> 00:49:35,019
maybe that line in the 

874
00:49:35,019 --> 00:49:38,869
middle is a much better linear separator than the others, right? 

875
00:49:38,869 --> 00:49:41,179
And one reason that 

876
00:49:41,179 --> 00:49:45,759
when you and I look at it this one seems best 

877
00:49:45,759 --> 00:49:48,329
is because this line is just further from the data, all right? 

878
00:49:48,329 --> 00:49:52,469
That is separates the data with a greater distance between your positive and your negative 

879
00:49:52,469 --> 00:49:55,779
examples and division boundary, okay? 

880
00:49:55,779 --> 00:49:59,390
And this second intuition, we'll come back to this shortly, 

881
00:49:59,390 --> 00:50:02,829
about this final line 

882
00:50:02,829 --> 00:50:06,099
that I drew being, maybe, the best line 

883
00:50:06,099 --> 00:50:09,999
this notion of distance from the training examples. This is the second intuition I want 

884
00:50:09,999 --> 00:50:11,239
to convey, 

885
00:50:11,239 --> 00:50:13,659
and we'll formalize it later when 

886
00:50:13,659 --> 00:50:17,849
we talk about 

887
00:50:17,849 --> 00:50:24,849
geometric margins of our classifiers, okay? 

888
00:50:41,549 --> 00:50:44,889
So in order to describe support vector machine, unfortunately, I'm gonna 

889
00:50:44,889 --> 00:50:48,239
have to pull a notation change, 

890
00:50:48,239 --> 00:50:50,109
and, sort 

891
00:50:50,109 --> 00:50:53,989
of, unfortunately, it, sort of, was impossible to do logistic regression, 

892
00:50:53,989 --> 00:50:55,809
and support vector machines, 

893
00:50:55,809 --> 00:51:01,029
and all the other algorithms using one completely consistent notation, 

894
00:51:01,029 --> 00:51:04,180
and so I'm actually gonna change notations slightly 

895
00:51:04,180 --> 00:51:07,999
for linear classifiers, and that will actually make it much easier for us - 

896
00:51:07,999 --> 00:51:10,019
that'll make it much easier later today 

897
00:51:10,019 --> 00:51:15,149
and in next week's lectures to actually talk about support vector machine. But the 

898
00:51:15,149 --> 00:51:19,549
notation that I'm gonna use for the rest 

899
00:51:19,549 --> 00:51:21,759
of today and for most of next week 

900
00:51:21,759 --> 00:51:25,769
will be that my B equals Y, 

901
00:51:25,769 --> 00:51:30,219
and instead of be zero, one, they'll be minus one and plus one, 

902
00:51:30,219 --> 00:51:33,779


903
00:51:33,779 --> 00:51:36,759
and a development of a support vector machine 

904
00:51:36,759 --> 00:51:38,469
we will have H, 

905
00:51:38,469 --> 00:51:43,939
have a hypothesis 

906
00:51:43,939 --> 00:51:49,369
output values to 

907
00:51:49,369 --> 00:51:53,179
the either plus one or minus one, 

908
00:51:53,179 --> 00:51:54,619
and so 

909
00:51:54,619 --> 00:51:58,039
we'll let G of Z be equal to 

910
00:51:58,039 --> 00:52:00,349
one if Z is 

911
00:52:00,349 --> 00:52:04,460
greater or equal to zero, and minus one otherwise, right? So just rather than zero and 

912
00:52:04,460 --> 00:52:07,809
one, we change everything to plus one and minus one. 

913
00:52:07,809 --> 00:52:10,149
And, finally, 

914
00:52:10,149 --> 00:52:13,770
whereas previously I wrote G subscript 

915
00:52:13,770 --> 00:52:16,779
theta of X equals 

916
00:52:16,779 --> 00:52:18,429
G of theta transpose X 

917
00:52:18,429 --> 00:52:20,880
and we had the convention 

918
00:52:20,880 --> 00:52:23,409
that X zero is equal to one, 

919
00:52:23,409 --> 00:52:25,789
right? And so X is an RN plus 

920
00:52:25,789 --> 00:52:28,670
one. 

921
00:52:28,670 --> 00:52:34,499
I'm gonna drop this convention of 

922
00:52:34,499 --> 00:52:38,400
letting X zero equals a one, and letting X be an RN plus one, and instead I'm 

923
00:52:38,400 --> 00:52:44,049
going to parameterize my linear classifier as H subscript W, B of X 

924
00:52:44,049 --> 00:52:46,159
equals G of 

925
00:52:46,159 --> 00:52:50,629
W transpose X plus B, okay? 

926
00:52:50,629 --> 00:52:51,349
And so 

927
00:52:51,349 --> 00:52:53,359
B 

928
00:52:53,359 --> 00:52:54,989
just now plays the role of 

929
00:52:54,989 --> 00:52:56,289
theta zero, 

930
00:52:56,289 --> 00:53:03,289
and W now plays the role of the rest of the parameters, theta one 

931
00:53:04,179 --> 00:53:07,059
through theta N, okay? So just by separating out 

932
00:53:07,059 --> 00:53:10,569
the interceptor B rather than lumping it together, it'll make it easier for 

933
00:53:10,569 --> 00:53:11,230
us 

934
00:53:11,230 --> 00:53:17,569
to develop support vector machines. 

935
00:53:17,569 --> 00:53:24,569
So - yes. Student:[Off mic]. Instructor 

936
00:53:27,889 --> 00:53:31,829
(Andrew Ng):Oh, yes. Right, yes. So W is - 

937
00:53:31,829 --> 00:53:33,049
right. So W 

938
00:53:33,049 --> 00:53:35,499
is a vector in RN, and 

939
00:53:35,499 --> 00:53:37,069
X 

940
00:53:37,069 --> 00:53:42,709
is now a vector in RN rather than N plus one, 

941
00:53:42,709 --> 00:53:49,709
and a lowercase b is a real number. Okay. 

942
00:53:56,429 --> 00:54:02,650
Now, let's formalize the notion of functional margin and germesh margin. Let me make a 

943
00:54:02,650 --> 00:54:03,660
definition. 

944
00:54:03,660 --> 00:54:10,660
I'm going to say that the functional margin 

945
00:54:13,160 --> 00:54:19,680
of the hyper plane 

946
00:54:19,680 --> 00:54:21,880
WB 

947
00:54:21,880 --> 00:54:25,229
with respect to a specific training example, 

948
00:54:25,229 --> 00:54:27,429
XIYI is - WRT 

949
00:54:27,429 --> 00:54:32,069
stands for with respect to - 

950
00:54:32,069 --> 00:54:35,489
the function margin of a hyper plane WB with 

951
00:54:35,489 --> 00:54:38,099
respect to 

952
00:54:38,099 --> 00:54:43,099
a certain training example, XIYI has been defined as Gamma 

953
00:54:43,099 --> 00:54:45,319
Hat I equals YI 

954
00:54:45,319 --> 00:54:46,919
times W transpose XI 

955
00:54:46,919 --> 00:54:51,159
plus B, okay? 

956
00:54:51,159 --> 00:54:54,579
And so a set of parameters, W, B defines 

957
00:54:54,579 --> 00:54:55,960
a 

958
00:54:55,960 --> 00:55:00,239
classifier - it, sort of, defines a linear separating boundary, 

959
00:55:00,239 --> 00:55:01,160
and so 

960
00:55:01,160 --> 00:55:03,989
when I say hyper plane, I just mean 

961
00:55:03,989 --> 00:55:07,390
the decision boundary that's 

962
00:55:07,390 --> 00:55:11,169
defined by the parameters W, B. 

963
00:55:11,169 --> 00:55:12,969
You know what, 

964
00:55:12,969 --> 00:55:17,699
if you're confused by the hyper plane term, just ignore it. The hyper plane of a classifier with 

965
00:55:17,699 --> 00:55:18,869
parameters W, B 

966
00:55:18,869 --> 00:55:21,019
with respect to a training example 

967
00:55:21,019 --> 00:55:23,539
is given by this formula, okay? 

968
00:55:23,539 --> 00:55:25,530
And interpretation of this is that 

969
00:55:25,530 --> 00:55:28,369
if 

970
00:55:28,369 --> 00:55:29,799
YI is equal to one, 

971
00:55:29,799 --> 00:55:33,239
then for each to have a large functional margin, you 

972
00:55:33,239 --> 00:55:36,519
want W transpose XI plus B to 

973
00:55:36,519 --> 00:55:37,849
be large, 

974
00:55:37,849 --> 00:55:39,079
right? 

975
00:55:39,079 --> 00:55:41,389
And if YI 

976
00:55:41,389 --> 00:55:43,709
is equal minus one, 

977
00:55:43,709 --> 00:55:47,479
then in order for the functional margin to be large - we, sort of, want the functional margins 

978
00:55:47,479 --> 00:55:50,229
to large, but in order for the function margins to be large, 

979
00:55:50,229 --> 00:55:55,229
if YI is equal to minus one, then the only way for this to be big is if W 

980
00:55:55,229 --> 00:55:57,029
transpose XI 

981
00:55:57,029 --> 00:55:58,219
plus B 

982
00:55:58,219 --> 00:56:03,849
is much less than zero, okay? So this 

983
00:56:03,849 --> 00:56:06,780
captures the intuition that we had earlier about functional 

984
00:56:06,780 --> 00:56:08,689
margins - the 

985
00:56:08,689 --> 00:56:12,839
intuition we had earlier that if YI is equal to one, we want this to be big, and if YI 

986
00:56:12,839 --> 00:56:14,669
is equal to minus one, we 

987
00:56:14,669 --> 00:56:15,990
want this to be small, 

988
00:56:15,990 --> 00:56:17,009
and this, sort of, 

989
00:56:17,009 --> 00:56:18,699
practice of two cases 

990
00:56:18,699 --> 00:56:22,599
into one statement that we'd like the functional margin to be large. 

991
00:56:22,599 --> 00:56:26,589
And notice this is also that 

992
00:56:26,589 --> 00:56:31,249
so long as YI times W transpose XY 

993
00:56:31,249 --> 00:56:34,049
plus B, so long as this is greater than zero, 

994
00:56:34,049 --> 00:56:36,649
that means we 

995
00:56:36,649 --> 00:56:43,649
classified it correctly, okay? 

996
00:57:14,359 --> 00:57:18,069
And one more definition, I'm going to say that 

997
00:57:18,069 --> 00:57:19,899
the functional margin of 

998
00:57:19,899 --> 00:57:23,679
a hyper plane with respect to an entire training set 

999
00:57:23,679 --> 00:57:30,679
is 

1000
00:57:31,579 --> 00:57:33,609
going to define gamma hat to 

1001
00:57:33,609 --> 00:57:35,509
be equal to 

1002
00:57:35,509 --> 00:57:35,759
min 

1003
00:57:35,650 --> 00:57:39,569
over all your training examples of gamma hat, I, right? 

1004
00:57:39,569 --> 00:57:40,539
So if you have 

1005
00:57:40,539 --> 00:57:44,439
a training set, if you have just more than one training example, 

1006
00:57:44,439 --> 00:57:46,949
I'm going to define the functional margin 

1007
00:57:46,949 --> 00:57:49,779
with respect to the entire training set as 

1008
00:57:49,779 --> 00:57:51,799
the worst case of all of your 

1009
00:57:51,799 --> 00:57:54,439
functional margins of the entire training set. 

1010
00:57:54,439 --> 00:57:57,849
And so for now we should think of the 

1011
00:57:57,849 --> 00:58:02,179
first function like an intuition of saying that we would like the function margin 

1012
00:58:02,179 --> 00:58:03,450
to be large, 

1013
00:58:03,450 --> 00:58:05,910
and for our purposes, for now, 

1014
00:58:05,910 --> 00:58:07,309
let's just say we would like 

1015
00:58:07,309 --> 00:58:10,389
the worst-case functional margin to be large, okay? And 

1016
00:58:10,389 --> 00:58:16,299
we'll change this a little bit later as well. Now, it turns out that 

1017
00:58:16,299 --> 00:58:20,150
there's one little problem with this intuition that will, sort of, edge 

1018
00:58:20,150 --> 00:58:22,640
us later, 

1019
00:58:22,640 --> 00:58:26,189
which it actually turns out to be very easy to make the functional margin large, all 

1020
00:58:26,189 --> 00:58:28,599
right? So, for example, 

1021
00:58:28,599 --> 00:58:31,899
so as I have a classifiable parameters W and 

1022
00:58:31,899 --> 00:58:37,369
B. If I take W and multiply it by two and take B and multiply it by two, 

1023
00:58:37,369 --> 00:58:42,049
then if you refer to the definition of the functional margin, I guess that was what? 

1024
00:58:42,049 --> 00:58:42,959
Gamma I, 

1025
00:58:42,959 --> 00:58:44,409
gamma hat I 

1026
00:58:44,409 --> 00:58:47,929
equals YI 

1027
00:58:47,929 --> 00:58:51,059
times W times transpose B. If 

1028
00:58:51,059 --> 00:58:52,859
I double 

1029
00:58:52,859 --> 00:58:54,609
W and B, 

1030
00:58:54,609 --> 00:58:55,010
then 

1031
00:58:55,010 --> 00:58:59,419
I can easily double my functional margin. So this goal 

1032
00:58:59,419 --> 00:59:02,799
of making the functional margin large, in and of itself, isn't so 

1033
00:59:02,799 --> 00:59:05,959
useful because it's easy to make the functional margin arbitrarily large 

1034
00:59:05,959 --> 00:59:08,729
just by scaling other parameters. And so 

1035
00:59:08,729 --> 00:59:13,259
maybe one thing we need to do later is add a normalization 

1036
00:59:13,259 --> 00:59:14,879
condition. 

1037
00:59:14,879 --> 00:59:16,849
For example, maybe 

1038
00:59:16,849 --> 00:59:20,179
we want to add a normalization condition that de-norm, the 

1039
00:59:20,179 --> 00:59:21,789
alter-norm of 

1040
00:59:21,789 --> 00:59:23,109
the parameter W 

1041
00:59:23,109 --> 00:59:24,009
is equal to one, 

1042
00:59:24,009 --> 00:59:28,150
and we'll come back to this in a second. All 

1043
00:59:28,150 --> 00:59:34,279
right. And then so - Okay. 

1044
00:59:34,279 --> 00:59:39,869
Now, let's talk about - see how much 

1045
00:59:39,869 --> 00:59:46,869
time we have, 15 minutes. Well, see, I'm trying to 

1046
00:59:52,909 --> 00:59:59,909
decide how much to try to do in the last 15 minutes. Okay. So let's talk 

1047
01:00:05,339 --> 01:00:10,890
about the geometric margin, 

1048
01:00:10,890 --> 01:00:13,379
and so 

1049
01:00:13,379 --> 01:00:16,939
the geometric margin of a training example - 

1050
01:00:16,939 --> 01:00:23,939
[inaudible], right? 

1051
01:00:27,279 --> 01:00:30,649
So the division boundary of my classifier 

1052
01:00:30,649 --> 01:00:33,439
is going to be given by the plane W transpose X 

1053
01:00:33,439 --> 01:00:34,889
plus B is equal 

1054
01:00:34,889 --> 01:00:37,650
to zero, okay? Right, and these are the 

1055
01:00:37,650 --> 01:00:41,389
X1, X2 axis, say, 

1056
01:00:41,389 --> 01:00:42,579
and 

1057
01:00:42,579 --> 01:00:48,299
we're going to draw relatively few training examples here. Let's say I'm drawing 

1058
01:00:48,299 --> 01:00:52,929
deliberately few training examples so that I can add things to this, okay? 

1059
01:00:52,929 --> 01:00:54,659
And so 

1060
01:00:54,659 --> 01:00:57,199
assuming we classified an example correctly, I'm 

1061
01:00:57,199 --> 01:01:01,209
going to define the geometric margin 

1062
01:01:01,209 --> 01:01:05,559
as just a geometric distance between a point between the 

1063
01:01:05,559 --> 01:01:07,239
training example - 

1064
01:01:07,239 --> 01:01:09,649
yeah, between the training example XI, 

1065
01:01:09,649 --> 01:01:11,679
YI 

1066
01:01:11,679 --> 01:01:13,259
and the distance 

1067
01:01:13,259 --> 01:01:15,489
given by this separating 

1068
01:01:15,489 --> 01:01:18,249
line, given by this separating hyper plane, okay? 

1069
01:01:18,249 --> 01:01:20,329
That's what I'm going to define the 

1070
01:01:20,329 --> 01:01:23,809
geometric margin to be. 

1071
01:01:23,809 --> 01:01:26,299
And so I'm gonna do 

1072
01:01:26,299 --> 01:01:29,919
some algebra fairly quickly. In case it doesn't make sense, and read 

1073
01:01:29,919 --> 01:01:33,969
through the lecture notes more carefully for details. 

1074
01:01:33,969 --> 01:01:36,479
Sort of, by standard geometry, 

1075
01:01:36,479 --> 01:01:40,589
the normal, or in other words, the vector that's 90 degrees to the 

1076
01:01:40,589 --> 01:01:41,869
separating hyper plane 

1077
01:01:41,869 --> 01:01:45,069
is going to be given by W divided by 

1078
01:01:45,069 --> 01:01:47,480
the norm of W; that's just how 

1079
01:01:47,480 --> 01:01:49,619
planes and high dimensions work. If this 

1080
01:01:49,619 --> 01:01:53,380
stuff - some of this you have to use, take a look t the lecture notes on the 

1081
01:01:53,380 --> 01:01:55,589
website. 

1082
01:01:55,589 --> 01:01:58,199
And so let's say this distance 

1083
01:01:58,199 --> 01:01:59,789
is 

1084
01:01:59,789 --> 01:02:01,220


1085
01:02:01,220 --> 01:02:02,079
gamma I, 

1086
01:02:02,079 --> 01:02:04,869
okay? And so I'm going to use the convention that 

1087
01:02:04,869 --> 01:02:08,549
I'll put a hat on top where I'm referring to functional margins, 

1088
01:02:08,549 --> 01:02:11,619
and no hat on top for geometric margins. So let's say geometric margin, 

1089
01:02:11,619 --> 01:02:14,120
as this example, is 

1090
01:02:14,120 --> 01:02:18,009
gamma I. 

1091
01:02:18,009 --> 01:02:21,789
That means that 

1092
01:02:21,789 --> 01:02:26,459
this point here, right, 

1093
01:02:26,459 --> 01:02:28,899
is going to be 

1094
01:02:28,899 --> 01:02:32,149
XI minus 

1095
01:02:32,149 --> 01:02:36,119
gamma I times W 

1096
01:02:36,119 --> 01:02:39,479
over normal W, okay? 

1097
01:02:39,479 --> 01:02:41,279
Because 

1098
01:02:41,279 --> 01:02:45,399
W over normal W is the unit vector, is the length one vector that 

1099
01:02:45,399 --> 01:02:48,459
is normal to the separating hyper plane, 

1100
01:02:48,459 --> 01:02:52,069
and so when we subtract gamma I times the unit vector 

1101
01:02:52,069 --> 01:02:55,189
from this point, XI, or at this point here is XI. 

1102
01:02:55,189 --> 01:02:57,039
So XI minus, 

1103
01:02:57,039 --> 01:02:59,079
you know, this little vector here 

1104
01:02:59,079 --> 01:03:01,539
is going to be this point that I've drawn as 

1105
01:03:01,539 --> 01:03:03,080
a heavy circle, 

1106
01:03:03,080 --> 01:03:05,660
okay? So this heavy point here is 

1107
01:03:05,660 --> 01:03:07,429
XI minus this vector, 

1108
01:03:07,429 --> 01:03:14,429
and this vector is gamma I time W over norm of W, okay? 

1109
01:03:16,049 --> 01:03:17,649
And so 

1110
01:03:17,649 --> 01:03:21,160
because this heavy point is on the separating hyper plane, 

1111
01:03:21,160 --> 01:03:22,689
right, this point 

1112
01:03:22,689 --> 01:03:24,949
must satisfy 

1113
01:03:24,949 --> 01:03:28,660
W transpose times 

1114
01:03:28,660 --> 01:03:34,659
that point 

1115
01:03:34,659 --> 01:03:35,569
equals zero, 

1116
01:03:35,569 --> 01:03:37,189
right? Because 

1117
01:03:37,189 --> 01:03:38,029
all 

1118
01:03:38,029 --> 01:03:41,479
points X on the separating hyper plane satisfy the equation W transpose X plus B 

1119
01:03:41,479 --> 01:03:42,939
equals zero, 

1120
01:03:42,939 --> 01:03:45,989
and so this point is on the separating hyper plane, therefore, 

1121
01:03:45,989 --> 01:03:49,199
it must satisfy W transpose this point - oh, 

1122
01:03:49,199 --> 01:03:53,809
excuse me. Plus B is equal 

1123
01:03:53,809 --> 01:03:57,579
to zero, okay? 

1124
01:03:57,579 --> 01:04:01,289
Raise your hand if this makes sense so far? Oh, 

1125
01:04:01,289 --> 01:04:03,669
okay. Cool, most of you, but, again, 

1126
01:04:03,669 --> 01:04:07,099
I'm, sort of, being slightly fast in this geometry. So if you're not quite sure 

1127
01:04:07,099 --> 01:04:10,089
why this is a normal vector, or how I subtracted 

1128
01:04:10,089 --> 01:04:17,089
this, or whatever, take a look at the details in the lecture notes. 

1129
01:04:21,299 --> 01:04:24,319
And so 

1130
01:04:24,319 --> 01:04:28,459
what I'm going to do is I'll just take this equation, and I'll solve for gamma, right? So 

1131
01:04:28,459 --> 01:04:30,269
this equation I just wrote down, 

1132
01:04:30,269 --> 01:04:32,199
solve this equation for gamma 

1133
01:04:32,199 --> 01:04:33,939
or gamma I, 

1134
01:04:33,939 --> 01:04:40,939
and you find that - 

1135
01:04:44,129 --> 01:04:47,630
you saw that previous equation from gamma I - 

1136
01:04:47,630 --> 01:04:49,769
well, why don't I just do it? 

1137
01:04:49,769 --> 01:04:52,819
You have W transpose XI plus 

1138
01:04:52,819 --> 01:04:54,660
B 

1139
01:04:54,660 --> 01:04:55,489
equals 

1140
01:04:55,489 --> 01:04:57,049
gamma 

1141
01:04:57,049 --> 01:05:01,489
I times W transpose W over norm of W; 

1142
01:05:01,489 --> 01:05:05,169
that's just equal to gamma 

1143
01:05:05,169 --> 01:05:06,749
times the norm of W 

1144
01:05:06,749 --> 01:05:07,659
because 

1145
01:05:07,659 --> 01:05:09,529
W transpose W 

1146
01:05:09,529 --> 01:05:12,469
is the norm 

1147
01:05:12,469 --> 01:05:14,829
of W squared, and, therefore, 

1148
01:05:14,829 --> 01:05:17,589
gamma 

1149
01:05:17,589 --> 01:05:24,589
is just - well, 

1150
01:05:26,579 --> 01:05:33,579
transpose X equals, okay? 

1151
01:05:33,999 --> 01:05:38,089
And, in other words, this little calculation just showed us that if you have a 

1152
01:05:38,089 --> 01:05:41,299
training example 

1153
01:05:41,299 --> 01:05:42,599
XI, 

1154
01:05:42,599 --> 01:05:46,749
then the distance between XI and the separating hyper plane defined by the 

1155
01:05:46,749 --> 01:05:49,449
parameters W and B 

1156
01:05:49,449 --> 01:05:56,449
can be computed by this formula, okay? So 

1157
01:06:02,979 --> 01:06:05,459
the last thing I want to do is actually 

1158
01:06:05,459 --> 01:06:06,729
take into account 

1159
01:06:06,729 --> 01:06:11,929
the sign of the - the correct classification of the training example. So I've 

1160
01:06:11,929 --> 01:06:13,130
been assuming that 

1161
01:06:13,130 --> 01:06:16,339
we've been classifying an example correctly. 

1162
01:06:16,339 --> 01:06:20,999
So, more generally, to find 

1163
01:06:20,999 --> 01:06:27,439
the geometric 

1164
01:06:27,439 --> 01:06:30,029
margin of a training example to be 

1165
01:06:30,029 --> 01:06:32,979
gamma I equals YI 

1166
01:06:32,979 --> 01:06:39,979
times that thing on top, okay? 

1167
01:06:45,389 --> 01:06:47,279
And so 

1168
01:06:47,279 --> 01:06:51,169
this is very similar to the functional margin, except for the normalization by the 

1169
01:06:51,169 --> 01:06:52,729
norm of W, 

1170
01:06:52,729 --> 01:06:57,759
and so as before, you know, this says that so long as - 

1171
01:06:57,759 --> 01:07:00,889
we would like the geometric margin to be large, and all that means is that so 

1172
01:07:00,889 --> 01:07:01,849
long as 

1173
01:07:01,849 --> 01:07:03,919
we're classifying the example correctly, 

1174
01:07:03,919 --> 01:07:08,209
we would ideally hope of the example to be as far as possible from the separating 

1175
01:07:08,209 --> 01:07:11,079
hyper plane, so long as it's on the right side of 

1176
01:07:11,079 --> 01:07:12,279
the separating hyper plane, and that's what YI 

1177
01:07:12,279 --> 01:07:19,279
multiplied into this does. 

1178
01:07:22,689 --> 01:07:27,839
And so a couple of easy facts, one is if the 

1179
01:07:27,839 --> 01:07:31,769
norm of W is equal to one, 

1180
01:07:31,769 --> 01:07:34,289
then 

1181
01:07:34,289 --> 01:07:37,439
the functional margin is equal to the geometric margin, and you see 

1182
01:07:37,439 --> 01:07:39,369
that quite easily, 

1183
01:07:39,369 --> 01:07:41,479
and, more generally, 

1184
01:07:41,479 --> 01:07:43,019


1185
01:07:43,019 --> 01:07:47,779
the geometric margin is just equal 

1186
01:07:47,779 --> 01:07:54,779
to the functional margin divided by the norm of W, okay? Let's see, okay. 

1187
01:08:11,959 --> 01:08:18,959
And so one final definition is 

1188
01:08:22,089 --> 01:08:26,769
so far I've defined the geometric margin with respect to a single training example, 

1189
01:08:26,769 --> 01:08:32,189
and so as before, I'll define the geometric margin with respect to an entire training set 

1190
01:08:32,189 --> 01:08:34,499
as gamma equals 

1191
01:08:34,499 --> 01:08:37,599
min over I 

1192
01:08:37,599 --> 01:08:44,599
of gamma I, all right? 

1193
01:08:45,250 --> 01:08:47,309
And so 

1194
01:08:47,309 --> 01:08:54,309
the maximum margin classifier, which is a precursor to the support vector machine, 

1195
01:08:59,539 --> 01:09:04,880
is the learning algorithm that chooses the parameters W and B 

1196
01:09:04,880 --> 01:09:06,730
so as to maximize 

1197
01:09:06,730 --> 01:09:08,900
the geometric margin, 

1198
01:09:08,900 --> 01:09:11,919
and so I just write that down. The 

1199
01:09:11,919 --> 01:09:16,489
maximum margin classified poses the following optimization problem. It says 

1200
01:09:16,489 --> 01:09:19,960
choose gamma, W, and B 

1201
01:09:19,960 --> 01:09:23,529
so as to maximize the geometric margin, 

1202
01:09:23,529 --> 01:09:25,499
subject to that YI times - 

1203
01:09:25,499 --> 01:09:32,499
well, 

1204
01:09:33,420 --> 01:09:37,559
this is just one way to write it, 

1205
01:09:37,559 --> 01:09:41,089
subject to - actually, do I write it like that? Yeah, 

1206
01:09:41,089 --> 01:09:43,819
fine. 

1207
01:09:43,819 --> 01:09:46,209
There are several ways to write this, and one of the things we'll 

1208
01:09:46,209 --> 01:09:48,949
do next time is actually - I'm trying 

1209
01:09:48,949 --> 01:09:52,730
to figure out if I can do this in five minutes. I'm guessing this could 

1210
01:09:52,730 --> 01:09:55,420
be difficult. Well, 

1211
01:09:55,420 --> 01:09:59,429
so this maximizing your classifier is the maximization problem 

1212
01:09:59,429 --> 01:10:00,929
over parameter gamma 

1213
01:10:00,929 --> 01:10:02,519
W 

1214
01:10:02,519 --> 01:10:04,919
and B, and for now, it turns out 

1215
01:10:04,919 --> 01:10:08,869
that the geometric margin doesn't change depending on the norm of W, right? Because 

1216
01:10:08,869 --> 01:10:11,479
in the definition of the geometric margin, 

1217
01:10:11,479 --> 01:10:14,429
notice that we're dividing by the norm of W anyway. 

1218
01:10:14,429 --> 01:10:18,239
So you can actually set the norm of W to be anything you want, and you can multiply 

1219
01:10:18,239 --> 01:10:22,530
W and B by any constant; it doesn't change the geometric margin. 

1220
01:10:22,530 --> 01:10:26,809
This will actually be important, and we'll come back to this later. Notice that you 

1221
01:10:26,809 --> 01:10:29,829
can take the parameters WB, 

1222
01:10:29,829 --> 01:10:33,759
and you can impose any normalization constant to it, or you can change W and B by 

1223
01:10:33,759 --> 01:10:35,760
any scaling factor and 

1224
01:10:35,760 --> 01:10:37,969
replace them by ten W and ten B 

1225
01:10:37,969 --> 01:10:38,730
whatever, 

1226
01:10:38,730 --> 01:10:42,809
and it does not change the geometric margin, 

1227
01:10:42,809 --> 01:10:44,329
okay? And so 

1228
01:10:44,329 --> 01:10:47,800
in this first formulation, I'm just gonna impose a constraint and say that the norm of W was 

1229
01:10:47,800 --> 01:10:48,989
one, 

1230
01:10:48,989 --> 01:10:51,969
and so the function of the geometric margins will be the same, 

1231
01:10:51,969 --> 01:10:52,699
and then we'll say 

1232
01:10:52,699 --> 01:10:55,119
maximize the geometric margins subject to - 

1233
01:10:55,119 --> 01:10:56,800
you maximize gamma 

1234
01:10:56,800 --> 01:11:00,489
subject to that every training example must have geometric margin 

1235
01:11:00,489 --> 01:11:02,069
at least gamma, 

1236
01:11:02,069 --> 01:11:04,219
and this is a geometric margin because 

1237
01:11:04,219 --> 01:11:05,780
when the norm of W is equal to one, then 

1238
01:11:05,780 --> 01:11:08,159
the functional of the geometric margin 

1239
01:11:08,159 --> 01:11:11,409
are identical, okay? 

1240
01:11:11,409 --> 01:11:15,130
So this is the maximum margin classifier, and it turns out that if you do this, 

1241
01:11:15,130 --> 01:11:19,079
it'll run, you know, maybe about as well as a - maybe slight - maybe comparable to logistic regression, but 

1242
01:11:19,079 --> 01:11:19,659
it 

1243
01:11:19,659 --> 01:11:22,709


1244
01:11:22,709 --> 01:11:24,139
turns out that 

1245
01:11:24,139 --> 01:11:26,449
as we develop this algorithm further, there 

1246
01:11:26,449 --> 01:11:30,519
will be a clever way to allow us to change this algorithm to let it work 

1247
01:11:30,519 --> 01:11:33,150
in infinite dimensional feature spaces 

1248
01:11:33,150 --> 01:11:37,010
and come up with very efficient non-linear classifiers. 

1249
01:11:37,010 --> 01:11:38,020
So 

1250
01:11:38,020 --> 01:11:41,280
there's a ways to go before we turn this into a support vector machine, 

1251
01:11:41,280 --> 01:11:43,409
but this is the first step. 

1252
01:11:43,409 --> 01:11:50,409
So are there questions about this? Yeah. 

1253
01:12:03,610 --> 01:12:08,179
Student:[Off mic]. Instructor (Andrew Ng):For now, let's just say you're given a fixed training set, and you can't - yeah, for now, let's just say you're given a 

1254
01:12:08,179 --> 01:12:09,900
fixed training set, and the 

1255
01:12:09,900 --> 01:12:12,879
scaling of the training set is not something you get to play with, right? 

1256
01:12:12,879 --> 01:12:17,190
So everything I've said is for a fixed training set, so that you can't change the X's, and you can't change the 

1257
01:12:17,190 --> 01:12:20,869
Y's. Are there other questions? Okay. So all right. 

1258
01:12:20,869 --> 01:12:26,150


1259
01:12:26,150 --> 01:12:30,129


1260
01:12:30,129 --> 01:12:34,779
Next week we will take this, and we'll talk about authorization algorithms, 

1261
01:12:34,779 --> 01:12:35,839
and 

1262
01:12:35,839 --> 01:12:39,949
work our way towards turning this into one of the most effective off-theshelf 

1263
01:12:39,949 --> 01:12:41,920
learning algorithms, 

1264
01:12:41,920 --> 01:12:44,369
and just a final reminder again, this next 

1265
01:12:44,369 --> 01:12:48,170
discussion session will be on Matlab and Octaves. So show up for that if you want 

1266
01:12:48,170 --> 01:12:49,679
to see a tutorial. 

1267
01:12:49,679 --> 01:12:51,529
Okay. See you guys in the next class.