1
00:00:10,040 --> 00:00:12,179

2
00:00:12,179 --> 00:00:15,449
This presentation is delivered by the Stanford Center for Professional 

3
00:00:15,449 --> 00:00:22,449
Development. 

4
00:00:25,119 --> 00:00:29,819
So what I want to do today is talk about a different type of learning algorithm, and, in particular, 

5
00:00:29,819 --> 00:00:32,790
start to talk about generative learning algorithms 

6
00:00:32,790 --> 00:00:37,840
and the specific algorithm called Gaussian Discriminant Analysis. 

7
00:00:37,840 --> 00:00:43,340
Take a slight digression, talk about Gaussians, and I'll briefly discuss 

8
00:00:43,340 --> 00:00:45,700
generative versus discriminative learning algorithms, 

9
00:00:45,700 --> 00:00:49,780
and then hopefully wrap up today's lecture with a discussion of Naive Bayes and 

10
00:00:49,780 --> 00:00:52,140
the Laplace Smoothing. 

11
00:00:52,140 --> 00:00:55,020
So just to motivate our 

12
00:00:55,020 --> 00:00:58,590
discussion on generative learning algorithms, right, so by way of contrast, 

13
00:00:58,590 --> 00:01:02,100
the source of classification algorithms we've been talking about 

14
00:01:02,100 --> 00:01:08,920
I think of algorithms that do this. So you're given a training set, 

15
00:01:08,920 --> 00:01:11,640
and 

16
00:01:11,640 --> 00:01:15,370
if you run an algorithm right, we just see progression on those training sets. 

17
00:01:15,370 --> 00:01:19,830
The way I think of logistic regression is that it's trying to find " look at the date and is 

18
00:01:19,830 --> 00:01:24,369
trying to find a straight line to divide the crosses and O's, right? So it's, sort of, 

19
00:01:24,369 --> 00:01:27,380
trying to find a straight line. Let 

20
00:01:27,380 --> 00:01:29,400
me " just make the days a bit noisier. 

21
00:01:29,400 --> 00:01:33,060
Trying to find a straight line 

22
00:01:33,060 --> 00:01:34,430
that separates out 

23
00:01:34,430 --> 00:01:38,630
the positive and the negative classes as well as pass the law, right? And, 

24
00:01:38,630 --> 00:01:42,570
in fact, it shows it on the laptop. Maybe just use the screens or the small 

25
00:01:42,570 --> 00:01:46,250
monitors for this. 

26
00:01:46,250 --> 00:01:47,230
In fact, 

27
00:01:47,230 --> 00:01:49,720
you can see there's the data 

28
00:01:49,720 --> 00:01:51,100
set with 

29
00:01:51,100 --> 00:01:52,710
logistic regression, 

30
00:01:52,710 --> 00:01:54,080
and so 

31
00:01:54,080 --> 00:01:57,930
I've initialized the parameters randomly, and so logistic regression is, kind 

32
00:01:57,930 --> 00:01:59,350
of, the outputting " it's 

33
00:01:59,350 --> 00:02:01,530
the, kind of, hypothesis that 

34
00:02:01,530 --> 00:02:04,750
iteration zero is that straight line shown in the bottom right. 

35
00:02:04,750 --> 00:02:08,720
And so after one iteration and creating descent, the straight line changes a bit. 

36
00:02:08,720 --> 00:02:10,999
After two iterations, three, 

37
00:02:10,999 --> 00:02:12,359
four, 

38
00:02:12,359 --> 00:02:15,019
until logistic regression converges 

39
00:02:15,019 --> 00:02:18,969
and has found the straight line that, more or less, separates the positive and negative class, okay? So 

40
00:02:18,969 --> 00:02:20,170
you can think of this 

41
00:02:20,170 --> 00:02:21,490
as logistic regression, 

42
00:02:21,490 --> 00:02:28,010
sort of, searching for a line that separates the positive and the negative classes. 

43
00:02:28,010 --> 00:02:30,980
What I want to do today is talk about an algorithm that does something slightly 

44
00:02:30,980 --> 00:02:32,759
different, 

45
00:02:32,759 --> 00:02:36,309
and to motivate us, let's use our old example of trying to classifythe 

46
00:02:36,309 --> 00:02:40,689
team malignant cancer and benign cancer, right? So a patient comes in 

47
00:02:40,689 --> 00:02:44,759
and they have a cancer, you want to know if it's a malignant or a harmful cancer, 

48
00:02:44,759 --> 00:02:48,489
or if it's a benign, meaning a harmless cancer. 

49
00:02:48,489 --> 00:02:51,579
So rather than trying to find the straight line to separate the two classes, here's something else 

50
00:02:51,579 --> 00:02:53,579
we could do. 

51
00:02:53,579 --> 00:02:55,759
We can go from our training set 

52
00:02:55,759 --> 00:03:00,409
and look at all the cases of malignant cancers, go through, you know, look for our training set for all the 

53
00:03:00,409 --> 00:03:04,129
positive examples of malignant cancers, 

54
00:03:04,129 --> 00:03:08,799
and we can then build a model for what malignant cancer looks like. 

55
00:03:08,799 --> 00:03:13,400
Then we'll go for our training set again and take out all of the examples of benign cancers, 

56
00:03:13,400 --> 00:03:17,859
and then we'll build a model for what benign cancers look like, okay? 

57
00:03:17,859 --> 00:03:19,209
And then 

58
00:03:19,209 --> 00:03:20,270
when you need to 

59
00:03:20,270 --> 00:03:24,620
classify a new example, when you have a new patient, and you want to decide is this cancer malignant 

60
00:03:24,620 --> 00:03:25,650
or benign, 

61
00:03:25,650 --> 00:03:27,920
you then take your new cancer, and you 

62
00:03:27,920 --> 00:03:30,720
match it to your model of malignant cancers, 

63
00:03:30,720 --> 00:03:35,969
and you match it to your model of benign cancers, and you see which model it matches better, and 

64
00:03:35,969 --> 00:03:39,389
depending on which model it matches better to, you then 

65
00:03:39,389 --> 00:03:43,679
predict whether the new cancer is malignant or benign, 

66
00:03:43,679 --> 00:03:45,809
okay? 

67
00:03:45,809 --> 00:03:46,899
So 

68
00:03:46,899 --> 00:03:49,659
what I just described, just this cross 

69
00:03:49,659 --> 00:03:52,929
of methods where you build a second model for malignant cancers and 

70
00:03:52,929 --> 00:03:54,950
a separate model for benign cancers 

71
00:03:54,950 --> 00:03:58,379
is called a generative learning algorithm, 

72
00:03:58,379 --> 00:04:01,230
and let me just, kind of, formalize this. 

73
00:04:01,230 --> 00:04:03,089


74
00:04:03,089 --> 00:04:05,459
So in the models that we've 

75
00:04:05,459 --> 00:04:07,559
been talking about previously, those were actually 

76
00:04:07,559 --> 00:04:13,899
all discriminative learning algorithms, 

77
00:04:13,899 --> 00:04:18,959
and studied more formally, a discriminative learning algorithm is one 

78
00:04:18,959 --> 00:04:22,729
that either learns P of Y 

79
00:04:22,729 --> 00:04:25,190
given X directly, 

80
00:04:25,190 --> 00:04:26,889


81
00:04:26,889 --> 00:04:29,619
or even 

82
00:04:29,619 --> 00:04:32,510
learns a hypothesis 

83
00:04:32,510 --> 00:04:39,510
that 

84
00:04:39,699 --> 00:04:43,240
outputs value 0, 1 directly, 

85
00:04:43,240 --> 00:04:46,509
okay? So logistic regression is an example of 

86
00:04:46,509 --> 00:04:49,349
a discriminative learning algorithm. 

87
00:04:49,349 --> 00:04:56,349
In contrast, a generative learning algorithm of 

88
00:04:58,150 --> 00:04:59,940
models P of X given Y. 

89
00:04:59,940 --> 00:05:03,270
The probability of the features given the class label, 

90
00:05:03,270 --> 00:05:04,330


91
00:05:04,330 --> 00:05:11,330
and as a technical detail, it also models P of Y, but that's a less important thing, and the 

92
00:05:11,939 --> 00:05:15,319
interpretation of this is that a generative model 

93
00:05:15,319 --> 00:05:17,499
builds a probabilistic model for 

94
00:05:17,499 --> 00:05:21,959
what the features looks like, 

95
00:05:21,959 --> 00:05:27,310
conditioned on the 

96
00:05:27,310 --> 00:05:30,729
class label, okay? In other words, conditioned on whether a cancer is 

97
00:05:30,729 --> 00:05:35,060
malignant or benign, it models probability distribution over what the features 

98
00:05:35,060 --> 00:05:37,090
of the cancer looks like. 

99
00:05:37,090 --> 00:05:40,470
Then having built this model " having built a model for P of X 

100
00:05:40,470 --> 00:05:42,129
given Y and P of Y, 

101
00:05:42,129 --> 00:05:45,949
then by Bayes rule, obviously, you can compute P of Y given 1, 

102
00:05:45,949 --> 00:05:47,439
conditioned on X. 

103
00:05:47,439 --> 00:05:49,370
This is just 

104
00:05:49,370 --> 00:05:50,959
P of X 

105
00:05:50,959 --> 00:05:52,669
given Y = 1 

106
00:05:52,669 --> 00:05:56,430
times P of X 

107
00:05:56,430 --> 00:05:58,919
divided by P of X, 

108
00:05:58,919 --> 00:06:01,349
and, if necessary, 

109
00:06:01,349 --> 00:06:05,600
you can calculate the denominator 

110
00:06:05,600 --> 00:06:12,600
using this, right? 

111
00:06:18,969 --> 00:06:19,889


112
00:06:19,889 --> 00:06:22,919
And so by modeling P of X given Y 

113
00:06:22,919 --> 00:06:26,500
and modeling P of Y, you can actually use Bayes rule to get back to P of Y given 

114
00:06:26,500 --> 00:06:27,569
X, 

115
00:06:27,569 --> 00:06:29,820
but a generative model - 

116
00:06:29,820 --> 00:06:33,120
generative learning algorithm starts in modeling P of X given Y, rather than P of Y 

117
00:06:33,120 --> 00:06:34,979
given X, okay? 

118
00:06:34,979 --> 00:06:37,810
We'll talk about some of the tradeoffs, and why this may be a 

119
00:06:37,810 --> 00:06:41,999
better or worse idea than a discriminative model a bit later. 

120
00:06:41,999 --> 00:06:46,679
Let's go for a specific example of a generative learning algorithm, 

121
00:06:46,679 --> 00:06:47,909
and for 

122
00:06:47,909 --> 00:06:51,329
this specific motivating example, I'm 

123
00:06:51,329 --> 00:06:53,580
going to assume 

124
00:06:53,580 --> 00:06:55,600
that your input feature is X 

125
00:06:55,600 --> 00:06:58,199
and RN 

126
00:06:58,199 --> 00:07:00,489


127
00:07:00,489 --> 00:07:03,110
and are 

128
00:07:03,110 --> 00:07:05,710
continuous values, okay? 

129
00:07:05,710 --> 00:07:08,190
And under this assumption, let 

130
00:07:08,190 --> 00:07:15,190
me describe to you a specific algorithm called Gaussian Discriminant Analysis, 

131
00:07:20,180 --> 00:07:23,059


132
00:07:23,059 --> 00:07:25,960
and the, I 

133
00:07:25,960 --> 00:07:30,080
guess, core assumption is that we're going to assume in the Gaussian discriminant analysis 

134
00:07:30,080 --> 00:07:32,279
model of that P of X given Y 

135
00:07:32,279 --> 00:07:39,279
is Gaussian, okay? So 

136
00:07:39,779 --> 00:07:43,449
actually just raise your hand, how many of you have seen a multivariate Gaussian before - 

137
00:07:43,449 --> 00:07:45,639
not a 1D Gaussian, but the higher range though? 

138
00:07:45,639 --> 00:07:51,029
Okay, cool, like maybe half of you, two-thirds of 

139
00:07:51,029 --> 00:07:54,599
you. So let me just say a few words about Gaussians, and for those of you that have seen 

140
00:07:54,599 --> 00:07:57,119
it before, it'll be a refresher. 

141
00:07:57,119 --> 00:08:03,249
So we say that a random variable Z is distributed Gaussian, multivariate Gaussian as - and the 

142
00:08:03,249 --> 00:08:05,309
script N for normal 

143
00:08:05,309 --> 00:08:08,150
with 

144
00:08:08,150 --> 00:08:13,379
parameters mean U and covariance sigma squared. If 

145
00:08:13,379 --> 00:08:16,759
Z has a density 1 

146
00:08:16,759 --> 00:08:19,539
over 2 Pi, sigma 

147
00:08:19,539 --> 00:08:23,939
2, 

148
00:08:23,939 --> 00:08:30,939
okay? 

149
00:08:31,449 --> 00:08:35,949
That's the formula for the density as a generalization of the one dimension of Gaussians and no 

150
00:08:35,949 --> 00:08:38,190
more the familiar bell-shape curve. It's 

151
00:08:38,190 --> 00:08:42,220
a high dimension vector value random variable 

152
00:08:42,220 --> 00:08:42,990
Z. 

153
00:08:42,990 --> 00:08:47,210
Don't worry too much about this formula for the density. You rarely end up needing to 

154
00:08:47,210 --> 00:08:47,960
use it, 

155
00:08:47,960 --> 00:08:51,620
but the two key quantities are this 

156
00:08:51,620 --> 00:08:55,310
vector mew is the mean of the Gaussian and this 

157
00:08:55,310 --> 00:08:56,999
matrix sigma is 

158
00:08:56,999 --> 00:09:00,060
the covariance matrix - 

159
00:09:00,060 --> 00:09:02,470
covariance, 

160
00:09:02,470 --> 00:09:04,670
and so 

161
00:09:04,670 --> 00:09:07,410
sigma will be equal to, 

162
00:09:07,410 --> 00:09:11,810
right, the definition of covariance of a vector valued random variable 

163
00:09:11,810 --> 00:09:13,710
is X - U, X - V conspose, 

164
00:09:13,710 --> 00:09:17,260
okay? 

165
00:09:17,260 --> 00:09:20,320
And, actually, if this 

166
00:09:20,320 --> 00:09:22,930
doesn't look familiar to you, 

167
00:09:22,930 --> 00:09:25,720
you might re-watch the 

168
00:09:25,720 --> 00:09:28,740
discussion section that the TAs held last Friday 

169
00:09:28,740 --> 00:09:33,690
or the one that they'll be holding later this week on, sort of, a recap of 

170
00:09:33,690 --> 00:09:34,290
probability, okay? 

171
00:09:34,290 --> 00:09:37,670
So 

172
00:09:37,670 --> 00:09:41,370
multi-grade Gaussians is parameterized by a mean and a covariance, and let me just - 

173
00:09:41,370 --> 00:09:42,420
can I 

174
00:09:42,420 --> 00:09:45,860
have the laptop displayed, please? 

175
00:09:45,860 --> 00:09:48,570
I'll just go ahead and actually show you, 

176
00:09:48,570 --> 00:09:52,090
you know, graphically, the effects of 

177
00:09:52,090 --> 00:09:53,930
varying a Gaussian - 

178
00:09:53,930 --> 00:09:56,310
varying the parameters of a Gaussian. 

179
00:09:56,310 --> 00:09:59,130
So what I have up here 

180
00:09:59,130 --> 00:10:02,960
is the density of a zero mean Gaussian with 

181
00:10:02,960 --> 00:10:05,880
covariance matrix equals the identity. The covariance matrix is shown in the upper right-hand 

182
00:10:05,880 --> 00:10:08,240
corner of the slide, and 

183
00:10:08,240 --> 00:10:11,600
there's the familiar bell-shaped curve in two dimensions. 

184
00:10:11,600 --> 00:10:17,320
And so if I shrink the covariance matrix, instead of covariance your identity, if 

185
00:10:17,320 --> 00:10:21,600
I shrink the covariance matrix, then the Gaussian becomes more peaked, 

186
00:10:21,600 --> 00:10:25,090
and if I widen the covariance, so like same = 2, 2, 

187
00:10:25,090 --> 00:10:25,749
then 

188
00:10:25,749 --> 00:10:30,810
the distribution - well, the density becomes more spread out, okay? 

189
00:10:30,810 --> 00:10:32,510
Those vectors stand at normal, identity 

190
00:10:32,510 --> 00:10:34,570
covariance one. 

191
00:10:34,570 --> 00:10:36,520
If I increase 

192
00:10:36,520 --> 00:10:39,419
the diagonals of a covariance matrix, right, 

193
00:10:39,419 --> 00:10:43,949
if I make the variables correlated, and the 

194
00:10:43,949 --> 00:10:45,350
Gaussian becomes flattened out in this X = Y direction, and 

195
00:10:45,350 --> 00:10:47,170
increase it even further, 

196
00:10:47,170 --> 00:10:52,460
then my variables, X and Y, right - excuse me, it goes Z1 and Z2 

197
00:10:52,460 --> 00:10:57,020
are my two variables on a horizontal axis become even more correlated. I'll just show the same thing in 

198
00:10:57,020 --> 00:10:58,940
contours. 

199
00:10:58,940 --> 00:11:03,310
The standard normal of distribution has contours that are - they're actually 

200
00:11:03,310 --> 00:11:06,960
circles. Because of the aspect ratio, these look like ellipses. 

201
00:11:06,960 --> 00:11:09,050
These should actually be circles, 

202
00:11:09,050 --> 00:11:12,770
and if you increase the off diagonals of the Gaussian covariance matrix, 

203
00:11:12,770 --> 00:11:14,890
then it becomes 

204
00:11:14,890 --> 00:11:16,400
ellipses aligned along the, sort of, 

205
00:11:16,400 --> 00:11:18,090
45 degree angle 

206
00:11:18,090 --> 00:11:21,120
in this example. 

207
00:11:21,120 --> 00:11:26,430
This is the same thing. Here's an example of a Gaussian density with negative covariances. 

208
00:11:26,430 --> 00:11:28,610
So now the correlation 

209
00:11:28,610 --> 00:11:32,700
goes the other way, so that even strong [inaudible] of covariance and the same thing in 

210
00:11:32,700 --> 00:11:36,400
contours. This is a Gaussian with negative entries on the diagonals and even 

211
00:11:36,400 --> 00:11:41,130
larger entries on the diagonals, okay? 

212
00:11:41,130 --> 00:11:45,480
And other parameter for the Gaussian is the mean parameters, so if this is - with mew0, 

213
00:11:45,480 --> 00:11:47,839
and as he changed the mean parameter, 

214
00:11:47,839 --> 00:11:50,060
this is mew equals 0.15, 

215
00:11:50,060 --> 00:11:56,230
the location of the Gaussian just moves around, okay? 

216
00:11:56,230 --> 00:12:01,500
All right. So that was a quick primer on what Gaussians look like, and here's 

217
00:12:01,500 --> 00:12:02,300


218
00:12:02,300 --> 00:12:06,230
as a roadmap or as a picture to keep in mind, when we described the Gaussian discriminant 

219
00:12:06,230 --> 00:12:09,400
analysis algorithm, this is what we're going to do. Here's 

220
00:12:09,400 --> 00:12:11,640
the training set, 

221
00:12:11,640 --> 00:12:14,790
and in the Gaussian discriminant analysis algorithm, 

222
00:12:14,790 --> 00:12:19,480
what I'm going to do is I'm going to look at the positive examples, say the crosses, 

223
00:12:19,480 --> 00:12:23,520
and just looking at only the positive examples, I'm gonna fit a Gaussian distribution to the 

224
00:12:23,520 --> 00:12:25,250
positive examples, and so 

225
00:12:25,250 --> 00:12:27,580
maybe I end up with a Gaussian distribution like that, 

226
00:12:27,580 --> 00:12:30,960
okay? So there's P of X given Y = 1. 

227
00:12:30,960 --> 00:12:34,340
And then I'll look at the negative examples, the O's in this figure, 

228
00:12:34,340 --> 00:12:36,990
and I'll fit a Gaussian to that, and maybe I get a 

229
00:12:36,990 --> 00:12:37,740
Gaussian 

230
00:12:37,740 --> 00:12:40,700
centered over there. This is the concept of my second Gaussian, 

231
00:12:40,700 --> 00:12:41,910
and together - 

232
00:12:41,910 --> 00:12:44,770
we'll say how later - 

233
00:12:44,770 --> 00:12:51,170
together these two Gaussian densities will define a separator for these two classes, okay? 

234
00:12:51,170 --> 00:12:55,300
And it'll turn out that the separator will turn out to be a little bit 

235
00:12:55,300 --> 00:12:55,870
different 

236
00:12:55,870 --> 00:12:57,340
from what logistic regression 

237
00:12:57,340 --> 00:12:58,719
gives you. 

238
00:12:58,719 --> 00:13:00,390
If you run logistic regression, 

239
00:13:00,390 --> 00:13:03,720
you actually get the division bound to be shown in the green line, whereas Gaussian discriminant 

240
00:13:03,720 --> 00:13:07,340
analysis gives you the blue line, okay? Switch back to chalkboard, please. All right. Here's the 

241
00:13:07,340 --> 00:13:14,340


242
00:13:29,460 --> 00:13:34,440
Gaussian discriminant analysis model, put 

243
00:13:34,440 --> 00:13:37,160
into model P of Y 

244
00:13:37,160 --> 00:13:40,940
as a Bernoulli random variable as usual, but 

245
00:13:40,940 --> 00:13:45,540
as a Bernoulli random variable and parameterized by parameter phi; you've 

246
00:13:45,540 --> 00:13:47,300
seen this before. 

247
00:13:47,300 --> 00:13:48,900


248
00:13:48,900 --> 00:13:55,900
Model P of X given Y = 0 as a Gaussian - 

249
00:13:58,090 --> 00:14:01,060
oh, you know what? Yeah, 

250
00:14:01,060 --> 00:14:04,870
yes, excuse me. I 

251
00:14:04,870 --> 00:14:06,050
thought this looked strange. 

252
00:14:06,050 --> 00:14:11,730
This 

253
00:14:11,730 --> 00:14:13,040
should be a sigma, 

254
00:14:13,040 --> 00:14:16,230
determined in a sigma to the one-half of the denominator there. 

255
00:14:16,230 --> 00:14:18,110
It's no big deal. It was - yeah, 

256
00:14:18,110 --> 00:14:20,350
well, okay. Right. 

257
00:14:20,350 --> 00:14:23,120
I 

258
00:14:23,120 --> 00:14:26,690
was listing the sigma to the determining the sigma to the one-half on a previous 

259
00:14:26,690 --> 00:14:33,690
board, excuse me. Okay, 

260
00:14:40,180 --> 00:14:44,130
and so I model P of X given Y = 0 as a Gaussian 

261
00:14:44,130 --> 00:14:47,700
with mean mew0 and covariance sigma to the sigma to 

262
00:14:47,700 --> 00:14:54,700
the minus one-half, 

263
00:14:54,800 --> 00:15:01,210


264
00:15:01,210 --> 00:15:02,000
and 

265
00:15:02,000 --> 00:15:09,000
- 

266
00:15:11,200 --> 00:15:12,840
okay? 

267
00:15:12,840 --> 00:15:16,840
And so the parameters of this model are 

268
00:15:16,840 --> 00:15:18,310
phi, 

269
00:15:18,310 --> 00:15:20,640
mew0, 

270
00:15:20,640 --> 00:15:23,900
mew1, and sigma, 

271
00:15:23,900 --> 00:15:25,100
and so 

272
00:15:25,100 --> 00:15:29,390
I can now write down the likelihood of the parameters 

273
00:15:29,390 --> 00:15:30,380
as - oh, excuse 

274
00:15:30,380 --> 00:15:37,380
me, actually, the log likelihood of the parameters as the log of 

275
00:15:40,500 --> 00:15:41,440
that, 

276
00:15:41,440 --> 00:15:45,080
right? 

277
00:15:45,080 --> 00:15:48,260
So, in other words, if I'm given the training set, then 

278
00:15:48,260 --> 00:15:52,690
they can write down the log likelihood of the parameters as the log of, you know, 

279
00:15:52,690 --> 00:15:58,060
the probative probabilities of P of XI, YI, right? 

280
00:15:58,060 --> 00:16:03,830


281
00:16:03,830 --> 00:16:10,030


282
00:16:10,030 --> 00:16:11,710
And 

283
00:16:11,710 --> 00:16:14,650
this is just equal to that where each of these terms, P of XI given YI, 

284
00:16:14,650 --> 00:16:16,980
or P of YI is 

285
00:16:16,980 --> 00:16:18,720
then given 

286
00:16:18,720 --> 00:16:19,600


287
00:16:19,600 --> 00:16:24,250
by one of these three equations on top, okay? 

288
00:16:24,250 --> 00:16:26,420
And I just want 

289
00:16:26,420 --> 00:16:30,000
to contrast this again with discriminative learning algorithms, right? 

290
00:16:30,000 --> 00:16:31,790
So 

291
00:16:31,790 --> 00:16:34,779
to give this a name, I guess, this sometimes is actually called 

292
00:16:34,779 --> 00:16:39,249
the Joint Data Likelihood - the Joint Likelihood, 

293
00:16:39,249 --> 00:16:41,790
and 

294
00:16:41,790 --> 00:16:44,930
let me just contrast this with what we had previously 

295
00:16:44,930 --> 00:16:47,710
when we're talking about logistic 

296
00:16:47,710 --> 00:16:51,860
regression. Where I said with the log likelihood of the parameter's theater 

297
00:16:51,860 --> 00:16:53,089
was log 

298
00:16:53,089 --> 00:16:57,000
of a product I = 1 to M, P of YI 

299
00:16:57,000 --> 00:16:59,030
given XI 

300
00:16:59,030 --> 00:17:01,990
and parameterized 

301
00:17:01,990 --> 00:17:03,960
by a theater, right? 

302
00:17:03,960 --> 00:17:05,150
So 

303
00:17:05,150 --> 00:17:08,040
back where we're fitting logistic regression models or generalized learning 

304
00:17:08,040 --> 00:17:08,670
models, 

305
00:17:08,670 --> 00:17:14,360
we're always modeling P of YI given XI and parameterized by a theater, and that was the 

306
00:17:14,360 --> 00:17:16,420
conditional 

307
00:17:16,420 --> 00:17:17,460


308
00:17:17,460 --> 00:17:19,579
likelihood, okay, 

309
00:17:19,579 --> 00:17:23,910
in 

310
00:17:23,910 --> 00:17:26,700
which we're modeling P of YI given XI, 

311
00:17:26,700 --> 00:17:27,839
whereas, now, 

312
00:17:27,839 --> 00:17:30,929
regenerative learning algorithms, we're going to look at the joint likelihood which 

313
00:17:30,929 --> 00:17:35,440
is P of XI, YI, okay? 

314
00:17:35,440 --> 00:17:36,270
So let's 

315
00:17:36,270 --> 00:17:43,270
see. 

316
00:17:46,240 --> 00:17:48,049
So given the training sets 

317
00:17:48,049 --> 00:17:51,260
and using the Gaussian discriminant analysis model 

318
00:17:51,260 --> 00:17:55,550
to fit the parameters of the model, we'll do maximize likelihood estimation as usual, 

319
00:17:55,550 --> 00:17:58,780
and so you maximize your 

320
00:17:58,780 --> 00:18:02,269
L 

321
00:18:02,269 --> 00:18:05,910
with respect to the parameters phi, mew0, 

322
00:18:05,910 --> 00:18:07,850
mew1, sigma, 

323
00:18:07,850 --> 00:18:10,150
and so 

324
00:18:10,150 --> 00:18:15,800
if we find the maximum likelihood estimate of parameters, you find that phi is 

325
00:18:15,800 --> 00:18:17,400
- 

326
00:18:17,400 --> 00:18:22,040
the maximum likelihood estimate is actually no surprise, and I'm writing this down mainly as a practice for 

327
00:18:22,040 --> 00:18:23,750
indicating notation, 

328
00:18:23,750 --> 00:18:27,220
all right? So the maximum likelihood estimate for phi would be Sum over 

329
00:18:27,220 --> 00:18:29,520
I, YI ÷ M, 

330
00:18:29,520 --> 00:18:32,040


331
00:18:32,040 --> 00:18:35,490
or written alternatively as Sum over - 

332
00:18:35,490 --> 00:18:41,620
all your training examples of indicator YI = 1 ÷ M, okay? 

333
00:18:41,620 --> 00:18:42,870


334
00:18:42,870 --> 00:18:45,880
In other words, maximum likelihood estimate for a 

335
00:18:45,880 --> 00:18:47,390


336
00:18:47,390 --> 00:18:52,600
newly parameter phi is just the faction of training examples with 

337
00:18:52,600 --> 00:18:56,880
label one, with Y equals 1. 

338
00:18:56,880 --> 00:19:02,800
Maximum likelihood estimate for mew0 is this, okay? 

339
00:19:02,800 --> 00:19:09,800
You 

340
00:19:22,010 --> 00:19:24,040
should stare at this for a second and 

341
00:19:24,040 --> 00:19:26,200
see if it makes sense. 

342
00:19:26,200 --> 00:19:33,200
Actually, I'll just write on the next one for mew1 while you do that. 

343
00:19:34,310 --> 00:19:41,310
Okay? 

344
00:19:49,880 --> 00:19:51,900
So what this is is 

345
00:19:51,900 --> 00:19:57,060
what the denominator is sum of your training sets indicated YI = 0. 

346
00:19:57,060 --> 00:20:02,090
So for every training example for which YI = 0, this will 

347
00:20:02,090 --> 00:20:05,670
increment the count by one, all right? So the 

348
00:20:05,670 --> 00:20:08,030
denominator is just 

349
00:20:08,030 --> 00:20:10,880
the number of examples 

350
00:20:10,880 --> 00:20:14,550


351
00:20:14,550 --> 00:20:15,370
with 

352
00:20:15,370 --> 00:20:20,700
label zero, all right? 

353
00:20:20,700 --> 00:20:22,680
And then the numerator will be, let's 

354
00:20:22,680 --> 00:20:25,660
see, Sum from I = 1 for M, or every time 

355
00:20:25,660 --> 00:20:30,350
YI is equal to 0, this will be a one, and otherwise, this thing will be zero, 

356
00:20:30,350 --> 00:20:31,480
and so 

357
00:20:31,480 --> 00:20:34,330
this indicator function means that you're including 

358
00:20:34,330 --> 00:20:37,630
only the times for 

359
00:20:37,630 --> 00:20:40,270
which YI is equal to one - only the turns which Y is equal to zero 

360
00:20:40,270 --> 00:20:45,020
because for all the times where YI is equal to one, 

361
00:20:45,020 --> 00:20:47,990
this sum and will be equal to zero, 

362
00:20:47,990 --> 00:20:52,890
and then you multiply that by XI, and so the numerator is 

363
00:20:52,890 --> 00:20:55,770
really the sum 

364
00:20:55,770 --> 00:21:00,910
of XI's corresponding to 

365
00:21:00,910 --> 00:21:02,970


366
00:21:02,970 --> 00:21:08,649
examples where the class labels were zero, okay? 

367
00:21:08,649 --> 00:21:14,530
Raise your hand if this makes sense. Okay, cool. 

368
00:21:14,530 --> 00:21:17,350
So just to say this fancifully, 

369
00:21:17,350 --> 00:21:19,630
this just means look for your training set, 

370
00:21:19,630 --> 00:21:22,900
find all the examples for which Y = 0, 

371
00:21:22,900 --> 00:21:25,160
and take the average 

372
00:21:25,160 --> 00:21:29,570
of the value of X for all your examples which Y = 0. So take all your negative fitting 

373
00:21:29,570 --> 00:21:30,470
examples 

374
00:21:30,470 --> 00:21:32,950
and average the values for X 

375
00:21:32,950 --> 00:21:37,730
and 

376
00:21:37,730 --> 00:21:41,350
that's mew0, okay? If this 

377
00:21:41,350 --> 00:21:45,140
notation is still a little bit cryptic - if you're still not sure why this 

378
00:21:45,140 --> 00:21:48,000
equation translates into 

379
00:21:48,000 --> 00:21:52,940
what I just said, do go home and stare at it for a while until it just makes sense. This is, sort 

380
00:21:52,940 --> 00:21:57,040
of, no surprise. It just says to estimate the mean for the negative examples, 

381
00:21:57,040 --> 00:21:59,860
take all your negative examples, and average them. So 

382
00:21:59,860 --> 00:22:06,830
no surprise, but this is a useful practice to indicate a notation. 

383
00:22:06,830 --> 00:22:07,500
[Inaudible] 

384
00:22:07,500 --> 00:22:11,330
divide the maximum likelihood estimate for sigma. I won't do that. You can read that in 

385
00:22:11,330 --> 00:22:16,630
the notes yourself. 

386
00:22:16,630 --> 00:22:21,390


387
00:22:21,390 --> 00:22:24,340
And so having fit 

388
00:22:24,340 --> 00:22:28,940
the parameters find mew0, mew1, and sigma 

389
00:22:28,940 --> 00:22:30,470
to your data, 

390
00:22:30,470 --> 00:22:35,590
well, you now need to make a prediction. You 

391
00:22:35,590 --> 00:22:39,710
know, when you're given a new value of X, when you're given a new cancer, you need to predict whether 

392
00:22:39,710 --> 00:22:41,630
it's malignant or benign. 

393
00:22:41,630 --> 00:22:44,610
Your prediction is then going to be, 

394
00:22:44,610 --> 00:22:46,710
let's say, 

395
00:22:46,710 --> 00:22:50,400
the most likely value of Y given X. I should 

396
00:22:50,400 --> 00:22:54,720
write semicolon the parameters there. I'll just give that 

397
00:22:54,720 --> 00:22:57,230
- which is the [inaudible] of a Y 

398
00:22:57,230 --> 00:23:04,230
by Bayes rule, all right? 

399
00:23:05,870 --> 00:23:12,870
And that is, in turn, 

400
00:23:14,540 --> 00:23:15,480
just that 

401
00:23:15,480 --> 00:23:19,790
because the denominator P of X doesn't depend on Y, 

402
00:23:19,790 --> 00:23:21,530
and 

403
00:23:21,530 --> 00:23:23,540
if P of Y 

404
00:23:23,540 --> 00:23:27,450
is uniform. 

405
00:23:27,450 --> 00:23:30,950
In other words, if each of your constants is equally likely, 

406
00:23:30,950 --> 00:23:33,039
so if P of Y 

407
00:23:33,039 --> 00:23:36,840
takes the same value for all values 

408
00:23:36,840 --> 00:23:38,010
of Y, 

409
00:23:38,010 --> 00:23:42,250
then this is just arc X over Y, P of X 

410
00:23:42,250 --> 00:23:47,340
given Y, okay? 

411
00:23:47,340 --> 00:23:50,970
This happens sometimes, maybe not very often, so usually you end up using this 

412
00:23:50,970 --> 00:23:53,049
formula where you 

413
00:23:53,049 --> 00:23:56,470
compute P of X given Y and P of Y using 

414
00:23:56,470 --> 00:23:59,910
your model, okay? Student:Can 

415
00:23:59,910 --> 00:24:00,780
you give 

416
00:24:00,780 --> 00:24:05,480
us arc x? Instructor (Andrew Ng):Oh, let's see. So if you take - actually 

417
00:24:05,480 --> 00:24:08,750
let me. 

418
00:24:08,750 --> 00:24:09,770
So the min of 

419
00:24:09,770 --> 00:24:15,470
- arcomatics means the value for Y that maximizes this. Student:Oh, okay. Instructor (Andrew Ng):So just 

420
00:24:15,470 --> 00:24:18,600
for an example, the min of X - 5 

421
00:24:18,600 --> 00:24:22,610
squared is 0 because by choosing X equals 5, you can get this to be zero, 

422
00:24:22,610 --> 00:24:24,470
and the argument over X 

423
00:24:24,470 --> 00:24:30,870
of X - 5 squared is equal to 5 because 5 is the value of X that makes this minimize, okay? 

424
00:24:30,870 --> 00:24:37,870
Cool. Thanks 

425
00:24:38,100 --> 00:24:45,100
for 

426
00:24:45,140 --> 00:24:52,140
asking that. Instructor (Andrew Ng):Okay. Actually any other questions about this? Yeah? Student:Why is 

427
00:25:00,910 --> 00:25:04,320
distributive removing? Why isn't [inaudible] - Instructor (Andrew Ng):Oh, I see. By uniform I meant - I was being loose 

428
00:25:04,320 --> 00:25:05,649
here. 

429
00:25:05,649 --> 00:25:07,240
I meant if 

430
00:25:07,240 --> 00:25:10,890
P of Y = 0 is equal to P of Y = 1, or if Y is the 

431
00:25:10,890 --> 00:25:12,780
uniform distribution over 

432
00:25:12,780 --> 00:25:13,860
the set 0 and 1. Student:Oh. Instructor (Andrew Ng):I just meant - yeah, if P of Y = 0 

433
00:25:13,860 --> 00:25:15,450
zero = P of Y given 1. That's all I mean, see? Anything else? 

434
00:25:15,450 --> 00:25:17,080
All 

435
00:25:17,080 --> 00:25:20,030
right. Okay. So 

436
00:25:20,030 --> 00:25:22,039


437
00:25:22,039 --> 00:25:28,040


438
00:25:28,040 --> 00:25:35,040
it 

439
00:25:42,130 --> 00:25:46,980
turns out Gaussian discriminant analysis has an interesting relationship 

440
00:25:46,980 --> 00:25:48,590
to logistic 

441
00:25:48,590 --> 00:25:51,640
regression. Let me illustrate that. 

442
00:25:51,640 --> 00:25:54,540


443
00:25:54,540 --> 00:25:59,200
So let's say you have a training set 

444
00:25:59,200 --> 00:26:06,200
- actually let me just go ahead and draw 1D training set, and that will 

445
00:26:11,820 --> 00:26:14,580
kind of work, yes, okay. 

446
00:26:14,580 --> 00:26:18,470
So let's say we have a training set comprising a few negative and a few positive examples, 

447
00:26:18,470 --> 00:26:23,070
and let's say I run Gaussian discriminate analysis. So I'll fit Gaussians to each of these two 

448
00:26:23,070 --> 00:26:24,679
densities - a Gaussian density to each of these two - to my positive 

449
00:26:24,679 --> 00:26:27,440
and negative training 

450
00:26:27,440 --> 00:26:30,090
examples, 

451
00:26:30,090 --> 00:26:33,470
and so maybe my 

452
00:26:33,470 --> 00:26:37,960
positive examples, the X's, are fit with a Gaussian like this, 

453
00:26:37,960 --> 00:26:39,520


454
00:26:39,520 --> 00:26:44,560


455
00:26:44,560 --> 00:26:47,840
and my negative examples I will fit, and you have a 

456
00:26:47,840 --> 00:26:54,840
Gaussian that looks like that, okay? 

457
00:26:57,030 --> 00:27:00,180
Now, I 

458
00:27:00,180 --> 00:27:04,150
hope this [inaudible]. Now, let's 

459
00:27:04,150 --> 00:27:06,789
vary along the X axis, 

460
00:27:06,789 --> 00:27:10,260
and what I want to do is I'll 

461
00:27:10,260 --> 00:27:14,470
overlay on top of this plot. I'm going to plot 

462
00:27:14,470 --> 00:27:21,470
P of Y = 1 - no, actually, 

463
00:27:24,420 --> 00:27:25,820
given X 

464
00:27:25,820 --> 00:27:31,570
for a variety of values X, okay? 

465
00:27:31,570 --> 00:27:32,929
So I actually realize what I should have done. 

466
00:27:32,929 --> 00:27:37,250
I'm gonna call the X's the negative examples, and I'm gonna call the O's the positive examples. It just 

467
00:27:37,250 --> 00:27:39,720
makes this part come in better. 

468
00:27:39,720 --> 00:27:44,270
So let's take a value of X that's fairly small. Let's say X is this value here on a horizontal 

469
00:27:44,270 --> 00:27:45,200
axis. 

470
00:27:45,200 --> 00:27:48,810
Then what's the probability of Y being equal to one conditioned on X? Well, 

471
00:27:48,810 --> 00:27:52,770
the way you calculate that is you write P of Y 

472
00:27:52,770 --> 00:27:56,899
= 1 given X, and then you plug in all these formulas as usual, right? It's P of X 

473
00:27:56,899 --> 00:27:59,140
given Y = 1, which is 

474
00:27:59,140 --> 00:28:00,929
your Gaussian density, 

475
00:28:00,929 --> 00:28:04,070
times P of Y = 1, you know, which is 

476
00:28:04,070 --> 00:28:08,060
essentially - this is just going to be equal to phi, 

477
00:28:08,060 --> 00:28:09,510
and then divided by, 

478
00:28:09,510 --> 00:28:12,210
right, P of X, and then this shows you how you can calculate this. 

479
00:28:12,210 --> 00:28:13,660
By using 

480
00:28:13,660 --> 00:28:18,730
these two Gaussians and my phi on P of Y, I actually compute what P of Y 

481
00:28:18,730 --> 00:28:21,340
= 1 given X is, and 

482
00:28:21,340 --> 00:28:24,529
in this case, 

483
00:28:24,529 --> 00:28:29,230
if X is this small, clearly it belongs to the left Gaussian. It's very unlikely to belong to 

484
00:28:29,230 --> 00:28:31,330
a positive class, and so 

485
00:28:31,330 --> 00:28:36,370
it'll be very small; it'll be very close to zero say, okay? 

486
00:28:36,370 --> 00:28:40,770
And then we can increment the value of X a bit, and study a different value of X, and 

487
00:28:40,770 --> 00:28:44,230
plot what is the P of Y given X - P of Y 

488
00:28:44,230 --> 00:28:49,660
= 1 given X, and, again, it'll be pretty small. Let's 

489
00:28:49,660 --> 00:28:52,200
use a point like that, right? At this point, 

490
00:28:52,200 --> 00:28:54,480
the two Gaussian densities 

491
00:28:54,480 --> 00:28:56,490
have equal value, 

492
00:28:56,490 --> 00:28:57,539
and if 

493
00:28:57,539 --> 00:28:58,799
I ask 

494
00:28:58,799 --> 00:29:01,779
if X is this value, right, shown by the arrow, 

495
00:29:01,779 --> 00:29:03,080


496
00:29:03,080 --> 00:29:06,770
what's the probably of Y being equal to one for that value of X? Well, you really can't tell, so maybe it's about 0.5, okay? And if 

497
00:29:06,770 --> 00:29:11,270
you fill 

498
00:29:11,270 --> 00:29:13,930
in a bunch more points, you get a 

499
00:29:13,930 --> 00:29:16,470
curve like that, 

500
00:29:16,470 --> 00:29:20,550
and then you can keep going. Let's say for a point like that, you can ask what's the probability of X 

501
00:29:20,550 --> 00:29:21,919
being one? Well, if 

502
00:29:21,919 --> 00:29:25,320
it's that far out, then clearly, it belongs to this 

503
00:29:25,320 --> 00:29:27,540
rightmost Gaussian, and so 

504
00:29:27,540 --> 00:29:31,510
the probability of Y being a one would be very high; it would be almost one, okay? 

505
00:29:31,510 --> 00:29:33,830
And so you 

506
00:29:33,830 --> 00:29:36,429
can repeat this exercise 

507
00:29:36,429 --> 00:29:38,510
for a bunch of points. All right, 

508
00:29:38,510 --> 00:29:41,620
compute P of Y equals one given X for a bunch of points, 

509
00:29:41,620 --> 00:29:46,490
and if you connect up these points, 

510
00:29:46,490 --> 00:29:48,150
you find that the 

511
00:29:48,150 --> 00:29:50,039
curve you get [Pause] plotted 

512
00:29:50,039 --> 00:29:56,410
takes a form of sigmoid function, okay? So, 

513
00:29:56,410 --> 00:30:00,030
in other words, when you make the assumptions under the Gaussian 

514
00:30:00,030 --> 00:30:02,350


515
00:30:02,350 --> 00:30:05,550
discriminant analysis model, 

516
00:30:05,550 --> 00:30:08,130
that P of X given Y is Gaussian, 

517
00:30:08,130 --> 00:30:13,330
when you go back and compute what P of Y given X is, you actually get back 

518
00:30:13,330 --> 00:30:19,290
exactly the same sigmoid function 

519
00:30:19,290 --> 00:30:22,960
that we're using which is the progression, okay? But it turns out the key difference is that 

520
00:30:22,960 --> 00:30:27,360
Gaussian discriminant analysis will end up choosing a different 

521
00:30:27,360 --> 00:30:28,900
position 

522
00:30:28,900 --> 00:30:31,659
and a steepness of the sigmoid 

523
00:30:31,659 --> 00:30:35,730
than would logistic regression. Is there a question? Student:I'm just 

524
00:30:35,730 --> 00:30:39,570
wondering, 

525
00:30:39,570 --> 00:30:44,600
the Gaussian of P of Y [inaudible] you do? Instructor (Andrew Ng):No, let's see. The Gaussian - so this Gaussian is 

526
00:30:44,600 --> 00:30:47,500
P of X given Y = 1, and 

527
00:30:47,500 --> 00:30:50,070
this Gaussian is P of X 

528
00:30:50,070 --> 00:30:57,070
given Y = 0; does that make sense? Anything else? Student:Okay. Instructor (Andrew Ng):Yeah? Student:When you drawing all the dots, how did you 

529
00:31:01,920 --> 00:31:04,040
decide what Y 

530
00:31:04,040 --> 00:31:05,760
given 

531
00:31:05,760 --> 00:31:09,769
P of X was? Instructor (Andrew Ng):What - say that again. Student:I'm sorry. Could you go over how you 

532
00:31:09,769 --> 00:31:12,330
figured out where 

533
00:31:12,330 --> 00:31:13,750
to draw each dot? Instructor (Andrew Ng):Let's see, 

534
00:31:13,750 --> 00:31:15,919
okay. So the 

535
00:31:15,919 --> 00:31:18,600
computation is as follows, right? The steps 

536
00:31:18,600 --> 00:31:22,020
are I have the training sets, and so given my training set, I'm going to fit 

537
00:31:22,020 --> 00:31:25,159
a Gaussian discriminant analysis model to it, 

538
00:31:25,159 --> 00:31:28,940
and what that means is I'll build a model for P of X given Y = 1. I'll 

539
00:31:28,940 --> 00:31:30,159
build 

540
00:31:30,159 --> 00:31:33,059
a model for P of X given Y = 0, 

541
00:31:33,059 --> 00:31:37,950
and I'll also fit a Bernoulli distribution to 

542
00:31:37,950 --> 00:31:39,290
P of Y, okay? 

543
00:31:39,290 --> 00:31:43,270
So, in other words, given my training set, I'll fit P of X given Y and P of Y 

544
00:31:43,270 --> 00:31:46,539
to my data, and now I've chosen my parameters 

545
00:31:46,539 --> 00:31:48,630
of find mew0, 

546
00:31:48,630 --> 00:31:49,600
mew1, 

547
00:31:49,600 --> 00:31:52,310
and the sigma, okay? Then 

548
00:31:52,310 --> 00:31:54,930
this is the process I went through 

549
00:31:54,930 --> 00:31:58,710
to plot all these dots, right? It's just I pick a point in the X axis, 

550
00:31:58,710 --> 00:32:00,950
and then I compute 

551
00:32:00,950 --> 00:32:02,970
P of Y given X 

552
00:32:02,970 --> 00:32:05,200
for that value of X, 

553
00:32:05,200 --> 00:32:09,380
and P of Y given 1 conditioned on X will be some value between zero and one. It'll 

554
00:32:09,380 --> 00:32:12,500
be some real number, and whatever that real number is, I then plot it on the vertical 

555
00:32:12,500 --> 00:32:14,490
axis, 

556
00:32:14,490 --> 00:32:19,929
okay? And the way I compute P of Y = 1 conditioned on X is 

557
00:32:19,929 --> 00:32:21,380
I would 

558
00:32:21,380 --> 00:32:23,470
use these quantities. I would use 

559
00:32:23,470 --> 00:32:25,080
P of X given Y 

560
00:32:25,080 --> 00:32:30,050
and P of Y, and, sort of, plug them into Bayes rule, and that allows me 

561
00:32:30,050 --> 00:32:30,760
to 

562
00:32:30,760 --> 00:32:31,690
compute P of Y given X 

563
00:32:31,690 --> 00:32:34,210
from these three quantities; does that make 

564
00:32:34,210 --> 00:32:35,240
sense? Student:Yeah. Instructor (Andrew Ng):Was there something more that - 

565
00:32:35,240 --> 00:32:39,080
Student:And how did you model P of X; is that - Instructor (Andrew Ng):Oh, okay. Yeah, so 

566
00:32:39,080 --> 00:32:42,269
- 

567
00:32:42,269 --> 00:32:43,170
well, 

568
00:32:43,170 --> 00:32:45,340
got this right 

569
00:32:45,340 --> 00:32:49,090
here. So P of X can be written as, 

570
00:32:49,090 --> 00:32:50,449


571
00:32:50,449 --> 00:32:52,810
right, 

572
00:32:52,810 --> 00:32:55,030
so 

573
00:32:55,030 --> 00:32:58,660
P of X given Y = 0  by P of Y = 0 + P of X given Y = 1, P of Y = 

574
00:32:58,660 --> 00:33:03,140
1, right? 

575
00:33:03,140 --> 00:33:06,740
And so each of these terms, P of X given Y 

576
00:33:06,740 --> 00:33:11,170
and P of Y, these are terms I can get out of, directly, from my Gaussian discriminant 

577
00:33:11,170 --> 00:33:13,890
analysis model. Each of these terms is something that 

578
00:33:13,890 --> 00:33:16,300
my model gives me directly, 

579
00:33:16,300 --> 00:33:18,350
so plugged in as the denominator, 

580
00:33:18,350 --> 00:33:25,350
and by doing that, that's how I compute P of Y = 1 given X, make sense? Student:Thank you. Instructor (Andrew Ng):Okay. Cool. 

581
00:33:30,570 --> 00:33:37,570


582
00:33:53,170 --> 00:33:57,860
So let's talk a little bit about the advantages and disadvantages of using a 

583
00:33:57,860 --> 00:34:00,360
generative 

584
00:34:00,360 --> 00:34:04,759
learning algorithm, okay? So in the particular case of Gaussian discriminant analysis, we 

585
00:34:04,759 --> 00:34:05,960
assume that 

586
00:34:05,960 --> 00:34:08,080
X conditions on Y 

587
00:34:08,080 --> 00:34:13,829
is Gaussian, 

588
00:34:13,829 --> 00:34:17,089
and the argument I showed on the previous chalkboard, I didn't prove it formally, 

589
00:34:17,089 --> 00:34:20,289
but you can actually go back and prove it yourself 

590
00:34:20,289 --> 00:34:24,269
is that if you assume X given Y is Gaussian, 

591
00:34:24,269 --> 00:34:27,419
then that implies that 

592
00:34:27,419 --> 00:34:28,589
when you plot Y 

593
00:34:28,589 --> 00:34:30,639
given X, 

594
00:34:30,639 --> 00:34:37,639
you find that - well, let me just write logistic posterior, okay? 

595
00:34:40,689 --> 00:34:43,519
And the argument I showed just now, which I didn't prove; you can go home and prove it 

596
00:34:43,519 --> 00:34:44,489
yourself, 

597
00:34:44,489 --> 00:34:49,329
is that if you assume X given Y is Gaussian, then that implies that the posterior 

598
00:34:49,329 --> 00:34:53,929
distribution or the form of 

599
00:34:53,929 --> 00:34:57,229
P of Y = 1 given X 

600
00:34:57,229 --> 00:35:00,569
is going to be a logistic function, 

601
00:35:00,569 --> 00:35:02,089


602
00:35:02,089 --> 00:35:05,809
and it turns out this 

603
00:35:05,809 --> 00:35:08,319
implication in the opposite direction 

604
00:35:08,319 --> 00:35:11,799
does not hold true, 

605
00:35:11,799 --> 00:35:16,349
okay? In particular, it actually turns out - this is actually, kind of, cool. It 

606
00:35:16,349 --> 00:35:19,360
turns out that if you're 

607
00:35:19,360 --> 00:35:22,739
seeing that X given Y = 1 is 

608
00:35:22,739 --> 00:35:24,389


609
00:35:24,389 --> 00:35:26,169
Hessian with 

610
00:35:26,169 --> 00:35:28,609
parameter lambda 1, 

611
00:35:28,609 --> 00:35:32,039
and X given Y = 0, 

612
00:35:32,039 --> 00:35:36,509
is Hessian 

613
00:35:36,509 --> 00:35:39,199
with parameter lambda 0. 

614
00:35:39,199 --> 00:35:41,420
It turns out if you assumed this, 

615
00:35:41,420 --> 00:35:42,640
then 

616
00:35:42,640 --> 00:35:43,849
that also 

617
00:35:43,849 --> 00:35:46,839
implies that P of Y 

618
00:35:46,839 --> 00:35:51,130
given X 

619
00:35:51,130 --> 00:35:52,439


620
00:35:52,439 --> 00:35:54,429
is logistic, okay? 

621
00:35:54,429 --> 00:35:57,259
So there are lots of assumptions on X given Y 

622
00:35:57,259 --> 00:36:04,259
that will lead to P of Y given X being logistic, and, 

623
00:36:04,619 --> 00:36:06,059
therefore, 

624
00:36:06,059 --> 00:36:11,609
this, the assumption that X given Y being Gaussian is the stronger assumption 

625
00:36:11,609 --> 00:36:15,019
than the assumption that Y given X is logistic, 

626
00:36:15,019 --> 00:36:17,400
okay? Because this implies this, 

627
00:36:17,400 --> 00:36:22,829
right? That means that this is a stronger assumption than this because 

628
00:36:22,829 --> 00:36:29,829
this, the logistic posterior holds whenever X given Y is Gaussian but not vice versa. 

629
00:36:29,940 --> 00:36:31,099
And so this leaves some 

630
00:36:31,099 --> 00:36:35,749
of the tradeoffs between Gaussian discriminant analysis and logistic 

631
00:36:35,749 --> 00:36:36,559
regression, 

632
00:36:36,559 --> 00:36:40,079
right? Gaussian discriminant analysis makes a much stronger assumption 

633
00:36:40,079 --> 00:36:43,049
that X given Y is Gaussian, 

634
00:36:43,049 --> 00:36:47,099
and so when this assumption is true, when this assumption approximately holds, if you plot the 

635
00:36:47,099 --> 00:36:48,569
data, 

636
00:36:48,569 --> 00:36:52,039
and if X given Y is, indeed, approximately Gaussian, 

637
00:36:52,039 --> 00:36:56,199
then if you make this assumption, explicit to the algorithm, then the 

638
00:36:56,199 --> 00:36:57,499
algorithm will do better 

639
00:36:57,499 --> 00:36:59,239
because it's as if the 

640
00:36:59,239 --> 00:37:02,609
algorithm is making use of more information about the data. The algorithm knows that 

641
00:37:02,609 --> 00:37:04,380
the data is Gaussian, 

642
00:37:04,380 --> 00:37:05,290
right? And so 

643
00:37:05,290 --> 00:37:07,609
if the Gaussian assumption, you know, 

644
00:37:07,609 --> 00:37:09,529
holds or roughly holds, 

645
00:37:09,529 --> 00:37:10,220
then Gaussian 

646
00:37:10,220 --> 00:37:13,569
discriminant analysis may do better than logistic regression. 

647
00:37:13,569 --> 00:37:19,299
If, conversely, if you're actually not sure what X given Y is, then 

648
00:37:19,299 --> 00:37:22,819
logistic regression, the discriminant algorithm may do better, 

649
00:37:22,819 --> 00:37:25,579
and, in particular, use logistic regression, 

650
00:37:25,579 --> 00:37:26,110
and 

651
00:37:26,110 --> 00:37:29,690
maybe you see [inaudible] before the data was Gaussian, but it turns out the data 

652
00:37:29,690 --> 00:37:31,429
was actually Poisson, right? 

653
00:37:31,429 --> 00:37:33,980
Then logistic regression will still do perfectly fine because if 

654
00:37:33,980 --> 00:37:36,869
the data were actually Poisson, 

655
00:37:36,869 --> 00:37:40,210
then P of Y = 1 given X will be logistic, and it'll do perfectly 

656
00:37:40,210 --> 00:37:44,519
fine, but if you assumed it was Gaussian, then the algorithm may go off and do something 

657
00:37:44,519 --> 00:37:51,519
that's not as good, okay? 

658
00:37:52,109 --> 00:37:56,139
So it turns out that - right. 

659
00:37:56,139 --> 00:37:58,799
So it's slightly different. 

660
00:37:58,799 --> 00:38:01,089
It 

661
00:38:01,089 --> 00:38:05,359
turns out the real advantage of generative learning algorithms is often that it 

662
00:38:05,359 --> 00:38:07,839
requires less data, 

663
00:38:07,839 --> 00:38:09,049
and, in particular, 

664
00:38:09,049 --> 00:38:13,469
data is never really exactly Gaussian, right? Because data is often 

665
00:38:13,469 --> 00:38:15,890
approximately Gaussian; it's never exactly Gaussian. 

666
00:38:15,890 --> 00:38:19,989
And it turns out, generative learning algorithms often do surprisingly well 

667
00:38:19,989 --> 00:38:20,750
even when 

668
00:38:20,750 --> 00:38:24,369
these modeling assumptions are not met, but 

669
00:38:24,369 --> 00:38:26,019
one other tradeoff 

670
00:38:26,019 --> 00:38:26,880
is that 

671
00:38:26,880 --> 00:38:29,930
by making stronger assumptions 

672
00:38:29,930 --> 00:38:31,899
about the data, 

673
00:38:31,899 --> 00:38:34,039
Gaussian discriminant analysis 

674
00:38:34,039 --> 00:38:36,229
often needs less data 

675
00:38:36,229 --> 00:38:40,439
in order to fit, like, an okay model, even if there's less training data. Whereas, in 

676
00:38:40,439 --> 00:38:41,870
contrast, 

677
00:38:41,870 --> 00:38:45,959
logistic regression by making less 

678
00:38:45,959 --> 00:38:48,959
assumption is more robust to your modeling assumptions because you're making a weaker assumption; you're 

679
00:38:48,959 --> 00:38:50,729
making less assumptions, 

680
00:38:50,729 --> 00:38:54,150
but sometimes it takes a slightly larger training set to fit than Gaussian discriminant 

681
00:38:54,150 --> 00:38:55,419
analysis. Question? Student:In order 

682
00:38:55,419 --> 00:39:00,019
to meet any assumption about the number [inaudible], plus 

683
00:39:00,019 --> 00:39:01,029


684
00:39:01,029 --> 00:39:03,139
here 

685
00:39:03,139 --> 00:39:09,089
we assume that P of Y = 1, equal 

686
00:39:09,089 --> 00:39:11,159
two 

687
00:39:11,159 --> 00:39:14,849
number of. 

688
00:39:14,849 --> 00:39:16,319
[Inaudible]. Is true when 

689
00:39:16,319 --> 00:39:18,069
the number of samples is marginal? Instructor (Andrew Ng):Okay. So let's see. 

690
00:39:18,069 --> 00:39:21,799
So there's a question of is this true - what 

691
00:39:21,799 --> 00:39:23,379
was that? Let me translate that 

692
00:39:23,379 --> 00:39:25,429
differently. So 

693
00:39:25,429 --> 00:39:28,829
the marving assumptions are made independently of the size 

694
00:39:28,829 --> 00:39:29,880
of 

695
00:39:29,880 --> 00:39:32,520
your training set, right? So, like, in least/great regression - well, 

696
00:39:32,520 --> 00:39:36,649
in all of these models I'm assuming that these are random variables 

697
00:39:36,649 --> 00:39:41,380
flowing from some distribution, and then, finally, I'm giving a single training set 

698
00:39:41,380 --> 00:39:43,640
and that as for the parameters of the 

699
00:39:43,640 --> 00:39:44,450
distribution, right? 

700
00:39:44,450 --> 00:39:51,450
Student:So 

701
00:39:51,889 --> 00:39:54,169
what's the probability of Y = 1? 

702
00:39:54,169 --> 00:39:55,069
Instructor (Andrew Ng):Probability of Y + 1? 

703
00:39:55,069 --> 00:39:56,499
Student:Yeah, you used the - Instructor (Andrew Ng):Sort 

704
00:39:56,499 --> 00:39:57,909
of, this like 

705
00:39:57,909 --> 00:39:59,589
- 

706
00:39:59,589 --> 00:40:02,829
back to the philosophy of mass molecular estimation, 

707
00:40:02,829 --> 00:40:04,700
right? I'm assuming that 

708
00:40:04,700 --> 00:40:08,329
they're P of Y is equal to phi to the Y, 

709
00:40:08,329 --> 00:40:13,390
Y - phi to the Y or Y - Y. So I'm assuming that there's some true value of Y 

710
00:40:13,390 --> 00:40:14,009
generating 

711
00:40:14,009 --> 00:40:15,879
all my data, 

712
00:40:15,879 --> 00:40:18,849
and then 

713
00:40:18,849 --> 00:40:19,549
- 

714
00:40:19,549 --> 00:40:25,589
well, when I write this, I guess, maybe what I should write isn't - 

715
00:40:25,589 --> 00:40:27,609
so when I write this, I 

716
00:40:27,609 --> 00:40:29,900
guess there are already two values of phi. One is 

717
00:40:29,900 --> 00:40:32,200
there's a true underlying value of phi 

718
00:40:32,200 --> 00:40:35,160
that guards the use to generate the data, 

719
00:40:35,160 --> 00:40:39,630
and then there's the maximum likelihood estimate of the value of phi, and so when I was writing 

720
00:40:39,630 --> 00:40:41,509
those formulas earlier, 

721
00:40:41,509 --> 00:40:45,209
those formulas are writing for phi, and mew0, and mew1 

722
00:40:45,209 --> 00:40:49,359
were really the maximum likelihood estimates for phi, mew0, and mew1, and that's different from the true 

723
00:40:49,359 --> 00:40:55,819
underlying values of phi, mew0, and mew1, but - Student:[Off mic]. Instructor (Andrew Ng):Yeah, right. So maximum 

724
00:40:55,819 --> 00:40:57,559
likelihood estimate comes from the data, 

725
00:40:57,559 --> 00:41:01,529
and there's some, sort of, true underlying value of phi that I'm trying to estimate, 

726
00:41:01,529 --> 00:41:05,319
and my maximum likelihood estimate is my attempt to estimate the true value, 

727
00:41:05,319 --> 00:41:08,150
but, you know, by notational and convention 

728
00:41:08,150 --> 00:41:11,779
often are just right as that as well without bothering to distinguish between 

729
00:41:11,779 --> 00:41:15,470
the maximum likelihood value and the true underlying value that I'm assuming is out 

730
00:41:15,470 --> 00:41:16,339
there, and that I'm 

731
00:41:16,339 --> 00:41:23,339
only hoping to estimate. 

732
00:41:23,979 --> 00:41:25,579
Actually, yeah, 

733
00:41:25,579 --> 00:41:29,369
so for the sample of questions like these about maximum likelihood and so on, I hope 

734
00:41:29,369 --> 00:41:34,059
to tease to the Friday discussion section 

735
00:41:34,059 --> 00:41:36,279
as a good time to 

736
00:41:36,279 --> 00:41:37,890
ask questions about, sort of, 

737
00:41:37,890 --> 00:41:40,880
probabilistic definitions like these as well. Are there any 

738
00:41:40,880 --> 00:41:47,880
other questions? No, great. Okay. 

739
00:41:52,900 --> 00:41:59,680
So, 

740
00:41:59,680 --> 00:42:01,189
great. Oh, it 

741
00:42:01,189 --> 00:42:07,589
turns out, just to mention one more thing that's, kind of, cool. 

742
00:42:07,589 --> 00:42:09,300
I said that 

743
00:42:09,300 --> 00:42:13,509
if X given Y is Poisson, and you also go logistic posterior, 

744
00:42:13,509 --> 00:42:17,210
it actually turns out there's a more general version of this. If you assume 

745
00:42:17,210 --> 00:42:18,929
X 

746
00:42:18,929 --> 00:42:23,899
given Y = 1 is exponential family 

747
00:42:23,899 --> 00:42:26,279
with parameter A to 1, and then you assume 

748
00:42:26,279 --> 00:42:33,279
X given Y = 0 is exponential family 

749
00:42:33,859 --> 00:42:38,889
with parameter A to 0, then 

750
00:42:38,889 --> 00:42:43,269
this implies that P of Y = 1 given X is also logistic, okay? And 

751
00:42:43,269 --> 00:42:44,679
that's, kind of, cool. 

752
00:42:44,679 --> 00:42:47,590
It means that Y given X could be - I don't 

753
00:42:47,590 --> 00:42:50,180
know, some strange thing. It could be gamma because 

754
00:42:50,180 --> 00:42:52,509
we've seen Gaussian right 

755
00:42:52,509 --> 00:42:53,609
next to the - I 

756
00:42:53,609 --> 00:42:57,029
don't know, gamma exponential. 

757
00:42:57,029 --> 00:42:59,099
They're actually a beta. I'm 

758
00:42:59,099 --> 00:43:02,089
just rattling off my mental list of exponential family extrusions. It could be any one 

759
00:43:02,089 --> 00:43:03,689
of those things, 

760
00:43:03,689 --> 00:43:07,519
so [inaudible] the same exponential family distribution for the two classes 

761
00:43:07,519 --> 00:43:09,439
with different natural parameters 

762
00:43:09,439 --> 00:43:12,239
than the 

763
00:43:12,239 --> 00:43:13,279
posterior 

764
00:43:13,279 --> 00:43:16,849
P of Y given 1 given X - P of Y = 1 given X would be logistic, and so this shows 

765
00:43:16,849 --> 00:43:19,299
the robustness of logistic regression 

766
00:43:19,299 --> 00:43:22,479
to the choice of modeling assumptions because it could be that 

767
00:43:22,479 --> 00:43:24,989
the data was actually, you know, gamma distributed, 

768
00:43:24,989 --> 00:43:28,739
and just still turns out to be logistic. So it's the 

769
00:43:28,739 --> 00:43:35,739
robustness of logistic regression to modeling 

770
00:43:38,809 --> 00:43:40,960
assumptions. 

771
00:43:40,960 --> 00:43:42,900
And this is the density. I think, 

772
00:43:42,900 --> 00:43:44,399
early on I promised 

773
00:43:44,399 --> 00:43:48,619
two justifications for where I pulled the logistic function out of the hat, right? So 

774
00:43:48,619 --> 00:43:53,299
one was the exponential family derivation we went through last time, and this is, sort of, the second one. 

775
00:43:53,299 --> 00:44:00,299
That all of these modeling assumptions also lead to the logistic function. Yeah? Student:[Off 

776
00:44:05,109 --> 00:44:09,969
mic]. Instructor (Andrew Ng):Oh, that Y = 1 given as the logistic then this implies that, no. This is also 

777
00:44:09,969 --> 00:44:11,059
not true, right? 

778
00:44:11,059 --> 00:44:12,499
Yeah, so this exponential 

779
00:44:12,499 --> 00:44:14,199
family distribution 

780
00:44:14,199 --> 00:44:18,269
implies Y = 1 is logistic, but the reverse assumption is also not true. 

781
00:44:18,269 --> 00:44:22,599
There are actually all sorts of really bizarre distributions 

782
00:44:22,599 --> 00:44:29,599
for X that would give rise to logistic function, okay? Okay. So 

783
00:44:29,889 --> 00:44:34,719
let's talk about - those are first generative learning algorithm. Maybe I'll talk about the second 

784
00:44:34,719 --> 00:44:37,609
generative learning algorithm, 

785
00:44:37,609 --> 00:44:44,609
and the motivating example, actually this is called a Naive Bayes algorithm, 

786
00:44:44,849 --> 00:44:49,690
and the motivating example that I'm gonna use will be spam classification. All right. So let's 

787
00:44:49,690 --> 00:44:54,109
say that you want to build a spam classifier to take your incoming stream of email and decide if 

788
00:44:54,109 --> 00:44:56,529
it's spam or 

789
00:44:56,529 --> 00:44:57,489
not. 

790
00:44:57,489 --> 00:45:02,299
So let's 

791
00:45:02,299 --> 00:45:04,759
see. Y will be 0 

792
00:45:04,759 --> 00:45:06,849
or 

793
00:45:06,849 --> 00:45:11,679
1, with 1 being spam email 

794
00:45:11,679 --> 00:45:12,589
and 0 being non-spam, and 

795
00:45:12,589 --> 00:45:16,649
the first decision we need to make is, given a piece of email, 

796
00:45:16,649 --> 00:45:20,789
how do you represent a piece of email using a feature vector X, 

797
00:45:20,789 --> 00:45:24,449
right? So email is just a piece of text, right? Email 

798
00:45:24,449 --> 00:45:27,889
is like a list of words or a list of ASCII characters. 

799
00:45:27,889 --> 00:45:30,839
So I can represent email as a feature of vector X. 

800
00:45:30,839 --> 00:45:32,369
So we'll use a couple of different 

801
00:45:32,369 --> 00:45:34,099
representations, 

802
00:45:34,099 --> 00:45:36,579
but the one I'll use today is 

803
00:45:36,579 --> 00:45:38,049
we will 

804
00:45:38,049 --> 00:45:42,179
construct the vector X as follows. I'm gonna go through my dictionary, and, sort of, make a listing of 

805
00:45:42,179 --> 00:45:44,400
all the words in my dictionary, okay? So 

806
00:45:44,400 --> 00:45:46,469
the first word is 

807
00:45:46,469 --> 00:45:52,289
RA. The second word in my dictionary is Aardvark, ausworth, 

808
00:45:52,289 --> 00:45:55,569
okay? 

809
00:45:55,569 --> 00:45:59,419
You know, and somewhere along the way you see the word �buy� in the spam email telling you to buy 

810
00:45:59,419 --> 00:46:01,379
stuff. 

811
00:46:01,379 --> 00:46:04,140
Tell you how you collect your list of words, 

812
00:46:04,140 --> 00:46:08,579
you know, you won't find CS229, right, course number in a dictionary, but 

813
00:46:08,579 --> 00:46:09,559
if you 

814
00:46:09,559 --> 00:46:14,249
collect a list of words via other emails you've gotten, you have this list somewhere 

815
00:46:14,249 --> 00:46:17,979
as well, and then the last word in my dictionary was 

816
00:46:17,979 --> 00:46:21,969
zicmergue, which 

817
00:46:21,969 --> 00:46:24,739
pertains to the technological chemistry that deals with 

818
00:46:24,739 --> 00:46:27,899
the fermentation process in 

819
00:46:27,899 --> 00:46:30,829
brewing. 

820
00:46:30,829 --> 00:46:35,180
So say I get a piece of email, and what I'll do is I'll then 

821
00:46:35,180 --> 00:46:37,860
scan through this list of words, and wherever 

822
00:46:37,860 --> 00:46:41,249
a certain word appears in my email, I'll put a 1 there. So if a particular 

823
00:46:41,249 --> 00:46:44,319
email has the word �aid� then that's 1. 

824
00:46:44,319 --> 00:46:46,560
You know, my email doesn't have the words ausworth 

825
00:46:46,560 --> 00:46:49,209
or aardvark, so it gets zeros. And again, 

826
00:46:49,209 --> 00:46:52,939
a piece of email, they want me to buy something, CS229 doesn't occur, and so on, okay? 

827
00:46:52,939 --> 00:46:56,910
So 

828
00:46:56,910 --> 00:46:59,289
this would be 

829
00:46:59,289 --> 00:47:01,939
one way of creating a feature vector 

830
00:47:01,939 --> 00:47:03,619
to represent a 

831
00:47:03,619 --> 00:47:05,739
piece of email. 

832
00:47:05,739 --> 00:47:09,900
Now, let's throw 

833
00:47:09,900 --> 00:47:16,900
the generative model out for this. Actually, 

834
00:47:18,059 --> 00:47:19,289
let's use 

835
00:47:19,289 --> 00:47:20,849
this. 

836
00:47:20,849 --> 00:47:24,400
In other words, I want to model P of X given Y. The 

837
00:47:24,400 --> 00:47:27,959
given Y = 0 or Y = 1, all right? 

838
00:47:27,959 --> 00:47:31,390
And my feature vectors are going to be 0, 1 

839
00:47:31,390 --> 00:47:35,709
to the N. It's going to be these split vectors, binary value vectors. They're N 

840
00:47:35,709 --> 00:47:36,400
dimensional. 

841
00:47:36,400 --> 00:47:38,130
Where N 

842
00:47:38,130 --> 00:47:38,830
may 

843
00:47:38,830 --> 00:47:41,950
be on the order of, say, 50,000, if you have 50,000 

844
00:47:41,950 --> 00:47:44,089
words in your dictionary, 

845
00:47:44,089 --> 00:47:46,110
which is not atypical. So 

846
00:47:46,110 --> 00:47:47,459
values from - I don't 

847
00:47:47,459 --> 00:47:48,449
know, 

848
00:47:48,449 --> 00:47:52,219
mid-thousands to tens of thousands is very typical 

849
00:47:52,219 --> 00:47:53,939
for problems like 

850
00:47:53,939 --> 00:47:56,620
these. And, therefore, 

851
00:47:56,620 --> 00:48:00,340
there two to the 50,000 possible values for X, right? So two to 

852
00:48:00,340 --> 00:48:01,500
50,000 

853
00:48:01,500 --> 00:48:02,729
possible bit vectors 

854
00:48:02,729 --> 00:48:03,819
of length 

855
00:48:03,819 --> 00:48:05,199


856
00:48:05,199 --> 00:48:07,239
50,000, and so 

857
00:48:07,239 --> 00:48:08,829
one way to model this is 

858
00:48:08,829 --> 00:48:10,749
the multinomial distribution, 

859
00:48:10,749 --> 00:48:13,900
but because there are two to the 50,000 possible values for X, 

860
00:48:13,900 --> 00:48:20,339
I would need two to the 50,000, but maybe -1 parameters, 

861
00:48:20,339 --> 00:48:21,059
right? Because you have 

862
00:48:21,059 --> 00:48:23,699
this sum to 1, right? So 

863
00:48:23,699 --> 00:48:27,089
-1. And this is clearly way too many parameters 

864
00:48:27,089 --> 00:48:28,479
to model 

865
00:48:28,479 --> 00:48:30,269
using the multinomial distribution 

866
00:48:30,269 --> 00:48:34,439
over all two to 50,000 possibilities. 

867
00:48:34,439 --> 00:48:35,839
So 

868
00:48:35,839 --> 00:48:42,289
in a Naive Bayes algorithm, we're 

869
00:48:42,289 --> 00:48:46,579
going to make a very strong assumption on P of X given Y, 

870
00:48:46,579 --> 00:48:49,369
and, in particular, I'm going to assume - let 

871
00:48:49,369 --> 00:48:53,609
me just say what it's called; then I'll write out what it means. I'm going to assume that the 

872
00:48:53,609 --> 00:48:54,630
XI's 

873
00:48:54,630 --> 00:49:01,630
are conditionally independent 

874
00:49:06,489 --> 00:49:07,950
given Y, okay? 

875
00:49:07,950 --> 00:49:11,209
Let me say what this means. 

876
00:49:11,209 --> 00:49:18,209
So I have that P of X1, X2, up to X50,000, 

877
00:49:18,729 --> 00:49:20,559
right, given the 

878
00:49:20,559 --> 00:49:25,910
Y. By the key rule of probability, this is P of X1 given Y 

879
00:49:25,910 --> 00:49:27,759
times P of X2 

880
00:49:27,759 --> 00:49:30,369
given 

881
00:49:30,369 --> 00:49:32,469


882
00:49:32,469 --> 00:49:33,989
Y, 

883
00:49:33,989 --> 00:49:39,409
X1 

884
00:49:39,409 --> 00:49:42,919
times PF - I'll just put dot, dot, dot. I'll just write 1, 1 × dot, dot, dot up to, you know, well - 

885
00:49:42,919 --> 00:49:46,369
whatever. You get the idea, up to P of X50,000, okay? 

886
00:49:46,369 --> 00:49:49,529
So this is the chain were of probability. This always holds. I've not 

887
00:49:49,529 --> 00:49:52,849
made any assumption yet, and now, we're 

888
00:49:52,849 --> 00:49:54,529
gonna 

889
00:49:54,529 --> 00:49:58,089
meet what's called the Naive Bayes assumption, or this assumption that X 

890
00:49:58,089 --> 00:50:01,869
defies a conditionally independent given Y. Going 

891
00:50:01,869 --> 00:50:03,709
to assume that - 

892
00:50:03,709 --> 00:50:06,619
well, nothing changes for the first term, 

893
00:50:06,619 --> 00:50:11,719
but I'm gonna assume that P of X3 given Y, X1 is equal to P of X2 given the Y. I'm gonna assume that that term's equal to P of X3 

894
00:50:11,719 --> 00:50:13,359
given 

895
00:50:13,359 --> 00:50:16,979
the 

896
00:50:16,979 --> 00:50:17,760
Y, 

897
00:50:17,760 --> 00:50:20,099
and so on, up 

898
00:50:20,099 --> 00:50:24,489
to P of X50,000 given Y, okay? 

899
00:50:24,489 --> 00:50:28,849
Or just written more compactly, 

900
00:50:28,849 --> 00:50:32,119


901
00:50:32,119 --> 00:50:35,989
means assume that P of X1, P of X50,000 given Y is 

902
00:50:35,989 --> 00:50:41,150
the product from I = 1 to 50,000 or P of XI 

903
00:50:41,150 --> 00:50:44,170
given the Y, 

904
00:50:44,170 --> 00:50:45,299
okay? 

905
00:50:45,299 --> 00:50:49,129
And stating informally what this means is that I'm, sort of, assuming that - 

906
00:50:49,129 --> 00:50:53,009
so unless you know the cost label Y, so long as you know whether this is spam or not 

907
00:50:53,009 --> 00:50:54,959
spam, 

908
00:50:54,959 --> 00:50:58,579
then knowing whether the word �A� appears in email 

909
00:50:58,579 --> 00:50:59,809
does not affect 

910
00:50:59,809 --> 00:51:01,359
the probability 

911
00:51:01,359 --> 00:51:02,849
of whether the word 

912
00:51:02,849 --> 00:51:06,829
�Ausworth� appears in the email, all right? 

913
00:51:06,829 --> 00:51:10,699
And, in other words, there's assuming - once you know whether an email is spam 

914
00:51:10,699 --> 00:51:12,619
or not spam, 

915
00:51:12,619 --> 00:51:15,519
then knowing whether other words appear in the email won't help 

916
00:51:15,519 --> 00:51:19,049
you predict whether any other word appears in the email, okay? 

917
00:51:19,049 --> 00:51:20,289
And, 

918
00:51:20,289 --> 00:51:23,489
obviously, this assumption is false, right? This 

919
00:51:23,489 --> 00:51:25,979
assumption can't possibly be 

920
00:51:25,979 --> 00:51:28,190
true. I mean, if you see the word 

921
00:51:28,190 --> 00:51:31,410
- I don't know, CS229 in an email, you're much more likely to see my name in the email, or 

922
00:51:31,410 --> 00:51:37,050
the TA's names, or whatever. So this assumption is normally just false 

923
00:51:37,050 --> 00:51:37,469


924
00:51:37,469 --> 00:51:39,349
under English, right, 

925
00:51:39,349 --> 00:51:41,239
for normal written English, 

926
00:51:41,239 --> 00:51:42,519
but it 

927
00:51:42,519 --> 00:51:44,269
turns out that despite 

928
00:51:44,269 --> 00:51:46,599
this assumption being, sort of, 

929
00:51:46,599 --> 00:51:48,180
false in the literal sense, 

930
00:51:48,180 --> 00:51:50,829
the Naive Bayes algorithm is, sort of, 

931
00:51:50,829 --> 00:51:53,469
an extremely effective 

932
00:51:53,469 --> 00:51:57,869
algorithm for classifying text documents into spam or not spam, for 

933
00:51:57,869 --> 00:52:01,519
classifying your emails into different emails for your automatic view, for 

934
00:52:01,519 --> 00:52:03,589
looking at web pages and classifying 

935
00:52:03,589 --> 00:52:06,290
whether this webpage is trying to sell something or whatever. It 

936
00:52:06,290 --> 00:52:07,959
turns out, this assumption 

937
00:52:07,959 --> 00:52:11,819
works very well for classifying text documents and for other applications too that I'll 

938
00:52:11,819 --> 00:52:16,419
talk a bit about later. 

939
00:52:16,419 --> 00:52:20,689
As a digression that'll make sense only to some of you. 

940
00:52:20,689 --> 00:52:23,019
Let me just say that 

941
00:52:23,019 --> 00:52:27,959
if you're familiar with Bayesian X world, say 

942
00:52:27,959 --> 00:52:29,839


943
00:52:29,839 --> 00:52:33,999
graphical models, the Bayesian network associated with this model looks like this, and you're assuming 

944
00:52:33,999 --> 00:52:34,509
that 

945
00:52:34,509 --> 00:52:36,769
this is random variable Y 

946
00:52:36,769 --> 00:52:38,779
that then generates X1, X2, through 

947
00:52:38,779 --> 00:52:41,089
X50,000, okay? If you've not seen the 

948
00:52:41,089 --> 00:52:43,069
Bayes Net before, if 

949
00:52:43,069 --> 00:52:47,549
you don't know your graphical model, just ignore this. It's not important to our purposes, but 

950
00:52:47,549 --> 00:52:54,549
if you've seen it before, that's what it will look like. Okay. 

951
00:52:58,019 --> 00:53:03,349
So 

952
00:53:03,349 --> 00:53:10,349


953
00:53:11,660 --> 00:53:16,239
the parameters of the model 

954
00:53:16,239 --> 00:53:18,069
are as follows 

955
00:53:18,069 --> 00:53:21,659
with phi FI given Y = 1, which 

956
00:53:21,659 --> 00:53:23,189


957
00:53:23,189 --> 00:53:28,019
is probably FX = 1 or XI = 1 

958
00:53:28,019 --> 00:53:28,839
given Y = 1, 

959
00:53:28,839 --> 00:53:35,529
phi I 

960
00:53:35,529 --> 00:53:39,959
given Y = 0, and phi Y, okay? 

961
00:53:39,959 --> 00:53:43,459
So these are the parameters of the model, 

962
00:53:43,459 --> 00:53:45,419


963
00:53:45,419 --> 00:53:47,249
and, therefore, 

964
00:53:47,249 --> 00:53:50,139
to fit the parameters of the model, you 

965
00:53:50,139 --> 00:53:57,139
can write down the joint likelihood, right, 

966
00:54:01,849 --> 00:54:07,329
is 

967
00:54:07,329 --> 00:54:14,329
equal to, as usual, okay? 

968
00:54:18,670 --> 00:54:20,779
So given the training sets, 

969
00:54:20,779 --> 00:54:27,779
you can write down the joint 

970
00:54:28,979 --> 00:54:32,579
likelihood of the parameters, and 

971
00:54:32,579 --> 00:54:39,579
then when 

972
00:54:43,729 --> 00:54:44,640
you 

973
00:54:44,640 --> 00:54:46,869
do maximum likelihood estimation, 

974
00:54:46,869 --> 00:54:50,860
you find that the maximum likelihood estimate of the parameters are 

975
00:54:50,860 --> 00:54:53,239
- they're really, pretty much, what you'd expect. 

976
00:54:53,239 --> 00:54:55,630
Maximum likelihood estimate for phi J given Y = 1 is 

977
00:54:55,630 --> 00:54:57,880
sum from I = 1 to 

978
00:54:57,880 --> 00:55:01,759
M, 

979
00:55:01,759 --> 00:55:03,459
indicator 

980
00:55:03,459 --> 00:55:07,129
XIJ = 

981
00:55:07,129 --> 00:55:14,129
1, YI = 1, okay? 

982
00:55:14,699 --> 00:55:18,479


983
00:55:18,479 --> 00:55:19,359


984
00:55:19,359 --> 00:55:20,719
And this is just a, 

985
00:55:20,719 --> 00:55:22,829
I guess, stated more simply, 

986
00:55:22,829 --> 00:55:24,969
the numerator just says, �Run for 

987
00:55:24,969 --> 00:55:27,979
your entire training set, some [inaudible] examples, 

988
00:55:27,979 --> 00:55:30,919
and count up the number of times you saw word �Jay� 

989
00:55:30,919 --> 00:55:32,549
in a piece of email 

990
00:55:32,549 --> 00:55:34,649
for which the label Y was equal to 1.� So, in other words, look 

991
00:55:34,649 --> 00:55:36,800
through all your spam emails 

992
00:55:36,800 --> 00:55:39,320
and count the number of emails in which the word 

993
00:55:39,320 --> 00:55:40,559
�Jay� appeared out of 

994
00:55:40,559 --> 00:55:42,420
all your spam emails, 

995
00:55:42,420 --> 00:55:44,710
and the denominator is, you know, 

996
00:55:44,710 --> 00:55:45,989
sum from I = 1 to M, 

997
00:55:45,989 --> 00:55:47,949
the number of spam. The 

998
00:55:47,949 --> 00:55:51,439
denominator is just the number of spam emails you got. 

999
00:55:51,439 --> 00:55:54,329
And so this ratio is 

1000
00:55:54,329 --> 00:55:57,769
in all your spam emails in your training set, 

1001
00:55:57,769 --> 00:56:00,399
what fraction of these emails 

1002
00:56:00,399 --> 00:56:03,039
did the word �Jay� 

1003
00:56:03,039 --> 00:56:03,659
appear in - 

1004
00:56:03,659 --> 00:56:06,640
did the, �Jay� you wrote in your dictionary appear in? 

1005
00:56:06,640 --> 00:56:10,239
And that's the maximum likelihood estimate 

1006
00:56:10,239 --> 00:56:13,819
for the probability of seeing the word �Jay� conditions on the piece of email being spam, okay? And similar to your 

1007
00:56:13,819 --> 00:56:17,499


1008
00:56:17,499 --> 00:56:20,499


1009
00:56:20,499 --> 00:56:23,160
maximum likelihood estimate for phi 

1010
00:56:23,160 --> 00:56:24,999
Y 

1011
00:56:24,999 --> 00:56:28,880
is pretty much what you'd expect, right? 

1012
00:56:28,880 --> 00:56:35,880
Okay? 

1013
00:56:41,409 --> 00:56:43,389
And so 

1014
00:56:43,389 --> 00:56:47,669
having estimated all these parameters, 

1015
00:56:47,669 --> 00:56:50,960
when you're given a new piece of email that you want to classify, 

1016
00:56:50,960 --> 00:56:53,430
you can then compute P of Y given X 

1017
00:56:53,430 --> 00:56:56,079
using Bayes rule, right? 

1018
00:56:56,079 --> 00:56:57,580
Same as before because 

1019
00:56:57,580 --> 00:57:04,229
together these parameters gives you a model for P of X given Y and for P of Y, 

1020
00:57:04,229 --> 00:57:07,849
and by using Bayes rule, given these two terms, you can compute 

1021
00:57:07,849 --> 00:57:10,709
P of X given Y, and 

1022
00:57:10,709 --> 00:57:15,650
there's your spam classifier, okay? 

1023
00:57:15,650 --> 00:57:18,999
Turns out we need one more elaboration to this idea, but let me check if there are 

1024
00:57:18,999 --> 00:57:25,109
questions about this so far. 

1025
00:57:25,109 --> 00:57:29,269
Student:So does this model depend 

1026
00:57:29,269 --> 00:57:29,659
on 

1027
00:57:29,659 --> 00:57:34,380
the number of inputs? Instructor (Andrew Ng):What do 

1028
00:57:34,380 --> 00:57:35,149
you 

1029
00:57:35,149 --> 00:57:38,169
mean, number of inputs, the number of features? Student:No, number of samples. Instructor (Andrew Ng):Well, N is the number of training examples, so this 

1030
00:57:38,169 --> 00:57:45,169
given M training examples, this is the formula for the maximum likelihood estimate of the parameters, right? So other questions, does it make 

1031
00:57:48,959 --> 00:57:52,629
sense? Or M is the number of training examples, so when you have M training examples, you plug them 

1032
00:57:52,629 --> 00:57:53,999
into this formula, 

1033
00:57:53,999 --> 00:58:00,999
and that's how you compute the maximum likelihood estimates. Student:Is training examples you mean M is the 

1034
00:58:01,289 --> 00:58:04,089
number of emails? Instructor (Andrew Ng):Yeah, right. So, right. 

1035
00:58:04,089 --> 00:58:07,130
So it's, kind of, your training set. I would go through all the email I've gotten 

1036
00:58:07,130 --> 00:58:08,869
in the last two months 

1037
00:58:08,869 --> 00:58:11,129
and label them as spam or not spam, 

1038
00:58:11,129 --> 00:58:14,450
and so you have - I don't 

1039
00:58:14,450 --> 00:58:16,809
know, like, a few hundred emails 

1040
00:58:16,809 --> 00:58:18,630
labeled as spam or not spam, 

1041
00:58:18,630 --> 00:58:25,630
and that will comprise your training sets for X1 and Y1 through XM, 

1042
00:58:28,049 --> 00:58:28,620
YM, 

1043
00:58:28,620 --> 00:58:32,820
where X is one of those vectors representing which words appeared in the email and Y 

1044
00:58:32,820 --> 00:58:39,820
is 0, 1 depending on whether they equal spam or not spam, okay? Student:So you are saying that this model depends on the number 

1045
00:58:41,309 --> 00:58:42,019


1046
00:58:42,019 --> 00:58:49,019
of examples, but the last model doesn't depend on the models, but your phi is the 

1047
00:58:49,829 --> 00:58:53,630
same for either one. Instructor (Andrew Ng):They're 

1048
00:58:53,630 --> 00:58:58,019
different things, right? There's the model which is 

1049
00:58:58,019 --> 00:59:00,659
- the modeling assumptions aren't made very well. 

1050
00:59:00,659 --> 00:59:04,069
I'm assuming that - I'm making the Naive Bayes assumption. 

1051
00:59:04,069 --> 00:59:08,900
So the probabilistic model is an assumption on the joint distribution 

1052
00:59:08,900 --> 00:59:10,319
of X and Y. 

1053
00:59:10,319 --> 00:59:12,390
That's what the model is, 

1054
00:59:12,390 --> 00:59:16,650
and then I'm given a fixed number of training examples. I'm given M training examples, and 

1055
00:59:16,650 --> 00:59:20,309
then it's, like, after I'm given the training sets, I'll then go in to write the maximum 

1056
00:59:20,309 --> 00:59:22,299
likelihood estimate of the parameters, right? So that's, sort 

1057
00:59:22,299 --> 00:59:24,529
of, 

1058
00:59:24,529 --> 00:59:31,529
maybe we should take that offline for - yeah, ask a question? Student:Then how would you do this, like, 

1059
00:59:38,389 --> 00:59:41,079
if this [inaudible] didn't work? Instructor (Andrew Ng):Say that again. Student:How would you do it, say, like the 50,000 words - Instructor (Andrew Ng):Oh, okay. How to do this with the 50,000 words, yeah. So 

1060
00:59:41,079 --> 00:59:42,360
it turns out 

1061
00:59:42,360 --> 00:59:45,859
this is, sort of, a very practical question, really. How do I count this list of 

1062
00:59:45,859 --> 00:59:49,629
words? One common way to do this is to actually 

1063
00:59:49,629 --> 00:59:52,899
find some way to count a list of words, like go through all your emails, go through 

1064
00:59:52,899 --> 00:59:53,909
all the - 

1065
00:59:53,909 --> 00:59:56,930
in practice, one common way to count a list of words 

1066
00:59:56,930 --> 01:00:00,339
is to just take all the words that appear in your training set. That's one fairly common way 

1067
01:00:00,339 --> 01:00:02,190
to do it, 

1068
01:00:02,190 --> 01:00:06,130
or if that turns out to be too many words, you can take all words that appear 

1069
01:00:06,130 --> 01:00:07,030
at least 

1070
01:00:07,030 --> 01:00:08,080
three times 

1071
01:00:08,080 --> 01:00:09,349
in your training set. So 

1072
01:00:09,349 --> 01:00:10,659
words that 

1073
01:00:10,659 --> 01:00:14,749
you didn't even see three times in the emails you got in the last 

1074
01:00:14,749 --> 01:00:17,210
two months, you discard. So those are - I 

1075
01:00:17,210 --> 01:00:20,259
was talking about going through a dictionary, which is a nice way of thinking about it, but in 

1076
01:00:20,259 --> 01:00:22,049
practice, you might go through 

1077
01:00:22,049 --> 01:00:26,409
your training set and then just take the union of all the words that appear in 

1078
01:00:26,409 --> 01:00:29,759
it. In some of the tests I've even, by the way, said select these features, but this is one 

1079
01:00:29,759 --> 01:00:32,309
way to think about 

1080
01:00:32,309 --> 01:00:34,139
creating your feature vector, 

1081
01:00:34,139 --> 01:00:41,139
right, as zero and one values, okay? Moving on, yeah. Okay. Ask a question? Student:I'm getting, kind of, confused on how you compute all those parameters. Instructor (Andrew Ng):On 

1082
01:00:49,209 --> 01:00:51,609
how I came up with the parameters? 

1083
01:00:51,609 --> 01:00:53,929
Student:Correct. Instructor (Andrew Ng):Let's see. 

1084
01:00:53,929 --> 01:00:58,249
So in Naive Bayes, what I need to do - the question was how did I come up with the parameters, right? 

1085
01:00:58,249 --> 01:00:59,279
In Naive Bayes, 

1086
01:00:59,279 --> 01:01:01,499
I need to build a model 

1087
01:01:01,499 --> 01:01:03,549
for P of X given Y and for 

1088
01:01:03,549 --> 01:01:05,359
P of Y, 

1089
01:01:05,359 --> 01:01:09,749
right? So this is, I mean, in generous of learning algorithms, I need to come up with 

1090
01:01:09,749 --> 01:01:11,079
models for these. 

1091
01:01:11,079 --> 01:01:15,359
So how'd I model P of Y? Well, I just those to model it using a Bernoulli 

1092
01:01:15,359 --> 01:01:16,389
distribution, 

1093
01:01:16,389 --> 01:01:17,150
and so 

1094
01:01:17,150 --> 01:01:19,709
P of Y will be 

1095
01:01:19,709 --> 01:01:22,109
parameterized by that, all right? Student:Okay. 

1096
01:01:22,109 --> 01:01:26,519
Instructor (Andrew Ng):And then how'd I model P of X given Y? Well, 

1097
01:01:26,519 --> 01:01:28,529
let's keep changing bullets. 

1098
01:01:28,529 --> 01:01:32,729
My model for P of X given Y under the Naive Bayes assumption, I assume 

1099
01:01:32,729 --> 01:01:34,539
that P of X given Y 

1100
01:01:34,539 --> 01:01:37,549
is the product of these probabilities, 

1101
01:01:37,549 --> 01:01:40,900
and so I'm going to need parameters to tell me 

1102
01:01:40,900 --> 01:01:43,509
what's the probability of each word occurring, 

1103
01:01:43,509 --> 01:01:46,309
you know, of each word occurring or not occurring, 

1104
01:01:46,309 --> 01:01:53,309
conditions on the email being spam or not spam email, okay? Student:How is that 

1105
01:01:54,639 --> 01:01:58,829
Bernoulli? Instructor (Andrew Ng):Oh, because X is either zero or one, right? By the way I defined the feature 

1106
01:01:58,829 --> 01:02:00,799
vectors, XI 

1107
01:02:00,799 --> 01:02:05,499
is either one or zero, depending on whether words I appear as in the email, 

1108
01:02:05,499 --> 01:02:08,859
right? So by the way I define the 

1109
01:02:08,859 --> 01:02:11,419
feature vectors, XI - 

1110
01:02:11,419 --> 01:02:16,639
the XI is always zero or one. So that by definition, if XI, you know, is either zero or 

1111
01:02:16,639 --> 01:02:19,969
one, then it has to be a Bernoulli distribution, right? 

1112
01:02:19,969 --> 01:02:22,419
If XI would continue as then 

1113
01:02:22,419 --> 01:02:24,819
you might model this as Gaussian and say you end up 

1114
01:02:24,819 --> 01:02:27,299
like we did in Gaussian discriminant analysis. It's 

1115
01:02:27,299 --> 01:02:30,719
just that the way I constructed my features for email, XI is always binary 

1116
01:02:30,719 --> 01:02:32,129
value, and so you end up 

1117
01:02:32,129 --> 01:02:32,990
with a 

1118
01:02:32,990 --> 01:02:36,059
Bernoulli here, okay? All right. I 

1119
01:02:36,059 --> 01:02:40,389
should move on. So 

1120
01:02:40,389 --> 01:02:42,389
it turns out that 

1121
01:02:42,389 --> 01:02:44,339
this idea 

1122
01:02:44,339 --> 01:02:46,369
almost works. 

1123
01:02:46,369 --> 01:02:47,719
Now, here's the problem. 

1124
01:02:47,719 --> 01:02:49,959
So let's say you 

1125
01:02:49,959 --> 01:02:54,439
complete this class and you start to do, maybe do the class project, and you 

1126
01:02:54,439 --> 01:02:56,760
keep working on your class project for a bit, and it 

1127
01:02:56,760 --> 01:03:01,439
becomes really good, and you want to submit your class project to a conference, right? So, 

1128
01:03:01,439 --> 01:03:03,479
you know, around - I don't know, 

1129
01:03:03,479 --> 01:03:08,039
June every year is the conference deadline for the next conference. 

1130
01:03:08,039 --> 01:03:10,889
It's just the name of the conference; it's an acronym. 

1131
01:03:10,889 --> 01:03:12,359
And so maybe 

1132
01:03:12,359 --> 01:03:16,080
you send your project partners or senior friends even, and say, �Hey, let's 

1133
01:03:16,080 --> 01:03:20,069
work on a project and submit it to the NIPS conference.� And so you're getting these emails 

1134
01:03:20,069 --> 01:03:21,489
with the word �NIPS� in them, 

1135
01:03:21,489 --> 01:03:24,329
which you've probably never seen before, 

1136
01:03:24,329 --> 01:03:28,179
and so a 

1137
01:03:28,179 --> 01:03:32,519
piece of email comes from your project partner, and so you 

1138
01:03:32,519 --> 01:03:35,639
go, �Let's send a paper to the NIPS conference.� 

1139
01:03:35,639 --> 01:03:37,310
And then your stamp classifier 

1140
01:03:37,310 --> 01:03:38,759
will say 

1141
01:03:38,759 --> 01:03:39,930
P of X - 

1142
01:03:39,930 --> 01:03:44,630
let's say NIPS is the 30,000th word in your dictionary, okay? 

1143
01:03:44,630 --> 01:03:46,669
So X30,000 given 

1144
01:03:46,669 --> 01:03:50,189
the 1, given 

1145
01:03:50,189 --> 01:03:51,619
Y = 

1146
01:03:51,619 --> 01:03:52,799
1 

1147
01:03:52,799 --> 01:03:54,029
will be equal to 0. 

1148
01:03:54,029 --> 01:03:57,479
That's the maximum likelihood of this, right? Because you've never seen the word NIPS before in 

1149
01:03:57,479 --> 01:04:01,199
your training set, so maximum likelihood of the parameter is that probably have seen the word 

1150
01:04:01,199 --> 01:04:02,459
NIPS is zero, 

1151
01:04:02,459 --> 01:04:06,789
and, similarly, 

1152
01:04:06,789 --> 01:04:08,379
you 

1153
01:04:08,379 --> 01:04:12,459
know, in, I guess, non-spam mail, the chance of seeing the word NIPS is also 

1154
01:04:12,459 --> 01:04:14,729
estimated 

1155
01:04:14,729 --> 01:04:21,209
as zero. 

1156
01:04:21,209 --> 01:04:28,209
So 

1157
01:04:34,789 --> 01:04:40,709
when your spam classifier goes to compute P of Y = 1 given X, it will 

1158
01:04:40,709 --> 01:04:44,529
compute this 

1159
01:04:44,529 --> 01:04:47,209
right here P of Y 

1160
01:04:47,209 --> 01:04:54,209
over - well, 

1161
01:04:56,179 --> 01:05:03,179
all 

1162
01:05:05,089 --> 01:05:09,509
right. 

1163
01:05:09,509 --> 01:05:11,329
And so 

1164
01:05:11,329 --> 01:05:15,759
you look at that terms, say, this will be product from I = 

1165
01:05:15,759 --> 01:05:18,379
1 to 50,000, 

1166
01:05:18,379 --> 01:05:22,359
P of XI given Y, 

1167
01:05:22,359 --> 01:05:26,139
and one of those probabilities will be equal to 

1168
01:05:26,139 --> 01:05:28,169


1169
01:05:28,169 --> 01:05:32,339
zero because P of X30,000 = 1 given Y = 1 is equal to zero. So you have a 

1170
01:05:32,339 --> 01:05:36,699
zero in this product, and so the numerator is zero, 

1171
01:05:36,699 --> 01:05:40,249
and in the same way, it turns out the denominator will also be zero, and so you end 

1172
01:05:40,249 --> 01:05:42,719
up with - 

1173
01:05:42,719 --> 01:05:45,839
actually all of these terms end up being zero. So you end up with P of Y = 1 

1174
01:05:45,839 --> 01:05:48,779
given X is 0 over 0 + 0, okay, which is 

1175
01:05:48,779 --> 01:05:51,410
undefined. And the 

1176
01:05:51,410 --> 01:05:53,959
problem with this is that it's 

1177
01:05:53,959 --> 01:05:57,240
just statistically a bad idea 

1178
01:05:57,240 --> 01:06:00,209
to say that P of X30,000 

1179
01:06:00,209 --> 01:06:02,950
given Y is 

1180
01:06:02,950 --> 01:06:03,490
0, 

1181
01:06:03,490 --> 01:06:06,270
right? Just because you haven't seen the word NIPS in your last 

1182
01:06:06,270 --> 01:06:10,420
two months worth of email, it's also statistically not sound to say that, 

1183
01:06:10,420 --> 01:06:16,319
therefore, the chance of ever seeing this word is zero, right? 

1184
01:06:16,319 --> 01:06:19,170
And so 

1185
01:06:19,170 --> 01:06:20,479


1186
01:06:20,479 --> 01:06:22,719


1187
01:06:22,719 --> 01:06:25,819
is this idea that just because you haven't seen something 

1188
01:06:25,819 --> 01:06:30,400
before, that may mean that that event is unlikely, but it doesn't mean that 

1189
01:06:30,400 --> 01:06:33,920
it's impossible, and just saying that if you've never seen the word NIPS before, 

1190
01:06:33,920 --> 01:06:40,759
then it is impossible to ever see the word NIPS in future emails; the chance of that is just zero. 

1191
01:06:40,759 --> 01:06:47,759
So we're gonna fix this, 

1192
01:06:48,339 --> 01:06:50,079
and 

1193
01:06:50,079 --> 01:06:54,119
to motivate the fix I'll talk about 

1194
01:06:54,119 --> 01:06:58,939
- the example we're gonna use is let's say that you've been following the Stanford basketball 

1195
01:06:58,939 --> 01:07:04,099
team for all of their away games, and been, sort of, tracking their wins and losses 

1196
01:07:04,099 --> 01:07:08,169
to gather statistics, and, maybe - I don't know, form a betting pool about 

1197
01:07:08,169 --> 01:07:11,059
whether they're likely to win or lose the next game, okay? 

1198
01:07:11,059 --> 01:07:12,630
So 

1199
01:07:12,630 --> 01:07:15,409
these are some of the statistics. 

1200
01:07:15,409 --> 01:07:17,589


1201
01:07:17,589 --> 01:07:19,429


1202
01:07:19,429 --> 01:07:24,819
So on, I guess, the 8th of February 

1203
01:07:24,819 --> 01:07:29,299
last season they played Washington State, and they 

1204
01:07:29,299 --> 01:07:31,779
did not win. 

1205
01:07:31,779 --> 01:07:36,239
On 

1206
01:07:36,239 --> 01:07:38,209
the 11th of February, 

1207
01:07:38,209 --> 01:07:42,349
they play Washington, 22nd 

1208
01:07:42,349 --> 01:07:47,259
they played USC, 

1209
01:07:47,259 --> 01:07:49,769


1210
01:07:49,769 --> 01:07:54,929
played UCLA, 

1211
01:07:54,929 --> 01:07:57,630
played USC again, 

1212
01:07:57,630 --> 01:08:03,739


1213
01:08:03,739 --> 01:08:05,819
and now you want to estimate 

1214
01:08:05,819 --> 01:08:09,880
what's the chance that they'll win or lose against Louisville, right? 

1215
01:08:09,880 --> 01:08:11,169
So 

1216
01:08:11,169 --> 01:08:14,909
find the four guys last year or five times and they weren't good in their away games, but it 

1217
01:08:14,909 --> 01:08:17,009
seems awfully harsh to say that - so it 

1218
01:08:17,009 --> 01:08:24,009
seems awfully harsh to say there's zero chance that they'll 

1219
01:08:37,079 --> 01:08:40,269
win in the last - in the 5th game. So here's the idea behind Laplace smoothing 

1220
01:08:40,269 --> 01:08:42,629
which is 

1221
01:08:42,629 --> 01:08:44,819
that we're estimate 

1222
01:08:44,819 --> 01:08:48,219
the probably of Y being equal to one, right? 

1223
01:08:48,219 --> 01:08:53,189
Normally, the maximum likelihood [inaudible] is the 

1224
01:08:53,189 --> 01:08:56,059
number of ones 

1225
01:08:56,059 --> 01:08:57,659
divided by 

1226
01:08:57,659 --> 01:09:00,759
the number of zeros 

1227
01:09:00,759 --> 01:09:05,350
plus the number of ones, okay? I 

1228
01:09:05,350 --> 01:09:07,869
hope this informal notation makes sense, right? Knowing 

1229
01:09:07,869 --> 01:09:11,349
the maximum likelihood estimate for, sort of, a win or loss for Bernoulli random 

1230
01:09:11,349 --> 01:09:12,219
variable 

1231
01:09:12,219 --> 01:09:14,629
is 

1232
01:09:14,629 --> 01:09:16,969
just the number of ones you saw 

1233
01:09:16,969 --> 01:09:20,859
divided by the total number of examples. So it's the number of zeros you saw plus the number of ones you saw. So in 

1234
01:09:20,859 --> 01:09:22,859
the Laplace 

1235
01:09:22,859 --> 01:09:24,170
Smoothing 

1236
01:09:24,170 --> 01:09:25,799
we're going to 

1237
01:09:25,799 --> 01:09:29,549
just take each of these terms, the number of ones and, sort of, add one to that, the number 

1238
01:09:29,549 --> 01:09:31,830
of zeros and add one to that, the 

1239
01:09:31,830 --> 01:09:34,129
number of ones and add one to that, 

1240
01:09:34,129 --> 01:09:37,109
and so in our example, 

1241
01:09:37,109 --> 01:09:38,299
instead of estimating 

1242
01:09:38,299 --> 01:09:42,890
the probability of winning the next game to be 0 ÷ 

1243
01:09:42,890 --> 01:09:45,210
5 + 0, 

1244
01:09:45,210 --> 01:09:50,319
we'll add one to all of these counts, and so we say that the chance of 

1245
01:09:50,319 --> 01:09:51,850
their 

1246
01:09:51,850 --> 01:09:53,799
winning the next game is 1/7th, 

1247
01:09:53,799 --> 01:09:55,179
okay? Which is 

1248
01:09:55,179 --> 01:09:59,310
that having seen them lose, you know, five away games in a row, we aren't terribly - 

1249
01:09:59,310 --> 01:10:02,510
we don't think it's terribly likely they'll win the next game, but at 

1250
01:10:02,510 --> 01:10:05,979
least we're not saying it's impossible. 

1251
01:10:05,979 --> 01:10:09,949
As a historical side note, the Laplace actually came up with the method. 

1252
01:10:09,949 --> 01:10:14,320
It's called the Laplace smoothing after him. 

1253
01:10:14,320 --> 01:10:18,399


1254
01:10:18,399 --> 01:10:21,919
When he was trying to estimate the probability that the sun will rise tomorrow, and his rationale 

1255
01:10:21,919 --> 01:10:25,239
was in a lot of days now, we've seen the sun rise, 

1256
01:10:25,239 --> 01:10:28,489
but that doesn't mean we can be absolutely certain the sun will rise tomorrow. 

1257
01:10:28,489 --> 01:10:31,420
He was using this to estimate the probability that the sun will rise tomorrow. This is, kind of, 

1258
01:10:31,420 --> 01:10:36,449
cool. So, 

1259
01:10:36,449 --> 01:10:38,549
and more generally, 

1260
01:10:38,549 --> 01:10:41,570


1261
01:10:41,570 --> 01:10:43,969


1262
01:10:43,969 --> 01:10:45,429


1263
01:10:45,429 --> 01:10:46,449


1264
01:10:46,449 --> 01:10:49,409
if Y 

1265
01:10:49,409 --> 01:10:52,840
takes on 

1266
01:10:52,840 --> 01:10:55,150
K possible of values, 

1267
01:10:55,150 --> 01:10:59,479
if you're trying to estimate the parameter of the multinomial, then you estimate P of Y = 1. 

1268
01:10:59,479 --> 01:11:01,729
Let's 

1269
01:11:01,729 --> 01:11:05,900
see. 

1270
01:11:05,900 --> 01:11:11,380
So the maximum likelihood estimate will be Sum from J = 1 to M, 

1271
01:11:11,380 --> 01:11:16,499
indicator YI = J ÷ M, 

1272
01:11:16,499 --> 01:11:19,379
right? 

1273
01:11:19,379 --> 01:11:21,629
That's the maximum likelihood estimate 

1274
01:11:21,629 --> 01:11:24,369
of a multinomial probability of Y 

1275
01:11:24,369 --> 01:11:26,929
being equal to - oh, 

1276
01:11:26,929 --> 01:11:29,229
excuse me, Y = J. All right. 

1277
01:11:29,229 --> 01:11:33,110
That's the maximum likelihood estimate for the probability of Y = J, 

1278
01:11:33,110 --> 01:11:36,890
and so when you apply Laplace smoothing to that, 

1279
01:11:36,890 --> 01:11:39,639
you add one to the numerator, and 

1280
01:11:39,639 --> 01:11:41,889
add K to the denominator, 

1281
01:11:41,889 --> 01:11:48,889
if Y can take up K possible values, okay? 

1282
01:12:06,469 --> 01:12:08,579
So for Naive Bayes, 

1283
01:12:08,579 --> 01:12:14,349
what that gives us is - 

1284
01:12:14,349 --> 01:12:21,349
shoot. 

1285
01:12:38,639 --> 01:12:44,320
Right? So that was the maximum likelihood estimate, and what you 

1286
01:12:44,320 --> 01:12:47,949
end up doing is adding one to the numerator and adding two to the denominator, 

1287
01:12:47,949 --> 01:12:52,030
and this solves the problem of the zero probabilities, and when your friend sends 

1288
01:12:52,030 --> 01:12:57,489
you email about the NIPS conference, 

1289
01:12:57,489 --> 01:13:00,799
your spam filter will still be able to 

1290
01:13:00,799 --> 01:13:07,799
make a meaningful prediction, all right? Okay. 

1291
01:13:12,030 --> 01:13:14,829
Shoot. Any questions about this? Yeah? Student:So that's what doesn't makes sense because, for instance, if you take the 

1292
01:13:14,829 --> 01:13:16,239
games on the right, it's liberal assumptions that the probability 

1293
01:13:16,239 --> 01:13:23,239
of 

1294
01:13:24,650 --> 01:13:27,960
winning is very close to zero, so, I mean, the prediction should 

1295
01:13:27,960 --> 01:13:29,789
be equal to PF, 0. Instructor (Andrew Ng):Right. 

1296
01:13:29,789 --> 01:13:35,959
I would say that 

1297
01:13:35,959 --> 01:13:38,139
in this case the prediction 

1298
01:13:38,139 --> 01:13:41,170
is 1/7th, right? We don't have a lot of - if you see somebody lose five games 

1299
01:13:41,170 --> 01:13:44,289
in a row, you may not have a lot of faith in them, 

1300
01:13:44,289 --> 01:13:48,520
but as an extreme example, suppose you saw them lose one game, 

1301
01:13:48,520 --> 01:13:51,500
right? It's just not reasonable to say that the chances of winning the next game 

1302
01:13:51,500 --> 01:13:52,860
is zero, but 

1303
01:13:52,860 --> 01:13:58,900
that's what maximum likelihood 

1304
01:13:58,900 --> 01:14:01,150
estimate 

1305
01:14:01,150 --> 01:14:05,099
will say. Student:Yes. Instructor (Andrew Ng):And - 

1306
01:14:05,099 --> 01:14:08,139
Student:In such a case anywhere the learning algorithm [inaudible] or - Instructor (Andrew Ng):So some questions of, you 

1307
01:14:08,139 --> 01:14:11,309
know, given just five training examples, what's a reasonable estimate for the chance of 

1308
01:14:11,309 --> 01:14:13,170
winning the next game, 

1309
01:14:13,170 --> 01:14:14,119
and 

1310
01:14:14,119 --> 01:14:18,460
1/7th is, I think, is actually pretty reasonable. It's less than 1/5th for instance. 

1311
01:14:18,460 --> 01:14:21,119
We're saying the chances of winning the next game is less 

1312
01:14:21,119 --> 01:14:25,069
than 1/5th. It turns out, under a certain set of assumptions I won't go into - under a certain set of 

1313
01:14:25,069 --> 01:14:27,859
Bayesian assumptions about the prior and posterior, 

1314
01:14:27,859 --> 01:14:31,280
this Laplace smoothing actually gives the optimal estimate, 

1315
01:14:31,280 --> 01:14:33,459
in a certain sense I won't go into 

1316
01:14:33,459 --> 01:14:37,009
of what's the chance of winning the next game, and so under a certain assumption 

1317
01:14:37,009 --> 01:14:38,079
about the 

1318
01:14:38,079 --> 01:14:40,679
Bayesian prior on the parameter. 

1319
01:14:40,679 --> 01:14:44,260
So I don't know. It actually seems like a pretty reasonable assumption to me. 

1320
01:14:44,260 --> 01:14:46,940
Although, I should say, it actually 

1321
01:14:46,940 --> 01:14:50,059
turned out - 

1322
01:14:50,059 --> 01:14:53,530
No, I'm just being mean. We actually are a pretty good basketball team, but I chose a 

1323
01:14:53,530 --> 01:14:54,660
losing streak 

1324
01:14:54,660 --> 01:14:58,769
because it's funnier that way. 

1325
01:14:58,769 --> 01:15:05,769
Let's see. Shoot. Does someone want to - are there other questions about 

1326
01:15:07,489 --> 01:15:09,320


1327
01:15:09,320 --> 01:15:13,380
this? No, yeah. Okay. So there's more that I want to say about Naive Bayes, but 

1328
01:15:13,380 --> 01:15:15,059


1329
01:15:15,059 --> 01:15:15,860
we'll do that in the next lecture. So let's wrap it.