1
00:00:09,040 --> 00:00:10,319


2
00:00:10,319 --> 00:00:13,589
This presentation is delivered by the Stanford Center for Professional 

3
00:00:13,589 --> 00:00:20,589
Development. 

4
00:00:23,960 --> 00:00:24,880
So 

5
00:00:24,880 --> 00:00:29,720
welcome back, and what I want to do today is 

6
00:00:29,720 --> 00:00:33,930
continue our discussions of the EM Algorithm, and in particular, I want to talk 

7
00:00:33,930 --> 00:00:34,730
about 

8
00:00:34,730 --> 00:00:36,359
the EM 

9
00:00:36,359 --> 00:00:39,640
formulation that we derived in the previous lecture and apply it to the 

10
00:00:39,640 --> 00:00:41,910
mixture of Gaussians model, 

11
00:00:41,910 --> 00:00:45,050
apply it to a different model and a mixture of naive Bayes model, 

12
00:00:45,050 --> 00:00:48,690
and then the launch part of today's lecture will be on the factor 

13
00:00:48,690 --> 00:00:51,480
analysis algorithm, which will also use the EM. 

14
00:00:51,480 --> 00:00:55,600
And as part of that, we'll actually take a brief digression to talk a little bit about 

15
00:00:55,600 --> 00:01:01,050
sort of useful properties of Gaussian distributions. 

16
00:01:01,050 --> 00:01:03,190
So just to recap where we are. 

17
00:01:03,190 --> 00:01:10,120
In the previous lecture, I started to talk about unsupervised learning, which was 

18
00:01:10,120 --> 00:01:12,720
machine-learning problems, where you're given 

19
00:01:12,720 --> 00:01:14,590
an unlabeled training set 

20
00:01:14,590 --> 00:01:16,210
comprising m examples here, right? 

21
00:01:16,210 --> 00:01:18,360
And then " so the fact that there are no labels; 

22
00:01:18,360 --> 00:01:22,720
that's what makes this unsupervised or 

23
00:01:22,720 --> 00:01:24,880
anything. So 

24
00:01:24,880 --> 00:01:30,710
one problem that I talked about last time was what if 

25
00:01:30,710 --> 00:01:33,820
you're given a data set that looks like this 

26
00:01:33,820 --> 00:01:36,420
and you want to model 

27
00:01:36,420 --> 00:01:40,110
the density PFX from which you think the data 

28
00:01:40,110 --> 00:01:41,730
had been drawn, 

29
00:01:41,730 --> 00:01:45,900
and so with a data set like this, maybe you think was a mixture of two Gaussians and start 

30
00:01:45,900 --> 00:01:50,230
to talk about an algorithm 

31
00:01:50,230 --> 00:01:53,920
for fitting a mixture of Gaussians model, all right? And so 

32
00:01:53,920 --> 00:01:55,050


33
00:01:55,050 --> 00:01:57,820
we said that we would model the 

34
00:01:57,820 --> 00:01:59,460
density of XP of X 

35
00:01:59,460 --> 00:02:02,130
as sum over Z 

36
00:02:02,130 --> 00:02:03,170
PFX 

37
00:02:03,170 --> 00:02:05,410
given Z 

38
00:02:05,410 --> 00:02:06,970
times P of Z 

39
00:02:06,970 --> 00:02:10,350
where this later random variable meaning this hidden random 

40
00:02:10,350 --> 00:02:11,829
variable Z 

41
00:02:11,829 --> 00:02:12,609
indicates 

42
00:02:12,609 --> 00:02:16,369
which of the two Gaussian distributions each of your data points came from 

43
00:02:16,369 --> 00:02:19,369
and so we have, 

44
00:02:19,369 --> 00:02:24,139
you know, Z was not a 

45
00:02:24,139 --> 00:02:26,629
nomial with parameter phi 

46
00:02:26,629 --> 00:02:28,760
and X conditions on 

47
00:02:28,760 --> 00:02:32,599
a coming from the JAFE 

48
00:02:32,599 --> 00:02:35,809
Gaussian 

49
00:02:35,809 --> 00:02:38,749
was given by Gaussian 

50
00:02:38,749 --> 00:02:43,679
of mean mu J and covariant sigma J, all right? So, like I said 

51
00:02:43,679 --> 00:02:47,510
at the beginning of the previous lecture, I just talked about a very specific algorithm that 

52
00:02:47,510 --> 00:02:50,639
I sort of pulled out of the air 

53
00:02:50,639 --> 00:02:54,609
for fitting the parameters of this model for finian, Francis, phi, mu and 

54
00:02:54,609 --> 00:02:55,519


55
00:02:55,519 --> 00:02:59,769
sigma, but then in the second half of the previous lecture I talked about what's called the 

56
00:02:59,769 --> 00:03:01,489
EM Algorithm in which 

57
00:03:01,489 --> 00:03:03,999
our 

58
00:03:03,999 --> 00:03:08,719
goal is that it's a likelihood estimation of parameters. So we want to maximize in terms of 

59
00:03:08,719 --> 00:03:09,900
theta, 

60
00:03:09,900 --> 00:03:13,279
you know, the, sort of, 

61
00:03:13,279 --> 00:03:17,649
usual right matter of log likelihood " 

62
00:03:17,649 --> 00:03:19,549
well, parameterized by theta. 

63
00:03:19,549 --> 00:03:22,149
And 

64
00:03:22,149 --> 00:03:25,109
because we have 

65
00:03:25,109 --> 00:03:28,199
a later random variable Z this is really 

66
00:03:28,199 --> 00:03:30,889
maximizing in terms of theta, 

67
00:03:30,889 --> 00:03:37,889
sum over I, sum over Z, P of XI, 

68
00:03:39,639 --> 00:03:45,279
ZI parameterized by theta. Okay? 

69
00:03:45,279 --> 00:03:47,139
So using 

70
00:03:47,139 --> 00:03:50,109
Jensen's inequality last time 

71
00:03:50,109 --> 00:03:54,249
we worked out the EM Algorithm in which in the E step 

72
00:03:54,249 --> 00:03:56,180
we would chose these 

73
00:03:56,180 --> 00:03:59,359
probability distributions QI to the 

74
00:03:59,359 --> 00:04:02,930
l posterior 

75
00:04:02,930 --> 00:04:09,319
on Z given X and parameterized by theta 

76
00:04:09,319 --> 00:04:13,759
and in the M step we would set 

77
00:04:13,759 --> 00:04:16,429
theta 

78
00:04:16,429 --> 00:04:21,669
to be the value that 

79
00:04:21,669 --> 00:04:27,609
maximizes 

80
00:04:27,609 --> 00:04:34,609
this. 

81
00:04:36,469 --> 00:04:39,499
Okay? So these are the ones we worked out last time 

82
00:04:39,499 --> 00:04:41,730


83
00:04:41,730 --> 00:04:46,699
and the cartoon that I drew was that you have this long likelihood function L of 

84
00:04:46,699 --> 00:04:49,499
theta that's often hard to maximize 

85
00:04:49,499 --> 00:04:51,250
and what the E step does 

86
00:04:51,250 --> 00:04:55,960
is choose these probability distribution production QI's. And in the cartoon, I drew 

87
00:04:55,960 --> 00:04:57,850
what that corresponded to 

88
00:04:57,850 --> 00:05:01,029
was finding a lower bounds 

89
00:05:01,029 --> 00:05:02,979
for the log likelihood. 

90
00:05:02,979 --> 00:05:04,779
And then 

91
00:05:04,779 --> 00:05:06,420
horizontal access data 

92
00:05:06,420 --> 00:05:10,710
and then the M step you maximize the lower boundary, right? So maybe you were here previously 

93
00:05:10,710 --> 00:05:11,500
and so you 

94
00:05:11,500 --> 00:05:15,389
jumped to the new point, the new 

95
00:05:15,389 --> 00:05:17,619
maximum of this lower bound. Okay? 

96
00:05:17,619 --> 00:05:21,889
And so this little curve here, right? This lower bound function here 

97
00:05:21,889 --> 00:05:26,259
that's really 

98
00:05:26,259 --> 00:05:28,820
the right-hand side of that augments. 

99
00:05:28,820 --> 00:05:30,210
Okay? So 

100
00:05:30,210 --> 00:05:34,010
this whole thing in the augments. If you view this thing 

101
00:05:34,010 --> 00:05:36,590
as a function of theta, 

102
00:05:36,590 --> 00:05:38,129
this function of theta 

103
00:05:38,129 --> 00:05:39,250
is a lower bounds 

104
00:05:39,250 --> 00:05:41,729
for the log likelihood of theta 

105
00:05:41,729 --> 00:05:45,539
and so the M step we maximize this lower bound and that corresponds to 

106
00:05:45,539 --> 00:05:47,360
jumping 

107
00:05:47,360 --> 00:05:51,669
to this new maximum to lower bound. So it 

108
00:05:51,669 --> 00:05:53,139
turns out 

109
00:05:53,139 --> 00:05:55,669
that 

110
00:05:55,669 --> 00:05:57,259
in the EM Algorithm " 

111
00:05:57,259 --> 00:06:02,100
so why do you evolve with the EM algorithm? It turns out that very often, and this will be 

112
00:06:02,100 --> 00:06:05,210
true for all the examples we see 

113
00:06:05,210 --> 00:06:07,410
today, it turns out 

114
00:06:07,410 --> 00:06:10,889
that very often in the 

115
00:06:10,889 --> 00:06:12,529
EM Algorithm 

116
00:06:12,529 --> 00:06:16,159
maximizing the M Step, so performing the maximization the M Step, will be 

117
00:06:16,159 --> 00:06:19,739
tractable and can often be done analytically in the closed form. 

118
00:06:19,739 --> 00:06:21,330
Whereas 

119
00:06:21,330 --> 00:06:25,729
if you were trying to maximize this objective we try to take this 

120
00:06:25,729 --> 00:06:27,270
formula on the right and 

121
00:06:27,270 --> 00:06:30,130
this maximum likely object, everyone, is to take this all on the right 

122
00:06:30,130 --> 00:06:32,810
and set its derivatives to zero and try to solve and 

123
00:06:32,810 --> 00:06:34,629
you'll find that you're unable 

124
00:06:34,629 --> 00:06:37,030
to obtain a solution to this in closed form 

125
00:06:37,030 --> 00:06:38,759
this maximization. Okay? 

126
00:06:38,759 --> 00:06:41,250
And so to give you an example of that is that 

127
00:06:41,250 --> 00:06:45,069
you remember our discussion on exponential family marbles, right? 

128
00:06:45,069 --> 00:06:47,019
It turns out that 

129
00:06:47,019 --> 00:06:49,370
if X and Z 

130
00:06:49,370 --> 00:06:50,320
is 

131
00:06:50,320 --> 00:06:54,220
jointly, I guess, a line in exponential families. So if P of X, 

132
00:06:54,220 --> 00:06:55,379
Z 

133
00:06:55,379 --> 00:06:58,319
prioritized by theta there's an explanation family distribution, 

134
00:06:58,319 --> 00:07:01,909
which it turns out to be true for the mixture of Gaussians distribution. 

135
00:07:01,909 --> 00:07:05,219
Then turns out that the M step here will be tractable 

136
00:07:05,219 --> 00:07:08,379
and the E step will also be tractable and so you can do each of these steps very 

137
00:07:08,379 --> 00:07:09,489
easily. 

138
00:07:09,489 --> 00:07:10,889
Whereas 

139
00:07:10,889 --> 00:07:12,689
performing " trying to 

140
00:07:12,689 --> 00:07:15,669
perform this original maximum likelihood estimation problem 

141
00:07:15,669 --> 00:07:17,719
on this one, right? 

142
00:07:17,719 --> 00:07:21,439
Will be computationally very difficult. You're going 

143
00:07:21,439 --> 00:07:24,699
to set the derivatives to zero and try to solve for that. Analytically you won't be able to find an analytic solution to 

144
00:07:24,699 --> 00:07:27,589
this. Okay? 

145
00:07:27,589 --> 00:07:32,709
So 

146
00:07:32,709 --> 00:07:36,050
what I want to do in a second is actually take this view of the EM 

147
00:07:36,050 --> 00:07:37,110
Algorithm 

148
00:07:37,110 --> 00:07:40,889
and apply it to the mixture of Gaussians models. I want to take these E steps 

149
00:07:40,889 --> 00:07:42,729
and M Steps and work 

150
00:07:42,729 --> 00:07:44,229
them out for 

151
00:07:44,229 --> 00:07:47,849
the mixture of Gaussians model, but before I do that, I just want to say one more thing about this other view of the 

152
00:07:47,849 --> 00:07:49,689


153
00:07:49,689 --> 00:07:55,020
EM Algorithm. It turns out there's one other way of thinking about the EM 

154
00:07:55,020 --> 00:07:59,479
Algorithm, which is the following: I can define 

155
00:07:59,479 --> 00:08:02,090
an optimization objective 

156
00:08:02,090 --> 00:08:05,979
J of theta, Q are defined 

157
00:08:05,979 --> 00:08:10,519
it to be 

158
00:08:10,519 --> 00:08:14,699
this. This is just a thing in the augments 

159
00:08:14,699 --> 00:08:21,699
in the M step. Okay? 

160
00:08:25,759 --> 00:08:28,110
And so what we proved 

161
00:08:28,110 --> 00:08:31,079
using Jensen's inequality 

162
00:08:31,079 --> 00:08:34,040
is that 

163
00:08:34,040 --> 00:08:38,380
the log likelihood of theta is greater and equal to J 

164
00:08:38,380 --> 00:08:41,900
of theta Q. So in other words, we proved last time that 

165
00:08:41,900 --> 00:08:43,749
for any value of theta and Q 

166
00:08:43,749 --> 00:08:47,190
the log likelihood upper bounds J of theta and Q. 

167
00:08:47,190 --> 00:08:51,290
And so just to relate this back to, sort of, yet more things that you all ready 

168
00:08:51,290 --> 00:08:52,220
know, 

169
00:08:52,220 --> 00:08:54,100
you can also think of 

170
00:08:54,100 --> 00:08:58,530
covariant cause in a sense, right? 

171
00:08:58,530 --> 00:09:02,010
However, our discussion awhile back on the coordinate ascent optimization 

172
00:09:02,010 --> 00:09:03,160
algorithm. 

173
00:09:03,160 --> 00:09:07,420
So we can show, and I won't actually show this view so just take our word for it 

174
00:09:07,420 --> 00:09:09,580
and look for that at home if you want, 

175
00:09:09,580 --> 00:09:11,820


176
00:09:11,820 --> 00:09:13,260
that EM is 

177
00:09:13,260 --> 00:09:17,480
just coordinate in a set on the 

178
00:09:17,480 --> 00:09:18,240
function J. 

179
00:09:18,240 --> 00:09:20,630
So in the E step you maximize 

180
00:09:20,630 --> 00:09:23,960
with respect to Q 

181
00:09:23,960 --> 00:09:26,850
and then the M step 

182
00:09:26,850 --> 00:09:31,860
you maximize with 

183
00:09:31,860 --> 00:09:34,100
respect to theta. Okay? 

184
00:09:34,100 --> 00:09:35,810
So this is 

185
00:09:35,810 --> 00:09:39,690
another view of the EM Algorithm 

186
00:09:39,690 --> 00:09:40,900


187
00:09:40,900 --> 00:09:45,430
that shows why it has to converge, for example. If you can " I've used in a sense of 

188
00:09:45,430 --> 00:09:46,880
J of theta, Q 

189
00:09:46,880 --> 00:09:53,160
having to monotonically increase on every iteration. Okay? 

190
00:09:53,160 --> 00:09:58,800
So what I want 

191
00:09:58,800 --> 00:10:02,389
to do next is actually take this general 

192
00:10:02,389 --> 00:10:06,230
EM machinery that we worked up and apply it to a mixture Gaussians model. 

193
00:10:06,230 --> 00:10:09,790
Before I do that, let me just check if there are questions about 

194
00:10:09,790 --> 00:10:16,790
the EM Algorithm as a whole? Okay, cool. So 

195
00:10:23,980 --> 00:10:26,100
let's go ahead and 

196
00:10:26,100 --> 00:10:33,100
work on the mixture of Gaussian's 

197
00:10:35,550 --> 00:10:37,160
EM, all right? MOG, and that's my 

198
00:10:37,160 --> 00:10:39,010
abbreviation for Mixture of 

199
00:10:39,010 --> 00:10:43,590
Gaussian's. So the E step were called those Q distributions, right? 

200
00:10:43,590 --> 00:10:50,590
In particular, I want to work out " so Q is the probability distribution 

201
00:10:50,670 --> 00:10:53,230
over the late and random variable Z 

202
00:10:53,230 --> 00:10:58,440
and so the E step I'm gonna figure out what is these compute " what is Q of ZI equals J. And 

203
00:10:58,440 --> 00:11:00,700
you can think of this as my writing 

204
00:11:00,700 --> 00:11:04,210
P of ZI equals J, right? Under the Q 

205
00:11:04,210 --> 00:11:08,260
distribution. That's what this notation means. 

206
00:11:08,260 --> 00:11:12,810
And so the EM Algorithm tells us 

207
00:11:12,810 --> 00:11:13,840


208
00:11:13,840 --> 00:11:16,280
that, let's see, Q of J 

209
00:11:16,280 --> 00:11:23,280
is the likelihood probability of Z being the value 

210
00:11:23,750 --> 00:11:25,040
J and 

211
00:11:25,040 --> 00:11:28,690
given XI and all your parameters. 

212
00:11:28,690 --> 00:11:30,310
And so, 

213
00:11:30,310 --> 00:11:31,050


214
00:11:31,050 --> 00:11:35,750
well, the way you compute this is by Dave's rule, right? So that is going to 

215
00:11:35,750 --> 00:11:39,250
be equal to P of XI given ZI 

216
00:11:39,250 --> 00:11:41,060
equals J 

217
00:11:41,060 --> 00:11:44,690
times P of ZIJ divided 

218
00:11:44,690 --> 00:11:51,690
by " 

219
00:11:57,490 --> 00:12:00,880
right? That's all the " by Dave's rule. 

220
00:12:00,880 --> 00:12:02,670


221
00:12:02,670 --> 00:12:06,100
And so this 

222
00:12:06,100 --> 00:12:09,110
you know 

223
00:12:09,110 --> 00:12:09,860


224
00:12:09,860 --> 00:12:11,630


225
00:12:11,630 --> 00:12:14,390
because XI given ZI equals J. This was a Gaussian 

226
00:12:14,390 --> 00:12:17,780
with mean mu J and covariant sigma J. And so to compute this first 

227
00:12:17,780 --> 00:12:20,970
term you plug in the formula for the Gaussian density there 

228
00:12:20,970 --> 00:12:21,810


229
00:12:21,810 --> 00:12:24,780
with parameters mu J and sigma J 

230
00:12:24,780 --> 00:12:27,160
and this you'd know 

231
00:12:27,160 --> 00:12:28,500
because 

232
00:12:28,500 --> 00:12:35,500
Z was not a nomial, right? 

233
00:12:36,020 --> 00:12:40,650
Where parameters given by phi and so the problem of ZI being with J is just phi 

234
00:12:40,650 --> 00:12:44,150
J and so you can substitute these terms in. 

235
00:12:44,150 --> 00:12:46,810
Similarly do the same thing for the denominator 

236
00:12:46,810 --> 00:12:49,930
and that's how you work out what Q is. Okay? 

237
00:12:49,930 --> 00:12:56,080
And so in the previous lecture this value the probability that ZI 

238
00:12:56,080 --> 00:12:58,540
equals J under the Q 

239
00:12:58,540 --> 00:13:01,150
distribution that was why I denoted that as WIJ. 

240
00:13:01,150 --> 00:13:03,840
So that would be the 

241
00:13:03,840 --> 00:13:05,330


242
00:13:05,330 --> 00:13:08,550
E 

243
00:13:08,550 --> 00:13:10,440
step 

244
00:13:10,440 --> 00:13:13,720
and then in the M step 

245
00:13:13,720 --> 00:13:17,890
we maximize with respect to all of our parameters. 

246
00:13:17,890 --> 00:13:19,480


247
00:13:19,480 --> 00:13:21,230
This, well 

248
00:13:21,230 --> 00:13:28,130
I seem to be writing the same formula down a lot today. All 

249
00:13:28,130 --> 00:13:35,130
right. 

250
00:13:42,670 --> 00:13:48,090
And just so 

251
00:13:48,090 --> 00:13:52,250
we're completely concrete about how you do that, right? So if you do 

252
00:13:52,250 --> 00:13:54,820
that you end up with " 

253
00:13:54,820 --> 00:13:57,440
so plugging in the 

254
00:13:57,440 --> 00:13:59,840
quantities that you know 

255
00:13:59,840 --> 00:14:00,770


256
00:14:00,770 --> 00:14:03,950
that becomes 

257
00:14:03,950 --> 00:14:10,950
this, let's see. Right. 

258
00:14:34,590 --> 00:14:38,730
And so that we're completely concrete about what the M step is doing. 

259
00:14:38,730 --> 00:14:41,900
So in the 

260
00:14:41,900 --> 00:14:44,260
M step that was, 

261
00:14:44,260 --> 00:14:45,660
I guess, QI 

262
00:14:45,660 --> 00:14:47,130
over Z, I being over 

263
00:14:47,130 --> 00:14:50,560
J. Just in the summation, sum over J is the sum over all the possible values 

264
00:14:50,560 --> 00:14:52,500
of ZI 

265
00:14:52,500 --> 00:14:53,430
and then 

266
00:14:53,430 --> 00:14:56,330
this thing here is my Gaussian 

267
00:14:56,330 --> 00:14:58,700
density. Sorry, guys, this thing " well, 

268
00:14:58,700 --> 00:15:01,980
this first term here, 

269
00:15:01,980 --> 00:15:06,480
right? Is my P of 

270
00:15:06,480 --> 00:15:10,050
XI given ZI 

271
00:15:10,050 --> 00:15:14,460
and that's P of ZI. Okay? 

272
00:15:14,460 --> 00:15:16,950
And so 

273
00:15:16,950 --> 00:15:21,070
to maximize this with respect to " say you want to maximize this with respect to all 

274
00:15:21,070 --> 00:15:24,410
of your parameters phi, mu and sigma. 

275
00:15:24,410 --> 00:15:27,800
So to maximize with respect to the parameter mu, say, 

276
00:15:27,800 --> 00:15:32,430
you would take the derivative for respect to mu 

277
00:15:32,430 --> 00:15:34,440
and set that to zero 

278
00:15:34,440 --> 00:15:38,190
and you would " and if you actually do that computation 

279
00:15:38,190 --> 00:15:44,330
you would get, for instance, that 

280
00:15:44,330 --> 00:15:47,660


281
00:15:47,660 --> 00:15:49,750
that becomes your update 

282
00:15:49,750 --> 00:15:51,230
to mu J. 

283
00:15:51,230 --> 00:15:52,260
Okay? Just 

284
00:15:52,260 --> 00:15:53,409
so I want to " 

285
00:15:53,409 --> 00:15:57,029
the equation is unimportant. All of these equations are written down in 

286
00:15:57,029 --> 00:15:59,980
the lecture notes. I'm writing these down just to be 

287
00:15:59,980 --> 00:16:03,350
completely concrete about what the M step means. And so write down that formula, 

288
00:16:03,350 --> 00:16:05,370
plug in the densities you know, take 

289
00:16:05,370 --> 00:16:08,150
the derivative set to zero, solve for mu J 

290
00:16:08,150 --> 00:16:10,949
and in the same way you set the derivatives equal to zero 

291
00:16:10,949 --> 00:16:12,990
and solve for your updates 

292
00:16:12,990 --> 00:16:15,439
for your other parameters phi and 

293
00:16:15,439 --> 00:16:22,439
sigma as well. Okay? Well, 

294
00:16:23,100 --> 00:16:26,500
just point out just one little tricky bit for this that 

295
00:16:26,500 --> 00:16:30,630
you haven't seen before that most of you probably all ready now, but I'll just 

296
00:16:30,630 --> 00:16:31,220
mention 

297
00:16:31,220 --> 00:16:32,810
is that 

298
00:16:32,810 --> 00:16:37,430
since phi here is a multinomial distribution 

299
00:16:37,430 --> 00:16:39,570
when you take this formula 

300
00:16:39,570 --> 00:16:42,440
and you maximize it with respect to phi 

301
00:16:42,440 --> 00:16:45,470
you actually have an additional constraint, right? That the sum of I " let's see, sum 

302
00:16:45,470 --> 00:16:46,460


303
00:16:46,460 --> 00:16:48,250
over 

304
00:16:48,250 --> 00:16:50,440
J, 

305
00:16:50,440 --> 00:16:53,630
phi J must be equal to one. All right? So, again, 

306
00:16:53,630 --> 00:16:57,720
in the M step I want to take this thing and maximize it with respect to all the parameters 

307
00:16:57,720 --> 00:17:00,900
and when you maximize this respect to the parameters phi J 

308
00:17:00,900 --> 00:17:03,400
you need to respect the constraint that 

309
00:17:03,400 --> 00:17:06,300
sum of J phi J must be equal to one. 

310
00:17:06,300 --> 00:17:10,730
And so, well, you all ready know how to do constraint maximization, right? So I'll throw out the method of 

311
00:17:10,730 --> 00:17:14,230
the granjay multipliers and generalize the granjay when you talk about the support of X 

312
00:17:14,230 --> 00:17:15,120
machines. 

313
00:17:15,120 --> 00:17:19,049
And so to actually perform the maximization in terms of phi J you 

314
00:17:19,049 --> 00:17:21,100
construct to the granjay, 

315
00:17:21,100 --> 00:17:23,490
which is " all right? 

316
00:17:23,490 --> 00:17:24,639
So that's the 

317
00:17:24,639 --> 00:17:28,390
equation from above and we'll denote in the dot dot dot plus 

318
00:17:28,390 --> 00:17:33,760
theta times 

319
00:17:33,760 --> 00:17:40,760
that, where this is sort of the granjay multiplier 

320
00:17:42,220 --> 00:17:42,830
and this 

321
00:17:42,830 --> 00:17:46,100
is your optimization objective. 

322
00:17:46,100 --> 00:17:47,640


323
00:17:47,640 --> 00:17:50,799
And so to actually solve the parameters phi J you set 

324
00:17:50,799 --> 00:17:56,690
the parameters of this 

325
00:17:56,690 --> 00:17:59,080
so that the granjay is zero and solve. Okay? 

326
00:17:59,080 --> 00:18:02,990
And if you then work through the math 

327
00:18:02,990 --> 00:18:07,240
you get the appropriate value to update the phi J's too, 

328
00:18:07,240 --> 00:18:10,880
which I won't do, but I'll be " all the full directions are in the lecture 

329
00:18:10,880 --> 00:18:16,470
notes. I won't do that here. 

330
00:18:16,470 --> 00:18:18,010


331
00:18:18,010 --> 00:18:21,870
Okay. And so if you actually perform all these computations you can also verify 

332
00:18:21,870 --> 00:18:22,730
that. 

333
00:18:22,730 --> 00:18:26,590
So I just wrote down a bunch of formulas for the EM 

334
00:18:26,590 --> 00:18:30,080
Algorithm. At the beginning of the last lecture I said for the mixture of Gaussian's model " I 

335
00:18:30,080 --> 00:18:34,170
said for the EM here's the formula for computing the WIJ's and here's a 

336
00:18:34,170 --> 00:18:36,480
formula for computing the mud's and so on, 

337
00:18:36,480 --> 00:18:42,090
and this variation is where all of those formulas actually come from. 

338
00:18:42,090 --> 00:18:43,280
Okay? 

339
00:18:43,280 --> 00:18:50,280
Questions about this? Yeah? Student:[Inaudible] Instructor (Andrew Ng):Oh, 

340
00:18:52,390 --> 00:18:54,289
I see. So 

341
00:18:54,289 --> 00:18:58,200
it turns out that, yes, there's also constrained to the phi J this must be greater than 

342
00:18:58,200 --> 00:19:00,070
zero. 

343
00:19:00,070 --> 00:19:02,770
It turns out that 

344
00:19:02,770 --> 00:19:07,090
if you want you could actually write down then generalize the granjayn 

345
00:19:07,090 --> 00:19:10,860
incorporating all of these constraints as well and you can solve to [inaudible] 

346
00:19:10,860 --> 00:19:12,179
these constraints. 

347
00:19:12,179 --> 00:19:16,280
It turns out that in this particular derivation " actually it turns out that very often 

348
00:19:16,280 --> 00:19:16,780
we 

349
00:19:16,780 --> 00:19:19,980
find maximum likely estimate for multinomial distributions probabilities. 

350
00:19:19,980 --> 00:19:23,190
It turns out that if you ignore these constraints and you just maximize the 

351
00:19:23,190 --> 00:19:24,020
formula 

352
00:19:24,020 --> 00:19:25,100


353
00:19:25,100 --> 00:19:27,820
luckily you end up with 

354
00:19:27,820 --> 00:19:31,190
values that actually are greater than or equal to zero 

355
00:19:31,190 --> 00:19:34,160
and so if even ignoring those constraint you end up with parameters that are greater than or equal to 

356
00:19:34,160 --> 00:19:38,470
zero that shows that that must be the correct solution 

357
00:19:38,470 --> 00:19:41,130
because adding that constraint won't change anything. 

358
00:19:41,130 --> 00:19:44,789
So this constraint it is then caused " it turns out that if you 

359
00:19:44,789 --> 00:19:46,150
ignore this and just do 

360
00:19:46,150 --> 00:19:53,150
what I've wrote down you actually get the right answer. 

361
00:19:55,789 --> 00:19:57,030
Okay? 

362
00:19:57,030 --> 00:19:58,080
Great. 

363
00:19:58,080 --> 00:19:59,080
So let me 

364
00:19:59,080 --> 00:20:00,810
just 

365
00:20:00,810 --> 00:20:05,400
very quickly talk about one more example of a mixture model. And 

366
00:20:05,400 --> 00:20:09,340
the perfect example for this is imagine you want to do text clustering, right? 

367
00:20:09,340 --> 00:20:11,900
So someone gives you a large set of documents 

368
00:20:11,900 --> 00:20:15,389
and you want to cluster them together into cohesive 

369
00:20:15,389 --> 00:20:16,239
topics. I 

370
00:20:16,239 --> 00:20:19,379
think I mentioned the news website news.google.com. 

371
00:20:19,379 --> 00:20:22,010
That's one application of text clustering 

372
00:20:22,010 --> 00:20:25,850
where you might want to look at all of the news stories about 

373
00:20:25,850 --> 00:20:28,220
today, all the news stories 

374
00:20:28,220 --> 00:20:32,150
written by everyone, written by all the online news websites about whatever happened 

375
00:20:32,150 --> 00:20:33,270
yesterday 

376
00:20:33,270 --> 00:20:36,810
and there will be many, many different stories on the same thing, right? 

377
00:20:36,810 --> 00:20:40,770
And by running a text-clustering algorithm you can group related 

378
00:20:40,770 --> 00:20:42,559
documents together. Okay? 

379
00:20:42,559 --> 00:20:49,559
So how do you apply the EM Algorithm to text clustering? 

380
00:20:52,980 --> 00:20:56,300
I want to do this to illustrate an example 

381
00:20:56,300 --> 00:20:59,380
in which you run the EM Algorithm 

382
00:20:59,380 --> 00:21:03,320
on discreet valued inputs where the input " where the training examples 

383
00:21:03,320 --> 00:21:04,220
XI 

384
00:21:04,220 --> 00:21:05,830
are discreet 

385
00:21:05,830 --> 00:21:08,589
values. So what I want to talk about specifically 

386
00:21:08,589 --> 00:21:15,589
is the mixture of naive Bayes model 

387
00:21:16,510 --> 00:21:19,980
and depending on how much you remember about naive Bayes 

388
00:21:19,980 --> 00:21:23,690
I talked about two event models. One was the multivariant vanuvy 

389
00:21:23,690 --> 00:21:24,960
event model. One 

390
00:21:24,960 --> 00:21:27,880
was the multinomial event model. 

391
00:21:27,880 --> 00:21:31,050
Today I'm gonna use the multivariant vanuvy event model. If 

392
00:21:31,050 --> 00:21:34,720
you don't remember what those terms mean anymore don't worry about it. I 

393
00:21:34,720 --> 00:21:36,430
think the equation will still make sense. But in 

394
00:21:36,430 --> 00:21:43,430
this setting we're given a training set X1 

395
00:21:44,549 --> 00:21:45,820
for XM. 

396
00:21:45,820 --> 00:21:48,930
So we're given M text documents 

397
00:21:48,930 --> 00:21:52,270
where each 

398
00:21:52,270 --> 00:21:55,940


399
00:21:55,940 --> 00:22:00,330
XI is zero one to the end. So each of our training examples 

400
00:22:00,330 --> 00:22:04,160
is an indimensional bit of vector, 

401
00:22:04,160 --> 00:22:06,780
right? S this was the representation 

402
00:22:06,780 --> 00:22:11,060
where XIJ was " it indicates whether 

403
00:22:11,060 --> 00:22:13,220
word J 

404
00:22:13,220 --> 00:22:16,820
appears in 

405
00:22:16,820 --> 00:22:23,260
document 

406
00:22:23,260 --> 00:22:28,220
I, right? And let's say that 

407
00:22:28,220 --> 00:22:31,930
we're going to model ZI the " our latent random variable meaning 

408
00:22:31,930 --> 00:22:35,470
our hidden random variable ZI will take on two values zero one 

409
00:22:35,470 --> 00:22:38,480
so this means I'm just gonna find two clusters and you can generalize the 

410
00:22:38,480 --> 00:22:44,560
clusters that you want. 

411
00:22:44,560 --> 00:22:46,650
So 

412
00:22:46,650 --> 00:22:51,510
in the mixture of naive Bayes model we assume that 

413
00:22:51,510 --> 00:22:54,090
ZI is distributed and mu 

414
00:22:54,090 --> 00:22:57,280
E 

415
00:22:57,280 --> 00:22:58,390
with some 

416
00:22:58,390 --> 00:23:02,600
value of phi so there's some probability of each document coming from cluster one 

417
00:23:02,600 --> 00:23:05,220
or from cluster 

418
00:23:05,220 --> 00:23:10,490
two. We assume that 

419
00:23:10,490 --> 00:23:14,090
the probability 

420
00:23:14,090 --> 00:23:16,049


421
00:23:16,049 --> 00:23:20,500
of XI given ZI, 

422
00:23:20,500 --> 00:23:21,550
right? 

423
00:23:21,550 --> 00:23:28,550
Will make the same naive Bayes assumption as we did before. 

424
00:23:35,550 --> 00:23:36,630
Okay? 

425
00:23:36,630 --> 00:23:43,630
And more specifically well, excuse 

426
00:23:53,110 --> 00:24:00,110
me, 

427
00:24:00,820 --> 00:24:01,840
right. Okay. 

428
00:24:01,840 --> 00:24:06,570
And so most of us [inaudible] cycles one given ZI equals say zero 

429
00:24:06,570 --> 00:24:10,250
will be given by a parameter phi 

430
00:24:10,250 --> 00:24:11,960
substitute J 

431
00:24:11,960 --> 00:24:18,230
given Z equals 

432
00:24:18,230 --> 00:24:21,170
zero. So if you take this chalkboard and if you 

433
00:24:21,170 --> 00:24:22,669
take all instances of 

434
00:24:22,669 --> 00:24:24,510
the alphabet Z 

435
00:24:24,510 --> 00:24:27,270
and replace it with Y 

436
00:24:27,270 --> 00:24:31,750
then you end up with exactly the same equation as I've written down for naive 

437
00:24:31,750 --> 00:24:38,750
Bayes like a long time ago. Okay? 

438
00:24:39,770 --> 00:24:40,710
And I'm not 

439
00:24:40,710 --> 00:24:44,850
actually going to work out the mechanics deriving 

440
00:24:44,850 --> 00:24:46,460
the EM Algorithm, but it turns out that 

441
00:24:46,460 --> 00:24:49,830
if you take this joint distribution of X and Z 

442
00:24:49,830 --> 00:24:53,700
and if you work out what the equations are for the EM algorithm for 

443
00:24:53,700 --> 00:24:55,860
maximum likelihood estimation of parameters 

444
00:24:55,860 --> 00:24:58,820
you find that 

445
00:24:58,820 --> 00:25:00,880
in the E 

446
00:25:00,880 --> 00:25:03,100
step you compute, 

447
00:25:03,100 --> 00:25:05,679
you know, let's say these parameters " these weights 

448
00:25:05,679 --> 00:25:06,970
WI 

449
00:25:06,970 --> 00:25:09,420
which are going to be equal to 

450
00:25:09,420 --> 00:25:15,670
your perceived distribution of Z being equal one conditioned on XI parameterized 

451
00:25:15,670 --> 00:25:19,200
by your 

452
00:25:19,200 --> 00:25:23,700
phi's and given your parameters 

453
00:25:23,700 --> 00:25:25,590


454
00:25:25,590 --> 00:25:30,900
and in the M step. Okay? 

455
00:25:30,900 --> 00:25:37,900
And 

456
00:26:17,240 --> 00:26:20,740
that's the equation you get in the M step. 

457
00:26:20,740 --> 00:26:22,150
I 

458
00:26:22,150 --> 00:26:24,860
mean, again, the equations themselves aren't too important. Just 

459
00:26:24,860 --> 00:26:27,179
sort of convey " I'll 

460
00:26:27,179 --> 00:26:30,210
give you a second to finish writing, I guess. 

461
00:26:30,210 --> 00:26:32,500
And when you're done or finished writing 

462
00:26:32,500 --> 00:26:37,370
take a look at these equations and see if they make intuitive sense to you 

463
00:26:37,370 --> 00:26:44,370
why these three equations, sort of, sound like they might be the right thing to do. Yeah? Student:[Inaudible] Instructor (Andrew Ng):Say that 

464
00:26:47,130 --> 00:26:49,620
again. Student:Y " Instructor (Andrew Ng):Oh, 

465
00:26:49,620 --> 00:26:54,230
yes, thank you. Right. 

466
00:26:54,230 --> 00:26:57,950
Sorry, 

467
00:26:57,950 --> 00:27:04,950
just, for everywhere over Y I meant Z. Yeah? Student:[Inaudible] in the 

468
00:27:15,380 --> 00:27:20,670
first place? 

469
00:27:20,670 --> 00:27:22,059
Instructor (Andrew Ng):No. So 

470
00:27:22,059 --> 00:27:25,130


471
00:27:25,130 --> 00:27:26,080
what is it? 

472
00:27:26,080 --> 00:27:33,080
Normally you initialize phi's to be something else, say randomly. 

473
00:27:35,010 --> 00:27:38,060
So just like in naive Bayes we saw 

474
00:27:38,060 --> 00:27:41,360
zero probabilities as a bad thing so the same reason you 

475
00:27:41,360 --> 00:27:48,200
try to avoid zero probabilities, yeah. Okay? 

476
00:27:48,200 --> 00:27:51,500
And so just the intuition behind these 

477
00:27:51,500 --> 00:27:54,150
equations is 

478
00:27:54,150 --> 00:27:55,950
in the E step 

479
00:27:55,950 --> 00:27:59,860
WI's is you're gonna take your best guess for whether the document came from 

480
00:27:59,860 --> 00:28:02,640
cluster one or cluster 

481
00:28:02,640 --> 00:28:05,570
zero, all right? This is very similar to the 

482
00:28:05,570 --> 00:28:09,200
intuitions behind the EM Algorithm that we talked about in a previous lecture. So in the 

483
00:28:09,200 --> 00:28:10,020
E step 

484
00:28:10,020 --> 00:28:14,760
we're going to compute these weights that tell us do I think this document came from 

485
00:28:14,760 --> 00:28:19,370
cluster one or cluster zero. 

486
00:28:19,370 --> 00:28:22,650
And then in the M step I'm gonna say 

487
00:28:22,650 --> 00:28:27,140
does this numerator is the sum over all the elements of my training set 

488
00:28:27,140 --> 00:28:28,800
of " so then 

489
00:28:28,800 --> 00:28:33,000
informally, right? WI is one there, but I think the document came from cluster 

490
00:28:33,000 --> 00:28:33,980
one 

491
00:28:33,980 --> 00:28:35,810
and so this will 

492
00:28:35,810 --> 00:28:41,090
essentially sum up all the times I saw words J 

493
00:28:41,090 --> 00:28:45,419
in documents that I think are in cluster one. 

494
00:28:45,419 --> 00:28:48,830
And these are sort of weighted by the actual probability. I think it came from cluster 

495
00:28:48,830 --> 00:28:49,640
one 

496
00:28:49,640 --> 00:28:51,200
and then I'll divide by 

497
00:28:51,200 --> 00:28:54,740
" again, if all of these were ones and zeros then I'd be dividing by 

498
00:28:54,740 --> 00:28:58,480
the actual number of documents I had in cluster one. So if all the WI's were 

499
00:28:58,480 --> 00:28:59,540


500
00:28:59,540 --> 00:29:02,960
either ones or zeroes then this would be exactly 

501
00:29:02,960 --> 00:29:06,660
the fraction of documents that I saw in cluster one 

502
00:29:06,660 --> 00:29:09,640
in which I also saw were at J. 

503
00:29:09,640 --> 00:29:10,900
Okay? But in the EM Algorithm 

504
00:29:10,900 --> 00:29:14,540
you don't make a hard assignment decision about is this in cluster one or is this in 

505
00:29:14,540 --> 00:29:15,790
cluster zero. You 

506
00:29:15,790 --> 00:29:16,880
instead 

507
00:29:16,880 --> 00:29:23,880
represent your uncertainty about cluster membership with the parameters WI. Okay? It 

508
00:29:23,970 --> 00:29:26,970
actually turns out that when we actually implement this particular model it 

509
00:29:26,970 --> 00:29:28,630
actually turns out that 

510
00:29:28,630 --> 00:29:32,080
by the nature of this computation all the values of WI's 

511
00:29:32,080 --> 00:29:35,260
will be very close to either one or zero so they'll 

512
00:29:35,260 --> 00:29:39,110
be numerically almost indistinguishable from one's and zeroes. This is a property of 

513
00:29:39,110 --> 00:29:41,960
naive Bayes. If you actually compute this probability 

514
00:29:41,960 --> 00:29:45,750
from all those documents you find that WI is either 

515
00:29:45,750 --> 00:29:46,169


516
00:29:46,169 --> 00:29:50,000
0.0001 or 0.999. It'll be amazingly close to either zero or one 

517
00:29:50,000 --> 00:29:53,170
and so the M step " and so this is pretty much guessing whether each document is 

518
00:29:53,170 --> 00:29:55,270
in cluster one or cluster 

519
00:29:55,270 --> 00:29:56,210
zero and then 

520
00:29:56,210 --> 00:30:00,450
using formulas they're very similar to maximum likely estimation 

521
00:30:00,450 --> 00:30:05,210
for naive Bayes. 

522
00:30:05,210 --> 00:30:06,250
Okay? 

523
00:30:06,250 --> 00:30:08,409
Cool. Are there and 

524
00:30:08,409 --> 00:30:09,679
if 

525
00:30:09,679 --> 00:30:12,170
some of these equations don't look that familiar to you anymore, 

526
00:30:12,170 --> 00:30:15,309
sort of, go back and take another look at what you saw in naive 

527
00:30:15,309 --> 00:30:16,700
Bayes 

528
00:30:16,700 --> 00:30:18,720
and 

529
00:30:18,720 --> 00:30:22,220
hopefully you can see the links there as well. 

530
00:30:22,220 --> 00:30:29,220
Questions about this before I move on? Right, okay. 

531
00:30:33,200 --> 00:30:36,700
Of course the way I got these equations was by turning through the machinery of 

532
00:30:36,700 --> 00:30:40,610
the EM Algorithm, right? I didn't just write these out of thin air. The way you do this 

533
00:30:40,610 --> 00:30:41,890
is by 

534
00:30:41,890 --> 00:30:44,990
writing down the E step and the M step for this model and then the M step 

535
00:30:44,990 --> 00:30:46,340
same derivatives 

536
00:30:46,340 --> 00:30:48,270
equal to zero and solving from that so 

537
00:30:48,270 --> 00:30:49,889
that's how you get the M step and the E 

538
00:30:49,889 --> 00:30:56,889
step. 

539
00:31:22,030 --> 00:31:23,950
So the last thing I want to do today 

540
00:31:23,950 --> 00:31:30,950
is talk about the factor analysis model 

541
00:31:31,870 --> 00:31:33,110


542
00:31:33,110 --> 00:31:34,780
and 

543
00:31:34,780 --> 00:31:38,249
the reason I want to do this is sort of two reasons because one is 

544
00:31:38,249 --> 00:31:42,590
factor analysis is kind of a useful model. It's 

545
00:31:42,590 --> 00:31:44,160
not 

546
00:31:44,160 --> 00:31:48,320
as widely used as mixtures of Gaussian's and mixtures of naive Bayes maybe, 

547
00:31:48,320 --> 00:31:50,690
but it's sort of useful. 

548
00:31:50,690 --> 00:31:55,180
But the other reason I want to derive this model is that there are a 

549
00:31:55,180 --> 00:31:58,169
few steps in the math that are more generally useful. 

550
00:31:58,169 --> 00:32:02,560
In particular, where this is for factor analysis this would be an example 

551
00:32:02,560 --> 00:32:04,510
in which we'll do EM 

552
00:32:04,510 --> 00:32:08,440
where the late and random variable " where the hidden random variable Z 

553
00:32:08,440 --> 00:32:11,060
is going to be continued as valued. 

554
00:32:11,060 --> 00:32:14,740
And so some of the math we'll see in deriving factor analysis will be a little bit 

555
00:32:14,740 --> 00:32:17,430
different than what you saw before and they're just a " 

556
00:32:17,430 --> 00:32:21,590
it turns out the full derivation for EM for factor analysis is sort of 

557
00:32:21,590 --> 00:32:23,260
extremely long and complicated 

558
00:32:23,260 --> 00:32:24,980
and so I won't 

559
00:32:24,980 --> 00:32:26,750
inflect that on you in lecture today, 

560
00:32:26,750 --> 00:32:30,530
but I will still be writing more equations than is " than you'll see me do 

561
00:32:30,530 --> 00:32:32,640
in other lectures because there are, sort of, 

562
00:32:32,640 --> 00:32:36,690
just a few steps in the factor analysis derivation so I'll physically 

563
00:32:36,690 --> 00:32:38,800


564
00:32:38,800 --> 00:32:40,100
illustrate 

565
00:32:40,100 --> 00:32:40,950
it. 

566
00:32:40,950 --> 00:32:43,350
So it's actually [inaudible] the model 

567
00:32:43,350 --> 00:32:47,510
and it's really contrast to the mixture of Gaussians model, all right? So for the mixture of Gaussians model, which is 

568
00:32:47,510 --> 00:32:50,670
our first model 

569
00:32:50,670 --> 00:32:52,570
we had, 

570
00:32:52,570 --> 00:32:57,990
that " well I actually motivated it by drawing the data set like this, right? That one of 

571
00:32:57,990 --> 00:33:04,390
you has a data set that looks like this, 

572
00:33:04,390 --> 00:33:06,190
right? So this was a problem where 

573
00:33:06,190 --> 00:33:09,440
n is twodimensional 

574
00:33:09,440 --> 00:33:13,860
and you have, I don't know, maybe 50 or 100 training examples, whatever, right? 

575
00:33:13,860 --> 00:33:15,409
And I said 

576
00:33:15,409 --> 00:33:17,760
maybe you want to give a label 

577
00:33:17,760 --> 00:33:21,100
training set like this. Maybe you want to model this 

578
00:33:21,100 --> 00:33:23,710
as a mixture of two Gaussians. Okay? 

579
00:33:23,710 --> 00:33:24,700
And so a 

580
00:33:24,700 --> 00:33:27,320
mixture of Gaussian models tend to be 

581
00:33:27,320 --> 00:33:29,100
applicable 

582
00:33:29,100 --> 00:33:30,520
where m 

583
00:33:30,520 --> 00:33:32,519
is larger, 

584
00:33:32,519 --> 00:33:35,690
and often much larger, than n where the number of training examples you 

585
00:33:35,690 --> 00:33:36,610
have 

586
00:33:36,610 --> 00:33:40,790
is at least as large as, and is usually much larger than, the 

587
00:33:40,790 --> 00:33:45,820
dimension of the data. What I 

588
00:33:45,820 --> 00:33:49,280
want to do is talk about a different problem where I 

589
00:33:49,280 --> 00:33:51,880
want you to imagine what happens if 

590
00:33:51,880 --> 00:33:56,200
either the dimension of your data is roughly equal to the number of 

591
00:33:56,200 --> 00:33:58,130
examples you have 

592
00:33:58,130 --> 00:34:00,180
or maybe 

593
00:34:00,180 --> 00:34:00,940
the 

594
00:34:00,940 --> 00:34:03,980
dimension of your data is maybe even much larger than 

595
00:34:03,980 --> 00:34:08,469
the number of training examples you have. Okay? So how do 

596
00:34:08,469 --> 00:34:10,139
you model 

597
00:34:10,139 --> 00:34:13,569
such a very high dimensional data? Watch and 

598
00:34:13,569 --> 00:34:15,549
you will see sometimes, right? If 

599
00:34:15,549 --> 00:34:18,960
you run a plant or something, you run a factory, maybe you have 

600
00:34:18,960 --> 00:34:23,359
a thousand measurements all through your plants, but you only have five " you only have 

601
00:34:23,359 --> 00:34:25,639
20 days of data. So 

602
00:34:25,639 --> 00:34:27,189
you can have 1,000 dimensional data, 

603
00:34:27,189 --> 00:34:31,039
but 20 examples of it all ready. So 

604
00:34:31,039 --> 00:34:35,239
given data that has this property in the beginning that we've given 

605
00:34:35,239 --> 00:34:39,709
a training set of m examples. Well, 

606
00:34:39,709 --> 00:34:41,229
what can you do to try to model 

607
00:34:41,229 --> 00:34:44,139
the density of X? 

608
00:34:44,139 --> 00:34:48,239
So one thing you can do is try to model it just as a single Gaussian, right? So in my 

609
00:34:48,239 --> 00:34:51,629
mixtures of Gaussian this is how you try model as a single Gaussian and 

610
00:34:51,629 --> 00:34:55,019
say X is intuitive with mean mu 

611
00:34:55,019 --> 00:34:57,559
and parameter 

612
00:34:57,559 --> 00:34:58,809
sigma 

613
00:34:58,809 --> 00:35:00,329
where 

614
00:35:00,329 --> 00:35:02,169
sigma is going to be 

615
00:35:02,169 --> 00:35:04,769
done n by n matrix 

616
00:35:04,769 --> 00:35:09,509
and so if you work out the maximum likelihood estimate of the parameters 

617
00:35:09,509 --> 00:35:12,680
you find that the maximum likelihood estimate for the mean 

618
00:35:12,680 --> 00:35:17,400
is just the empirical mean of your training set, 

619
00:35:17,400 --> 00:35:21,299
right. So that makes sense. 

620
00:35:21,299 --> 00:35:25,769
And the maximum likelihood of the covariance matrix sigma 

621
00:35:25,769 --> 00:35:27,700
will be 

622
00:35:27,700 --> 00:35:29,549


623
00:35:29,549 --> 00:35:36,549


624
00:35:39,649 --> 00:35:41,150
this, all right? But it turns out that 

625
00:35:41,150 --> 00:35:44,789
in this regime where the data is much higher dimensional " excuse me, 

626
00:35:44,789 --> 00:35:48,259
where the data's dimension is much larger than the training examples 

627
00:35:48,259 --> 00:35:49,579
you 

628
00:35:49,579 --> 00:35:51,229
have if you 

629
00:35:51,229 --> 00:35:52,840
compute the maximum likely estimate 

630
00:35:52,840 --> 00:35:54,869
of the covariance matrix sigma 

631
00:35:54,869 --> 00:35:58,309
you find that this matrix is singular. 

632
00:35:58,309 --> 00:35:59,079


633
00:35:59,079 --> 00:36:02,869
Okay? By singular, I mean that it doesn't have four vanq or it has zero eigen 

634
00:36:02,869 --> 00:36:04,739
value so it doesn't have " I hope 

635
00:36:04,739 --> 00:36:06,869
one of those terms makes sense. 

636
00:36:06,869 --> 00:36:12,619
And 

637
00:36:12,619 --> 00:36:19,619
there's another saying that the matrix sigma will be non-invertible. 

638
00:36:22,299 --> 00:36:24,910
And just in pictures, 

639
00:36:24,910 --> 00:36:28,619
one complete example is if D is " 

640
00:36:28,619 --> 00:36:33,419
if N equals M equals two if you have two-dimensional data and 

641
00:36:33,419 --> 00:36:35,929
you have two examples. So I'd have 

642
00:36:35,929 --> 00:36:39,959
two training examples in two-dimen " this is 

643
00:36:39,959 --> 00:36:41,539
X1 and 

644
00:36:41,539 --> 00:36:43,669
X2. This is my unlabeled data. 

645
00:36:43,669 --> 00:36:47,640
If you fit a Gaussian to this data set you find that 

646
00:36:47,640 --> 00:36:49,169
" well 

647
00:36:49,169 --> 00:36:53,329
you remember I used to draw constables of Gaussians as ellipses, right? So 

648
00:36:53,329 --> 00:36:56,139
these are examples of different constables of Gaussians. 

649
00:36:56,139 --> 00:36:58,669
You find that the maximum likely estimate 

650
00:36:58,669 --> 00:36:59,650
Gaussian for this 

651
00:36:59,650 --> 00:37:04,329
responds to Gaussian where the contours are sort of infinitely 

652
00:37:04,329 --> 00:37:07,160
thin and infinitely long in that direction. 

653
00:37:07,160 --> 00:37:10,959
Okay? So in terms " so the contours will sort of be 

654
00:37:10,959 --> 00:37:12,970
infinitely thin, 

655
00:37:12,970 --> 00:37:17,959
right? And stretch infinitely long in that direction. 

656
00:37:17,959 --> 00:37:20,819
And another way of saying it is that if 

657
00:37:20,819 --> 00:37:25,539
you actually plug in the formula for the density 

658
00:37:25,539 --> 00:37:30,679
of the 

659
00:37:30,679 --> 00:37:36,209
Gaussian, which is 

660
00:37:36,209 --> 00:37:39,099
this, you won't actually get a nice answer because 

661
00:37:39,099 --> 00:37:42,399
the matrix sigma is non-invertible so sigma inverse is not 

662
00:37:42,399 --> 00:37:43,309
defined 

663
00:37:43,309 --> 00:37:45,290
and this is zero. 

664
00:37:45,290 --> 00:37:48,890
So you also have one over zero times E to the sum 

665
00:37:48,890 --> 00:37:55,890
inversive and non-inversive matrix so not a good model. 

666
00:37:58,289 --> 00:38:02,009
So let's do even better, right? So given this sort of data 

667
00:38:02,009 --> 00:38:09,009
how do you model P of X? 

668
00:38:28,400 --> 00:38:35,400
Well, one thing you could do is constrain sigma to be diagonal. So you have a 

669
00:38:42,309 --> 00:38:46,319
covariance matrix X is " okay? 

670
00:38:46,319 --> 00:38:48,939
So in other words you get a constraint sigma 

671
00:38:48,939 --> 00:38:52,979
to be this matrix, all 

672
00:38:52,979 --> 00:38:55,880
right? 

673
00:38:55,880 --> 00:38:59,779
With zeroes on the off diagonals. I hope 

674
00:38:59,779 --> 00:39:03,109
this makes sense. These zeroes I've written down here denote that 

675
00:39:03,109 --> 00:39:06,079
everything after diagonal of this matrix is a 

676
00:39:06,079 --> 00:39:07,459
zero. 

677
00:39:07,459 --> 00:39:08,970
So 

678
00:39:08,970 --> 00:39:13,410
the massive likely estimate of the parameters will be pretty much what you'll 

679
00:39:13,410 --> 00:39:16,210
expect, 

680
00:39:16,210 --> 00:39:19,880
right? 

681
00:39:19,880 --> 00:39:21,200
And 

682
00:39:21,200 --> 00:39:22,440
in pictures 

683
00:39:22,440 --> 00:39:24,150
what this means is that 

684
00:39:24,150 --> 00:39:26,399
the [inaudible] the distribution 

685
00:39:26,399 --> 00:39:28,569
with Gaussians 

686
00:39:28,569 --> 00:39:31,739
whose controls are axis aligned. So 

687
00:39:31,739 --> 00:39:36,900
that's one example of a Gaussian where the covariance is diagonal. 

688
00:39:36,900 --> 00:39:37,859


689
00:39:37,859 --> 00:39:38,509
And 

690
00:39:38,509 --> 00:39:42,919
here's another example and 

691
00:39:42,919 --> 00:39:45,349
so here's a third example. But 

692
00:39:45,349 --> 00:39:48,939
often I've used the examples of Gaussians where the covariance matrix is off diagonal. Okay? And, I 

693
00:39:48,939 --> 00:39:51,489
don't 

694
00:39:51,489 --> 00:39:53,999
know, 

695
00:39:53,999 --> 00:39:56,629
you could do this in model P of X, but this isn't very nice because you've now 

696
00:39:56,629 --> 00:40:00,069
thrown away all the correlations 

697
00:40:00,069 --> 00:40:04,539
between the different variables so 

698
00:40:04,539 --> 00:40:07,599
the axis are X1 and X2, right? So you've thrown away " you're failing to capture 

699
00:40:07,599 --> 00:40:11,779
any of the correlations or the relationships between 

700
00:40:11,779 --> 00:40:18,229
any pair of variables in your data. Yeah? Student:Is it " could you say again what does that do for the diagonal? Instructor (Andrew Ng):Say again. Student:The 

701
00:40:18,229 --> 00:40:20,759
covariance matrix the diagonal, 

702
00:40:20,759 --> 00:40:22,150
what does 

703
00:40:22,150 --> 00:40:23,369
that again? I didn't quite understand what the examples mean. Instructor (Andrew Ng):Okay. 

704
00:40:23,369 --> 00:40:26,530
So these are the contours of the Gaussian density that I'm drawing, 

705
00:40:26,530 --> 00:40:28,159
right? So let's see " 

706
00:40:28,159 --> 00:40:32,080
so post covariance issues with diagonal 

707
00:40:32,080 --> 00:40:35,229
then you can ask what is P of X 

708
00:40:35,229 --> 00:40:37,299
parameterized by 

709
00:40:37,299 --> 00:40:38,539
mu and sigma, 

710
00:40:38,539 --> 00:40:41,190
right? If sigma is diagonal 

711
00:40:41,190 --> 00:40:43,699
and so this will be some Gaussian dump, 

712
00:40:43,699 --> 00:40:46,249
right? So not in " oh, boy. My drawing's really 

713
00:40:46,249 --> 00:40:48,699
bad, but in two-D 

714
00:40:48,699 --> 00:40:53,239
the density for Gaussian is like this bump shaped thing, right? 

715
00:40:53,239 --> 00:40:56,719
So this is the density of the Gaussian 

716
00:40:56,719 --> 00:40:59,549
" wow, and this is a really bad drawing. With those, 

717
00:40:59,549 --> 00:41:04,049
your axis X1 and X2 and the height of this is P of X 

718
00:41:04,049 --> 00:41:07,879
and so those figures over there are the contours of 

719
00:41:07,879 --> 00:41:09,479
the density of the Gaussian. 

720
00:41:09,479 --> 00:41:15,949
So those are the contours of this shape. Student:No, I don't 

721
00:41:15,949 --> 00:41:16,680
mean 

722
00:41:16,680 --> 00:41:22,569
the contour. What's special about these types? What makes them different than instead of general covariance matrix? Instructor (Andrew Ng):Oh, I see. Oh, okay, sorry. They're axis aligned so 

723
00:41:22,569 --> 00:41:23,690
the main " these, 

724
00:41:23,690 --> 00:41:24,999
let's see. 

725
00:41:24,999 --> 00:41:27,710
So I'm not drawing a contour like this, 

726
00:41:27,710 --> 00:41:30,309
right? Because the main axes of these 

727
00:41:30,309 --> 00:41:32,289
are not aligned with the 

728
00:41:32,289 --> 00:41:34,390
X1 and X-axis so 

729
00:41:34,390 --> 00:41:36,189
this occurs found to 

730
00:41:36,189 --> 00:41:43,109
Gaussian where the off-diagonals are non-zero, right? Cool. Okay. 

731
00:41:43,109 --> 00:41:46,929


732
00:41:46,929 --> 00:41:50,229
You could do this, this is sort of work. It turns out that what our best view is two 

733
00:41:50,229 --> 00:41:51,670
training examples 

734
00:41:51,670 --> 00:41:55,210
you can learn in non-singular covariance matrix, but you've thrown away all 

735
00:41:55,210 --> 00:41:55,849


736
00:41:55,849 --> 00:42:02,849
of the correlation in the data so this is not a great model. 

737
00:42:05,900 --> 00:42:08,280
It turns out you can do something " well, 

738
00:42:08,280 --> 00:42:10,439
actually, we'll come back and use this property later. 

739
00:42:10,439 --> 00:42:17,439
But it turns out you can do something even more restrictive, which 

740
00:42:17,989 --> 00:42:18,779
is 

741
00:42:18,779 --> 00:42:23,959
you can constrain sigma to equal to sigma squared times the identity 

742
00:42:23,959 --> 00:42:27,629
matrix. So in other words, you can constrain it to be diagonal 

743
00:42:27,629 --> 00:42:28,759
matrix 

744
00:42:28,759 --> 00:42:31,699
and moreover all the 

745
00:42:31,699 --> 00:42:34,609
diagonal entries must be the same 

746
00:42:34,609 --> 00:42:37,989
and so the cartoon for that is that you're 

747
00:42:37,989 --> 00:42:39,440
constraining 

748
00:42:39,440 --> 00:42:42,779
the contours of your Gaussian density to be 

749
00:42:42,779 --> 00:42:47,359
circular. Okay? This is a sort of even harsher constraint to place in your model. 

750
00:42:47,359 --> 00:42:51,479


751
00:42:51,479 --> 00:42:52,200


752
00:42:52,200 --> 00:42:55,970
So either of these versions, diagonal sigma or sigma being the, sort of, constant 

753
00:42:55,970 --> 00:42:57,349
value diagonal 

754
00:42:57,349 --> 00:43:01,869
are the all ready strong assumptions, all right? So if you have enough data 

755
00:43:01,869 --> 00:43:07,529
maybe write a model just a little bit of a correlation between your different variables. 

756
00:43:07,529 --> 00:43:10,930
So the factor analysis model 

757
00:43:10,930 --> 00:43:14,349
is one way to attempt to do that. So here's 

758
00:43:14,349 --> 00:43:21,349
the idea. 

759
00:43:34,619 --> 00:43:35,729
So this is 

760
00:43:35,729 --> 00:43:39,169
how the factor analysis model 

761
00:43:39,169 --> 00:43:40,979
models your data. 

762
00:43:40,979 --> 00:43:42,150
We're going to 

763
00:43:42,150 --> 00:43:45,309
assume that there is a latent random variable, okay? 

764
00:43:45,309 --> 00:43:46,959
Which just 

765
00:43:46,959 --> 00:43:50,759
means random variable Z. So Z is distributed 

766
00:43:50,759 --> 00:43:53,380
Gaussian with mean zero and 

767
00:43:53,380 --> 00:43:55,659
covariance identity 

768
00:43:55,659 --> 00:43:57,630
where Z 

769
00:43:57,630 --> 00:44:00,109
will be a Ddimensional vector now 

770
00:44:00,109 --> 00:44:01,499
and 

771
00:44:01,499 --> 00:44:03,839
D 

772
00:44:03,839 --> 00:44:10,839
will be chosen so that it is lower than the dimension of your X's. Okay? 

773
00:44:11,829 --> 00:44:12,829
And now 

774
00:44:12,829 --> 00:44:14,660
I'm going to assume that 

775
00:44:14,660 --> 00:44:19,259
X is given by " 

776
00:44:19,259 --> 00:44:21,279
well let me write this. Each XI 

777
00:44:21,279 --> 00:44:22,690
is distributed 

778
00:44:22,690 --> 00:44:25,059
" 

779
00:44:25,059 --> 00:44:28,849
actually, sorry, I'm just. We 

780
00:44:28,849 --> 00:44:33,889
have 

781
00:44:33,889 --> 00:44:38,679
to assume that conditions on the value of Z, 

782
00:44:38,679 --> 00:44:42,109
X is given by another Gaussian 

783
00:44:42,109 --> 00:44:46,819
with mean given by mu plus 

784
00:44:46,819 --> 00:44:50,089
lambda Z 

785
00:44:50,089 --> 00:44:53,639
and covariance given by matrix si. 

786
00:44:53,639 --> 00:44:59,400
So just to say the second line in 

787
00:44:59,400 --> 00:45:01,289
an equivalent form, equivalently 

788
00:45:01,289 --> 00:45:05,929
I'm going to model X as 

789
00:45:05,929 --> 00:45:08,819
mu plus lambda Z 

790
00:45:08,819 --> 00:45:13,269
plus a noise term epsilon where epsilon is 

791
00:45:13,269 --> 00:45:18,329
Gaussian with mean zero 

792
00:45:18,329 --> 00:45:21,869
and covariant si. 

793
00:45:21,869 --> 00:45:23,469


794
00:45:23,469 --> 00:45:24,669
And so 

795
00:45:24,669 --> 00:45:27,709
the parameters of this 

796
00:45:27,709 --> 00:45:30,180
model 

797
00:45:30,180 --> 00:45:32,719
are going to be a vector 

798
00:45:32,719 --> 00:45:34,729
mu with its 

799
00:45:34,729 --> 00:45:37,559
n-dimensional and 

800
00:45:37,559 --> 00:45:41,179
matrix lambda, 

801
00:45:41,179 --> 00:45:42,400
which is 

802
00:45:42,400 --> 00:45:43,539


803
00:45:43,539 --> 00:45:45,569
n by D 

804
00:45:45,569 --> 00:45:50,039
and a covariance matrix si, 

805
00:45:50,039 --> 00:45:51,970
which is n by n, 

806
00:45:51,970 --> 00:45:55,519
and I'm going to impose an additional constraint on si. I'm going to impose 

807
00:45:55,519 --> 00:45:57,319
a constraint that si 

808
00:45:57,319 --> 00:45:59,559
is 

809
00:45:59,559 --> 00:46:03,389
diagonal. 

810
00:46:03,389 --> 00:46:04,789
Okay? So 

811
00:46:04,789 --> 00:46:07,640
that was a form of definition " let me actually, sort of, 

812
00:46:07,640 --> 00:46:14,640
give a couple of examples to make this more complete. 

813
00:46:32,710 --> 00:46:37,709
So let's give a kind of example, suppose Z 

814
00:46:37,709 --> 00:46:40,419
is one-dimensional 

815
00:46:40,419 --> 00:46:44,419
and X is twodimensional 

816
00:46:44,419 --> 00:46:46,689
so let's see what 

817
00:46:46,689 --> 00:46:51,599
this model " let's see a, sort of, specific instance of the factor analysis 

818
00:46:51,599 --> 00:46:52,509
model 

819
00:46:52,509 --> 00:46:55,819
and how we're modeling the joint " the distribution over X 

820
00:46:55,819 --> 00:46:57,999
of " what this gives us 

821
00:46:57,999 --> 00:47:01,269
in terms of a model over P of X, all right? 

822
00:47:01,269 --> 00:47:03,089
So 

823
00:47:03,089 --> 00:47:09,699
let's see. From this model to let me assume that 

824
00:47:09,699 --> 00:47:10,569
lambda is 

825
00:47:10,569 --> 00:47:12,059
2, 1 

826
00:47:12,059 --> 00:47:15,549
and si, which has to be diagonal matrix, remember, 

827
00:47:15,549 --> 00:47:19,809
is this. 

828
00:47:19,809 --> 00:47:21,710
Okay? So 

829
00:47:21,710 --> 00:47:28,710
Z is one-dimensional so let me just draw a typical sample for Z, all right? So 

830
00:47:30,900 --> 00:47:33,659
if I draw ZI 

831
00:47:33,659 --> 00:47:36,199
from a Gaussian 

832
00:47:36,199 --> 00:47:39,530
so that's a typical sample for what Z might look like 

833
00:47:39,530 --> 00:47:43,440
and so I'm gonna " at any rate I'm gonna call 

834
00:47:43,440 --> 00:47:45,369
this Z1, 

835
00:47:45,369 --> 00:47:49,969
Z2, Z3 and so on. If this really were a typical sample the order of the 

836
00:47:49,969 --> 00:47:53,349
Z's would be jumbled up, but I'm just ordering them like this 

837
00:47:53,349 --> 00:47:55,970
just to make the example easier. 

838
00:47:55,970 --> 00:47:57,769
So, yes, typical sample 

839
00:47:57,769 --> 00:48:04,769
of random variable Z from a Gaussian distribution with mean of covariance one. 

840
00:48:05,569 --> 00:48:08,929
So " 

841
00:48:08,929 --> 00:48:13,789
and with this example let me just set mu equals zero. It's to write the " 

842
00:48:13,789 --> 00:48:18,359
just that it's easier to talk about. 

843
00:48:18,359 --> 00:48:20,749
So 

844
00:48:20,749 --> 00:48:22,509
lambda times Z, 

845
00:48:22,509 --> 00:48:26,269
right? We'll take each of these numbers and multiply them by lambda. 

846
00:48:26,269 --> 00:48:28,639
And so 

847
00:48:28,639 --> 00:48:30,999
you find that 

848
00:48:30,999 --> 00:48:36,989
all of the values for lambda times Z 

849
00:48:36,989 --> 00:48:38,489
will lie on a straight line, 

850
00:48:38,489 --> 00:48:40,309
all right? So, for example, 

851
00:48:40,309 --> 00:48:41,090


852
00:48:41,090 --> 00:48:45,400
this one here would be one, two, three, four, five, six, seven, I guess. So if this was 

853
00:48:45,400 --> 00:48:46,239
Z7 

854
00:48:46,239 --> 00:48:49,329
then this one here would be lambda 

855
00:48:49,329 --> 00:48:51,979
times Z7 

856
00:48:51,979 --> 00:48:54,369
and now that's the number in R2, because lambda's a two 

857
00:48:54,369 --> 00:48:56,529
by one matrix. 

858
00:48:56,529 --> 00:48:59,900
And so what I've drawn here is like a typical sample 

859
00:48:59,900 --> 00:49:04,279
for lambda times Z 

860
00:49:04,279 --> 00:49:05,589
and 

861
00:49:05,589 --> 00:49:09,779
the final step for this is what a typical sample for X looks like. Well X is 

862
00:49:09,779 --> 00:49:11,839
mu 

863
00:49:11,839 --> 00:49:12,989
plus 

864
00:49:12,989 --> 00:49:15,089


865
00:49:15,089 --> 00:49:16,660
lambda Z plus epsilon 

866
00:49:16,660 --> 00:49:18,649
where epsilon 

867
00:49:18,649 --> 00:49:23,680
is Gaussian with mean nu and covariance given by si, right? 

868
00:49:23,680 --> 00:49:27,690
And so the last step to draw a typical sample for the random variables 

869
00:49:27,690 --> 00:49:28,509


870
00:49:28,509 --> 00:49:30,899
X 

871
00:49:30,899 --> 00:49:36,079
I'm gonna take these non " these are really same as mu plus lambda Z because mu is zero in this example 

872
00:49:36,079 --> 00:49:37,159
and 

873
00:49:37,159 --> 00:49:38,390
around this point 

874
00:49:38,390 --> 00:49:41,890
I'm going to place 

875
00:49:41,890 --> 00:49:43,890
an axis aligned ellipse. Or 

876
00:49:43,890 --> 00:49:45,879
in other words, I'm going to 

877
00:49:45,879 --> 00:49:47,939
create a Gaussian distribution 

878
00:49:47,939 --> 00:49:50,950
centered on this point 

879
00:49:50,950 --> 00:49:53,459
and this I've drawn 

880
00:49:53,459 --> 00:49:55,880
corresponds to one of the contours 

881
00:49:55,880 --> 00:49:59,309
of my density for 

882
00:49:59,309 --> 00:50:02,799
epsilon, right? And so you can imagine placing a little Gaussian bump here. 

883
00:50:02,799 --> 00:50:04,179
And so 

884
00:50:04,179 --> 00:50:08,189
I'll draw an example from this little Gaussian and 

885
00:50:08,189 --> 00:50:11,089
let's say I get that point going, 

886
00:50:11,089 --> 00:50:17,630
I do the same here and 

887
00:50:17,630 --> 00:50:19,130
so on. 

888
00:50:19,130 --> 00:50:24,079
So I draw a bunch of examples from these Gaussians 

889
00:50:24,079 --> 00:50:28,519
and the " 

890
00:50:28,519 --> 00:50:31,999
whatever they call it " the orange points I drew 

891
00:50:31,999 --> 00:50:35,309
will comprise a typical sample for 

892
00:50:35,309 --> 00:50:42,309
whether distribution of X is under this model. Okay? Yeah? Student:Would you add, like, mean? Instructor: Oh, say that again. Student:Do you add mean into that? 

893
00:50:43,839 --> 00:50:45,929
Instructor (Andrew Ng):Oh, 

894
00:50:45,929 --> 00:50:48,649
yes, you do. And in this example, I said you 

895
00:50:48,649 --> 00:50:50,109
do a zero zero just to make 

896
00:50:50,109 --> 00:50:53,199
it easier. If mu were something else you'd take the whole picture and you'd sort of shift 

897
00:50:53,199 --> 00:51:00,199
it to whatever value of mu is. Yeah? Student:[Inaudible] horizontal line right there, which was Z. What did the X's, of 

898
00:51:00,679 --> 00:51:03,319
course, 

899
00:51:03,319 --> 00:51:05,920
what does that Y-axis corresponds to? Instructor (Andrew 

900
00:51:05,920 --> 00:51:07,799
Ng):Oh, so this is Z 

901
00:51:07,799 --> 00:51:09,529
is one-dimensional 

902
00:51:09,529 --> 00:51:12,969
so here I'm plotting the typical sample 

903
00:51:12,969 --> 00:51:15,460
for Z so this is like zero. 

904
00:51:15,460 --> 00:51:19,759
So this is just the Z Axis, right. So Z is onedimensional data. 

905
00:51:19,759 --> 00:51:22,999
So this line here is like a plot 

906
00:51:22,999 --> 00:51:26,479
of a typical sample of 

907
00:51:26,479 --> 00:51:31,609
values for Z. Okay? 

908
00:51:31,609 --> 00:51:32,889
Yeah? Student:You have by axis, right? And 

909
00:51:32,889 --> 00:51:34,459
the axis data pertains samples. Instructor (Andrew 

910
00:51:34,459 --> 00:51:35,689
Ng):Oh, yes, right. Student:So sort of projecting 

911
00:51:35,689 --> 00:51:38,289
them into 

912
00:51:38,289 --> 00:51:42,680
that? Instructor (Andrew Ng):Let's not talk about projections yet, but, yeah, right. So these 

913
00:51:42,680 --> 00:51:45,319
beige points " so that's like X1 and that's X2 

914
00:51:45,319 --> 00:51:47,290
and so on, right? So the beige points are 

915
00:51:47,290 --> 00:51:51,099
what I 

916
00:51:51,099 --> 00:51:52,269
see. And so 

917
00:51:52,269 --> 00:51:54,849
in reality all you ever get to see are the 

918
00:51:54,849 --> 00:51:56,289
X's, but 

919
00:51:56,289 --> 00:52:00,049
just like in the mixture of Gaussians model I tell a story about what I would 

920
00:52:00,049 --> 00:52:03,609
imagine the Gau"the data came from two Gaussian's 

921
00:52:03,609 --> 00:52:08,679
was is had a random variable Z that led to the generation of X's from two Gaussians. 

922
00:52:08,679 --> 00:52:12,209
So the same way I'm sort of telling the story here, which all the algorithm actually 

923
00:52:12,209 --> 00:52:15,049
sees are the orange points, but we're 

924
00:52:15,049 --> 00:52:19,219
gonna tell a story about how the data came about and that story is 

925
00:52:19,219 --> 00:52:23,809
what comprises the factor analysis model. Okay? 

926
00:52:23,809 --> 00:52:26,779
So one of the ways to see the intrusion of this model is that we're 

927
00:52:26,779 --> 00:52:30,779
going to think of the model as one way 

928
00:52:30,779 --> 00:52:34,659
just informally, not formally, but one way to think about this model is 

929
00:52:34,659 --> 00:52:37,679
you can think of this factor analysis model 

930
00:52:37,679 --> 00:52:42,089
as modeling the data from coming from a lower dimensional subspace 

931
00:52:42,089 --> 00:52:42,890
more 

932
00:52:42,890 --> 00:52:44,799
or less so the data X here Y 

933
00:52:44,799 --> 00:52:47,609
is approximately on one D line 

934
00:52:47,609 --> 00:52:50,380
and then plus a little bit of noise " plus a little bit 

935
00:52:50,380 --> 00:52:53,449
of random noise so the X isn't exactly on this one D line. 

936
00:52:53,449 --> 00:52:57,279
That's one informal way of thinking about factor analysis. 

937
00:52:57,279 --> 00:53:04,279
We're 

938
00:53:08,880 --> 00:53:15,880
not doing great on time. 

939
00:53:18,159 --> 00:53:20,359
Well, let's do this. 

940
00:53:20,359 --> 00:53:22,780
So let me just do one more quick example, 

941
00:53:22,780 --> 00:53:24,100
which is, 

942
00:53:24,100 --> 00:53:25,789


943
00:53:25,789 --> 00:53:28,949
in this example, 

944
00:53:28,949 --> 00:53:34,899
let's say Z is in R2 

945
00:53:34,899 --> 00:53:37,899
and X is in R3, right? 

946
00:53:37,899 --> 00:53:40,289
And so 

947
00:53:40,289 --> 00:53:44,939
in this example Z, your data Z now lies in 2-D 

948
00:53:44,939 --> 00:53:49,259
and so let me draw this on a sheet of paper. Okay? 

949
00:53:49,259 --> 00:53:50,790
So let's say the 

950
00:53:50,790 --> 00:53:54,180
axis of my paper are the Z1 and Z2 axis 

951
00:53:54,180 --> 00:53:56,299
and so 

952
00:53:56,299 --> 00:53:58,289
here is a typical sample 

953
00:53:58,289 --> 00:54:01,489
of point Z, right? 

954
00:54:01,489 --> 00:54:05,369
And so we'll then take the sample Z " 

955
00:54:05,369 --> 00:54:10,919
well, actually let me draw this here as well. All right. So 

956
00:54:10,919 --> 00:54:13,449
this is a typical sample for Z going on the Z1 and Z2 axis and 

957
00:54:13,449 --> 00:54:17,419
I guess the origin would be here. 

958
00:54:17,419 --> 00:54:20,119
So center around zero. 

959
00:54:20,119 --> 00:54:24,079
And then we'll take those and map it to mu 

960
00:54:24,079 --> 00:54:24,979
plus 

961
00:54:24,979 --> 00:54:26,299
lambda Z 

962
00:54:26,299 --> 00:54:30,140
and what that means is if you imagine the free space of this classroom is R3. 

963
00:54:30,140 --> 00:54:34,549
What that means is we'll take this sample of Z's and we'll map it to 

964
00:54:34,549 --> 00:54:38,769
position in free space. So we'll take this sheet of paper and move it somewhere and some 

965
00:54:38,769 --> 00:54:41,769
orientation in 3-D space. 

966
00:54:41,769 --> 00:54:44,219
And the last step is 

967
00:54:44,219 --> 00:54:48,689
you have X equals mu plus lambda Z plus 

968
00:54:48,689 --> 00:54:52,339
epsilon and so you would take the set of the points which align in some plane 

969
00:54:52,339 --> 00:54:53,639
in our 3-D space 

970
00:54:53,639 --> 00:54:56,150
the variable of noise of these 

971
00:54:56,150 --> 00:54:58,259
and the noise will, sort of, come from 

972
00:54:58,259 --> 00:55:01,419
Gaussians to 

973
00:55:01,419 --> 00:55:03,699
the axis 

974
00:55:03,699 --> 00:55:10,699
aligned. Okay? So you end up with a data set that's sort of like a fat pancake or a little bit of fuzz off your pancake. 

975
00:55:11,680 --> 00:55:13,179


976
00:55:13,179 --> 00:55:15,119
So that's a model 

977
00:55:15,119 --> 00:55:22,119
" let's actually talk about how to fit the parameters of the model. Okay? 

978
00:55:27,959 --> 00:55:32,229
In order to 

979
00:55:32,229 --> 00:55:33,770
describe how to fit the model I'm sure 

980
00:55:33,770 --> 00:55:35,859
we need to 

981
00:55:35,859 --> 00:55:40,190
re-write Gaussians and this is in a very slightly different way. 

982
00:55:40,190 --> 00:55:42,789
So, in particular, 

983
00:55:42,789 --> 00:55:45,389
let's say I have a vector X and 

984
00:55:45,389 --> 00:55:51,389
I'm gonna use this notation to denote partition vectors, right? X1, X2 

985
00:55:51,389 --> 00:55:53,209
where 

986
00:55:53,209 --> 00:55:56,809
if X1 is say an rdimensional vector 

987
00:55:56,809 --> 00:55:58,640
then X2 is 

988
00:55:58,640 --> 00:56:00,430
an estimational vector 

989
00:56:00,430 --> 00:56:02,229
and X 

990
00:56:02,229 --> 00:56:04,509
is an R plus S 

991
00:56:04,509 --> 00:56:09,029
dimensional vector. Okay? So I'm gonna use this notation to denote just 

992
00:56:09,029 --> 00:56:13,269
the taking of vector and, sort of, partitioning the vector into two halves. The 

993
00:56:13,269 --> 00:56:15,449
first R elements followed by 

994
00:56:15,449 --> 00:56:17,079
the last 

995
00:56:17,079 --> 00:56:22,409
S elements. 

996
00:56:22,409 --> 00:56:26,269
So 

997
00:56:26,269 --> 00:56:29,199
let's say you have X 

998
00:56:29,199 --> 00:56:35,029
coming from a Gaussian distribution with mean mu and covariance sigma 

999
00:56:35,029 --> 00:56:37,169
where 

1000
00:56:37,169 --> 00:56:38,809
mu 

1001
00:56:38,809 --> 00:56:42,219
is itself a partition vector. 

1002
00:56:42,219 --> 00:56:46,839
So break mu up into two pieces mu1 and mu2 

1003
00:56:46,839 --> 00:56:50,189
and the covariance matrix sigma 

1004
00:56:50,189 --> 00:56:54,989
is now a partitioned matrix. 

1005
00:56:54,989 --> 00:56:56,750
Okay? So what this means is 

1006
00:56:56,750 --> 00:56:58,669
that you take the covariance matrix sigma 

1007
00:56:58,669 --> 00:57:00,940
and I'm going to break it up into four blocks, 

1008
00:57:00,940 --> 00:57:02,649
right? And so the 

1009
00:57:02,649 --> 00:57:06,059
dimension of this is there will be R elements here 

1010
00:57:06,059 --> 00:57:08,299
and there will be 

1011
00:57:08,299 --> 00:57:10,289
S elements here 

1012
00:57:10,289 --> 00:57:13,679
and there will be R elements here. So, for example, sigma 1, 2 will 

1013
00:57:13,679 --> 00:57:15,689
be an 

1014
00:57:15,689 --> 00:57:19,769
R by S matrix. 

1015
00:57:19,769 --> 00:57:26,769
It's R elements tall and S elements wide. 

1016
00:57:32,150 --> 00:57:36,190
So this Gaussian over to down is really a joint distribution of a loss of variables, right? 

1017
00:57:36,190 --> 00:57:40,309
So X is a vector so XY is a joint distribution 

1018
00:57:40,309 --> 00:57:42,980
over X1 through X of " 

1019
00:57:42,980 --> 00:57:47,349
over XN or over X of R plus S. 

1020
00:57:47,349 --> 00:57:51,299
We can then ask what are the marginal and conditional distributions of this 

1021
00:57:51,299 --> 00:57:52,609
Gaussian? 

1022
00:57:52,609 --> 00:57:53,840
So, for example, 

1023
00:57:53,840 --> 00:57:57,819
with my Gaussian, I know what P of X is, but can 

1024
00:57:57,819 --> 00:58:03,229
I compute the modular distribution of X1, right. And so P of X1 is just equal to, 

1025
00:58:03,229 --> 00:58:07,009
of course, integrate our X2, P of X1 

1026
00:58:07,009 --> 00:58:08,880
comma X2 

1027
00:58:08,880 --> 00:58:10,649
DX2. 

1028
00:58:10,649 --> 00:58:14,169
And if you actually perform that distribution " that computation you 

1029
00:58:14,169 --> 00:58:15,199
find 

1030
00:58:15,199 --> 00:58:17,769
that P of X1, 

1031
00:58:17,769 --> 00:58:20,069
I guess, is Gaussian 

1032
00:58:20,069 --> 00:58:21,159
with mean 

1033
00:58:21,159 --> 00:58:23,079
given by mu1 

1034
00:58:23,079 --> 00:58:24,929
and sigma 1, 1. All right. 

1035
00:58:24,929 --> 00:58:26,309
So this is sort 

1036
00:58:26,309 --> 00:58:30,019
of no surprise. The marginal distribution of a 

1037
00:58:30,019 --> 00:58:31,129
Gaussian 

1038
00:58:31,129 --> 00:58:32,879
is itself the 

1039
00:58:32,879 --> 00:58:33,390
Gaussian and 

1040
00:58:33,390 --> 00:58:35,479
you just take out the 

1041
00:58:35,479 --> 00:58:36,229
relevant 

1042
00:58:36,229 --> 00:58:37,399
sub-blocks of 

1043
00:58:37,399 --> 00:58:42,049
the covariance matrix and the relevant sub-vector of the 

1044
00:58:42,049 --> 00:58:46,569
mu vector " E in vector mu. 

1045
00:58:46,569 --> 00:58:49,819
You can also compute conditionals. You can also " 

1046
00:58:49,819 --> 00:58:51,959
what does P of X1 

1047
00:58:51,959 --> 00:58:53,409
given 

1048
00:58:53,409 --> 00:58:56,819
a specific value for X2, right? 

1049
00:58:56,819 --> 00:59:02,799
And so the way you compute that is, well, the usual way P of X1 

1050
00:59:02,799 --> 00:59:04,930
comma X2 

1051
00:59:04,930 --> 00:59:08,379
divided by P of X2, right? 

1052
00:59:08,379 --> 00:59:10,069
And so 

1053
00:59:10,069 --> 00:59:13,789
you know what both of these formulas are, right? The numerator " well, 

1054
00:59:13,789 --> 00:59:16,559
this is just a usual 

1055
00:59:16,559 --> 00:59:20,160
Gaussian that your joint distribution over X1, X2 is a Gaussian with 

1056
00:59:20,160 --> 00:59:21,899
mean mu and covariance sigma 

1057
00:59:21,899 --> 00:59:23,929
and 

1058
00:59:23,929 --> 00:59:26,659
this 

1059
00:59:26,659 --> 00:59:30,079
by that marginalization operation I talked about is that. 

1060
00:59:30,079 --> 00:59:31,449


1061
00:59:31,449 --> 00:59:35,009
So if you actually plug in the formulas for these two Gaussians and if you simplify 

1062
00:59:35,009 --> 00:59:35,599


1063
00:59:35,599 --> 00:59:38,560
the simplification step is actually fairly non-trivial. 

1064
00:59:38,560 --> 00:59:42,070
If you haven't seen it before this will actually be " this will actually be 

1065
00:59:42,070 --> 00:59:44,389
somewhat difficult to do. 

1066
00:59:44,389 --> 00:59:45,580
But if you 

1067
00:59:45,580 --> 00:59:50,020
plug this in for Gaussian and simplify that 

1068
00:59:50,020 --> 00:59:51,029
expression 

1069
00:59:51,029 --> 00:59:53,249
you find that 

1070
00:59:53,249 --> 00:59:55,969
conditioned on the value of 

1071
00:59:55,969 --> 01:00:01,130
X2, X1 is " the distribution of X1 conditioned on X2 is itself going to be 

1072
01:00:01,130 --> 01:00:05,849
Gaussian 

1073
01:00:05,849 --> 01:00:08,260
and it will have mean mu 

1074
01:00:08,260 --> 01:00:13,929
of 1 given 2 and covariant 

1075
01:00:13,929 --> 01:00:15,549


1076
01:00:15,549 --> 01:00:18,479
sigma of 1 given 2 where " well, so about the simplification and derivation I'm not gonna show 

1077
01:00:18,479 --> 01:00:20,969
the formula for mu given " of mu of 

1078
01:00:20,969 --> 01:00:27,969
one given 2 is given by this 

1079
01:00:31,689 --> 01:00:38,689
and I 

1080
01:00:40,179 --> 01:00:44,209
think 

1081
01:00:44,209 --> 01:00:48,949
the sigma of 1 given 2 is given by that. Okay? 

1082
01:00:48,949 --> 01:00:52,109
So these are just 

1083
01:00:52,109 --> 01:00:55,869
useful formulas to know for how to find the conditional distributions 

1084
01:00:55,869 --> 01:00:58,649
of the Gaussian and the marginal distributions of a Gaussian. 

1085
01:00:58,649 --> 01:01:03,229
I won't actually show the derivation for this. Student:Could you 

1086
01:01:03,229 --> 01:01:10,229
repeat the [inaudible]? Instructor 

1087
01:01:12,059 --> 01:01:16,189
(Andrew Ng):Sure. So this one on the left mu of 1 given 2 

1088
01:01:16,189 --> 01:01:16,849
equals 

1089
01:01:16,849 --> 01:01:19,029
mu1 plus 

1090
01:01:19,029 --> 01:01:20,769
sigma 1,2, 

1091
01:01:20,769 --> 01:01:22,729
sigma 2,2 inverse 

1092
01:01:22,729 --> 01:01:25,389
times X2 minus mu2 

1093
01:01:25,389 --> 01:01:29,599
and this is sigma 1 given 2 equals sigma 1,1 

1094
01:01:29,599 --> 01:01:31,599
minus sigma 1,2 

1095
01:01:31,599 --> 01:01:33,119
sigma 2,2 inverse 

1096
01:01:33,119 --> 01:01:34,400
sigma 2,1. Okay? 

1097
01:01:34,400 --> 01:01:40,189
These are also in the 

1098
01:01:40,189 --> 01:01:47,189
lecture 

1099
01:01:52,929 --> 01:01:59,929
notes. Shoot. Nothing as where I was hoping to on time. 

1100
01:02:00,989 --> 01:02:07,989
Well, actually it is. Okay? 

1101
01:02:18,719 --> 01:02:21,209
So it 

1102
01:02:21,209 --> 01:02:26,369
turns out " I think I'll skip this in the interest of time. So it turns out that " 

1103
01:02:26,369 --> 01:02:30,049
well, so let's go back and use these in the factor analysis model, right? 

1104
01:02:30,049 --> 01:02:32,089
It turns out that 

1105
01:02:32,089 --> 01:02:33,769
you can go back 

1106
01:02:33,769 --> 01:02:35,130
and 

1107
01:02:35,130 --> 01:02:38,769
" 

1108
01:02:38,769 --> 01:02:45,759
oh, 

1109
01:02:45,759 --> 01:02:48,569
do I want to do this? I kind of need this 

1110
01:02:48,569 --> 01:02:51,939
though. So let's go back and figure out 

1111
01:02:51,939 --> 01:02:57,869
just what the joint distribution factor analysis assumes on Z and X's. Okay? 

1112
01:02:57,869 --> 01:02:58,630
So 

1113
01:02:58,630 --> 01:03:00,219


1114
01:03:00,219 --> 01:03:03,199
under the factor analysis model 

1115
01:03:03,199 --> 01:03:07,429
Z and X, the random variables Z and X 

1116
01:03:07,429 --> 01:03:11,549
have some joint distribution given by " I'll write this 

1117
01:03:11,549 --> 01:03:12,779
vector as mu 

1118
01:03:12,779 --> 01:03:13,799
ZX 

1119
01:03:13,799 --> 01:03:17,329
in some covariance matrix sigma. 

1120
01:03:17,329 --> 01:03:21,239
So let's go back and figure out what mu ZX is and what sigma is and I'll 

1121
01:03:21,239 --> 01:03:22,699
do this so that 

1122
01:03:22,699 --> 01:03:24,990
we'll get a little bit more practice with 

1123
01:03:24,990 --> 01:03:28,739
partition vectors and partition matrixes. 

1124
01:03:28,739 --> 01:03:31,239
So just to remind you, right? You 

1125
01:03:31,239 --> 01:03:34,959
have to have Z as Gaussian with mean zero and covariance identity 

1126
01:03:34,959 --> 01:03:41,099
and X is mu plus lambda Z plus epsilon where epsilon is Gaussian with 

1127
01:03:41,099 --> 01:03:42,749
mean zero 

1128
01:03:42,749 --> 01:03:47,179
covariant si. So I have the " I'm just writing out the same equations again. 

1129
01:03:47,179 --> 01:03:53,209
So let's first figure out what this vector mu ZX is. Well, 

1130
01:03:53,209 --> 01:03:54,619
the expected value of Z 

1131
01:03:54,619 --> 01:03:56,219
is 

1132
01:03:56,219 --> 01:03:56,989
zero 

1133
01:03:56,989 --> 01:03:58,140


1134
01:03:58,140 --> 01:04:03,599
and, again, as usual I'll often drop the square backers around here. 

1135
01:04:03,599 --> 01:04:05,930
And the expected value of X is " 

1136
01:04:05,930 --> 01:04:08,899
well, the expected value of mu 

1137
01:04:08,899 --> 01:04:11,149
plus lambda 

1138
01:04:11,149 --> 01:04:14,179
Z 

1139
01:04:14,179 --> 01:04:17,249
plus epsilon. So these two terms have zero expectation 

1140
01:04:17,249 --> 01:04:20,099
and so the expected value of X 

1141
01:04:20,099 --> 01:04:22,749
is just mu 

1142
01:04:22,749 --> 01:04:23,900


1143
01:04:23,900 --> 01:04:26,529
and so that vector 

1144
01:04:26,529 --> 01:04:28,049
mu ZX, right, 

1145
01:04:28,049 --> 01:04:31,239
in my parameter for the Gaussian 

1146
01:04:31,239 --> 01:04:33,650
this is going to be 

1147
01:04:33,650 --> 01:04:37,439
the expected value of this partition vector 

1148
01:04:37,439 --> 01:04:40,249
given by this partition Z and X 

1149
01:04:40,249 --> 01:04:42,689
and so that would just be 

1150
01:04:42,689 --> 01:04:43,769
zero 

1151
01:04:43,769 --> 01:04:46,900
followed by mu. Okay? 

1152
01:04:46,900 --> 01:04:48,319
And so that's 

1153
01:04:48,319 --> 01:04:51,299
a d-dimensional zero 

1154
01:04:51,299 --> 01:04:57,059
followed by an indimensional mu. 

1155
01:04:57,059 --> 01:05:02,160
That's not gonna work out what the covariance matrix sigma is. 

1156
01:05:02,160 --> 01:05:09,160
So 

1157
01:05:20,929 --> 01:05:24,109
the covariance matrix sigma 

1158
01:05:24,109 --> 01:05:27,959
" if you work out 

1159
01:05:27,959 --> 01:05:34,959
definition of a partition. So this 

1160
01:05:44,359 --> 01:05:45,619
is 

1161
01:05:45,619 --> 01:05:52,619
into your partition matrix. 

1162
01:06:03,339 --> 01:06:04,150
Okay? Will be " 

1163
01:06:04,150 --> 01:06:06,149
so the covariance matrix sigma 

1164
01:06:06,149 --> 01:06:10,499
will comprise four blocks like that 

1165
01:06:10,499 --> 01:06:14,199
and so the upper left most block, which I write as sigma 1,1 " 

1166
01:06:14,199 --> 01:06:16,400
well, that uppermost left block 

1167
01:06:16,400 --> 01:06:18,559
is just 

1168
01:06:18,559 --> 01:06:21,139
the covariance matrix of Z, 

1169
01:06:21,139 --> 01:06:24,819
which we know is the identity. I was 

1170
01:06:24,819 --> 01:06:28,269
gonna show you briefly how to derive some of the other blocks, right, so 

1171
01:06:28,269 --> 01:06:30,809
sigma 1,2 that's the 

1172
01:06:30,809 --> 01:06:37,809
upper " oh, actually, 

1173
01:06:41,259 --> 01:06:43,819
excuse me. Sigma 2,1 

1174
01:06:43,819 --> 01:06:46,829
which is the lower left block 

1175
01:06:46,829 --> 01:06:48,929
that's E 

1176
01:06:48,929 --> 01:06:50,829
of X 

1177
01:06:50,829 --> 01:06:54,900
minus EX times Z minus EZ. 

1178
01:06:54,900 --> 01:06:58,689
So X is equal to mu 

1179
01:06:58,689 --> 01:07:01,529
plus lambda Z 

1180
01:07:01,529 --> 01:07:03,329
plus 

1181
01:07:03,329 --> 01:07:05,429
epsilon and then minus EX is 

1182
01:07:05,429 --> 01:07:07,369
minus mu and then 

1183
01:07:07,369 --> 01:07:11,219
times Z 

1184
01:07:11,219 --> 01:07:14,089
because the expected value of Z is zero, right, 

1185
01:07:14,089 --> 01:07:19,329
so that's equal to zero. 

1186
01:07:19,329 --> 01:07:23,660
And so if you simplify " or if you expand this out 

1187
01:07:23,660 --> 01:07:26,829
plus mu minus mu cancel out 

1188
01:07:26,829 --> 01:07:29,819
and so you have the expected value 

1189
01:07:29,819 --> 01:07:33,139
of lambda " oh, 

1190
01:07:33,139 --> 01:07:35,359
excuse me. 

1191
01:07:35,359 --> 01:07:40,269
ZZ transpose minus the 

1192
01:07:40,269 --> 01:07:47,269
expected value of epsilon Z is 

1193
01:07:52,799 --> 01:07:57,059
equal to 

1194
01:07:57,059 --> 01:08:00,309
that, which is just equal to 

1195
01:08:00,309 --> 01:08:07,239
lambda times the identity matrix. Okay? Does that make 

1196
01:08:07,239 --> 01:08:11,499
sense? Cause 

1197
01:08:11,499 --> 01:08:14,540
this term is equal to zero. 

1198
01:08:14,540 --> 01:08:21,540
Both epsilon and Z are independent and have zero expectation so the second terms are zero. Well, 

1199
01:08:48,170 --> 01:08:53,399
so the final block is sigma 2,2 which is equal to the expected value of 

1200
01:08:53,399 --> 01:08:59,099
mu plus lambda Z 

1201
01:08:59,099 --> 01:09:01,889
plus epsilon minus mu 

1202
01:09:01,889 --> 01:09:03,960
times, right? 

1203
01:09:03,960 --> 01:09:07,900
Is 

1204
01:09:07,900 --> 01:09:09,259
equal to " and 

1205
01:09:09,259 --> 01:09:16,259
I won't do this, but this simplifies to lambda lambda transpose plus si. Okay? 

1206
01:09:17,609 --> 01:09:20,880
So 

1207
01:09:20,880 --> 01:09:23,809
putting all this together 

1208
01:09:23,809 --> 01:09:28,689
this tells us that the joint distribution of this vector ZX 

1209
01:09:28,689 --> 01:09:31,209
is going to be Gaussian 

1210
01:09:31,209 --> 01:09:35,920
with mean vector given by that, 

1211
01:09:35,920 --> 01:09:37,409
which we worked out previously. 

1212
01:09:37,409 --> 01:09:38,719
So 

1213
01:09:38,719 --> 01:09:42,239
this is the new ZX that we worked out previously, 

1214
01:09:42,239 --> 01:09:44,179
and covariance matrix 

1215
01:09:44,179 --> 01:09:51,179
given by 

1216
01:09:54,429 --> 01:10:01,429
that. Okay? So 

1217
01:10:04,719 --> 01:10:07,179
in principle " let's 

1218
01:10:07,179 --> 01:10:09,869
see, so the parameters of our model 

1219
01:10:09,869 --> 01:10:11,579
are mu, 

1220
01:10:11,579 --> 01:10:12,359
lambda, 

1221
01:10:12,359 --> 01:10:14,000
and si. 

1222
01:10:14,000 --> 01:10:15,480
And so 

1223
01:10:15,480 --> 01:10:18,800
in order to find the parameters of this model 

1224
01:10:18,800 --> 01:10:24,559
we're given a training set of m examples 

1225
01:10:24,559 --> 01:10:28,590
and so we like to do a massive likely estimation of the parameters. 

1226
01:10:28,590 --> 01:10:34,319
And so in principle one thing you could do is you can actually write down 

1227
01:10:34,319 --> 01:10:36,820
what P of XI is and, 

1228
01:10:36,820 --> 01:10:40,309
right, so P of XI 

1229
01:10:40,309 --> 01:10:43,020
XI is actually " 

1230
01:10:43,020 --> 01:10:45,880
the distribution of X, right? If, again, 

1231
01:10:45,880 --> 01:10:49,439
you can marginalize this Gaussian 

1232
01:10:49,439 --> 01:10:54,679
and so the distribution of X, which is the lower half of this partition vector 

1233
01:10:54,679 --> 01:10:56,620
is going to have 

1234
01:10:56,620 --> 01:10:58,209
mean mu 

1235
01:10:58,209 --> 01:11:03,929
and covariance given by lambda lambda transpose plus si. Right? 

1236
01:11:03,929 --> 01:11:05,320
So that's 

1237
01:11:05,320 --> 01:11:06,759
the distribution 

1238
01:11:06,759 --> 01:11:12,499
that we're using to model P of X. 

1239
01:11:12,499 --> 01:11:16,980
And so in principle one thing you could do is actually write down the log likelihood of 

1240
01:11:16,980 --> 01:11:20,329
your parameters, 

1241
01:11:20,329 --> 01:11:23,679
right? Which is just the product over of " it 

1242
01:11:23,679 --> 01:11:25,310
is the sum over 

1243
01:11:25,310 --> 01:11:26,050
I 

1244
01:11:26,050 --> 01:11:29,829
log P of XI 

1245
01:11:29,829 --> 01:11:30,970
where P of XI 

1246
01:11:30,970 --> 01:11:32,400
will be given 

1247
01:11:32,400 --> 01:11:35,719
by this Gaussian density, right. And 

1248
01:11:35,719 --> 01:11:39,820
I'm using theta as a shorthand to denote all of my parameters. 

1249
01:11:39,820 --> 01:11:43,429
And so you actually know what the density for Gaussian is 

1250
01:11:43,429 --> 01:11:49,519
and so you can say P of XI is this Gaussian with E mu in covariance 

1251
01:11:49,519 --> 01:11:52,960
given by this lambda lambda transpose plus si. 

1252
01:11:52,960 --> 01:11:54,690
So in case you write down the log likelihood 

1253
01:11:54,690 --> 01:11:55,900
of your parameters 

1254
01:11:55,900 --> 01:11:57,380
as follows 

1255
01:11:57,380 --> 01:12:01,050
and you can try to take derivatives of your log likelihood with respect to your 

1256
01:12:01,050 --> 01:12:02,039
parameters 

1257
01:12:02,039 --> 01:12:05,059
and maximize the log likelihood, all right. 

1258
01:12:05,059 --> 01:12:07,490
It turns out that if you do that you end up with 

1259
01:12:07,490 --> 01:12:11,300
sort of an intractable atomization problem or at least one 

1260
01:12:11,300 --> 01:12:13,219
that you " excuse me, you end up with a 

1261
01:12:13,219 --> 01:12:16,860
optimization problem that you will not be able to find and in this analytics, sort of, 

1262
01:12:16,860 --> 01:12:18,649
closed form solutions to. 

1263
01:12:18,649 --> 01:12:21,679
So if you say my model of X is this and found your 

1264
01:12:21,679 --> 01:12:23,900
massive likely parameter estimation 

1265
01:12:23,900 --> 01:12:25,589
you won't be able to find 

1266
01:12:25,589 --> 01:12:29,119
the massive likely estimate of the parameters in closed form. 

1267
01:12:29,119 --> 01:12:30,059


1268
01:12:30,059 --> 01:12:31,940
So what I would have liked to do 

1269
01:12:31,940 --> 01:12:38,940
is " well, 

1270
01:12:40,280 --> 01:12:45,579
so in order to fit parameters to this model 

1271
01:12:45,579 --> 01:12:51,869
what we'll actually do is use the EM Algorithm 

1272
01:12:51,869 --> 01:12:55,569
in with 

1273
01:12:55,569 --> 01:12:59,559
the E step, right? 

1274
01:12:59,559 --> 01:13:06,559
We'll compute that 

1275
01:13:08,320 --> 01:13:10,929


1276
01:13:10,929 --> 01:13:15,010
and this formula looks the same except that one difference is that now 

1277
01:13:15,010 --> 01:13:17,679
Z is a continuous random variable 

1278
01:13:17,679 --> 01:13:18,590
and so 

1279
01:13:18,590 --> 01:13:20,119
in the E step 

1280
01:13:20,119 --> 01:13:24,210
we actually have to find the density QI of ZI where it's the, sort of, E step actually 

1281
01:13:24,210 --> 01:13:26,050
requires that we find 

1282
01:13:26,050 --> 01:13:28,260
the posterior distribution 

1283
01:13:28,260 --> 01:13:31,510
that " so the density to the random variable ZI 

1284
01:13:31,510 --> 01:13:34,489
and then the M step 

1285
01:13:34,489 --> 01:13:37,949
will then perform 

1286
01:13:37,949 --> 01:13:40,870
the following maximization 

1287
01:13:40,870 --> 01:13:41,709


1288
01:13:41,709 --> 01:13:45,010
where, again, because Z is now 

1289
01:13:45,010 --> 01:13:49,929
continuous we 

1290
01:13:49,929 --> 01:13:56,929
now need to integrate over Z. Okay? Where 

1291
01:13:59,389 --> 01:14:02,379
in the M step now because ZI was continuous we now have an 

1292
01:14:02,379 --> 01:14:05,709
integral over Z rather than a sum. 

1293
01:14:05,709 --> 01:14:09,269
Okay? I was hoping to go a little bit further in deriving these things, but I don't have 

1294
01:14:09,269 --> 01:14:11,080
time today so we'll wrap that up 

1295
01:14:11,080 --> 01:14:14,530
in the next lecture, but before I close let's check if there are questions 

1296
01:14:14,530 --> 01:14:15,729


1297
01:14:15,729 --> 01:14:22,729
about the whole factor analysis model. Okay. 

1298
01:14:27,569 --> 01:14:30,639
So we'll come back in the next lecture; 

1299
01:14:30,639 --> 01:14:34,690
I will wrap up this model and because I want to go a little bit deeper into the E 

1300
01:14:34,690 --> 01:14:36,709
and M steps, as there's some 

1301
01:14:36,709 --> 01:14:39,760
tricky parts for the factor analysis model specifically. 

1302
01:14:39,760 --> 01:14:41,159
Okay. I'll see you in a 

1303
01:14:41,159 --> 01:14:41,409
couple of days.