1
00:00:09,410 --> 00:00:11,729

2
00:00:11,729 --> 00:00:14,999
This presentation is delivered by the Stanford Center for Professional 

3
00:00:14,999 --> 00:00:21,999
Development. Okay. 

4
00:00:25,789 --> 00:00:30,900
So welcome back, and what I want to do to is talk about " is wrap up our discussion 

5
00:00:30,900 --> 00:00:35,450
on factor analysis, and in particular what I 

6
00:00:35,450 --> 00:00:37,040
want to do is 

7
00:00:37,040 --> 00:00:42,880
step through parts of the derivations for EM for factor analysis because again there are a few steps 

8
00:00:42,880 --> 00:00:47,250
in the EM derivation that are particularly tricky, and 

9
00:00:47,250 --> 00:00:51,500
there are specific mistakes that people often make on deriving EM algorithms for 

10
00:00:51,500 --> 00:00:54,110
algorithms like factor analysis. So I wanted to 

11
00:00:54,110 --> 00:00:57,320
show you how to do those steps right so you can apply the same ideas to other problems 

12
00:00:57,320 --> 00:00:58,450
as well. 

13
00:00:58,450 --> 00:01:03,050
And then in the second half or so of this lecture, I'll talk about principal 

14
00:01:03,050 --> 00:01:04,489
component analysis, 

15
00:01:04,489 --> 00:01:09,780
which is a very powerful algorithm for dimensionality reduction. We'll see 

16
00:01:09,780 --> 00:01:12,320
later what that means. 

17
00:01:12,320 --> 00:01:15,620
So just a recap, 

18
00:01:15,620 --> 00:01:18,060
in a previous lecture I 

19
00:01:18,060 --> 00:01:20,990
described a few properties of Gaussian distributions. 

20
00:01:20,990 --> 00:01:23,240
One was that 

21
00:01:23,240 --> 00:01:28,940
if you have a random variable " a random value vector X that can be partitioned into two 

22
00:01:28,940 --> 00:01:31,210
portions, X1 

23
00:01:31,210 --> 00:01:33,970
and X2, and if X is 

24
00:01:33,970 --> 00:01:37,810
Gaussian with mu [inaudible] and covariance sigma where 

25
00:01:37,810 --> 00:01:40,830
mu is 

26
00:01:40,830 --> 00:01:44,040
itself a partition vector 

27
00:01:44,040 --> 00:01:47,490
and sigma 

28
00:01:47,490 --> 00:01:52,710
is 

29
00:01:52,710 --> 00:01:56,480
sort of a partition matrix that can be written like that. So I'm just writing sigma in 

30
00:01:56,480 --> 00:01:57,469
terms of 

31
00:01:57,469 --> 00:02:00,580
the four 

32
00:02:00,580 --> 00:02:01,619
subblocks. 

33
00:02:01,619 --> 00:02:06,479
Then you can look at the distribution of X and ask what is the marginal distribution 

34
00:02:06,479 --> 00:02:11,019
of say X1. 

35
00:02:11,019 --> 00:02:12,789
And the 

36
00:02:12,789 --> 00:02:14,129
answer 

37
00:02:14,129 --> 00:02:16,869
we said last time was that X1 " the 

38
00:02:16,869 --> 00:02:20,639
marginal distribution of X1 is Gaussian 

39
00:02:20,639 --> 00:02:24,560
would mean mu and covariance sigma one one, whereas sigma one one 

40
00:02:24,560 --> 00:02:27,450
is the upper left block 

41
00:02:27,450 --> 00:02:29,819
of that covariance matrix sigma. So this one is no surprise. 

42
00:02:29,819 --> 00:02:35,359
And I also wrote down the formula for computing conditional distributions, such 

43
00:02:35,359 --> 00:02:39,459
as what is P of X1 given 

44
00:02:39,459 --> 00:02:41,139
X2, 

45
00:02:41,139 --> 00:02:46,419
and last time I wrote down that the distribution of X1 given X2 would 

46
00:02:46,419 --> 00:02:48,349
also be Gaussian 

47
00:02:48,349 --> 00:02:52,389
with parameters that I 

48
00:02:52,389 --> 00:02:56,409
wrote as mu of one given two and sigma of one given two 

49
00:02:56,409 --> 00:02:58,759
where mu 

50
00:02:58,759 --> 00:03:01,260
of one given two is " let's see [inaudible] 

51
00:03:01,260 --> 00:03:06,459
this formula. 

52
00:03:06,459 --> 00:03:13,459
Okay? 

53
00:03:23,309 --> 00:03:24,469
So 

54
00:03:24,469 --> 00:03:27,679
with these formulas will be able to 

55
00:03:27,679 --> 00:03:32,139
locate a pair of joint Gaussian random variables " X1 and X here are both vectors 

56
00:03:32,139 --> 00:03:36,759
" and compute the marginal and conditional distributions, so P of X1 or P 

57
00:03:36,759 --> 00:03:40,039
of X1 given X2. 

58
00:03:40,039 --> 00:03:41,749
So when I come back 

59
00:03:41,749 --> 00:03:45,059
and derive the E set " actually, 

60
00:03:45,059 --> 00:03:48,680
I'll come back and use the marginal formula in a second, and then when I come 

61
00:03:48,680 --> 00:03:50,089
back and derive from 

62
00:03:50,089 --> 00:03:54,999
the E step in the EM algorithm for factor analysis, I'll 

63
00:03:54,999 --> 00:03:58,559
actually be using these two formulas 

64
00:03:58,559 --> 00:04:00,039
again. And 

65
00:04:00,039 --> 00:04:02,779
again, just to continue 

66
00:04:02,779 --> 00:04:04,519
summarizing what we did last time, 

67
00:04:04,519 --> 00:04:06,209
we said that 

68
00:04:06,209 --> 00:04:10,599
in factor analysis 

69
00:04:10,599 --> 00:04:13,199
our model " 

70
00:04:13,199 --> 00:04:18,180
let's see. This is an unsupervised learning problem, 

71
00:04:18,180 --> 00:04:20,900
and so we're given an unlabeled training set 

72
00:04:20,900 --> 00:04:23,629
where each SI 

73
00:04:23,629 --> 00:04:26,400
is a vector in RN as usual. We 

74
00:04:26,400 --> 00:04:29,929
want 

75
00:04:29,929 --> 00:04:31,639
to model the 

76
00:04:31,639 --> 00:04:33,030
density of X, 

77
00:04:33,030 --> 00:04:34,050
and 

78
00:04:34,050 --> 00:04:36,499
our model for X would be that 

79
00:04:36,499 --> 00:04:40,509
we imagine there's a latent random variable Z that's generating 

80
00:04:40,509 --> 00:04:43,680
this [inaudible] zero 

81
00:04:43,680 --> 00:04:46,130
mean in identity covariance. 

82
00:04:46,130 --> 00:04:47,409
And Z will be some 

83
00:04:47,409 --> 00:04:53,769
low dimensional thing [inaudible] and we 

84
00:04:53,769 --> 00:04:58,419
[inaudible]. 

85
00:04:58,419 --> 00:05:00,840
And we imagine that X is 

86
00:05:00,840 --> 00:05:02,830
generated as mu plus 

87
00:05:02,830 --> 00:05:06,659
lambda Z plus epsilon where epsilon is a Gaussian random variable with 

88
00:05:06,659 --> 00:05:08,020
mean zero 

89
00:05:08,020 --> 00:05:11,899
and a covariance matrix psi. 

90
00:05:11,899 --> 00:05:15,249
And so the parameters of this model 

91
00:05:15,249 --> 00:05:17,739
are 

92
00:05:17,739 --> 00:05:21,610
mu which N-dimensional vector, 

93
00:05:21,610 --> 00:05:26,509
lambda, which is an N by D-dimensional vector " matrix, and psi which is N 

94
00:05:26,509 --> 00:05:28,819
by N 

95
00:05:28,819 --> 00:05:32,209
and is diagonal. N 

96
00:05:32,209 --> 00:05:36,619
is a diagonal 

97
00:05:36,619 --> 00:05:40,349
matrix. 

98
00:05:40,349 --> 00:05:43,850
So the cartoon I drew last time for factor analysis was 

99
00:05:43,850 --> 00:05:48,499
" I said that 

100
00:05:48,499 --> 00:05:52,349
maybe that's the typical example of data point ZI if " 

101
00:05:52,349 --> 00:05:55,639
and in this example I had D equals one, 

102
00:05:55,639 --> 00:05:57,300
N equals two. 

103
00:05:57,300 --> 00:06:01,400
So Z in this example is one-dimensional. D equals one. 

104
00:06:01,400 --> 00:06:02,700
And so 

105
00:06:02,700 --> 00:06:05,189


106
00:06:05,189 --> 00:06:12,189
you take this data, map it to say [inaudible] mu plus lambda Z and that may 

107
00:06:12,679 --> 00:06:15,889
give you 

108
00:06:15,889 --> 00:06:17,789
some set of points there. 

109
00:06:17,789 --> 00:06:21,040
And lastly, this model was envisioning 

110
00:06:21,040 --> 00:06:25,609
that you'd place a little Gaussian bump around each of these say 

111
00:06:25,609 --> 00:06:28,329
and sample 

112
00:06:28,329 --> 00:06:30,490
" 

113
00:06:30,490 --> 00:06:35,719
and 

114
00:06:35,719 --> 00:06:37,490
the Xs are 

115
00:06:37,490 --> 00:06:44,490
then maybe " that would be a typical sample of the Xs under this model. 

116
00:06:44,860 --> 00:06:47,610


117
00:06:47,610 --> 00:06:54,279
So 

118
00:06:54,279 --> 00:07:01,279
how [inaudible] the parameters of this model? Well, the 

119
00:07:04,289 --> 00:07:05,849
joint distribution 

120
00:07:05,849 --> 00:07:07,969
of Z and 

121
00:07:07,969 --> 00:07:09,729
X 

122
00:07:09,729 --> 00:07:13,659
is actually Gaussian 

123
00:07:13,659 --> 00:07:17,569
where parameters given by some vector [inaudible] mu XZ, 

124
00:07:17,569 --> 00:07:20,509
and sum covariance matrix sigma. 

125
00:07:20,509 --> 00:07:25,800
And [inaudible] what those two things are, this vector [inaudible] is 

126
00:07:25,800 --> 00:07:27,759
a vector of zeroes appended 

127
00:07:27,759 --> 00:07:30,280
to the vector mu. 

128
00:07:30,280 --> 00:07:33,639
And the matrix sigma 

129
00:07:33,639 --> 00:07:39,899
is this partitioned matrix. 

130
00:07:39,899 --> 00:07:45,849
We also worked this out last time. 

131
00:07:45,849 --> 00:07:49,969
So you can ask what is the distribution of X under this model, 

132
00:07:49,969 --> 00:07:51,759
and the answer 

133
00:07:51,759 --> 00:07:55,869
is under this model X is Gaussian with 

134
00:07:55,869 --> 00:07:59,639
mean mu 

135
00:07:59,639 --> 00:08:03,689
and covariance lambda lambda [inaudible] plus psi. So let's just 

136
00:08:03,689 --> 00:08:05,810
take the second block of the 

137
00:08:05,810 --> 00:08:07,669
mean vector 

138
00:08:07,669 --> 00:08:10,419
and take that 

139
00:08:10,419 --> 00:08:14,430
lower right hand corner block for the covariance matrix, and so 

140
00:08:14,430 --> 00:08:15,780
this is really 

141
00:08:15,780 --> 00:08:19,590
my formula for computing the marginal distribution of a Gaussian, 

142
00:08:19,590 --> 00:08:20,309
except that 

143
00:08:20,309 --> 00:08:23,629
I'm computing the marginal distribution of the second half of 

144
00:08:23,629 --> 00:08:26,429
the vector rather than the first half. 

145
00:08:26,429 --> 00:08:32,490
So this is the marginal distribution of X under my model. 

146
00:08:32,490 --> 00:08:39,210
And so if you want to learn " Student:[Inaudible] initial distribution [inaudible]? Instructor (Andrew Ng):Let's see. 

147
00:08:39,210 --> 00:08:41,940
Oh, 

148
00:08:41,940 --> 00:08:45,880
yes. Yes, so in this one I'm breaking down the 

149
00:08:45,880 --> 00:08:50,130
" this is really I'm specifying the conditional distribution of X 

150
00:08:50,130 --> 00:08:51,460
given Z. So 

151
00:08:51,460 --> 00:08:55,090
the conditional distribution of X given Z 

152
00:08:55,090 --> 00:08:56,910
" this is Gaussian " 

153
00:08:56,910 --> 00:08:59,890
would mean 

154
00:08:59,890 --> 00:09:02,790
mu plus lambda Z 

155
00:09:02,790 --> 00:09:03,829
and covariance psi. 

156
00:09:03,829 --> 00:09:10,829
This is what that [inaudible]. So 

157
00:09:18,790 --> 00:09:21,170
since this is the marginal distribution of X, 

158
00:09:21,170 --> 00:09:23,510
given my training set of 

159
00:09:23,510 --> 00:09:25,010
M unlabeled examples, 

160
00:09:25,010 --> 00:09:28,970
I can actually write down the log likelihood of my training set. So 

161
00:09:28,970 --> 00:09:31,940
the log likelihood of my training 

162
00:09:31,940 --> 00:09:35,420
set " actually, no. Let's just write down the likelihood. So the likelihood of my parameters 

163
00:09:35,420 --> 00:09:38,340
given my training set is the product from I equals one to M of P of XI given the parameters. I can actually 

164
00:09:38,340 --> 00:09:42,930
write down what that 

165
00:09:42,930 --> 00:09:47,800


166
00:09:47,800 --> 00:09:53,170


167
00:09:53,170 --> 00:09:56,720
is because XI is Gaussian with 

168
00:09:56,720 --> 00:09:57,850
parameter mu 

169
00:09:57,850 --> 00:10:01,080
and variance lambda lambda transpose times psi, 

170
00:10:01,080 --> 00:10:05,760
so you can actually write this down as one 

171
00:10:05,760 --> 00:10:07,900
over 

172
00:10:07,900 --> 00:10:11,880
two pi to 

173
00:10:11,880 --> 00:10:18,880
the N over two determined of " and then times E to the minus one half X of " 

174
00:10:34,000 --> 00:10:36,340
so that's my 

175
00:10:36,340 --> 00:10:37,590
formula 

176
00:10:37,590 --> 00:10:41,020
for the density of a Gaussian that has mean mu 

177
00:10:41,020 --> 00:10:42,940
and covariance lambda 

178
00:10:42,940 --> 00:10:46,960
lambda transpose plus psi. 

179
00:10:46,960 --> 00:10:51,370
So this is my likelihood of the parameters given a training set. 

180
00:10:51,370 --> 00:10:55,500
And one thing you could do is actually take this likelihood and try to maximize 

181
00:10:55,500 --> 00:10:56,450
it in terms of 

182
00:10:56,450 --> 00:11:01,110
the parameters, try to find the [inaudible] the parameters. But 

183
00:11:01,110 --> 00:11:04,389
if you do that you find that " if you sort of try " take [inaudible] 

184
00:11:04,389 --> 00:11:07,720
to get the law of likelihood, take derivatives, set derivatives equal to 

185
00:11:07,720 --> 00:11:08,410
zero, 

186
00:11:08,410 --> 00:11:10,230
you find that you'll be unable 

187
00:11:10,230 --> 00:11:12,620
to solve for the maximum of 

188
00:11:12,620 --> 00:11:14,610
this analytically. 

189
00:11:14,610 --> 00:11:16,160


190
00:11:16,160 --> 00:11:19,740
You won't be able to solve this [inaudible] likelihood estimation problem. If you 

191
00:11:19,740 --> 00:11:23,020
take the derivative of this with respect to the parameters, 

192
00:11:23,020 --> 00:11:27,090
set the derivatives equal to zero and try to solve for the value of the 

193
00:11:27,090 --> 00:11:31,250
parameters lambda, mu, and psi [inaudible] so you won't be able to do that 

194
00:11:31,250 --> 00:11:33,220
[inaudible]. 

195
00:11:33,220 --> 00:11:34,709
So 

196
00:11:34,709 --> 00:11:37,570
what we'll use instead 

197
00:11:37,570 --> 00:11:44,570
to estimate the parameters in a factor analysis model will be the EM algorithm. Student:Why is the law of likelihood P of X and not P of X [inaudible] X and Z or 

198
00:11:49,880 --> 00:11:56,880
something? Instructor (Andrew Ng):Oh, 

199
00:11:58,690 --> 00:12:03,340
right. So the question is why is the likelihood P of X and not P of X given Z or P of 

200
00:12:03,340 --> 00:12:05,420
X and 

201
00:12:05,420 --> 00:12:06,720
Z. The answer is let's see " 

202
00:12:06,720 --> 00:12:10,130
we're in the " so 

203
00:12:10,130 --> 00:12:15,080
by analogy to the mixture of Gaussian distributions models, 

204
00:12:15,080 --> 00:12:21,980
we're given a training set that just comprises 

205
00:12:21,980 --> 00:12:26,210
a set of unlabeled training examples that " for 

206
00:12:26,210 --> 00:12:30,560
convenience for whatever reason, 

207
00:12:30,560 --> 00:12:33,579
it's easier for me to write down a model that 

208
00:12:33,579 --> 00:12:37,400
defines the joint distribution on P of X comma 

209
00:12:37,400 --> 00:12:38,620
Z. But 

210
00:12:38,620 --> 00:12:42,770
what I would like to do is really maximize 

211
00:12:42,770 --> 00:12:46,890
" as a function of my parameters " I'm using theta as a shorthand to denote all the parameters of the 

212
00:12:46,890 --> 00:12:47,780
model " 

213
00:12:47,780 --> 00:12:52,960
I want to maximize the probability of 

214
00:12:52,960 --> 00:12:55,450
the data I actually observe. 

215
00:12:55,450 --> 00:12:58,040
And this would be 

216
00:12:58,040 --> 00:12:59,880
actually [inaudible] theta [inaudible] from I equals 

217
00:12:59,880 --> 00:13:02,510
one to M, 

218
00:13:02,510 --> 00:13:09,510
and this is really integrating out Z. So I only 

219
00:13:14,290 --> 00:13:17,740
ever get to observe the Xs and the Zs are 

220
00:13:17,740 --> 00:13:20,570
latent random variables or hidden random variables. 

221
00:13:20,570 --> 00:13:23,860
And so what 

222
00:13:23,860 --> 00:13:24,810
I'd like to do is 

223
00:13:24,810 --> 00:13:29,590
do this maximization in which I've implicitly integrated out Z. Does that make 

224
00:13:29,590 --> 00:13:34,600
sense? Actually, are there 

225
00:13:34,600 --> 00:13:41,600
other questions about the factor analysis models? 

226
00:13:55,550 --> 00:13:58,320


227
00:13:58,320 --> 00:14:05,320
Okay. So here's 

228
00:14:14,340 --> 00:14:18,570
the EM algorithm. In the E step, we compute the 

229
00:14:18,570 --> 00:14:21,170
conditional distribution of ZI 

230
00:14:21,170 --> 00:14:24,030
given XI in our current setting of the parameters. 

231
00:14:24,030 --> 00:14:28,380
And in the 

232
00:14:28,380 --> 00:14:35,380


233
00:14:35,960 --> 00:14:42,960


234
00:14:50,060 --> 00:14:57,060


235
00:15:01,570 --> 00:15:05,050
M step, we perform this maximization. 

236
00:15:05,050 --> 00:15:07,350
And if this looks a little bit different from the 

237
00:15:07,350 --> 00:15:08,229
previous versions of 

238
00:15:08,229 --> 00:15:09,590
EM that you've seen, 

239
00:15:09,590 --> 00:15:13,150
the only difference is that now I'm integrating over ZI 

240
00:15:13,150 --> 00:15:17,450
because ZI is now a Gaussian random variable, is now this 

241
00:15:17,450 --> 00:15:21,840
continuous value thing, so rather than summing over ZI, I now integrate over 

242
00:15:21,840 --> 00:15:26,240
ZI. And if you replace [inaudible] of a sum, you [inaudible] ZI then this is exactly the 

243
00:15:26,240 --> 00:15:27,590
M step you saw 

244
00:15:27,590 --> 00:15:31,390
when we worked it out from the mixture of Gaussian [inaudible]. 

245
00:15:31,390 --> 00:15:34,230
So it turns out that 

246
00:15:34,230 --> 00:15:38,220
in order to implement the E step and the M step, there are just two little things " there are 

247
00:15:38,220 --> 00:15:41,530
actually sort of three key things that are different from the models that you saw 

248
00:15:41,530 --> 00:15:44,350
previously, so I wanna talk about them. 

249
00:15:44,350 --> 00:15:48,120
The first one is that the first of the [inaudible] which is 

250
00:15:48,120 --> 00:15:50,890
that in the E step, 

251
00:15:50,890 --> 00:15:53,740
Z is now a continuous value random 

252
00:15:53,740 --> 00:15:55,420
variable, so 

253
00:15:55,420 --> 00:15:56,080


254
00:15:56,080 --> 00:16:00,180
you now need a way to represent these continuous value densities, probably density 

255
00:16:00,180 --> 00:16:03,160
functions to represent QI of Z. 

256
00:16:03,160 --> 00:16:10,100
So 

257
00:16:10,100 --> 00:16:12,180
fortunately 

258
00:16:12,180 --> 00:16:16,870
in this probably it's not difficult to do, 

259
00:16:16,870 --> 00:16:22,290
and in particular 

260
00:16:22,290 --> 00:16:27,540
the conditional distribution of ZI given XI and our 

261
00:16:27,540 --> 00:16:28,410
parameters which I'm gonna 

262
00:16:28,410 --> 00:16:30,210
omit from this equation 

263
00:16:30,210 --> 00:16:34,770
is going to be 

264
00:16:34,770 --> 00:16:36,840
Gaussian with mean 

265
00:16:36,840 --> 00:16:38,980
and covariance given 

266
00:16:38,980 --> 00:16:42,650
by these two things, so I write like that, 

267
00:16:42,650 --> 00:16:48,030
where 

268
00:16:48,030 --> 00:16:55,030
this is going to be equal to the vector zero minus " 

269
00:17:01,040 --> 00:17:05,540
And the way I got this formula was 

270
00:17:05,540 --> 00:17:06,319
" if you 

271
00:17:06,319 --> 00:17:10,240
match this to the formula I had previously for computing conditional distributions 

272
00:17:10,240 --> 00:17:11,430
of Gaussians, this 

273
00:17:11,430 --> 00:17:15,579
corresponded to the terms mu one minus sigma one two 

274
00:17:15,579 --> 00:17:17,579
sigma two two 

275
00:17:17,579 --> 00:17:20,720
inverse times two minus 

276
00:17:20,720 --> 00:17:24,530
mu. So those are the terms corresponding to the form I had previously 

277
00:17:24,530 --> 00:17:28,910
for computing the marginal distributions of a Gaussian. 

278
00:17:28,910 --> 00:17:32,640
And 

279
00:17:32,640 --> 00:17:39,640
that is going to be given by 

280
00:17:41,090 --> 00:17:42,279
that, and 

281
00:17:42,279 --> 00:17:49,279
again 

282
00:17:49,350 --> 00:17:53,490
those are the terms corresponding to the formulas I had for 

283
00:17:53,490 --> 00:17:57,410
the very first thing I did, the formulas for computing 

284
00:17:57,410 --> 00:18:01,179
conditional distributions of Gaussians. 

285
00:18:01,179 --> 00:18:05,960
And so this is E step. You need to compute the Q distribution. You need to 

286
00:18:05,960 --> 00:18:08,090
compute 

287
00:18:08,090 --> 00:18:09,940
QI of ZI, 

288
00:18:09,940 --> 00:18:11,570
and to do that 

289
00:18:11,570 --> 00:18:15,450
what you actually do is you compute this vector mu of ZI given XI and 

290
00:18:15,450 --> 00:18:17,850
sigma of ZI given XI, 

291
00:18:17,850 --> 00:18:21,550
and together these represent the mean and covariance of the 

292
00:18:21,550 --> 00:18:22,550
distribution Q where Q is going to 

293
00:18:22,550 --> 00:18:24,300
be Gaussian, 

294
00:18:24,300 --> 00:18:31,300
so that's the E step. 

295
00:18:32,710 --> 00:18:35,050
Now 

296
00:18:35,050 --> 00:18:38,870
here's 

297
00:18:38,870 --> 00:18:40,210


298
00:18:40,210 --> 00:18:45,770
the M step then. I'll just mention there are sort of two ways to derive M steps for especially 

299
00:18:45,770 --> 00:18:49,180
Gaussian models like these. 

300
00:18:49,180 --> 00:18:52,490
Here's the key trick I guess which is that 

301
00:18:52,490 --> 00:18:54,440
when you compute the M step, 

302
00:18:54,440 --> 00:18:59,520
you often need to compute integrals that look like these. 

303
00:18:59,520 --> 00:19:06,330
And then there'll be some function of ZI. Let me just write ZI there, for instance. So 

304
00:19:06,330 --> 00:19:10,870
there are two ways to compute integrals like these. One is if you write out " this is a 

305
00:19:10,870 --> 00:19:12,760
commonly made mistake in " 

306
00:19:12,760 --> 00:19:15,680
well, not mistake, commonly made 

307
00:19:15,680 --> 00:19:18,060
unnecessary complication I guess. 

308
00:19:18,060 --> 00:19:22,420
One way to try to compute this integral is you can write this out as integral over 

309
00:19:22,420 --> 00:19:23,200
ZI. 

310
00:19:23,200 --> 00:19:28,290
And while we know what QI is, right? QI is a Gaussian so 

311
00:19:28,290 --> 00:19:29,130
two pi, 

312
00:19:29,130 --> 00:19:31,210
so D over two 

313
00:19:31,210 --> 00:19:35,090
" the covariance of QI is this sigma 

314
00:19:35,090 --> 00:19:36,600
given 

315
00:19:36,600 --> 00:19:38,360
XI which you've computed in the 

316
00:19:38,360 --> 00:19:39,490
E step, and 

317
00:19:39,490 --> 00:19:41,590
then times E to the 

318
00:19:41,590 --> 00:19:43,400
minus one half 

319
00:19:43,400 --> 00:19:47,450
ZI minus mu of ZI 

320
00:19:47,450 --> 00:19:49,650
given 

321
00:19:49,650 --> 00:19:54,200
XI transpose sigma inverse 

322
00:19:54,200 --> 00:20:01,200
" 

323
00:20:01,850 --> 00:20:04,380
so that's my Gaussian density. 

324
00:20:04,380 --> 00:20:06,920
And then times 

325
00:20:06,920 --> 00:20:09,410


326
00:20:09,410 --> 00:20:11,490
ZI ZZI. 

327
00:20:11,490 --> 00:20:13,030
And so 

328
00:20:13,030 --> 00:20:14,679
writing those out is 

329
00:20:14,679 --> 00:20:18,919
the unnecessarily complicated way to do it because once you've written out this integral, 

330
00:20:18,919 --> 00:20:23,220
if you want to actually integrate " that's the times, 

331
00:20:23,220 --> 00:20:26,270
multiplication " if you actually want to evaluate this integral, it just looks 

332
00:20:26,270 --> 00:20:30,740
horribly complicated. I'm not quite sure how to evaluate this. By 

333
00:20:30,740 --> 00:20:34,310
far, the simpler to evaluate this integral is 

334
00:20:34,310 --> 00:20:36,960
to recognize that this is just 

335
00:20:36,960 --> 00:20:40,750
the expectation with respect to ZI drawn 

336
00:20:40,750 --> 00:20:41,940
from the distribution 

337
00:20:41,940 --> 00:20:46,040
QI 

338
00:20:46,040 --> 00:20:48,960
of the random variable ZI. 

339
00:20:48,960 --> 00:20:49,660
And 

340
00:20:49,660 --> 00:20:53,170
once you've actually got this way, you notice that this is just the expected 

341
00:20:53,170 --> 00:20:57,020
value of ZI under the distribution QI, 

342
00:20:57,020 --> 00:21:01,380
but QI is a Gaussian distribution with mean 

343
00:21:01,380 --> 00:21:04,940
given by that mu vector and covariance given by that sigma vector, 

344
00:21:04,940 --> 00:21:11,940
and so the expected value of ZI under this distribution is just mu of ZI given XI. Does 

345
00:21:12,090 --> 00:21:14,750


346
00:21:14,750 --> 00:21:17,700
that make 

347
00:21:17,700 --> 00:21:21,410
sense? So by making those observations that this is just an expected value, 

348
00:21:21,410 --> 00:21:28,410
there's a much easier way to compute that integral. 

349
00:21:29,500 --> 00:21:32,790


350
00:21:32,790 --> 00:21:34,430
[Inaudible]. 

351
00:21:34,430 --> 00:21:36,330


352
00:21:36,330 --> 00:21:40,640


353
00:21:40,640 --> 00:21:44,650
So we'll apply the same idea to the M step. 

354
00:21:44,650 --> 00:21:47,869
So the M step we want to maximize 

355
00:21:47,869 --> 00:21:50,580


356
00:21:50,580 --> 00:21:56,920


357
00:21:56,920 --> 00:22:03,610


358
00:22:03,610 --> 00:22:07,790
this. There's also sum over I " there's a summation over I outside, 

359
00:22:07,790 --> 00:22:08,330
but 

360
00:22:08,330 --> 00:22:10,370
this is essentially the term 

361
00:22:10,370 --> 00:22:13,270
inside the arg max of the M step 

362
00:22:13,270 --> 00:22:18,290
where taking the integral of a ZI and just observe that that's actually " I 

363
00:22:18,290 --> 00:22:21,090
guess just form some expectation 

364
00:22:21,090 --> 00:22:23,310
respect to the random variable ZI 

365
00:22:23,310 --> 00:22:27,600
of this thing inside. 

366
00:22:27,600 --> 00:22:33,790


367
00:22:33,790 --> 00:22:40,790
And so this simplifies to the following. 

368
00:23:05,090 --> 00:23:06,900
And it turns out 

369
00:23:06,900 --> 00:23:07,580
actually 

370
00:23:07,580 --> 00:23:12,290
all I did here was use the fact that P of X given Z 

371
00:23:12,290 --> 00:23:14,070
times P of Z 

372
00:23:14,070 --> 00:23:18,920
equals P over X comma Z. Right? That's just combining this term and that term gives 

373
00:23:18,920 --> 00:23:23,030
you the numerator in the original. 

374
00:23:23,030 --> 00:23:25,760
And so in turns out that for factor analysis 

375
00:23:25,760 --> 00:23:29,930
in these two terms, this is the only one 

376
00:23:29,930 --> 00:23:33,460
that depends on the parameters. The distribution P 

377
00:23:33,460 --> 00:23:35,070
of 

378
00:23:35,070 --> 00:23:39,779
ZI has no parameters because ZI was just drawn 

379
00:23:39,779 --> 00:23:40,900
from a 

380
00:23:40,900 --> 00:23:43,680
Gaussian with zero mean and identity 

381
00:23:43,680 --> 00:23:49,090
covariance. QI of Z was this fixed Gaussian. It doesn't depend on the parameters 

382
00:23:49,090 --> 00:23:52,940
theta, and so in the M step we really just need to maximize this one term 

383
00:23:52,940 --> 00:23:55,340
with respect to all parameters, 

384
00:23:55,340 --> 00:23:56,470
mu, lambda, 

385
00:23:56,470 --> 00:24:00,080
and psi. 

386
00:24:00,080 --> 00:24:03,890
So 

387
00:24:03,890 --> 00:24:10,890
let's see. 

388
00:24:11,850 --> 00:24:15,200
There's sort of one more key step I want to show but 

389
00:24:15,200 --> 00:24:17,940
to get there I have to write down 

390
00:24:17,940 --> 00:24:23,710
an unfortunately large amount of math, and let's go 

391
00:24:23,710 --> 00:24:29,870
into it. 

392
00:24:29,870 --> 00:24:36,870
Okay. 

393
00:24:38,440 --> 00:24:45,440
So in the M step, we want to maximize 

394
00:24:47,130 --> 00:24:51,330
and all expectations with respect to ZI drawn from the distributions 

395
00:24:51,330 --> 00:24:52,140
QI, 

396
00:24:52,140 --> 00:24:55,140
and sometimes I'll be sloppy and just omit this. 

397
00:24:55,140 --> 00:24:58,460


398
00:24:58,460 --> 00:25:05,460


399
00:25:06,159 --> 00:25:09,019
And now 

400
00:25:09,019 --> 00:25:10,250
this distribution 

401
00:25:10,250 --> 00:25:13,690
P of XI given 

402
00:25:13,690 --> 00:25:20,010
ZI, that is a Gaussian density because XI given ZI " this is 

403
00:25:20,010 --> 00:25:22,670


404
00:25:22,670 --> 00:25:28,840
Gaussian with mean given by mu plus lambda Z 

405
00:25:28,840 --> 00:25:30,700
and covariance psi. 

406
00:25:30,700 --> 00:25:35,870
And so in this step of the derivation, I will actually go ahead and substitute in 

407
00:25:35,870 --> 00:25:40,299
the formula for Gaussian density. So I will go ahead and take those and plug it into 

408
00:25:40,299 --> 00:25:42,770
here, one over two pi to the N over 

409
00:25:42,770 --> 00:25:45,750
two, psi inverse E to the 

410
00:25:45,750 --> 00:25:48,790
dot, dot, dot. 

411
00:25:48,790 --> 00:25:51,460
So I will go ahead and plug in the Gaussian density. 

412
00:25:51,460 --> 00:25:57,110
And when you do that, you find that you get expectation 

413
00:25:57,110 --> 00:26:04,110
" excuse 

414
00:26:04,630 --> 00:26:06,970
me. I forgot to say 

415
00:26:06,970 --> 00:26:08,740
to maintain 

416
00:26:08,740 --> 00:26:12,500
" to not make the derivation too complicated, 

417
00:26:12,500 --> 00:26:16,480
I'm actually just going to maximize this with respect to the parameters lambda. 

418
00:26:16,480 --> 00:26:17,730
So let me just show how the " 

419
00:26:17,730 --> 00:26:20,870
so you want to maximize this with respect to lambda, psi, and mu, 

420
00:26:20,870 --> 00:26:24,110
but just to keep the amount of math I'm gonna do in class sane, I'm just going to 

421
00:26:24,110 --> 00:26:26,200
show how to maximize this with 

422
00:26:26,200 --> 00:26:30,140
respect to the matrix lambda, and I'll pretend that psi and mu are fixed. 

423
00:26:30,140 --> 00:26:34,570
And so if you substitute in the Gaussian density, you get some expected value of the 

424
00:26:34,570 --> 00:26:35,010
constant. 

425
00:26:35,010 --> 00:26:37,010
The constant may depend on psi 

426
00:26:37,010 --> 00:26:39,260
but not on lambda. 

427
00:26:39,260 --> 00:26:41,810
Then mine is 

428
00:26:41,810 --> 00:26:48,810
this thing. 

429
00:26:59,590 --> 00:27:01,620
And this quadratic term 

430
00:27:01,620 --> 00:27:03,250
essentially came from 

431
00:27:03,250 --> 00:27:07,090
the exponent in your Gaussian density. When I take log of 

432
00:27:07,090 --> 00:27:08,650
exponent, 

433
00:27:08,650 --> 00:27:15,650
then you end up with this quadratic term. 

434
00:27:15,650 --> 00:27:18,660


435
00:27:18,660 --> 00:27:23,870
And so if you take the derivatives 

436
00:27:23,870 --> 00:27:25,419
of 

437
00:27:25,419 --> 00:27:32,419
the expression above with 

438
00:27:34,760 --> 00:27:40,270
respect to the matrix lambda 

439
00:27:40,270 --> 00:27:42,900
and you set that to zero 

440
00:27:42,900 --> 00:27:47,590
" we want to maximize this expression with respect to the parameters lambda, 

441
00:27:47,590 --> 00:27:52,630
so you take the derivative of this with respect to lambda " excuse me. That's a minus 

442
00:27:52,630 --> 00:27:54,740
sign " 

443
00:27:54,740 --> 00:27:58,230
and set the derivate of this expression 

444
00:27:58,230 --> 00:28:04,100
to zero because you set derivatives to zero to maximize things, right? 

445
00:28:04,100 --> 00:28:06,510
When you do that and simplify, 

446
00:28:06,510 --> 00:28:13,510
you end up with the following. 

447
00:28:54,340 --> 00:28:55,700
And so that's the " in 

448
00:28:55,700 --> 00:29:02,320
the M step, this is the value you should get 

449
00:29:02,320 --> 00:29:06,100
that you use to update your parameters lambda. 

450
00:29:06,100 --> 00:29:10,790
And again, the expectations 

451
00:29:10,790 --> 00:29:14,740
are with respect to 

452
00:29:14,740 --> 00:29:17,480
ZI drawn from the 

453
00:29:17,480 --> 00:29:22,100
distributions QI. So the very last 

454
00:29:22,100 --> 00:29:27,440
step of this derivation is we need to work out what 

455
00:29:27,440 --> 00:29:30,320
these two expectations are. 

456
00:29:30,320 --> 00:29:34,010
And so the very first term 

457
00:29:34,010 --> 00:29:35,370
EZI 

458
00:29:35,370 --> 00:29:42,370
transpose I guess is just mu of ZI 

459
00:29:42,410 --> 00:29:46,250
give XI transpose because the QI distribution has mean given by [inaudible]. 

460
00:29:46,250 --> 00:29:49,120


461
00:29:49,120 --> 00:29:52,020


462
00:29:52,020 --> 00:29:55,740
To work out the other term, let 

463
00:29:55,740 --> 00:30:00,190
me just remind you that 

464
00:30:00,190 --> 00:30:02,950
if you have a random variable Z 

465
00:30:02,950 --> 00:30:06,690
that's Gaussian with mean mu 

466
00:30:06,690 --> 00:30:11,310
and covariance sigma, then the covariance sigma is EZZ transpose 

467
00:30:11,310 --> 00:30:13,040
minus EZ 

468
00:30:13,040 --> 00:30:20,040
EZ transpose. 

469
00:30:20,470 --> 00:30:23,290
That's one of the definitions of the covariance, 

470
00:30:23,290 --> 00:30:28,590
and so this implies that EZZ transpose equals sigma 

471
00:30:28,590 --> 00:30:34,920
plus 

472
00:30:34,920 --> 00:30:37,470
EZ EZ transpose. 

473
00:30:37,470 --> 00:30:42,730
And so this second term here 

474
00:30:42,730 --> 00:30:45,429
becomes 

475
00:30:45,429 --> 00:30:50,200
sigma ZI to the mu I 

476
00:30:50,200 --> 00:30:52,850
plus 

477
00:30:52,850 --> 00:30:53,179
" 

478
00:30:53,179 --> 00:31:00,179
given XI. Okay? 

479
00:31:07,350 --> 00:31:12,520
And so that's how you compute E of ZI transpose and E of ZI ZI transpose and 

480
00:31:12,520 --> 00:31:14,770
substitute them back into this formula. 

481
00:31:14,770 --> 00:31:17,860
And you would then have your M step update 

482
00:31:17,860 --> 00:31:22,420
to the parameter matrix lambda. 

483
00:31:22,420 --> 00:31:26,240
And the last thing I want to point out in this derivation is that it turns out " it's 

484
00:31:26,240 --> 00:31:30,770
probably because of the name EM algorithm, expectation maximization, 

485
00:31:30,770 --> 00:31:33,940
one common mistake for the EM algorithm is 

486
00:31:33,940 --> 00:31:35,950
in the E step, 

487
00:31:35,950 --> 00:31:36,889
some want to 

488
00:31:36,889 --> 00:31:39,960
take the expectation of the random variable Z, 

489
00:31:39,960 --> 00:31:43,820
and then in the M step they just plug in the expected value everywhere you see it. 

490
00:31:43,820 --> 00:31:45,049
So in particular, 

491
00:31:45,049 --> 00:31:49,340
one common mistake in deriving EM for factor analysis is to look at this and say, �Oh, look. 

492
00:31:49,340 --> 00:31:52,550
I see ZZ transpose. Let's 

493
00:31:52,550 --> 00:31:55,550
just plug in the expected value under the Q distribution.� 

494
00:31:55,550 --> 00:31:58,740
And so plug in 

495
00:31:58,740 --> 00:32:03,860
that " 

496
00:32:03,860 --> 00:32:07,430
mu of ZI given XI times mu of ZI given XI transpose " 

497
00:32:07,430 --> 00:32:12,190
into that expectation, and that would be 

498
00:32:12,190 --> 00:32:14,570
an incorrect derivation of 

499
00:32:14,570 --> 00:32:15,960
EM because 

500
00:32:15,960 --> 00:32:19,059
it's missing this other term, sigma 

501
00:32:19,059 --> 00:32:20,040
of ZI given XI. 

502
00:32:20,040 --> 00:32:24,320
So one common misconception for EM is that in the E step you just compute the expected value of 

503
00:32:24,320 --> 00:32:25,700
the hidden random variable, 

504
00:32:25,700 --> 00:32:28,430
and the M step you plug in the expected value. 

505
00:32:28,430 --> 00:32:31,940
It turns out in some algorithms that turns out to be the right thing to do. In 

506
00:32:31,940 --> 00:32:35,220
the mixture of Gaussians and the mixture of [inaudible] models, that would actually give 

507
00:32:35,220 --> 00:32:36,770
you the right answer, 

508
00:32:36,770 --> 00:32:40,370
but in general the EM algorithm is more complicated than just taking the 

509
00:32:40,370 --> 00:32:43,799
expected values of the random variables and then pretending that they were sort of 

510
00:32:43,799 --> 00:32:47,190
observed at the expected values. So I wanna 

511
00:32:47,190 --> 00:32:51,909
go through this just to illustrate that step 

512
00:32:51,909 --> 00:32:54,030
as well. So just to summarize the 

513
00:32:54,030 --> 00:32:57,360
three key things to keep in " that came up in this 

514
00:32:57,360 --> 00:32:58,520
variation were, 1.) 

515
00:32:58,520 --> 00:33:02,860
That for the E step, we had a continuous Gaussian random variable, and so to 

516
00:33:02,860 --> 00:33:04,150
compute the E step, 

517
00:33:04,150 --> 00:33:07,760
we actually compute the mean and covariance of the distribution 

518
00:33:07,760 --> 00:33:09,300
QI. 

519
00:33:09,300 --> 00:33:12,960
The second thing that came up was in the M step 

520
00:33:12,960 --> 00:33:14,649
when you see these integrals, 

521
00:33:14,649 --> 00:33:19,080
sometimes if you interpret that as expectation then the rest of 

522
00:33:19,080 --> 00:33:20,590
the math becomes much easier. 

523
00:33:20,590 --> 00:33:22,830
And the final thing was 

524
00:33:22,830 --> 00:33:25,600
again in the M step, 

525
00:33:25,600 --> 00:33:29,290
the EM algorithm is derived by a certain maximization problem that we solve. It 

526
00:33:29,290 --> 00:33:30,240
is not 

527
00:33:30,240 --> 00:33:32,369
necessarily just plugging the expected value of 

528
00:33:32,369 --> 00:33:36,130
ZI everywhere. Let's see. 

529
00:33:36,130 --> 00:33:38,820
I 

530
00:33:38,820 --> 00:33:43,390
feel like I just did a ton of math and wrote down way too many equations. 

531
00:33:43,390 --> 00:33:47,480
And even doing this, I was skipping many steps. So 

532
00:33:47,480 --> 00:33:51,480
you can go to the lecture notes to see all the derivations of the steps I skipped, like 

533
00:33:51,480 --> 00:33:54,710
how you actually take derivatives with respect to the matrix lambda, and 

534
00:33:54,710 --> 00:33:58,420
how to compute the updates for the other parameters as well, for mu and for psi, 

535
00:33:58,420 --> 00:34:03,320
because this is only for lambda. 

536
00:34:03,320 --> 00:34:04,980
And so 

537
00:34:04,980 --> 00:34:11,980
that's the factor analysis algorithm. Justin? Student:I was just wondering in the step in the lower right board, you said that the 

538
00:34:16,479 --> 00:34:17,629
second term doesn't have any 

539
00:34:17,629 --> 00:34:19,899
parameters that we're interested in. The first term 

540
00:34:19,899 --> 00:34:23,959
has all the parameters, so we'll [inaudible], but it seems to me 

541
00:34:23,959 --> 00:34:26,239
that QI 

542
00:34:26,239 --> 00:34:28,359
has a lot of parameters [inaudible]. 

543
00:34:28,359 --> 00:34:30,889
Instructor (Andrew Ng):I see. Right. 

544
00:34:30,889 --> 00:34:33,379
Let's see. So the question was 

545
00:34:33,379 --> 00:34:38,519
doesn't the term QI have parameters? 

546
00:34:38,519 --> 00:34:43,819
So in the EM algorithm QI is " it actually 

547
00:34:43,819 --> 00:34:48,619
turns out in the EM algorithm, sometimes P of ZI may have parameters, 

548
00:34:48,619 --> 00:34:52,259
but QI of ZI may never have any parameters. 

549
00:34:52,259 --> 00:34:56,779
In the specific case of factor analysis, P of ZI doesn't have parameters. 

550
00:34:56,779 --> 00:35:00,210
In other examples, the mixture of Gaussian models say, 

551
00:35:00,210 --> 00:35:03,729
ZI was a multinomial random variable, and so in that example PI of ZI has parameters, but 

552
00:35:03,729 --> 00:35:06,489
it turns out 

553
00:35:06,489 --> 00:35:09,269
that Q of ZI will never have any parameters. 

554
00:35:09,269 --> 00:35:14,549
And in particular, QI of ZI is going to be 

555
00:35:14,549 --> 00:35:18,319
" is this Gaussian distribution " is Gaussian with 

556
00:35:18,319 --> 00:35:19,099
mean 

557
00:35:19,099 --> 00:35:22,229
given by mu 

558
00:35:22,229 --> 00:35:23,759
of ZI given XI 

559
00:35:23,759 --> 00:35:30,029
and covariance sigma ZI given XI. 

560
00:35:30,029 --> 00:35:31,199
And so 

561
00:35:31,199 --> 00:35:32,930


562
00:35:32,930 --> 00:35:34,290
it's true that 

563
00:35:34,290 --> 00:35:35,709
mu and sigma 

564
00:35:35,709 --> 00:35:40,359
may themselves have depended on the values of the parameters I had in the 

565
00:35:40,359 --> 00:35:42,909
previous iteration of 

566
00:35:42,909 --> 00:35:45,470
EM, but the way to think about Q is 

567
00:35:45,470 --> 00:35:50,399
I'm going to take the parameters from the previous iteration of the algorithm and use 

568
00:35:50,399 --> 00:35:51,670
that to compute 

569
00:35:51,670 --> 00:35:54,109
what QI of ZI is. 

570
00:35:54,109 --> 00:35:57,249
And that's the E second EM algorithm. 

571
00:35:57,249 --> 00:36:00,540
And then once I've computed what QI of Z is, then 

572
00:36:00,540 --> 00:36:02,939
this is a fixed distribution. I'm gonna 

573
00:36:02,939 --> 00:36:06,269
use these fixed values for mu and sigma, 

574
00:36:06,269 --> 00:36:08,340
and just keep these two values fixed 

575
00:36:08,340 --> 00:36:09,599
as I run the M step. Student:So that's " I guess I was confused because in the second point 

576
00:36:09,599 --> 00:36:16,490
over 

577
00:36:16,490 --> 00:36:22,709
there's a lot of " it looks like they're parameters, 

578
00:36:22,709 --> 00:36:24,029
but I guess 

579
00:36:24,029 --> 00:36:28,509
they're old iterations of the parameters. Instructor (Andrew Ng):Oh, yeah. Yes, you're right. When I wrote down QI of ZI that was a function of " so yeah " the 

580
00:36:28,509 --> 00:36:30,979
parameters from the previous iteration. And I want to compute 

581
00:36:30,979 --> 00:36:34,569
the new set of parameters. 

582
00:36:34,569 --> 00:36:41,569
Okay. More questions? 

583
00:36:48,469 --> 00:36:54,549
So this is probably the most math I'll ever do in a 

584
00:36:54,549 --> 00:36:58,549
lecture in this 

585
00:36:58,549 --> 00:37:04,069
entire course. Let's now talk about a different algorithm. 

586
00:37:04,069 --> 00:37:08,709
Actually, which board was I on? 

587
00:37:08,709 --> 00:37:15,709
So 

588
00:37:39,249 --> 00:37:43,359
what I want to do now is talk about an algorithm called principal components 

589
00:37:43,359 --> 00:37:50,359
analysis, which is 

590
00:37:52,739 --> 00:37:55,690
often abbreviated 

591
00:37:55,690 --> 00:37:57,299
PCA. 

592
00:37:57,299 --> 00:37:58,449


593
00:37:58,449 --> 00:38:00,070
Here's the idea. 

594
00:38:00,070 --> 00:38:05,469
PCA has a very similar idea as factor analysis, but it sort 

595
00:38:05,469 --> 00:38:11,849
of maybe gets to the problem a little more directly than just factor analysis. 

596
00:38:11,849 --> 00:38:13,499
So 

597
00:38:13,499 --> 00:38:16,819


598
00:38:16,819 --> 00:38:20,159
the question is given " so we're still doing unsupervised learning, so given a training set 

599
00:38:20,159 --> 00:38:24,020
of M examples 

600
00:38:24,020 --> 00:38:26,519
where each XI 

601
00:38:26,519 --> 00:38:30,499
is an N-dimensional vector as usual, 

602
00:38:30,499 --> 00:38:34,039
what I like to do is reduce it 

603
00:38:34,039 --> 00:38:39,239
to a lower dimensional data set where 

604
00:38:39,239 --> 00:38:41,379
K is strictly 

605
00:38:41,379 --> 00:38:43,169
less 

606
00:38:43,169 --> 00:38:46,979
than N, and quite often will be much smaller 

607
00:38:46,979 --> 00:38:50,039
than N. 

608
00:38:50,039 --> 00:38:52,699
So I'll give a 

609
00:38:52,699 --> 00:38:56,929
couple examples of why we want to do this. 

610
00:38:56,929 --> 00:39:00,900
Imagine that you're given a data set that contains measurements 

611
00:39:00,900 --> 00:39:06,819
and unknown to you " measurements of, I don't know, people or something " 

612
00:39:06,819 --> 00:39:09,189
and unknown to you, 

613
00:39:09,189 --> 00:39:13,640
whoever collected this data actually included the height of the person 

614
00:39:13,640 --> 00:39:18,119
in centimeters as well as the height of the person in inches. So 

615
00:39:18,119 --> 00:39:21,410
because of rounding off to the nearest centimeter or rounding off to the nearest 

616
00:39:21,410 --> 00:39:22,119
inch 

617
00:39:22,119 --> 00:39:25,329
the values won't exactly match up, but along 

618
00:39:25,329 --> 00:39:31,039
two dimensions of this data anyway, it'll 

619
00:39:31,039 --> 00:39:35,019
lie extremely close to a straight line, but it won't lie exactly 

620
00:39:35,019 --> 00:39:38,699
on a straight line because of rounding off to the nearest centimeter or inch, but lie 

621
00:39:38,699 --> 00:39:40,609
very close to the straight line. 

622
00:39:40,609 --> 00:39:43,029
And so 

623
00:39:43,029 --> 00:39:46,769
we have a data set like this. It seems that what you really care about is 

624
00:39:46,769 --> 00:39:48,519
that axis, 

625
00:39:48,519 --> 00:39:51,959
and this axis is really the 

626
00:39:51,959 --> 00:39:55,660
variable of interest. 

627
00:39:55,660 --> 00:39:56,440
That's 

628
00:39:56,440 --> 00:39:59,189
maybe the closest thing you have to the true height of a person. 

629
00:39:59,189 --> 00:40:00,299
And 

630
00:40:00,299 --> 00:40:03,660
this other axis 

631
00:40:03,660 --> 00:40:05,819
is just noise. 

632
00:40:05,819 --> 00:40:10,099
So if you can reduce the dimension of this data from two-dimensional to 

633
00:40:10,099 --> 00:40:11,529
one-dimensional, 

634
00:40:11,529 --> 00:40:17,209
then maybe you can get rid of some of the noise in this data. 

635
00:40:17,209 --> 00:40:20,569
Quite often, you won't know that this was cm and this was inches. You may have a 

636
00:40:20,569 --> 00:40:22,779
data set with a hundred attributes, 

637
00:40:22,779 --> 00:40:29,689
but you just didn't happen to notice one was 

638
00:40:29,689 --> 00:40:33,909
cm and one was inches. Another example that I sometimes think 

639
00:40:33,909 --> 00:40:36,879
about is some of you know that my students and 

640
00:40:36,879 --> 00:40:40,319
I work with [inaudible] helicopters a lot. So 

641
00:40:40,319 --> 00:40:41,839
we imagined that 

642
00:40:41,839 --> 00:40:42,839
you take 

643
00:40:42,839 --> 00:40:44,929
surveys of quizzes, measurements of 

644
00:40:44,929 --> 00:40:48,279
radio control helicopter pilots. 

645
00:40:48,279 --> 00:40:52,039
Maybe one axis you have measurements of 

646
00:40:52,039 --> 00:40:55,969
your pilot's skill, 

647
00:40:55,969 --> 00:40:57,519


648
00:40:57,519 --> 00:41:00,409
how skillful your helicopter pilot is, 

649
00:41:00,409 --> 00:41:03,169
and on 

650
00:41:03,169 --> 00:41:05,379
another axis maybe you 

651
00:41:05,379 --> 00:41:08,039
measure how much they actually enjoy flying. 

652
00:41:08,039 --> 00:41:09,829
And maybe " 

653
00:41:09,829 --> 00:41:11,859
this is really " maybe 

654
00:41:11,859 --> 00:41:13,889
this is actually roughly onedimensional data, 

655
00:41:13,889 --> 00:41:15,919
and there's some variable of 

656
00:41:15,919 --> 00:41:22,919
interest, which I'll call maybe pilot attitude 

657
00:41:23,609 --> 00:41:27,189
that somehow determines their skill and how much they enjoy flying. And so 

658
00:41:27,189 --> 00:41:27,749
again, 

659
00:41:27,749 --> 00:41:31,829
if you can reduce this data from two dimensions to one-dimensional, maybe you'd 

660
00:41:31,829 --> 00:41:33,929
have a slightly better of measure of 

661
00:41:33,929 --> 00:41:40,259
what I'm calling loosely pilot attitude, which may be what you really 

662
00:41:40,259 --> 00:41:44,549
wanted to [inaudible]. So let's talk about an algorithm to do this, and I should come back and talk about more 

663
00:41:44,549 --> 00:41:46,359
applications of 

664
00:41:46,359 --> 00:41:53,359
PCA later. So here's the algorithm. 

665
00:41:54,989 --> 00:42:01,989


666
00:42:13,959 --> 00:42:19,499
Before running PCA normally you will preprocess the data as follows. I'll just write this out, I guess. I know some of you are 

667
00:42:19,499 --> 00:42:26,499


668
00:43:04,179 --> 00:43:11,179


669
00:43:27,009 --> 00:43:31,069
writing. 

670
00:43:31,069 --> 00:43:31,929
So this is 

671
00:43:31,929 --> 00:43:34,849
maybe just an unnecessary amount of writing to say that 

672
00:43:34,849 --> 00:43:38,569
compute a mean of my training sets and subtract out the means, so now I've 

673
00:43:38,569 --> 00:43:40,799
zeroed out the mean of my training sets. 

674
00:43:40,799 --> 00:43:45,549
And the other step is I'll compute the variance of each of features after 

675
00:43:45,549 --> 00:43:46,589
zeroing out the mean, 

676
00:43:46,589 --> 00:43:50,429
and then I'll divide each feature by the standard deviation so that each 

677
00:43:50,429 --> 00:43:51,649
of my features 

678
00:43:51,649 --> 00:43:55,899
now has equal variance. So these are some standard preprocessing steps we'll often do for 

679
00:43:55,899 --> 00:43:57,689
PCA. 

680
00:43:57,689 --> 00:44:01,809
I'll just mention that sometimes the second step is usually done only 

681
00:44:01,809 --> 00:44:04,709
when your different features are on different scales. So 

682
00:44:04,709 --> 00:44:05,959


683
00:44:05,959 --> 00:44:08,750
if you're taking measurements of people, 

684
00:44:08,750 --> 00:44:11,579
one feature may be the height, and 

685
00:44:11,579 --> 00:44:15,479
another may be the weight, and another may be the strength, and another may 

686
00:44:15,479 --> 00:44:19,229
be how they age or whatever, all of these quantities are on very 

687
00:44:19,229 --> 00:44:21,990
different scales, so you'd normally 

688
00:44:21,990 --> 00:44:24,130
normalize the variance. 

689
00:44:24,130 --> 00:44:28,469
Sometimes if all the XIs are the same type of thing " so for 

690
00:44:28,469 --> 00:44:29,179
example 

691
00:44:29,179 --> 00:44:33,979
if you're working with images and the XIJs are the pixels that 

692
00:44:33,979 --> 00:44:34,529
they're 

693
00:44:34,529 --> 00:44:37,360
" are on the same scale because they're all 

694
00:44:37,360 --> 00:44:41,070
pixel intensity values ranging from zero to 255, then you may 

695
00:44:41,070 --> 00:44:43,769
omit 

696
00:44:43,769 --> 00:44:45,179
this step. So 

697
00:44:45,179 --> 00:44:47,139
after preprocessing, 

698
00:44:47,139 --> 00:44:49,559
let's talk about how we would find the 

699
00:44:49,559 --> 00:44:51,119


700
00:44:51,119 --> 00:44:54,489
main axes along which the data 

701
00:44:54,489 --> 00:45:01,489
varies. How would you find a principal axes of variations [inaudible]? 

702
00:45:05,079 --> 00:45:12,079
So to do that let me 

703
00:45:18,309 --> 00:45:22,439
describe " let me just go through one specific example, and then we'll formalize the algorithm. Here's 

704
00:45:22,439 --> 00:45:23,449


705
00:45:23,449 --> 00:45:30,449


706
00:45:32,089 --> 00:45:38,529
my training set comprising five examples. It has roughly zero mean. 

707
00:45:38,529 --> 00:45:43,409
And the variance under X1 and X2 axes are the same. There's 

708
00:45:43,409 --> 00:45:45,099
X1, there's X2 

709
00:45:45,099 --> 00:45:46,139
axes. 

710
00:45:46,139 --> 00:45:50,159
And so the principal axes and variation of this data is roughly 

711
00:45:50,159 --> 00:45:53,190
the positive 45-degree axis, so what 

712
00:45:53,190 --> 00:45:56,629
I'd like to do is have my algorithm 

713
00:45:56,629 --> 00:45:59,849
conclude that this direction that I've drawn with this arrow 

714
00:45:59,849 --> 00:46:00,309
U 

715
00:46:00,309 --> 00:46:01,760
is the best 

716
00:46:01,760 --> 00:46:04,060
direction onto which to project the data, 

717
00:46:04,060 --> 00:46:06,379
so the axis by which 

718
00:46:06,379 --> 00:46:09,089
my data really varies. 

719
00:46:09,089 --> 00:46:11,700
So let's think about how we formalize. 

720
00:46:11,700 --> 00:46:16,909
One thing we want to look at is suppose I take that axis " 

721
00:46:16,909 --> 00:46:20,569
this is really the axis onto which I want to project " that I want to use to 

722
00:46:20,569 --> 00:46:23,709
capture most of the variation of my data. 

723
00:46:23,709 --> 00:46:27,609
And then when I take my training set and project it onto this line, then 

724
00:46:27,609 --> 00:46:30,789
I get 

725
00:46:30,789 --> 00:46:36,599
that set of points. And so 

726
00:46:36,599 --> 00:46:40,239
you notice that those dots, the projections of my training sets onto this 

727
00:46:40,239 --> 00:46:41,049
axis, 

728
00:46:41,049 --> 00:46:44,339
has very large variance. 

729
00:46:44,339 --> 00:46:48,199
In contrast, if I were to choose 

730
00:46:48,199 --> 00:46:52,359
a different direction " let's say this is really 

731
00:46:52,359 --> 00:46:56,559
[inaudible] the worst direction onto which to project my data. 

732
00:46:56,559 --> 00:46:59,079
If I project all my data onto this 

733
00:46:59,079 --> 00:47:06,079
axis, 

734
00:47:09,199 --> 00:47:10,970
then I find that the projects of 

735
00:47:10,970 --> 00:47:13,299
my data onto the 

736
00:47:13,299 --> 00:47:15,030
purple line, onto this other line 

737
00:47:15,030 --> 00:47:21,849
has much smaller variance, that my points are clustered together much more tightly. So one 

738
00:47:21,849 --> 00:47:25,869
way to formalize this notion of finding the 

739
00:47:25,869 --> 00:47:28,380
main axis of variations of data is 

740
00:47:28,380 --> 00:47:32,949
I would like to find the vector U, I would like to find the direction U 

741
00:47:32,949 --> 00:47:37,529
so that when I project my data onto that direction, 

742
00:47:37,529 --> 00:47:41,400
the projected points vary as much as possible. So in other words, I'm 

743
00:47:41,400 --> 00:47:44,279
gonna find a direction so that when I project my data onto it, 

744
00:47:44,279 --> 00:47:46,180
the projections are 

745
00:47:46,180 --> 00:47:53,180
largely widely spaced out. There's large variance. So let's 

746
00:48:13,420 --> 00:48:15,559
see. 

747
00:48:15,559 --> 00:48:17,779
So I want to find the direction U. Just 

748
00:48:17,779 --> 00:48:19,709
as a 

749
00:48:19,709 --> 00:48:22,119
reminder, 

750
00:48:22,119 --> 00:48:25,609
if I have a vector U of norm one, 

751
00:48:25,609 --> 00:48:28,209
then the length 

752
00:48:28,209 --> 00:48:32,439
of our vector XI 

753
00:48:32,439 --> 00:48:38,719
projected " then 

754
00:48:38,719 --> 00:48:42,269
the vector XI projected onto 

755
00:48:42,269 --> 00:48:47,169
U has length 

756
00:48:47,169 --> 00:48:48,999
XI 

757
00:48:48,999 --> 00:48:51,249
transpose U. 

758
00:48:51,249 --> 00:48:53,089
To project a vector onto 

759
00:48:53,089 --> 00:48:56,749
" to project X onto unit vector, the length of the projection's just 

760
00:48:56,749 --> 00:48:59,939
XI transpose U. 

761
00:48:59,939 --> 00:49:05,089
And so to formalize my PCA problem, I'm going to choose a vector U 

762
00:49:05,089 --> 00:49:06,920
to maximize " so I 

763
00:49:06,920 --> 00:49:08,549
choose U and this 

764
00:49:08,549 --> 00:49:11,289
is 

765
00:49:11,289 --> 00:49:15,339
subject to the constraint that the norm of U, that the length of U is one " 

766
00:49:15,339 --> 00:49:19,089
but I'm going to maximize " let's 

767
00:49:19,089 --> 00:49:26,049
see, sum from I equals one to M " 

768
00:49:26,049 --> 00:49:30,819
the length of the projection of the vectors X onto U. 

769
00:49:30,819 --> 00:49:32,439
In particular, I want the 

770
00:49:32,439 --> 00:49:35,410
sum of square distances of the projections to be far from the origin. 

771
00:49:35,410 --> 00:49:36,979
I want the 

772
00:49:36,979 --> 00:49:39,709
projections of X onto U to have large variance. 

773
00:49:39,709 --> 00:49:41,219
And just to simplify some of 

774
00:49:41,219 --> 00:49:44,649
the math later, let me put a one over M in front. 

775
00:49:44,649 --> 00:49:49,049
And so that quantity on the right 

776
00:49:49,049 --> 00:49:51,079
is equal to 

777
00:49:51,079 --> 00:49:57,369
one over M. Let's 

778
00:49:57,369 --> 00:50:00,059
see. U transpose XI times 

779
00:50:00,059 --> 00:50:00,999
XI 

780
00:50:00,999 --> 00:50:04,669
transpose U, 

781
00:50:04,669 --> 00:50:05,869
and so 

782
00:50:05,869 --> 00:50:10,169
can simplify and I get " this is U transpose 

783
00:50:10,169 --> 00:50:16,900
times 

784
00:50:16,900 --> 00:50:20,809


785
00:50:20,809 --> 00:50:27,809
[inaudible] U. 

786
00:50:48,209 --> 00:50:50,310
So I want to maximize 

787
00:50:50,310 --> 00:50:53,839
U transpose times some matrix times U 

788
00:50:53,839 --> 00:50:56,489
subject to the constraint that the length of U must 

789
00:50:56,489 --> 00:50:59,789
be equal to one. 

790
00:50:59,789 --> 00:51:04,599
And so some of you will recognize that this means U must be the principal 

791
00:51:04,599 --> 00:51:05,739
eigenvector 

792
00:51:05,739 --> 00:51:08,029
of this matrix in the middle. 

793
00:51:08,029 --> 00:51:10,589
So let me just write that down and say a few more words about it. 

794
00:51:10,589 --> 00:51:13,159
So this implies that U 

795
00:51:13,159 --> 00:51:20,159
is the principal 

796
00:51:21,829 --> 00:51:25,649
eigenvector of the matrix, and I'll just call this matrix sigma. It actually turns out to be a 

797
00:51:25,649 --> 00:51:27,869
covariance matrix, [inaudible] equals one to M 

798
00:51:27,869 --> 00:51:33,119
transpose. 

799
00:51:33,119 --> 00:51:36,949
Actually, 

800
00:51:36,949 --> 00:51:37,560
let's check. 

801
00:51:37,560 --> 00:51:41,739
How many of you are familiar with eigenvectors? 

802
00:51:41,739 --> 00:51:48,739
Oh, cool. Lots of you, almost all of you. Great. 

803
00:51:49,509 --> 00:51:52,809
What I'm about to say is very extremely familiar, but 

804
00:51:52,809 --> 00:51:57,900
it's worth saying anyway. 

805
00:51:57,900 --> 00:52:00,319
So 

806
00:52:00,319 --> 00:52:03,469
if you have a matrix A and a vector U, 

807
00:52:03,469 --> 00:52:06,179
and they satisfy AU equals lambda U, then 

808
00:52:06,179 --> 00:52:09,589
this is what it means for U to be an eigenvector of the matrix 

809
00:52:09,589 --> 00:52:15,779
A. 

810
00:52:15,779 --> 00:52:22,099
And the value lambda here is called an 

811
00:52:22,099 --> 00:52:25,869
eigenvalue. And so the principal eigenvector is just the eigenvector that corresponds 

812
00:52:25,869 --> 00:52:28,599
to 

813
00:52:28,599 --> 00:52:30,029
the largest eigenvalue. One thing 

814
00:52:30,029 --> 00:52:34,589
that some of you may have seen, but I'm not sure that you have, and I just wanna relate this to stuff 

815
00:52:34,589 --> 00:52:36,029
that you already know as well, 

816
00:52:36,029 --> 00:52:37,789
is that 

817
00:52:37,789 --> 00:52:42,390
optimization problem is really 

818
00:52:42,390 --> 00:52:44,239
maximize U transpose U 

819
00:52:44,239 --> 00:52:48,609
subject to the norm of U is equal to one and " let me 

820
00:52:48,609 --> 00:52:52,279
write that constraint as that U transpose U equals one. 

821
00:52:52,279 --> 00:52:54,940
And so to solve this 

822
00:52:54,940 --> 00:52:58,339
constrained optimization problem, 

823
00:52:58,339 --> 00:53:05,160
you write down the Lagrangian [inaudible] lambda, 

824
00:53:05,160 --> 00:53:09,859


825
00:53:09,859 --> 00:53:14,630
where that's the Lagrange multiplier 

826
00:53:14,630 --> 00:53:17,989


827
00:53:17,989 --> 00:53:18,779


828
00:53:18,779 --> 00:53:22,639
because there's a constraint optimization. And so to actually solve 

829
00:53:22,639 --> 00:53:26,640
this optimization, you take the derivative of L with respect to U 

830
00:53:26,640 --> 00:53:29,749
and that gives you sigma U 

831
00:53:29,749 --> 00:53:33,759
minus lambda U. You 

832
00:53:33,759 --> 00:53:37,549
set the derivative equal to zero and this shows that sigma U equals lambda 

833
00:53:37,549 --> 00:53:38,959
U, and therefore 

834
00:53:38,959 --> 00:53:40,789
the value of U that solves 

835
00:53:40,789 --> 00:53:44,890
this constraint optimization problem that we care about must be an eigenvector of sigma. And in 

836
00:53:44,890 --> 00:53:51,890
particular it turns out to be the principal eigenvector. So 

837
00:53:54,859 --> 00:53:56,609
just to summarize, 

838
00:53:56,609 --> 00:53:57,799


839
00:53:57,799 --> 00:54:00,489
what have we done? We've shown that given a training set, 

840
00:54:00,489 --> 00:54:04,299
if you want to find the principal axis of a variation of data, you want to find the 

841
00:54:04,299 --> 00:54:07,719
1-D axis on which the data really varies the most, 

842
00:54:07,719 --> 00:54:09,170
what we do is we 

843
00:54:09,170 --> 00:54:13,270
construct the covariance matrix sigma, the matrix sigma that I wrote down 

844
00:54:13,270 --> 00:54:14,679
just now, 

845
00:54:14,679 --> 00:54:19,649
and then you would find the principal eigenvector of the matrix sigma. 

846
00:54:19,649 --> 00:54:26,649
And this gives you the best 1-D subspace onto which to project the data. 

847
00:54:32,209 --> 00:54:39,209
And 

848
00:54:41,979 --> 00:54:43,589


849
00:54:43,589 --> 00:54:45,629


850
00:54:45,629 --> 00:54:52,629
more generally, you would choose 

851
00:55:08,219 --> 00:55:13,979


852
00:55:13,979 --> 00:55:17,099
" if you wanna K-dimensional subspace onto which 

853
00:55:17,099 --> 00:55:18,369
project your date, 

854
00:55:18,369 --> 00:55:22,200
you would then choose U1 through UK to be the top K 

855
00:55:22,200 --> 00:55:24,329
eigenvectors of sigma. 

856
00:55:24,329 --> 00:55:28,680
And by top K eigenvectors, I just mean the eigenvectors corresponding 

857
00:55:28,680 --> 00:55:33,199
to the K highest 

858
00:55:33,199 --> 00:55:36,909
eigenvalues. I guess I showed this only for the case of a 1D subspace, but this holds 

859
00:55:36,909 --> 00:55:39,759
more generally. 

860
00:55:39,759 --> 00:55:41,200
And now 

861
00:55:41,200 --> 00:55:42,579


862
00:55:42,579 --> 00:55:44,479
the eigenvectors U1 through 

863
00:55:44,479 --> 00:55:51,479
UK represent " give you a new basis with which to represent your data. Student:[Inaudible] diagonal? Instructor 

864
00:55:55,899 --> 00:55:57,659
(Andrew 

865
00:55:57,659 --> 00:56:01,429
Ng):Let's see. So by convention, the PCA will choose orthogonal axes. 

866
00:56:01,429 --> 00:56:04,239
So I think this 

867
00:56:04,239 --> 00:56:05,919
is what I'm saying. 

868
00:56:05,919 --> 00:56:12,529
Here's one more example. Imagine that you're a three-dimensional set, 

869
00:56:12,529 --> 00:56:15,769
and imagine that your 

870
00:56:15,769 --> 00:56:19,699
three-dimensional data set " it's very hard to draw in 3-D on the board, so just let me try this. 

871
00:56:19,699 --> 00:56:24,239
Imagine that here my X1 and X2 axes lie on the plane of the board, 

872
00:56:24,239 --> 00:56:28,559
and the X3 axis points directly out of the board. 

873
00:56:28,559 --> 00:56:30,280
Imagine that you have a data set where 

874
00:56:30,280 --> 00:56:33,799


875
00:56:33,799 --> 00:56:36,749
most of the data lies on the plane of the board, but there's just a little bit of fuzz. 

876
00:56:36,749 --> 00:56:41,279
So imagine that the X3 axis points orthogonally out of the board, 

877
00:56:41,279 --> 00:56:43,940
and all the data lies roughly in the plane of the board, but 

878
00:56:43,940 --> 00:56:48,240
there's just a tiny little bit of fuzz so that some of the data 

879
00:56:48,240 --> 00:56:51,410
lies just a couple of millimeters off the board. 

880
00:56:51,410 --> 00:56:55,059
So you run a PCA on this data. You find 

881
00:56:55,059 --> 00:56:56,880


882
00:56:56,880 --> 00:56:59,480
that U1 and U2 will be some pair of bases that 

883
00:56:59,480 --> 00:57:02,329
essentially lie in the plane of the board, 

884
00:57:02,329 --> 00:57:03,329
and U3 

885
00:57:03,329 --> 00:57:05,170


886
00:57:05,170 --> 00:57:11,489
will be an orthogonal axis at points roughly out of the plane of the board. 

887
00:57:11,489 --> 00:57:12,959
So 

888
00:57:12,959 --> 00:57:16,269
if I reduce this data to two dimensions, then the bases U1 and 

889
00:57:16,269 --> 00:57:17,319
U2 

890
00:57:17,319 --> 00:57:21,109
now give me a new basis with which to construct my lower dimensional 

891
00:57:21,109 --> 00:57:26,049
representation of the data. 

892
00:57:26,049 --> 00:57:28,679
Just to be complete about what that means, 

893
00:57:28,679 --> 00:57:35,469
previously 

894
00:57:35,469 --> 00:57:37,510
we have say fairly high dimensional 

895
00:57:37,510 --> 00:57:40,299
input data XI and RN, 

896
00:57:40,299 --> 00:57:44,779
and now if I want to represent my data in this new basis given by U1 

897
00:57:44,779 --> 00:57:47,799
up to UK, 

898
00:57:47,799 --> 00:57:54,799


899
00:57:58,989 --> 00:58:03,799


900
00:58:03,799 --> 00:58:05,650
instead I would take each 

901
00:58:05,650 --> 00:58:08,909
of my original training examples XI 

902
00:58:08,909 --> 00:58:13,399
and I would now represent it or replace it with a different vector which I 

903
00:58:13,399 --> 00:58:14,920
call YI, 

904
00:58:14,920 --> 00:58:19,009
and that would be computed as U1 transpose 

905
00:58:19,009 --> 00:58:26,009
XI U2 transpose XI. 

906
00:58:28,239 --> 00:58:34,179
And so 

907
00:58:34,179 --> 00:58:38,559
the YIs are going to be K-dimensional where 

908
00:58:38,559 --> 00:58:39,509
K 

909
00:58:39,509 --> 00:58:43,299
will be less " your choice of K will be less than N, so this 

910
00:58:43,299 --> 00:58:45,160
represents a 

911
00:58:45,160 --> 00:58:46,560
lower dimensional representation 

912
00:58:46,560 --> 00:58:50,680
of your data, and serves as an approximate representation of your original 

913
00:58:50,680 --> 00:58:51,319
data 

914
00:58:51,319 --> 00:58:55,539
where you're using only K numbers to represent each training example rather than N 

915
00:58:55,539 --> 00:59:02,539
numbers. 

916
00:59:04,699 --> 00:59:11,699
Let's see. Student:[Inaudible] to 

917
00:59:16,809 --> 00:59:17,819
have eigenvectors [inaudible] trivial eigenvectors? Instructor (Andrew Ng):Is it possible 

918
00:59:17,819 --> 00:59:24,819
not to have eigenvectors but have trivial eigenvectors? Student:[Inaudible] determined by [inaudible]? Is it a condition wherein [inaudible]. Instructor (Andrew Ng):Let's see. What do you 

919
00:59:31,839 --> 00:59:38,839
mean by trivial eigenvectors? Student:[Inaudible] linear algebra from [inaudible]. Instructor (Andrew Ng):Oh, 

920
00:59:42,069 --> 00:59:43,199
okay. Yes, 

921
00:59:43,199 --> 00:59:47,739
so I see. Let me see if I can get this right. So there are some matrices 

922
00:59:47,739 --> 00:59:51,549
that are " I think the term is degenerate " that don't have a full set of 

923
00:59:51,549 --> 00:59:53,639
eigenvectors. 

924
00:59:53,639 --> 00:59:57,819
Sorry, deficient I think is what it's called. Is that right, Ziko? Yeah. So some matrices are 

925
00:59:57,819 --> 01:00:00,599
deficient and actually don't have a full set of eigenvectors, like an N 

926
01:00:00,599 --> 01:00:01,739


927
01:00:01,739 --> 01:00:05,459
by N matrix that does not have N distinct eigenvectors, 

928
01:00:05,459 --> 01:00:08,710
but that turns out not to be possible in this case 

929
01:00:08,710 --> 01:00:12,399
because the covariance matrix sigma is symmetric, and symmetric 

930
01:00:12,399 --> 01:00:16,049
matrices are never deficient, so the matrix sigma will always have a full set of 

931
01:00:16,049 --> 01:00:18,289
eigenvectors. 

932
01:00:18,289 --> 01:00:21,069
It's possible that if you " 

933
01:00:21,069 --> 01:00:23,049
there's one 

934
01:00:23,049 --> 01:00:26,769
other issue which is repeated eigenvalues. So 

935
01:00:26,769 --> 01:00:28,119
for example, 

936
01:00:28,119 --> 01:00:32,179
it turns out that in this example if 

937
01:00:32,179 --> 01:00:36,410
my covariance matrix looks like this, 

938
01:00:36,410 --> 01:00:40,429


939
01:00:40,429 --> 01:00:43,759
it turns out that the identity of the first two eigenvectors is ambiguous. 

940
01:00:43,759 --> 01:00:46,279
And by that I mean 

941
01:00:46,279 --> 01:00:48,909
you can choose this to be U1 

942
01:00:48,909 --> 01:00:51,029
and this to be U2 

943
01:00:51,029 --> 01:00:54,479
or I can just as well rotate these eigenvectors and choose that to be 

944
01:00:54,479 --> 01:00:55,729
U1 

945
01:00:55,729 --> 01:00:57,239
and that to be U2, 

946
01:00:57,239 --> 01:01:00,139
or even choose that to be U1 and 

947
01:01:00,139 --> 01:01:01,339
that to be U2, 

948
01:01:01,339 --> 01:01:02,609
and so on. 

949
01:01:02,609 --> 01:01:08,029
And so when you apply PCA, one thing to keep in mind is 

950
01:01:08,029 --> 01:01:11,460
sometimes eigenvectors can rotate freely within their subspaces 

951
01:01:11,460 --> 01:01:12,819
when you have 

952
01:01:12,819 --> 01:01:16,249
repeated eigenvalues or close to repeated eigenvalues. 

953
01:01:16,249 --> 01:01:18,709
And so 

954
01:01:18,709 --> 01:01:19,710
the way to think about 

955
01:01:19,710 --> 01:01:23,799
the vectors U is think of as a basis with which to represent your data, 

956
01:01:23,799 --> 01:01:27,709
but the basis vectors can sometimes rotate freely, and so it's not always 

957
01:01:27,709 --> 01:01:28,449
useful 

958
01:01:28,449 --> 01:01:32,479
to look at the eigenvectors one at a time, and say this is my first 

959
01:01:32,479 --> 01:01:33,589
eigenvector 

960
01:01:33,589 --> 01:01:37,769
capturing whatever, the height of a person, or this is my second eigenvector, and it captures 

961
01:01:37,769 --> 01:01:39,670
their skill at [inaudible] or whatever. 

962
01:01:39,670 --> 01:01:42,019
That's a very dangerous thing to do when you 

963
01:01:42,019 --> 01:01:46,619
do PCA. What is meaningful is the subspace spanned by the 

964
01:01:46,619 --> 01:01:50,130
eigenvectors, but looking at the eigenvectors one at a time is sometimes a dangerous 

965
01:01:50,130 --> 01:01:52,419
thing to do 

966
01:01:52,419 --> 01:01:54,989
because they can often freely. 

967
01:01:54,989 --> 01:01:58,559
Tiny numerical changes can cause eigenvectors to change a lot, 

968
01:01:58,559 --> 01:02:01,859
but the subspace spanned by the top K eigenvectors will usually be about 

969
01:02:01,859 --> 01:02:08,859
the same. 

970
01:02:22,119 --> 01:02:29,119
It 

971
01:02:32,209 --> 01:02:35,969
actually turns out there are multiple possible interpretations of PCA. 

972
01:02:35,969 --> 01:02:39,049
I'll just give one more without proof, 

973
01:02:39,049 --> 01:02:46,049
which is that [inaudible] whatever 

974
01:02:46,529 --> 01:02:47,299
" 

975
01:02:47,299 --> 01:02:50,269
given a training set like this, 

976
01:02:50,269 --> 01:02:53,579
another view of PCA is " and let me just choose a direction. This is 

977
01:02:53,579 --> 01:02:55,099
not 

978
01:02:55,099 --> 01:02:57,330
the principal components. I choose some direction and project 

979
01:02:57,330 --> 01:03:01,609
my data onto it. 

980
01:03:01,609 --> 01:03:04,289
This is clearly not the direction PCA would choose, 

981
01:03:04,289 --> 01:03:06,959
but what you can think of PCA as doing is 

982
01:03:06,959 --> 01:03:10,079
choose a subspace onto which to project your data 

983
01:03:10,079 --> 01:03:12,220
so as to minimize the sum of squares 

984
01:03:12,220 --> 01:03:15,690
differences between the projections and the [inaudible] points. So in 

985
01:03:15,690 --> 01:03:17,569
other words, 

986
01:03:17,569 --> 01:03:21,319
another way to think of PCA is trying to minimize the 

987
01:03:21,319 --> 01:03:23,309
sum of squares 

988
01:03:23,309 --> 01:03:24,579


989
01:03:24,579 --> 01:03:28,519
of these distances between the dots, the points X 

990
01:03:28,519 --> 01:03:31,269
and the dots onto which I'm projecting 

991
01:03:31,269 --> 01:03:33,729
the data. It turns out 

992
01:03:33,729 --> 01:03:36,749
they're actually " I don't know. There's sort of nine or ten different 

993
01:03:36,749 --> 01:03:39,949
interpretations of PCA. This is another one. There are a bunch 

994
01:03:39,949 --> 01:03:43,579
of ways to derive PCA. You get play with some 

995
01:03:43,579 --> 01:03:49,969
PCA ideas more 

996
01:03:49,969 --> 01:03:53,529
in the 

997
01:03:53,529 --> 01:03:57,489
next problem set. What I want to do next 

998
01:03:57,489 --> 01:03:59,569
is talk about a few 

999
01:03:59,569 --> 01:04:02,249
applications of 

1000
01:04:02,249 --> 01:04:09,249
PCA. 

1001
01:04:17,699 --> 01:04:23,499
Here are some ways that PCA is used. One is visualization. 

1002
01:04:23,499 --> 01:04:24,839


1003
01:04:24,839 --> 01:04:26,899


1004
01:04:26,899 --> 01:04:30,439
Very often you have high dimensional data sets. Someone gives you a 

1005
01:04:30,439 --> 01:04:32,329
50-dimensional data set, 

1006
01:04:32,329 --> 01:04:36,259
and it's very hard for you to look at a data set that's 50-dimensional and understand 

1007
01:04:36,259 --> 01:04:40,349
what's going on because you can't plot something in 50 dimensions. 

1008
01:04:40,349 --> 01:04:44,230
So common practice if you want to visualize a very high dimensional data set 

1009
01:04:44,230 --> 01:04:47,920
is to take your data and project it into say a 2-D plot, 

1010
01:04:47,920 --> 01:04:51,500
or project it into a 3-D plot, so you can render like a 3-D display on 

1011
01:04:51,500 --> 01:04:52,769
a computer 

1012
01:04:52,769 --> 01:04:55,699
so you can better visualize the data and look for structure. 

1013
01:04:55,699 --> 01:04:58,150
One particular example 

1014
01:04:58,150 --> 01:05:01,420
that I learned about of doing this recently was in Krishna 

1015
01:05:01,420 --> 01:05:06,809
Shenoy's lab here in Stanford in which 

1016
01:05:06,809 --> 01:05:11,269
he had readings from 50 different parts of a monkey brain. I actually 

1017
01:05:11,269 --> 01:05:13,299
don't know it was the number 50. It was tens 

1018
01:05:13,299 --> 01:05:16,019
of different parts of the monkey brain, and so 

1019
01:05:16,019 --> 01:05:17,279
you'd have these 

1020
01:05:17,279 --> 01:05:21,680
50-dimensional readings, 50-dimensional vectors correspond to 

1021
01:05:21,680 --> 01:05:24,700
different amounts of electrical activity in different parts of the monkey brain. It was 

1022
01:05:24,700 --> 01:05:26,330
actually 50 neurons, but tens 

1023
01:05:26,330 --> 01:05:28,139
of neurons, but 

1024
01:05:28,139 --> 01:05:32,179
it was tens of neurons in the monkey brain, and so you have a 50dimensional 

1025
01:05:32,179 --> 01:05:33,619
time series, 

1026
01:05:33,619 --> 01:05:36,669
and it's very hard to visualize very high dimensional data. 

1027
01:05:36,669 --> 01:05:38,589
But what he understood is that " was 

1028
01:05:38,589 --> 01:05:43,059
use PCA to project this 50-dimensional data down to three dimensions, 

1029
01:05:43,059 --> 01:05:47,269
and then you can visualize this data in three dimensions just using 3D plot, 

1030
01:05:47,269 --> 01:05:50,479
so you could visualize what the monkey is thinking over time. 

1031
01:05:50,479 --> 01:05:53,949


1032
01:05:53,949 --> 01:05:54,759
Another 

1033
01:05:54,759 --> 01:05:56,979
common application of PCA 

1034
01:05:56,979 --> 01:06:00,449
is 

1035
01:06:00,449 --> 01:06:02,980
compression, so if you have high dimensional data and you want to store it 

1036
01:06:02,980 --> 01:06:04,560
with [inaudible] numbers, 

1037
01:06:04,560 --> 01:06:07,689
clearly PCA is a great way to do this. 

1038
01:06:07,689 --> 01:06:11,219
It turns out also that 

1039
01:06:11,219 --> 01:06:13,119
sometimes in machine learning, 

1040
01:06:13,119 --> 01:06:16,579
sometimes you're just given extremely high dimensional input data, 

1041
01:06:16,579 --> 01:06:20,589
and for computational reasons you don't want to deal with such high dimensional 

1042
01:06:20,589 --> 01:06:24,879
data. 

1043
01:06:24,879 --> 01:06:27,649
And so 

1044
01:06:27,649 --> 01:06:31,939
fairly " one common use of PCA is taking very high dimensional data 

1045
01:06:31,939 --> 01:06:35,829
and represent it with low dimensional subspace " it's that YI is the 

1046
01:06:35,829 --> 01:06:37,679
representation I wrote down just now " 

1047
01:06:37,679 --> 01:06:40,609
so that you can work with much lower dimensional data. 

1048
01:06:40,609 --> 01:06:41,699
And 

1049
01:06:41,699 --> 01:06:43,390
it turns out to be it's this sort of " 

1050
01:06:43,390 --> 01:06:46,830
just seems to make a fact of life that when you're given extremely high dimensional 

1051
01:06:46,830 --> 01:06:47,669
data 

1052
01:06:47,669 --> 01:06:51,529
almost all the time just in practice " of all the high dimensional data 

1053
01:06:51,529 --> 01:06:53,369
sets I've ever seen, very 

1054
01:06:53,369 --> 01:06:55,879
high dimensional data sets often 

1055
01:06:55,879 --> 01:06:59,639
have all their points lying on much lower dimensional subspaces, 

1056
01:06:59,639 --> 01:07:03,269
so very often you can dramatically reduce the dimension of your 

1057
01:07:03,269 --> 01:07:04,259
data 

1058
01:07:04,259 --> 01:07:06,239
and really be 

1059
01:07:06,239 --> 01:07:10,369
throwing too much of the information away. 

1060
01:07:10,369 --> 01:07:13,039
So 

1061
01:07:13,039 --> 01:07:14,130
let's 

1062
01:07:14,130 --> 01:07:17,709
see. If you're learning algorithms, it takes a long time to run very high 

1063
01:07:17,709 --> 01:07:18,649
dimensional data. You 

1064
01:07:18,649 --> 01:07:22,069
can often use PCA to compress the data to lower dimensions so that your learning 

1065
01:07:22,069 --> 01:07:23,960
algorithm runs much faster. 

1066
01:07:23,960 --> 01:07:27,950
And you can often do this with sacrificing almost no performance in 

1067
01:07:27,950 --> 01:07:30,599
your learning algorithm. 

1068
01:07:30,599 --> 01:07:35,329
There's one other use of PCA for learning which is " 

1069
01:07:35,329 --> 01:07:38,729
you remember when we talked about learning theory, we said that 

1070
01:07:38,729 --> 01:07:41,160
the more features you have, the more 

1071
01:07:41,160 --> 01:07:45,000
complex your hypothesis class, if say you're doing linear classification. If you have more 

1072
01:07:45,000 --> 01:07:46,959
features, you have lots of features, 

1073
01:07:46,959 --> 01:07:51,519
then you may be more prone to overfitting. One other thing you could do with PCA is just 

1074
01:07:51,519 --> 01:07:54,049
use that to reduce the dimension of your data 

1075
01:07:54,049 --> 01:07:54,880
so that 

1076
01:07:54,880 --> 01:07:59,749
you have fewer features and you may be slightly less prone to overfitting. 

1077
01:07:59,749 --> 01:08:02,609
This particular application of PCA to learning I should say, 

1078
01:08:02,609 --> 01:08:06,049
it sometimes works. It often works. I'll actually later show a couple of 

1079
01:08:06,049 --> 01:08:09,170
examples where sort of do this and it works. But 

1080
01:08:09,170 --> 01:08:12,539
this particular application of PCA, I find looking in 

1081
01:08:12,539 --> 01:08:17,330
the industry, it also seems to be a little bit overused, and in particular " 

1082
01:08:17,330 --> 01:08:21,549
actually, let me say more about that later. We'll 

1083
01:08:21,549 --> 01:08:23,489
come back to that later. 

1084
01:08:23,489 --> 01:08:28,219
There's a couple of other applications to talk about. One is outlier detection or 

1085
01:08:28,219 --> 01:08:34,619
anomaly detection. 

1086
01:08:34,619 --> 01:08:36,119


1087
01:08:36,119 --> 01:08:38,969
The idea is 

1088
01:08:38,969 --> 01:08:40,689
suppose I give you a data set. 

1089
01:08:40,689 --> 01:08:45,099
You may then run PCA to find roughly the subspace on which your 

1090
01:08:45,099 --> 01:08:46,600
data lies. 

1091
01:08:46,600 --> 01:08:48,029
And then if 

1092
01:08:48,029 --> 01:08:51,040
you want to find anomalies in future examples, you then 

1093
01:08:51,040 --> 01:08:54,119
just look at your future examples and see if the lie very far from your 

1094
01:08:54,119 --> 01:08:55,639
subspace. 

1095
01:08:55,639 --> 01:09:00,250
This isn't a fantastic anomaly detection algorithm. It's not a very good " the idea is 

1096
01:09:00,250 --> 01:09:01,329


1097
01:09:01,329 --> 01:09:04,380
if I give you a data 

1098
01:09:04,380 --> 01:09:07,449
set, you may find a little dimensional subspace on which it lies, 

1099
01:09:07,449 --> 01:09:11,120
and then if you ever find a point that's far from your subspace, you can factor it as an 

1100
01:09:11,120 --> 01:09:11,989
anomaly. 

1101
01:09:11,989 --> 01:09:13,399
So this 

1102
01:09:13,399 --> 01:09:16,959
really isn't the best anomaly detection algorithm, but sometimes this is 

1103
01:09:16,959 --> 01:09:17,929
done. 

1104
01:09:17,929 --> 01:09:19,520


1105
01:09:19,520 --> 01:09:22,509
And the last application that I want to 

1106
01:09:22,509 --> 01:09:25,069
go into a little bit more in detail 

1107
01:09:25,069 --> 01:09:26,989
is matching or 

1108
01:09:26,989 --> 01:09:28,839


1109
01:09:28,839 --> 01:09:35,839
to find better distance calculations. 

1110
01:09:37,789 --> 01:09:40,730
So let me say what I mean by this. I'll go into much more detail on 

1111
01:09:40,730 --> 01:09:47,730
this last one. 

1112
01:09:58,529 --> 01:10:01,880
So here's the idea. 

1113
01:10:01,880 --> 01:10:06,669
Let's say you want to do face recognition 

1114
01:10:06,669 --> 01:10:10,020


1115
01:10:10,020 --> 01:10:11,649
[inaudible]. 

1116
01:10:11,649 --> 01:10:13,959
So let's say you want to do face recognition 

1117
01:10:13,959 --> 01:10:15,960
and you have 

1118
01:10:15,960 --> 01:10:19,030
100 of 100 

1119
01:10:19,030 --> 01:10:22,769
images. So a 

1120
01:10:22,769 --> 01:10:26,119
picture of a face is like whatever, some array of 

1121
01:10:26,119 --> 01:10:27,719
pixels, 

1122
01:10:27,719 --> 01:10:32,069
and the pixels have different grayscale values, and dependent on the different 

1123
01:10:32,069 --> 01:10:34,309
grayscale values, you get different pictures of 

1124
01:10:34,309 --> 01:10:36,789
people's faces. 

1125
01:10:36,789 --> 01:10:39,090
And so 

1126
01:10:39,090 --> 01:10:46,090
[inaudible] 

1127
01:10:50,940 --> 01:10:52,449
you have 100 of 

1128
01:10:52,449 --> 01:10:54,310
100 pixel images, and you think of 

1129
01:10:54,310 --> 01:10:57,010
each face 

1130
01:10:57,010 --> 01:10:58,590
as a 

1131
01:10:58,590 --> 01:11:01,800
10,000-dimensional vector, 

1132
01:11:01,800 --> 01:11:04,709
which is very high dimensional. 

1133
01:11:04,709 --> 01:11:09,780


1134
01:11:09,780 --> 01:11:12,369
[Inaudible] cartoon to keep in mind, and you think of 

1135
01:11:12,369 --> 01:11:17,139
here's my plot of my 100-dimensional space, 

1136
01:11:17,139 --> 01:11:21,300
and if you look at a lot of different pictures of 

1137
01:11:21,300 --> 01:11:24,579
faces, each face will be along what will be some point in this 100,000-dimensional space. 

1138
01:11:24,579 --> 01:11:27,100
And in this cartoon, I want 

1139
01:11:27,100 --> 01:11:31,350
you 

1140
01:11:31,350 --> 01:11:33,059
to think of it as " I 

1141
01:11:33,059 --> 01:11:37,019
mean most of this data lies on a relatively low dimensional subspace 

1142
01:11:37,019 --> 01:11:38,190
because 

1143
01:11:38,190 --> 01:11:41,220
in this 10,000-dimensional space, 

1144
01:11:41,220 --> 01:11:44,930
really not all points correspond to valid images of faces. The vast 

1145
01:11:44,930 --> 01:11:46,719
majority of 

1146
01:11:46,719 --> 01:11:48,249
values, the vast majority of 

1147
01:11:48,249 --> 01:11:50,239
10,000-dimensional images 

1148
01:11:50,239 --> 01:11:53,760
just correspond to random noise like looking things and don't correspond to 

1149
01:11:53,760 --> 01:11:55,300
a valid image of a face. 

1150
01:11:55,300 --> 01:11:56,699
But instead 

1151
01:11:56,699 --> 01:12:03,699
the space of possible face of images probably spans a much lower dimensional subspace. 

1152
01:12:04,959 --> 01:12:11,959
[Crosstalk] Instructor 

1153
01:12:14,239 --> 01:12:20,619
(Andrew 

1154
01:12:20,619 --> 01:12:22,469
Ng):And so what we'll do 

1155
01:12:22,469 --> 01:12:24,329
is we'll use 

1156
01:12:24,329 --> 01:12:29,209
a low dimensional subspace on which the data lies, and in practice a 

1157
01:12:29,209 --> 01:12:33,589
PCA dimension of 50 would be fairly typical. 

1158
01:12:33,589 --> 01:12:35,189


1159
01:12:35,189 --> 01:12:37,880
So when you think of " there's some 

1160
01:12:37,880 --> 01:12:39,340
axes that 

1161
01:12:39,340 --> 01:12:45,300
really measure the shape or the appearance of the face, 

1162
01:12:45,300 --> 01:12:47,979
and there are some other axes 

1163
01:12:47,979 --> 01:12:49,910
that we're not interested in that are maybe just 

1164
01:12:49,910 --> 01:12:52,119
random noise. 

1165
01:12:52,119 --> 01:12:54,289
So what we'll do 

1166
01:12:54,289 --> 01:12:56,000
is 

1167
01:12:56,000 --> 01:12:57,650
for face recognition 

1168
01:12:57,650 --> 01:13:01,550
I might give you a picture of a face and ask you what face looks the most 

1169
01:13:01,550 --> 01:13:04,249
similar to this. I'll give you a picture of someone and ask you 

1170
01:13:04,249 --> 01:13:07,339
can you find other pictures of the same person. 

1171
01:13:07,339 --> 01:13:08,099


1172
01:13:08,099 --> 01:13:12,959
And so the key step to do that is to look at two faces and to compare 

1173
01:13:12,959 --> 01:13:19,959
how similar these two faces 

1174
01:13:21,209 --> 01:13:24,089
are. So here's how we'll use PCA. 

1175
01:13:24,089 --> 01:13:25,939
Given a 

1176
01:13:25,939 --> 01:13:28,380
face there 

1177
01:13:28,380 --> 01:13:30,539
and a different face there, 

1178
01:13:30,539 --> 01:13:35,230
the way I'll measure the difference between these two faces is I won't just measure the 

1179
01:13:35,230 --> 01:13:37,229
Euclidian distance similarity. 

1180
01:13:37,229 --> 01:13:40,469
Instead, I'll take the space and project it 

1181
01:13:40,469 --> 01:13:43,849
onto my 50-dimensional subspace " take this and 

1182
01:13:43,849 --> 01:13:47,779
project it onto my 50-dimensional subspace and measure 

1183
01:13:47,779 --> 01:13:51,849
the similarity between those two points on the subspace. 

1184
01:13:51,849 --> 01:13:55,989
And so when I do that, I may end up with a face here and a face here that look very 

1185
01:13:55,989 --> 01:13:58,559
far upon the original space, but when I 

1186
01:13:58,559 --> 01:14:00,559
project them onto the subspace, 

1187
01:14:00,559 --> 01:14:03,429
they may end up looking much more similar. 

1188
01:14:03,429 --> 01:14:06,999
So what I want to show you a second on the laptop is 

1189
01:14:06,999 --> 01:14:08,900


1190
01:14:08,900 --> 01:14:13,999
given each of these training examples 

1191
01:14:13,999 --> 01:14:15,799
is a 10,000-dimensional vector, 

1192
01:14:15,799 --> 01:14:17,250
I can plot that 

1193
01:14:17,250 --> 01:14:20,790
as a grayscale image and I'd get like a picture of some person's face. 

1194
01:14:20,790 --> 01:14:22,449
What 

1195
01:14:22,449 --> 01:14:23,520
I'll also 

1196
01:14:23,520 --> 01:14:27,679
show you on the laptop will be plots of my eigenvectors. And so the eigenvectors 

1197
01:14:27,679 --> 01:14:29,099
also 

1198
01:14:29,099 --> 01:14:31,639
live in a 10,000-dimensional subspace. 

1199
01:14:31,639 --> 01:14:33,699
And I plot that. 

1200
01:14:33,699 --> 01:14:39,129
You get some image that's " 

1201
01:14:39,129 --> 01:14:42,980
and these images are called eigenfaces. And what 

1202
01:14:42,980 --> 01:14:45,050
PCA is doing is essentially 

1203
01:14:45,050 --> 01:14:48,670
using linear combinations of these eigenface images 

1204
01:14:48,670 --> 01:14:53,340
to approximate these XIs, and it's the eigenfaces that span " so that 

1205
01:14:53,340 --> 01:14:54,590
these 

1206
01:14:54,590 --> 01:14:55,489
bases 

1207
01:14:55,489 --> 01:15:00,959
U1, U2 and so on that span hopefully the subspace along which 

1208
01:15:00,959 --> 01:15:01,940


1209
01:15:01,940 --> 01:15:08,940
my 

1210
01:15:22,489 --> 01:15:29,489
faces live. 

1211
01:15:42,790 --> 01:15:49,790
[Crosstalk] 

1212
01:16:05,079 --> 01:16:12,079


1213
01:17:03,069 --> 01:17:04,029


1214
01:17:04,029 --> 01:17:11,029
Instructor (Andrew 

1215
01:17:12,739 --> 01:17:17,679
Ng):So here's a 

1216
01:17:17,679 --> 01:17:20,289


1217
01:17:20,289 --> 01:17:21,789


1218
01:17:21,789 --> 01:17:24,989


1219
01:17:24,989 --> 01:17:26,909
training set comprising 

1220
01:17:26,909 --> 01:17:31,020
aligned images like these, a much larger set than is shown 

1221
01:17:31,020 --> 01:17:32,019
here, 

1222
01:17:32,019 --> 01:17:36,159
and so when you run PCA, 

1223
01:17:36,159 --> 01:17:40,479
these are some of the 

1224
01:17:40,479 --> 01:17:43,250
plots of the UIs. Remember when I said plot the eigenvectors UI in the same way that it's plotting 

1225
01:17:43,250 --> 01:17:46,210
the 

1226
01:17:46,210 --> 01:17:49,540
training examples, and so you end up with the 

1227
01:17:49,540 --> 01:17:50,840
eigenvectors being 

1228
01:17:50,840 --> 01:17:52,849
grayscale images like these, and 

1229
01:17:52,849 --> 01:17:57,710
what approximate images of faces with linear combinations of these. So 

1230
01:17:57,710 --> 01:18:01,349
I said it's dangerous to look at individual eigenvectors and you 

1231
01:18:01,349 --> 01:18:04,010
really shouldn't do that, but let me just do that a little bit anyway. So if 

1232
01:18:04,010 --> 01:18:06,250
you look at the first eigenvector, 

1233
01:18:06,250 --> 01:18:09,580
this corresponds roughly to whether the face is illuminated from the left or 

1234
01:18:09,580 --> 01:18:11,340
the right. So depending on 

1235
01:18:11,340 --> 01:18:13,590
how heavy the weight is on this image, that 

1236
01:18:13,590 --> 01:18:17,329
roughly captures variation in illumination. The 

1237
01:18:17,329 --> 01:18:20,669
second image is hard to tell. It seems to be capturing variation in overall brightness 

1238
01:18:20,669 --> 01:18:22,010
of the face. 

1239
01:18:22,010 --> 01:18:24,340
The third eigenvector capturing 

1240
01:18:24,340 --> 01:18:27,920
roughly how much of a shadow or maybe a beard or something 

1241
01:18:27,920 --> 01:18:29,699
a person has and so forth. It's 

1242
01:18:29,699 --> 01:18:33,209
dangerous to look at individual eigenvectors, but it 

1243
01:18:33,209 --> 01:18:36,170
sort of slightly informs the look of it. Here's 

1244
01:18:36,170 --> 01:18:38,159
an example of a specific 

1245
01:18:38,159 --> 01:18:40,500
application of eigenfaces. 

1246
01:18:40,500 --> 01:18:41,869
This is 

1247
01:18:41,869 --> 01:18:45,310
from Sandy Pentland " Alex Pentland's lab at MIT. 

1248
01:18:45,310 --> 01:18:47,389
In 

1249
01:18:47,389 --> 01:18:50,120
this display, this upper leftmost image is the input 

1250
01:18:50,120 --> 01:18:52,550
to the eigenface's algorithm. The 

1251
01:18:52,550 --> 01:18:57,469
algorithm is then asked to go through a database and find the most similar faces. 

1252
01:18:57,469 --> 01:18:59,599
This second image here 

1253
01:18:59,599 --> 01:19:02,429
is what it thinks is the most similar face of the input. 

1254
01:19:02,429 --> 01:19:05,939
The next one over is the second most similar. The next one over is the third 

1255
01:19:05,939 --> 01:19:08,019
most similar and so on. 

1256
01:19:08,019 --> 01:19:12,380
And so using eigenfaces, and by measuring differences between faces in that lower 

1257
01:19:12,380 --> 01:19:16,559
dimensional subspace, it is able to do a reasonable job identifying pictures of the 

1258
01:19:16,559 --> 01:19:19,269
same person even with 

1259
01:19:19,269 --> 01:19:21,119
faces of the person removed. 

1260
01:19:21,119 --> 01:19:27,119
And the next row shows the next most similar faces and so on, so 

1261
01:19:27,119 --> 01:19:30,620
this one is the fourth most similar face and 

1262
01:19:30,620 --> 01:19:31,769
so on. 

1263
01:19:31,769 --> 01:19:32,589
This is a 

1264
01:19:32,589 --> 01:19:35,679
usual application of eigenfaces. The last thing 

1265
01:19:35,679 --> 01:19:40,010
I wanna say is just when people tell me about machine learning problems, 

1266
01:19:40,010 --> 01:19:43,839
I do often recommend they try PCA for different things, so PCA is a useful thing 

1267
01:19:43,839 --> 01:19:44,260


1268
01:19:44,260 --> 01:19:48,199
to know to use for compression, visualization, and so on. But 

1269
01:19:48,199 --> 01:19:52,040
in industry, I just tend to see PCA used slightly more often. There 

1270
01:19:52,040 --> 01:19:56,070
are also some times where you see people using PCA when they really shouldn't. So just one 

1271
01:19:56,070 --> 01:19:59,040
final piece of advice for use of PCA is 

1272
01:19:59,040 --> 01:20:02,739
before you use it, also think about whether you could just do it with the 

1273
01:20:02,739 --> 01:20:06,550
original training data XI without compressing it, since I've also 

1274
01:20:06,550 --> 01:20:07,969
definitely seen people 

1275
01:20:07,969 --> 01:20:09,960
compress the data when 

1276
01:20:09,960 --> 01:20:13,620
they really didn't need to. But having said that, I also often advise people to use PCA for various 

1277
01:20:13,620 --> 01:20:15,190
problems and it often works. 

1278
01:20:15,190 --> 01:20:16,390
Okay. 

1279
01:20:16,390 --> 01:20:18,500
Sorry about running late. So let's wrap up for today.