1 00:00:09,410 --> 00:00:11,729 2 00:00:11,729 --> 00:00:14,999 This presentation is delivered by the Stanford Center for Professional 3 00:00:14,999 --> 00:00:21,999 Development. Okay. 4 00:00:25,789 --> 00:00:30,900 So welcome back, and what I want to do to is talk about " is wrap up our discussion 5 00:00:30,900 --> 00:00:35,450 on factor analysis, and in particular what I 6 00:00:35,450 --> 00:00:37,040 want to do is 7 00:00:37,040 --> 00:00:42,880 step through parts of the derivations for EM for factor analysis because again there are a few steps 8 00:00:42,880 --> 00:00:47,250 in the EM derivation that are particularly tricky, and 9 00:00:47,250 --> 00:00:51,500 there are specific mistakes that people often make on deriving EM algorithms for 10 00:00:51,500 --> 00:00:54,110 algorithms like factor analysis. So I wanted to 11 00:00:54,110 --> 00:00:57,320 show you how to do those steps right so you can apply the same ideas to other problems 12 00:00:57,320 --> 00:00:58,450 as well. 13 00:00:58,450 --> 00:01:03,050 And then in the second half or so of this lecture, I'll talk about principal 14 00:01:03,050 --> 00:01:04,489 component analysis, 15 00:01:04,489 --> 00:01:09,780 which is a very powerful algorithm for dimensionality reduction. We'll see 16 00:01:09,780 --> 00:01:12,320 later what that means. 17 00:01:12,320 --> 00:01:15,620 So just a recap, 18 00:01:15,620 --> 00:01:18,060 in a previous lecture I 19 00:01:18,060 --> 00:01:20,990 described a few properties of Gaussian distributions. 20 00:01:20,990 --> 00:01:23,240 One was that 21 00:01:23,240 --> 00:01:28,940 if you have a random variable " a random value vector X that can be partitioned into two 22 00:01:28,940 --> 00:01:31,210 portions, X1 23 00:01:31,210 --> 00:01:33,970 and X2, and if X is 24 00:01:33,970 --> 00:01:37,810 Gaussian with mu [inaudible] and covariance sigma where 25 00:01:37,810 --> 00:01:40,830 mu is 26 00:01:40,830 --> 00:01:44,040 itself a partition vector 27 00:01:44,040 --> 00:01:47,490 and sigma 28 00:01:47,490 --> 00:01:52,710 is 29 00:01:52,710 --> 00:01:56,480 sort of a partition matrix that can be written like that. So I'm just writing sigma in 30 00:01:56,480 --> 00:01:57,469 terms of 31 00:01:57,469 --> 00:02:00,580 the four 32 00:02:00,580 --> 00:02:01,619 subblocks. 33 00:02:01,619 --> 00:02:06,479 Then you can look at the distribution of X and ask what is the marginal distribution 34 00:02:06,479 --> 00:02:11,019 of say X1. 35 00:02:11,019 --> 00:02:12,789 And the 36 00:02:12,789 --> 00:02:14,129 answer 37 00:02:14,129 --> 00:02:16,869 we said last time was that X1 " the 38 00:02:16,869 --> 00:02:20,639 marginal distribution of X1 is Gaussian 39 00:02:20,639 --> 00:02:24,560 would mean mu and covariance sigma one one, whereas sigma one one 40 00:02:24,560 --> 00:02:27,450 is the upper left block 41 00:02:27,450 --> 00:02:29,819 of that covariance matrix sigma. So this one is no surprise. 42 00:02:29,819 --> 00:02:35,359 And I also wrote down the formula for computing conditional distributions, such 43 00:02:35,359 --> 00:02:39,459 as what is P of X1 given 44 00:02:39,459 --> 00:02:41,139 X2, 45 00:02:41,139 --> 00:02:46,419 and last time I wrote down that the distribution of X1 given X2 would 46 00:02:46,419 --> 00:02:48,349 also be Gaussian 47 00:02:48,349 --> 00:02:52,389 with parameters that I 48 00:02:52,389 --> 00:02:56,409 wrote as mu of one given two and sigma of one given two 49 00:02:56,409 --> 00:02:58,759 where mu 50 00:02:58,759 --> 00:03:01,260 of one given two is " let's see [inaudible] 51 00:03:01,260 --> 00:03:06,459 this formula. 52 00:03:06,459 --> 00:03:13,459 Okay? 53 00:03:23,309 --> 00:03:24,469 So 54 00:03:24,469 --> 00:03:27,679 with these formulas will be able to 55 00:03:27,679 --> 00:03:32,139 locate a pair of joint Gaussian random variables " X1 and X here are both vectors 56 00:03:32,139 --> 00:03:36,759 " and compute the marginal and conditional distributions, so P of X1 or P 57 00:03:36,759 --> 00:03:40,039 of X1 given X2. 58 00:03:40,039 --> 00:03:41,749 So when I come back 59 00:03:41,749 --> 00:03:45,059 and derive the E set " actually, 60 00:03:45,059 --> 00:03:48,680 I'll come back and use the marginal formula in a second, and then when I come 61 00:03:48,680 --> 00:03:50,089 back and derive from 62 00:03:50,089 --> 00:03:54,999 the E step in the EM algorithm for factor analysis, I'll 63 00:03:54,999 --> 00:03:58,559 actually be using these two formulas 64 00:03:58,559 --> 00:04:00,039 again. And 65 00:04:00,039 --> 00:04:02,779 again, just to continue 66 00:04:02,779 --> 00:04:04,519 summarizing what we did last time, 67 00:04:04,519 --> 00:04:06,209 we said that 68 00:04:06,209 --> 00:04:10,599 in factor analysis 69 00:04:10,599 --> 00:04:13,199 our model " 70 00:04:13,199 --> 00:04:18,180 let's see. This is an unsupervised learning problem, 71 00:04:18,180 --> 00:04:20,900 and so we're given an unlabeled training set 72 00:04:20,900 --> 00:04:23,629 where each SI 73 00:04:23,629 --> 00:04:26,400 is a vector in RN as usual. We 74 00:04:26,400 --> 00:04:29,929 want 75 00:04:29,929 --> 00:04:31,639 to model the 76 00:04:31,639 --> 00:04:33,030 density of X, 77 00:04:33,030 --> 00:04:34,050 and 78 00:04:34,050 --> 00:04:36,499 our model for X would be that 79 00:04:36,499 --> 00:04:40,509 we imagine there's a latent random variable Z that's generating 80 00:04:40,509 --> 00:04:43,680 this [inaudible] zero 81 00:04:43,680 --> 00:04:46,130 mean in identity covariance. 82 00:04:46,130 --> 00:04:47,409 And Z will be some 83 00:04:47,409 --> 00:04:53,769 low dimensional thing [inaudible] and we 84 00:04:53,769 --> 00:04:58,419 [inaudible]. 85 00:04:58,419 --> 00:05:00,840 And we imagine that X is 86 00:05:00,840 --> 00:05:02,830 generated as mu plus 87 00:05:02,830 --> 00:05:06,659 lambda Z plus epsilon where epsilon is a Gaussian random variable with 88 00:05:06,659 --> 00:05:08,020 mean zero 89 00:05:08,020 --> 00:05:11,899 and a covariance matrix psi. 90 00:05:11,899 --> 00:05:15,249 And so the parameters of this model 91 00:05:15,249 --> 00:05:17,739 are 92 00:05:17,739 --> 00:05:21,610 mu which N-dimensional vector, 93 00:05:21,610 --> 00:05:26,509 lambda, which is an N by D-dimensional vector " matrix, and psi which is N 94 00:05:26,509 --> 00:05:28,819 by N 95 00:05:28,819 --> 00:05:32,209 and is diagonal. N 96 00:05:32,209 --> 00:05:36,619 is a diagonal 97 00:05:36,619 --> 00:05:40,349 matrix. 98 00:05:40,349 --> 00:05:43,850 So the cartoon I drew last time for factor analysis was 99 00:05:43,850 --> 00:05:48,499 " I said that 100 00:05:48,499 --> 00:05:52,349 maybe that's the typical example of data point ZI if " 101 00:05:52,349 --> 00:05:55,639 and in this example I had D equals one, 102 00:05:55,639 --> 00:05:57,300 N equals two. 103 00:05:57,300 --> 00:06:01,400 So Z in this example is one-dimensional. D equals one. 104 00:06:01,400 --> 00:06:02,700 And so 105 00:06:02,700 --> 00:06:05,189 106 00:06:05,189 --> 00:06:12,189 you take this data, map it to say [inaudible] mu plus lambda Z and that may 107 00:06:12,679 --> 00:06:15,889 give you 108 00:06:15,889 --> 00:06:17,789 some set of points there. 109 00:06:17,789 --> 00:06:21,040 And lastly, this model was envisioning 110 00:06:21,040 --> 00:06:25,609 that you'd place a little Gaussian bump around each of these say 111 00:06:25,609 --> 00:06:28,329 and sample 112 00:06:28,329 --> 00:06:30,490 " 113 00:06:30,490 --> 00:06:35,719 and 114 00:06:35,719 --> 00:06:37,490 the Xs are 115 00:06:37,490 --> 00:06:44,490 then maybe " that would be a typical sample of the Xs under this model. 116 00:06:44,860 --> 00:06:47,610 117 00:06:47,610 --> 00:06:54,279 So 118 00:06:54,279 --> 00:07:01,279 how [inaudible] the parameters of this model? Well, the 119 00:07:04,289 --> 00:07:05,849 joint distribution 120 00:07:05,849 --> 00:07:07,969 of Z and 121 00:07:07,969 --> 00:07:09,729 X 122 00:07:09,729 --> 00:07:13,659 is actually Gaussian 123 00:07:13,659 --> 00:07:17,569 where parameters given by some vector [inaudible] mu XZ, 124 00:07:17,569 --> 00:07:20,509 and sum covariance matrix sigma. 125 00:07:20,509 --> 00:07:25,800 And [inaudible] what those two things are, this vector [inaudible] is 126 00:07:25,800 --> 00:07:27,759 a vector of zeroes appended 127 00:07:27,759 --> 00:07:30,280 to the vector mu. 128 00:07:30,280 --> 00:07:33,639 And the matrix sigma 129 00:07:33,639 --> 00:07:39,899 is this partitioned matrix. 130 00:07:39,899 --> 00:07:45,849 We also worked this out last time. 131 00:07:45,849 --> 00:07:49,969 So you can ask what is the distribution of X under this model, 132 00:07:49,969 --> 00:07:51,759 and the answer 133 00:07:51,759 --> 00:07:55,869 is under this model X is Gaussian with 134 00:07:55,869 --> 00:07:59,639 mean mu 135 00:07:59,639 --> 00:08:03,689 and covariance lambda lambda [inaudible] plus psi. So let's just 136 00:08:03,689 --> 00:08:05,810 take the second block of the 137 00:08:05,810 --> 00:08:07,669 mean vector 138 00:08:07,669 --> 00:08:10,419 and take that 139 00:08:10,419 --> 00:08:14,430 lower right hand corner block for the covariance matrix, and so 140 00:08:14,430 --> 00:08:15,780 this is really 141 00:08:15,780 --> 00:08:19,590 my formula for computing the marginal distribution of a Gaussian, 142 00:08:19,590 --> 00:08:20,309 except that 143 00:08:20,309 --> 00:08:23,629 I'm computing the marginal distribution of the second half of 144 00:08:23,629 --> 00:08:26,429 the vector rather than the first half. 145 00:08:26,429 --> 00:08:32,490 So this is the marginal distribution of X under my model. 146 00:08:32,490 --> 00:08:39,210 And so if you want to learn " Student:[Inaudible] initial distribution [inaudible]? Instructor (Andrew Ng):Let's see. 147 00:08:39,210 --> 00:08:41,940 Oh, 148 00:08:41,940 --> 00:08:45,880 yes. Yes, so in this one I'm breaking down the 149 00:08:45,880 --> 00:08:50,130 " this is really I'm specifying the conditional distribution of X 150 00:08:50,130 --> 00:08:51,460 given Z. So 151 00:08:51,460 --> 00:08:55,090 the conditional distribution of X given Z 152 00:08:55,090 --> 00:08:56,910 " this is Gaussian " 153 00:08:56,910 --> 00:08:59,890 would mean 154 00:08:59,890 --> 00:09:02,790 mu plus lambda Z 155 00:09:02,790 --> 00:09:03,829 and covariance psi. 156 00:09:03,829 --> 00:09:10,829 This is what that [inaudible]. So 157 00:09:18,790 --> 00:09:21,170 since this is the marginal distribution of X, 158 00:09:21,170 --> 00:09:23,510 given my training set of 159 00:09:23,510 --> 00:09:25,010 M unlabeled examples, 160 00:09:25,010 --> 00:09:28,970 I can actually write down the log likelihood of my training set. So 161 00:09:28,970 --> 00:09:31,940 the log likelihood of my training 162 00:09:31,940 --> 00:09:35,420 set " actually, no. Let's just write down the likelihood. So the likelihood of my parameters 163 00:09:35,420 --> 00:09:38,340 given my training set is the product from I equals one to M of P of XI given the parameters. I can actually 164 00:09:38,340 --> 00:09:42,930 write down what that 165 00:09:42,930 --> 00:09:47,800 166 00:09:47,800 --> 00:09:53,170 167 00:09:53,170 --> 00:09:56,720 is because XI is Gaussian with 168 00:09:56,720 --> 00:09:57,850 parameter mu 169 00:09:57,850 --> 00:10:01,080 and variance lambda lambda transpose times psi, 170 00:10:01,080 --> 00:10:05,760 so you can actually write this down as one 171 00:10:05,760 --> 00:10:07,900 over 172 00:10:07,900 --> 00:10:11,880 two pi to 173 00:10:11,880 --> 00:10:18,880 the N over two determined of " and then times E to the minus one half X of " 174 00:10:34,000 --> 00:10:36,340 so that's my 175 00:10:36,340 --> 00:10:37,590 formula 176 00:10:37,590 --> 00:10:41,020 for the density of a Gaussian that has mean mu 177 00:10:41,020 --> 00:10:42,940 and covariance lambda 178 00:10:42,940 --> 00:10:46,960 lambda transpose plus psi. 179 00:10:46,960 --> 00:10:51,370 So this is my likelihood of the parameters given a training set. 180 00:10:51,370 --> 00:10:55,500 And one thing you could do is actually take this likelihood and try to maximize 181 00:10:55,500 --> 00:10:56,450 it in terms of 182 00:10:56,450 --> 00:11:01,110 the parameters, try to find the [inaudible] the parameters. But 183 00:11:01,110 --> 00:11:04,389 if you do that you find that " if you sort of try " take [inaudible] 184 00:11:04,389 --> 00:11:07,720 to get the law of likelihood, take derivatives, set derivatives equal to 185 00:11:07,720 --> 00:11:08,410 zero, 186 00:11:08,410 --> 00:11:10,230 you find that you'll be unable 187 00:11:10,230 --> 00:11:12,620 to solve for the maximum of 188 00:11:12,620 --> 00:11:14,610 this analytically. 189 00:11:14,610 --> 00:11:16,160 190 00:11:16,160 --> 00:11:19,740 You won't be able to solve this [inaudible] likelihood estimation problem. If you 191 00:11:19,740 --> 00:11:23,020 take the derivative of this with respect to the parameters, 192 00:11:23,020 --> 00:11:27,090 set the derivatives equal to zero and try to solve for the value of the 193 00:11:27,090 --> 00:11:31,250 parameters lambda, mu, and psi [inaudible] so you won't be able to do that 194 00:11:31,250 --> 00:11:33,220 [inaudible]. 195 00:11:33,220 --> 00:11:34,709 So 196 00:11:34,709 --> 00:11:37,570 what we'll use instead 197 00:11:37,570 --> 00:11:44,570 to estimate the parameters in a factor analysis model will be the EM algorithm. Student:Why is the law of likelihood P of X and not P of X [inaudible] X and Z or 198 00:11:49,880 --> 00:11:56,880 something? Instructor (Andrew Ng):Oh, 199 00:11:58,690 --> 00:12:03,340 right. So the question is why is the likelihood P of X and not P of X given Z or P of 200 00:12:03,340 --> 00:12:05,420 X and 201 00:12:05,420 --> 00:12:06,720 Z. The answer is let's see " 202 00:12:06,720 --> 00:12:10,130 we're in the " so 203 00:12:10,130 --> 00:12:15,080 by analogy to the mixture of Gaussian distributions models, 204 00:12:15,080 --> 00:12:21,980 we're given a training set that just comprises 205 00:12:21,980 --> 00:12:26,210 a set of unlabeled training examples that " for 206 00:12:26,210 --> 00:12:30,560 convenience for whatever reason, 207 00:12:30,560 --> 00:12:33,579 it's easier for me to write down a model that 208 00:12:33,579 --> 00:12:37,400 defines the joint distribution on P of X comma 209 00:12:37,400 --> 00:12:38,620 Z. But 210 00:12:38,620 --> 00:12:42,770 what I would like to do is really maximize 211 00:12:42,770 --> 00:12:46,890 " as a function of my parameters " I'm using theta as a shorthand to denote all the parameters of the 212 00:12:46,890 --> 00:12:47,780 model " 213 00:12:47,780 --> 00:12:52,960 I want to maximize the probability of 214 00:12:52,960 --> 00:12:55,450 the data I actually observe. 215 00:12:55,450 --> 00:12:58,040 And this would be 216 00:12:58,040 --> 00:12:59,880 actually [inaudible] theta [inaudible] from I equals 217 00:12:59,880 --> 00:13:02,510 one to M, 218 00:13:02,510 --> 00:13:09,510 and this is really integrating out Z. So I only 219 00:13:14,290 --> 00:13:17,740 ever get to observe the Xs and the Zs are 220 00:13:17,740 --> 00:13:20,570 latent random variables or hidden random variables. 221 00:13:20,570 --> 00:13:23,860 And so what 222 00:13:23,860 --> 00:13:24,810 I'd like to do is 223 00:13:24,810 --> 00:13:29,590 do this maximization in which I've implicitly integrated out Z. Does that make 224 00:13:29,590 --> 00:13:34,600 sense? Actually, are there 225 00:13:34,600 --> 00:13:41,600 other questions about the factor analysis models? 226 00:13:55,550 --> 00:13:58,320 227 00:13:58,320 --> 00:14:05,320 Okay. So here's 228 00:14:14,340 --> 00:14:18,570 the EM algorithm. In the E step, we compute the 229 00:14:18,570 --> 00:14:21,170 conditional distribution of ZI 230 00:14:21,170 --> 00:14:24,030 given XI in our current setting of the parameters. 231 00:14:24,030 --> 00:14:28,380 And in the 232 00:14:28,380 --> 00:14:35,380 233 00:14:35,960 --> 00:14:42,960 234 00:14:50,060 --> 00:14:57,060 235 00:15:01,570 --> 00:15:05,050 M step, we perform this maximization. 236 00:15:05,050 --> 00:15:07,350 And if this looks a little bit different from the 237 00:15:07,350 --> 00:15:08,229 previous versions of 238 00:15:08,229 --> 00:15:09,590 EM that you've seen, 239 00:15:09,590 --> 00:15:13,150 the only difference is that now I'm integrating over ZI 240 00:15:13,150 --> 00:15:17,450 because ZI is now a Gaussian random variable, is now this 241 00:15:17,450 --> 00:15:21,840 continuous value thing, so rather than summing over ZI, I now integrate over 242 00:15:21,840 --> 00:15:26,240 ZI. And if you replace [inaudible] of a sum, you [inaudible] ZI then this is exactly the 243 00:15:26,240 --> 00:15:27,590 M step you saw 244 00:15:27,590 --> 00:15:31,390 when we worked it out from the mixture of Gaussian [inaudible]. 245 00:15:31,390 --> 00:15:34,230 So it turns out that 246 00:15:34,230 --> 00:15:38,220 in order to implement the E step and the M step, there are just two little things " there are 247 00:15:38,220 --> 00:15:41,530 actually sort of three key things that are different from the models that you saw 248 00:15:41,530 --> 00:15:44,350 previously, so I wanna talk about them. 249 00:15:44,350 --> 00:15:48,120 The first one is that the first of the [inaudible] which is 250 00:15:48,120 --> 00:15:50,890 that in the E step, 251 00:15:50,890 --> 00:15:53,740 Z is now a continuous value random 252 00:15:53,740 --> 00:15:55,420 variable, so 253 00:15:55,420 --> 00:15:56,080 254 00:15:56,080 --> 00:16:00,180 you now need a way to represent these continuous value densities, probably density 255 00:16:00,180 --> 00:16:03,160 functions to represent QI of Z. 256 00:16:03,160 --> 00:16:10,100 So 257 00:16:10,100 --> 00:16:12,180 fortunately 258 00:16:12,180 --> 00:16:16,870 in this probably it's not difficult to do, 259 00:16:16,870 --> 00:16:22,290 and in particular 260 00:16:22,290 --> 00:16:27,540 the conditional distribution of ZI given XI and our 261 00:16:27,540 --> 00:16:28,410 parameters which I'm gonna 262 00:16:28,410 --> 00:16:30,210 omit from this equation 263 00:16:30,210 --> 00:16:34,770 is going to be 264 00:16:34,770 --> 00:16:36,840 Gaussian with mean 265 00:16:36,840 --> 00:16:38,980 and covariance given 266 00:16:38,980 --> 00:16:42,650 by these two things, so I write like that, 267 00:16:42,650 --> 00:16:48,030 where 268 00:16:48,030 --> 00:16:55,030 this is going to be equal to the vector zero minus " 269 00:17:01,040 --> 00:17:05,540 And the way I got this formula was 270 00:17:05,540 --> 00:17:06,319 " if you 271 00:17:06,319 --> 00:17:10,240 match this to the formula I had previously for computing conditional distributions 272 00:17:10,240 --> 00:17:11,430 of Gaussians, this 273 00:17:11,430 --> 00:17:15,579 corresponded to the terms mu one minus sigma one two 274 00:17:15,579 --> 00:17:17,579 sigma two two 275 00:17:17,579 --> 00:17:20,720 inverse times two minus 276 00:17:20,720 --> 00:17:24,530 mu. So those are the terms corresponding to the form I had previously 277 00:17:24,530 --> 00:17:28,910 for computing the marginal distributions of a Gaussian. 278 00:17:28,910 --> 00:17:32,640 And 279 00:17:32,640 --> 00:17:39,640 that is going to be given by 280 00:17:41,090 --> 00:17:42,279 that, and 281 00:17:42,279 --> 00:17:49,279 again 282 00:17:49,350 --> 00:17:53,490 those are the terms corresponding to the formulas I had for 283 00:17:53,490 --> 00:17:57,410 the very first thing I did, the formulas for computing 284 00:17:57,410 --> 00:18:01,179 conditional distributions of Gaussians. 285 00:18:01,179 --> 00:18:05,960 And so this is E step. You need to compute the Q distribution. You need to 286 00:18:05,960 --> 00:18:08,090 compute 287 00:18:08,090 --> 00:18:09,940 QI of ZI, 288 00:18:09,940 --> 00:18:11,570 and to do that 289 00:18:11,570 --> 00:18:15,450 what you actually do is you compute this vector mu of ZI given XI and 290 00:18:15,450 --> 00:18:17,850 sigma of ZI given XI, 291 00:18:17,850 --> 00:18:21,550 and together these represent the mean and covariance of the 292 00:18:21,550 --> 00:18:22,550 distribution Q where Q is going to 293 00:18:22,550 --> 00:18:24,300 be Gaussian, 294 00:18:24,300 --> 00:18:31,300 so that's the E step. 295 00:18:32,710 --> 00:18:35,050 Now 296 00:18:35,050 --> 00:18:38,870 here's 297 00:18:38,870 --> 00:18:40,210 298 00:18:40,210 --> 00:18:45,770 the M step then. I'll just mention there are sort of two ways to derive M steps for especially 299 00:18:45,770 --> 00:18:49,180 Gaussian models like these. 300 00:18:49,180 --> 00:18:52,490 Here's the key trick I guess which is that 301 00:18:52,490 --> 00:18:54,440 when you compute the M step, 302 00:18:54,440 --> 00:18:59,520 you often need to compute integrals that look like these. 303 00:18:59,520 --> 00:19:06,330 And then there'll be some function of ZI. Let me just write ZI there, for instance. So 304 00:19:06,330 --> 00:19:10,870 there are two ways to compute integrals like these. One is if you write out " this is a 305 00:19:10,870 --> 00:19:12,760 commonly made mistake in " 306 00:19:12,760 --> 00:19:15,680 well, not mistake, commonly made 307 00:19:15,680 --> 00:19:18,060 unnecessary complication I guess. 308 00:19:18,060 --> 00:19:22,420 One way to try to compute this integral is you can write this out as integral over 309 00:19:22,420 --> 00:19:23,200 ZI. 310 00:19:23,200 --> 00:19:28,290 And while we know what QI is, right? QI is a Gaussian so 311 00:19:28,290 --> 00:19:29,130 two pi, 312 00:19:29,130 --> 00:19:31,210 so D over two 313 00:19:31,210 --> 00:19:35,090 " the covariance of QI is this sigma 314 00:19:35,090 --> 00:19:36,600 given 315 00:19:36,600 --> 00:19:38,360 XI which you've computed in the 316 00:19:38,360 --> 00:19:39,490 E step, and 317 00:19:39,490 --> 00:19:41,590 then times E to the 318 00:19:41,590 --> 00:19:43,400 minus one half 319 00:19:43,400 --> 00:19:47,450 ZI minus mu of ZI 320 00:19:47,450 --> 00:19:49,650 given 321 00:19:49,650 --> 00:19:54,200 XI transpose sigma inverse 322 00:19:54,200 --> 00:20:01,200 " 323 00:20:01,850 --> 00:20:04,380 so that's my Gaussian density. 324 00:20:04,380 --> 00:20:06,920 And then times 325 00:20:06,920 --> 00:20:09,410 326 00:20:09,410 --> 00:20:11,490 ZI ZZI. 327 00:20:11,490 --> 00:20:13,030 And so 328 00:20:13,030 --> 00:20:14,679 writing those out is 329 00:20:14,679 --> 00:20:18,919 the unnecessarily complicated way to do it because once you've written out this integral, 330 00:20:18,919 --> 00:20:23,220 if you want to actually integrate " that's the times, 331 00:20:23,220 --> 00:20:26,270 multiplication " if you actually want to evaluate this integral, it just looks 332 00:20:26,270 --> 00:20:30,740 horribly complicated. I'm not quite sure how to evaluate this. By 333 00:20:30,740 --> 00:20:34,310 far, the simpler to evaluate this integral is 334 00:20:34,310 --> 00:20:36,960 to recognize that this is just 335 00:20:36,960 --> 00:20:40,750 the expectation with respect to ZI drawn 336 00:20:40,750 --> 00:20:41,940 from the distribution 337 00:20:41,940 --> 00:20:46,040 QI 338 00:20:46,040 --> 00:20:48,960 of the random variable ZI. 339 00:20:48,960 --> 00:20:49,660 And 340 00:20:49,660 --> 00:20:53,170 once you've actually got this way, you notice that this is just the expected 341 00:20:53,170 --> 00:20:57,020 value of ZI under the distribution QI, 342 00:20:57,020 --> 00:21:01,380 but QI is a Gaussian distribution with mean 343 00:21:01,380 --> 00:21:04,940 given by that mu vector and covariance given by that sigma vector, 344 00:21:04,940 --> 00:21:11,940 and so the expected value of ZI under this distribution is just mu of ZI given XI. Does 345 00:21:12,090 --> 00:21:14,750 346 00:21:14,750 --> 00:21:17,700 that make 347 00:21:17,700 --> 00:21:21,410 sense? So by making those observations that this is just an expected value, 348 00:21:21,410 --> 00:21:28,410 there's a much easier way to compute that integral. 349 00:21:29,500 --> 00:21:32,790 350 00:21:32,790 --> 00:21:34,430 [Inaudible]. 351 00:21:34,430 --> 00:21:36,330 352 00:21:36,330 --> 00:21:40,640 353 00:21:40,640 --> 00:21:44,650 So we'll apply the same idea to the M step. 354 00:21:44,650 --> 00:21:47,869 So the M step we want to maximize 355 00:21:47,869 --> 00:21:50,580 356 00:21:50,580 --> 00:21:56,920 357 00:21:56,920 --> 00:22:03,610 358 00:22:03,610 --> 00:22:07,790 this. There's also sum over I " there's a summation over I outside, 359 00:22:07,790 --> 00:22:08,330 but 360 00:22:08,330 --> 00:22:10,370 this is essentially the term 361 00:22:10,370 --> 00:22:13,270 inside the arg max of the M step 362 00:22:13,270 --> 00:22:18,290 where taking the integral of a ZI and just observe that that's actually " I 363 00:22:18,290 --> 00:22:21,090 guess just form some expectation 364 00:22:21,090 --> 00:22:23,310 respect to the random variable ZI 365 00:22:23,310 --> 00:22:27,600 of this thing inside. 366 00:22:27,600 --> 00:22:33,790 367 00:22:33,790 --> 00:22:40,790 And so this simplifies to the following. 368 00:23:05,090 --> 00:23:06,900 And it turns out 369 00:23:06,900 --> 00:23:07,580 actually 370 00:23:07,580 --> 00:23:12,290 all I did here was use the fact that P of X given Z 371 00:23:12,290 --> 00:23:14,070 times P of Z 372 00:23:14,070 --> 00:23:18,920 equals P over X comma Z. Right? That's just combining this term and that term gives 373 00:23:18,920 --> 00:23:23,030 you the numerator in the original. 374 00:23:23,030 --> 00:23:25,760 And so in turns out that for factor analysis 375 00:23:25,760 --> 00:23:29,930 in these two terms, this is the only one 376 00:23:29,930 --> 00:23:33,460 that depends on the parameters. The distribution P 377 00:23:33,460 --> 00:23:35,070 of 378 00:23:35,070 --> 00:23:39,779 ZI has no parameters because ZI was just drawn 379 00:23:39,779 --> 00:23:40,900 from a 380 00:23:40,900 --> 00:23:43,680 Gaussian with zero mean and identity 381 00:23:43,680 --> 00:23:49,090 covariance. QI of Z was this fixed Gaussian. It doesn't depend on the parameters 382 00:23:49,090 --> 00:23:52,940 theta, and so in the M step we really just need to maximize this one term 383 00:23:52,940 --> 00:23:55,340 with respect to all parameters, 384 00:23:55,340 --> 00:23:56,470 mu, lambda, 385 00:23:56,470 --> 00:24:00,080 and psi. 386 00:24:00,080 --> 00:24:03,890 So 387 00:24:03,890 --> 00:24:10,890 let's see. 388 00:24:11,850 --> 00:24:15,200 There's sort of one more key step I want to show but 389 00:24:15,200 --> 00:24:17,940 to get there I have to write down 390 00:24:17,940 --> 00:24:23,710 an unfortunately large amount of math, and let's go 391 00:24:23,710 --> 00:24:29,870 into it. 392 00:24:29,870 --> 00:24:36,870 Okay. 393 00:24:38,440 --> 00:24:45,440 So in the M step, we want to maximize 394 00:24:47,130 --> 00:24:51,330 and all expectations with respect to ZI drawn from the distributions 395 00:24:51,330 --> 00:24:52,140 QI, 396 00:24:52,140 --> 00:24:55,140 and sometimes I'll be sloppy and just omit this. 397 00:24:55,140 --> 00:24:58,460 398 00:24:58,460 --> 00:25:05,460 399 00:25:06,159 --> 00:25:09,019 And now 400 00:25:09,019 --> 00:25:10,250 this distribution 401 00:25:10,250 --> 00:25:13,690 P of XI given 402 00:25:13,690 --> 00:25:20,010 ZI, that is a Gaussian density because XI given ZI " this is 403 00:25:20,010 --> 00:25:22,670 404 00:25:22,670 --> 00:25:28,840 Gaussian with mean given by mu plus lambda Z 405 00:25:28,840 --> 00:25:30,700 and covariance psi. 406 00:25:30,700 --> 00:25:35,870 And so in this step of the derivation, I will actually go ahead and substitute in 407 00:25:35,870 --> 00:25:40,299 the formula for Gaussian density. So I will go ahead and take those and plug it into 408 00:25:40,299 --> 00:25:42,770 here, one over two pi to the N over 409 00:25:42,770 --> 00:25:45,750 two, psi inverse E to the 410 00:25:45,750 --> 00:25:48,790 dot, dot, dot. 411 00:25:48,790 --> 00:25:51,460 So I will go ahead and plug in the Gaussian density. 412 00:25:51,460 --> 00:25:57,110 And when you do that, you find that you get expectation 413 00:25:57,110 --> 00:26:04,110 " excuse 414 00:26:04,630 --> 00:26:06,970 me. I forgot to say 415 00:26:06,970 --> 00:26:08,740 to maintain 416 00:26:08,740 --> 00:26:12,500 " to not make the derivation too complicated, 417 00:26:12,500 --> 00:26:16,480 I'm actually just going to maximize this with respect to the parameters lambda. 418 00:26:16,480 --> 00:26:17,730 So let me just show how the " 419 00:26:17,730 --> 00:26:20,870 so you want to maximize this with respect to lambda, psi, and mu, 420 00:26:20,870 --> 00:26:24,110 but just to keep the amount of math I'm gonna do in class sane, I'm just going to 421 00:26:24,110 --> 00:26:26,200 show how to maximize this with 422 00:26:26,200 --> 00:26:30,140 respect to the matrix lambda, and I'll pretend that psi and mu are fixed. 423 00:26:30,140 --> 00:26:34,570 And so if you substitute in the Gaussian density, you get some expected value of the 424 00:26:34,570 --> 00:26:35,010 constant. 425 00:26:35,010 --> 00:26:37,010 The constant may depend on psi 426 00:26:37,010 --> 00:26:39,260 but not on lambda. 427 00:26:39,260 --> 00:26:41,810 Then mine is 428 00:26:41,810 --> 00:26:48,810 this thing. 429 00:26:59,590 --> 00:27:01,620 And this quadratic term 430 00:27:01,620 --> 00:27:03,250 essentially came from 431 00:27:03,250 --> 00:27:07,090 the exponent in your Gaussian density. When I take log of 432 00:27:07,090 --> 00:27:08,650 exponent, 433 00:27:08,650 --> 00:27:15,650 then you end up with this quadratic term. 434 00:27:15,650 --> 00:27:18,660 435 00:27:18,660 --> 00:27:23,870 And so if you take the derivatives 436 00:27:23,870 --> 00:27:25,419 of 437 00:27:25,419 --> 00:27:32,419 the expression above with 438 00:27:34,760 --> 00:27:40,270 respect to the matrix lambda 439 00:27:40,270 --> 00:27:42,900 and you set that to zero 440 00:27:42,900 --> 00:27:47,590 " we want to maximize this expression with respect to the parameters lambda, 441 00:27:47,590 --> 00:27:52,630 so you take the derivative of this with respect to lambda " excuse me. That's a minus 442 00:27:52,630 --> 00:27:54,740 sign " 443 00:27:54,740 --> 00:27:58,230 and set the derivate of this expression 444 00:27:58,230 --> 00:28:04,100 to zero because you set derivatives to zero to maximize things, right? 445 00:28:04,100 --> 00:28:06,510 When you do that and simplify, 446 00:28:06,510 --> 00:28:13,510 you end up with the following. 447 00:28:54,340 --> 00:28:55,700 And so that's the " in 448 00:28:55,700 --> 00:29:02,320 the M step, this is the value you should get 449 00:29:02,320 --> 00:29:06,100 that you use to update your parameters lambda. 450 00:29:06,100 --> 00:29:10,790 And again, the expectations 451 00:29:10,790 --> 00:29:14,740 are with respect to 452 00:29:14,740 --> 00:29:17,480 ZI drawn from the 453 00:29:17,480 --> 00:29:22,100 distributions QI. So the very last 454 00:29:22,100 --> 00:29:27,440 step of this derivation is we need to work out what 455 00:29:27,440 --> 00:29:30,320 these two expectations are. 456 00:29:30,320 --> 00:29:34,010 And so the very first term 457 00:29:34,010 --> 00:29:35,370 EZI 458 00:29:35,370 --> 00:29:42,370 transpose I guess is just mu of ZI 459 00:29:42,410 --> 00:29:46,250 give XI transpose because the QI distribution has mean given by [inaudible]. 460 00:29:46,250 --> 00:29:49,120 461 00:29:49,120 --> 00:29:52,020 462 00:29:52,020 --> 00:29:55,740 To work out the other term, let 463 00:29:55,740 --> 00:30:00,190 me just remind you that 464 00:30:00,190 --> 00:30:02,950 if you have a random variable Z 465 00:30:02,950 --> 00:30:06,690 that's Gaussian with mean mu 466 00:30:06,690 --> 00:30:11,310 and covariance sigma, then the covariance sigma is EZZ transpose 467 00:30:11,310 --> 00:30:13,040 minus EZ 468 00:30:13,040 --> 00:30:20,040 EZ transpose. 469 00:30:20,470 --> 00:30:23,290 That's one of the definitions of the covariance, 470 00:30:23,290 --> 00:30:28,590 and so this implies that EZZ transpose equals sigma 471 00:30:28,590 --> 00:30:34,920 plus 472 00:30:34,920 --> 00:30:37,470 EZ EZ transpose. 473 00:30:37,470 --> 00:30:42,730 And so this second term here 474 00:30:42,730 --> 00:30:45,429 becomes 475 00:30:45,429 --> 00:30:50,200 sigma ZI to the mu I 476 00:30:50,200 --> 00:30:52,850 plus 477 00:30:52,850 --> 00:30:53,179 " 478 00:30:53,179 --> 00:31:00,179 given XI. Okay? 479 00:31:07,350 --> 00:31:12,520 And so that's how you compute E of ZI transpose and E of ZI ZI transpose and 480 00:31:12,520 --> 00:31:14,770 substitute them back into this formula. 481 00:31:14,770 --> 00:31:17,860 And you would then have your M step update 482 00:31:17,860 --> 00:31:22,420 to the parameter matrix lambda. 483 00:31:22,420 --> 00:31:26,240 And the last thing I want to point out in this derivation is that it turns out " it's 484 00:31:26,240 --> 00:31:30,770 probably because of the name EM algorithm, expectation maximization, 485 00:31:30,770 --> 00:31:33,940 one common mistake for the EM algorithm is 486 00:31:33,940 --> 00:31:35,950 in the E step, 487 00:31:35,950 --> 00:31:36,889 some want to 488 00:31:36,889 --> 00:31:39,960 take the expectation of the random variable Z, 489 00:31:39,960 --> 00:31:43,820 and then in the M step they just plug in the expected value everywhere you see it. 490 00:31:43,820 --> 00:31:45,049 So in particular, 491 00:31:45,049 --> 00:31:49,340 one common mistake in deriving EM for factor analysis is to look at this and say, Oh, look. 492 00:31:49,340 --> 00:31:52,550 I see ZZ transpose. Let's 493 00:31:52,550 --> 00:31:55,550 just plug in the expected value under the Q distribution. 494 00:31:55,550 --> 00:31:58,740 And so plug in 495 00:31:58,740 --> 00:32:03,860 that " 496 00:32:03,860 --> 00:32:07,430 mu of ZI given XI times mu of ZI given XI transpose " 497 00:32:07,430 --> 00:32:12,190 into that expectation, and that would be 498 00:32:12,190 --> 00:32:14,570 an incorrect derivation of 499 00:32:14,570 --> 00:32:15,960 EM because 500 00:32:15,960 --> 00:32:19,059 it's missing this other term, sigma 501 00:32:19,059 --> 00:32:20,040 of ZI given XI. 502 00:32:20,040 --> 00:32:24,320 So one common misconception for EM is that in the E step you just compute the expected value of 503 00:32:24,320 --> 00:32:25,700 the hidden random variable, 504 00:32:25,700 --> 00:32:28,430 and the M step you plug in the expected value. 505 00:32:28,430 --> 00:32:31,940 It turns out in some algorithms that turns out to be the right thing to do. In 506 00:32:31,940 --> 00:32:35,220 the mixture of Gaussians and the mixture of [inaudible] models, that would actually give 507 00:32:35,220 --> 00:32:36,770 you the right answer, 508 00:32:36,770 --> 00:32:40,370 but in general the EM algorithm is more complicated than just taking the 509 00:32:40,370 --> 00:32:43,799 expected values of the random variables and then pretending that they were sort of 510 00:32:43,799 --> 00:32:47,190 observed at the expected values. So I wanna 511 00:32:47,190 --> 00:32:51,909 go through this just to illustrate that step 512 00:32:51,909 --> 00:32:54,030 as well. So just to summarize the 513 00:32:54,030 --> 00:32:57,360 three key things to keep in " that came up in this 514 00:32:57,360 --> 00:32:58,520 variation were, 1.) 515 00:32:58,520 --> 00:33:02,860 That for the E step, we had a continuous Gaussian random variable, and so to 516 00:33:02,860 --> 00:33:04,150 compute the E step, 517 00:33:04,150 --> 00:33:07,760 we actually compute the mean and covariance of the distribution 518 00:33:07,760 --> 00:33:09,300 QI. 519 00:33:09,300 --> 00:33:12,960 The second thing that came up was in the M step 520 00:33:12,960 --> 00:33:14,649 when you see these integrals, 521 00:33:14,649 --> 00:33:19,080 sometimes if you interpret that as expectation then the rest of 522 00:33:19,080 --> 00:33:20,590 the math becomes much easier. 523 00:33:20,590 --> 00:33:22,830 And the final thing was 524 00:33:22,830 --> 00:33:25,600 again in the M step, 525 00:33:25,600 --> 00:33:29,290 the EM algorithm is derived by a certain maximization problem that we solve. It 526 00:33:29,290 --> 00:33:30,240 is not 527 00:33:30,240 --> 00:33:32,369 necessarily just plugging the expected value of 528 00:33:32,369 --> 00:33:36,130 ZI everywhere. Let's see. 529 00:33:36,130 --> 00:33:38,820 I 530 00:33:38,820 --> 00:33:43,390 feel like I just did a ton of math and wrote down way too many equations. 531 00:33:43,390 --> 00:33:47,480 And even doing this, I was skipping many steps. So 532 00:33:47,480 --> 00:33:51,480 you can go to the lecture notes to see all the derivations of the steps I skipped, like 533 00:33:51,480 --> 00:33:54,710 how you actually take derivatives with respect to the matrix lambda, and 534 00:33:54,710 --> 00:33:58,420 how to compute the updates for the other parameters as well, for mu and for psi, 535 00:33:58,420 --> 00:34:03,320 because this is only for lambda. 536 00:34:03,320 --> 00:34:04,980 And so 537 00:34:04,980 --> 00:34:11,980 that's the factor analysis algorithm. Justin? Student:I was just wondering in the step in the lower right board, you said that the 538 00:34:16,479 --> 00:34:17,629 second term doesn't have any 539 00:34:17,629 --> 00:34:19,899 parameters that we're interested in. The first term 540 00:34:19,899 --> 00:34:23,959 has all the parameters, so we'll [inaudible], but it seems to me 541 00:34:23,959 --> 00:34:26,239 that QI 542 00:34:26,239 --> 00:34:28,359 has a lot of parameters [inaudible]. 543 00:34:28,359 --> 00:34:30,889 Instructor (Andrew Ng):I see. Right. 544 00:34:30,889 --> 00:34:33,379 Let's see. So the question was 545 00:34:33,379 --> 00:34:38,519 doesn't the term QI have parameters? 546 00:34:38,519 --> 00:34:43,819 So in the EM algorithm QI is " it actually 547 00:34:43,819 --> 00:34:48,619 turns out in the EM algorithm, sometimes P of ZI may have parameters, 548 00:34:48,619 --> 00:34:52,259 but QI of ZI may never have any parameters. 549 00:34:52,259 --> 00:34:56,779 In the specific case of factor analysis, P of ZI doesn't have parameters. 550 00:34:56,779 --> 00:35:00,210 In other examples, the mixture of Gaussian models say, 551 00:35:00,210 --> 00:35:03,729 ZI was a multinomial random variable, and so in that example PI of ZI has parameters, but 552 00:35:03,729 --> 00:35:06,489 it turns out 553 00:35:06,489 --> 00:35:09,269 that Q of ZI will never have any parameters. 554 00:35:09,269 --> 00:35:14,549 And in particular, QI of ZI is going to be 555 00:35:14,549 --> 00:35:18,319 " is this Gaussian distribution " is Gaussian with 556 00:35:18,319 --> 00:35:19,099 mean 557 00:35:19,099 --> 00:35:22,229 given by mu 558 00:35:22,229 --> 00:35:23,759 of ZI given XI 559 00:35:23,759 --> 00:35:30,029 and covariance sigma ZI given XI. 560 00:35:30,029 --> 00:35:31,199 And so 561 00:35:31,199 --> 00:35:32,930 562 00:35:32,930 --> 00:35:34,290 it's true that 563 00:35:34,290 --> 00:35:35,709 mu and sigma 564 00:35:35,709 --> 00:35:40,359 may themselves have depended on the values of the parameters I had in the 565 00:35:40,359 --> 00:35:42,909 previous iteration of 566 00:35:42,909 --> 00:35:45,470 EM, but the way to think about Q is 567 00:35:45,470 --> 00:35:50,399 I'm going to take the parameters from the previous iteration of the algorithm and use 568 00:35:50,399 --> 00:35:51,670 that to compute 569 00:35:51,670 --> 00:35:54,109 what QI of ZI is. 570 00:35:54,109 --> 00:35:57,249 And that's the E second EM algorithm. 571 00:35:57,249 --> 00:36:00,540 And then once I've computed what QI of Z is, then 572 00:36:00,540 --> 00:36:02,939 this is a fixed distribution. I'm gonna 573 00:36:02,939 --> 00:36:06,269 use these fixed values for mu and sigma, 574 00:36:06,269 --> 00:36:08,340 and just keep these two values fixed 575 00:36:08,340 --> 00:36:09,599 as I run the M step. Student:So that's " I guess I was confused because in the second point 576 00:36:09,599 --> 00:36:16,490 over 577 00:36:16,490 --> 00:36:22,709 there's a lot of " it looks like they're parameters, 578 00:36:22,709 --> 00:36:24,029 but I guess 579 00:36:24,029 --> 00:36:28,509 they're old iterations of the parameters. Instructor (Andrew Ng):Oh, yeah. Yes, you're right. When I wrote down QI of ZI that was a function of " so yeah " the 580 00:36:28,509 --> 00:36:30,979 parameters from the previous iteration. And I want to compute 581 00:36:30,979 --> 00:36:34,569 the new set of parameters. 582 00:36:34,569 --> 00:36:41,569 Okay. More questions? 583 00:36:48,469 --> 00:36:54,549 So this is probably the most math I'll ever do in a 584 00:36:54,549 --> 00:36:58,549 lecture in this 585 00:36:58,549 --> 00:37:04,069 entire course. Let's now talk about a different algorithm. 586 00:37:04,069 --> 00:37:08,709 Actually, which board was I on? 587 00:37:08,709 --> 00:37:15,709 So 588 00:37:39,249 --> 00:37:43,359 what I want to do now is talk about an algorithm called principal components 589 00:37:43,359 --> 00:37:50,359 analysis, which is 590 00:37:52,739 --> 00:37:55,690 often abbreviated 591 00:37:55,690 --> 00:37:57,299 PCA. 592 00:37:57,299 --> 00:37:58,449 593 00:37:58,449 --> 00:38:00,070 Here's the idea. 594 00:38:00,070 --> 00:38:05,469 PCA has a very similar idea as factor analysis, but it sort 595 00:38:05,469 --> 00:38:11,849 of maybe gets to the problem a little more directly than just factor analysis. 596 00:38:11,849 --> 00:38:13,499 So 597 00:38:13,499 --> 00:38:16,819 598 00:38:16,819 --> 00:38:20,159 the question is given " so we're still doing unsupervised learning, so given a training set 599 00:38:20,159 --> 00:38:24,020 of M examples 600 00:38:24,020 --> 00:38:26,519 where each XI 601 00:38:26,519 --> 00:38:30,499 is an N-dimensional vector as usual, 602 00:38:30,499 --> 00:38:34,039 what I like to do is reduce it 603 00:38:34,039 --> 00:38:39,239 to a lower dimensional data set where 604 00:38:39,239 --> 00:38:41,379 K is strictly 605 00:38:41,379 --> 00:38:43,169 less 606 00:38:43,169 --> 00:38:46,979 than N, and quite often will be much smaller 607 00:38:46,979 --> 00:38:50,039 than N. 608 00:38:50,039 --> 00:38:52,699 So I'll give a 609 00:38:52,699 --> 00:38:56,929 couple examples of why we want to do this. 610 00:38:56,929 --> 00:39:00,900 Imagine that you're given a data set that contains measurements 611 00:39:00,900 --> 00:39:06,819 and unknown to you " measurements of, I don't know, people or something " 612 00:39:06,819 --> 00:39:09,189 and unknown to you, 613 00:39:09,189 --> 00:39:13,640 whoever collected this data actually included the height of the person 614 00:39:13,640 --> 00:39:18,119 in centimeters as well as the height of the person in inches. So 615 00:39:18,119 --> 00:39:21,410 because of rounding off to the nearest centimeter or rounding off to the nearest 616 00:39:21,410 --> 00:39:22,119 inch 617 00:39:22,119 --> 00:39:25,329 the values won't exactly match up, but along 618 00:39:25,329 --> 00:39:31,039 two dimensions of this data anyway, it'll 619 00:39:31,039 --> 00:39:35,019 lie extremely close to a straight line, but it won't lie exactly 620 00:39:35,019 --> 00:39:38,699 on a straight line because of rounding off to the nearest centimeter or inch, but lie 621 00:39:38,699 --> 00:39:40,609 very close to the straight line. 622 00:39:40,609 --> 00:39:43,029 And so 623 00:39:43,029 --> 00:39:46,769 we have a data set like this. It seems that what you really care about is 624 00:39:46,769 --> 00:39:48,519 that axis, 625 00:39:48,519 --> 00:39:51,959 and this axis is really the 626 00:39:51,959 --> 00:39:55,660 variable of interest. 627 00:39:55,660 --> 00:39:56,440 That's 628 00:39:56,440 --> 00:39:59,189 maybe the closest thing you have to the true height of a person. 629 00:39:59,189 --> 00:40:00,299 And 630 00:40:00,299 --> 00:40:03,660 this other axis 631 00:40:03,660 --> 00:40:05,819 is just noise. 632 00:40:05,819 --> 00:40:10,099 So if you can reduce the dimension of this data from two-dimensional to 633 00:40:10,099 --> 00:40:11,529 one-dimensional, 634 00:40:11,529 --> 00:40:17,209 then maybe you can get rid of some of the noise in this data. 635 00:40:17,209 --> 00:40:20,569 Quite often, you won't know that this was cm and this was inches. You may have a 636 00:40:20,569 --> 00:40:22,779 data set with a hundred attributes, 637 00:40:22,779 --> 00:40:29,689 but you just didn't happen to notice one was 638 00:40:29,689 --> 00:40:33,909 cm and one was inches. Another example that I sometimes think 639 00:40:33,909 --> 00:40:36,879 about is some of you know that my students and 640 00:40:36,879 --> 00:40:40,319 I work with [inaudible] helicopters a lot. So 641 00:40:40,319 --> 00:40:41,839 we imagined that 642 00:40:41,839 --> 00:40:42,839 you take 643 00:40:42,839 --> 00:40:44,929 surveys of quizzes, measurements of 644 00:40:44,929 --> 00:40:48,279 radio control helicopter pilots. 645 00:40:48,279 --> 00:40:52,039 Maybe one axis you have measurements of 646 00:40:52,039 --> 00:40:55,969 your pilot's skill, 647 00:40:55,969 --> 00:40:57,519 648 00:40:57,519 --> 00:41:00,409 how skillful your helicopter pilot is, 649 00:41:00,409 --> 00:41:03,169 and on 650 00:41:03,169 --> 00:41:05,379 another axis maybe you 651 00:41:05,379 --> 00:41:08,039 measure how much they actually enjoy flying. 652 00:41:08,039 --> 00:41:09,829 And maybe " 653 00:41:09,829 --> 00:41:11,859 this is really " maybe 654 00:41:11,859 --> 00:41:13,889 this is actually roughly onedimensional data, 655 00:41:13,889 --> 00:41:15,919 and there's some variable of 656 00:41:15,919 --> 00:41:22,919 interest, which I'll call maybe pilot attitude 657 00:41:23,609 --> 00:41:27,189 that somehow determines their skill and how much they enjoy flying. And so 658 00:41:27,189 --> 00:41:27,749 again, 659 00:41:27,749 --> 00:41:31,829 if you can reduce this data from two dimensions to one-dimensional, maybe you'd 660 00:41:31,829 --> 00:41:33,929 have a slightly better of measure of 661 00:41:33,929 --> 00:41:40,259 what I'm calling loosely pilot attitude, which may be what you really 662 00:41:40,259 --> 00:41:44,549 wanted to [inaudible]. So let's talk about an algorithm to do this, and I should come back and talk about more 663 00:41:44,549 --> 00:41:46,359 applications of 664 00:41:46,359 --> 00:41:53,359 PCA later. So here's the algorithm. 665 00:41:54,989 --> 00:42:01,989 666 00:42:13,959 --> 00:42:19,499 Before running PCA normally you will preprocess the data as follows. I'll just write this out, I guess. I know some of you are 667 00:42:19,499 --> 00:42:26,499 668 00:43:04,179 --> 00:43:11,179 669 00:43:27,009 --> 00:43:31,069 writing. 670 00:43:31,069 --> 00:43:31,929 So this is 671 00:43:31,929 --> 00:43:34,849 maybe just an unnecessary amount of writing to say that 672 00:43:34,849 --> 00:43:38,569 compute a mean of my training sets and subtract out the means, so now I've 673 00:43:38,569 --> 00:43:40,799 zeroed out the mean of my training sets. 674 00:43:40,799 --> 00:43:45,549 And the other step is I'll compute the variance of each of features after 675 00:43:45,549 --> 00:43:46,589 zeroing out the mean, 676 00:43:46,589 --> 00:43:50,429 and then I'll divide each feature by the standard deviation so that each 677 00:43:50,429 --> 00:43:51,649 of my features 678 00:43:51,649 --> 00:43:55,899 now has equal variance. So these are some standard preprocessing steps we'll often do for 679 00:43:55,899 --> 00:43:57,689 PCA. 680 00:43:57,689 --> 00:44:01,809 I'll just mention that sometimes the second step is usually done only 681 00:44:01,809 --> 00:44:04,709 when your different features are on different scales. So 682 00:44:04,709 --> 00:44:05,959 683 00:44:05,959 --> 00:44:08,750 if you're taking measurements of people, 684 00:44:08,750 --> 00:44:11,579 one feature may be the height, and 685 00:44:11,579 --> 00:44:15,479 another may be the weight, and another may be the strength, and another may 686 00:44:15,479 --> 00:44:19,229 be how they age or whatever, all of these quantities are on very 687 00:44:19,229 --> 00:44:21,990 different scales, so you'd normally 688 00:44:21,990 --> 00:44:24,130 normalize the variance. 689 00:44:24,130 --> 00:44:28,469 Sometimes if all the XIs are the same type of thing " so for 690 00:44:28,469 --> 00:44:29,179 example 691 00:44:29,179 --> 00:44:33,979 if you're working with images and the XIJs are the pixels that 692 00:44:33,979 --> 00:44:34,529 they're 693 00:44:34,529 --> 00:44:37,360 " are on the same scale because they're all 694 00:44:37,360 --> 00:44:41,070 pixel intensity values ranging from zero to 255, then you may 695 00:44:41,070 --> 00:44:43,769 omit 696 00:44:43,769 --> 00:44:45,179 this step. So 697 00:44:45,179 --> 00:44:47,139 after preprocessing, 698 00:44:47,139 --> 00:44:49,559 let's talk about how we would find the 699 00:44:49,559 --> 00:44:51,119 700 00:44:51,119 --> 00:44:54,489 main axes along which the data 701 00:44:54,489 --> 00:45:01,489 varies. How would you find a principal axes of variations [inaudible]? 702 00:45:05,079 --> 00:45:12,079 So to do that let me 703 00:45:18,309 --> 00:45:22,439 describe " let me just go through one specific example, and then we'll formalize the algorithm. Here's 704 00:45:22,439 --> 00:45:23,449 705 00:45:23,449 --> 00:45:30,449 706 00:45:32,089 --> 00:45:38,529 my training set comprising five examples. It has roughly zero mean. 707 00:45:38,529 --> 00:45:43,409 And the variance under X1 and X2 axes are the same. There's 708 00:45:43,409 --> 00:45:45,099 X1, there's X2 709 00:45:45,099 --> 00:45:46,139 axes. 710 00:45:46,139 --> 00:45:50,159 And so the principal axes and variation of this data is roughly 711 00:45:50,159 --> 00:45:53,190 the positive 45-degree axis, so what 712 00:45:53,190 --> 00:45:56,629 I'd like to do is have my algorithm 713 00:45:56,629 --> 00:45:59,849 conclude that this direction that I've drawn with this arrow 714 00:45:59,849 --> 00:46:00,309 U 715 00:46:00,309 --> 00:46:01,760 is the best 716 00:46:01,760 --> 00:46:04,060 direction onto which to project the data, 717 00:46:04,060 --> 00:46:06,379 so the axis by which 718 00:46:06,379 --> 00:46:09,089 my data really varies. 719 00:46:09,089 --> 00:46:11,700 So let's think about how we formalize. 720 00:46:11,700 --> 00:46:16,909 One thing we want to look at is suppose I take that axis " 721 00:46:16,909 --> 00:46:20,569 this is really the axis onto which I want to project " that I want to use to 722 00:46:20,569 --> 00:46:23,709 capture most of the variation of my data. 723 00:46:23,709 --> 00:46:27,609 And then when I take my training set and project it onto this line, then 724 00:46:27,609 --> 00:46:30,789 I get 725 00:46:30,789 --> 00:46:36,599 that set of points. And so 726 00:46:36,599 --> 00:46:40,239 you notice that those dots, the projections of my training sets onto this 727 00:46:40,239 --> 00:46:41,049 axis, 728 00:46:41,049 --> 00:46:44,339 has very large variance. 729 00:46:44,339 --> 00:46:48,199 In contrast, if I were to choose 730 00:46:48,199 --> 00:46:52,359 a different direction " let's say this is really 731 00:46:52,359 --> 00:46:56,559 [inaudible] the worst direction onto which to project my data. 732 00:46:56,559 --> 00:46:59,079 If I project all my data onto this 733 00:46:59,079 --> 00:47:06,079 axis, 734 00:47:09,199 --> 00:47:10,970 then I find that the projects of 735 00:47:10,970 --> 00:47:13,299 my data onto the 736 00:47:13,299 --> 00:47:15,030 purple line, onto this other line 737 00:47:15,030 --> 00:47:21,849 has much smaller variance, that my points are clustered together much more tightly. So one 738 00:47:21,849 --> 00:47:25,869 way to formalize this notion of finding the 739 00:47:25,869 --> 00:47:28,380 main axis of variations of data is 740 00:47:28,380 --> 00:47:32,949 I would like to find the vector U, I would like to find the direction U 741 00:47:32,949 --> 00:47:37,529 so that when I project my data onto that direction, 742 00:47:37,529 --> 00:47:41,400 the projected points vary as much as possible. So in other words, I'm 743 00:47:41,400 --> 00:47:44,279 gonna find a direction so that when I project my data onto it, 744 00:47:44,279 --> 00:47:46,180 the projections are 745 00:47:46,180 --> 00:47:53,180 largely widely spaced out. There's large variance. So let's 746 00:48:13,420 --> 00:48:15,559 see. 747 00:48:15,559 --> 00:48:17,779 So I want to find the direction U. Just 748 00:48:17,779 --> 00:48:19,709 as a 749 00:48:19,709 --> 00:48:22,119 reminder, 750 00:48:22,119 --> 00:48:25,609 if I have a vector U of norm one, 751 00:48:25,609 --> 00:48:28,209 then the length 752 00:48:28,209 --> 00:48:32,439 of our vector XI 753 00:48:32,439 --> 00:48:38,719 projected " then 754 00:48:38,719 --> 00:48:42,269 the vector XI projected onto 755 00:48:42,269 --> 00:48:47,169 U has length 756 00:48:47,169 --> 00:48:48,999 XI 757 00:48:48,999 --> 00:48:51,249 transpose U. 758 00:48:51,249 --> 00:48:53,089 To project a vector onto 759 00:48:53,089 --> 00:48:56,749 " to project X onto unit vector, the length of the projection's just 760 00:48:56,749 --> 00:48:59,939 XI transpose U. 761 00:48:59,939 --> 00:49:05,089 And so to formalize my PCA problem, I'm going to choose a vector U 762 00:49:05,089 --> 00:49:06,920 to maximize " so I 763 00:49:06,920 --> 00:49:08,549 choose U and this 764 00:49:08,549 --> 00:49:11,289 is 765 00:49:11,289 --> 00:49:15,339 subject to the constraint that the norm of U, that the length of U is one " 766 00:49:15,339 --> 00:49:19,089 but I'm going to maximize " let's 767 00:49:19,089 --> 00:49:26,049 see, sum from I equals one to M " 768 00:49:26,049 --> 00:49:30,819 the length of the projection of the vectors X onto U. 769 00:49:30,819 --> 00:49:32,439 In particular, I want the 770 00:49:32,439 --> 00:49:35,410 sum of square distances of the projections to be far from the origin. 771 00:49:35,410 --> 00:49:36,979 I want the 772 00:49:36,979 --> 00:49:39,709 projections of X onto U to have large variance. 773 00:49:39,709 --> 00:49:41,219 And just to simplify some of 774 00:49:41,219 --> 00:49:44,649 the math later, let me put a one over M in front. 775 00:49:44,649 --> 00:49:49,049 And so that quantity on the right 776 00:49:49,049 --> 00:49:51,079 is equal to 777 00:49:51,079 --> 00:49:57,369 one over M. Let's 778 00:49:57,369 --> 00:50:00,059 see. U transpose XI times 779 00:50:00,059 --> 00:50:00,999 XI 780 00:50:00,999 --> 00:50:04,669 transpose U, 781 00:50:04,669 --> 00:50:05,869 and so 782 00:50:05,869 --> 00:50:10,169 can simplify and I get " this is U transpose 783 00:50:10,169 --> 00:50:16,900 times 784 00:50:16,900 --> 00:50:20,809 785 00:50:20,809 --> 00:50:27,809 [inaudible] U. 786 00:50:48,209 --> 00:50:50,310 So I want to maximize 787 00:50:50,310 --> 00:50:53,839 U transpose times some matrix times U 788 00:50:53,839 --> 00:50:56,489 subject to the constraint that the length of U must 789 00:50:56,489 --> 00:50:59,789 be equal to one. 790 00:50:59,789 --> 00:51:04,599 And so some of you will recognize that this means U must be the principal 791 00:51:04,599 --> 00:51:05,739 eigenvector 792 00:51:05,739 --> 00:51:08,029 of this matrix in the middle. 793 00:51:08,029 --> 00:51:10,589 So let me just write that down and say a few more words about it. 794 00:51:10,589 --> 00:51:13,159 So this implies that U 795 00:51:13,159 --> 00:51:20,159 is the principal 796 00:51:21,829 --> 00:51:25,649 eigenvector of the matrix, and I'll just call this matrix sigma. It actually turns out to be a 797 00:51:25,649 --> 00:51:27,869 covariance matrix, [inaudible] equals one to M 798 00:51:27,869 --> 00:51:33,119 transpose. 799 00:51:33,119 --> 00:51:36,949 Actually, 800 00:51:36,949 --> 00:51:37,560 let's check. 801 00:51:37,560 --> 00:51:41,739 How many of you are familiar with eigenvectors? 802 00:51:41,739 --> 00:51:48,739 Oh, cool. Lots of you, almost all of you. Great. 803 00:51:49,509 --> 00:51:52,809 What I'm about to say is very extremely familiar, but 804 00:51:52,809 --> 00:51:57,900 it's worth saying anyway. 805 00:51:57,900 --> 00:52:00,319 So 806 00:52:00,319 --> 00:52:03,469 if you have a matrix A and a vector U, 807 00:52:03,469 --> 00:52:06,179 and they satisfy AU equals lambda U, then 808 00:52:06,179 --> 00:52:09,589 this is what it means for U to be an eigenvector of the matrix 809 00:52:09,589 --> 00:52:15,779 A. 810 00:52:15,779 --> 00:52:22,099 And the value lambda here is called an 811 00:52:22,099 --> 00:52:25,869 eigenvalue. And so the principal eigenvector is just the eigenvector that corresponds 812 00:52:25,869 --> 00:52:28,599 to 813 00:52:28,599 --> 00:52:30,029 the largest eigenvalue. One thing 814 00:52:30,029 --> 00:52:34,589 that some of you may have seen, but I'm not sure that you have, and I just wanna relate this to stuff 815 00:52:34,589 --> 00:52:36,029 that you already know as well, 816 00:52:36,029 --> 00:52:37,789 is that 817 00:52:37,789 --> 00:52:42,390 optimization problem is really 818 00:52:42,390 --> 00:52:44,239 maximize U transpose U 819 00:52:44,239 --> 00:52:48,609 subject to the norm of U is equal to one and " let me 820 00:52:48,609 --> 00:52:52,279 write that constraint as that U transpose U equals one. 821 00:52:52,279 --> 00:52:54,940 And so to solve this 822 00:52:54,940 --> 00:52:58,339 constrained optimization problem, 823 00:52:58,339 --> 00:53:05,160 you write down the Lagrangian [inaudible] lambda, 824 00:53:05,160 --> 00:53:09,859 825 00:53:09,859 --> 00:53:14,630 where that's the Lagrange multiplier 826 00:53:14,630 --> 00:53:17,989 827 00:53:17,989 --> 00:53:18,779 828 00:53:18,779 --> 00:53:22,639 because there's a constraint optimization. And so to actually solve 829 00:53:22,639 --> 00:53:26,640 this optimization, you take the derivative of L with respect to U 830 00:53:26,640 --> 00:53:29,749 and that gives you sigma U 831 00:53:29,749 --> 00:53:33,759 minus lambda U. You 832 00:53:33,759 --> 00:53:37,549 set the derivative equal to zero and this shows that sigma U equals lambda 833 00:53:37,549 --> 00:53:38,959 U, and therefore 834 00:53:38,959 --> 00:53:40,789 the value of U that solves 835 00:53:40,789 --> 00:53:44,890 this constraint optimization problem that we care about must be an eigenvector of sigma. And in 836 00:53:44,890 --> 00:53:51,890 particular it turns out to be the principal eigenvector. So 837 00:53:54,859 --> 00:53:56,609 just to summarize, 838 00:53:56,609 --> 00:53:57,799 839 00:53:57,799 --> 00:54:00,489 what have we done? We've shown that given a training set, 840 00:54:00,489 --> 00:54:04,299 if you want to find the principal axis of a variation of data, you want to find the 841 00:54:04,299 --> 00:54:07,719 1-D axis on which the data really varies the most, 842 00:54:07,719 --> 00:54:09,170 what we do is we 843 00:54:09,170 --> 00:54:13,270 construct the covariance matrix sigma, the matrix sigma that I wrote down 844 00:54:13,270 --> 00:54:14,679 just now, 845 00:54:14,679 --> 00:54:19,649 and then you would find the principal eigenvector of the matrix sigma. 846 00:54:19,649 --> 00:54:26,649 And this gives you the best 1-D subspace onto which to project the data. 847 00:54:32,209 --> 00:54:39,209 And 848 00:54:41,979 --> 00:54:43,589 849 00:54:43,589 --> 00:54:45,629 850 00:54:45,629 --> 00:54:52,629 more generally, you would choose 851 00:55:08,219 --> 00:55:13,979 852 00:55:13,979 --> 00:55:17,099 " if you wanna K-dimensional subspace onto which 853 00:55:17,099 --> 00:55:18,369 project your date, 854 00:55:18,369 --> 00:55:22,200 you would then choose U1 through UK to be the top K 855 00:55:22,200 --> 00:55:24,329 eigenvectors of sigma. 856 00:55:24,329 --> 00:55:28,680 And by top K eigenvectors, I just mean the eigenvectors corresponding 857 00:55:28,680 --> 00:55:33,199 to the K highest 858 00:55:33,199 --> 00:55:36,909 eigenvalues. I guess I showed this only for the case of a 1D subspace, but this holds 859 00:55:36,909 --> 00:55:39,759 more generally. 860 00:55:39,759 --> 00:55:41,200 And now 861 00:55:41,200 --> 00:55:42,579 862 00:55:42,579 --> 00:55:44,479 the eigenvectors U1 through 863 00:55:44,479 --> 00:55:51,479 UK represent " give you a new basis with which to represent your data. Student:[Inaudible] diagonal? Instructor 864 00:55:55,899 --> 00:55:57,659 (Andrew 865 00:55:57,659 --> 00:56:01,429 Ng):Let's see. So by convention, the PCA will choose orthogonal axes. 866 00:56:01,429 --> 00:56:04,239 So I think this 867 00:56:04,239 --> 00:56:05,919 is what I'm saying. 868 00:56:05,919 --> 00:56:12,529 Here's one more example. Imagine that you're a three-dimensional set, 869 00:56:12,529 --> 00:56:15,769 and imagine that your 870 00:56:15,769 --> 00:56:19,699 three-dimensional data set " it's very hard to draw in 3-D on the board, so just let me try this. 871 00:56:19,699 --> 00:56:24,239 Imagine that here my X1 and X2 axes lie on the plane of the board, 872 00:56:24,239 --> 00:56:28,559 and the X3 axis points directly out of the board. 873 00:56:28,559 --> 00:56:30,280 Imagine that you have a data set where 874 00:56:30,280 --> 00:56:33,799 875 00:56:33,799 --> 00:56:36,749 most of the data lies on the plane of the board, but there's just a little bit of fuzz. 876 00:56:36,749 --> 00:56:41,279 So imagine that the X3 axis points orthogonally out of the board, 877 00:56:41,279 --> 00:56:43,940 and all the data lies roughly in the plane of the board, but 878 00:56:43,940 --> 00:56:48,240 there's just a tiny little bit of fuzz so that some of the data 879 00:56:48,240 --> 00:56:51,410 lies just a couple of millimeters off the board. 880 00:56:51,410 --> 00:56:55,059 So you run a PCA on this data. You find 881 00:56:55,059 --> 00:56:56,880 882 00:56:56,880 --> 00:56:59,480 that U1 and U2 will be some pair of bases that 883 00:56:59,480 --> 00:57:02,329 essentially lie in the plane of the board, 884 00:57:02,329 --> 00:57:03,329 and U3 885 00:57:03,329 --> 00:57:05,170 886 00:57:05,170 --> 00:57:11,489 will be an orthogonal axis at points roughly out of the plane of the board. 887 00:57:11,489 --> 00:57:12,959 So 888 00:57:12,959 --> 00:57:16,269 if I reduce this data to two dimensions, then the bases U1 and 889 00:57:16,269 --> 00:57:17,319 U2 890 00:57:17,319 --> 00:57:21,109 now give me a new basis with which to construct my lower dimensional 891 00:57:21,109 --> 00:57:26,049 representation of the data. 892 00:57:26,049 --> 00:57:28,679 Just to be complete about what that means, 893 00:57:28,679 --> 00:57:35,469 previously 894 00:57:35,469 --> 00:57:37,510 we have say fairly high dimensional 895 00:57:37,510 --> 00:57:40,299 input data XI and RN, 896 00:57:40,299 --> 00:57:44,779 and now if I want to represent my data in this new basis given by U1 897 00:57:44,779 --> 00:57:47,799 up to UK, 898 00:57:47,799 --> 00:57:54,799 899 00:57:58,989 --> 00:58:03,799 900 00:58:03,799 --> 00:58:05,650 instead I would take each 901 00:58:05,650 --> 00:58:08,909 of my original training examples XI 902 00:58:08,909 --> 00:58:13,399 and I would now represent it or replace it with a different vector which I 903 00:58:13,399 --> 00:58:14,920 call YI, 904 00:58:14,920 --> 00:58:19,009 and that would be computed as U1 transpose 905 00:58:19,009 --> 00:58:26,009 XI U2 transpose XI. 906 00:58:28,239 --> 00:58:34,179 And so 907 00:58:34,179 --> 00:58:38,559 the YIs are going to be K-dimensional where 908 00:58:38,559 --> 00:58:39,509 K 909 00:58:39,509 --> 00:58:43,299 will be less " your choice of K will be less than N, so this 910 00:58:43,299 --> 00:58:45,160 represents a 911 00:58:45,160 --> 00:58:46,560 lower dimensional representation 912 00:58:46,560 --> 00:58:50,680 of your data, and serves as an approximate representation of your original 913 00:58:50,680 --> 00:58:51,319 data 914 00:58:51,319 --> 00:58:55,539 where you're using only K numbers to represent each training example rather than N 915 00:58:55,539 --> 00:59:02,539 numbers. 916 00:59:04,699 --> 00:59:11,699 Let's see. Student:[Inaudible] to 917 00:59:16,809 --> 00:59:17,819 have eigenvectors [inaudible] trivial eigenvectors? Instructor (Andrew Ng):Is it possible 918 00:59:17,819 --> 00:59:24,819 not to have eigenvectors but have trivial eigenvectors? Student:[Inaudible] determined by [inaudible]? Is it a condition wherein [inaudible]. Instructor (Andrew Ng):Let's see. What do you 919 00:59:31,839 --> 00:59:38,839 mean by trivial eigenvectors? Student:[Inaudible] linear algebra from [inaudible]. Instructor (Andrew Ng):Oh, 920 00:59:42,069 --> 00:59:43,199 okay. Yes, 921 00:59:43,199 --> 00:59:47,739 so I see. Let me see if I can get this right. So there are some matrices 922 00:59:47,739 --> 00:59:51,549 that are " I think the term is degenerate " that don't have a full set of 923 00:59:51,549 --> 00:59:53,639 eigenvectors. 924 00:59:53,639 --> 00:59:57,819 Sorry, deficient I think is what it's called. Is that right, Ziko? Yeah. So some matrices are 925 00:59:57,819 --> 01:00:00,599 deficient and actually don't have a full set of eigenvectors, like an N 926 01:00:00,599 --> 01:00:01,739 927 01:00:01,739 --> 01:00:05,459 by N matrix that does not have N distinct eigenvectors, 928 01:00:05,459 --> 01:00:08,710 but that turns out not to be possible in this case 929 01:00:08,710 --> 01:00:12,399 because the covariance matrix sigma is symmetric, and symmetric 930 01:00:12,399 --> 01:00:16,049 matrices are never deficient, so the matrix sigma will always have a full set of 931 01:00:16,049 --> 01:00:18,289 eigenvectors. 932 01:00:18,289 --> 01:00:21,069 It's possible that if you " 933 01:00:21,069 --> 01:00:23,049 there's one 934 01:00:23,049 --> 01:00:26,769 other issue which is repeated eigenvalues. So 935 01:00:26,769 --> 01:00:28,119 for example, 936 01:00:28,119 --> 01:00:32,179 it turns out that in this example if 937 01:00:32,179 --> 01:00:36,410 my covariance matrix looks like this, 938 01:00:36,410 --> 01:00:40,429 939 01:00:40,429 --> 01:00:43,759 it turns out that the identity of the first two eigenvectors is ambiguous. 940 01:00:43,759 --> 01:00:46,279 And by that I mean 941 01:00:46,279 --> 01:00:48,909 you can choose this to be U1 942 01:00:48,909 --> 01:00:51,029 and this to be U2 943 01:00:51,029 --> 01:00:54,479 or I can just as well rotate these eigenvectors and choose that to be 944 01:00:54,479 --> 01:00:55,729 U1 945 01:00:55,729 --> 01:00:57,239 and that to be U2, 946 01:00:57,239 --> 01:01:00,139 or even choose that to be U1 and 947 01:01:00,139 --> 01:01:01,339 that to be U2, 948 01:01:01,339 --> 01:01:02,609 and so on. 949 01:01:02,609 --> 01:01:08,029 And so when you apply PCA, one thing to keep in mind is 950 01:01:08,029 --> 01:01:11,460 sometimes eigenvectors can rotate freely within their subspaces 951 01:01:11,460 --> 01:01:12,819 when you have 952 01:01:12,819 --> 01:01:16,249 repeated eigenvalues or close to repeated eigenvalues. 953 01:01:16,249 --> 01:01:18,709 And so 954 01:01:18,709 --> 01:01:19,710 the way to think about 955 01:01:19,710 --> 01:01:23,799 the vectors U is think of as a basis with which to represent your data, 956 01:01:23,799 --> 01:01:27,709 but the basis vectors can sometimes rotate freely, and so it's not always 957 01:01:27,709 --> 01:01:28,449 useful 958 01:01:28,449 --> 01:01:32,479 to look at the eigenvectors one at a time, and say this is my first 959 01:01:32,479 --> 01:01:33,589 eigenvector 960 01:01:33,589 --> 01:01:37,769 capturing whatever, the height of a person, or this is my second eigenvector, and it captures 961 01:01:37,769 --> 01:01:39,670 their skill at [inaudible] or whatever. 962 01:01:39,670 --> 01:01:42,019 That's a very dangerous thing to do when you 963 01:01:42,019 --> 01:01:46,619 do PCA. What is meaningful is the subspace spanned by the 964 01:01:46,619 --> 01:01:50,130 eigenvectors, but looking at the eigenvectors one at a time is sometimes a dangerous 965 01:01:50,130 --> 01:01:52,419 thing to do 966 01:01:52,419 --> 01:01:54,989 because they can often freely. 967 01:01:54,989 --> 01:01:58,559 Tiny numerical changes can cause eigenvectors to change a lot, 968 01:01:58,559 --> 01:02:01,859 but the subspace spanned by the top K eigenvectors will usually be about 969 01:02:01,859 --> 01:02:08,859 the same. 970 01:02:22,119 --> 01:02:29,119 It 971 01:02:32,209 --> 01:02:35,969 actually turns out there are multiple possible interpretations of PCA. 972 01:02:35,969 --> 01:02:39,049 I'll just give one more without proof, 973 01:02:39,049 --> 01:02:46,049 which is that [inaudible] whatever 974 01:02:46,529 --> 01:02:47,299 " 975 01:02:47,299 --> 01:02:50,269 given a training set like this, 976 01:02:50,269 --> 01:02:53,579 another view of PCA is " and let me just choose a direction. This is 977 01:02:53,579 --> 01:02:55,099 not 978 01:02:55,099 --> 01:02:57,330 the principal components. I choose some direction and project 979 01:02:57,330 --> 01:03:01,609 my data onto it. 980 01:03:01,609 --> 01:03:04,289 This is clearly not the direction PCA would choose, 981 01:03:04,289 --> 01:03:06,959 but what you can think of PCA as doing is 982 01:03:06,959 --> 01:03:10,079 choose a subspace onto which to project your data 983 01:03:10,079 --> 01:03:12,220 so as to minimize the sum of squares 984 01:03:12,220 --> 01:03:15,690 differences between the projections and the [inaudible] points. So in 985 01:03:15,690 --> 01:03:17,569 other words, 986 01:03:17,569 --> 01:03:21,319 another way to think of PCA is trying to minimize the 987 01:03:21,319 --> 01:03:23,309 sum of squares 988 01:03:23,309 --> 01:03:24,579 989 01:03:24,579 --> 01:03:28,519 of these distances between the dots, the points X 990 01:03:28,519 --> 01:03:31,269 and the dots onto which I'm projecting 991 01:03:31,269 --> 01:03:33,729 the data. It turns out 992 01:03:33,729 --> 01:03:36,749 they're actually " I don't know. There's sort of nine or ten different 993 01:03:36,749 --> 01:03:39,949 interpretations of PCA. This is another one. There are a bunch 994 01:03:39,949 --> 01:03:43,579 of ways to derive PCA. You get play with some 995 01:03:43,579 --> 01:03:49,969 PCA ideas more 996 01:03:49,969 --> 01:03:53,529 in the 997 01:03:53,529 --> 01:03:57,489 next problem set. What I want to do next 998 01:03:57,489 --> 01:03:59,569 is talk about a few 999 01:03:59,569 --> 01:04:02,249 applications of 1000 01:04:02,249 --> 01:04:09,249 PCA. 1001 01:04:17,699 --> 01:04:23,499 Here are some ways that PCA is used. One is visualization. 1002 01:04:23,499 --> 01:04:24,839 1003 01:04:24,839 --> 01:04:26,899 1004 01:04:26,899 --> 01:04:30,439 Very often you have high dimensional data sets. Someone gives you a 1005 01:04:30,439 --> 01:04:32,329 50-dimensional data set, 1006 01:04:32,329 --> 01:04:36,259 and it's very hard for you to look at a data set that's 50-dimensional and understand 1007 01:04:36,259 --> 01:04:40,349 what's going on because you can't plot something in 50 dimensions. 1008 01:04:40,349 --> 01:04:44,230 So common practice if you want to visualize a very high dimensional data set 1009 01:04:44,230 --> 01:04:47,920 is to take your data and project it into say a 2-D plot, 1010 01:04:47,920 --> 01:04:51,500 or project it into a 3-D plot, so you can render like a 3-D display on 1011 01:04:51,500 --> 01:04:52,769 a computer 1012 01:04:52,769 --> 01:04:55,699 so you can better visualize the data and look for structure. 1013 01:04:55,699 --> 01:04:58,150 One particular example 1014 01:04:58,150 --> 01:05:01,420 that I learned about of doing this recently was in Krishna 1015 01:05:01,420 --> 01:05:06,809 Shenoy's lab here in Stanford in which 1016 01:05:06,809 --> 01:05:11,269 he had readings from 50 different parts of a monkey brain. I actually 1017 01:05:11,269 --> 01:05:13,299 don't know it was the number 50. It was tens 1018 01:05:13,299 --> 01:05:16,019 of different parts of the monkey brain, and so 1019 01:05:16,019 --> 01:05:17,279 you'd have these 1020 01:05:17,279 --> 01:05:21,680 50-dimensional readings, 50-dimensional vectors correspond to 1021 01:05:21,680 --> 01:05:24,700 different amounts of electrical activity in different parts of the monkey brain. It was 1022 01:05:24,700 --> 01:05:26,330 actually 50 neurons, but tens 1023 01:05:26,330 --> 01:05:28,139 of neurons, but 1024 01:05:28,139 --> 01:05:32,179 it was tens of neurons in the monkey brain, and so you have a 50dimensional 1025 01:05:32,179 --> 01:05:33,619 time series, 1026 01:05:33,619 --> 01:05:36,669 and it's very hard to visualize very high dimensional data. 1027 01:05:36,669 --> 01:05:38,589 But what he understood is that " was 1028 01:05:38,589 --> 01:05:43,059 use PCA to project this 50-dimensional data down to three dimensions, 1029 01:05:43,059 --> 01:05:47,269 and then you can visualize this data in three dimensions just using 3D plot, 1030 01:05:47,269 --> 01:05:50,479 so you could visualize what the monkey is thinking over time. 1031 01:05:50,479 --> 01:05:53,949 1032 01:05:53,949 --> 01:05:54,759 Another 1033 01:05:54,759 --> 01:05:56,979 common application of PCA 1034 01:05:56,979 --> 01:06:00,449 is 1035 01:06:00,449 --> 01:06:02,980 compression, so if you have high dimensional data and you want to store it 1036 01:06:02,980 --> 01:06:04,560 with [inaudible] numbers, 1037 01:06:04,560 --> 01:06:07,689 clearly PCA is a great way to do this. 1038 01:06:07,689 --> 01:06:11,219 It turns out also that 1039 01:06:11,219 --> 01:06:13,119 sometimes in machine learning, 1040 01:06:13,119 --> 01:06:16,579 sometimes you're just given extremely high dimensional input data, 1041 01:06:16,579 --> 01:06:20,589 and for computational reasons you don't want to deal with such high dimensional 1042 01:06:20,589 --> 01:06:24,879 data. 1043 01:06:24,879 --> 01:06:27,649 And so 1044 01:06:27,649 --> 01:06:31,939 fairly " one common use of PCA is taking very high dimensional data 1045 01:06:31,939 --> 01:06:35,829 and represent it with low dimensional subspace " it's that YI is the 1046 01:06:35,829 --> 01:06:37,679 representation I wrote down just now " 1047 01:06:37,679 --> 01:06:40,609 so that you can work with much lower dimensional data. 1048 01:06:40,609 --> 01:06:41,699 And 1049 01:06:41,699 --> 01:06:43,390 it turns out to be it's this sort of " 1050 01:06:43,390 --> 01:06:46,830 just seems to make a fact of life that when you're given extremely high dimensional 1051 01:06:46,830 --> 01:06:47,669 data 1052 01:06:47,669 --> 01:06:51,529 almost all the time just in practice " of all the high dimensional data 1053 01:06:51,529 --> 01:06:53,369 sets I've ever seen, very 1054 01:06:53,369 --> 01:06:55,879 high dimensional data sets often 1055 01:06:55,879 --> 01:06:59,639 have all their points lying on much lower dimensional subspaces, 1056 01:06:59,639 --> 01:07:03,269 so very often you can dramatically reduce the dimension of your 1057 01:07:03,269 --> 01:07:04,259 data 1058 01:07:04,259 --> 01:07:06,239 and really be 1059 01:07:06,239 --> 01:07:10,369 throwing too much of the information away. 1060 01:07:10,369 --> 01:07:13,039 So 1061 01:07:13,039 --> 01:07:14,130 let's 1062 01:07:14,130 --> 01:07:17,709 see. If you're learning algorithms, it takes a long time to run very high 1063 01:07:17,709 --> 01:07:18,649 dimensional data. You 1064 01:07:18,649 --> 01:07:22,069 can often use PCA to compress the data to lower dimensions so that your learning 1065 01:07:22,069 --> 01:07:23,960 algorithm runs much faster. 1066 01:07:23,960 --> 01:07:27,950 And you can often do this with sacrificing almost no performance in 1067 01:07:27,950 --> 01:07:30,599 your learning algorithm. 1068 01:07:30,599 --> 01:07:35,329 There's one other use of PCA for learning which is " 1069 01:07:35,329 --> 01:07:38,729 you remember when we talked about learning theory, we said that 1070 01:07:38,729 --> 01:07:41,160 the more features you have, the more 1071 01:07:41,160 --> 01:07:45,000 complex your hypothesis class, if say you're doing linear classification. If you have more 1072 01:07:45,000 --> 01:07:46,959 features, you have lots of features, 1073 01:07:46,959 --> 01:07:51,519 then you may be more prone to overfitting. One other thing you could do with PCA is just 1074 01:07:51,519 --> 01:07:54,049 use that to reduce the dimension of your data 1075 01:07:54,049 --> 01:07:54,880 so that 1076 01:07:54,880 --> 01:07:59,749 you have fewer features and you may be slightly less prone to overfitting. 1077 01:07:59,749 --> 01:08:02,609 This particular application of PCA to learning I should say, 1078 01:08:02,609 --> 01:08:06,049 it sometimes works. It often works. I'll actually later show a couple of 1079 01:08:06,049 --> 01:08:09,170 examples where sort of do this and it works. But 1080 01:08:09,170 --> 01:08:12,539 this particular application of PCA, I find looking in 1081 01:08:12,539 --> 01:08:17,330 the industry, it also seems to be a little bit overused, and in particular " 1082 01:08:17,330 --> 01:08:21,549 actually, let me say more about that later. We'll 1083 01:08:21,549 --> 01:08:23,489 come back to that later. 1084 01:08:23,489 --> 01:08:28,219 There's a couple of other applications to talk about. One is outlier detection or 1085 01:08:28,219 --> 01:08:34,619 anomaly detection. 1086 01:08:34,619 --> 01:08:36,119 1087 01:08:36,119 --> 01:08:38,969 The idea is 1088 01:08:38,969 --> 01:08:40,689 suppose I give you a data set. 1089 01:08:40,689 --> 01:08:45,099 You may then run PCA to find roughly the subspace on which your 1090 01:08:45,099 --> 01:08:46,600 data lies. 1091 01:08:46,600 --> 01:08:48,029 And then if 1092 01:08:48,029 --> 01:08:51,040 you want to find anomalies in future examples, you then 1093 01:08:51,040 --> 01:08:54,119 just look at your future examples and see if the lie very far from your 1094 01:08:54,119 --> 01:08:55,639 subspace. 1095 01:08:55,639 --> 01:09:00,250 This isn't a fantastic anomaly detection algorithm. It's not a very good " the idea is 1096 01:09:00,250 --> 01:09:01,329 1097 01:09:01,329 --> 01:09:04,380 if I give you a data 1098 01:09:04,380 --> 01:09:07,449 set, you may find a little dimensional subspace on which it lies, 1099 01:09:07,449 --> 01:09:11,120 and then if you ever find a point that's far from your subspace, you can factor it as an 1100 01:09:11,120 --> 01:09:11,989 anomaly. 1101 01:09:11,989 --> 01:09:13,399 So this 1102 01:09:13,399 --> 01:09:16,959 really isn't the best anomaly detection algorithm, but sometimes this is 1103 01:09:16,959 --> 01:09:17,929 done. 1104 01:09:17,929 --> 01:09:19,520 1105 01:09:19,520 --> 01:09:22,509 And the last application that I want to 1106 01:09:22,509 --> 01:09:25,069 go into a little bit more in detail 1107 01:09:25,069 --> 01:09:26,989 is matching or 1108 01:09:26,989 --> 01:09:28,839 1109 01:09:28,839 --> 01:09:35,839 to find better distance calculations. 1110 01:09:37,789 --> 01:09:40,730 So let me say what I mean by this. I'll go into much more detail on 1111 01:09:40,730 --> 01:09:47,730 this last one. 1112 01:09:58,529 --> 01:10:01,880 So here's the idea. 1113 01:10:01,880 --> 01:10:06,669 Let's say you want to do face recognition 1114 01:10:06,669 --> 01:10:10,020 1115 01:10:10,020 --> 01:10:11,649 [inaudible]. 1116 01:10:11,649 --> 01:10:13,959 So let's say you want to do face recognition 1117 01:10:13,959 --> 01:10:15,960 and you have 1118 01:10:15,960 --> 01:10:19,030 100 of 100 1119 01:10:19,030 --> 01:10:22,769 images. So a 1120 01:10:22,769 --> 01:10:26,119 picture of a face is like whatever, some array of 1121 01:10:26,119 --> 01:10:27,719 pixels, 1122 01:10:27,719 --> 01:10:32,069 and the pixels have different grayscale values, and dependent on the different 1123 01:10:32,069 --> 01:10:34,309 grayscale values, you get different pictures of 1124 01:10:34,309 --> 01:10:36,789 people's faces. 1125 01:10:36,789 --> 01:10:39,090 And so 1126 01:10:39,090 --> 01:10:46,090 [inaudible] 1127 01:10:50,940 --> 01:10:52,449 you have 100 of 1128 01:10:52,449 --> 01:10:54,310 100 pixel images, and you think of 1129 01:10:54,310 --> 01:10:57,010 each face 1130 01:10:57,010 --> 01:10:58,590 as a 1131 01:10:58,590 --> 01:11:01,800 10,000-dimensional vector, 1132 01:11:01,800 --> 01:11:04,709 which is very high dimensional. 1133 01:11:04,709 --> 01:11:09,780 1134 01:11:09,780 --> 01:11:12,369 [Inaudible] cartoon to keep in mind, and you think of 1135 01:11:12,369 --> 01:11:17,139 here's my plot of my 100-dimensional space, 1136 01:11:17,139 --> 01:11:21,300 and if you look at a lot of different pictures of 1137 01:11:21,300 --> 01:11:24,579 faces, each face will be along what will be some point in this 100,000-dimensional space. 1138 01:11:24,579 --> 01:11:27,100 And in this cartoon, I want 1139 01:11:27,100 --> 01:11:31,350 you 1140 01:11:31,350 --> 01:11:33,059 to think of it as " I 1141 01:11:33,059 --> 01:11:37,019 mean most of this data lies on a relatively low dimensional subspace 1142 01:11:37,019 --> 01:11:38,190 because 1143 01:11:38,190 --> 01:11:41,220 in this 10,000-dimensional space, 1144 01:11:41,220 --> 01:11:44,930 really not all points correspond to valid images of faces. The vast 1145 01:11:44,930 --> 01:11:46,719 majority of 1146 01:11:46,719 --> 01:11:48,249 values, the vast majority of 1147 01:11:48,249 --> 01:11:50,239 10,000-dimensional images 1148 01:11:50,239 --> 01:11:53,760 just correspond to random noise like looking things and don't correspond to 1149 01:11:53,760 --> 01:11:55,300 a valid image of a face. 1150 01:11:55,300 --> 01:11:56,699 But instead 1151 01:11:56,699 --> 01:12:03,699 the space of possible face of images probably spans a much lower dimensional subspace. 1152 01:12:04,959 --> 01:12:11,959 [Crosstalk] Instructor 1153 01:12:14,239 --> 01:12:20,619 (Andrew 1154 01:12:20,619 --> 01:12:22,469 Ng):And so what we'll do 1155 01:12:22,469 --> 01:12:24,329 is we'll use 1156 01:12:24,329 --> 01:12:29,209 a low dimensional subspace on which the data lies, and in practice a 1157 01:12:29,209 --> 01:12:33,589 PCA dimension of 50 would be fairly typical. 1158 01:12:33,589 --> 01:12:35,189 1159 01:12:35,189 --> 01:12:37,880 So when you think of " there's some 1160 01:12:37,880 --> 01:12:39,340 axes that 1161 01:12:39,340 --> 01:12:45,300 really measure the shape or the appearance of the face, 1162 01:12:45,300 --> 01:12:47,979 and there are some other axes 1163 01:12:47,979 --> 01:12:49,910 that we're not interested in that are maybe just 1164 01:12:49,910 --> 01:12:52,119 random noise. 1165 01:12:52,119 --> 01:12:54,289 So what we'll do 1166 01:12:54,289 --> 01:12:56,000 is 1167 01:12:56,000 --> 01:12:57,650 for face recognition 1168 01:12:57,650 --> 01:13:01,550 I might give you a picture of a face and ask you what face looks the most 1169 01:13:01,550 --> 01:13:04,249 similar to this. I'll give you a picture of someone and ask you 1170 01:13:04,249 --> 01:13:07,339 can you find other pictures of the same person. 1171 01:13:07,339 --> 01:13:08,099 1172 01:13:08,099 --> 01:13:12,959 And so the key step to do that is to look at two faces and to compare 1173 01:13:12,959 --> 01:13:19,959 how similar these two faces 1174 01:13:21,209 --> 01:13:24,089 are. So here's how we'll use PCA. 1175 01:13:24,089 --> 01:13:25,939 Given a 1176 01:13:25,939 --> 01:13:28,380 face there 1177 01:13:28,380 --> 01:13:30,539 and a different face there, 1178 01:13:30,539 --> 01:13:35,230 the way I'll measure the difference between these two faces is I won't just measure the 1179 01:13:35,230 --> 01:13:37,229 Euclidian distance similarity. 1180 01:13:37,229 --> 01:13:40,469 Instead, I'll take the space and project it 1181 01:13:40,469 --> 01:13:43,849 onto my 50-dimensional subspace " take this and 1182 01:13:43,849 --> 01:13:47,779 project it onto my 50-dimensional subspace and measure 1183 01:13:47,779 --> 01:13:51,849 the similarity between those two points on the subspace. 1184 01:13:51,849 --> 01:13:55,989 And so when I do that, I may end up with a face here and a face here that look very 1185 01:13:55,989 --> 01:13:58,559 far upon the original space, but when I 1186 01:13:58,559 --> 01:14:00,559 project them onto the subspace, 1187 01:14:00,559 --> 01:14:03,429 they may end up looking much more similar. 1188 01:14:03,429 --> 01:14:06,999 So what I want to show you a second on the laptop is 1189 01:14:06,999 --> 01:14:08,900 1190 01:14:08,900 --> 01:14:13,999 given each of these training examples 1191 01:14:13,999 --> 01:14:15,799 is a 10,000-dimensional vector, 1192 01:14:15,799 --> 01:14:17,250 I can plot that 1193 01:14:17,250 --> 01:14:20,790 as a grayscale image and I'd get like a picture of some person's face. 1194 01:14:20,790 --> 01:14:22,449 What 1195 01:14:22,449 --> 01:14:23,520 I'll also 1196 01:14:23,520 --> 01:14:27,679 show you on the laptop will be plots of my eigenvectors. And so the eigenvectors 1197 01:14:27,679 --> 01:14:29,099 also 1198 01:14:29,099 --> 01:14:31,639 live in a 10,000-dimensional subspace. 1199 01:14:31,639 --> 01:14:33,699 And I plot that. 1200 01:14:33,699 --> 01:14:39,129 You get some image that's " 1201 01:14:39,129 --> 01:14:42,980 and these images are called eigenfaces. And what 1202 01:14:42,980 --> 01:14:45,050 PCA is doing is essentially 1203 01:14:45,050 --> 01:14:48,670 using linear combinations of these eigenface images 1204 01:14:48,670 --> 01:14:53,340 to approximate these XIs, and it's the eigenfaces that span " so that 1205 01:14:53,340 --> 01:14:54,590 these 1206 01:14:54,590 --> 01:14:55,489 bases 1207 01:14:55,489 --> 01:15:00,959 U1, U2 and so on that span hopefully the subspace along which 1208 01:15:00,959 --> 01:15:01,940 1209 01:15:01,940 --> 01:15:08,940 my 1210 01:15:22,489 --> 01:15:29,489 faces live. 1211 01:15:42,790 --> 01:15:49,790 [Crosstalk] 1212 01:16:05,079 --> 01:16:12,079 1213 01:17:03,069 --> 01:17:04,029 1214 01:17:04,029 --> 01:17:11,029 Instructor (Andrew 1215 01:17:12,739 --> 01:17:17,679 Ng):So here's a 1216 01:17:17,679 --> 01:17:20,289 1217 01:17:20,289 --> 01:17:21,789 1218 01:17:21,789 --> 01:17:24,989 1219 01:17:24,989 --> 01:17:26,909 training set comprising 1220 01:17:26,909 --> 01:17:31,020 aligned images like these, a much larger set than is shown 1221 01:17:31,020 --> 01:17:32,019 here, 1222 01:17:32,019 --> 01:17:36,159 and so when you run PCA, 1223 01:17:36,159 --> 01:17:40,479 these are some of the 1224 01:17:40,479 --> 01:17:43,250 plots of the UIs. Remember when I said plot the eigenvectors UI in the same way that it's plotting 1225 01:17:43,250 --> 01:17:46,210 the 1226 01:17:46,210 --> 01:17:49,540 training examples, and so you end up with the 1227 01:17:49,540 --> 01:17:50,840 eigenvectors being 1228 01:17:50,840 --> 01:17:52,849 grayscale images like these, and 1229 01:17:52,849 --> 01:17:57,710 what approximate images of faces with linear combinations of these. So 1230 01:17:57,710 --> 01:18:01,349 I said it's dangerous to look at individual eigenvectors and you 1231 01:18:01,349 --> 01:18:04,010 really shouldn't do that, but let me just do that a little bit anyway. So if 1232 01:18:04,010 --> 01:18:06,250 you look at the first eigenvector, 1233 01:18:06,250 --> 01:18:09,580 this corresponds roughly to whether the face is illuminated from the left or 1234 01:18:09,580 --> 01:18:11,340 the right. So depending on 1235 01:18:11,340 --> 01:18:13,590 how heavy the weight is on this image, that 1236 01:18:13,590 --> 01:18:17,329 roughly captures variation in illumination. The 1237 01:18:17,329 --> 01:18:20,669 second image is hard to tell. It seems to be capturing variation in overall brightness 1238 01:18:20,669 --> 01:18:22,010 of the face. 1239 01:18:22,010 --> 01:18:24,340 The third eigenvector capturing 1240 01:18:24,340 --> 01:18:27,920 roughly how much of a shadow or maybe a beard or something 1241 01:18:27,920 --> 01:18:29,699 a person has and so forth. It's 1242 01:18:29,699 --> 01:18:33,209 dangerous to look at individual eigenvectors, but it 1243 01:18:33,209 --> 01:18:36,170 sort of slightly informs the look of it. Here's 1244 01:18:36,170 --> 01:18:38,159 an example of a specific 1245 01:18:38,159 --> 01:18:40,500 application of eigenfaces. 1246 01:18:40,500 --> 01:18:41,869 This is 1247 01:18:41,869 --> 01:18:45,310 from Sandy Pentland " Alex Pentland's lab at MIT. 1248 01:18:45,310 --> 01:18:47,389 In 1249 01:18:47,389 --> 01:18:50,120 this display, this upper leftmost image is the input 1250 01:18:50,120 --> 01:18:52,550 to the eigenface's algorithm. The 1251 01:18:52,550 --> 01:18:57,469 algorithm is then asked to go through a database and find the most similar faces. 1252 01:18:57,469 --> 01:18:59,599 This second image here 1253 01:18:59,599 --> 01:19:02,429 is what it thinks is the most similar face of the input. 1254 01:19:02,429 --> 01:19:05,939 The next one over is the second most similar. The next one over is the third 1255 01:19:05,939 --> 01:19:08,019 most similar and so on. 1256 01:19:08,019 --> 01:19:12,380 And so using eigenfaces, and by measuring differences between faces in that lower 1257 01:19:12,380 --> 01:19:16,559 dimensional subspace, it is able to do a reasonable job identifying pictures of the 1258 01:19:16,559 --> 01:19:19,269 same person even with 1259 01:19:19,269 --> 01:19:21,119 faces of the person removed. 1260 01:19:21,119 --> 01:19:27,119 And the next row shows the next most similar faces and so on, so 1261 01:19:27,119 --> 01:19:30,620 this one is the fourth most similar face and 1262 01:19:30,620 --> 01:19:31,769 so on. 1263 01:19:31,769 --> 01:19:32,589 This is a 1264 01:19:32,589 --> 01:19:35,679 usual application of eigenfaces. The last thing 1265 01:19:35,679 --> 01:19:40,010 I wanna say is just when people tell me about machine learning problems, 1266 01:19:40,010 --> 01:19:43,839 I do often recommend they try PCA for different things, so PCA is a useful thing 1267 01:19:43,839 --> 01:19:44,260 1268 01:19:44,260 --> 01:19:48,199 to know to use for compression, visualization, and so on. But 1269 01:19:48,199 --> 01:19:52,040 in industry, I just tend to see PCA used slightly more often. There 1270 01:19:52,040 --> 01:19:56,070 are also some times where you see people using PCA when they really shouldn't. So just one 1271 01:19:56,070 --> 01:19:59,040 final piece of advice for use of PCA is 1272 01:19:59,040 --> 01:20:02,739 before you use it, also think about whether you could just do it with the 1273 01:20:02,739 --> 01:20:06,550 original training data XI without compressing it, since I've also 1274 01:20:06,550 --> 01:20:07,969 definitely seen people 1275 01:20:07,969 --> 01:20:09,960 compress the data when 1276 01:20:09,960 --> 01:20:13,620 they really didn't need to. But having said that, I also often advise people to use PCA for various 1277 01:20:13,620 --> 01:20:15,190 problems and it often works. 1278 01:20:15,190 --> 01:20:16,390 Okay. 1279 01:20:16,390 --> 01:20:18,500 Sorry about running late. So let's wrap up for today.