1 00:00:09,040 --> 00:00:10,319 2 00:00:10,319 --> 00:00:13,589 This presentation is delivered by the Stanford Center for Professional 3 00:00:13,589 --> 00:00:20,589 Development. 4 00:00:23,960 --> 00:00:24,880 So 5 00:00:24,880 --> 00:00:29,720 welcome back, and what I want to do today is 6 00:00:29,720 --> 00:00:33,930 continue our discussions of the EM Algorithm, and in particular, I want to talk 7 00:00:33,930 --> 00:00:34,730 about 8 00:00:34,730 --> 00:00:36,359 the EM 9 00:00:36,359 --> 00:00:39,640 formulation that we derived in the previous lecture and apply it to the 10 00:00:39,640 --> 00:00:41,910 mixture of Gaussians model, 11 00:00:41,910 --> 00:00:45,050 apply it to a different model and a mixture of naive Bayes model, 12 00:00:45,050 --> 00:00:48,690 and then the launch part of today's lecture will be on the factor 13 00:00:48,690 --> 00:00:51,480 analysis algorithm, which will also use the EM. 14 00:00:51,480 --> 00:00:55,600 And as part of that, we'll actually take a brief digression to talk a little bit about 15 00:00:55,600 --> 00:01:01,050 sort of useful properties of Gaussian distributions. 16 00:01:01,050 --> 00:01:03,190 So just to recap where we are. 17 00:01:03,190 --> 00:01:10,120 In the previous lecture, I started to talk about unsupervised learning, which was 18 00:01:10,120 --> 00:01:12,720 machine-learning problems, where you're given 19 00:01:12,720 --> 00:01:14,590 an unlabeled training set 20 00:01:14,590 --> 00:01:16,210 comprising m examples here, right? 21 00:01:16,210 --> 00:01:18,360 And then " so the fact that there are no labels; 22 00:01:18,360 --> 00:01:22,720 that's what makes this unsupervised or 23 00:01:22,720 --> 00:01:24,880 anything. So 24 00:01:24,880 --> 00:01:30,710 one problem that I talked about last time was what if 25 00:01:30,710 --> 00:01:33,820 you're given a data set that looks like this 26 00:01:33,820 --> 00:01:36,420 and you want to model 27 00:01:36,420 --> 00:01:40,110 the density PFX from which you think the data 28 00:01:40,110 --> 00:01:41,730 had been drawn, 29 00:01:41,730 --> 00:01:45,900 and so with a data set like this, maybe you think was a mixture of two Gaussians and start 30 00:01:45,900 --> 00:01:50,230 to talk about an algorithm 31 00:01:50,230 --> 00:01:53,920 for fitting a mixture of Gaussians model, all right? And so 32 00:01:53,920 --> 00:01:55,050 33 00:01:55,050 --> 00:01:57,820 we said that we would model the 34 00:01:57,820 --> 00:01:59,460 density of XP of X 35 00:01:59,460 --> 00:02:02,130 as sum over Z 36 00:02:02,130 --> 00:02:03,170 PFX 37 00:02:03,170 --> 00:02:05,410 given Z 38 00:02:05,410 --> 00:02:06,970 times P of Z 39 00:02:06,970 --> 00:02:10,350 where this later random variable meaning this hidden random 40 00:02:10,350 --> 00:02:11,829 variable Z 41 00:02:11,829 --> 00:02:12,609 indicates 42 00:02:12,609 --> 00:02:16,369 which of the two Gaussian distributions each of your data points came from 43 00:02:16,369 --> 00:02:19,369 and so we have, 44 00:02:19,369 --> 00:02:24,139 you know, Z was not a 45 00:02:24,139 --> 00:02:26,629 nomial with parameter phi 46 00:02:26,629 --> 00:02:28,760 and X conditions on 47 00:02:28,760 --> 00:02:32,599 a coming from the JAFE 48 00:02:32,599 --> 00:02:35,809 Gaussian 49 00:02:35,809 --> 00:02:38,749 was given by Gaussian 50 00:02:38,749 --> 00:02:43,679 of mean mu J and covariant sigma J, all right? So, like I said 51 00:02:43,679 --> 00:02:47,510 at the beginning of the previous lecture, I just talked about a very specific algorithm that 52 00:02:47,510 --> 00:02:50,639 I sort of pulled out of the air 53 00:02:50,639 --> 00:02:54,609 for fitting the parameters of this model for finian, Francis, phi, mu and 54 00:02:54,609 --> 00:02:55,519 55 00:02:55,519 --> 00:02:59,769 sigma, but then in the second half of the previous lecture I talked about what's called the 56 00:02:59,769 --> 00:03:01,489 EM Algorithm in which 57 00:03:01,489 --> 00:03:03,999 our 58 00:03:03,999 --> 00:03:08,719 goal is that it's a likelihood estimation of parameters. So we want to maximize in terms of 59 00:03:08,719 --> 00:03:09,900 theta, 60 00:03:09,900 --> 00:03:13,279 you know, the, sort of, 61 00:03:13,279 --> 00:03:17,649 usual right matter of log likelihood " 62 00:03:17,649 --> 00:03:19,549 well, parameterized by theta. 63 00:03:19,549 --> 00:03:22,149 And 64 00:03:22,149 --> 00:03:25,109 because we have 65 00:03:25,109 --> 00:03:28,199 a later random variable Z this is really 66 00:03:28,199 --> 00:03:30,889 maximizing in terms of theta, 67 00:03:30,889 --> 00:03:37,889 sum over I, sum over Z, P of XI, 68 00:03:39,639 --> 00:03:45,279 ZI parameterized by theta. Okay? 69 00:03:45,279 --> 00:03:47,139 So using 70 00:03:47,139 --> 00:03:50,109 Jensen's inequality last time 71 00:03:50,109 --> 00:03:54,249 we worked out the EM Algorithm in which in the E step 72 00:03:54,249 --> 00:03:56,180 we would chose these 73 00:03:56,180 --> 00:03:59,359 probability distributions QI to the 74 00:03:59,359 --> 00:04:02,930 l posterior 75 00:04:02,930 --> 00:04:09,319 on Z given X and parameterized by theta 76 00:04:09,319 --> 00:04:13,759 and in the M step we would set 77 00:04:13,759 --> 00:04:16,429 theta 78 00:04:16,429 --> 00:04:21,669 to be the value that 79 00:04:21,669 --> 00:04:27,609 maximizes 80 00:04:27,609 --> 00:04:34,609 this. 81 00:04:36,469 --> 00:04:39,499 Okay? So these are the ones we worked out last time 82 00:04:39,499 --> 00:04:41,730 83 00:04:41,730 --> 00:04:46,699 and the cartoon that I drew was that you have this long likelihood function L of 84 00:04:46,699 --> 00:04:49,499 theta that's often hard to maximize 85 00:04:49,499 --> 00:04:51,250 and what the E step does 86 00:04:51,250 --> 00:04:55,960 is choose these probability distribution production QI's. And in the cartoon, I drew 87 00:04:55,960 --> 00:04:57,850 what that corresponded to 88 00:04:57,850 --> 00:05:01,029 was finding a lower bounds 89 00:05:01,029 --> 00:05:02,979 for the log likelihood. 90 00:05:02,979 --> 00:05:04,779 And then 91 00:05:04,779 --> 00:05:06,420 horizontal access data 92 00:05:06,420 --> 00:05:10,710 and then the M step you maximize the lower boundary, right? So maybe you were here previously 93 00:05:10,710 --> 00:05:11,500 and so you 94 00:05:11,500 --> 00:05:15,389 jumped to the new point, the new 95 00:05:15,389 --> 00:05:17,619 maximum of this lower bound. Okay? 96 00:05:17,619 --> 00:05:21,889 And so this little curve here, right? This lower bound function here 97 00:05:21,889 --> 00:05:26,259 that's really 98 00:05:26,259 --> 00:05:28,820 the right-hand side of that augments. 99 00:05:28,820 --> 00:05:30,210 Okay? So 100 00:05:30,210 --> 00:05:34,010 this whole thing in the augments. If you view this thing 101 00:05:34,010 --> 00:05:36,590 as a function of theta, 102 00:05:36,590 --> 00:05:38,129 this function of theta 103 00:05:38,129 --> 00:05:39,250 is a lower bounds 104 00:05:39,250 --> 00:05:41,729 for the log likelihood of theta 105 00:05:41,729 --> 00:05:45,539 and so the M step we maximize this lower bound and that corresponds to 106 00:05:45,539 --> 00:05:47,360 jumping 107 00:05:47,360 --> 00:05:51,669 to this new maximum to lower bound. So it 108 00:05:51,669 --> 00:05:53,139 turns out 109 00:05:53,139 --> 00:05:55,669 that 110 00:05:55,669 --> 00:05:57,259 in the EM Algorithm " 111 00:05:57,259 --> 00:06:02,100 so why do you evolve with the EM algorithm? It turns out that very often, and this will be 112 00:06:02,100 --> 00:06:05,210 true for all the examples we see 113 00:06:05,210 --> 00:06:07,410 today, it turns out 114 00:06:07,410 --> 00:06:10,889 that very often in the 115 00:06:10,889 --> 00:06:12,529 EM Algorithm 116 00:06:12,529 --> 00:06:16,159 maximizing the M Step, so performing the maximization the M Step, will be 117 00:06:16,159 --> 00:06:19,739 tractable and can often be done analytically in the closed form. 118 00:06:19,739 --> 00:06:21,330 Whereas 119 00:06:21,330 --> 00:06:25,729 if you were trying to maximize this objective we try to take this 120 00:06:25,729 --> 00:06:27,270 formula on the right and 121 00:06:27,270 --> 00:06:30,130 this maximum likely object, everyone, is to take this all on the right 122 00:06:30,130 --> 00:06:32,810 and set its derivatives to zero and try to solve and 123 00:06:32,810 --> 00:06:34,629 you'll find that you're unable 124 00:06:34,629 --> 00:06:37,030 to obtain a solution to this in closed form 125 00:06:37,030 --> 00:06:38,759 this maximization. Okay? 126 00:06:38,759 --> 00:06:41,250 And so to give you an example of that is that 127 00:06:41,250 --> 00:06:45,069 you remember our discussion on exponential family marbles, right? 128 00:06:45,069 --> 00:06:47,019 It turns out that 129 00:06:47,019 --> 00:06:49,370 if X and Z 130 00:06:49,370 --> 00:06:50,320 is 131 00:06:50,320 --> 00:06:54,220 jointly, I guess, a line in exponential families. So if P of X, 132 00:06:54,220 --> 00:06:55,379 Z 133 00:06:55,379 --> 00:06:58,319 prioritized by theta there's an explanation family distribution, 134 00:06:58,319 --> 00:07:01,909 which it turns out to be true for the mixture of Gaussians distribution. 135 00:07:01,909 --> 00:07:05,219 Then turns out that the M step here will be tractable 136 00:07:05,219 --> 00:07:08,379 and the E step will also be tractable and so you can do each of these steps very 137 00:07:08,379 --> 00:07:09,489 easily. 138 00:07:09,489 --> 00:07:10,889 Whereas 139 00:07:10,889 --> 00:07:12,689 performing " trying to 140 00:07:12,689 --> 00:07:15,669 perform this original maximum likelihood estimation problem 141 00:07:15,669 --> 00:07:17,719 on this one, right? 142 00:07:17,719 --> 00:07:21,439 Will be computationally very difficult. You're going 143 00:07:21,439 --> 00:07:24,699 to set the derivatives to zero and try to solve for that. Analytically you won't be able to find an analytic solution to 144 00:07:24,699 --> 00:07:27,589 this. Okay? 145 00:07:27,589 --> 00:07:32,709 So 146 00:07:32,709 --> 00:07:36,050 what I want to do in a second is actually take this view of the EM 147 00:07:36,050 --> 00:07:37,110 Algorithm 148 00:07:37,110 --> 00:07:40,889 and apply it to the mixture of Gaussians models. I want to take these E steps 149 00:07:40,889 --> 00:07:42,729 and M Steps and work 150 00:07:42,729 --> 00:07:44,229 them out for 151 00:07:44,229 --> 00:07:47,849 the mixture of Gaussians model, but before I do that, I just want to say one more thing about this other view of the 152 00:07:47,849 --> 00:07:49,689 153 00:07:49,689 --> 00:07:55,020 EM Algorithm. It turns out there's one other way of thinking about the EM 154 00:07:55,020 --> 00:07:59,479 Algorithm, which is the following: I can define 155 00:07:59,479 --> 00:08:02,090 an optimization objective 156 00:08:02,090 --> 00:08:05,979 J of theta, Q are defined 157 00:08:05,979 --> 00:08:10,519 it to be 158 00:08:10,519 --> 00:08:14,699 this. This is just a thing in the augments 159 00:08:14,699 --> 00:08:21,699 in the M step. Okay? 160 00:08:25,759 --> 00:08:28,110 And so what we proved 161 00:08:28,110 --> 00:08:31,079 using Jensen's inequality 162 00:08:31,079 --> 00:08:34,040 is that 163 00:08:34,040 --> 00:08:38,380 the log likelihood of theta is greater and equal to J 164 00:08:38,380 --> 00:08:41,900 of theta Q. So in other words, we proved last time that 165 00:08:41,900 --> 00:08:43,749 for any value of theta and Q 166 00:08:43,749 --> 00:08:47,190 the log likelihood upper bounds J of theta and Q. 167 00:08:47,190 --> 00:08:51,290 And so just to relate this back to, sort of, yet more things that you all ready 168 00:08:51,290 --> 00:08:52,220 know, 169 00:08:52,220 --> 00:08:54,100 you can also think of 170 00:08:54,100 --> 00:08:58,530 covariant cause in a sense, right? 171 00:08:58,530 --> 00:09:02,010 However, our discussion awhile back on the coordinate ascent optimization 172 00:09:02,010 --> 00:09:03,160 algorithm. 173 00:09:03,160 --> 00:09:07,420 So we can show, and I won't actually show this view so just take our word for it 174 00:09:07,420 --> 00:09:09,580 and look for that at home if you want, 175 00:09:09,580 --> 00:09:11,820 176 00:09:11,820 --> 00:09:13,260 that EM is 177 00:09:13,260 --> 00:09:17,480 just coordinate in a set on the 178 00:09:17,480 --> 00:09:18,240 function J. 179 00:09:18,240 --> 00:09:20,630 So in the E step you maximize 180 00:09:20,630 --> 00:09:23,960 with respect to Q 181 00:09:23,960 --> 00:09:26,850 and then the M step 182 00:09:26,850 --> 00:09:31,860 you maximize with 183 00:09:31,860 --> 00:09:34,100 respect to theta. Okay? 184 00:09:34,100 --> 00:09:35,810 So this is 185 00:09:35,810 --> 00:09:39,690 another view of the EM Algorithm 186 00:09:39,690 --> 00:09:40,900 187 00:09:40,900 --> 00:09:45,430 that shows why it has to converge, for example. If you can " I've used in a sense of 188 00:09:45,430 --> 00:09:46,880 J of theta, Q 189 00:09:46,880 --> 00:09:53,160 having to monotonically increase on every iteration. Okay? 190 00:09:53,160 --> 00:09:58,800 So what I want 191 00:09:58,800 --> 00:10:02,389 to do next is actually take this general 192 00:10:02,389 --> 00:10:06,230 EM machinery that we worked up and apply it to a mixture Gaussians model. 193 00:10:06,230 --> 00:10:09,790 Before I do that, let me just check if there are questions about 194 00:10:09,790 --> 00:10:16,790 the EM Algorithm as a whole? Okay, cool. So 195 00:10:23,980 --> 00:10:26,100 let's go ahead and 196 00:10:26,100 --> 00:10:33,100 work on the mixture of Gaussian's 197 00:10:35,550 --> 00:10:37,160 EM, all right? MOG, and that's my 198 00:10:37,160 --> 00:10:39,010 abbreviation for Mixture of 199 00:10:39,010 --> 00:10:43,590 Gaussian's. So the E step were called those Q distributions, right? 200 00:10:43,590 --> 00:10:50,590 In particular, I want to work out " so Q is the probability distribution 201 00:10:50,670 --> 00:10:53,230 over the late and random variable Z 202 00:10:53,230 --> 00:10:58,440 and so the E step I'm gonna figure out what is these compute " what is Q of ZI equals J. And 203 00:10:58,440 --> 00:11:00,700 you can think of this as my writing 204 00:11:00,700 --> 00:11:04,210 P of ZI equals J, right? Under the Q 205 00:11:04,210 --> 00:11:08,260 distribution. That's what this notation means. 206 00:11:08,260 --> 00:11:12,810 And so the EM Algorithm tells us 207 00:11:12,810 --> 00:11:13,840 208 00:11:13,840 --> 00:11:16,280 that, let's see, Q of J 209 00:11:16,280 --> 00:11:23,280 is the likelihood probability of Z being the value 210 00:11:23,750 --> 00:11:25,040 J and 211 00:11:25,040 --> 00:11:28,690 given XI and all your parameters. 212 00:11:28,690 --> 00:11:30,310 And so, 213 00:11:30,310 --> 00:11:31,050 214 00:11:31,050 --> 00:11:35,750 well, the way you compute this is by Dave's rule, right? So that is going to 215 00:11:35,750 --> 00:11:39,250 be equal to P of XI given ZI 216 00:11:39,250 --> 00:11:41,060 equals J 217 00:11:41,060 --> 00:11:44,690 times P of ZIJ divided 218 00:11:44,690 --> 00:11:51,690 by " 219 00:11:57,490 --> 00:12:00,880 right? That's all the " by Dave's rule. 220 00:12:00,880 --> 00:12:02,670 221 00:12:02,670 --> 00:12:06,100 And so this 222 00:12:06,100 --> 00:12:09,110 you know 223 00:12:09,110 --> 00:12:09,860 224 00:12:09,860 --> 00:12:11,630 225 00:12:11,630 --> 00:12:14,390 because XI given ZI equals J. This was a Gaussian 226 00:12:14,390 --> 00:12:17,780 with mean mu J and covariant sigma J. And so to compute this first 227 00:12:17,780 --> 00:12:20,970 term you plug in the formula for the Gaussian density there 228 00:12:20,970 --> 00:12:21,810 229 00:12:21,810 --> 00:12:24,780 with parameters mu J and sigma J 230 00:12:24,780 --> 00:12:27,160 and this you'd know 231 00:12:27,160 --> 00:12:28,500 because 232 00:12:28,500 --> 00:12:35,500 Z was not a nomial, right? 233 00:12:36,020 --> 00:12:40,650 Where parameters given by phi and so the problem of ZI being with J is just phi 234 00:12:40,650 --> 00:12:44,150 J and so you can substitute these terms in. 235 00:12:44,150 --> 00:12:46,810 Similarly do the same thing for the denominator 236 00:12:46,810 --> 00:12:49,930 and that's how you work out what Q is. Okay? 237 00:12:49,930 --> 00:12:56,080 And so in the previous lecture this value the probability that ZI 238 00:12:56,080 --> 00:12:58,540 equals J under the Q 239 00:12:58,540 --> 00:13:01,150 distribution that was why I denoted that as WIJ. 240 00:13:01,150 --> 00:13:03,840 So that would be the 241 00:13:03,840 --> 00:13:05,330 242 00:13:05,330 --> 00:13:08,550 E 243 00:13:08,550 --> 00:13:10,440 step 244 00:13:10,440 --> 00:13:13,720 and then in the M step 245 00:13:13,720 --> 00:13:17,890 we maximize with respect to all of our parameters. 246 00:13:17,890 --> 00:13:19,480 247 00:13:19,480 --> 00:13:21,230 This, well 248 00:13:21,230 --> 00:13:28,130 I seem to be writing the same formula down a lot today. All 249 00:13:28,130 --> 00:13:35,130 right. 250 00:13:42,670 --> 00:13:48,090 And just so 251 00:13:48,090 --> 00:13:52,250 we're completely concrete about how you do that, right? So if you do 252 00:13:52,250 --> 00:13:54,820 that you end up with " 253 00:13:54,820 --> 00:13:57,440 so plugging in the 254 00:13:57,440 --> 00:13:59,840 quantities that you know 255 00:13:59,840 --> 00:14:00,770 256 00:14:00,770 --> 00:14:03,950 that becomes 257 00:14:03,950 --> 00:14:10,950 this, let's see. Right. 258 00:14:34,590 --> 00:14:38,730 And so that we're completely concrete about what the M step is doing. 259 00:14:38,730 --> 00:14:41,900 So in the 260 00:14:41,900 --> 00:14:44,260 M step that was, 261 00:14:44,260 --> 00:14:45,660 I guess, QI 262 00:14:45,660 --> 00:14:47,130 over Z, I being over 263 00:14:47,130 --> 00:14:50,560 J. Just in the summation, sum over J is the sum over all the possible values 264 00:14:50,560 --> 00:14:52,500 of ZI 265 00:14:52,500 --> 00:14:53,430 and then 266 00:14:53,430 --> 00:14:56,330 this thing here is my Gaussian 267 00:14:56,330 --> 00:14:58,700 density. Sorry, guys, this thing " well, 268 00:14:58,700 --> 00:15:01,980 this first term here, 269 00:15:01,980 --> 00:15:06,480 right? Is my P of 270 00:15:06,480 --> 00:15:10,050 XI given ZI 271 00:15:10,050 --> 00:15:14,460 and that's P of ZI. Okay? 272 00:15:14,460 --> 00:15:16,950 And so 273 00:15:16,950 --> 00:15:21,070 to maximize this with respect to " say you want to maximize this with respect to all 274 00:15:21,070 --> 00:15:24,410 of your parameters phi, mu and sigma. 275 00:15:24,410 --> 00:15:27,800 So to maximize with respect to the parameter mu, say, 276 00:15:27,800 --> 00:15:32,430 you would take the derivative for respect to mu 277 00:15:32,430 --> 00:15:34,440 and set that to zero 278 00:15:34,440 --> 00:15:38,190 and you would " and if you actually do that computation 279 00:15:38,190 --> 00:15:44,330 you would get, for instance, that 280 00:15:44,330 --> 00:15:47,660 281 00:15:47,660 --> 00:15:49,750 that becomes your update 282 00:15:49,750 --> 00:15:51,230 to mu J. 283 00:15:51,230 --> 00:15:52,260 Okay? Just 284 00:15:52,260 --> 00:15:53,409 so I want to " 285 00:15:53,409 --> 00:15:57,029 the equation is unimportant. All of these equations are written down in 286 00:15:57,029 --> 00:15:59,980 the lecture notes. I'm writing these down just to be 287 00:15:59,980 --> 00:16:03,350 completely concrete about what the M step means. And so write down that formula, 288 00:16:03,350 --> 00:16:05,370 plug in the densities you know, take 289 00:16:05,370 --> 00:16:08,150 the derivative set to zero, solve for mu J 290 00:16:08,150 --> 00:16:10,949 and in the same way you set the derivatives equal to zero 291 00:16:10,949 --> 00:16:12,990 and solve for your updates 292 00:16:12,990 --> 00:16:15,439 for your other parameters phi and 293 00:16:15,439 --> 00:16:22,439 sigma as well. Okay? Well, 294 00:16:23,100 --> 00:16:26,500 just point out just one little tricky bit for this that 295 00:16:26,500 --> 00:16:30,630 you haven't seen before that most of you probably all ready now, but I'll just 296 00:16:30,630 --> 00:16:31,220 mention 297 00:16:31,220 --> 00:16:32,810 is that 298 00:16:32,810 --> 00:16:37,430 since phi here is a multinomial distribution 299 00:16:37,430 --> 00:16:39,570 when you take this formula 300 00:16:39,570 --> 00:16:42,440 and you maximize it with respect to phi 301 00:16:42,440 --> 00:16:45,470 you actually have an additional constraint, right? That the sum of I " let's see, sum 302 00:16:45,470 --> 00:16:46,460 303 00:16:46,460 --> 00:16:48,250 over 304 00:16:48,250 --> 00:16:50,440 J, 305 00:16:50,440 --> 00:16:53,630 phi J must be equal to one. All right? So, again, 306 00:16:53,630 --> 00:16:57,720 in the M step I want to take this thing and maximize it with respect to all the parameters 307 00:16:57,720 --> 00:17:00,900 and when you maximize this respect to the parameters phi J 308 00:17:00,900 --> 00:17:03,400 you need to respect the constraint that 309 00:17:03,400 --> 00:17:06,300 sum of J phi J must be equal to one. 310 00:17:06,300 --> 00:17:10,730 And so, well, you all ready know how to do constraint maximization, right? So I'll throw out the method of 311 00:17:10,730 --> 00:17:14,230 the granjay multipliers and generalize the granjay when you talk about the support of X 312 00:17:14,230 --> 00:17:15,120 machines. 313 00:17:15,120 --> 00:17:19,049 And so to actually perform the maximization in terms of phi J you 314 00:17:19,049 --> 00:17:21,100 construct to the granjay, 315 00:17:21,100 --> 00:17:23,490 which is " all right? 316 00:17:23,490 --> 00:17:24,639 So that's the 317 00:17:24,639 --> 00:17:28,390 equation from above and we'll denote in the dot dot dot plus 318 00:17:28,390 --> 00:17:33,760 theta times 319 00:17:33,760 --> 00:17:40,760 that, where this is sort of the granjay multiplier 320 00:17:42,220 --> 00:17:42,830 and this 321 00:17:42,830 --> 00:17:46,100 is your optimization objective. 322 00:17:46,100 --> 00:17:47,640 323 00:17:47,640 --> 00:17:50,799 And so to actually solve the parameters phi J you set 324 00:17:50,799 --> 00:17:56,690 the parameters of this 325 00:17:56,690 --> 00:17:59,080 so that the granjay is zero and solve. Okay? 326 00:17:59,080 --> 00:18:02,990 And if you then work through the math 327 00:18:02,990 --> 00:18:07,240 you get the appropriate value to update the phi J's too, 328 00:18:07,240 --> 00:18:10,880 which I won't do, but I'll be " all the full directions are in the lecture 329 00:18:10,880 --> 00:18:16,470 notes. I won't do that here. 330 00:18:16,470 --> 00:18:18,010 331 00:18:18,010 --> 00:18:21,870 Okay. And so if you actually perform all these computations you can also verify 332 00:18:21,870 --> 00:18:22,730 that. 333 00:18:22,730 --> 00:18:26,590 So I just wrote down a bunch of formulas for the EM 334 00:18:26,590 --> 00:18:30,080 Algorithm. At the beginning of the last lecture I said for the mixture of Gaussian's model " I 335 00:18:30,080 --> 00:18:34,170 said for the EM here's the formula for computing the WIJ's and here's a 336 00:18:34,170 --> 00:18:36,480 formula for computing the mud's and so on, 337 00:18:36,480 --> 00:18:42,090 and this variation is where all of those formulas actually come from. 338 00:18:42,090 --> 00:18:43,280 Okay? 339 00:18:43,280 --> 00:18:50,280 Questions about this? Yeah? Student:[Inaudible] Instructor (Andrew Ng):Oh, 340 00:18:52,390 --> 00:18:54,289 I see. So 341 00:18:54,289 --> 00:18:58,200 it turns out that, yes, there's also constrained to the phi J this must be greater than 342 00:18:58,200 --> 00:19:00,070 zero. 343 00:19:00,070 --> 00:19:02,770 It turns out that 344 00:19:02,770 --> 00:19:07,090 if you want you could actually write down then generalize the granjayn 345 00:19:07,090 --> 00:19:10,860 incorporating all of these constraints as well and you can solve to [inaudible] 346 00:19:10,860 --> 00:19:12,179 these constraints. 347 00:19:12,179 --> 00:19:16,280 It turns out that in this particular derivation " actually it turns out that very often 348 00:19:16,280 --> 00:19:16,780 we 349 00:19:16,780 --> 00:19:19,980 find maximum likely estimate for multinomial distributions probabilities. 350 00:19:19,980 --> 00:19:23,190 It turns out that if you ignore these constraints and you just maximize the 351 00:19:23,190 --> 00:19:24,020 formula 352 00:19:24,020 --> 00:19:25,100 353 00:19:25,100 --> 00:19:27,820 luckily you end up with 354 00:19:27,820 --> 00:19:31,190 values that actually are greater than or equal to zero 355 00:19:31,190 --> 00:19:34,160 and so if even ignoring those constraint you end up with parameters that are greater than or equal to 356 00:19:34,160 --> 00:19:38,470 zero that shows that that must be the correct solution 357 00:19:38,470 --> 00:19:41,130 because adding that constraint won't change anything. 358 00:19:41,130 --> 00:19:44,789 So this constraint it is then caused " it turns out that if you 359 00:19:44,789 --> 00:19:46,150 ignore this and just do 360 00:19:46,150 --> 00:19:53,150 what I've wrote down you actually get the right answer. 361 00:19:55,789 --> 00:19:57,030 Okay? 362 00:19:57,030 --> 00:19:58,080 Great. 363 00:19:58,080 --> 00:19:59,080 So let me 364 00:19:59,080 --> 00:20:00,810 just 365 00:20:00,810 --> 00:20:05,400 very quickly talk about one more example of a mixture model. And 366 00:20:05,400 --> 00:20:09,340 the perfect example for this is imagine you want to do text clustering, right? 367 00:20:09,340 --> 00:20:11,900 So someone gives you a large set of documents 368 00:20:11,900 --> 00:20:15,389 and you want to cluster them together into cohesive 369 00:20:15,389 --> 00:20:16,239 topics. I 370 00:20:16,239 --> 00:20:19,379 think I mentioned the news website news.google.com. 371 00:20:19,379 --> 00:20:22,010 That's one application of text clustering 372 00:20:22,010 --> 00:20:25,850 where you might want to look at all of the news stories about 373 00:20:25,850 --> 00:20:28,220 today, all the news stories 374 00:20:28,220 --> 00:20:32,150 written by everyone, written by all the online news websites about whatever happened 375 00:20:32,150 --> 00:20:33,270 yesterday 376 00:20:33,270 --> 00:20:36,810 and there will be many, many different stories on the same thing, right? 377 00:20:36,810 --> 00:20:40,770 And by running a text-clustering algorithm you can group related 378 00:20:40,770 --> 00:20:42,559 documents together. Okay? 379 00:20:42,559 --> 00:20:49,559 So how do you apply the EM Algorithm to text clustering? 380 00:20:52,980 --> 00:20:56,300 I want to do this to illustrate an example 381 00:20:56,300 --> 00:20:59,380 in which you run the EM Algorithm 382 00:20:59,380 --> 00:21:03,320 on discreet valued inputs where the input " where the training examples 383 00:21:03,320 --> 00:21:04,220 XI 384 00:21:04,220 --> 00:21:05,830 are discreet 385 00:21:05,830 --> 00:21:08,589 values. So what I want to talk about specifically 386 00:21:08,589 --> 00:21:15,589 is the mixture of naive Bayes model 387 00:21:16,510 --> 00:21:19,980 and depending on how much you remember about naive Bayes 388 00:21:19,980 --> 00:21:23,690 I talked about two event models. One was the multivariant vanuvy 389 00:21:23,690 --> 00:21:24,960 event model. One 390 00:21:24,960 --> 00:21:27,880 was the multinomial event model. 391 00:21:27,880 --> 00:21:31,050 Today I'm gonna use the multivariant vanuvy event model. If 392 00:21:31,050 --> 00:21:34,720 you don't remember what those terms mean anymore don't worry about it. I 393 00:21:34,720 --> 00:21:36,430 think the equation will still make sense. But in 394 00:21:36,430 --> 00:21:43,430 this setting we're given a training set X1 395 00:21:44,549 --> 00:21:45,820 for XM. 396 00:21:45,820 --> 00:21:48,930 So we're given M text documents 397 00:21:48,930 --> 00:21:52,270 where each 398 00:21:52,270 --> 00:21:55,940 399 00:21:55,940 --> 00:22:00,330 XI is zero one to the end. So each of our training examples 400 00:22:00,330 --> 00:22:04,160 is an indimensional bit of vector, 401 00:22:04,160 --> 00:22:06,780 right? S this was the representation 402 00:22:06,780 --> 00:22:11,060 where XIJ was " it indicates whether 403 00:22:11,060 --> 00:22:13,220 word J 404 00:22:13,220 --> 00:22:16,820 appears in 405 00:22:16,820 --> 00:22:23,260 document 406 00:22:23,260 --> 00:22:28,220 I, right? And let's say that 407 00:22:28,220 --> 00:22:31,930 we're going to model ZI the " our latent random variable meaning 408 00:22:31,930 --> 00:22:35,470 our hidden random variable ZI will take on two values zero one 409 00:22:35,470 --> 00:22:38,480 so this means I'm just gonna find two clusters and you can generalize the 410 00:22:38,480 --> 00:22:44,560 clusters that you want. 411 00:22:44,560 --> 00:22:46,650 So 412 00:22:46,650 --> 00:22:51,510 in the mixture of naive Bayes model we assume that 413 00:22:51,510 --> 00:22:54,090 ZI is distributed and mu 414 00:22:54,090 --> 00:22:57,280 E 415 00:22:57,280 --> 00:22:58,390 with some 416 00:22:58,390 --> 00:23:02,600 value of phi so there's some probability of each document coming from cluster one 417 00:23:02,600 --> 00:23:05,220 or from cluster 418 00:23:05,220 --> 00:23:10,490 two. We assume that 419 00:23:10,490 --> 00:23:14,090 the probability 420 00:23:14,090 --> 00:23:16,049 421 00:23:16,049 --> 00:23:20,500 of XI given ZI, 422 00:23:20,500 --> 00:23:21,550 right? 423 00:23:21,550 --> 00:23:28,550 Will make the same naive Bayes assumption as we did before. 424 00:23:35,550 --> 00:23:36,630 Okay? 425 00:23:36,630 --> 00:23:43,630 And more specifically well, excuse 426 00:23:53,110 --> 00:24:00,110 me, 427 00:24:00,820 --> 00:24:01,840 right. Okay. 428 00:24:01,840 --> 00:24:06,570 And so most of us [inaudible] cycles one given ZI equals say zero 429 00:24:06,570 --> 00:24:10,250 will be given by a parameter phi 430 00:24:10,250 --> 00:24:11,960 substitute J 431 00:24:11,960 --> 00:24:18,230 given Z equals 432 00:24:18,230 --> 00:24:21,170 zero. So if you take this chalkboard and if you 433 00:24:21,170 --> 00:24:22,669 take all instances of 434 00:24:22,669 --> 00:24:24,510 the alphabet Z 435 00:24:24,510 --> 00:24:27,270 and replace it with Y 436 00:24:27,270 --> 00:24:31,750 then you end up with exactly the same equation as I've written down for naive 437 00:24:31,750 --> 00:24:38,750 Bayes like a long time ago. Okay? 438 00:24:39,770 --> 00:24:40,710 And I'm not 439 00:24:40,710 --> 00:24:44,850 actually going to work out the mechanics deriving 440 00:24:44,850 --> 00:24:46,460 the EM Algorithm, but it turns out that 441 00:24:46,460 --> 00:24:49,830 if you take this joint distribution of X and Z 442 00:24:49,830 --> 00:24:53,700 and if you work out what the equations are for the EM algorithm for 443 00:24:53,700 --> 00:24:55,860 maximum likelihood estimation of parameters 444 00:24:55,860 --> 00:24:58,820 you find that 445 00:24:58,820 --> 00:25:00,880 in the E 446 00:25:00,880 --> 00:25:03,100 step you compute, 447 00:25:03,100 --> 00:25:05,679 you know, let's say these parameters " these weights 448 00:25:05,679 --> 00:25:06,970 WI 449 00:25:06,970 --> 00:25:09,420 which are going to be equal to 450 00:25:09,420 --> 00:25:15,670 your perceived distribution of Z being equal one conditioned on XI parameterized 451 00:25:15,670 --> 00:25:19,200 by your 452 00:25:19,200 --> 00:25:23,700 phi's and given your parameters 453 00:25:23,700 --> 00:25:25,590 454 00:25:25,590 --> 00:25:30,900 and in the M step. Okay? 455 00:25:30,900 --> 00:25:37,900 And 456 00:26:17,240 --> 00:26:20,740 that's the equation you get in the M step. 457 00:26:20,740 --> 00:26:22,150 I 458 00:26:22,150 --> 00:26:24,860 mean, again, the equations themselves aren't too important. Just 459 00:26:24,860 --> 00:26:27,179 sort of convey " I'll 460 00:26:27,179 --> 00:26:30,210 give you a second to finish writing, I guess. 461 00:26:30,210 --> 00:26:32,500 And when you're done or finished writing 462 00:26:32,500 --> 00:26:37,370 take a look at these equations and see if they make intuitive sense to you 463 00:26:37,370 --> 00:26:44,370 why these three equations, sort of, sound like they might be the right thing to do. Yeah? Student:[Inaudible] Instructor (Andrew Ng):Say that 464 00:26:47,130 --> 00:26:49,620 again. Student:Y " Instructor (Andrew Ng):Oh, 465 00:26:49,620 --> 00:26:54,230 yes, thank you. Right. 466 00:26:54,230 --> 00:26:57,950 Sorry, 467 00:26:57,950 --> 00:27:04,950 just, for everywhere over Y I meant Z. Yeah? Student:[Inaudible] in the 468 00:27:15,380 --> 00:27:20,670 first place? 469 00:27:20,670 --> 00:27:22,059 Instructor (Andrew Ng):No. So 470 00:27:22,059 --> 00:27:25,130 471 00:27:25,130 --> 00:27:26,080 what is it? 472 00:27:26,080 --> 00:27:33,080 Normally you initialize phi's to be something else, say randomly. 473 00:27:35,010 --> 00:27:38,060 So just like in naive Bayes we saw 474 00:27:38,060 --> 00:27:41,360 zero probabilities as a bad thing so the same reason you 475 00:27:41,360 --> 00:27:48,200 try to avoid zero probabilities, yeah. Okay? 476 00:27:48,200 --> 00:27:51,500 And so just the intuition behind these 477 00:27:51,500 --> 00:27:54,150 equations is 478 00:27:54,150 --> 00:27:55,950 in the E step 479 00:27:55,950 --> 00:27:59,860 WI's is you're gonna take your best guess for whether the document came from 480 00:27:59,860 --> 00:28:02,640 cluster one or cluster 481 00:28:02,640 --> 00:28:05,570 zero, all right? This is very similar to the 482 00:28:05,570 --> 00:28:09,200 intuitions behind the EM Algorithm that we talked about in a previous lecture. So in the 483 00:28:09,200 --> 00:28:10,020 E step 484 00:28:10,020 --> 00:28:14,760 we're going to compute these weights that tell us do I think this document came from 485 00:28:14,760 --> 00:28:19,370 cluster one or cluster zero. 486 00:28:19,370 --> 00:28:22,650 And then in the M step I'm gonna say 487 00:28:22,650 --> 00:28:27,140 does this numerator is the sum over all the elements of my training set 488 00:28:27,140 --> 00:28:28,800 of " so then 489 00:28:28,800 --> 00:28:33,000 informally, right? WI is one there, but I think the document came from cluster 490 00:28:33,000 --> 00:28:33,980 one 491 00:28:33,980 --> 00:28:35,810 and so this will 492 00:28:35,810 --> 00:28:41,090 essentially sum up all the times I saw words J 493 00:28:41,090 --> 00:28:45,419 in documents that I think are in cluster one. 494 00:28:45,419 --> 00:28:48,830 And these are sort of weighted by the actual probability. I think it came from cluster 495 00:28:48,830 --> 00:28:49,640 one 496 00:28:49,640 --> 00:28:51,200 and then I'll divide by 497 00:28:51,200 --> 00:28:54,740 " again, if all of these were ones and zeros then I'd be dividing by 498 00:28:54,740 --> 00:28:58,480 the actual number of documents I had in cluster one. So if all the WI's were 499 00:28:58,480 --> 00:28:59,540 500 00:28:59,540 --> 00:29:02,960 either ones or zeroes then this would be exactly 501 00:29:02,960 --> 00:29:06,660 the fraction of documents that I saw in cluster one 502 00:29:06,660 --> 00:29:09,640 in which I also saw were at J. 503 00:29:09,640 --> 00:29:10,900 Okay? But in the EM Algorithm 504 00:29:10,900 --> 00:29:14,540 you don't make a hard assignment decision about is this in cluster one or is this in 505 00:29:14,540 --> 00:29:15,790 cluster zero. You 506 00:29:15,790 --> 00:29:16,880 instead 507 00:29:16,880 --> 00:29:23,880 represent your uncertainty about cluster membership with the parameters WI. Okay? It 508 00:29:23,970 --> 00:29:26,970 actually turns out that when we actually implement this particular model it 509 00:29:26,970 --> 00:29:28,630 actually turns out that 510 00:29:28,630 --> 00:29:32,080 by the nature of this computation all the values of WI's 511 00:29:32,080 --> 00:29:35,260 will be very close to either one or zero so they'll 512 00:29:35,260 --> 00:29:39,110 be numerically almost indistinguishable from one's and zeroes. This is a property of 513 00:29:39,110 --> 00:29:41,960 naive Bayes. If you actually compute this probability 514 00:29:41,960 --> 00:29:45,750 from all those documents you find that WI is either 515 00:29:45,750 --> 00:29:46,169 516 00:29:46,169 --> 00:29:50,000 0.0001 or 0.999. It'll be amazingly close to either zero or one 517 00:29:50,000 --> 00:29:53,170 and so the M step " and so this is pretty much guessing whether each document is 518 00:29:53,170 --> 00:29:55,270 in cluster one or cluster 519 00:29:55,270 --> 00:29:56,210 zero and then 520 00:29:56,210 --> 00:30:00,450 using formulas they're very similar to maximum likely estimation 521 00:30:00,450 --> 00:30:05,210 for naive Bayes. 522 00:30:05,210 --> 00:30:06,250 Okay? 523 00:30:06,250 --> 00:30:08,409 Cool. Are there and 524 00:30:08,409 --> 00:30:09,679 if 525 00:30:09,679 --> 00:30:12,170 some of these equations don't look that familiar to you anymore, 526 00:30:12,170 --> 00:30:15,309 sort of, go back and take another look at what you saw in naive 527 00:30:15,309 --> 00:30:16,700 Bayes 528 00:30:16,700 --> 00:30:18,720 and 529 00:30:18,720 --> 00:30:22,220 hopefully you can see the links there as well. 530 00:30:22,220 --> 00:30:29,220 Questions about this before I move on? Right, okay. 531 00:30:33,200 --> 00:30:36,700 Of course the way I got these equations was by turning through the machinery of 532 00:30:36,700 --> 00:30:40,610 the EM Algorithm, right? I didn't just write these out of thin air. The way you do this 533 00:30:40,610 --> 00:30:41,890 is by 534 00:30:41,890 --> 00:30:44,990 writing down the E step and the M step for this model and then the M step 535 00:30:44,990 --> 00:30:46,340 same derivatives 536 00:30:46,340 --> 00:30:48,270 equal to zero and solving from that so 537 00:30:48,270 --> 00:30:49,889 that's how you get the M step and the E 538 00:30:49,889 --> 00:30:56,889 step. 539 00:31:22,030 --> 00:31:23,950 So the last thing I want to do today 540 00:31:23,950 --> 00:31:30,950 is talk about the factor analysis model 541 00:31:31,870 --> 00:31:33,110 542 00:31:33,110 --> 00:31:34,780 and 543 00:31:34,780 --> 00:31:38,249 the reason I want to do this is sort of two reasons because one is 544 00:31:38,249 --> 00:31:42,590 factor analysis is kind of a useful model. It's 545 00:31:42,590 --> 00:31:44,160 not 546 00:31:44,160 --> 00:31:48,320 as widely used as mixtures of Gaussian's and mixtures of naive Bayes maybe, 547 00:31:48,320 --> 00:31:50,690 but it's sort of useful. 548 00:31:50,690 --> 00:31:55,180 But the other reason I want to derive this model is that there are a 549 00:31:55,180 --> 00:31:58,169 few steps in the math that are more generally useful. 550 00:31:58,169 --> 00:32:02,560 In particular, where this is for factor analysis this would be an example 551 00:32:02,560 --> 00:32:04,510 in which we'll do EM 552 00:32:04,510 --> 00:32:08,440 where the late and random variable " where the hidden random variable Z 553 00:32:08,440 --> 00:32:11,060 is going to be continued as valued. 554 00:32:11,060 --> 00:32:14,740 And so some of the math we'll see in deriving factor analysis will be a little bit 555 00:32:14,740 --> 00:32:17,430 different than what you saw before and they're just a " 556 00:32:17,430 --> 00:32:21,590 it turns out the full derivation for EM for factor analysis is sort of 557 00:32:21,590 --> 00:32:23,260 extremely long and complicated 558 00:32:23,260 --> 00:32:24,980 and so I won't 559 00:32:24,980 --> 00:32:26,750 inflect that on you in lecture today, 560 00:32:26,750 --> 00:32:30,530 but I will still be writing more equations than is " than you'll see me do 561 00:32:30,530 --> 00:32:32,640 in other lectures because there are, sort of, 562 00:32:32,640 --> 00:32:36,690 just a few steps in the factor analysis derivation so I'll physically 563 00:32:36,690 --> 00:32:38,800 564 00:32:38,800 --> 00:32:40,100 illustrate 565 00:32:40,100 --> 00:32:40,950 it. 566 00:32:40,950 --> 00:32:43,350 So it's actually [inaudible] the model 567 00:32:43,350 --> 00:32:47,510 and it's really contrast to the mixture of Gaussians model, all right? So for the mixture of Gaussians model, which is 568 00:32:47,510 --> 00:32:50,670 our first model 569 00:32:50,670 --> 00:32:52,570 we had, 570 00:32:52,570 --> 00:32:57,990 that " well I actually motivated it by drawing the data set like this, right? That one of 571 00:32:57,990 --> 00:33:04,390 you has a data set that looks like this, 572 00:33:04,390 --> 00:33:06,190 right? So this was a problem where 573 00:33:06,190 --> 00:33:09,440 n is twodimensional 574 00:33:09,440 --> 00:33:13,860 and you have, I don't know, maybe 50 or 100 training examples, whatever, right? 575 00:33:13,860 --> 00:33:15,409 And I said 576 00:33:15,409 --> 00:33:17,760 maybe you want to give a label 577 00:33:17,760 --> 00:33:21,100 training set like this. Maybe you want to model this 578 00:33:21,100 --> 00:33:23,710 as a mixture of two Gaussians. Okay? 579 00:33:23,710 --> 00:33:24,700 And so a 580 00:33:24,700 --> 00:33:27,320 mixture of Gaussian models tend to be 581 00:33:27,320 --> 00:33:29,100 applicable 582 00:33:29,100 --> 00:33:30,520 where m 583 00:33:30,520 --> 00:33:32,519 is larger, 584 00:33:32,519 --> 00:33:35,690 and often much larger, than n where the number of training examples you 585 00:33:35,690 --> 00:33:36,610 have 586 00:33:36,610 --> 00:33:40,790 is at least as large as, and is usually much larger than, the 587 00:33:40,790 --> 00:33:45,820 dimension of the data. What I 588 00:33:45,820 --> 00:33:49,280 want to do is talk about a different problem where I 589 00:33:49,280 --> 00:33:51,880 want you to imagine what happens if 590 00:33:51,880 --> 00:33:56,200 either the dimension of your data is roughly equal to the number of 591 00:33:56,200 --> 00:33:58,130 examples you have 592 00:33:58,130 --> 00:34:00,180 or maybe 593 00:34:00,180 --> 00:34:00,940 the 594 00:34:00,940 --> 00:34:03,980 dimension of your data is maybe even much larger than 595 00:34:03,980 --> 00:34:08,469 the number of training examples you have. Okay? So how do 596 00:34:08,469 --> 00:34:10,139 you model 597 00:34:10,139 --> 00:34:13,569 such a very high dimensional data? Watch and 598 00:34:13,569 --> 00:34:15,549 you will see sometimes, right? If 599 00:34:15,549 --> 00:34:18,960 you run a plant or something, you run a factory, maybe you have 600 00:34:18,960 --> 00:34:23,359 a thousand measurements all through your plants, but you only have five " you only have 601 00:34:23,359 --> 00:34:25,639 20 days of data. So 602 00:34:25,639 --> 00:34:27,189 you can have 1,000 dimensional data, 603 00:34:27,189 --> 00:34:31,039 but 20 examples of it all ready. So 604 00:34:31,039 --> 00:34:35,239 given data that has this property in the beginning that we've given 605 00:34:35,239 --> 00:34:39,709 a training set of m examples. Well, 606 00:34:39,709 --> 00:34:41,229 what can you do to try to model 607 00:34:41,229 --> 00:34:44,139 the density of X? 608 00:34:44,139 --> 00:34:48,239 So one thing you can do is try to model it just as a single Gaussian, right? So in my 609 00:34:48,239 --> 00:34:51,629 mixtures of Gaussian this is how you try model as a single Gaussian and 610 00:34:51,629 --> 00:34:55,019 say X is intuitive with mean mu 611 00:34:55,019 --> 00:34:57,559 and parameter 612 00:34:57,559 --> 00:34:58,809 sigma 613 00:34:58,809 --> 00:35:00,329 where 614 00:35:00,329 --> 00:35:02,169 sigma is going to be 615 00:35:02,169 --> 00:35:04,769 done n by n matrix 616 00:35:04,769 --> 00:35:09,509 and so if you work out the maximum likelihood estimate of the parameters 617 00:35:09,509 --> 00:35:12,680 you find that the maximum likelihood estimate for the mean 618 00:35:12,680 --> 00:35:17,400 is just the empirical mean of your training set, 619 00:35:17,400 --> 00:35:21,299 right. So that makes sense. 620 00:35:21,299 --> 00:35:25,769 And the maximum likelihood of the covariance matrix sigma 621 00:35:25,769 --> 00:35:27,700 will be 622 00:35:27,700 --> 00:35:29,549 623 00:35:29,549 --> 00:35:36,549 624 00:35:39,649 --> 00:35:41,150 this, all right? But it turns out that 625 00:35:41,150 --> 00:35:44,789 in this regime where the data is much higher dimensional " excuse me, 626 00:35:44,789 --> 00:35:48,259 where the data's dimension is much larger than the training examples 627 00:35:48,259 --> 00:35:49,579 you 628 00:35:49,579 --> 00:35:51,229 have if you 629 00:35:51,229 --> 00:35:52,840 compute the maximum likely estimate 630 00:35:52,840 --> 00:35:54,869 of the covariance matrix sigma 631 00:35:54,869 --> 00:35:58,309 you find that this matrix is singular. 632 00:35:58,309 --> 00:35:59,079 633 00:35:59,079 --> 00:36:02,869 Okay? By singular, I mean that it doesn't have four vanq or it has zero eigen 634 00:36:02,869 --> 00:36:04,739 value so it doesn't have " I hope 635 00:36:04,739 --> 00:36:06,869 one of those terms makes sense. 636 00:36:06,869 --> 00:36:12,619 And 637 00:36:12,619 --> 00:36:19,619 there's another saying that the matrix sigma will be non-invertible. 638 00:36:22,299 --> 00:36:24,910 And just in pictures, 639 00:36:24,910 --> 00:36:28,619 one complete example is if D is " 640 00:36:28,619 --> 00:36:33,419 if N equals M equals two if you have two-dimensional data and 641 00:36:33,419 --> 00:36:35,929 you have two examples. So I'd have 642 00:36:35,929 --> 00:36:39,959 two training examples in two-dimen " this is 643 00:36:39,959 --> 00:36:41,539 X1 and 644 00:36:41,539 --> 00:36:43,669 X2. This is my unlabeled data. 645 00:36:43,669 --> 00:36:47,640 If you fit a Gaussian to this data set you find that 646 00:36:47,640 --> 00:36:49,169 " well 647 00:36:49,169 --> 00:36:53,329 you remember I used to draw constables of Gaussians as ellipses, right? So 648 00:36:53,329 --> 00:36:56,139 these are examples of different constables of Gaussians. 649 00:36:56,139 --> 00:36:58,669 You find that the maximum likely estimate 650 00:36:58,669 --> 00:36:59,650 Gaussian for this 651 00:36:59,650 --> 00:37:04,329 responds to Gaussian where the contours are sort of infinitely 652 00:37:04,329 --> 00:37:07,160 thin and infinitely long in that direction. 653 00:37:07,160 --> 00:37:10,959 Okay? So in terms " so the contours will sort of be 654 00:37:10,959 --> 00:37:12,970 infinitely thin, 655 00:37:12,970 --> 00:37:17,959 right? And stretch infinitely long in that direction. 656 00:37:17,959 --> 00:37:20,819 And another way of saying it is that if 657 00:37:20,819 --> 00:37:25,539 you actually plug in the formula for the density 658 00:37:25,539 --> 00:37:30,679 of the 659 00:37:30,679 --> 00:37:36,209 Gaussian, which is 660 00:37:36,209 --> 00:37:39,099 this, you won't actually get a nice answer because 661 00:37:39,099 --> 00:37:42,399 the matrix sigma is non-invertible so sigma inverse is not 662 00:37:42,399 --> 00:37:43,309 defined 663 00:37:43,309 --> 00:37:45,290 and this is zero. 664 00:37:45,290 --> 00:37:48,890 So you also have one over zero times E to the sum 665 00:37:48,890 --> 00:37:55,890 inversive and non-inversive matrix so not a good model. 666 00:37:58,289 --> 00:38:02,009 So let's do even better, right? So given this sort of data 667 00:38:02,009 --> 00:38:09,009 how do you model P of X? 668 00:38:28,400 --> 00:38:35,400 Well, one thing you could do is constrain sigma to be diagonal. So you have a 669 00:38:42,309 --> 00:38:46,319 covariance matrix X is " okay? 670 00:38:46,319 --> 00:38:48,939 So in other words you get a constraint sigma 671 00:38:48,939 --> 00:38:52,979 to be this matrix, all 672 00:38:52,979 --> 00:38:55,880 right? 673 00:38:55,880 --> 00:38:59,779 With zeroes on the off diagonals. I hope 674 00:38:59,779 --> 00:39:03,109 this makes sense. These zeroes I've written down here denote that 675 00:39:03,109 --> 00:39:06,079 everything after diagonal of this matrix is a 676 00:39:06,079 --> 00:39:07,459 zero. 677 00:39:07,459 --> 00:39:08,970 So 678 00:39:08,970 --> 00:39:13,410 the massive likely estimate of the parameters will be pretty much what you'll 679 00:39:13,410 --> 00:39:16,210 expect, 680 00:39:16,210 --> 00:39:19,880 right? 681 00:39:19,880 --> 00:39:21,200 And 682 00:39:21,200 --> 00:39:22,440 in pictures 683 00:39:22,440 --> 00:39:24,150 what this means is that 684 00:39:24,150 --> 00:39:26,399 the [inaudible] the distribution 685 00:39:26,399 --> 00:39:28,569 with Gaussians 686 00:39:28,569 --> 00:39:31,739 whose controls are axis aligned. So 687 00:39:31,739 --> 00:39:36,900 that's one example of a Gaussian where the covariance is diagonal. 688 00:39:36,900 --> 00:39:37,859 689 00:39:37,859 --> 00:39:38,509 And 690 00:39:38,509 --> 00:39:42,919 here's another example and 691 00:39:42,919 --> 00:39:45,349 so here's a third example. But 692 00:39:45,349 --> 00:39:48,939 often I've used the examples of Gaussians where the covariance matrix is off diagonal. Okay? And, I 693 00:39:48,939 --> 00:39:51,489 don't 694 00:39:51,489 --> 00:39:53,999 know, 695 00:39:53,999 --> 00:39:56,629 you could do this in model P of X, but this isn't very nice because you've now 696 00:39:56,629 --> 00:40:00,069 thrown away all the correlations 697 00:40:00,069 --> 00:40:04,539 between the different variables so 698 00:40:04,539 --> 00:40:07,599 the axis are X1 and X2, right? So you've thrown away " you're failing to capture 699 00:40:07,599 --> 00:40:11,779 any of the correlations or the relationships between 700 00:40:11,779 --> 00:40:18,229 any pair of variables in your data. Yeah? Student:Is it " could you say again what does that do for the diagonal? Instructor (Andrew Ng):Say again. Student:The 701 00:40:18,229 --> 00:40:20,759 covariance matrix the diagonal, 702 00:40:20,759 --> 00:40:22,150 what does 703 00:40:22,150 --> 00:40:23,369 that again? I didn't quite understand what the examples mean. Instructor (Andrew Ng):Okay. 704 00:40:23,369 --> 00:40:26,530 So these are the contours of the Gaussian density that I'm drawing, 705 00:40:26,530 --> 00:40:28,159 right? So let's see " 706 00:40:28,159 --> 00:40:32,080 so post covariance issues with diagonal 707 00:40:32,080 --> 00:40:35,229 then you can ask what is P of X 708 00:40:35,229 --> 00:40:37,299 parameterized by 709 00:40:37,299 --> 00:40:38,539 mu and sigma, 710 00:40:38,539 --> 00:40:41,190 right? If sigma is diagonal 711 00:40:41,190 --> 00:40:43,699 and so this will be some Gaussian dump, 712 00:40:43,699 --> 00:40:46,249 right? So not in " oh, boy. My drawing's really 713 00:40:46,249 --> 00:40:48,699 bad, but in two-D 714 00:40:48,699 --> 00:40:53,239 the density for Gaussian is like this bump shaped thing, right? 715 00:40:53,239 --> 00:40:56,719 So this is the density of the Gaussian 716 00:40:56,719 --> 00:40:59,549 " wow, and this is a really bad drawing. With those, 717 00:40:59,549 --> 00:41:04,049 your axis X1 and X2 and the height of this is P of X 718 00:41:04,049 --> 00:41:07,879 and so those figures over there are the contours of 719 00:41:07,879 --> 00:41:09,479 the density of the Gaussian. 720 00:41:09,479 --> 00:41:15,949 So those are the contours of this shape. Student:No, I don't 721 00:41:15,949 --> 00:41:16,680 mean 722 00:41:16,680 --> 00:41:22,569 the contour. What's special about these types? What makes them different than instead of general covariance matrix? Instructor (Andrew Ng):Oh, I see. Oh, okay, sorry. They're axis aligned so 723 00:41:22,569 --> 00:41:23,690 the main " these, 724 00:41:23,690 --> 00:41:24,999 let's see. 725 00:41:24,999 --> 00:41:27,710 So I'm not drawing a contour like this, 726 00:41:27,710 --> 00:41:30,309 right? Because the main axes of these 727 00:41:30,309 --> 00:41:32,289 are not aligned with the 728 00:41:32,289 --> 00:41:34,390 X1 and X-axis so 729 00:41:34,390 --> 00:41:36,189 this occurs found to 730 00:41:36,189 --> 00:41:43,109 Gaussian where the off-diagonals are non-zero, right? Cool. Okay. 731 00:41:43,109 --> 00:41:46,929 732 00:41:46,929 --> 00:41:50,229 You could do this, this is sort of work. It turns out that what our best view is two 733 00:41:50,229 --> 00:41:51,670 training examples 734 00:41:51,670 --> 00:41:55,210 you can learn in non-singular covariance matrix, but you've thrown away all 735 00:41:55,210 --> 00:41:55,849 736 00:41:55,849 --> 00:42:02,849 of the correlation in the data so this is not a great model. 737 00:42:05,900 --> 00:42:08,280 It turns out you can do something " well, 738 00:42:08,280 --> 00:42:10,439 actually, we'll come back and use this property later. 739 00:42:10,439 --> 00:42:17,439 But it turns out you can do something even more restrictive, which 740 00:42:17,989 --> 00:42:18,779 is 741 00:42:18,779 --> 00:42:23,959 you can constrain sigma to equal to sigma squared times the identity 742 00:42:23,959 --> 00:42:27,629 matrix. So in other words, you can constrain it to be diagonal 743 00:42:27,629 --> 00:42:28,759 matrix 744 00:42:28,759 --> 00:42:31,699 and moreover all the 745 00:42:31,699 --> 00:42:34,609 diagonal entries must be the same 746 00:42:34,609 --> 00:42:37,989 and so the cartoon for that is that you're 747 00:42:37,989 --> 00:42:39,440 constraining 748 00:42:39,440 --> 00:42:42,779 the contours of your Gaussian density to be 749 00:42:42,779 --> 00:42:47,359 circular. Okay? This is a sort of even harsher constraint to place in your model. 750 00:42:47,359 --> 00:42:51,479 751 00:42:51,479 --> 00:42:52,200 752 00:42:52,200 --> 00:42:55,970 So either of these versions, diagonal sigma or sigma being the, sort of, constant 753 00:42:55,970 --> 00:42:57,349 value diagonal 754 00:42:57,349 --> 00:43:01,869 are the all ready strong assumptions, all right? So if you have enough data 755 00:43:01,869 --> 00:43:07,529 maybe write a model just a little bit of a correlation between your different variables. 756 00:43:07,529 --> 00:43:10,930 So the factor analysis model 757 00:43:10,930 --> 00:43:14,349 is one way to attempt to do that. So here's 758 00:43:14,349 --> 00:43:21,349 the idea. 759 00:43:34,619 --> 00:43:35,729 So this is 760 00:43:35,729 --> 00:43:39,169 how the factor analysis model 761 00:43:39,169 --> 00:43:40,979 models your data. 762 00:43:40,979 --> 00:43:42,150 We're going to 763 00:43:42,150 --> 00:43:45,309 assume that there is a latent random variable, okay? 764 00:43:45,309 --> 00:43:46,959 Which just 765 00:43:46,959 --> 00:43:50,759 means random variable Z. So Z is distributed 766 00:43:50,759 --> 00:43:53,380 Gaussian with mean zero and 767 00:43:53,380 --> 00:43:55,659 covariance identity 768 00:43:55,659 --> 00:43:57,630 where Z 769 00:43:57,630 --> 00:44:00,109 will be a Ddimensional vector now 770 00:44:00,109 --> 00:44:01,499 and 771 00:44:01,499 --> 00:44:03,839 D 772 00:44:03,839 --> 00:44:10,839 will be chosen so that it is lower than the dimension of your X's. Okay? 773 00:44:11,829 --> 00:44:12,829 And now 774 00:44:12,829 --> 00:44:14,660 I'm going to assume that 775 00:44:14,660 --> 00:44:19,259 X is given by " 776 00:44:19,259 --> 00:44:21,279 well let me write this. Each XI 777 00:44:21,279 --> 00:44:22,690 is distributed 778 00:44:22,690 --> 00:44:25,059 " 779 00:44:25,059 --> 00:44:28,849 actually, sorry, I'm just. We 780 00:44:28,849 --> 00:44:33,889 have 781 00:44:33,889 --> 00:44:38,679 to assume that conditions on the value of Z, 782 00:44:38,679 --> 00:44:42,109 X is given by another Gaussian 783 00:44:42,109 --> 00:44:46,819 with mean given by mu plus 784 00:44:46,819 --> 00:44:50,089 lambda Z 785 00:44:50,089 --> 00:44:53,639 and covariance given by matrix si. 786 00:44:53,639 --> 00:44:59,400 So just to say the second line in 787 00:44:59,400 --> 00:45:01,289 an equivalent form, equivalently 788 00:45:01,289 --> 00:45:05,929 I'm going to model X as 789 00:45:05,929 --> 00:45:08,819 mu plus lambda Z 790 00:45:08,819 --> 00:45:13,269 plus a noise term epsilon where epsilon is 791 00:45:13,269 --> 00:45:18,329 Gaussian with mean zero 792 00:45:18,329 --> 00:45:21,869 and covariant si. 793 00:45:21,869 --> 00:45:23,469 794 00:45:23,469 --> 00:45:24,669 And so 795 00:45:24,669 --> 00:45:27,709 the parameters of this 796 00:45:27,709 --> 00:45:30,180 model 797 00:45:30,180 --> 00:45:32,719 are going to be a vector 798 00:45:32,719 --> 00:45:34,729 mu with its 799 00:45:34,729 --> 00:45:37,559 n-dimensional and 800 00:45:37,559 --> 00:45:41,179 matrix lambda, 801 00:45:41,179 --> 00:45:42,400 which is 802 00:45:42,400 --> 00:45:43,539 803 00:45:43,539 --> 00:45:45,569 n by D 804 00:45:45,569 --> 00:45:50,039 and a covariance matrix si, 805 00:45:50,039 --> 00:45:51,970 which is n by n, 806 00:45:51,970 --> 00:45:55,519 and I'm going to impose an additional constraint on si. I'm going to impose 807 00:45:55,519 --> 00:45:57,319 a constraint that si 808 00:45:57,319 --> 00:45:59,559 is 809 00:45:59,559 --> 00:46:03,389 diagonal. 810 00:46:03,389 --> 00:46:04,789 Okay? So 811 00:46:04,789 --> 00:46:07,640 that was a form of definition " let me actually, sort of, 812 00:46:07,640 --> 00:46:14,640 give a couple of examples to make this more complete. 813 00:46:32,710 --> 00:46:37,709 So let's give a kind of example, suppose Z 814 00:46:37,709 --> 00:46:40,419 is one-dimensional 815 00:46:40,419 --> 00:46:44,419 and X is twodimensional 816 00:46:44,419 --> 00:46:46,689 so let's see what 817 00:46:46,689 --> 00:46:51,599 this model " let's see a, sort of, specific instance of the factor analysis 818 00:46:51,599 --> 00:46:52,509 model 819 00:46:52,509 --> 00:46:55,819 and how we're modeling the joint " the distribution over X 820 00:46:55,819 --> 00:46:57,999 of " what this gives us 821 00:46:57,999 --> 00:47:01,269 in terms of a model over P of X, all right? 822 00:47:01,269 --> 00:47:03,089 So 823 00:47:03,089 --> 00:47:09,699 let's see. From this model to let me assume that 824 00:47:09,699 --> 00:47:10,569 lambda is 825 00:47:10,569 --> 00:47:12,059 2, 1 826 00:47:12,059 --> 00:47:15,549 and si, which has to be diagonal matrix, remember, 827 00:47:15,549 --> 00:47:19,809 is this. 828 00:47:19,809 --> 00:47:21,710 Okay? So 829 00:47:21,710 --> 00:47:28,710 Z is one-dimensional so let me just draw a typical sample for Z, all right? So 830 00:47:30,900 --> 00:47:33,659 if I draw ZI 831 00:47:33,659 --> 00:47:36,199 from a Gaussian 832 00:47:36,199 --> 00:47:39,530 so that's a typical sample for what Z might look like 833 00:47:39,530 --> 00:47:43,440 and so I'm gonna " at any rate I'm gonna call 834 00:47:43,440 --> 00:47:45,369 this Z1, 835 00:47:45,369 --> 00:47:49,969 Z2, Z3 and so on. If this really were a typical sample the order of the 836 00:47:49,969 --> 00:47:53,349 Z's would be jumbled up, but I'm just ordering them like this 837 00:47:53,349 --> 00:47:55,970 just to make the example easier. 838 00:47:55,970 --> 00:47:57,769 So, yes, typical sample 839 00:47:57,769 --> 00:48:04,769 of random variable Z from a Gaussian distribution with mean of covariance one. 840 00:48:05,569 --> 00:48:08,929 So " 841 00:48:08,929 --> 00:48:13,789 and with this example let me just set mu equals zero. It's to write the " 842 00:48:13,789 --> 00:48:18,359 just that it's easier to talk about. 843 00:48:18,359 --> 00:48:20,749 So 844 00:48:20,749 --> 00:48:22,509 lambda times Z, 845 00:48:22,509 --> 00:48:26,269 right? We'll take each of these numbers and multiply them by lambda. 846 00:48:26,269 --> 00:48:28,639 And so 847 00:48:28,639 --> 00:48:30,999 you find that 848 00:48:30,999 --> 00:48:36,989 all of the values for lambda times Z 849 00:48:36,989 --> 00:48:38,489 will lie on a straight line, 850 00:48:38,489 --> 00:48:40,309 all right? So, for example, 851 00:48:40,309 --> 00:48:41,090 852 00:48:41,090 --> 00:48:45,400 this one here would be one, two, three, four, five, six, seven, I guess. So if this was 853 00:48:45,400 --> 00:48:46,239 Z7 854 00:48:46,239 --> 00:48:49,329 then this one here would be lambda 855 00:48:49,329 --> 00:48:51,979 times Z7 856 00:48:51,979 --> 00:48:54,369 and now that's the number in R2, because lambda's a two 857 00:48:54,369 --> 00:48:56,529 by one matrix. 858 00:48:56,529 --> 00:48:59,900 And so what I've drawn here is like a typical sample 859 00:48:59,900 --> 00:49:04,279 for lambda times Z 860 00:49:04,279 --> 00:49:05,589 and 861 00:49:05,589 --> 00:49:09,779 the final step for this is what a typical sample for X looks like. Well X is 862 00:49:09,779 --> 00:49:11,839 mu 863 00:49:11,839 --> 00:49:12,989 plus 864 00:49:12,989 --> 00:49:15,089 865 00:49:15,089 --> 00:49:16,660 lambda Z plus epsilon 866 00:49:16,660 --> 00:49:18,649 where epsilon 867 00:49:18,649 --> 00:49:23,680 is Gaussian with mean nu and covariance given by si, right? 868 00:49:23,680 --> 00:49:27,690 And so the last step to draw a typical sample for the random variables 869 00:49:27,690 --> 00:49:28,509 870 00:49:28,509 --> 00:49:30,899 X 871 00:49:30,899 --> 00:49:36,079 I'm gonna take these non " these are really same as mu plus lambda Z because mu is zero in this example 872 00:49:36,079 --> 00:49:37,159 and 873 00:49:37,159 --> 00:49:38,390 around this point 874 00:49:38,390 --> 00:49:41,890 I'm going to place 875 00:49:41,890 --> 00:49:43,890 an axis aligned ellipse. Or 876 00:49:43,890 --> 00:49:45,879 in other words, I'm going to 877 00:49:45,879 --> 00:49:47,939 create a Gaussian distribution 878 00:49:47,939 --> 00:49:50,950 centered on this point 879 00:49:50,950 --> 00:49:53,459 and this I've drawn 880 00:49:53,459 --> 00:49:55,880 corresponds to one of the contours 881 00:49:55,880 --> 00:49:59,309 of my density for 882 00:49:59,309 --> 00:50:02,799 epsilon, right? And so you can imagine placing a little Gaussian bump here. 883 00:50:02,799 --> 00:50:04,179 And so 884 00:50:04,179 --> 00:50:08,189 I'll draw an example from this little Gaussian and 885 00:50:08,189 --> 00:50:11,089 let's say I get that point going, 886 00:50:11,089 --> 00:50:17,630 I do the same here and 887 00:50:17,630 --> 00:50:19,130 so on. 888 00:50:19,130 --> 00:50:24,079 So I draw a bunch of examples from these Gaussians 889 00:50:24,079 --> 00:50:28,519 and the " 890 00:50:28,519 --> 00:50:31,999 whatever they call it " the orange points I drew 891 00:50:31,999 --> 00:50:35,309 will comprise a typical sample for 892 00:50:35,309 --> 00:50:42,309 whether distribution of X is under this model. Okay? Yeah? Student:Would you add, like, mean? Instructor: Oh, say that again. Student:Do you add mean into that? 893 00:50:43,839 --> 00:50:45,929 Instructor (Andrew Ng):Oh, 894 00:50:45,929 --> 00:50:48,649 yes, you do. And in this example, I said you 895 00:50:48,649 --> 00:50:50,109 do a zero zero just to make 896 00:50:50,109 --> 00:50:53,199 it easier. If mu were something else you'd take the whole picture and you'd sort of shift 897 00:50:53,199 --> 00:51:00,199 it to whatever value of mu is. Yeah? Student:[Inaudible] horizontal line right there, which was Z. What did the X's, of 898 00:51:00,679 --> 00:51:03,319 course, 899 00:51:03,319 --> 00:51:05,920 what does that Y-axis corresponds to? Instructor (Andrew 900 00:51:05,920 --> 00:51:07,799 Ng):Oh, so this is Z 901 00:51:07,799 --> 00:51:09,529 is one-dimensional 902 00:51:09,529 --> 00:51:12,969 so here I'm plotting the typical sample 903 00:51:12,969 --> 00:51:15,460 for Z so this is like zero. 904 00:51:15,460 --> 00:51:19,759 So this is just the Z Axis, right. So Z is onedimensional data. 905 00:51:19,759 --> 00:51:22,999 So this line here is like a plot 906 00:51:22,999 --> 00:51:26,479 of a typical sample of 907 00:51:26,479 --> 00:51:31,609 values for Z. Okay? 908 00:51:31,609 --> 00:51:32,889 Yeah? Student:You have by axis, right? And 909 00:51:32,889 --> 00:51:34,459 the axis data pertains samples. Instructor (Andrew 910 00:51:34,459 --> 00:51:35,689 Ng):Oh, yes, right. Student:So sort of projecting 911 00:51:35,689 --> 00:51:38,289 them into 912 00:51:38,289 --> 00:51:42,680 that? Instructor (Andrew Ng):Let's not talk about projections yet, but, yeah, right. So these 913 00:51:42,680 --> 00:51:45,319 beige points " so that's like X1 and that's X2 914 00:51:45,319 --> 00:51:47,290 and so on, right? So the beige points are 915 00:51:47,290 --> 00:51:51,099 what I 916 00:51:51,099 --> 00:51:52,269 see. And so 917 00:51:52,269 --> 00:51:54,849 in reality all you ever get to see are the 918 00:51:54,849 --> 00:51:56,289 X's, but 919 00:51:56,289 --> 00:52:00,049 just like in the mixture of Gaussians model I tell a story about what I would 920 00:52:00,049 --> 00:52:03,609 imagine the Gau"the data came from two Gaussian's 921 00:52:03,609 --> 00:52:08,679 was is had a random variable Z that led to the generation of X's from two Gaussians. 922 00:52:08,679 --> 00:52:12,209 So the same way I'm sort of telling the story here, which all the algorithm actually 923 00:52:12,209 --> 00:52:15,049 sees are the orange points, but we're 924 00:52:15,049 --> 00:52:19,219 gonna tell a story about how the data came about and that story is 925 00:52:19,219 --> 00:52:23,809 what comprises the factor analysis model. Okay? 926 00:52:23,809 --> 00:52:26,779 So one of the ways to see the intrusion of this model is that we're 927 00:52:26,779 --> 00:52:30,779 going to think of the model as one way 928 00:52:30,779 --> 00:52:34,659 just informally, not formally, but one way to think about this model is 929 00:52:34,659 --> 00:52:37,679 you can think of this factor analysis model 930 00:52:37,679 --> 00:52:42,089 as modeling the data from coming from a lower dimensional subspace 931 00:52:42,089 --> 00:52:42,890 more 932 00:52:42,890 --> 00:52:44,799 or less so the data X here Y 933 00:52:44,799 --> 00:52:47,609 is approximately on one D line 934 00:52:47,609 --> 00:52:50,380 and then plus a little bit of noise " plus a little bit 935 00:52:50,380 --> 00:52:53,449 of random noise so the X isn't exactly on this one D line. 936 00:52:53,449 --> 00:52:57,279 That's one informal way of thinking about factor analysis. 937 00:52:57,279 --> 00:53:04,279 We're 938 00:53:08,880 --> 00:53:15,880 not doing great on time. 939 00:53:18,159 --> 00:53:20,359 Well, let's do this. 940 00:53:20,359 --> 00:53:22,780 So let me just do one more quick example, 941 00:53:22,780 --> 00:53:24,100 which is, 942 00:53:24,100 --> 00:53:25,789 943 00:53:25,789 --> 00:53:28,949 in this example, 944 00:53:28,949 --> 00:53:34,899 let's say Z is in R2 945 00:53:34,899 --> 00:53:37,899 and X is in R3, right? 946 00:53:37,899 --> 00:53:40,289 And so 947 00:53:40,289 --> 00:53:44,939 in this example Z, your data Z now lies in 2-D 948 00:53:44,939 --> 00:53:49,259 and so let me draw this on a sheet of paper. Okay? 949 00:53:49,259 --> 00:53:50,790 So let's say the 950 00:53:50,790 --> 00:53:54,180 axis of my paper are the Z1 and Z2 axis 951 00:53:54,180 --> 00:53:56,299 and so 952 00:53:56,299 --> 00:53:58,289 here is a typical sample 953 00:53:58,289 --> 00:54:01,489 of point Z, right? 954 00:54:01,489 --> 00:54:05,369 And so we'll then take the sample Z " 955 00:54:05,369 --> 00:54:10,919 well, actually let me draw this here as well. All right. So 956 00:54:10,919 --> 00:54:13,449 this is a typical sample for Z going on the Z1 and Z2 axis and 957 00:54:13,449 --> 00:54:17,419 I guess the origin would be here. 958 00:54:17,419 --> 00:54:20,119 So center around zero. 959 00:54:20,119 --> 00:54:24,079 And then we'll take those and map it to mu 960 00:54:24,079 --> 00:54:24,979 plus 961 00:54:24,979 --> 00:54:26,299 lambda Z 962 00:54:26,299 --> 00:54:30,140 and what that means is if you imagine the free space of this classroom is R3. 963 00:54:30,140 --> 00:54:34,549 What that means is we'll take this sample of Z's and we'll map it to 964 00:54:34,549 --> 00:54:38,769 position in free space. So we'll take this sheet of paper and move it somewhere and some 965 00:54:38,769 --> 00:54:41,769 orientation in 3-D space. 966 00:54:41,769 --> 00:54:44,219 And the last step is 967 00:54:44,219 --> 00:54:48,689 you have X equals mu plus lambda Z plus 968 00:54:48,689 --> 00:54:52,339 epsilon and so you would take the set of the points which align in some plane 969 00:54:52,339 --> 00:54:53,639 in our 3-D space 970 00:54:53,639 --> 00:54:56,150 the variable of noise of these 971 00:54:56,150 --> 00:54:58,259 and the noise will, sort of, come from 972 00:54:58,259 --> 00:55:01,419 Gaussians to 973 00:55:01,419 --> 00:55:03,699 the axis 974 00:55:03,699 --> 00:55:10,699 aligned. Okay? So you end up with a data set that's sort of like a fat pancake or a little bit of fuzz off your pancake. 975 00:55:11,680 --> 00:55:13,179 976 00:55:13,179 --> 00:55:15,119 So that's a model 977 00:55:15,119 --> 00:55:22,119 " let's actually talk about how to fit the parameters of the model. Okay? 978 00:55:27,959 --> 00:55:32,229 In order to 979 00:55:32,229 --> 00:55:33,770 describe how to fit the model I'm sure 980 00:55:33,770 --> 00:55:35,859 we need to 981 00:55:35,859 --> 00:55:40,190 re-write Gaussians and this is in a very slightly different way. 982 00:55:40,190 --> 00:55:42,789 So, in particular, 983 00:55:42,789 --> 00:55:45,389 let's say I have a vector X and 984 00:55:45,389 --> 00:55:51,389 I'm gonna use this notation to denote partition vectors, right? X1, X2 985 00:55:51,389 --> 00:55:53,209 where 986 00:55:53,209 --> 00:55:56,809 if X1 is say an rdimensional vector 987 00:55:56,809 --> 00:55:58,640 then X2 is 988 00:55:58,640 --> 00:56:00,430 an estimational vector 989 00:56:00,430 --> 00:56:02,229 and X 990 00:56:02,229 --> 00:56:04,509 is an R plus S 991 00:56:04,509 --> 00:56:09,029 dimensional vector. Okay? So I'm gonna use this notation to denote just 992 00:56:09,029 --> 00:56:13,269 the taking of vector and, sort of, partitioning the vector into two halves. The 993 00:56:13,269 --> 00:56:15,449 first R elements followed by 994 00:56:15,449 --> 00:56:17,079 the last 995 00:56:17,079 --> 00:56:22,409 S elements. 996 00:56:22,409 --> 00:56:26,269 So 997 00:56:26,269 --> 00:56:29,199 let's say you have X 998 00:56:29,199 --> 00:56:35,029 coming from a Gaussian distribution with mean mu and covariance sigma 999 00:56:35,029 --> 00:56:37,169 where 1000 00:56:37,169 --> 00:56:38,809 mu 1001 00:56:38,809 --> 00:56:42,219 is itself a partition vector. 1002 00:56:42,219 --> 00:56:46,839 So break mu up into two pieces mu1 and mu2 1003 00:56:46,839 --> 00:56:50,189 and the covariance matrix sigma 1004 00:56:50,189 --> 00:56:54,989 is now a partitioned matrix. 1005 00:56:54,989 --> 00:56:56,750 Okay? So what this means is 1006 00:56:56,750 --> 00:56:58,669 that you take the covariance matrix sigma 1007 00:56:58,669 --> 00:57:00,940 and I'm going to break it up into four blocks, 1008 00:57:00,940 --> 00:57:02,649 right? And so the 1009 00:57:02,649 --> 00:57:06,059 dimension of this is there will be R elements here 1010 00:57:06,059 --> 00:57:08,299 and there will be 1011 00:57:08,299 --> 00:57:10,289 S elements here 1012 00:57:10,289 --> 00:57:13,679 and there will be R elements here. So, for example, sigma 1, 2 will 1013 00:57:13,679 --> 00:57:15,689 be an 1014 00:57:15,689 --> 00:57:19,769 R by S matrix. 1015 00:57:19,769 --> 00:57:26,769 It's R elements tall and S elements wide. 1016 00:57:32,150 --> 00:57:36,190 So this Gaussian over to down is really a joint distribution of a loss of variables, right? 1017 00:57:36,190 --> 00:57:40,309 So X is a vector so XY is a joint distribution 1018 00:57:40,309 --> 00:57:42,980 over X1 through X of " 1019 00:57:42,980 --> 00:57:47,349 over XN or over X of R plus S. 1020 00:57:47,349 --> 00:57:51,299 We can then ask what are the marginal and conditional distributions of this 1021 00:57:51,299 --> 00:57:52,609 Gaussian? 1022 00:57:52,609 --> 00:57:53,840 So, for example, 1023 00:57:53,840 --> 00:57:57,819 with my Gaussian, I know what P of X is, but can 1024 00:57:57,819 --> 00:58:03,229 I compute the modular distribution of X1, right. And so P of X1 is just equal to, 1025 00:58:03,229 --> 00:58:07,009 of course, integrate our X2, P of X1 1026 00:58:07,009 --> 00:58:08,880 comma X2 1027 00:58:08,880 --> 00:58:10,649 DX2. 1028 00:58:10,649 --> 00:58:14,169 And if you actually perform that distribution " that computation you 1029 00:58:14,169 --> 00:58:15,199 find 1030 00:58:15,199 --> 00:58:17,769 that P of X1, 1031 00:58:17,769 --> 00:58:20,069 I guess, is Gaussian 1032 00:58:20,069 --> 00:58:21,159 with mean 1033 00:58:21,159 --> 00:58:23,079 given by mu1 1034 00:58:23,079 --> 00:58:24,929 and sigma 1, 1. All right. 1035 00:58:24,929 --> 00:58:26,309 So this is sort 1036 00:58:26,309 --> 00:58:30,019 of no surprise. The marginal distribution of a 1037 00:58:30,019 --> 00:58:31,129 Gaussian 1038 00:58:31,129 --> 00:58:32,879 is itself the 1039 00:58:32,879 --> 00:58:33,390 Gaussian and 1040 00:58:33,390 --> 00:58:35,479 you just take out the 1041 00:58:35,479 --> 00:58:36,229 relevant 1042 00:58:36,229 --> 00:58:37,399 sub-blocks of 1043 00:58:37,399 --> 00:58:42,049 the covariance matrix and the relevant sub-vector of the 1044 00:58:42,049 --> 00:58:46,569 mu vector " E in vector mu. 1045 00:58:46,569 --> 00:58:49,819 You can also compute conditionals. You can also " 1046 00:58:49,819 --> 00:58:51,959 what does P of X1 1047 00:58:51,959 --> 00:58:53,409 given 1048 00:58:53,409 --> 00:58:56,819 a specific value for X2, right? 1049 00:58:56,819 --> 00:59:02,799 And so the way you compute that is, well, the usual way P of X1 1050 00:59:02,799 --> 00:59:04,930 comma X2 1051 00:59:04,930 --> 00:59:08,379 divided by P of X2, right? 1052 00:59:08,379 --> 00:59:10,069 And so 1053 00:59:10,069 --> 00:59:13,789 you know what both of these formulas are, right? The numerator " well, 1054 00:59:13,789 --> 00:59:16,559 this is just a usual 1055 00:59:16,559 --> 00:59:20,160 Gaussian that your joint distribution over X1, X2 is a Gaussian with 1056 00:59:20,160 --> 00:59:21,899 mean mu and covariance sigma 1057 00:59:21,899 --> 00:59:23,929 and 1058 00:59:23,929 --> 00:59:26,659 this 1059 00:59:26,659 --> 00:59:30,079 by that marginalization operation I talked about is that. 1060 00:59:30,079 --> 00:59:31,449 1061 00:59:31,449 --> 00:59:35,009 So if you actually plug in the formulas for these two Gaussians and if you simplify 1062 00:59:35,009 --> 00:59:35,599 1063 00:59:35,599 --> 00:59:38,560 the simplification step is actually fairly non-trivial. 1064 00:59:38,560 --> 00:59:42,070 If you haven't seen it before this will actually be " this will actually be 1065 00:59:42,070 --> 00:59:44,389 somewhat difficult to do. 1066 00:59:44,389 --> 00:59:45,580 But if you 1067 00:59:45,580 --> 00:59:50,020 plug this in for Gaussian and simplify that 1068 00:59:50,020 --> 00:59:51,029 expression 1069 00:59:51,029 --> 00:59:53,249 you find that 1070 00:59:53,249 --> 00:59:55,969 conditioned on the value of 1071 00:59:55,969 --> 01:00:01,130 X2, X1 is " the distribution of X1 conditioned on X2 is itself going to be 1072 01:00:01,130 --> 01:00:05,849 Gaussian 1073 01:00:05,849 --> 01:00:08,260 and it will have mean mu 1074 01:00:08,260 --> 01:00:13,929 of 1 given 2 and covariant 1075 01:00:13,929 --> 01:00:15,549 1076 01:00:15,549 --> 01:00:18,479 sigma of 1 given 2 where " well, so about the simplification and derivation I'm not gonna show 1077 01:00:18,479 --> 01:00:20,969 the formula for mu given " of mu of 1078 01:00:20,969 --> 01:00:27,969 one given 2 is given by this 1079 01:00:31,689 --> 01:00:38,689 and I 1080 01:00:40,179 --> 01:00:44,209 think 1081 01:00:44,209 --> 01:00:48,949 the sigma of 1 given 2 is given by that. Okay? 1082 01:00:48,949 --> 01:00:52,109 So these are just 1083 01:00:52,109 --> 01:00:55,869 useful formulas to know for how to find the conditional distributions 1084 01:00:55,869 --> 01:00:58,649 of the Gaussian and the marginal distributions of a Gaussian. 1085 01:00:58,649 --> 01:01:03,229 I won't actually show the derivation for this. Student:Could you 1086 01:01:03,229 --> 01:01:10,229 repeat the [inaudible]? Instructor 1087 01:01:12,059 --> 01:01:16,189 (Andrew Ng):Sure. So this one on the left mu of 1 given 2 1088 01:01:16,189 --> 01:01:16,849 equals 1089 01:01:16,849 --> 01:01:19,029 mu1 plus 1090 01:01:19,029 --> 01:01:20,769 sigma 1,2, 1091 01:01:20,769 --> 01:01:22,729 sigma 2,2 inverse 1092 01:01:22,729 --> 01:01:25,389 times X2 minus mu2 1093 01:01:25,389 --> 01:01:29,599 and this is sigma 1 given 2 equals sigma 1,1 1094 01:01:29,599 --> 01:01:31,599 minus sigma 1,2 1095 01:01:31,599 --> 01:01:33,119 sigma 2,2 inverse 1096 01:01:33,119 --> 01:01:34,400 sigma 2,1. Okay? 1097 01:01:34,400 --> 01:01:40,189 These are also in the 1098 01:01:40,189 --> 01:01:47,189 lecture 1099 01:01:52,929 --> 01:01:59,929 notes. Shoot. Nothing as where I was hoping to on time. 1100 01:02:00,989 --> 01:02:07,989 Well, actually it is. Okay? 1101 01:02:18,719 --> 01:02:21,209 So it 1102 01:02:21,209 --> 01:02:26,369 turns out " I think I'll skip this in the interest of time. So it turns out that " 1103 01:02:26,369 --> 01:02:30,049 well, so let's go back and use these in the factor analysis model, right? 1104 01:02:30,049 --> 01:02:32,089 It turns out that 1105 01:02:32,089 --> 01:02:33,769 you can go back 1106 01:02:33,769 --> 01:02:35,130 and 1107 01:02:35,130 --> 01:02:38,769 " 1108 01:02:38,769 --> 01:02:45,759 oh, 1109 01:02:45,759 --> 01:02:48,569 do I want to do this? I kind of need this 1110 01:02:48,569 --> 01:02:51,939 though. So let's go back and figure out 1111 01:02:51,939 --> 01:02:57,869 just what the joint distribution factor analysis assumes on Z and X's. Okay? 1112 01:02:57,869 --> 01:02:58,630 So 1113 01:02:58,630 --> 01:03:00,219 1114 01:03:00,219 --> 01:03:03,199 under the factor analysis model 1115 01:03:03,199 --> 01:03:07,429 Z and X, the random variables Z and X 1116 01:03:07,429 --> 01:03:11,549 have some joint distribution given by " I'll write this 1117 01:03:11,549 --> 01:03:12,779 vector as mu 1118 01:03:12,779 --> 01:03:13,799 ZX 1119 01:03:13,799 --> 01:03:17,329 in some covariance matrix sigma. 1120 01:03:17,329 --> 01:03:21,239 So let's go back and figure out what mu ZX is and what sigma is and I'll 1121 01:03:21,239 --> 01:03:22,699 do this so that 1122 01:03:22,699 --> 01:03:24,990 we'll get a little bit more practice with 1123 01:03:24,990 --> 01:03:28,739 partition vectors and partition matrixes. 1124 01:03:28,739 --> 01:03:31,239 So just to remind you, right? You 1125 01:03:31,239 --> 01:03:34,959 have to have Z as Gaussian with mean zero and covariance identity 1126 01:03:34,959 --> 01:03:41,099 and X is mu plus lambda Z plus epsilon where epsilon is Gaussian with 1127 01:03:41,099 --> 01:03:42,749 mean zero 1128 01:03:42,749 --> 01:03:47,179 covariant si. So I have the " I'm just writing out the same equations again. 1129 01:03:47,179 --> 01:03:53,209 So let's first figure out what this vector mu ZX is. Well, 1130 01:03:53,209 --> 01:03:54,619 the expected value of Z 1131 01:03:54,619 --> 01:03:56,219 is 1132 01:03:56,219 --> 01:03:56,989 zero 1133 01:03:56,989 --> 01:03:58,140 1134 01:03:58,140 --> 01:04:03,599 and, again, as usual I'll often drop the square backers around here. 1135 01:04:03,599 --> 01:04:05,930 And the expected value of X is " 1136 01:04:05,930 --> 01:04:08,899 well, the expected value of mu 1137 01:04:08,899 --> 01:04:11,149 plus lambda 1138 01:04:11,149 --> 01:04:14,179 Z 1139 01:04:14,179 --> 01:04:17,249 plus epsilon. So these two terms have zero expectation 1140 01:04:17,249 --> 01:04:20,099 and so the expected value of X 1141 01:04:20,099 --> 01:04:22,749 is just mu 1142 01:04:22,749 --> 01:04:23,900 1143 01:04:23,900 --> 01:04:26,529 and so that vector 1144 01:04:26,529 --> 01:04:28,049 mu ZX, right, 1145 01:04:28,049 --> 01:04:31,239 in my parameter for the Gaussian 1146 01:04:31,239 --> 01:04:33,650 this is going to be 1147 01:04:33,650 --> 01:04:37,439 the expected value of this partition vector 1148 01:04:37,439 --> 01:04:40,249 given by this partition Z and X 1149 01:04:40,249 --> 01:04:42,689 and so that would just be 1150 01:04:42,689 --> 01:04:43,769 zero 1151 01:04:43,769 --> 01:04:46,900 followed by mu. Okay? 1152 01:04:46,900 --> 01:04:48,319 And so that's 1153 01:04:48,319 --> 01:04:51,299 a d-dimensional zero 1154 01:04:51,299 --> 01:04:57,059 followed by an indimensional mu. 1155 01:04:57,059 --> 01:05:02,160 That's not gonna work out what the covariance matrix sigma is. 1156 01:05:02,160 --> 01:05:09,160 So 1157 01:05:20,929 --> 01:05:24,109 the covariance matrix sigma 1158 01:05:24,109 --> 01:05:27,959 " if you work out 1159 01:05:27,959 --> 01:05:34,959 definition of a partition. So this 1160 01:05:44,359 --> 01:05:45,619 is 1161 01:05:45,619 --> 01:05:52,619 into your partition matrix. 1162 01:06:03,339 --> 01:06:04,150 Okay? Will be " 1163 01:06:04,150 --> 01:06:06,149 so the covariance matrix sigma 1164 01:06:06,149 --> 01:06:10,499 will comprise four blocks like that 1165 01:06:10,499 --> 01:06:14,199 and so the upper left most block, which I write as sigma 1,1 " 1166 01:06:14,199 --> 01:06:16,400 well, that uppermost left block 1167 01:06:16,400 --> 01:06:18,559 is just 1168 01:06:18,559 --> 01:06:21,139 the covariance matrix of Z, 1169 01:06:21,139 --> 01:06:24,819 which we know is the identity. I was 1170 01:06:24,819 --> 01:06:28,269 gonna show you briefly how to derive some of the other blocks, right, so 1171 01:06:28,269 --> 01:06:30,809 sigma 1,2 that's the 1172 01:06:30,809 --> 01:06:37,809 upper " oh, actually, 1173 01:06:41,259 --> 01:06:43,819 excuse me. Sigma 2,1 1174 01:06:43,819 --> 01:06:46,829 which is the lower left block 1175 01:06:46,829 --> 01:06:48,929 that's E 1176 01:06:48,929 --> 01:06:50,829 of X 1177 01:06:50,829 --> 01:06:54,900 minus EX times Z minus EZ. 1178 01:06:54,900 --> 01:06:58,689 So X is equal to mu 1179 01:06:58,689 --> 01:07:01,529 plus lambda Z 1180 01:07:01,529 --> 01:07:03,329 plus 1181 01:07:03,329 --> 01:07:05,429 epsilon and then minus EX is 1182 01:07:05,429 --> 01:07:07,369 minus mu and then 1183 01:07:07,369 --> 01:07:11,219 times Z 1184 01:07:11,219 --> 01:07:14,089 because the expected value of Z is zero, right, 1185 01:07:14,089 --> 01:07:19,329 so that's equal to zero. 1186 01:07:19,329 --> 01:07:23,660 And so if you simplify " or if you expand this out 1187 01:07:23,660 --> 01:07:26,829 plus mu minus mu cancel out 1188 01:07:26,829 --> 01:07:29,819 and so you have the expected value 1189 01:07:29,819 --> 01:07:33,139 of lambda " oh, 1190 01:07:33,139 --> 01:07:35,359 excuse me. 1191 01:07:35,359 --> 01:07:40,269 ZZ transpose minus the 1192 01:07:40,269 --> 01:07:47,269 expected value of epsilon Z is 1193 01:07:52,799 --> 01:07:57,059 equal to 1194 01:07:57,059 --> 01:08:00,309 that, which is just equal to 1195 01:08:00,309 --> 01:08:07,239 lambda times the identity matrix. Okay? Does that make 1196 01:08:07,239 --> 01:08:11,499 sense? Cause 1197 01:08:11,499 --> 01:08:14,540 this term is equal to zero. 1198 01:08:14,540 --> 01:08:21,540 Both epsilon and Z are independent and have zero expectation so the second terms are zero. Well, 1199 01:08:48,170 --> 01:08:53,399 so the final block is sigma 2,2 which is equal to the expected value of 1200 01:08:53,399 --> 01:08:59,099 mu plus lambda Z 1201 01:08:59,099 --> 01:09:01,889 plus epsilon minus mu 1202 01:09:01,889 --> 01:09:03,960 times, right? 1203 01:09:03,960 --> 01:09:07,900 Is 1204 01:09:07,900 --> 01:09:09,259 equal to " and 1205 01:09:09,259 --> 01:09:16,259 I won't do this, but this simplifies to lambda lambda transpose plus si. Okay? 1206 01:09:17,609 --> 01:09:20,880 So 1207 01:09:20,880 --> 01:09:23,809 putting all this together 1208 01:09:23,809 --> 01:09:28,689 this tells us that the joint distribution of this vector ZX 1209 01:09:28,689 --> 01:09:31,209 is going to be Gaussian 1210 01:09:31,209 --> 01:09:35,920 with mean vector given by that, 1211 01:09:35,920 --> 01:09:37,409 which we worked out previously. 1212 01:09:37,409 --> 01:09:38,719 So 1213 01:09:38,719 --> 01:09:42,239 this is the new ZX that we worked out previously, 1214 01:09:42,239 --> 01:09:44,179 and covariance matrix 1215 01:09:44,179 --> 01:09:51,179 given by 1216 01:09:54,429 --> 01:10:01,429 that. Okay? So 1217 01:10:04,719 --> 01:10:07,179 in principle " let's 1218 01:10:07,179 --> 01:10:09,869 see, so the parameters of our model 1219 01:10:09,869 --> 01:10:11,579 are mu, 1220 01:10:11,579 --> 01:10:12,359 lambda, 1221 01:10:12,359 --> 01:10:14,000 and si. 1222 01:10:14,000 --> 01:10:15,480 And so 1223 01:10:15,480 --> 01:10:18,800 in order to find the parameters of this model 1224 01:10:18,800 --> 01:10:24,559 we're given a training set of m examples 1225 01:10:24,559 --> 01:10:28,590 and so we like to do a massive likely estimation of the parameters. 1226 01:10:28,590 --> 01:10:34,319 And so in principle one thing you could do is you can actually write down 1227 01:10:34,319 --> 01:10:36,820 what P of XI is and, 1228 01:10:36,820 --> 01:10:40,309 right, so P of XI 1229 01:10:40,309 --> 01:10:43,020 XI is actually " 1230 01:10:43,020 --> 01:10:45,880 the distribution of X, right? If, again, 1231 01:10:45,880 --> 01:10:49,439 you can marginalize this Gaussian 1232 01:10:49,439 --> 01:10:54,679 and so the distribution of X, which is the lower half of this partition vector 1233 01:10:54,679 --> 01:10:56,620 is going to have 1234 01:10:56,620 --> 01:10:58,209 mean mu 1235 01:10:58,209 --> 01:11:03,929 and covariance given by lambda lambda transpose plus si. Right? 1236 01:11:03,929 --> 01:11:05,320 So that's 1237 01:11:05,320 --> 01:11:06,759 the distribution 1238 01:11:06,759 --> 01:11:12,499 that we're using to model P of X. 1239 01:11:12,499 --> 01:11:16,980 And so in principle one thing you could do is actually write down the log likelihood of 1240 01:11:16,980 --> 01:11:20,329 your parameters, 1241 01:11:20,329 --> 01:11:23,679 right? Which is just the product over of " it 1242 01:11:23,679 --> 01:11:25,310 is the sum over 1243 01:11:25,310 --> 01:11:26,050 I 1244 01:11:26,050 --> 01:11:29,829 log P of XI 1245 01:11:29,829 --> 01:11:30,970 where P of XI 1246 01:11:30,970 --> 01:11:32,400 will be given 1247 01:11:32,400 --> 01:11:35,719 by this Gaussian density, right. And 1248 01:11:35,719 --> 01:11:39,820 I'm using theta as a shorthand to denote all of my parameters. 1249 01:11:39,820 --> 01:11:43,429 And so you actually know what the density for Gaussian is 1250 01:11:43,429 --> 01:11:49,519 and so you can say P of XI is this Gaussian with E mu in covariance 1251 01:11:49,519 --> 01:11:52,960 given by this lambda lambda transpose plus si. 1252 01:11:52,960 --> 01:11:54,690 So in case you write down the log likelihood 1253 01:11:54,690 --> 01:11:55,900 of your parameters 1254 01:11:55,900 --> 01:11:57,380 as follows 1255 01:11:57,380 --> 01:12:01,050 and you can try to take derivatives of your log likelihood with respect to your 1256 01:12:01,050 --> 01:12:02,039 parameters 1257 01:12:02,039 --> 01:12:05,059 and maximize the log likelihood, all right. 1258 01:12:05,059 --> 01:12:07,490 It turns out that if you do that you end up with 1259 01:12:07,490 --> 01:12:11,300 sort of an intractable atomization problem or at least one 1260 01:12:11,300 --> 01:12:13,219 that you " excuse me, you end up with a 1261 01:12:13,219 --> 01:12:16,860 optimization problem that you will not be able to find and in this analytics, sort of, 1262 01:12:16,860 --> 01:12:18,649 closed form solutions to. 1263 01:12:18,649 --> 01:12:21,679 So if you say my model of X is this and found your 1264 01:12:21,679 --> 01:12:23,900 massive likely parameter estimation 1265 01:12:23,900 --> 01:12:25,589 you won't be able to find 1266 01:12:25,589 --> 01:12:29,119 the massive likely estimate of the parameters in closed form. 1267 01:12:29,119 --> 01:12:30,059 1268 01:12:30,059 --> 01:12:31,940 So what I would have liked to do 1269 01:12:31,940 --> 01:12:38,940 is " well, 1270 01:12:40,280 --> 01:12:45,579 so in order to fit parameters to this model 1271 01:12:45,579 --> 01:12:51,869 what we'll actually do is use the EM Algorithm 1272 01:12:51,869 --> 01:12:55,569 in with 1273 01:12:55,569 --> 01:12:59,559 the E step, right? 1274 01:12:59,559 --> 01:13:06,559 We'll compute that 1275 01:13:08,320 --> 01:13:10,929 1276 01:13:10,929 --> 01:13:15,010 and this formula looks the same except that one difference is that now 1277 01:13:15,010 --> 01:13:17,679 Z is a continuous random variable 1278 01:13:17,679 --> 01:13:18,590 and so 1279 01:13:18,590 --> 01:13:20,119 in the E step 1280 01:13:20,119 --> 01:13:24,210 we actually have to find the density QI of ZI where it's the, sort of, E step actually 1281 01:13:24,210 --> 01:13:26,050 requires that we find 1282 01:13:26,050 --> 01:13:28,260 the posterior distribution 1283 01:13:28,260 --> 01:13:31,510 that " so the density to the random variable ZI 1284 01:13:31,510 --> 01:13:34,489 and then the M step 1285 01:13:34,489 --> 01:13:37,949 will then perform 1286 01:13:37,949 --> 01:13:40,870 the following maximization 1287 01:13:40,870 --> 01:13:41,709 1288 01:13:41,709 --> 01:13:45,010 where, again, because Z is now 1289 01:13:45,010 --> 01:13:49,929 continuous we 1290 01:13:49,929 --> 01:13:56,929 now need to integrate over Z. Okay? Where 1291 01:13:59,389 --> 01:14:02,379 in the M step now because ZI was continuous we now have an 1292 01:14:02,379 --> 01:14:05,709 integral over Z rather than a sum. 1293 01:14:05,709 --> 01:14:09,269 Okay? I was hoping to go a little bit further in deriving these things, but I don't have 1294 01:14:09,269 --> 01:14:11,080 time today so we'll wrap that up 1295 01:14:11,080 --> 01:14:14,530 in the next lecture, but before I close let's check if there are questions 1296 01:14:14,530 --> 01:14:15,729 1297 01:14:15,729 --> 01:14:22,729 about the whole factor analysis model. Okay. 1298 01:14:27,569 --> 01:14:30,639 So we'll come back in the next lecture; 1299 01:14:30,639 --> 01:14:34,690 I will wrap up this model and because I want to go a little bit deeper into the E 1300 01:14:34,690 --> 01:14:36,709 and M steps, as there's some 1301 01:14:36,709 --> 01:14:39,760 tricky parts for the factor analysis model specifically. 1302 01:14:39,760 --> 01:14:41,159 Okay. I'll see you in a 1303 01:14:41,159 --> 01:14:41,409 couple of days.