1 00:00:09,570 --> 00:00:10,849 2 00:00:10,849 --> 00:00:14,119 This presentation is delivered by the Stanford Center for Professional 3 00:00:14,119 --> 00:00:21,119 Development. 4 00:00:24,529 --> 00:00:27,279 Okay. So good morning and welcome back. 5 00:00:27,279 --> 00:00:32,059 What I want to do today is actually begin a new chapter in 229 6 00:00:32,059 --> 00:00:37,710 in which I'm gonna start to talk about 7 00:00:37,710 --> 00:00:42,830 [inaudible]. So [inaudible] today is I'm gonna just very briefly talk about clustering's, 8 00:00:42,830 --> 00:00:44,660 [inaudible] algorithm. 9 00:00:44,660 --> 00:00:47,400 [Inaudible] 10 00:00:47,400 --> 00:00:51,590 and a special case of the EM, Expectation Maximization, algorithm with a 11 00:00:51,590 --> 00:00:52,960 mixture of [inaudible] 12 00:00:52,960 --> 00:00:54,650 model to 13 00:00:54,650 --> 00:00:58,010 describe something called Jensen and Equality and then we'll use that 14 00:00:58,010 --> 00:00:59,060 derive 15 00:00:59,060 --> 00:01:02,780 a general form of something called the EM or the Expectation Maximization 16 00:01:02,780 --> 00:01:03,400 algorithm, 17 00:01:03,400 --> 00:01:05,300 which is a very useful algorithm. We 18 00:01:05,300 --> 00:01:09,290 sort of use it all over the place and different unsupervised machine or 19 00:01:09,290 --> 00:01:16,290 any application. 20 00:01:17,320 --> 00:01:21,630 So the cartoons that I used to draw for supervised learning was 21 00:01:21,630 --> 00:01:23,160 you'd be given 22 00:01:23,160 --> 00:01:26,860 the data set like this, 23 00:01:26,860 --> 00:01:30,240 right, and you'd use [inaudible] between the positive and 24 00:01:30,240 --> 00:01:32,490 negative crosses and we'd call it the supervised 25 00:01:32,490 --> 00:01:36,030 learning because you're sort of told what the right 26 00:01:36,030 --> 00:01:38,750 cross label is for every training example, 27 00:01:38,750 --> 00:01:41,960 and that was the supervision. 28 00:01:41,960 --> 00:01:47,260 In unsupervised learning, we'll study a different problem. 29 00:01:47,260 --> 00:01:54,260 You're given a data set 30 00:01:54,920 --> 00:01:58,689 that maybe just comprises a set of points. You're just given a data set 31 00:01:58,689 --> 00:02:03,049 with no labels and no indication of what the right answers or where the 32 00:02:03,049 --> 00:02:04,560 supervision is 33 00:02:04,560 --> 00:02:08,840 and it's the job of the algorithm to discover structure in the 34 00:02:08,840 --> 00:02:10,889 data. 35 00:02:10,889 --> 00:02:14,699 So in this lecture and the next couple of weeks we'll talk about a variety of unsupervised 36 00:02:14,699 --> 00:02:16,359 learning algorithms 37 00:02:16,359 --> 00:02:20,189 that can look at data sets like these and discover there's different types of 38 00:02:20,189 --> 00:02:22,059 structure in it. 39 00:02:22,059 --> 00:02:25,699 In this particular cartoon that I've drawn " one has the structure that you and I can 40 00:02:25,699 --> 00:02:28,859 probably see as is that this data 41 00:02:28,859 --> 00:02:31,029 lives in two different crosses 42 00:02:31,029 --> 00:02:32,279 and so 43 00:02:32,279 --> 00:02:35,559 the first unsupervised learning algorithm that I'm just gonna talk about will be a 44 00:02:35,559 --> 00:02:37,739 clustering algorithm. It'll be an algorithm 45 00:02:37,739 --> 00:02:40,120 that looks for a data set like this 46 00:02:40,120 --> 00:02:43,889 and automatically breaks the data set into different 47 00:02:43,889 --> 00:02:47,919 smaller clusters. 48 00:02:47,919 --> 00:02:49,709 So 49 00:02:49,709 --> 00:02:51,589 let's see. When my 50 00:02:51,589 --> 00:02:52,879 laptop comes back up, I'll 51 00:02:52,879 --> 00:02:54,629 show you an example. 52 00:02:54,629 --> 00:02:57,309 So clustering algorithms like these have 53 00:02:57,309 --> 00:03:00,790 a variety of applications. 54 00:03:00,790 --> 00:03:04,989 Just to rattle off a few of the better-known ones I guess in biology 55 00:03:04,989 --> 00:03:07,409 application you often cross the different things here. You 56 00:03:07,409 --> 00:03:10,969 have [inaudible] genes and they cluster the different genes together in order to examine them 57 00:03:10,969 --> 00:03:14,799 and understand the biological function better. 58 00:03:14,799 --> 00:03:16,510 Another 59 00:03:16,510 --> 00:03:18,530 common application of clustering is market research. So 60 00:03:18,530 --> 00:03:21,310 imagine you have a customer database of how your different 61 00:03:21,310 --> 00:03:23,969 customers behave. It's a 62 00:03:23,969 --> 00:03:27,139 very common practice to apply clustering algorithms to break 63 00:03:27,139 --> 00:03:30,109 your database of customers into different market segments so that you 64 00:03:30,109 --> 00:03:30,749 can 65 00:03:30,749 --> 00:03:34,569 target your products towards different market segments and target your sales 66 00:03:34,569 --> 00:03:38,149 pitches specifically to different market segments. 67 00:03:38,149 --> 00:03:40,709 Something we'll do later today " I don't want to do 68 00:03:40,709 --> 00:03:47,219 this now, but you actually go to a website, like, use.google.com and 69 00:03:47,219 --> 00:03:49,000 that's an example of a website 70 00:03:49,000 --> 00:03:52,329 that uses a clustering algorithm 71 00:03:52,329 --> 00:03:56,809 to everyday group related news articles together to display to you so 72 00:03:56,809 --> 00:03:59,039 that you can see one of the 73 00:03:59,039 --> 00:04:02,699 thousand news articles today on whatever the top story of today is and all the 74 00:04:02,699 --> 00:04:05,129 500 news articles on all the different websites 75 00:04:05,129 --> 00:04:07,529 on different story of 76 00:04:07,529 --> 00:04:11,379 the day. And a very solid [inaudible] actually talks about image segmentation, which the 77 00:04:11,379 --> 00:04:13,590 application of when 78 00:04:13,590 --> 00:04:14,759 you might 79 00:04:14,759 --> 00:04:18,229 take a picture and group together different subsets of the picture into 80 00:04:18,229 --> 00:04:19,620 coherent pieces 81 00:04:19,620 --> 00:04:21,219 of pixels 82 00:04:21,219 --> 00:04:24,349 to try to understand what's contained in the picture. So that's yet another 83 00:04:24,349 --> 00:04:25,650 84 00:04:25,650 --> 00:04:27,430 application of clustering. 85 00:04:27,430 --> 00:04:29,229 The next idea is 86 00:04:29,229 --> 00:04:34,779 given a data set like this, given a set of points, can you automatically group 87 00:04:34,779 --> 00:04:36,169 the data 88 00:04:36,169 --> 00:04:37,659 sets 89 00:04:37,659 --> 00:04:40,209 into coherent clusters. 90 00:04:40,209 --> 00:04:42,389 91 00:04:42,389 --> 00:04:43,879 Let's see. I'm 92 00:04:43,879 --> 00:04:47,830 still waiting for the laptop to come back so I can show you an example. 93 00:04:47,830 --> 00:04:51,490 You know what, why don't I just start to write 94 00:04:51,490 --> 00:04:52,159 out 95 00:04:52,159 --> 00:04:59,159 the specific clustering algorithm and then I'll show you the animation later. So this 96 00:05:01,440 --> 00:05:03,960 is the called the k-means clustering algorithm for finding clustering's near 97 00:05:03,960 --> 00:05:06,569 the inset. The 98 00:05:06,569 --> 00:05:09,740 input to 99 00:05:09,740 --> 00:05:13,930 the algorithm will be an unlabeled data set 100 00:05:13,930 --> 00:05:17,999 which I write 101 00:05:17,999 --> 00:05:19,879 102 00:05:19,879 --> 00:05:21,289 as X1, X2, [inaudible] 103 00:05:21,289 --> 00:05:25,149 and because we're now talking about unsupervised learning, you see a lot of this as 104 00:05:25,149 --> 00:05:29,059 [inaudible] with just the Xs and no cross labels Y. 105 00:05:29,059 --> 00:05:33,270 So what a k-means algorithm does is the following. This will all make a bit 106 00:05:33,270 --> 00:05:36,719 more sense when I show you the 107 00:05:36,719 --> 00:05:40,009 animation on my laptop. 108 00:05:40,009 --> 00:05:43,599 To initialize a set of points, called 109 00:05:43,599 --> 00:05:50,599 the cluster centroids, 110 00:05:52,169 --> 00:05:54,940 [inaudible] randomly 111 00:05:54,940 --> 00:05:57,069 and so if you're 112 00:05:57,069 --> 00:06:01,560 [inaudible] of training data are [inaudible] then your cluster centroids, these 113 00:06:01,560 --> 00:06:02,159 muse, 114 00:06:02,159 --> 00:06:04,300 will also be 115 00:06:04,300 --> 00:06:06,410 vectors and [inaudible] 116 00:06:06,410 --> 00:06:11,119 and then you repeat 117 00:06:11,119 --> 00:06:15,869 until convergence 118 00:06:15,869 --> 00:06:17,829 the following two steps. 119 00:06:17,829 --> 00:06:24,829 120 00:06:35,080 --> 00:06:36,029 121 00:06:36,029 --> 00:06:38,399 So 122 00:06:38,399 --> 00:06:39,790 the cluster centroids will be 123 00:06:39,790 --> 00:06:43,610 your guesses for where the centers of each of the clusters are 124 00:06:43,610 --> 00:06:46,119 and so in one of those steps 125 00:06:46,119 --> 00:06:50,089 you look at each point, XI and you look at which cluster centroid 126 00:06:50,089 --> 00:06:52,649 J is closest to it and then 127 00:06:52,649 --> 00:06:55,830 this step is called assigning your point XI 128 00:06:55,830 --> 00:06:57,229 to cluster 129 00:06:57,229 --> 00:06:58,349 J. So 130 00:06:58,349 --> 00:07:03,329 looking at each point and picking the cluster centroid that's closest to it 131 00:07:03,329 --> 00:07:09,059 and the other step is 132 00:07:09,059 --> 00:07:16,059 you update the cluster centroids 133 00:07:19,099 --> 00:07:26,099 134 00:07:28,749 --> 00:07:34,589 to be the median of all the points that have been assigned to it. 135 00:07:34,589 --> 00:07:41,589 Okay. 136 00:07:46,439 --> 00:07:53,439 Let's see. Could you please bring down the display for the laptop? Excuse me. Okay. Okay. There we go. Okay. 137 00:09:02,340 --> 00:09:06,860 So here's an example of the k-means algorithm and hope the animation will make 138 00:09:06,860 --> 00:09:08,660 more sense. This is an 139 00:09:08,660 --> 00:09:13,160 inch chopped off. This is basically an example I got from Michael Jordan in 140 00:09:13,160 --> 00:09:15,540 Berkley. So 141 00:09:15,540 --> 00:09:20,640 these points in green are my data points and so I'm going to randomly 142 00:09:20,640 --> 00:09:24,160 initialize a pair of cluster centroids. So 143 00:09:24,160 --> 00:09:28,150 the [inaudible] blue crosses to note the positions of New1 and New2 say if I'm going 144 00:09:28,150 --> 00:09:31,600 to guess that there's two clusters 145 00:09:31,600 --> 00:09:33,780 146 00:09:33,780 --> 00:09:35,709 in 147 00:09:35,709 --> 00:09:39,060 this data. Sets of k-means algorithms as follow. I'm going to repeatedly 148 00:09:39,060 --> 00:09:41,370 go to all of the points in my data set 149 00:09:41,370 --> 00:09:46,610 and I'm going to associate each of the green dots with the closer of the 150 00:09:46,610 --> 00:09:48,770 two cluster centroids, so 151 00:09:48,770 --> 00:09:51,500 visually, I'm gonna denote that by 152 00:09:51,500 --> 00:09:55,540 painting each of the dots either blue or red 153 00:09:55,540 --> 00:09:58,390 depending on which is the closer cluster centroid. Okay. So 154 00:09:58,390 --> 00:10:01,130 all the points closer to the blue cross points are 155 00:10:01,130 --> 00:10:04,000 painted blue and so on. The 156 00:10:04,000 --> 00:10:07,750 other step is updating the cluster centroids and so 157 00:10:07,750 --> 00:10:09,209 I'm going to repeatedly 158 00:10:09,209 --> 00:10:12,090 look at all the points that I've painted blue 159 00:10:12,090 --> 00:10:15,660 and compute the average of all of the blue dots, 160 00:10:15,660 --> 00:10:16,939 and I'll 161 00:10:16,939 --> 00:10:20,520 look at all the red dots and compute the average of all the red dots 162 00:10:20,520 --> 00:10:24,350 and then I move the cluster centroids as follows to the average 163 00:10:24,350 --> 00:10:26,109 of the respective locations. 164 00:10:26,109 --> 00:10:28,550 So this is now [inaudible] of kmeans on 165 00:10:28,550 --> 00:10:29,940 here, 166 00:10:29,940 --> 00:10:33,550 and now I'll repeat the same process. I look at all the points and assign all the 167 00:10:33,550 --> 00:10:37,760 points closer to the blue cross to the color blue and similarly 168 00:10:37,760 --> 00:10:39,630 red. And so now I have that 169 00:10:39,630 --> 00:10:43,380 assignments of points to the cluster centroids 170 00:10:43,380 --> 00:10:47,970 and finally, I'll again compute the average of all the blue points and 171 00:10:47,970 --> 00:10:49,770 compute the average of all the red points 172 00:10:49,770 --> 00:10:51,639 and update the cluster centroids again 173 00:10:51,639 --> 00:10:53,690 as follows 174 00:10:53,690 --> 00:10:54,670 and 175 00:10:54,670 --> 00:10:57,170 now k-means is actually [inaudible]. If you keep 176 00:10:57,170 --> 00:11:00,150 running these two sets of k-means over and over, 177 00:11:00,150 --> 00:11:03,260 the cluster centroids and the assignment of the points closest to the cluster centroids will actually 178 00:11:03,260 --> 00:11:06,220 remain 179 00:11:06,220 --> 00:11:12,340 the same. Yeah. 180 00:11:12,340 --> 00:11:15,470 Student:[Inaudible] 181 00:11:15,470 --> 00:11:22,470 Instructor (Andrew Ng):Yeah, I'll 182 00:11:32,550 --> 00:11:33,830 assign that in a second. Yeah. Okay. 183 00:11:33,830 --> 00:11:36,700 So [inaudible]. Take a second to look at this again and make sure you understand 184 00:11:36,700 --> 00:11:39,240 185 00:11:39,240 --> 00:11:44,350 how the algorithm I wrote out maps onto the animation 186 00:11:44,350 --> 00:11:48,180 that we just saw. Do you have a question? Student:[Inaudible] 187 00:11:48,180 --> 00:11:53,060 Instructor (Andrew Ng):I see. Okay. Let me answer on that in a second. Okay. So these are the 188 00:11:53,060 --> 00:11:54,440 two steps. This 189 00:11:54,440 --> 00:11:58,800 step 2.1 was assigning the points to the closest centroid 190 00:11:58,800 --> 00:12:02,640 and 2.2 was shifting the cluster centroid to be the mean 191 00:12:02,640 --> 00:12:07,540 of all the points assigned to that cluster centroid. Okay. Okay. [Inaudible] questions 192 00:12:07,540 --> 00:12:08,560 that 193 00:12:08,560 --> 00:12:12,790 we just had, one is, does the algorithm converge? The answer 194 00:12:12,790 --> 00:12:16,490 is yes, k-means is guaranteed to converge in a certain sense. 195 00:12:16,490 --> 00:12:17,760 In particular, 196 00:12:17,760 --> 00:12:20,300 if you define 197 00:12:20,300 --> 00:12:27,300 198 00:12:29,190 --> 00:12:31,770 the distortion function 199 00:12:31,770 --> 00:12:36,040 to be J of 200 00:12:36,040 --> 00:12:43,040 C 201 00:12:44,130 --> 00:12:46,820 [inaudible] squared. 202 00:12:46,820 --> 00:12:50,070 You can define the distortion function to be a function of the 203 00:12:50,070 --> 00:12:53,840 cluster assignments, and the cluster centroids and [inaudible] square 204 00:12:53,840 --> 00:12:57,930 distances, which mean the points and the cluster centroids that they're assigned to, 205 00:12:57,930 --> 00:13:01,790 then you can show " I won't really prove this here but 206 00:13:01,790 --> 00:13:08,790 you can show that k-means is called [inaudible] 207 00:13:09,230 --> 00:13:11,660 208 00:13:11,660 --> 00:13:14,040 on the function J. 209 00:13:14,040 --> 00:13:18,250 In particular, who remembers, it's called in a sense as an authorization 210 00:13:18,250 --> 00:13:22,120 algorithm, I don't know, maybe about two weeks ago, 211 00:13:22,120 --> 00:13:27,260 so called in a sense is the algorithm that we'll 212 00:13:27,260 --> 00:13:30,740 repeatedly [inaudible] with respect to C. Okay. So that's called 213 00:13:30,740 --> 00:13:32,130 [inaudible]. 214 00:13:32,130 --> 00:13:35,950 And so what you can prove 215 00:13:35,950 --> 00:13:37,829 is that k-means " 216 00:13:37,829 --> 00:13:40,500 the two steps of k-means, 217 00:13:40,500 --> 00:13:45,050 are exactly optimizing this function with respect to C and will respect a new 218 00:13:45,050 --> 00:13:46,590 alternately. 219 00:13:46,590 --> 00:13:50,670 And therefore, this 220 00:13:50,670 --> 00:13:52,700 function, J of C, new, 221 00:13:52,700 --> 00:13:55,090 must be decreasing monotonically 222 00:13:55,090 --> 00:13:56,900 on every other variation 223 00:13:56,900 --> 00:14:00,660 and so the sense in which k-means converges is that this function, J of C, 224 00:14:00,660 --> 00:14:03,970 new, can only go down and therefore, this 225 00:14:03,970 --> 00:14:07,210 function will actually eventually converge in the sense that it will 226 00:14:07,210 --> 00:14:10,580 stop going down. Okay. 227 00:14:10,580 --> 00:14:13,820 It's actually possible that there may be several 228 00:14:13,820 --> 00:14:18,720 clustering's they give the same value of J of C, new and so k-means may actually switch 229 00:14:18,720 --> 00:14:20,949 back and forth between different clustering's that they 230 00:14:20,949 --> 00:14:24,920 [inaudible] in the extremely unlikely case, if there's multiple clustering's, 231 00:14:24,920 --> 00:14:27,790 they give exactly the same value for this objective function. 232 00:14:27,790 --> 00:14:31,710 Kmeans may also be [inaudible] it'll just never happen. 233 00:14:31,710 --> 00:14:33,700 That 234 00:14:33,700 --> 00:14:39,380 even if that happens, this function J of C, new will converge. Another 235 00:14:39,380 --> 00:14:41,860 question was how do you choose the number of clusters? 236 00:14:41,860 --> 00:14:43,690 So it turns out that 237 00:14:43,690 --> 00:14:47,960 in the vast majority of time when people apply k-means, you still 238 00:14:47,960 --> 00:14:51,120 just randomly pick a number of clusters or you randomly try 239 00:14:51,120 --> 00:14:56,210 a few different numbers of clusters 240 00:14:56,210 --> 00:14:59,090 and pick the one that seems to work best. 241 00:14:59,090 --> 00:15:02,560 The number of clusters in this algorithm instead of just one parameters, so usually I think 242 00:15:02,560 --> 00:15:05,550 it's not very hard to choose automatically. 243 00:15:05,550 --> 00:15:09,680 There are some automatic ways of choosing the number of clusters, but I'm 244 00:15:09,680 --> 00:15:11,259 not gonna talk about 245 00:15:11,259 --> 00:15:14,900 them. When I do this, I usually just pick of the number of clusters randomly. 246 00:15:14,900 --> 00:15:16,920 And the reason is 247 00:15:16,920 --> 00:15:18,210 I think 248 00:15:18,210 --> 00:15:22,050 for many clustering problems the true number of clusters is actually 249 00:15:22,050 --> 00:15:29,050 ambiguous so for example if you have a data set that looks like this, 250 00:15:30,730 --> 00:15:34,060 some of you may see four clusters, 251 00:15:34,060 --> 00:15:37,180 right, and some of you may see two clusters, and so 252 00:15:37,180 --> 00:15:44,180 the right answer for the actual number of clusters is sort of ambiguous. Yeah. Student:[Inaudible]. [Inaudible] clusters [inaudible] far away from the data point [inaudible] points and the same cluster? Instructor (Andrew 253 00:15:57,270 --> 00:16:01,590 Ng):I 254 00:16:01,590 --> 00:16:02,660 see. Right. So 255 00:16:02,660 --> 00:16:08,740 yes. K-means is susceptible to [inaudible] so 256 00:16:08,740 --> 00:16:12,570 this function, J of C, new is not a convex 257 00:16:12,570 --> 00:16:13,540 function 258 00:16:13,540 --> 00:16:14,700 and so 259 00:16:14,700 --> 00:16:18,930 k-means, sort of called in a sense on the non-convex function is not guaranteed to 260 00:16:18,930 --> 00:16:20,660 converge the [inaudible]. 261 00:16:20,660 --> 00:16:24,600 So kmeans 262 00:16:24,600 --> 00:16:26,960 is susceptible to local 263 00:16:26,960 --> 00:16:27,680 optimal 264 00:16:27,680 --> 00:16:31,160 and [inaudible]. 265 00:16:31,160 --> 00:16:31,790 266 00:16:31,790 --> 00:16:35,790 One thing you can do is try multiple random initializations 267 00:16:35,790 --> 00:16:38,070 and then run clustering a bunch of times 268 00:16:38,070 --> 00:16:41,570 and pick the solution that ended up with the lowest value for the 269 00:16:41,570 --> 00:16:48,570 distortion function. 270 00:16:52,630 --> 00:16:57,000 Yeah. Student:[Inaudible] Instructor (Andrew Ng):Yeah, let's 271 00:16:57,000 --> 00:17:01,340 see. Right. So what if one cluster centroid has no points assigned to it, again, one thing 272 00:17:01,340 --> 00:17:04,820 you could do is just eliminate it exactly the same. Another thing you can 273 00:17:04,820 --> 00:17:06,060 is you 274 00:17:06,060 --> 00:17:11,900 can just reinitialize randomly if you really [inaudible]. More questions. Yeah. 275 00:17:11,900 --> 00:17:15,830 Student:[Inaudible] as a 276 00:17:15,830 --> 00:17:16,220 norm 277 00:17:16,220 --> 00:17:20,140 or can you [inaudible] or infinity 278 00:17:20,140 --> 00:17:21,329 norm or " Instructor (Andrew Ng):I 279 00:17:21,329 --> 00:17:24,690 see. Right. Is it usually two norms? 280 00:17:24,690 --> 00:17:26,340 Let's see. 281 00:17:26,340 --> 00:17:29,530 For the vast majority of applications I've seen for k-means, you do 282 00:17:29,530 --> 00:17:31,830 take two norms when you have data 283 00:17:31,830 --> 00:17:35,830 [inaudible]. I'm sure there are others who have taken infinity norm and one norm as well. 284 00:17:35,830 --> 00:17:39,270 I personally haven't seen that very often, but 285 00:17:39,270 --> 00:17:42,159 there are other variations on this 286 00:17:42,159 --> 00:17:43,550 algorithm that use different norms, but the one I 287 00:17:43,550 --> 00:17:50,550 described is probably the most commonly used there is. Okay. 288 00:17:54,559 --> 00:17:55,620 So 289 00:17:55,620 --> 00:17:58,430 that was 290 00:17:58,430 --> 00:18:05,430 k-means clustering. 291 00:18:05,660 --> 00:18:09,179 What I want to do next and this will take longer to describe is 292 00:18:09,179 --> 00:18:16,179 actually talk about a closely related problem. 293 00:18:18,550 --> 00:18:25,550 In particular, what I wanted to do was talk about density estimation. 294 00:18:26,710 --> 00:18:28,450 295 00:18:28,450 --> 00:18:31,070 296 00:18:31,070 --> 00:18:33,300 As another k-means example, 297 00:18:33,300 --> 00:18:37,860 this is a problem that I know some guys that worked on. Let's say you have aircraft engine 298 00:18:37,860 --> 00:18:41,600 building off an assembly. Let's say you work for an aircraft company, 299 00:18:41,600 --> 00:18:44,230 you're building aircraft engines off the assembly line 300 00:18:44,230 --> 00:18:45,640 and 301 00:18:45,640 --> 00:18:49,910 as the aircraft engines roll off the assembly line, you test these aircraft engines and 302 00:18:49,910 --> 00:18:52,510 measure various different properties of it and 303 00:18:52,510 --> 00:18:57,330 to use [inaudible] example I'm gonna write these properties as heat 304 00:18:57,330 --> 00:19:01,400 and vibrations. Right. In reality, you'd measure different 305 00:19:01,400 --> 00:19:04,330 306 00:19:04,330 --> 00:19:07,250 vibrations, different 307 00:19:07,250 --> 00:19:12,960 308 00:19:12,960 --> 00:19:14,730 309 00:19:14,730 --> 00:19:15,970 frequencies and so 310 00:19:15,970 --> 00:19:19,539 on. We'll just write the amount of heat produced and 311 00:19:19,539 --> 00:19:26,539 vibrations produced. Let's say 312 00:19:26,740 --> 00:19:28,950 that maybe it looks like that 313 00:19:28,950 --> 00:19:30,680 and what you would like to do 314 00:19:30,680 --> 00:19:33,460 is estimate the density of 315 00:19:33,460 --> 00:19:37,110 these [inaudible] of the joint distribution, the amount of heat produced and the amount of 316 00:19:37,110 --> 00:19:39,660 vibrations 317 00:19:39,660 --> 00:19:42,789 because you would like to detect [inaudible] 318 00:19:42,789 --> 00:19:46,470 so that as a new aircraft engine rolls off the assembly line, you can then 319 00:19:46,470 --> 00:19:50,420 measure the same heat and vibration properties. 320 00:19:50,420 --> 00:19:52,530 If you get a point there, 321 00:19:52,530 --> 00:19:53,500 you can then ask, 322 00:19:53,500 --> 00:19:54,820 How likely is it 323 00:19:54,820 --> 00:19:59,120 that there was an undetected flaw in this aircraft engine that it needs to go 324 00:19:59,120 --> 00:20:00,900 undergo further inspections? 325 00:20:00,900 --> 00:20:01,950 And so 326 00:20:01,950 --> 00:20:02,899 if 327 00:20:02,899 --> 00:20:07,800 we look at the typical distribution of features we get, 328 00:20:07,800 --> 00:20:10,809 and we build a model for P of X 329 00:20:10,809 --> 00:20:12,760 and then if 330 00:20:12,760 --> 00:20:16,390 P of X is very small for some new aircraft engine 331 00:20:16,390 --> 00:20:19,640 then that would raise a red flag and we'll say there's an anomaly 332 00:20:19,640 --> 00:20:21,010 aircraft engine and we 333 00:20:21,010 --> 00:20:25,720 should subject it to further inspections before we let 334 00:20:25,720 --> 00:20:27,330 someone fly with 335 00:20:27,330 --> 00:20:28,549 the engine. 336 00:20:28,549 --> 00:20:30,380 So this 337 00:20:30,380 --> 00:20:34,049 problem I just described is an instance of what is called anomaly detection 338 00:20:34,049 --> 00:20:36,749 and so a common way of doing anomaly detection 339 00:20:36,749 --> 00:20:43,720 is to take your training set 340 00:20:43,720 --> 00:20:46,030 and from this data set, 341 00:20:46,030 --> 00:20:49,159 build a model, P of X of the density of 342 00:20:49,159 --> 00:20:52,660 the typical data you're saying 343 00:20:52,660 --> 00:20:56,320 and if you ever then see an example with very low probability 344 00:20:56,320 --> 00:20:57,799 under P of X, then 345 00:20:57,799 --> 00:21:02,580 you may flag that as an anomaly example. Okay? 346 00:21:02,580 --> 00:21:06,380 So anomaly detection is also used in security applications. 347 00:21:06,380 --> 00:21:09,940 348 00:21:09,940 --> 00:21:11,450 349 00:21:11,450 --> 00:21:15,180 350 00:21:15,180 --> 00:21:16,870 351 00:21:16,870 --> 00:21:20,700 If many, very unusual transactions to start to appear on my 352 00:21:20,700 --> 00:21:25,010 credit card, that's a sign to me that someone has stolen my credit card. 353 00:21:25,010 --> 00:21:25,960 And what I 354 00:21:25,960 --> 00:21:28,710 want to do now is talk about specific algorithm for 355 00:21:28,710 --> 00:21:30,960 density estimation, 356 00:21:30,960 --> 00:21:34,009 and in particular, one that works with data sets like these, that, 357 00:21:34,009 --> 00:21:37,330 you know, this distribution like 358 00:21:37,330 --> 00:21:38,840 359 00:21:38,840 --> 00:21:42,570 this doesn't really fall into any of the standard text book 360 00:21:42,570 --> 00:21:47,550 distributions. This is not really, like, a Gaussian or a [inaudible] explanation or 361 00:21:47,550 --> 00:21:54,550 anything. So can we come up with a model to estimate densities that may look like these somewhat unusual shapes? 362 00:21:55,220 --> 00:22:01,280 Okay. So 363 00:22:01,280 --> 00:22:04,560 to describe the algorithm a bit a more I'm also going to use a one dimensional example 364 00:22:04,560 --> 00:22:07,620 rather than a two D example, 365 00:22:07,620 --> 00:22:09,590 and in the example 366 00:22:09,590 --> 00:22:13,590 that I'm going to describe I'm going to say 367 00:22:13,590 --> 00:22:20,590 that let's imagine maybe a 368 00:22:21,310 --> 00:22:23,720 data set that looks like this 369 00:22:23,720 --> 00:22:25,910 where the horizontal access here 370 00:22:25,910 --> 00:22:30,110 is the X axis and these dots represent the 371 00:22:30,110 --> 00:22:32,400 positions of the data set 372 00:22:32,400 --> 00:22:38,130 that I have. Okay. So this data set looks like it's maybe coming from a density 373 00:22:38,130 --> 00:22:40,230 that looks like that 374 00:22:40,230 --> 00:22:41,659 as if this was the sum 375 00:22:41,659 --> 00:22:44,160 of two Gaussian distributions 376 00:22:44,160 --> 00:22:47,750 and so the specific model I'm gonna describe 377 00:22:47,750 --> 00:22:50,840 will be what's called a mixture of Gaussian's 378 00:22:50,840 --> 00:22:53,550 model. And just be clear that 379 00:22:53,550 --> 00:22:56,540 the picture I have is that 380 00:22:56,540 --> 00:22:58,899 when visioning 381 00:22:58,899 --> 00:23:05,899 that 382 00:23:07,520 --> 00:23:10,760 maybe there were two separate Gaussian's that generated 383 00:23:10,760 --> 00:23:14,470 this data set, and if only I knew what 384 00:23:14,470 --> 00:23:17,770 the two Gaussian's were, then I could put a Gaussian to my crosses, put a Gaussian to the Os 385 00:23:17,770 --> 00:23:20,950 and then sum those up 386 00:23:20,950 --> 00:23:23,650 to get the overall density for the two, 387 00:23:23,650 --> 00:23:25,280 but 388 00:23:25,280 --> 00:23:28,010 the problem is I don't 389 00:23:28,010 --> 00:23:31,410 actually have access to these labels. I don't actually know which of the two Gaussian's 390 00:23:31,410 --> 00:23:33,919 each of my data points came from 391 00:23:33,919 --> 00:23:38,540 and so what I'd like to do is come up with an algorithm 392 00:23:38,540 --> 00:23:41,750 to fit this mixture of Gaussian's model 393 00:23:41,750 --> 00:23:44,809 even when I don't know which of the two Gaussian's 394 00:23:44,809 --> 00:23:48,230 each of my data points came from. 395 00:23:48,230 --> 00:23:55,090 Okay. 396 00:23:55,090 --> 00:24:02,090 So here's the idea. 397 00:24:02,460 --> 00:24:09,460 In this model, I'm going to 398 00:24:09,820 --> 00:24:16,820 imagine there's a latent random variable, 399 00:24:19,960 --> 00:24:23,980 latent is just synonymous with hidden or unobserved, okay. 400 00:24:23,980 --> 00:24:30,980 So we're 401 00:24:35,750 --> 00:24:37,899 gonna imagine there's a 402 00:24:37,899 --> 00:24:40,100 latent random variable Z 403 00:24:40,100 --> 00:24:42,210 and XI, 404 00:24:42,210 --> 00:24:44,540 ZI have a 405 00:24:44,540 --> 00:24:51,540 joint distribution 406 00:24:53,630 --> 00:24:56,930 that is given as follows. 407 00:24:56,930 --> 00:25:01,590 We have that P of X, ZI 408 00:25:01,590 --> 00:25:04,730 by the chamber of probability, 409 00:25:04,730 --> 00:25:11,080 this is always like that. This is always true. 410 00:25:11,080 --> 00:25:13,059 And moreover, our [inaudible] is 411 00:25:13,059 --> 00:25:17,240 given by the following ZI is distributed 412 00:25:17,240 --> 00:25:20,190 multinomial with parameters I. And in 413 00:25:20,190 --> 00:25:27,190 the 414 00:25:32,590 --> 00:25:36,000 special case where I have just to make sure that two Gaussian's and ZI will be 415 00:25:36,000 --> 00:25:38,390 [inaudible], 416 00:25:38,390 --> 00:25:45,390 and so 417 00:25:46,540 --> 00:25:51,600 these parameter [inaudible] are the parameters of a multinomial distribution. 418 00:25:51,600 --> 00:25:54,840 And the distribution of XI 419 00:25:54,840 --> 00:25:58,100 conditioned on ZI being 420 00:25:58,100 --> 00:26:00,000 equal to J 421 00:26:00,000 --> 00:26:02,990 so it's P of XI given ZI is equal to J. 422 00:26:02,990 --> 00:26:06,960 That's going to be a Gaussian distribution 423 00:26:06,960 --> 00:26:11,300 with [inaudible] and 424 00:26:11,300 --> 00:26:16,230 covariant sigler. Okay. 425 00:26:16,230 --> 00:26:17,870 So this should actually look 426 00:26:17,870 --> 00:26:21,180 extremely familiar to you. What I've written down are pretty much 427 00:26:21,180 --> 00:26:25,679 the same equations that I wrote down 428 00:26:25,679 --> 00:26:28,740 for the Gaussian Discriminant Analysis algorithm that we 429 00:26:28,740 --> 00:26:30,910 saw way back, right, except that 430 00:26:30,910 --> 00:26:33,240 the differences " 431 00:26:33,240 --> 00:26:35,710 instead of, I guess 432 00:26:35,710 --> 00:26:36,840 supervised learning 433 00:26:36,840 --> 00:26:39,970 where we were given the cross labels Y, 434 00:26:39,970 --> 00:26:41,300 I've now 435 00:26:41,300 --> 00:26:43,870 replaced Y in Gaussian Discriminant Analysis 436 00:26:43,870 --> 00:26:46,770 with these latent random variables or these unobserved random 437 00:26:46,770 --> 00:26:52,140 variables Z, and we don't actually know what the values of Z are. 438 00:26:52,140 --> 00:26:59,140 Okay. 439 00:27:01,090 --> 00:27:05,730 So just to make the link to the Gaussian Discriminant Analysis even a little more 440 00:27:05,730 --> 00:27:06,590 explicit 441 00:27:06,590 --> 00:27:13,220 " if we knew what the 442 00:27:13,220 --> 00:27:16,680 Zs were, which was actually don't, but suppose for the sake of argument that 443 00:27:16,680 --> 00:27:18,210 we actually knew 444 00:27:18,210 --> 00:27:19,540 which of, 445 00:27:19,540 --> 00:27:23,680 say the two Gaussian's, each of our data points came from, 446 00:27:23,680 --> 00:27:29,290 then you can use [inaudible] estimation " 447 00:27:29,290 --> 00:27:33,110 you can write down the likelihood the parameters 448 00:27:33,110 --> 00:27:40,110 which would be that 449 00:27:43,740 --> 00:27:44,610 and 450 00:27:44,610 --> 00:27:48,120 you can then use [inaudible] estimation 451 00:27:48,120 --> 00:27:55,120 and you get exactly the same formula as in Gaussian Discriminant Analysis. Okay. 452 00:28:24,440 --> 00:28:25,419 So if 453 00:28:25,419 --> 00:28:27,710 you knew the value of the Z, you can 454 00:28:27,710 --> 00:28:30,840 write down the law of likelihood 455 00:28:30,840 --> 00:28:33,180 and do maximum likeliness this way, 456 00:28:33,180 --> 00:28:36,790 and you can then estimate all the parameters of your model. 457 00:28:36,790 --> 00:28:38,289 Does 458 00:28:38,289 --> 00:28:39,299 this make sense? 459 00:28:39,299 --> 00:28:43,180 Raise your hand if this makes sense. Cool. 460 00:28:43,180 --> 00:28:47,640 Some of you have questions? Some of you didn't raise your hands. 461 00:28:47,640 --> 00:28:48,200 Yeah. Student:So this ZI is 462 00:28:48,200 --> 00:28:49,750 just a label, like, an X or an O? Instructor (Andrew Ng):Yes. Basically. 463 00:28:49,750 --> 00:28:53,750 Any 464 00:28:53,750 --> 00:29:00,750 other questions? 465 00:29:07,290 --> 00:29:11,540 Okay. 466 00:29:11,540 --> 00:29:15,550 So if you knew the values of Z, the Z 467 00:29:15,550 --> 00:29:19,380 playing a similar role to the cross labels in Gaussian's Discriminant Analysis, 468 00:29:19,380 --> 00:29:24,360 then you could use maximum likeliness estimation parameters. 469 00:29:24,360 --> 00:29:27,480 But in reality, we don't actually know the values of the Zs. All we're given 470 00:29:27,480 --> 00:29:30,090 is this unlabeled data set 471 00:29:30,090 --> 00:29:31,320 and so 472 00:29:31,320 --> 00:29:35,000 let me write down the specific bootstrap procedure 473 00:29:35,000 --> 00:29:37,170 in which the idea is that 474 00:29:37,170 --> 00:29:39,390 we're going to 475 00:29:39,390 --> 00:29:44,240 use our model to try and guess what the values of Z is. We don't know our 476 00:29:44,240 --> 00:29:47,550 Z, but we'll just take a guess at the values of Z 477 00:29:47,550 --> 00:29:51,960 and we'll then use some of the values of Z that we guessed to fit the 478 00:29:51,960 --> 00:29:53,980 parameters of the rest of the model and then we'll actually iterate. 479 00:29:53,980 --> 00:29:56,950 And now that we have a better estimate for the 480 00:29:56,950 --> 00:29:58,289 parameters for the rest of the model, 481 00:29:58,289 --> 00:30:01,669 we'll then take another guess for what the values of Z are. And then we'll 482 00:30:01,669 --> 00:30:04,750 sort of use something like the maximum likeliness estimation 483 00:30:04,750 --> 00:30:11,750 to set even parameters of the model. 484 00:30:17,920 --> 00:30:19,430 So 485 00:30:19,430 --> 00:30:26,430 the 486 00:30:31,220 --> 00:30:34,200 algorithm Im gonna write down is called 487 00:30:34,200 --> 00:30:38,570 the EM Algorithm 488 00:30:38,570 --> 00:30:42,050 and it proceeds as follows. 489 00:30:42,050 --> 00:30:48,510 Repeat until convergence 490 00:30:48,510 --> 00:30:55,510 and the E set, we're going to guess the values 491 00:30:57,780 --> 00:31:00,910 of the unknown ZIs and in particular, 492 00:31:00,910 --> 00:31:07,910 I'm going to set 493 00:31:07,940 --> 00:31:14,940 WIJ. 494 00:31:20,480 --> 00:31:23,200 Okay. So I'm going to compute the probability that 495 00:31:23,200 --> 00:31:25,859 ZI is equal to J. So I'm going to 496 00:31:25,859 --> 00:31:29,750 use the rest of the parameters in my model and then I'm gonna compute the probability 497 00:31:29,750 --> 00:31:31,080 that 498 00:31:31,080 --> 00:31:33,240 point XI 499 00:31:33,240 --> 00:31:34,210 came from 500 00:31:34,210 --> 00:31:36,860 Gaussian number J. 501 00:31:36,860 --> 00:31:41,330 And just to be sort of concrete about what I mean by this, this means that I'm 502 00:31:41,330 --> 00:31:48,330 going to compute P of XI. 503 00:31:53,760 --> 00:32:00,760 This step is sort of [inaudible], I guess. 504 00:32:14,960 --> 00:32:17,170 505 00:32:17,170 --> 00:32:20,390 And again, just to be completely concrete about what I mean about this, 506 00:32:20,390 --> 00:32:24,970 the [inaudible] rate of P of XI given ZI equals J, you know, 507 00:32:24,970 --> 00:32:28,640 well that's the Gaussian density. Right? That's one over 508 00:32:28,640 --> 00:32:34,279 509 00:32:34,279 --> 00:32:35,980 510 00:32:35,980 --> 00:32:42,980 E 511 00:32:48,730 --> 00:32:51,200 to 512 00:32:51,200 --> 00:32:54,059 the " [inaudible] 513 00:32:54,059 --> 00:32:57,440 514 00:32:57,440 --> 00:32:59,520 and then divided by 515 00:32:59,520 --> 00:33:03,730 sum from O 516 00:33:03,730 --> 00:33:05,039 equals 517 00:33:05,039 --> 00:33:07,080 1 to K of [inaudible] of 518 00:33:07,080 --> 00:33:10,540 essentially the same thing, but with J replaced by 519 00:33:10,540 --> 00:33:13,610 L. Okay. [Inaudible] for the Gaussian and the numerator 520 00:33:13,610 --> 00:33:17,980 and the sum of the similar terms of the denominator. Excuse 521 00:33:17,980 --> 00:33:21,620 me. This is the sum 522 00:33:21,620 --> 00:33:23,950 523 00:33:23,950 --> 00:33:30,950 from O equals 1 through K in the denominator. 524 00:33:35,039 --> 00:33:42,039 525 00:33:45,429 --> 00:33:52,429 526 00:33:52,570 --> 00:33:58,000 Okay. Let's see. 527 00:33:58,000 --> 00:34:00,119 The maximization step 528 00:34:00,119 --> 00:34:04,899 where you would then update your estimates of the parameters. So I'll just lay down the 529 00:34:04,899 --> 00:34:07,260 formulas here. 530 00:34:07,260 --> 00:34:11,379 When you see these, you should compare them to the 531 00:34:11,379 --> 00:34:18,379 formulas we had for maximum likelihood estimation. 532 00:34:25,099 --> 00:34:32,099 533 00:35:03,420 --> 00:35:05,309 And so these 534 00:35:05,309 --> 00:35:09,139 535 00:35:09,139 --> 00:35:12,309 two formulas on top are very similar to what you saw 536 00:35:12,309 --> 00:35:15,769 for Gaussian Discriminant Analysis except that now, 537 00:35:15,769 --> 00:35:19,399 we have these [inaudible] so WIJ is 538 00:35:19,399 --> 00:35:22,329 " you remember was the probability that we computed 539 00:35:22,329 --> 00:35:23,529 that point I 540 00:35:23,529 --> 00:35:25,140 came from 541 00:35:25,140 --> 00:35:27,119 Gaussian's. 542 00:35:27,119 --> 00:35:29,509 I don't want to call it cluster J, but that's what 543 00:35:29,509 --> 00:35:32,380 " point I came from 544 00:35:32,380 --> 00:35:34,469 Gaussian 545 00:35:34,469 --> 00:35:38,769 J, rather than an indicator for where the point I came from Gaussian J. Okay. 546 00:35:38,769 --> 00:35:40,889 And 547 00:35:40,889 --> 00:35:45,359 the one slight difference between this and the formulas who have 548 00:35:45,359 --> 00:35:48,170 a Gaussian's Discriminant Analysis 549 00:35:48,170 --> 00:35:50,670 is that in the mixture of Gaussian's, 550 00:35:50,670 --> 00:35:55,329 we more commonly use different covariant [inaudible] for the different Gaussian's. So 551 00:35:55,329 --> 00:36:00,849 in Gaussian's Discriminant Analysis, sort of by convention, you usually model all of the 552 00:36:00,849 --> 00:36:03,789 crosses to the same covariant matrix sigma. 553 00:36:03,789 --> 00:36:07,789 I just wrote down a 554 00:36:07,789 --> 00:36:10,689 lot of equations. Why don't you 555 00:36:10,689 --> 00:36:14,219 556 00:36:14,219 --> 00:36:17,809 557 00:36:17,809 --> 00:36:19,029 558 00:36:19,029 --> 00:36:20,549 559 00:36:20,549 --> 00:36:23,410 560 00:36:23,410 --> 00:36:27,670 just take a second to look at this and make sure it all makes 561 00:36:27,670 --> 00:36:34,670 sense? Do you have questions about this? 562 00:36:50,960 --> 00:36:57,960 Raise your 563 00:37:00,369 --> 00:37:03,999 hand if this makes sense to you? [Inaudible]. Okay. Only some of you. Let's see. 564 00:37:03,999 --> 00:37:07,999 So 565 00:37:07,999 --> 00:37:13,990 let me 566 00:37:13,990 --> 00:37:14,869 try 567 00:37:14,869 --> 00:37:17,589 to explain that a little bit more. 568 00:37:17,589 --> 00:37:23,079 Some of you recall that in Gaussian's Discriminant Analysis, 569 00:37:23,079 --> 00:37:25,039 570 00:37:25,039 --> 00:37:31,799 right, if we knew the values for the ZIs so let's 571 00:37:31,799 --> 00:37:33,689 see. 572 00:37:33,689 --> 00:37:35,540 Suppose I was to give you labeled 573 00:37:35,540 --> 00:37:39,959 data sets, suppose I was to tell you the values of the ZIs for each example, 574 00:37:39,959 --> 00:37:42,029 then I'd be giving you a data set 575 00:37:42,029 --> 00:37:46,559 that looks like 576 00:37:46,559 --> 00:37:47,959 this. Okay. So 577 00:37:47,959 --> 00:37:50,429 here's my 1 D data set. That's sort of a typical 1 D 578 00:37:50,429 --> 00:37:53,149 Gaussian's Discriminant 579 00:37:53,149 --> 00:37:54,549 Analysis. So 580 00:37:54,549 --> 00:37:56,730 for Gaussian's Discriminant Analysis 581 00:37:56,730 --> 00:38:00,999 we figured out the maximum likeliness estimation and the maximum likeliness estimate for the 582 00:38:00,999 --> 00:38:03,079 parameters of GDA, 583 00:38:03,079 --> 00:38:04,160 and 584 00:38:04,160 --> 00:38:09,449 one of the estimates for the parameters for GDA was 585 00:38:09,449 --> 00:38:10,130 [inaudible] 586 00:38:10,130 --> 00:38:13,409 which is the probability 587 00:38:13,409 --> 00:38:16,059 that 588 00:38:16,059 --> 00:38:20,929 ZI equals J. You would 589 00:38:20,929 --> 00:38:23,089 estimate that as sum of I equals 590 00:38:23,089 --> 00:38:24,660 sum of I 591 00:38:24,660 --> 00:38:26,519 from 1 to M 592 00:38:26,519 --> 00:38:28,779 indicator 593 00:38:28,779 --> 00:38:31,169 ZI equals J and 594 00:38:31,169 --> 00:38:33,189 divide by N. 595 00:38:33,189 --> 00:38:36,259 Okay. When we're deriving GDA, 596 00:38:36,259 --> 00:38:37,700 [inaudible]. 597 00:38:37,700 --> 00:38:41,709 If you knew the cross labels for every example you cross, then 598 00:38:41,709 --> 00:38:44,419 this was your maximum likeliness estimate for the chance that the 599 00:38:44,419 --> 00:38:45,980 labels 600 00:38:45,980 --> 00:38:49,559 came from the positive [inaudible] versus the negative [inaudible]. It's just a 601 00:38:49,559 --> 00:38:50,819 fraction of examples. Your maximum 602 00:38:50,819 --> 00:38:53,619 likeliness estimate for 603 00:38:53,619 --> 00:38:55,329 probability of getting examples 604 00:38:55,329 --> 00:38:57,479 from cross J is just 605 00:38:57,479 --> 00:39:01,380 the fraction of examples in your training set that actually came from cross J. 606 00:39:01,380 --> 00:39:07,430 So this is the maximum likeliness estimation for Gaussian's Discriminant Analysis. 607 00:39:07,430 --> 00:39:08,799 Now, 608 00:39:08,799 --> 00:39:13,019 in the mixture of Gaussian's model and the EM problem we don't actually have 609 00:39:13,019 --> 00:39:14,109 these 610 00:39:14,109 --> 00:39:15,089 cross labels, right, we just have 611 00:39:15,089 --> 00:39:20,249 an unlabeled data set like this. We just have a set of 612 00:39:20,249 --> 00:39:20,659 dots. 613 00:39:20,659 --> 00:39:23,919 I'm trying to draw the same data set that I had above, but just with the cross 614 00:39:23,919 --> 00:39:25,660 labels. 615 00:39:25,660 --> 00:39:28,619 So now, it's as if 616 00:39:28,619 --> 00:39:32,139 you only get to observe the 617 00:39:32,139 --> 00:39:36,929 XIs, but the ZIs 618 00:39:36,929 --> 00:39:41,699 are unknown. Okay. 619 00:39:41,699 --> 00:39:46,039 So the cross label is unknown. So in the 620 00:39:46,039 --> 00:39:49,309 EM algorithm 621 00:39:49,309 --> 00:39:50,869 we're going to try to 622 00:39:50,869 --> 00:39:52,499 take a guess 623 00:39:52,499 --> 00:39:56,039 for the values of the ZIs, 624 00:39:56,039 --> 00:39:58,569 and specifically, 625 00:39:58,569 --> 00:40:01,900 in the E step we computed 626 00:40:01,900 --> 00:40:06,609 WIJ was our current best guess for the probability that ZI 627 00:40:06,609 --> 00:40:09,199 equals J 628 00:40:09,199 --> 00:40:10,739 given that data point. 629 00:40:10,739 --> 00:40:12,819 Okay. So this just means 630 00:40:12,819 --> 00:40:17,479 given my current hypothesis, the way the Gaussian's are, and given everything else, can 631 00:40:17,479 --> 00:40:21,309 I compute the [inaudible] probability " what was the [inaudible] probability 632 00:40:21,309 --> 00:40:22,860 that the point XI 633 00:40:22,860 --> 00:40:26,959 actually came from cross J? What is the probability that 634 00:40:26,959 --> 00:40:33,959 this point was a cross versus O? What's the probability that this point was [inaudible]? 635 00:40:34,079 --> 00:40:37,539 And now in the M step, 636 00:40:37,539 --> 00:40:42,130 my formula of estimating for the 637 00:40:42,130 --> 00:40:44,559 parameters [inaudible] will be given by 638 00:40:44,559 --> 00:40:47,029 1 over M 639 00:40:47,029 --> 00:40:50,359 sum 640 00:40:50,359 --> 00:40:53,150 from I equals 641 00:40:53,150 --> 00:40:55,919 1 through M, sum 642 00:40:55,919 --> 00:40:57,109 643 00:40:57,109 --> 00:40:59,070 of WIJ. So 644 00:40:59,070 --> 00:41:00,789 WIJ is 645 00:41:00,789 --> 00:41:04,839 right. The probability is my best guess for the probability that point I 646 00:41:04,839 --> 00:41:06,119 belongs to Gaussian 647 00:41:06,119 --> 00:41:09,179 or belongs to cross J, 648 00:41:09,179 --> 00:41:11,799 and 649 00:41:11,799 --> 00:41:13,010 [inaudible] 650 00:41:13,010 --> 00:41:18,589 using this formula instead of this one. Okay. 651 00:41:18,589 --> 00:41:23,100 And similarly, this is my formula for the estimate for new J and 652 00:41:23,100 --> 00:41:24,289 it 653 00:41:24,289 --> 00:41:26,949 replaces the WIJs with 654 00:41:26,949 --> 00:41:28,749 these new indicator functions, 655 00:41:28,749 --> 00:41:30,119 you get back to 656 00:41:30,119 --> 00:41:35,159 the formula that you had in Gaussian's 657 00:41:35,159 --> 00:41:38,789 Discriminant Analysis. I'm trying to convey 658 00:41:38,789 --> 00:41:40,799 659 00:41:40,799 --> 00:41:47,799 an 660 00:41:49,829 --> 00:41:53,249 intuitive sense 661 00:41:53,249 --> 00:41:54,630 of 662 00:41:54,630 --> 00:41:59,279 why these algorithm's make sense. 663 00:41:59,279 --> 00:42:02,630 Can you 664 00:42:02,630 --> 00:42:05,589 raise 665 00:42:05,589 --> 00:42:10,339 your hand if this 666 00:42:10,339 --> 00:42:11,670 makes 667 00:42:11,670 --> 00:42:18,670 sense now? Cool. Okay. 668 00:42:38,729 --> 00:42:42,569 So what I want to do now is actually 669 00:42:42,569 --> 00:42:45,900 present a broader view of the EM algorithm. 670 00:42:45,900 --> 00:42:49,930 What you just saw was a special case of the EM algorithm 671 00:42:49,930 --> 00:42:53,299 for specially to make sure of Gaussian's model, and in 672 00:42:53,299 --> 00:42:56,649 the remaining half hour I have today I'm going to describe 673 00:42:56,649 --> 00:42:58,509 674 00:42:58,509 --> 00:42:59,249 675 00:42:59,249 --> 00:43:01,919 a general description of the EM algorithm 676 00:43:01,919 --> 00:43:05,469 and everything you just saw will be devised, sort of there's a special case of this more 677 00:43:05,469 --> 00:43:07,859 general view that I'll present 678 00:43:07,859 --> 00:43:08,730 now. 679 00:43:08,730 --> 00:43:15,039 And 680 00:43:15,039 --> 00:43:18,359 as a pre-cursor to actually 681 00:43:18,359 --> 00:43:21,839 deriving this more general view of the EM algorithm, 682 00:43:21,839 --> 00:43:22,770 I'm gonna 683 00:43:22,770 --> 00:43:28,640 have to describe something called Jensen's and Equality that we use in the derivation. 684 00:43:28,640 --> 00:43:32,219 So here's Jensen's and Equality. 685 00:43:32,219 --> 00:43:39,219 Just let F be a convex function. 686 00:43:45,319 --> 00:43:48,489 So a function is a convex of the second derivative, 687 00:43:48,489 --> 00:43:52,639 which I've written F prime prime to [inaudible]. The 688 00:43:52,639 --> 00:43:55,149 functions don't have to be differentiatable to be convex, 689 00:43:55,149 --> 00:43:59,949 but if it has a second derivative, then F prime prime should be creating 690 00:43:59,949 --> 00:44:06,949 a 0. And let X be a random variable 691 00:44:12,749 --> 00:44:19,749 then 692 00:44:20,679 --> 00:44:26,239 the F applied to the expectation of X is less than the equal of 2D 693 00:44:26,239 --> 00:44:28,899 expectation of F of F. 694 00:44:28,899 --> 00:44:32,999 Okay. And hopefully you remember I often drop the square 695 00:44:32,999 --> 00:44:37,179 back, so E of X is the 696 00:44:37,179 --> 00:44:43,789 [inaudible], I'll often drop the square brackets. So let me draw 697 00:44:43,789 --> 00:44:48,249 a picture that would explain this 698 00:44:48,249 --> 00:44:49,140 and I think 699 00:44:49,140 --> 00:44:51,419 " Many of my friends and I often 700 00:44:51,419 --> 00:44:55,529 don't remember is less than or great than or whatever, 701 00:44:55,529 --> 00:44:58,219 and the way that 702 00:44:58,219 --> 00:45:00,369 many of us remember the sign 703 00:45:00,369 --> 00:45:04,009 of that in equality is by drawing 704 00:45:04,009 --> 00:45:09,739 the following picture. 705 00:45:09,739 --> 00:45:15,209 706 00:45:15,209 --> 00:45:17,799 707 00:45:17,799 --> 00:45:24,799 708 00:45:30,759 --> 00:45:33,269 For this example, let's say, X 709 00:45:33,269 --> 00:45:35,289 is equal to 1 710 00:45:35,289 --> 00:45:37,670 with a probability of one-half 711 00:45:37,670 --> 00:45:41,449 and X is equal to 6 worth probability 1 whole. So 712 00:45:41,449 --> 00:45:43,859 I'll illustrate this 713 00:45:43,859 --> 00:45:47,559 inequality with an example. So let's see. 714 00:45:47,559 --> 00:45:51,009 So X is 1 715 00:45:51,009 --> 00:45:52,249 with 716 00:45:52,249 --> 00:45:53,419 probability 717 00:45:53,419 --> 00:45:57,179 one-half and X is 6 with probably with half 718 00:45:57,179 --> 00:46:00,309 and so the expected value of X is 3.5. 719 00:46:00,309 --> 00:46:03,130 It would be in the middle here. So 720 00:46:03,130 --> 00:46:04,839 that's the expected value 721 00:46:04,839 --> 00:46:08,950 of X. The horizontal axis here is the X axis. 722 00:46:08,950 --> 00:46:11,140 And so 723 00:46:11,140 --> 00:46:14,579 F of the expected value of X, 724 00:46:14,579 --> 00:46:21,579 you can read of as this point here. So this is F of the expected value of X. 725 00:46:22,660 --> 00:46:25,619 Where as in contrast, let's see. If 726 00:46:25,619 --> 00:46:28,389 X is equal to 1 727 00:46:28,389 --> 00:46:30,989 then 728 00:46:30,989 --> 00:46:34,209 here's F of 1 729 00:46:34,209 --> 00:46:37,969 and if X equaled a 6 730 00:46:37,969 --> 00:46:44,379 then here's F of 6 and 731 00:46:44,379 --> 00:46:47,660 the expected value of F of X, 732 00:46:47,660 --> 00:46:51,440 it turns out, 733 00:46:51,440 --> 00:46:58,440 is now averaging on the vertical axis. We're 50 percent chance you 734 00:47:07,939 --> 00:47:10,109 get F of 1 with 50 percent chance 735 00:47:10,109 --> 00:47:13,439 you get F of 6 736 00:47:13,439 --> 00:47:17,329 and so these expected value of F of X is the average of F of 1 and F of 6, 737 00:47:17,329 --> 00:47:21,339 which is going to be the value in the middle here. 738 00:47:21,339 --> 00:47:23,839 And so in this example you see that 739 00:47:23,839 --> 00:47:29,339 the expected value of F of X is greater than or equal to F of the 740 00:47:29,339 --> 00:47:32,640 expected value 741 00:47:32,640 --> 00:47:34,139 742 00:47:34,139 --> 00:47:41,139 of X. 743 00:47:54,099 --> 00:47:57,729 Okay. And it turns out 744 00:47:57,729 --> 00:48:03,599 further that if 745 00:48:03,599 --> 00:48:07,119 F double prime of X makes 746 00:48:07,119 --> 00:48:14,119 [inaudible] than Z row, if this happens, we say F is strictly convex 747 00:48:18,239 --> 00:48:25,099 then the inequality holds an equality or in other words, E of F of X 748 00:48:25,099 --> 00:48:29,910 equals F of EX, 749 00:48:29,910 --> 00:48:32,279 if and only if, X is 750 00:48:32,279 --> 00:48:36,989 a constant 751 00:48:36,989 --> 00:48:38,969 with 752 00:48:38,969 --> 00:48:42,499 probability 1. Well, 753 00:48:42,499 --> 00:48:49,499 another way of writing this is X equals EX. 754 00:48:51,689 --> 00:48:54,589 Okay. 755 00:48:54,589 --> 00:48:58,469 So in other words, if F is a strictly convex function, 756 00:48:58,469 --> 00:49:02,949 then the only way for this inequality to hold its equality is 757 00:49:02,949 --> 00:49:07,419 if the random variable X always takes on the same value. 758 00:49:07,419 --> 00:49:14,419 Okay. 759 00:49:14,909 --> 00:49:17,149 760 00:49:17,149 --> 00:49:18,180 761 00:49:18,180 --> 00:49:21,179 Any questions 762 00:49:21,179 --> 00:49:22,279 763 00:49:22,279 --> 00:49:28,950 about this? Yeah. Student: [Inaudible] Instructor (Andrew 764 00:49:28,950 --> 00:49:30,899 765 00:49:30,899 --> 00:49:35,989 Ng):Say that again? Student:What is the strict 766 00:49:35,989 --> 00:49:38,349 767 00:49:38,349 --> 00:49:42,629 [inaudible]? 768 00:49:42,629 --> 00:49:46,319 769 00:49:46,319 --> 00:49:48,699 Instructor 770 00:49:48,699 --> 00:49:52,609 (Andrew Ng):I still 771 00:49:52,609 --> 00:49:55,109 couldn't hear 772 00:49:55,109 --> 00:49:58,679 that. What is " Student:What is the strictly 773 00:49:58,679 --> 00:49:59,840 convex 774 00:49:59,840 --> 00:50:06,229 [inaudible]? 775 00:50:06,229 --> 00:50:13,229 776 00:50:21,180 --> 00:50:24,819 Instructor 777 00:50:24,819 --> 00:50:30,150 (Andrew 778 00:50:30,150 --> 00:50:31,110 Ng):Oh, I see. 779 00:50:31,110 --> 00:50:34,790 If double prime of X is strictly greater than 0 that's my definition for 780 00:50:34,790 --> 00:50:38,159 strictly convex. 781 00:50:38,159 --> 00:50:41,339 If the second derivative of X is strictly greater than 0 then 782 00:50:41,339 --> 00:50:45,429 that's what it means for F to be strictly convex. Student: [Inaudible] 783 00:50:45,429 --> 00:50:47,150 Instructor (Andrew Ng):I see. Sure. So 784 00:50:47,150 --> 00:50:50,249 for example, 785 00:50:50,249 --> 00:50:54,989 this is an example of a convex function that's not strictly convexed because 786 00:50:54,989 --> 00:50:57,639 there's part of this function is a straight line 787 00:50:57,639 --> 00:50:59,069 and so 788 00:50:59,069 --> 00:51:01,169 F double prime would be zero in 789 00:51:01,169 --> 00:51:02,700 this portion. 790 00:51:02,700 --> 00:51:09,630 791 00:51:09,630 --> 00:51:11,069 Let's see. Yeah. It's 792 00:51:11,069 --> 00:51:14,630 just a less formal way of saying strictly convexed just means that you can't have 793 00:51:14,630 --> 00:51:17,669 a convex function within a straight line portion and 794 00:51:17,669 --> 00:51:19,399 then [inaudible]. 795 00:51:19,399 --> 00:51:22,719 Speaking very informally, think of this as meaning that there aren't any 796 00:51:22,719 --> 00:51:29,719 straight line portions. Okay. So 797 00:51:49,689 --> 00:51:52,420 here's the derivation for 798 00:51:52,420 --> 00:51:58,659 the general version of EM. 799 00:51:58,659 --> 00:52:02,120 The problem was face is as follows. 800 00:52:02,120 --> 00:52:09,120 801 00:52:13,549 --> 00:52:18,649 We have some model for the joint distribution of X 802 00:52:18,649 --> 00:52:24,239 of Z, but we observe 803 00:52:24,239 --> 00:52:26,329 only X, 804 00:52:26,329 --> 00:52:29,569 and our goal is to 805 00:52:29,569 --> 00:52:36,569 maximize 806 00:52:38,979 --> 00:52:41,709 807 00:52:41,709 --> 00:52:48,709 the law of likelihood of the parameters 808 00:52:52,689 --> 00:52:53,860 809 00:52:53,860 --> 00:52:57,929 of model. Right. So we have some models for the joint distribution for X and Z 810 00:52:57,929 --> 00:53:02,109 and our goal is to find the maximum likeliness estimate of the parameters data 811 00:53:02,109 --> 00:53:03,509 where 812 00:53:03,509 --> 00:53:07,689 the likelihood is defined as something equals 1 to M [inaudible] 813 00:53:07,689 --> 00:53:09,989 probably of our data as usual. 814 00:53:09,989 --> 00:53:14,089 And 815 00:53:14,089 --> 00:53:18,669 here X is parameterized by data is now given by 816 00:53:18,669 --> 00:53:19,919 a 817 00:53:19,919 --> 00:53:24,309 sum over all the values of 818 00:53:24,309 --> 00:53:29,589 ZI 819 00:53:29,589 --> 00:53:30,950 parameterized by data. Okay. 820 00:53:30,950 --> 00:53:31,909 So 821 00:53:31,909 --> 00:53:36,209 just by taking our model of the joint distribution of X and Z 822 00:53:36,209 --> 00:53:39,099 and marginalizing out ZI 823 00:53:39,099 --> 00:53:40,579 that we get P of XI 824 00:53:40,579 --> 00:53:45,039 parameterized by data. 825 00:53:45,039 --> 00:53:52,039 And so the 826 00:53:55,819 --> 00:53:56,809 EM algorithm 827 00:53:56,809 --> 00:53:59,519 will be a way of performing 828 00:53:59,519 --> 00:54:02,779 this maximum likeliness estimation problem, 829 00:54:02,779 --> 00:54:06,339 which is complicated by the fact that we have these ZIs in our model that 830 00:54:06,339 --> 00:54:11,669 are 831 00:54:11,669 --> 00:54:18,669 unobserved. 832 00:54:24,519 --> 00:54:31,519 Before I actually do the math, here's a useful picture to keep in mind. 833 00:54:31,899 --> 00:54:36,660 So the horizontal axis in this cartoon is the [inaudible] axis and there's some 834 00:54:36,660 --> 00:54:43,660 function, the law of likelihood of theta zero that we're trying to maximize, 835 00:54:47,219 --> 00:54:48,820 and 836 00:54:48,820 --> 00:54:52,750 usually maximizing our [inaudible] derivatives instead of the zero that would 837 00:54:52,750 --> 00:54:55,109 be very hard to do. 838 00:54:55,109 --> 00:54:59,549 What the EM algorithm will do is the following. Let's say it initializes some 839 00:54:59,549 --> 00:55:02,029 value of theta zero, 840 00:55:02,029 --> 00:55:06,389 what the EM algorithm will end up doing is it will construct 841 00:55:06,389 --> 00:55:08,379 a lower bound for this law of 842 00:55:08,379 --> 00:55:10,809 likelihood function 843 00:55:10,809 --> 00:55:12,729 and this lower bound 844 00:55:12,729 --> 00:55:14,089 will be tight 845 00:55:14,089 --> 00:55:16,499 [inaudible] of equality 846 00:55:16,499 --> 00:55:19,649 after current guessing the parameters 847 00:55:19,649 --> 00:55:24,239 and they maximize this lower boundary with respect to theta so we'll end up with 848 00:55:24,239 --> 00:55:28,899 say that value. So that will be data 1. Okay. 849 00:55:28,899 --> 00:55:32,069 And then EM algorithm look at theta 1 and they'll construct a new lower bound of 850 00:55:32,069 --> 00:55:35,359 theta 851 00:55:35,359 --> 00:55:37,339 852 00:55:37,339 --> 00:55:39,219 and then we'll maximize that. So 853 00:55:39,219 --> 00:55:41,689 you jump here. So that's the next 854 00:55:41,689 --> 00:55:43,259 theta 2 855 00:55:43,259 --> 00:55:45,979 and you do that again and 856 00:55:45,979 --> 00:55:48,549 you 857 00:55:48,549 --> 00:55:49,460 get the same 3, 4, and so on 858 00:55:49,460 --> 00:55:55,059 until you converge to local optimum on [inaudible] theta function. Okay. 859 00:55:55,059 --> 00:55:57,279 So this is a cartoon 860 00:55:57,279 --> 00:56:00,149 that displays what the EM algorithm will do. 861 00:56:00,149 --> 00:56:07,149 So let's actually make that formal now. 862 00:56:20,409 --> 00:56:23,609 So you 863 00:56:23,609 --> 00:56:27,169 want to maximize with respect to theta 864 00:56:27,169 --> 00:56:32,989 sum of [inaudible] " there's 865 00:56:32,989 --> 00:56:34,979 my theta, so this is 866 00:56:34,979 --> 00:56:37,900 sum over 1 [inaudible] 867 00:56:37,900 --> 00:56:39,849 868 00:56:39,849 --> 00:56:44,929 sum over all values of Z. 869 00:56:44,929 --> 00:56:50,369 Okay. 870 00:56:50,369 --> 00:56:54,969 So 871 00:56:54,969 --> 00:56:57,660 what I'm going to do is 872 00:56:57,660 --> 00:57:04,660 multiply and divide by the same thing and I'm gonna write this 873 00:57:06,759 --> 00:57:10,409 874 00:57:10,409 --> 00:57:14,729 as Q " okay. 875 00:57:14,729 --> 00:57:15,829 So 876 00:57:15,829 --> 00:57:17,439 I'm going to 877 00:57:17,439 --> 00:57:21,809 construct the probability distribution QI, that would 878 00:57:21,809 --> 00:57:25,069 be over the latent random variables ZI 879 00:57:25,069 --> 00:57:28,709 and so these QI would get distribution so each of the QI 880 00:57:28,709 --> 00:57:31,159 would bring in a 0 881 00:57:31,159 --> 00:57:34,770 and sum over all the values of ZI of 882 00:57:34,770 --> 00:57:36,879 QI 883 00:57:36,879 --> 00:57:41,949 would be 1, so these Qs will be a probability distribution that I get to construct. 884 00:57:41,949 --> 00:57:42,989 Okay. 885 00:57:42,989 --> 00:57:47,399 And then I'll later go describe the specific choice of this 886 00:57:47,399 --> 00:57:54,279 distribution QI. 887 00:57:54,279 --> 00:57:56,009 So 888 00:57:56,009 --> 00:57:59,669 this QI is a probability distribution over 889 00:57:59,669 --> 00:58:02,609 the 890 00:58:02,609 --> 00:58:07,139 random variables 891 00:58:07,139 --> 00:58:09,469 892 00:58:09,469 --> 00:58:14,150 893 00:58:14,150 --> 00:58:20,309 894 00:58:20,309 --> 00:58:21,449 895 00:58:21,449 --> 00:58:25,629 896 00:58:25,629 --> 00:58:32,629 897 00:58:34,380 --> 00:58:38,919 of 898 00:58:38,919 --> 00:58:40,829 899 00:58:40,829 --> 00:58:43,189 ZI 900 00:58:43,189 --> 00:58:44,199 901 00:58:44,199 --> 00:58:47,389 so 902 00:58:47,389 --> 00:58:52,069 this is [inaudible]. 903 00:58:52,069 --> 00:58:54,429 904 00:58:54,429 --> 00:58:59,549 Right. I 905 00:58:59,549 --> 00:59:06,549 see some frowns. Do you have questions about 906 00:59:39,909 --> 00:59:41,689 this? No. Okay. 907 00:59:41,689 --> 00:59:44,389 So the log function looks like that 908 00:59:44,389 --> 00:59:47,390 and there's a concave function 909 00:59:47,390 --> 00:59:51,369 so that tells us that the log of E of X 910 00:59:51,369 --> 00:59:56,279 is greater than and equal to 911 00:59:56,279 --> 00:59:59,369 the expected value of log X by the other 912 00:59:59,369 --> 01:00:03,899 concave function form of Jensen's and Equality. 913 01:00:03,899 --> 01:00:09,890 And so continuing from the previous expression, this is a summary of 914 01:00:09,890 --> 01:00:11,589 a 915 01:00:11,589 --> 01:00:13,499 log and an expectation, 916 01:00:13,499 --> 01:00:18,009 that must therefore be greater 917 01:00:18,009 --> 01:00:21,329 than or equal to the 918 01:00:21,329 --> 01:00:26,289 expected value 919 01:00:26,289 --> 01:00:28,899 of the log of that. 920 01:00:28,899 --> 01:00:35,899 Okay. 921 01:00:46,969 --> 01:00:51,479 Using Jensen's and Equality. 922 01:00:51,479 --> 01:00:58,479 And lastly just to expand out this formula again. This is equal 923 01:01:07,029 --> 01:01:11,629 924 01:01:11,629 --> 01:01:15,859 925 01:01:15,859 --> 01:01:22,859 to that. Okay. Yeah. Student: [Inaudible] Instructor (Andrew Ng): [Inaudible] 926 01:01:30,089 --> 01:01:30,829 Student: 927 01:01:30,829 --> 01:01:31,989 [Inaudible]. 928 01:01:31,989 --> 01:01:38,989 Yeah. 929 01:01:42,029 --> 01:01:44,259 Okay. So this has 930 01:01:44,259 --> 01:01:45,620 the [inaudible] so 931 01:01:45,620 --> 01:01:48,989 let's say Random variable Z, 932 01:01:48,989 --> 01:01:50,079 right, and 933 01:01:50,079 --> 01:01:54,969 Z has some distribution. Let's denote it G. 934 01:01:54,969 --> 01:01:59,399 And let's say I have some function G of Z. Okay. Then 935 01:01:59,399 --> 01:02:02,759 by definition, 936 01:02:02,759 --> 01:02:05,529 the expected value of G of 937 01:02:05,529 --> 01:02:07,070 Z, by definition, 938 01:02:07,070 --> 01:02:10,049 that's equal to sum over all the values of Z, 939 01:02:10,049 --> 01:02:13,889 the probability of that value of Z 940 01:02:13,889 --> 01:02:17,709 times Z of G. Right. That's sort of the definition of 941 01:02:17,709 --> 01:02:20,729 a random variable. 942 01:02:20,729 --> 01:02:23,499 And so the way I 943 01:02:23,499 --> 01:02:25,419 got from this step 944 01:02:25,419 --> 01:02:27,009 to this step 945 01:02:27,009 --> 01:02:28,490 is by using that. So 946 01:02:28,490 --> 01:02:30,109 in particular, 947 01:02:30,109 --> 01:02:32,529 now, 948 01:02:32,529 --> 01:02:36,299 I've been using distribution QI to denote the distribution of 949 01:02:36,299 --> 01:02:37,699 Z, 950 01:02:37,699 --> 01:02:42,429 so this is, like, sum over Z of P of Z times [inaudible]. 951 01:02:42,429 --> 01:02:44,129 952 01:02:44,129 --> 01:02:47,559 953 01:02:47,559 --> 01:02:51,799 954 01:02:51,799 --> 01:02:53,799 955 01:02:53,799 --> 01:02:56,819 956 01:02:56,819 --> 01:02:58,650 And so 957 01:02:58,650 --> 01:03:00,439 this is just 958 01:03:00,439 --> 01:03:04,639 the expected value with respect to a random variable Z joined from the 959 01:03:04,639 --> 01:03:07,670 distribution Q of 960 01:03:07,670 --> 01:03:11,139 G of Z. Are there questions? Student:So 961 01:03:11,139 --> 01:03:18,139 in general when you're doing maximum likelihood 962 01:03:22,249 --> 01:03:25,099 estimations, the likelihood of the data, 963 01:03:25,099 --> 01:03:29,880 but in this case you only say probability of X because 964 01:03:29,880 --> 01:03:32,849 you only have observed X whereas previously 965 01:03:32,849 --> 01:03:36,889 we said probability of X given the labels? Instructor (Andrew Ng):Yes. Exactly. Right. Right. 966 01:03:36,889 --> 01:03:41,569 [Inaudible] we want to choose the parameters that maximizes the probability of the data, 967 01:03:41,569 --> 01:03:47,339 and in this case, our data comprises only the Xs because we don't reserve the Zs, 968 01:03:47,339 --> 01:03:49,019 and therefore, 969 01:03:49,019 --> 01:03:51,799 the likelihood of parameters 970 01:03:51,799 --> 01:03:55,829 is given by the probability of the data, which 971 01:03:55,829 --> 01:03:56,869 is 972 01:03:56,869 --> 01:04:03,869 [inaudible]. 973 01:04:06,939 --> 01:04:08,969 So this is all we've done, right, 974 01:04:08,969 --> 01:04:13,509 we wanted to maximize the law of likelihood of theta 975 01:04:13,509 --> 01:04:17,449 and what we've done, through these manipulations, we've know constructed a 976 01:04:17,449 --> 01:04:20,869 lower bound on the law of likelihood of data. Okay. 977 01:04:20,869 --> 01:04:23,379 And in particular, this formula that we came up, 978 01:04:23,379 --> 01:04:26,169 we should think of this as a function 979 01:04:26,169 --> 01:04:28,619 of theta 980 01:04:28,619 --> 01:04:29,219 then, if you think about it, 981 01:04:29,219 --> 01:04:32,449 theta are the parameters of your model, right, if you think about this as a 982 01:04:32,449 --> 01:04:37,609 function of your parameters theta, what we've just shown is that the law of 983 01:04:37,609 --> 01:04:42,749 likelihood of your parameters theta 984 01:04:42,749 --> 01:04:45,089 is lower bounded 985 01:04:45,089 --> 01:04:49,939 by this 986 01:04:49,939 --> 01:04:56,909 thing. 987 01:04:56,909 --> 01:05:03,589 Okay. 988 01:05:03,589 --> 01:05:05,289 Remember that cartoon 989 01:05:05,289 --> 01:05:09,439 of repeatedly constructing a lower bound and optimizing 990 01:05:09,439 --> 01:05:10,849 the lower bound. 991 01:05:10,849 --> 01:05:13,129 So what we've just done is construct 992 01:05:13,129 --> 01:05:15,099 a lower bound for the law of likelihood 993 01:05:15,099 --> 01:05:19,509 for theta. Now, 994 01:05:19,509 --> 01:05:24,659 the last piece we want for this lower bound is actually we want this 995 01:05:24,659 --> 01:05:27,639 inequality to hold with equality 996 01:05:27,639 --> 01:05:29,630 for the current value for 997 01:05:29,630 --> 01:05:33,519 theta. So just refrain back to the previous cartoon. 998 01:05:33,519 --> 01:05:35,019 If this was 999 01:05:35,019 --> 01:05:39,199 the law of likelihood for theta, we'd then construct some lower bound of it, 1000 01:05:39,199 --> 01:05:42,449 some function of 1001 01:05:42,449 --> 01:05:46,029 theta and if this is my current value for theta, then I want my lower bound 1002 01:05:46,029 --> 01:05:50,889 to be tight. I want my lower bound to be equal to the law of likelihood of theta 1003 01:05:50,889 --> 01:05:54,579 because that's what I need to guarantee that when I optimize my lower bound, then I'll actually 1004 01:05:54,579 --> 01:05:54,929 1005 01:05:54,929 --> 01:05:56,410 do even better 1006 01:05:56,410 --> 01:06:02,699 on the true objective function. 1007 01:06:02,699 --> 01:06:05,989 Yeah. Student: How do 1008 01:06:05,989 --> 01:06:08,450 1009 01:06:08,450 --> 01:06:10,130 1010 01:06:10,130 --> 01:06:10,429 1011 01:06:10,429 --> 01:06:11,719 1012 01:06:11,719 --> 01:06:15,849 1013 01:06:15,849 --> 01:06:20,869 [inaudible] 1014 01:06:20,869 --> 01:06:22,739 Instructor (Andrew 1015 01:06:22,739 --> 01:06:24,580 Ng):Excuse me. 1016 01:06:24,580 --> 01:06:29,449 Yeah. Great question. 1017 01:06:29,449 --> 01:06:34,219 How do I 1018 01:06:34,219 --> 01:06:35,609 know that 1019 01:06:35,609 --> 01:06:37,800 1020 01:06:37,800 --> 01:06:39,569 1021 01:06:39,569 --> 01:06:45,199 1022 01:06:45,199 --> 01:06:51,899 function is concave? 1023 01:06:51,899 --> 01:06:54,669 Yeah. I don't think I've shown it. It actually turns out to be true for all the models 1024 01:06:54,669 --> 01:06:55,609 we work with. 1025 01:06:55,609 --> 01:06:58,119 Do 1026 01:06:58,119 --> 01:07:01,689 I know that the law of bound is a concave function of theta? I 1027 01:07:01,689 --> 01:07:04,829 think you're right. In general, this may not be a concave function of theta. For many of 1028 01:07:04,829 --> 01:07:07,899 the models we work with, this will turn out to be a concave function, but that's not 1029 01:07:07,899 --> 01:07:13,969 1030 01:07:13,969 --> 01:07:16,719 always true. Okay. So let me go ahead and choose a value for Q. 1031 01:07:16,719 --> 01:07:20,719 And I'll refer back to Jensen's and Equality. We said that 1032 01:07:20,719 --> 01:07:24,579 this inequality will become an equality 1033 01:07:24,579 --> 01:07:28,699 if the random variable inside is a constant. Right. If you're taking an 1034 01:07:28,699 --> 01:07:30,089 expectation with 1035 01:07:30,089 --> 01:07:31,710 respect to constant valued 1036 01:07:31,710 --> 01:07:33,749 variables. 1037 01:07:33,749 --> 01:07:40,749 1038 01:07:53,909 --> 01:07:58,589 1039 01:07:58,589 --> 01:08:05,589 So 1040 01:08:10,369 --> 01:08:11,679 1041 01:08:11,679 --> 01:08:15,439 1042 01:08:15,439 --> 01:08:17,329 1043 01:08:17,329 --> 01:08:19,310 1044 01:08:19,310 --> 01:08:26,310 1045 01:08:27,109 --> 01:08:30,509 1046 01:08:30,509 --> 01:08:35,589 the 1047 01:08:35,589 --> 01:08:36,979 1048 01:08:36,979 --> 01:08:40,430 QI 1049 01:08:40,430 --> 01:08:47,009 1050 01:08:47,009 --> 01:08:53,080 of 1051 01:08:53,080 --> 01:08:58,000 1052 01:08:58,000 --> 01:09:01,799 ZIs must 1053 01:09:01,799 --> 01:09:07,369 1054 01:09:07,369 --> 01:09:09,219 1055 01:09:09,219 --> 01:09:10,359 1056 01:09:10,359 --> 01:09:15,429 sum to 1057 01:09:15,429 --> 01:09:18,109 1058 01:09:18,109 --> 01:09:21,099 1059 01:09:21,099 --> 01:09:28,099 1060 01:09:34,739 --> 01:09:40,779 1061 01:09:40,779 --> 01:09:47,779 1062 01:09:48,289 --> 01:09:55,289 1063 01:10:07,739 --> 01:10:10,069 1064 01:10:10,069 --> 01:10:17,069 1065 01:10:19,190 --> 01:10:23,239 1 and so 1066 01:10:23,239 --> 01:10:27,550 to compute it you should just take P of XI, ZI, 1067 01:10:27,550 --> 01:10:30,300 parameterized by theta and just normalize the sum to one. 1068 01:10:30,300 --> 01:10:34,030 There is a step that I'm skipping here to show that this is really the right 1069 01:10:34,030 --> 01:10:36,539 thing to do. Hopefully, you'll just be convinced it's true. 1070 01:10:36,539 --> 01:10:41,690 For the actual steps that I skipped, it's actually written out in the lecture notes. 1071 01:10:41,690 --> 01:10:43,889 1072 01:10:43,889 --> 01:10:50,519 So you then have 1073 01:10:50,519 --> 01:10:56,520 the denominator, by definition, is that 1074 01:10:56,520 --> 01:11:03,520 and so by the definition of 1075 01:11:04,469 --> 01:11:05,840 conditional probability 1076 01:11:05,840 --> 01:11:08,869 QI of ZI is just equal to P of ZI 1077 01:11:08,869 --> 01:11:15,869 given XI and parameterized by theta. Okay. And so 1078 01:11:37,579 --> 01:11:44,579 to summarize the algorithm, the 1079 01:11:46,270 --> 01:11:49,150 EM algorithm has two steps. 1080 01:11:49,150 --> 01:11:53,040 And 1081 01:11:53,040 --> 01:11:55,010 1082 01:11:55,010 --> 01:12:00,009 the E step, we set, we choose the distributions 1083 01:12:00,009 --> 01:12:07,009 QI, so QI of ZI will set to be equal to a P 1084 01:12:09,559 --> 01:12:13,359 of ZI given [inaudible] by data. That's the formula we just worked 1085 01:12:13,359 --> 01:12:16,639 out. 1086 01:12:16,639 --> 01:12:21,940 And so by this step we've now created a lower bound on the law of likelihood function 1087 01:12:21,940 --> 01:12:25,920 that is now tight at a current value of theta. 1088 01:12:25,920 --> 01:12:28,139 And in the M step, 1089 01:12:28,139 --> 01:12:33,439 we then optimize that lower bound with respect to our parameters theta 1090 01:12:33,439 --> 01:12:35,530 and specifically 1091 01:12:35,530 --> 01:12:42,530 to the [inaudible] of theta. Okay. 1092 01:12:58,809 --> 01:13:01,809 And so that's the EM algorithm. I 1093 01:13:01,809 --> 01:13:03,919 won't 1094 01:13:03,919 --> 01:13:07,260 have time to do it today, but I'll 1095 01:13:07,260 --> 01:13:09,090 probably show this in the 1096 01:13:09,090 --> 01:13:12,850 next lecture, but the EM algorithm's that I wrote down for the mixtures of Gaussian's 1097 01:13:12,850 --> 01:13:13,830 algorithm 1098 01:13:13,830 --> 01:13:17,619 is actually a special case of this more general template 1099 01:13:17,619 --> 01:13:22,019 where the E step and the M step responded. So pretty much exactly to this 1100 01:13:22,019 --> 01:13:24,530 E step and this M step that 1101 01:13:24,530 --> 01:13:26,029 1102 01:13:26,029 --> 01:13:29,300 I wrote 1103 01:13:29,300 --> 01:13:32,340 down. 1104 01:13:32,340 --> 01:13:33,929 The E step 1105 01:13:33,929 --> 01:13:36,260 constructs this lower bound 1106 01:13:36,260 --> 01:13:39,989 and makes sure that it is tight to the current value of theta. 1107 01:13:39,989 --> 01:13:42,299 That's in my choice of Q, 1108 01:13:42,299 --> 01:13:44,920 and then the M step 1109 01:13:44,920 --> 01:13:49,309 optimizes the lower bound with respect to [inaudible] data. Okay. 1110 01:13:49,309 --> 01:13:53,759 So lots more to say about this in the next lecture. Let's check if there's any questions 1111 01:13:53,759 --> 01:14:00,759 before we close. No. 1112 01:14:03,780 --> 01:14:04,289 Okay. Cool. 1113 01:14:04,289 --> 01:14:06,520 So let's wrap up for today and we'll continue talking about this in the next session.