1 00:00:10,040 --> 00:00:12,179 2 00:00:12,179 --> 00:00:15,449 This presentation is delivered by the Stanford Center for Professional 3 00:00:15,449 --> 00:00:22,449 Development. 4 00:00:25,119 --> 00:00:29,819 So what I want to do today is talk about a different type of learning algorithm, and, in particular, 5 00:00:29,819 --> 00:00:32,790 start to talk about generative learning algorithms 6 00:00:32,790 --> 00:00:37,840 and the specific algorithm called Gaussian Discriminant Analysis. 7 00:00:37,840 --> 00:00:43,340 Take a slight digression, talk about Gaussians, and I'll briefly discuss 8 00:00:43,340 --> 00:00:45,700 generative versus discriminative learning algorithms, 9 00:00:45,700 --> 00:00:49,780 and then hopefully wrap up today's lecture with a discussion of Naive Bayes and 10 00:00:49,780 --> 00:00:52,140 the Laplace Smoothing. 11 00:00:52,140 --> 00:00:55,020 So just to motivate our 12 00:00:55,020 --> 00:00:58,590 discussion on generative learning algorithms, right, so by way of contrast, 13 00:00:58,590 --> 00:01:02,100 the source of classification algorithms we've been talking about 14 00:01:02,100 --> 00:01:08,920 I think of algorithms that do this. So you're given a training set, 15 00:01:08,920 --> 00:01:11,640 and 16 00:01:11,640 --> 00:01:15,370 if you run an algorithm right, we just see progression on those training sets. 17 00:01:15,370 --> 00:01:19,830 The way I think of logistic regression is that it's trying to find " look at the date and is 18 00:01:19,830 --> 00:01:24,369 trying to find a straight line to divide the crosses and O's, right? So it's, sort of, 19 00:01:24,369 --> 00:01:27,380 trying to find a straight line. Let 20 00:01:27,380 --> 00:01:29,400 me " just make the days a bit noisier. 21 00:01:29,400 --> 00:01:33,060 Trying to find a straight line 22 00:01:33,060 --> 00:01:34,430 that separates out 23 00:01:34,430 --> 00:01:38,630 the positive and the negative classes as well as pass the law, right? And, 24 00:01:38,630 --> 00:01:42,570 in fact, it shows it on the laptop. Maybe just use the screens or the small 25 00:01:42,570 --> 00:01:46,250 monitors for this. 26 00:01:46,250 --> 00:01:47,230 In fact, 27 00:01:47,230 --> 00:01:49,720 you can see there's the data 28 00:01:49,720 --> 00:01:51,100 set with 29 00:01:51,100 --> 00:01:52,710 logistic regression, 30 00:01:52,710 --> 00:01:54,080 and so 31 00:01:54,080 --> 00:01:57,930 I've initialized the parameters randomly, and so logistic regression is, kind 32 00:01:57,930 --> 00:01:59,350 of, the outputting " it's 33 00:01:59,350 --> 00:02:01,530 the, kind of, hypothesis that 34 00:02:01,530 --> 00:02:04,750 iteration zero is that straight line shown in the bottom right. 35 00:02:04,750 --> 00:02:08,720 And so after one iteration and creating descent, the straight line changes a bit. 36 00:02:08,720 --> 00:02:10,999 After two iterations, three, 37 00:02:10,999 --> 00:02:12,359 four, 38 00:02:12,359 --> 00:02:15,019 until logistic regression converges 39 00:02:15,019 --> 00:02:18,969 and has found the straight line that, more or less, separates the positive and negative class, okay? So 40 00:02:18,969 --> 00:02:20,170 you can think of this 41 00:02:20,170 --> 00:02:21,490 as logistic regression, 42 00:02:21,490 --> 00:02:28,010 sort of, searching for a line that separates the positive and the negative classes. 43 00:02:28,010 --> 00:02:30,980 What I want to do today is talk about an algorithm that does something slightly 44 00:02:30,980 --> 00:02:32,759 different, 45 00:02:32,759 --> 00:02:36,309 and to motivate us, let's use our old example of trying to classifythe 46 00:02:36,309 --> 00:02:40,689 team malignant cancer and benign cancer, right? So a patient comes in 47 00:02:40,689 --> 00:02:44,759 and they have a cancer, you want to know if it's a malignant or a harmful cancer, 48 00:02:44,759 --> 00:02:48,489 or if it's a benign, meaning a harmless cancer. 49 00:02:48,489 --> 00:02:51,579 So rather than trying to find the straight line to separate the two classes, here's something else 50 00:02:51,579 --> 00:02:53,579 we could do. 51 00:02:53,579 --> 00:02:55,759 We can go from our training set 52 00:02:55,759 --> 00:03:00,409 and look at all the cases of malignant cancers, go through, you know, look for our training set for all the 53 00:03:00,409 --> 00:03:04,129 positive examples of malignant cancers, 54 00:03:04,129 --> 00:03:08,799 and we can then build a model for what malignant cancer looks like. 55 00:03:08,799 --> 00:03:13,400 Then we'll go for our training set again and take out all of the examples of benign cancers, 56 00:03:13,400 --> 00:03:17,859 and then we'll build a model for what benign cancers look like, okay? 57 00:03:17,859 --> 00:03:19,209 And then 58 00:03:19,209 --> 00:03:20,270 when you need to 59 00:03:20,270 --> 00:03:24,620 classify a new example, when you have a new patient, and you want to decide is this cancer malignant 60 00:03:24,620 --> 00:03:25,650 or benign, 61 00:03:25,650 --> 00:03:27,920 you then take your new cancer, and you 62 00:03:27,920 --> 00:03:30,720 match it to your model of malignant cancers, 63 00:03:30,720 --> 00:03:35,969 and you match it to your model of benign cancers, and you see which model it matches better, and 64 00:03:35,969 --> 00:03:39,389 depending on which model it matches better to, you then 65 00:03:39,389 --> 00:03:43,679 predict whether the new cancer is malignant or benign, 66 00:03:43,679 --> 00:03:45,809 okay? 67 00:03:45,809 --> 00:03:46,899 So 68 00:03:46,899 --> 00:03:49,659 what I just described, just this cross 69 00:03:49,659 --> 00:03:52,929 of methods where you build a second model for malignant cancers and 70 00:03:52,929 --> 00:03:54,950 a separate model for benign cancers 71 00:03:54,950 --> 00:03:58,379 is called a generative learning algorithm, 72 00:03:58,379 --> 00:04:01,230 and let me just, kind of, formalize this. 73 00:04:01,230 --> 00:04:03,089 74 00:04:03,089 --> 00:04:05,459 So in the models that we've 75 00:04:05,459 --> 00:04:07,559 been talking about previously, those were actually 76 00:04:07,559 --> 00:04:13,899 all discriminative learning algorithms, 77 00:04:13,899 --> 00:04:18,959 and studied more formally, a discriminative learning algorithm is one 78 00:04:18,959 --> 00:04:22,729 that either learns P of Y 79 00:04:22,729 --> 00:04:25,190 given X directly, 80 00:04:25,190 --> 00:04:26,889 81 00:04:26,889 --> 00:04:29,619 or even 82 00:04:29,619 --> 00:04:32,510 learns a hypothesis 83 00:04:32,510 --> 00:04:39,510 that 84 00:04:39,699 --> 00:04:43,240 outputs value 0, 1 directly, 85 00:04:43,240 --> 00:04:46,509 okay? So logistic regression is an example of 86 00:04:46,509 --> 00:04:49,349 a discriminative learning algorithm. 87 00:04:49,349 --> 00:04:56,349 In contrast, a generative learning algorithm of 88 00:04:58,150 --> 00:04:59,940 models P of X given Y. 89 00:04:59,940 --> 00:05:03,270 The probability of the features given the class label, 90 00:05:03,270 --> 00:05:04,330 91 00:05:04,330 --> 00:05:11,330 and as a technical detail, it also models P of Y, but that's a less important thing, and the 92 00:05:11,939 --> 00:05:15,319 interpretation of this is that a generative model 93 00:05:15,319 --> 00:05:17,499 builds a probabilistic model for 94 00:05:17,499 --> 00:05:21,959 what the features looks like, 95 00:05:21,959 --> 00:05:27,310 conditioned on the 96 00:05:27,310 --> 00:05:30,729 class label, okay? In other words, conditioned on whether a cancer is 97 00:05:30,729 --> 00:05:35,060 malignant or benign, it models probability distribution over what the features 98 00:05:35,060 --> 00:05:37,090 of the cancer looks like. 99 00:05:37,090 --> 00:05:40,470 Then having built this model " having built a model for P of X 100 00:05:40,470 --> 00:05:42,129 given Y and P of Y, 101 00:05:42,129 --> 00:05:45,949 then by Bayes rule, obviously, you can compute P of Y given 1, 102 00:05:45,949 --> 00:05:47,439 conditioned on X. 103 00:05:47,439 --> 00:05:49,370 This is just 104 00:05:49,370 --> 00:05:50,959 P of X 105 00:05:50,959 --> 00:05:52,669 given Y = 1 106 00:05:52,669 --> 00:05:56,430 times P of X 107 00:05:56,430 --> 00:05:58,919 divided by P of X, 108 00:05:58,919 --> 00:06:01,349 and, if necessary, 109 00:06:01,349 --> 00:06:05,600 you can calculate the denominator 110 00:06:05,600 --> 00:06:12,600 using this, right? 111 00:06:18,969 --> 00:06:19,889 112 00:06:19,889 --> 00:06:22,919 And so by modeling P of X given Y 113 00:06:22,919 --> 00:06:26,500 and modeling P of Y, you can actually use Bayes rule to get back to P of Y given 114 00:06:26,500 --> 00:06:27,569 X, 115 00:06:27,569 --> 00:06:29,820 but a generative model - 116 00:06:29,820 --> 00:06:33,120 generative learning algorithm starts in modeling P of X given Y, rather than P of Y 117 00:06:33,120 --> 00:06:34,979 given X, okay? 118 00:06:34,979 --> 00:06:37,810 We'll talk about some of the tradeoffs, and why this may be a 119 00:06:37,810 --> 00:06:41,999 better or worse idea than a discriminative model a bit later. 120 00:06:41,999 --> 00:06:46,679 Let's go for a specific example of a generative learning algorithm, 121 00:06:46,679 --> 00:06:47,909 and for 122 00:06:47,909 --> 00:06:51,329 this specific motivating example, I'm 123 00:06:51,329 --> 00:06:53,580 going to assume 124 00:06:53,580 --> 00:06:55,600 that your input feature is X 125 00:06:55,600 --> 00:06:58,199 and RN 126 00:06:58,199 --> 00:07:00,489 127 00:07:00,489 --> 00:07:03,110 and are 128 00:07:03,110 --> 00:07:05,710 continuous values, okay? 129 00:07:05,710 --> 00:07:08,190 And under this assumption, let 130 00:07:08,190 --> 00:07:15,190 me describe to you a specific algorithm called Gaussian Discriminant Analysis, 131 00:07:20,180 --> 00:07:23,059 132 00:07:23,059 --> 00:07:25,960 and the, I 133 00:07:25,960 --> 00:07:30,080 guess, core assumption is that we're going to assume in the Gaussian discriminant analysis 134 00:07:30,080 --> 00:07:32,279 model of that P of X given Y 135 00:07:32,279 --> 00:07:39,279 is Gaussian, okay? So 136 00:07:39,779 --> 00:07:43,449 actually just raise your hand, how many of you have seen a multivariate Gaussian before - 137 00:07:43,449 --> 00:07:45,639 not a 1D Gaussian, but the higher range though? 138 00:07:45,639 --> 00:07:51,029 Okay, cool, like maybe half of you, two-thirds of 139 00:07:51,029 --> 00:07:54,599 you. So let me just say a few words about Gaussians, and for those of you that have seen 140 00:07:54,599 --> 00:07:57,119 it before, it'll be a refresher. 141 00:07:57,119 --> 00:08:03,249 So we say that a random variable Z is distributed Gaussian, multivariate Gaussian as - and the 142 00:08:03,249 --> 00:08:05,309 script N for normal 143 00:08:05,309 --> 00:08:08,150 with 144 00:08:08,150 --> 00:08:13,379 parameters mean U and covariance sigma squared. If 145 00:08:13,379 --> 00:08:16,759 Z has a density 1 146 00:08:16,759 --> 00:08:19,539 over 2 Pi, sigma 147 00:08:19,539 --> 00:08:23,939 2, 148 00:08:23,939 --> 00:08:30,939 okay? 149 00:08:31,449 --> 00:08:35,949 That's the formula for the density as a generalization of the one dimension of Gaussians and no 150 00:08:35,949 --> 00:08:38,190 more the familiar bell-shape curve. It's 151 00:08:38,190 --> 00:08:42,220 a high dimension vector value random variable 152 00:08:42,220 --> 00:08:42,990 Z. 153 00:08:42,990 --> 00:08:47,210 Don't worry too much about this formula for the density. You rarely end up needing to 154 00:08:47,210 --> 00:08:47,960 use it, 155 00:08:47,960 --> 00:08:51,620 but the two key quantities are this 156 00:08:51,620 --> 00:08:55,310 vector mew is the mean of the Gaussian and this 157 00:08:55,310 --> 00:08:56,999 matrix sigma is 158 00:08:56,999 --> 00:09:00,060 the covariance matrix - 159 00:09:00,060 --> 00:09:02,470 covariance, 160 00:09:02,470 --> 00:09:04,670 and so 161 00:09:04,670 --> 00:09:07,410 sigma will be equal to, 162 00:09:07,410 --> 00:09:11,810 right, the definition of covariance of a vector valued random variable 163 00:09:11,810 --> 00:09:13,710 is X - U, X - V conspose, 164 00:09:13,710 --> 00:09:17,260 okay? 165 00:09:17,260 --> 00:09:20,320 And, actually, if this 166 00:09:20,320 --> 00:09:22,930 doesn't look familiar to you, 167 00:09:22,930 --> 00:09:25,720 you might re-watch the 168 00:09:25,720 --> 00:09:28,740 discussion section that the TAs held last Friday 169 00:09:28,740 --> 00:09:33,690 or the one that they'll be holding later this week on, sort of, a recap of 170 00:09:33,690 --> 00:09:34,290 probability, okay? 171 00:09:34,290 --> 00:09:37,670 So 172 00:09:37,670 --> 00:09:41,370 multi-grade Gaussians is parameterized by a mean and a covariance, and let me just - 173 00:09:41,370 --> 00:09:42,420 can I 174 00:09:42,420 --> 00:09:45,860 have the laptop displayed, please? 175 00:09:45,860 --> 00:09:48,570 I'll just go ahead and actually show you, 176 00:09:48,570 --> 00:09:52,090 you know, graphically, the effects of 177 00:09:52,090 --> 00:09:53,930 varying a Gaussian - 178 00:09:53,930 --> 00:09:56,310 varying the parameters of a Gaussian. 179 00:09:56,310 --> 00:09:59,130 So what I have up here 180 00:09:59,130 --> 00:10:02,960 is the density of a zero mean Gaussian with 181 00:10:02,960 --> 00:10:05,880 covariance matrix equals the identity. The covariance matrix is shown in the upper right-hand 182 00:10:05,880 --> 00:10:08,240 corner of the slide, and 183 00:10:08,240 --> 00:10:11,600 there's the familiar bell-shaped curve in two dimensions. 184 00:10:11,600 --> 00:10:17,320 And so if I shrink the covariance matrix, instead of covariance your identity, if 185 00:10:17,320 --> 00:10:21,600 I shrink the covariance matrix, then the Gaussian becomes more peaked, 186 00:10:21,600 --> 00:10:25,090 and if I widen the covariance, so like same = 2, 2, 187 00:10:25,090 --> 00:10:25,749 then 188 00:10:25,749 --> 00:10:30,810 the distribution - well, the density becomes more spread out, okay? 189 00:10:30,810 --> 00:10:32,510 Those vectors stand at normal, identity 190 00:10:32,510 --> 00:10:34,570 covariance one. 191 00:10:34,570 --> 00:10:36,520 If I increase 192 00:10:36,520 --> 00:10:39,419 the diagonals of a covariance matrix, right, 193 00:10:39,419 --> 00:10:43,949 if I make the variables correlated, and the 194 00:10:43,949 --> 00:10:45,350 Gaussian becomes flattened out in this X = Y direction, and 195 00:10:45,350 --> 00:10:47,170 increase it even further, 196 00:10:47,170 --> 00:10:52,460 then my variables, X and Y, right - excuse me, it goes Z1 and Z2 197 00:10:52,460 --> 00:10:57,020 are my two variables on a horizontal axis become even more correlated. I'll just show the same thing in 198 00:10:57,020 --> 00:10:58,940 contours. 199 00:10:58,940 --> 00:11:03,310 The standard normal of distribution has contours that are - they're actually 200 00:11:03,310 --> 00:11:06,960 circles. Because of the aspect ratio, these look like ellipses. 201 00:11:06,960 --> 00:11:09,050 These should actually be circles, 202 00:11:09,050 --> 00:11:12,770 and if you increase the off diagonals of the Gaussian covariance matrix, 203 00:11:12,770 --> 00:11:14,890 then it becomes 204 00:11:14,890 --> 00:11:16,400 ellipses aligned along the, sort of, 205 00:11:16,400 --> 00:11:18,090 45 degree angle 206 00:11:18,090 --> 00:11:21,120 in this example. 207 00:11:21,120 --> 00:11:26,430 This is the same thing. Here's an example of a Gaussian density with negative covariances. 208 00:11:26,430 --> 00:11:28,610 So now the correlation 209 00:11:28,610 --> 00:11:32,700 goes the other way, so that even strong [inaudible] of covariance and the same thing in 210 00:11:32,700 --> 00:11:36,400 contours. This is a Gaussian with negative entries on the diagonals and even 211 00:11:36,400 --> 00:11:41,130 larger entries on the diagonals, okay? 212 00:11:41,130 --> 00:11:45,480 And other parameter for the Gaussian is the mean parameters, so if this is - with mew0, 213 00:11:45,480 --> 00:11:47,839 and as he changed the mean parameter, 214 00:11:47,839 --> 00:11:50,060 this is mew equals 0.15, 215 00:11:50,060 --> 00:11:56,230 the location of the Gaussian just moves around, okay? 216 00:11:56,230 --> 00:12:01,500 All right. So that was a quick primer on what Gaussians look like, and here's 217 00:12:01,500 --> 00:12:02,300 218 00:12:02,300 --> 00:12:06,230 as a roadmap or as a picture to keep in mind, when we described the Gaussian discriminant 219 00:12:06,230 --> 00:12:09,400 analysis algorithm, this is what we're going to do. Here's 220 00:12:09,400 --> 00:12:11,640 the training set, 221 00:12:11,640 --> 00:12:14,790 and in the Gaussian discriminant analysis algorithm, 222 00:12:14,790 --> 00:12:19,480 what I'm going to do is I'm going to look at the positive examples, say the crosses, 223 00:12:19,480 --> 00:12:23,520 and just looking at only the positive examples, I'm gonna fit a Gaussian distribution to the 224 00:12:23,520 --> 00:12:25,250 positive examples, and so 225 00:12:25,250 --> 00:12:27,580 maybe I end up with a Gaussian distribution like that, 226 00:12:27,580 --> 00:12:30,960 okay? So there's P of X given Y = 1. 227 00:12:30,960 --> 00:12:34,340 And then I'll look at the negative examples, the O's in this figure, 228 00:12:34,340 --> 00:12:36,990 and I'll fit a Gaussian to that, and maybe I get a 229 00:12:36,990 --> 00:12:37,740 Gaussian 230 00:12:37,740 --> 00:12:40,700 centered over there. This is the concept of my second Gaussian, 231 00:12:40,700 --> 00:12:41,910 and together - 232 00:12:41,910 --> 00:12:44,770 we'll say how later - 233 00:12:44,770 --> 00:12:51,170 together these two Gaussian densities will define a separator for these two classes, okay? 234 00:12:51,170 --> 00:12:55,300 And it'll turn out that the separator will turn out to be a little bit 235 00:12:55,300 --> 00:12:55,870 different 236 00:12:55,870 --> 00:12:57,340 from what logistic regression 237 00:12:57,340 --> 00:12:58,719 gives you. 238 00:12:58,719 --> 00:13:00,390 If you run logistic regression, 239 00:13:00,390 --> 00:13:03,720 you actually get the division bound to be shown in the green line, whereas Gaussian discriminant 240 00:13:03,720 --> 00:13:07,340 analysis gives you the blue line, okay? Switch back to chalkboard, please. All right. Here's the 241 00:13:07,340 --> 00:13:14,340 242 00:13:29,460 --> 00:13:34,440 Gaussian discriminant analysis model, put 243 00:13:34,440 --> 00:13:37,160 into model P of Y 244 00:13:37,160 --> 00:13:40,940 as a Bernoulli random variable as usual, but 245 00:13:40,940 --> 00:13:45,540 as a Bernoulli random variable and parameterized by parameter phi; you've 246 00:13:45,540 --> 00:13:47,300 seen this before. 247 00:13:47,300 --> 00:13:48,900 248 00:13:48,900 --> 00:13:55,900 Model P of X given Y = 0 as a Gaussian - 249 00:13:58,090 --> 00:14:01,060 oh, you know what? Yeah, 250 00:14:01,060 --> 00:14:04,870 yes, excuse me. I 251 00:14:04,870 --> 00:14:06,050 thought this looked strange. 252 00:14:06,050 --> 00:14:11,730 This 253 00:14:11,730 --> 00:14:13,040 should be a sigma, 254 00:14:13,040 --> 00:14:16,230 determined in a sigma to the one-half of the denominator there. 255 00:14:16,230 --> 00:14:18,110 It's no big deal. It was - yeah, 256 00:14:18,110 --> 00:14:20,350 well, okay. Right. 257 00:14:20,350 --> 00:14:23,120 I 258 00:14:23,120 --> 00:14:26,690 was listing the sigma to the determining the sigma to the one-half on a previous 259 00:14:26,690 --> 00:14:33,690 board, excuse me. Okay, 260 00:14:40,180 --> 00:14:44,130 and so I model P of X given Y = 0 as a Gaussian 261 00:14:44,130 --> 00:14:47,700 with mean mew0 and covariance sigma to the sigma to 262 00:14:47,700 --> 00:14:54,700 the minus one-half, 263 00:14:54,800 --> 00:15:01,210 264 00:15:01,210 --> 00:15:02,000 and 265 00:15:02,000 --> 00:15:09,000 - 266 00:15:11,200 --> 00:15:12,840 okay? 267 00:15:12,840 --> 00:15:16,840 And so the parameters of this model are 268 00:15:16,840 --> 00:15:18,310 phi, 269 00:15:18,310 --> 00:15:20,640 mew0, 270 00:15:20,640 --> 00:15:23,900 mew1, and sigma, 271 00:15:23,900 --> 00:15:25,100 and so 272 00:15:25,100 --> 00:15:29,390 I can now write down the likelihood of the parameters 273 00:15:29,390 --> 00:15:30,380 as - oh, excuse 274 00:15:30,380 --> 00:15:37,380 me, actually, the log likelihood of the parameters as the log of 275 00:15:40,500 --> 00:15:41,440 that, 276 00:15:41,440 --> 00:15:45,080 right? 277 00:15:45,080 --> 00:15:48,260 So, in other words, if I'm given the training set, then 278 00:15:48,260 --> 00:15:52,690 they can write down the log likelihood of the parameters as the log of, you know, 279 00:15:52,690 --> 00:15:58,060 the probative probabilities of P of XI, YI, right? 280 00:15:58,060 --> 00:16:03,830 281 00:16:03,830 --> 00:16:10,030 282 00:16:10,030 --> 00:16:11,710 And 283 00:16:11,710 --> 00:16:14,650 this is just equal to that where each of these terms, P of XI given YI, 284 00:16:14,650 --> 00:16:16,980 or P of YI is 285 00:16:16,980 --> 00:16:18,720 then given 286 00:16:18,720 --> 00:16:19,600 287 00:16:19,600 --> 00:16:24,250 by one of these three equations on top, okay? 288 00:16:24,250 --> 00:16:26,420 And I just want 289 00:16:26,420 --> 00:16:30,000 to contrast this again with discriminative learning algorithms, right? 290 00:16:30,000 --> 00:16:31,790 So 291 00:16:31,790 --> 00:16:34,779 to give this a name, I guess, this sometimes is actually called 292 00:16:34,779 --> 00:16:39,249 the Joint Data Likelihood - the Joint Likelihood, 293 00:16:39,249 --> 00:16:41,790 and 294 00:16:41,790 --> 00:16:44,930 let me just contrast this with what we had previously 295 00:16:44,930 --> 00:16:47,710 when we're talking about logistic 296 00:16:47,710 --> 00:16:51,860 regression. Where I said with the log likelihood of the parameter's theater 297 00:16:51,860 --> 00:16:53,089 was log 298 00:16:53,089 --> 00:16:57,000 of a product I = 1 to M, P of YI 299 00:16:57,000 --> 00:16:59,030 given XI 300 00:16:59,030 --> 00:17:01,990 and parameterized 301 00:17:01,990 --> 00:17:03,960 by a theater, right? 302 00:17:03,960 --> 00:17:05,150 So 303 00:17:05,150 --> 00:17:08,040 back where we're fitting logistic regression models or generalized learning 304 00:17:08,040 --> 00:17:08,670 models, 305 00:17:08,670 --> 00:17:14,360 we're always modeling P of YI given XI and parameterized by a theater, and that was the 306 00:17:14,360 --> 00:17:16,420 conditional 307 00:17:16,420 --> 00:17:17,460 308 00:17:17,460 --> 00:17:19,579 likelihood, okay, 309 00:17:19,579 --> 00:17:23,910 in 310 00:17:23,910 --> 00:17:26,700 which we're modeling P of YI given XI, 311 00:17:26,700 --> 00:17:27,839 whereas, now, 312 00:17:27,839 --> 00:17:30,929 regenerative learning algorithms, we're going to look at the joint likelihood which 313 00:17:30,929 --> 00:17:35,440 is P of XI, YI, okay? 314 00:17:35,440 --> 00:17:36,270 So let's 315 00:17:36,270 --> 00:17:43,270 see. 316 00:17:46,240 --> 00:17:48,049 So given the training sets 317 00:17:48,049 --> 00:17:51,260 and using the Gaussian discriminant analysis model 318 00:17:51,260 --> 00:17:55,550 to fit the parameters of the model, we'll do maximize likelihood estimation as usual, 319 00:17:55,550 --> 00:17:58,780 and so you maximize your 320 00:17:58,780 --> 00:18:02,269 L 321 00:18:02,269 --> 00:18:05,910 with respect to the parameters phi, mew0, 322 00:18:05,910 --> 00:18:07,850 mew1, sigma, 323 00:18:07,850 --> 00:18:10,150 and so 324 00:18:10,150 --> 00:18:15,800 if we find the maximum likelihood estimate of parameters, you find that phi is 325 00:18:15,800 --> 00:18:17,400 - 326 00:18:17,400 --> 00:18:22,040 the maximum likelihood estimate is actually no surprise, and I'm writing this down mainly as a practice for 327 00:18:22,040 --> 00:18:23,750 indicating notation, 328 00:18:23,750 --> 00:18:27,220 all right? So the maximum likelihood estimate for phi would be Sum over 329 00:18:27,220 --> 00:18:29,520 I, YI ÷ M, 330 00:18:29,520 --> 00:18:32,040 331 00:18:32,040 --> 00:18:35,490 or written alternatively as Sum over - 332 00:18:35,490 --> 00:18:41,620 all your training examples of indicator YI = 1 ÷ M, okay? 333 00:18:41,620 --> 00:18:42,870 334 00:18:42,870 --> 00:18:45,880 In other words, maximum likelihood estimate for a 335 00:18:45,880 --> 00:18:47,390 336 00:18:47,390 --> 00:18:52,600 newly parameter phi is just the faction of training examples with 337 00:18:52,600 --> 00:18:56,880 label one, with Y equals 1. 338 00:18:56,880 --> 00:19:02,800 Maximum likelihood estimate for mew0 is this, okay? 339 00:19:02,800 --> 00:19:09,800 You 340 00:19:22,010 --> 00:19:24,040 should stare at this for a second and 341 00:19:24,040 --> 00:19:26,200 see if it makes sense. 342 00:19:26,200 --> 00:19:33,200 Actually, I'll just write on the next one for mew1 while you do that. 343 00:19:34,310 --> 00:19:41,310 Okay? 344 00:19:49,880 --> 00:19:51,900 So what this is is 345 00:19:51,900 --> 00:19:57,060 what the denominator is sum of your training sets indicated YI = 0. 346 00:19:57,060 --> 00:20:02,090 So for every training example for which YI = 0, this will 347 00:20:02,090 --> 00:20:05,670 increment the count by one, all right? So the 348 00:20:05,670 --> 00:20:08,030 denominator is just 349 00:20:08,030 --> 00:20:10,880 the number of examples 350 00:20:10,880 --> 00:20:14,550 351 00:20:14,550 --> 00:20:15,370 with 352 00:20:15,370 --> 00:20:20,700 label zero, all right? 353 00:20:20,700 --> 00:20:22,680 And then the numerator will be, let's 354 00:20:22,680 --> 00:20:25,660 see, Sum from I = 1 for M, or every time 355 00:20:25,660 --> 00:20:30,350 YI is equal to 0, this will be a one, and otherwise, this thing will be zero, 356 00:20:30,350 --> 00:20:31,480 and so 357 00:20:31,480 --> 00:20:34,330 this indicator function means that you're including 358 00:20:34,330 --> 00:20:37,630 only the times for 359 00:20:37,630 --> 00:20:40,270 which YI is equal to one - only the turns which Y is equal to zero 360 00:20:40,270 --> 00:20:45,020 because for all the times where YI is equal to one, 361 00:20:45,020 --> 00:20:47,990 this sum and will be equal to zero, 362 00:20:47,990 --> 00:20:52,890 and then you multiply that by XI, and so the numerator is 363 00:20:52,890 --> 00:20:55,770 really the sum 364 00:20:55,770 --> 00:21:00,910 of XI's corresponding to 365 00:21:00,910 --> 00:21:02,970 366 00:21:02,970 --> 00:21:08,649 examples where the class labels were zero, okay? 367 00:21:08,649 --> 00:21:14,530 Raise your hand if this makes sense. Okay, cool. 368 00:21:14,530 --> 00:21:17,350 So just to say this fancifully, 369 00:21:17,350 --> 00:21:19,630 this just means look for your training set, 370 00:21:19,630 --> 00:21:22,900 find all the examples for which Y = 0, 371 00:21:22,900 --> 00:21:25,160 and take the average 372 00:21:25,160 --> 00:21:29,570 of the value of X for all your examples which Y = 0. So take all your negative fitting 373 00:21:29,570 --> 00:21:30,470 examples 374 00:21:30,470 --> 00:21:32,950 and average the values for X 375 00:21:32,950 --> 00:21:37,730 and 376 00:21:37,730 --> 00:21:41,350 that's mew0, okay? If this 377 00:21:41,350 --> 00:21:45,140 notation is still a little bit cryptic - if you're still not sure why this 378 00:21:45,140 --> 00:21:48,000 equation translates into 379 00:21:48,000 --> 00:21:52,940 what I just said, do go home and stare at it for a while until it just makes sense. This is, sort 380 00:21:52,940 --> 00:21:57,040 of, no surprise. It just says to estimate the mean for the negative examples, 381 00:21:57,040 --> 00:21:59,860 take all your negative examples, and average them. So 382 00:21:59,860 --> 00:22:06,830 no surprise, but this is a useful practice to indicate a notation. 383 00:22:06,830 --> 00:22:07,500 [Inaudible] 384 00:22:07,500 --> 00:22:11,330 divide the maximum likelihood estimate for sigma. I won't do that. You can read that in 385 00:22:11,330 --> 00:22:16,630 the notes yourself. 386 00:22:16,630 --> 00:22:21,390 387 00:22:21,390 --> 00:22:24,340 And so having fit 388 00:22:24,340 --> 00:22:28,940 the parameters find mew0, mew1, and sigma 389 00:22:28,940 --> 00:22:30,470 to your data, 390 00:22:30,470 --> 00:22:35,590 well, you now need to make a prediction. You 391 00:22:35,590 --> 00:22:39,710 know, when you're given a new value of X, when you're given a new cancer, you need to predict whether 392 00:22:39,710 --> 00:22:41,630 it's malignant or benign. 393 00:22:41,630 --> 00:22:44,610 Your prediction is then going to be, 394 00:22:44,610 --> 00:22:46,710 let's say, 395 00:22:46,710 --> 00:22:50,400 the most likely value of Y given X. I should 396 00:22:50,400 --> 00:22:54,720 write semicolon the parameters there. I'll just give that 397 00:22:54,720 --> 00:22:57,230 - which is the [inaudible] of a Y 398 00:22:57,230 --> 00:23:04,230 by Bayes rule, all right? 399 00:23:05,870 --> 00:23:12,870 And that is, in turn, 400 00:23:14,540 --> 00:23:15,480 just that 401 00:23:15,480 --> 00:23:19,790 because the denominator P of X doesn't depend on Y, 402 00:23:19,790 --> 00:23:21,530 and 403 00:23:21,530 --> 00:23:23,540 if P of Y 404 00:23:23,540 --> 00:23:27,450 is uniform. 405 00:23:27,450 --> 00:23:30,950 In other words, if each of your constants is equally likely, 406 00:23:30,950 --> 00:23:33,039 so if P of Y 407 00:23:33,039 --> 00:23:36,840 takes the same value for all values 408 00:23:36,840 --> 00:23:38,010 of Y, 409 00:23:38,010 --> 00:23:42,250 then this is just arc X over Y, P of X 410 00:23:42,250 --> 00:23:47,340 given Y, okay? 411 00:23:47,340 --> 00:23:50,970 This happens sometimes, maybe not very often, so usually you end up using this 412 00:23:50,970 --> 00:23:53,049 formula where you 413 00:23:53,049 --> 00:23:56,470 compute P of X given Y and P of Y using 414 00:23:56,470 --> 00:23:59,910 your model, okay? Student:Can 415 00:23:59,910 --> 00:24:00,780 you give 416 00:24:00,780 --> 00:24:05,480 us arc x? Instructor (Andrew Ng):Oh, let's see. So if you take - actually 417 00:24:05,480 --> 00:24:08,750 let me. 418 00:24:08,750 --> 00:24:09,770 So the min of 419 00:24:09,770 --> 00:24:15,470 - arcomatics means the value for Y that maximizes this. Student:Oh, okay. Instructor (Andrew Ng):So just 420 00:24:15,470 --> 00:24:18,600 for an example, the min of X - 5 421 00:24:18,600 --> 00:24:22,610 squared is 0 because by choosing X equals 5, you can get this to be zero, 422 00:24:22,610 --> 00:24:24,470 and the argument over X 423 00:24:24,470 --> 00:24:30,870 of X - 5 squared is equal to 5 because 5 is the value of X that makes this minimize, okay? 424 00:24:30,870 --> 00:24:37,870 Cool. Thanks 425 00:24:38,100 --> 00:24:45,100 for 426 00:24:45,140 --> 00:24:52,140 asking that. Instructor (Andrew Ng):Okay. Actually any other questions about this? Yeah? Student:Why is 427 00:25:00,910 --> 00:25:04,320 distributive removing? Why isn't [inaudible] - Instructor (Andrew Ng):Oh, I see. By uniform I meant - I was being loose 428 00:25:04,320 --> 00:25:05,649 here. 429 00:25:05,649 --> 00:25:07,240 I meant if 430 00:25:07,240 --> 00:25:10,890 P of Y = 0 is equal to P of Y = 1, or if Y is the 431 00:25:10,890 --> 00:25:12,780 uniform distribution over 432 00:25:12,780 --> 00:25:13,860 the set 0 and 1. Student:Oh. Instructor (Andrew Ng):I just meant - yeah, if P of Y = 0 433 00:25:13,860 --> 00:25:15,450 zero = P of Y given 1. That's all I mean, see? Anything else? 434 00:25:15,450 --> 00:25:17,080 All 435 00:25:17,080 --> 00:25:20,030 right. Okay. So 436 00:25:20,030 --> 00:25:22,039 437 00:25:22,039 --> 00:25:28,040 438 00:25:28,040 --> 00:25:35,040 it 439 00:25:42,130 --> 00:25:46,980 turns out Gaussian discriminant analysis has an interesting relationship 440 00:25:46,980 --> 00:25:48,590 to logistic 441 00:25:48,590 --> 00:25:51,640 regression. Let me illustrate that. 442 00:25:51,640 --> 00:25:54,540 443 00:25:54,540 --> 00:25:59,200 So let's say you have a training set 444 00:25:59,200 --> 00:26:06,200 - actually let me just go ahead and draw 1D training set, and that will 445 00:26:11,820 --> 00:26:14,580 kind of work, yes, okay. 446 00:26:14,580 --> 00:26:18,470 So let's say we have a training set comprising a few negative and a few positive examples, 447 00:26:18,470 --> 00:26:23,070 and let's say I run Gaussian discriminate analysis. So I'll fit Gaussians to each of these two 448 00:26:23,070 --> 00:26:24,679 densities - a Gaussian density to each of these two - to my positive 449 00:26:24,679 --> 00:26:27,440 and negative training 450 00:26:27,440 --> 00:26:30,090 examples, 451 00:26:30,090 --> 00:26:33,470 and so maybe my 452 00:26:33,470 --> 00:26:37,960 positive examples, the X's, are fit with a Gaussian like this, 453 00:26:37,960 --> 00:26:39,520 454 00:26:39,520 --> 00:26:44,560 455 00:26:44,560 --> 00:26:47,840 and my negative examples I will fit, and you have a 456 00:26:47,840 --> 00:26:54,840 Gaussian that looks like that, okay? 457 00:26:57,030 --> 00:27:00,180 Now, I 458 00:27:00,180 --> 00:27:04,150 hope this [inaudible]. Now, let's 459 00:27:04,150 --> 00:27:06,789 vary along the X axis, 460 00:27:06,789 --> 00:27:10,260 and what I want to do is I'll 461 00:27:10,260 --> 00:27:14,470 overlay on top of this plot. I'm going to plot 462 00:27:14,470 --> 00:27:21,470 P of Y = 1 - no, actually, 463 00:27:24,420 --> 00:27:25,820 given X 464 00:27:25,820 --> 00:27:31,570 for a variety of values X, okay? 465 00:27:31,570 --> 00:27:32,929 So I actually realize what I should have done. 466 00:27:32,929 --> 00:27:37,250 I'm gonna call the X's the negative examples, and I'm gonna call the O's the positive examples. It just 467 00:27:37,250 --> 00:27:39,720 makes this part come in better. 468 00:27:39,720 --> 00:27:44,270 So let's take a value of X that's fairly small. Let's say X is this value here on a horizontal 469 00:27:44,270 --> 00:27:45,200 axis. 470 00:27:45,200 --> 00:27:48,810 Then what's the probability of Y being equal to one conditioned on X? Well, 471 00:27:48,810 --> 00:27:52,770 the way you calculate that is you write P of Y 472 00:27:52,770 --> 00:27:56,899 = 1 given X, and then you plug in all these formulas as usual, right? It's P of X 473 00:27:56,899 --> 00:27:59,140 given Y = 1, which is 474 00:27:59,140 --> 00:28:00,929 your Gaussian density, 475 00:28:00,929 --> 00:28:04,070 times P of Y = 1, you know, which is 476 00:28:04,070 --> 00:28:08,060 essentially - this is just going to be equal to phi, 477 00:28:08,060 --> 00:28:09,510 and then divided by, 478 00:28:09,510 --> 00:28:12,210 right, P of X, and then this shows you how you can calculate this. 479 00:28:12,210 --> 00:28:13,660 By using 480 00:28:13,660 --> 00:28:18,730 these two Gaussians and my phi on P of Y, I actually compute what P of Y 481 00:28:18,730 --> 00:28:21,340 = 1 given X is, and 482 00:28:21,340 --> 00:28:24,529 in this case, 483 00:28:24,529 --> 00:28:29,230 if X is this small, clearly it belongs to the left Gaussian. It's very unlikely to belong to 484 00:28:29,230 --> 00:28:31,330 a positive class, and so 485 00:28:31,330 --> 00:28:36,370 it'll be very small; it'll be very close to zero say, okay? 486 00:28:36,370 --> 00:28:40,770 And then we can increment the value of X a bit, and study a different value of X, and 487 00:28:40,770 --> 00:28:44,230 plot what is the P of Y given X - P of Y 488 00:28:44,230 --> 00:28:49,660 = 1 given X, and, again, it'll be pretty small. Let's 489 00:28:49,660 --> 00:28:52,200 use a point like that, right? At this point, 490 00:28:52,200 --> 00:28:54,480 the two Gaussian densities 491 00:28:54,480 --> 00:28:56,490 have equal value, 492 00:28:56,490 --> 00:28:57,539 and if 493 00:28:57,539 --> 00:28:58,799 I ask 494 00:28:58,799 --> 00:29:01,779 if X is this value, right, shown by the arrow, 495 00:29:01,779 --> 00:29:03,080 496 00:29:03,080 --> 00:29:06,770 what's the probably of Y being equal to one for that value of X? Well, you really can't tell, so maybe it's about 0.5, okay? And if 497 00:29:06,770 --> 00:29:11,270 you fill 498 00:29:11,270 --> 00:29:13,930 in a bunch more points, you get a 499 00:29:13,930 --> 00:29:16,470 curve like that, 500 00:29:16,470 --> 00:29:20,550 and then you can keep going. Let's say for a point like that, you can ask what's the probability of X 501 00:29:20,550 --> 00:29:21,919 being one? Well, if 502 00:29:21,919 --> 00:29:25,320 it's that far out, then clearly, it belongs to this 503 00:29:25,320 --> 00:29:27,540 rightmost Gaussian, and so 504 00:29:27,540 --> 00:29:31,510 the probability of Y being a one would be very high; it would be almost one, okay? 505 00:29:31,510 --> 00:29:33,830 And so you 506 00:29:33,830 --> 00:29:36,429 can repeat this exercise 507 00:29:36,429 --> 00:29:38,510 for a bunch of points. All right, 508 00:29:38,510 --> 00:29:41,620 compute P of Y equals one given X for a bunch of points, 509 00:29:41,620 --> 00:29:46,490 and if you connect up these points, 510 00:29:46,490 --> 00:29:48,150 you find that the 511 00:29:48,150 --> 00:29:50,039 curve you get [Pause] plotted 512 00:29:50,039 --> 00:29:56,410 takes a form of sigmoid function, okay? So, 513 00:29:56,410 --> 00:30:00,030 in other words, when you make the assumptions under the Gaussian 514 00:30:00,030 --> 00:30:02,350 515 00:30:02,350 --> 00:30:05,550 discriminant analysis model, 516 00:30:05,550 --> 00:30:08,130 that P of X given Y is Gaussian, 517 00:30:08,130 --> 00:30:13,330 when you go back and compute what P of Y given X is, you actually get back 518 00:30:13,330 --> 00:30:19,290 exactly the same sigmoid function 519 00:30:19,290 --> 00:30:22,960 that we're using which is the progression, okay? But it turns out the key difference is that 520 00:30:22,960 --> 00:30:27,360 Gaussian discriminant analysis will end up choosing a different 521 00:30:27,360 --> 00:30:28,900 position 522 00:30:28,900 --> 00:30:31,659 and a steepness of the sigmoid 523 00:30:31,659 --> 00:30:35,730 than would logistic regression. Is there a question? Student:I'm just 524 00:30:35,730 --> 00:30:39,570 wondering, 525 00:30:39,570 --> 00:30:44,600 the Gaussian of P of Y [inaudible] you do? Instructor (Andrew Ng):No, let's see. The Gaussian - so this Gaussian is 526 00:30:44,600 --> 00:30:47,500 P of X given Y = 1, and 527 00:30:47,500 --> 00:30:50,070 this Gaussian is P of X 528 00:30:50,070 --> 00:30:57,070 given Y = 0; does that make sense? Anything else? Student:Okay. Instructor (Andrew Ng):Yeah? Student:When you drawing all the dots, how did you 529 00:31:01,920 --> 00:31:04,040 decide what Y 530 00:31:04,040 --> 00:31:05,760 given 531 00:31:05,760 --> 00:31:09,769 P of X was? Instructor (Andrew Ng):What - say that again. Student:I'm sorry. Could you go over how you 532 00:31:09,769 --> 00:31:12,330 figured out where 533 00:31:12,330 --> 00:31:13,750 to draw each dot? Instructor (Andrew Ng):Let's see, 534 00:31:13,750 --> 00:31:15,919 okay. So the 535 00:31:15,919 --> 00:31:18,600 computation is as follows, right? The steps 536 00:31:18,600 --> 00:31:22,020 are I have the training sets, and so given my training set, I'm going to fit 537 00:31:22,020 --> 00:31:25,159 a Gaussian discriminant analysis model to it, 538 00:31:25,159 --> 00:31:28,940 and what that means is I'll build a model for P of X given Y = 1. I'll 539 00:31:28,940 --> 00:31:30,159 build 540 00:31:30,159 --> 00:31:33,059 a model for P of X given Y = 0, 541 00:31:33,059 --> 00:31:37,950 and I'll also fit a Bernoulli distribution to 542 00:31:37,950 --> 00:31:39,290 P of Y, okay? 543 00:31:39,290 --> 00:31:43,270 So, in other words, given my training set, I'll fit P of X given Y and P of Y 544 00:31:43,270 --> 00:31:46,539 to my data, and now I've chosen my parameters 545 00:31:46,539 --> 00:31:48,630 of find mew0, 546 00:31:48,630 --> 00:31:49,600 mew1, 547 00:31:49,600 --> 00:31:52,310 and the sigma, okay? Then 548 00:31:52,310 --> 00:31:54,930 this is the process I went through 549 00:31:54,930 --> 00:31:58,710 to plot all these dots, right? It's just I pick a point in the X axis, 550 00:31:58,710 --> 00:32:00,950 and then I compute 551 00:32:00,950 --> 00:32:02,970 P of Y given X 552 00:32:02,970 --> 00:32:05,200 for that value of X, 553 00:32:05,200 --> 00:32:09,380 and P of Y given 1 conditioned on X will be some value between zero and one. It'll 554 00:32:09,380 --> 00:32:12,500 be some real number, and whatever that real number is, I then plot it on the vertical 555 00:32:12,500 --> 00:32:14,490 axis, 556 00:32:14,490 --> 00:32:19,929 okay? And the way I compute P of Y = 1 conditioned on X is 557 00:32:19,929 --> 00:32:21,380 I would 558 00:32:21,380 --> 00:32:23,470 use these quantities. I would use 559 00:32:23,470 --> 00:32:25,080 P of X given Y 560 00:32:25,080 --> 00:32:30,050 and P of Y, and, sort of, plug them into Bayes rule, and that allows me 561 00:32:30,050 --> 00:32:30,760 to 562 00:32:30,760 --> 00:32:31,690 compute P of Y given X 563 00:32:31,690 --> 00:32:34,210 from these three quantities; does that make 564 00:32:34,210 --> 00:32:35,240 sense? Student:Yeah. Instructor (Andrew Ng):Was there something more that - 565 00:32:35,240 --> 00:32:39,080 Student:And how did you model P of X; is that - Instructor (Andrew Ng):Oh, okay. Yeah, so 566 00:32:39,080 --> 00:32:42,269 - 567 00:32:42,269 --> 00:32:43,170 well, 568 00:32:43,170 --> 00:32:45,340 got this right 569 00:32:45,340 --> 00:32:49,090 here. So P of X can be written as, 570 00:32:49,090 --> 00:32:50,449 571 00:32:50,449 --> 00:32:52,810 right, 572 00:32:52,810 --> 00:32:55,030 so 573 00:32:55,030 --> 00:32:58,660 P of X given Y = 0 by P of Y = 0 + P of X given Y = 1, P of Y = 574 00:32:58,660 --> 00:33:03,140 1, right? 575 00:33:03,140 --> 00:33:06,740 And so each of these terms, P of X given Y 576 00:33:06,740 --> 00:33:11,170 and P of Y, these are terms I can get out of, directly, from my Gaussian discriminant 577 00:33:11,170 --> 00:33:13,890 analysis model. Each of these terms is something that 578 00:33:13,890 --> 00:33:16,300 my model gives me directly, 579 00:33:16,300 --> 00:33:18,350 so plugged in as the denominator, 580 00:33:18,350 --> 00:33:25,350 and by doing that, that's how I compute P of Y = 1 given X, make sense? Student:Thank you. Instructor (Andrew Ng):Okay. Cool. 581 00:33:30,570 --> 00:33:37,570 582 00:33:53,170 --> 00:33:57,860 So let's talk a little bit about the advantages and disadvantages of using a 583 00:33:57,860 --> 00:34:00,360 generative 584 00:34:00,360 --> 00:34:04,759 learning algorithm, okay? So in the particular case of Gaussian discriminant analysis, we 585 00:34:04,759 --> 00:34:05,960 assume that 586 00:34:05,960 --> 00:34:08,080 X conditions on Y 587 00:34:08,080 --> 00:34:13,829 is Gaussian, 588 00:34:13,829 --> 00:34:17,089 and the argument I showed on the previous chalkboard, I didn't prove it formally, 589 00:34:17,089 --> 00:34:20,289 but you can actually go back and prove it yourself 590 00:34:20,289 --> 00:34:24,269 is that if you assume X given Y is Gaussian, 591 00:34:24,269 --> 00:34:27,419 then that implies that 592 00:34:27,419 --> 00:34:28,589 when you plot Y 593 00:34:28,589 --> 00:34:30,639 given X, 594 00:34:30,639 --> 00:34:37,639 you find that - well, let me just write logistic posterior, okay? 595 00:34:40,689 --> 00:34:43,519 And the argument I showed just now, which I didn't prove; you can go home and prove it 596 00:34:43,519 --> 00:34:44,489 yourself, 597 00:34:44,489 --> 00:34:49,329 is that if you assume X given Y is Gaussian, then that implies that the posterior 598 00:34:49,329 --> 00:34:53,929 distribution or the form of 599 00:34:53,929 --> 00:34:57,229 P of Y = 1 given X 600 00:34:57,229 --> 00:35:00,569 is going to be a logistic function, 601 00:35:00,569 --> 00:35:02,089 602 00:35:02,089 --> 00:35:05,809 and it turns out this 603 00:35:05,809 --> 00:35:08,319 implication in the opposite direction 604 00:35:08,319 --> 00:35:11,799 does not hold true, 605 00:35:11,799 --> 00:35:16,349 okay? In particular, it actually turns out - this is actually, kind of, cool. It 606 00:35:16,349 --> 00:35:19,360 turns out that if you're 607 00:35:19,360 --> 00:35:22,739 seeing that X given Y = 1 is 608 00:35:22,739 --> 00:35:24,389 609 00:35:24,389 --> 00:35:26,169 Hessian with 610 00:35:26,169 --> 00:35:28,609 parameter lambda 1, 611 00:35:28,609 --> 00:35:32,039 and X given Y = 0, 612 00:35:32,039 --> 00:35:36,509 is Hessian 613 00:35:36,509 --> 00:35:39,199 with parameter lambda 0. 614 00:35:39,199 --> 00:35:41,420 It turns out if you assumed this, 615 00:35:41,420 --> 00:35:42,640 then 616 00:35:42,640 --> 00:35:43,849 that also 617 00:35:43,849 --> 00:35:46,839 implies that P of Y 618 00:35:46,839 --> 00:35:51,130 given X 619 00:35:51,130 --> 00:35:52,439 620 00:35:52,439 --> 00:35:54,429 is logistic, okay? 621 00:35:54,429 --> 00:35:57,259 So there are lots of assumptions on X given Y 622 00:35:57,259 --> 00:36:04,259 that will lead to P of Y given X being logistic, and, 623 00:36:04,619 --> 00:36:06,059 therefore, 624 00:36:06,059 --> 00:36:11,609 this, the assumption that X given Y being Gaussian is the stronger assumption 625 00:36:11,609 --> 00:36:15,019 than the assumption that Y given X is logistic, 626 00:36:15,019 --> 00:36:17,400 okay? Because this implies this, 627 00:36:17,400 --> 00:36:22,829 right? That means that this is a stronger assumption than this because 628 00:36:22,829 --> 00:36:29,829 this, the logistic posterior holds whenever X given Y is Gaussian but not vice versa. 629 00:36:29,940 --> 00:36:31,099 And so this leaves some 630 00:36:31,099 --> 00:36:35,749 of the tradeoffs between Gaussian discriminant analysis and logistic 631 00:36:35,749 --> 00:36:36,559 regression, 632 00:36:36,559 --> 00:36:40,079 right? Gaussian discriminant analysis makes a much stronger assumption 633 00:36:40,079 --> 00:36:43,049 that X given Y is Gaussian, 634 00:36:43,049 --> 00:36:47,099 and so when this assumption is true, when this assumption approximately holds, if you plot the 635 00:36:47,099 --> 00:36:48,569 data, 636 00:36:48,569 --> 00:36:52,039 and if X given Y is, indeed, approximately Gaussian, 637 00:36:52,039 --> 00:36:56,199 then if you make this assumption, explicit to the algorithm, then the 638 00:36:56,199 --> 00:36:57,499 algorithm will do better 639 00:36:57,499 --> 00:36:59,239 because it's as if the 640 00:36:59,239 --> 00:37:02,609 algorithm is making use of more information about the data. The algorithm knows that 641 00:37:02,609 --> 00:37:04,380 the data is Gaussian, 642 00:37:04,380 --> 00:37:05,290 right? And so 643 00:37:05,290 --> 00:37:07,609 if the Gaussian assumption, you know, 644 00:37:07,609 --> 00:37:09,529 holds or roughly holds, 645 00:37:09,529 --> 00:37:10,220 then Gaussian 646 00:37:10,220 --> 00:37:13,569 discriminant analysis may do better than logistic regression. 647 00:37:13,569 --> 00:37:19,299 If, conversely, if you're actually not sure what X given Y is, then 648 00:37:19,299 --> 00:37:22,819 logistic regression, the discriminant algorithm may do better, 649 00:37:22,819 --> 00:37:25,579 and, in particular, use logistic regression, 650 00:37:25,579 --> 00:37:26,110 and 651 00:37:26,110 --> 00:37:29,690 maybe you see [inaudible] before the data was Gaussian, but it turns out the data 652 00:37:29,690 --> 00:37:31,429 was actually Poisson, right? 653 00:37:31,429 --> 00:37:33,980 Then logistic regression will still do perfectly fine because if 654 00:37:33,980 --> 00:37:36,869 the data were actually Poisson, 655 00:37:36,869 --> 00:37:40,210 then P of Y = 1 given X will be logistic, and it'll do perfectly 656 00:37:40,210 --> 00:37:44,519 fine, but if you assumed it was Gaussian, then the algorithm may go off and do something 657 00:37:44,519 --> 00:37:51,519 that's not as good, okay? 658 00:37:52,109 --> 00:37:56,139 So it turns out that - right. 659 00:37:56,139 --> 00:37:58,799 So it's slightly different. 660 00:37:58,799 --> 00:38:01,089 It 661 00:38:01,089 --> 00:38:05,359 turns out the real advantage of generative learning algorithms is often that it 662 00:38:05,359 --> 00:38:07,839 requires less data, 663 00:38:07,839 --> 00:38:09,049 and, in particular, 664 00:38:09,049 --> 00:38:13,469 data is never really exactly Gaussian, right? Because data is often 665 00:38:13,469 --> 00:38:15,890 approximately Gaussian; it's never exactly Gaussian. 666 00:38:15,890 --> 00:38:19,989 And it turns out, generative learning algorithms often do surprisingly well 667 00:38:19,989 --> 00:38:20,750 even when 668 00:38:20,750 --> 00:38:24,369 these modeling assumptions are not met, but 669 00:38:24,369 --> 00:38:26,019 one other tradeoff 670 00:38:26,019 --> 00:38:26,880 is that 671 00:38:26,880 --> 00:38:29,930 by making stronger assumptions 672 00:38:29,930 --> 00:38:31,899 about the data, 673 00:38:31,899 --> 00:38:34,039 Gaussian discriminant analysis 674 00:38:34,039 --> 00:38:36,229 often needs less data 675 00:38:36,229 --> 00:38:40,439 in order to fit, like, an okay model, even if there's less training data. Whereas, in 676 00:38:40,439 --> 00:38:41,870 contrast, 677 00:38:41,870 --> 00:38:45,959 logistic regression by making less 678 00:38:45,959 --> 00:38:48,959 assumption is more robust to your modeling assumptions because you're making a weaker assumption; you're 679 00:38:48,959 --> 00:38:50,729 making less assumptions, 680 00:38:50,729 --> 00:38:54,150 but sometimes it takes a slightly larger training set to fit than Gaussian discriminant 681 00:38:54,150 --> 00:38:55,419 analysis. Question? Student:In order 682 00:38:55,419 --> 00:39:00,019 to meet any assumption about the number [inaudible], plus 683 00:39:00,019 --> 00:39:01,029 684 00:39:01,029 --> 00:39:03,139 here 685 00:39:03,139 --> 00:39:09,089 we assume that P of Y = 1, equal 686 00:39:09,089 --> 00:39:11,159 two 687 00:39:11,159 --> 00:39:14,849 number of. 688 00:39:14,849 --> 00:39:16,319 [Inaudible]. Is true when 689 00:39:16,319 --> 00:39:18,069 the number of samples is marginal? Instructor (Andrew Ng):Okay. So let's see. 690 00:39:18,069 --> 00:39:21,799 So there's a question of is this true - what 691 00:39:21,799 --> 00:39:23,379 was that? Let me translate that 692 00:39:23,379 --> 00:39:25,429 differently. So 693 00:39:25,429 --> 00:39:28,829 the marving assumptions are made independently of the size 694 00:39:28,829 --> 00:39:29,880 of 695 00:39:29,880 --> 00:39:32,520 your training set, right? So, like, in least/great regression - well, 696 00:39:32,520 --> 00:39:36,649 in all of these models I'm assuming that these are random variables 697 00:39:36,649 --> 00:39:41,380 flowing from some distribution, and then, finally, I'm giving a single training set 698 00:39:41,380 --> 00:39:43,640 and that as for the parameters of the 699 00:39:43,640 --> 00:39:44,450 distribution, right? 700 00:39:44,450 --> 00:39:51,450 Student:So 701 00:39:51,889 --> 00:39:54,169 what's the probability of Y = 1? 702 00:39:54,169 --> 00:39:55,069 Instructor (Andrew Ng):Probability of Y + 1? 703 00:39:55,069 --> 00:39:56,499 Student:Yeah, you used the - Instructor (Andrew Ng):Sort 704 00:39:56,499 --> 00:39:57,909 of, this like 705 00:39:57,909 --> 00:39:59,589 - 706 00:39:59,589 --> 00:40:02,829 back to the philosophy of mass molecular estimation, 707 00:40:02,829 --> 00:40:04,700 right? I'm assuming that 708 00:40:04,700 --> 00:40:08,329 they're P of Y is equal to phi to the Y, 709 00:40:08,329 --> 00:40:13,390 Y - phi to the Y or Y - Y. So I'm assuming that there's some true value of Y 710 00:40:13,390 --> 00:40:14,009 generating 711 00:40:14,009 --> 00:40:15,879 all my data, 712 00:40:15,879 --> 00:40:18,849 and then 713 00:40:18,849 --> 00:40:19,549 - 714 00:40:19,549 --> 00:40:25,589 well, when I write this, I guess, maybe what I should write isn't - 715 00:40:25,589 --> 00:40:27,609 so when I write this, I 716 00:40:27,609 --> 00:40:29,900 guess there are already two values of phi. One is 717 00:40:29,900 --> 00:40:32,200 there's a true underlying value of phi 718 00:40:32,200 --> 00:40:35,160 that guards the use to generate the data, 719 00:40:35,160 --> 00:40:39,630 and then there's the maximum likelihood estimate of the value of phi, and so when I was writing 720 00:40:39,630 --> 00:40:41,509 those formulas earlier, 721 00:40:41,509 --> 00:40:45,209 those formulas are writing for phi, and mew0, and mew1 722 00:40:45,209 --> 00:40:49,359 were really the maximum likelihood estimates for phi, mew0, and mew1, and that's different from the true 723 00:40:49,359 --> 00:40:55,819 underlying values of phi, mew0, and mew1, but - Student:[Off mic]. Instructor (Andrew Ng):Yeah, right. So maximum 724 00:40:55,819 --> 00:40:57,559 likelihood estimate comes from the data, 725 00:40:57,559 --> 00:41:01,529 and there's some, sort of, true underlying value of phi that I'm trying to estimate, 726 00:41:01,529 --> 00:41:05,319 and my maximum likelihood estimate is my attempt to estimate the true value, 727 00:41:05,319 --> 00:41:08,150 but, you know, by notational and convention 728 00:41:08,150 --> 00:41:11,779 often are just right as that as well without bothering to distinguish between 729 00:41:11,779 --> 00:41:15,470 the maximum likelihood value and the true underlying value that I'm assuming is out 730 00:41:15,470 --> 00:41:16,339 there, and that I'm 731 00:41:16,339 --> 00:41:23,339 only hoping to estimate. 732 00:41:23,979 --> 00:41:25,579 Actually, yeah, 733 00:41:25,579 --> 00:41:29,369 so for the sample of questions like these about maximum likelihood and so on, I hope 734 00:41:29,369 --> 00:41:34,059 to tease to the Friday discussion section 735 00:41:34,059 --> 00:41:36,279 as a good time to 736 00:41:36,279 --> 00:41:37,890 ask questions about, sort of, 737 00:41:37,890 --> 00:41:40,880 probabilistic definitions like these as well. Are there any 738 00:41:40,880 --> 00:41:47,880 other questions? No, great. Okay. 739 00:41:52,900 --> 00:41:59,680 So, 740 00:41:59,680 --> 00:42:01,189 great. Oh, it 741 00:42:01,189 --> 00:42:07,589 turns out, just to mention one more thing that's, kind of, cool. 742 00:42:07,589 --> 00:42:09,300 I said that 743 00:42:09,300 --> 00:42:13,509 if X given Y is Poisson, and you also go logistic posterior, 744 00:42:13,509 --> 00:42:17,210 it actually turns out there's a more general version of this. If you assume 745 00:42:17,210 --> 00:42:18,929 X 746 00:42:18,929 --> 00:42:23,899 given Y = 1 is exponential family 747 00:42:23,899 --> 00:42:26,279 with parameter A to 1, and then you assume 748 00:42:26,279 --> 00:42:33,279 X given Y = 0 is exponential family 749 00:42:33,859 --> 00:42:38,889 with parameter A to 0, then 750 00:42:38,889 --> 00:42:43,269 this implies that P of Y = 1 given X is also logistic, okay? And 751 00:42:43,269 --> 00:42:44,679 that's, kind of, cool. 752 00:42:44,679 --> 00:42:47,590 It means that Y given X could be - I don't 753 00:42:47,590 --> 00:42:50,180 know, some strange thing. It could be gamma because 754 00:42:50,180 --> 00:42:52,509 we've seen Gaussian right 755 00:42:52,509 --> 00:42:53,609 next to the - I 756 00:42:53,609 --> 00:42:57,029 don't know, gamma exponential. 757 00:42:57,029 --> 00:42:59,099 They're actually a beta. I'm 758 00:42:59,099 --> 00:43:02,089 just rattling off my mental list of exponential family extrusions. It could be any one 759 00:43:02,089 --> 00:43:03,689 of those things, 760 00:43:03,689 --> 00:43:07,519 so [inaudible] the same exponential family distribution for the two classes 761 00:43:07,519 --> 00:43:09,439 with different natural parameters 762 00:43:09,439 --> 00:43:12,239 than the 763 00:43:12,239 --> 00:43:13,279 posterior 764 00:43:13,279 --> 00:43:16,849 P of Y given 1 given X - P of Y = 1 given X would be logistic, and so this shows 765 00:43:16,849 --> 00:43:19,299 the robustness of logistic regression 766 00:43:19,299 --> 00:43:22,479 to the choice of modeling assumptions because it could be that 767 00:43:22,479 --> 00:43:24,989 the data was actually, you know, gamma distributed, 768 00:43:24,989 --> 00:43:28,739 and just still turns out to be logistic. So it's the 769 00:43:28,739 --> 00:43:35,739 robustness of logistic regression to modeling 770 00:43:38,809 --> 00:43:40,960 assumptions. 771 00:43:40,960 --> 00:43:42,900 And this is the density. I think, 772 00:43:42,900 --> 00:43:44,399 early on I promised 773 00:43:44,399 --> 00:43:48,619 two justifications for where I pulled the logistic function out of the hat, right? So 774 00:43:48,619 --> 00:43:53,299 one was the exponential family derivation we went through last time, and this is, sort of, the second one. 775 00:43:53,299 --> 00:44:00,299 That all of these modeling assumptions also lead to the logistic function. Yeah? Student:[Off 776 00:44:05,109 --> 00:44:09,969 mic]. Instructor (Andrew Ng):Oh, that Y = 1 given as the logistic then this implies that, no. This is also 777 00:44:09,969 --> 00:44:11,059 not true, right? 778 00:44:11,059 --> 00:44:12,499 Yeah, so this exponential 779 00:44:12,499 --> 00:44:14,199 family distribution 780 00:44:14,199 --> 00:44:18,269 implies Y = 1 is logistic, but the reverse assumption is also not true. 781 00:44:18,269 --> 00:44:22,599 There are actually all sorts of really bizarre distributions 782 00:44:22,599 --> 00:44:29,599 for X that would give rise to logistic function, okay? Okay. So 783 00:44:29,889 --> 00:44:34,719 let's talk about - those are first generative learning algorithm. Maybe I'll talk about the second 784 00:44:34,719 --> 00:44:37,609 generative learning algorithm, 785 00:44:37,609 --> 00:44:44,609 and the motivating example, actually this is called a Naive Bayes algorithm, 786 00:44:44,849 --> 00:44:49,690 and the motivating example that I'm gonna use will be spam classification. All right. So let's 787 00:44:49,690 --> 00:44:54,109 say that you want to build a spam classifier to take your incoming stream of email and decide if 788 00:44:54,109 --> 00:44:56,529 it's spam or 789 00:44:56,529 --> 00:44:57,489 not. 790 00:44:57,489 --> 00:45:02,299 So let's 791 00:45:02,299 --> 00:45:04,759 see. Y will be 0 792 00:45:04,759 --> 00:45:06,849 or 793 00:45:06,849 --> 00:45:11,679 1, with 1 being spam email 794 00:45:11,679 --> 00:45:12,589 and 0 being non-spam, and 795 00:45:12,589 --> 00:45:16,649 the first decision we need to make is, given a piece of email, 796 00:45:16,649 --> 00:45:20,789 how do you represent a piece of email using a feature vector X, 797 00:45:20,789 --> 00:45:24,449 right? So email is just a piece of text, right? Email 798 00:45:24,449 --> 00:45:27,889 is like a list of words or a list of ASCII characters. 799 00:45:27,889 --> 00:45:30,839 So I can represent email as a feature of vector X. 800 00:45:30,839 --> 00:45:32,369 So we'll use a couple of different 801 00:45:32,369 --> 00:45:34,099 representations, 802 00:45:34,099 --> 00:45:36,579 but the one I'll use today is 803 00:45:36,579 --> 00:45:38,049 we will 804 00:45:38,049 --> 00:45:42,179 construct the vector X as follows. I'm gonna go through my dictionary, and, sort of, make a listing of 805 00:45:42,179 --> 00:45:44,400 all the words in my dictionary, okay? So 806 00:45:44,400 --> 00:45:46,469 the first word is 807 00:45:46,469 --> 00:45:52,289 RA. The second word in my dictionary is Aardvark, ausworth, 808 00:45:52,289 --> 00:45:55,569 okay? 809 00:45:55,569 --> 00:45:59,419 You know, and somewhere along the way you see the word buy in the spam email telling you to buy 810 00:45:59,419 --> 00:46:01,379 stuff. 811 00:46:01,379 --> 00:46:04,140 Tell you how you collect your list of words, 812 00:46:04,140 --> 00:46:08,579 you know, you won't find CS229, right, course number in a dictionary, but 813 00:46:08,579 --> 00:46:09,559 if you 814 00:46:09,559 --> 00:46:14,249 collect a list of words via other emails you've gotten, you have this list somewhere 815 00:46:14,249 --> 00:46:17,979 as well, and then the last word in my dictionary was 816 00:46:17,979 --> 00:46:21,969 zicmergue, which 817 00:46:21,969 --> 00:46:24,739 pertains to the technological chemistry that deals with 818 00:46:24,739 --> 00:46:27,899 the fermentation process in 819 00:46:27,899 --> 00:46:30,829 brewing. 820 00:46:30,829 --> 00:46:35,180 So say I get a piece of email, and what I'll do is I'll then 821 00:46:35,180 --> 00:46:37,860 scan through this list of words, and wherever 822 00:46:37,860 --> 00:46:41,249 a certain word appears in my email, I'll put a 1 there. So if a particular 823 00:46:41,249 --> 00:46:44,319 email has the word aid then that's 1. 824 00:46:44,319 --> 00:46:46,560 You know, my email doesn't have the words ausworth 825 00:46:46,560 --> 00:46:49,209 or aardvark, so it gets zeros. And again, 826 00:46:49,209 --> 00:46:52,939 a piece of email, they want me to buy something, CS229 doesn't occur, and so on, okay? 827 00:46:52,939 --> 00:46:56,910 So 828 00:46:56,910 --> 00:46:59,289 this would be 829 00:46:59,289 --> 00:47:01,939 one way of creating a feature vector 830 00:47:01,939 --> 00:47:03,619 to represent a 831 00:47:03,619 --> 00:47:05,739 piece of email. 832 00:47:05,739 --> 00:47:09,900 Now, let's throw 833 00:47:09,900 --> 00:47:16,900 the generative model out for this. Actually, 834 00:47:18,059 --> 00:47:19,289 let's use 835 00:47:19,289 --> 00:47:20,849 this. 836 00:47:20,849 --> 00:47:24,400 In other words, I want to model P of X given Y. The 837 00:47:24,400 --> 00:47:27,959 given Y = 0 or Y = 1, all right? 838 00:47:27,959 --> 00:47:31,390 And my feature vectors are going to be 0, 1 839 00:47:31,390 --> 00:47:35,709 to the N. It's going to be these split vectors, binary value vectors. They're N 840 00:47:35,709 --> 00:47:36,400 dimensional. 841 00:47:36,400 --> 00:47:38,130 Where N 842 00:47:38,130 --> 00:47:38,830 may 843 00:47:38,830 --> 00:47:41,950 be on the order of, say, 50,000, if you have 50,000 844 00:47:41,950 --> 00:47:44,089 words in your dictionary, 845 00:47:44,089 --> 00:47:46,110 which is not atypical. So 846 00:47:46,110 --> 00:47:47,459 values from - I don't 847 00:47:47,459 --> 00:47:48,449 know, 848 00:47:48,449 --> 00:47:52,219 mid-thousands to tens of thousands is very typical 849 00:47:52,219 --> 00:47:53,939 for problems like 850 00:47:53,939 --> 00:47:56,620 these. And, therefore, 851 00:47:56,620 --> 00:48:00,340 there two to the 50,000 possible values for X, right? So two to 852 00:48:00,340 --> 00:48:01,500 50,000 853 00:48:01,500 --> 00:48:02,729 possible bit vectors 854 00:48:02,729 --> 00:48:03,819 of length 855 00:48:03,819 --> 00:48:05,199 856 00:48:05,199 --> 00:48:07,239 50,000, and so 857 00:48:07,239 --> 00:48:08,829 one way to model this is 858 00:48:08,829 --> 00:48:10,749 the multinomial distribution, 859 00:48:10,749 --> 00:48:13,900 but because there are two to the 50,000 possible values for X, 860 00:48:13,900 --> 00:48:20,339 I would need two to the 50,000, but maybe -1 parameters, 861 00:48:20,339 --> 00:48:21,059 right? Because you have 862 00:48:21,059 --> 00:48:23,699 this sum to 1, right? So 863 00:48:23,699 --> 00:48:27,089 -1. And this is clearly way too many parameters 864 00:48:27,089 --> 00:48:28,479 to model 865 00:48:28,479 --> 00:48:30,269 using the multinomial distribution 866 00:48:30,269 --> 00:48:34,439 over all two to 50,000 possibilities. 867 00:48:34,439 --> 00:48:35,839 So 868 00:48:35,839 --> 00:48:42,289 in a Naive Bayes algorithm, we're 869 00:48:42,289 --> 00:48:46,579 going to make a very strong assumption on P of X given Y, 870 00:48:46,579 --> 00:48:49,369 and, in particular, I'm going to assume - let 871 00:48:49,369 --> 00:48:53,609 me just say what it's called; then I'll write out what it means. I'm going to assume that the 872 00:48:53,609 --> 00:48:54,630 XI's 873 00:48:54,630 --> 00:49:01,630 are conditionally independent 874 00:49:06,489 --> 00:49:07,950 given Y, okay? 875 00:49:07,950 --> 00:49:11,209 Let me say what this means. 876 00:49:11,209 --> 00:49:18,209 So I have that P of X1, X2, up to X50,000, 877 00:49:18,729 --> 00:49:20,559 right, given the 878 00:49:20,559 --> 00:49:25,910 Y. By the key rule of probability, this is P of X1 given Y 879 00:49:25,910 --> 00:49:27,759 times P of X2 880 00:49:27,759 --> 00:49:30,369 given 881 00:49:30,369 --> 00:49:32,469 882 00:49:32,469 --> 00:49:33,989 Y, 883 00:49:33,989 --> 00:49:39,409 X1 884 00:49:39,409 --> 00:49:42,919 times PF - I'll just put dot, dot, dot. I'll just write 1, 1 × dot, dot, dot up to, you know, well - 885 00:49:42,919 --> 00:49:46,369 whatever. You get the idea, up to P of X50,000, okay? 886 00:49:46,369 --> 00:49:49,529 So this is the chain were of probability. This always holds. I've not 887 00:49:49,529 --> 00:49:52,849 made any assumption yet, and now, we're 888 00:49:52,849 --> 00:49:54,529 gonna 889 00:49:54,529 --> 00:49:58,089 meet what's called the Naive Bayes assumption, or this assumption that X 890 00:49:58,089 --> 00:50:01,869 defies a conditionally independent given Y. Going 891 00:50:01,869 --> 00:50:03,709 to assume that - 892 00:50:03,709 --> 00:50:06,619 well, nothing changes for the first term, 893 00:50:06,619 --> 00:50:11,719 but I'm gonna assume that P of X3 given Y, X1 is equal to P of X2 given the Y. I'm gonna assume that that term's equal to P of X3 894 00:50:11,719 --> 00:50:13,359 given 895 00:50:13,359 --> 00:50:16,979 the 896 00:50:16,979 --> 00:50:17,760 Y, 897 00:50:17,760 --> 00:50:20,099 and so on, up 898 00:50:20,099 --> 00:50:24,489 to P of X50,000 given Y, okay? 899 00:50:24,489 --> 00:50:28,849 Or just written more compactly, 900 00:50:28,849 --> 00:50:32,119 901 00:50:32,119 --> 00:50:35,989 means assume that P of X1, P of X50,000 given Y is 902 00:50:35,989 --> 00:50:41,150 the product from I = 1 to 50,000 or P of XI 903 00:50:41,150 --> 00:50:44,170 given the Y, 904 00:50:44,170 --> 00:50:45,299 okay? 905 00:50:45,299 --> 00:50:49,129 And stating informally what this means is that I'm, sort of, assuming that - 906 00:50:49,129 --> 00:50:53,009 so unless you know the cost label Y, so long as you know whether this is spam or not 907 00:50:53,009 --> 00:50:54,959 spam, 908 00:50:54,959 --> 00:50:58,579 then knowing whether the word A appears in email 909 00:50:58,579 --> 00:50:59,809 does not affect 910 00:50:59,809 --> 00:51:01,359 the probability 911 00:51:01,359 --> 00:51:02,849 of whether the word 912 00:51:02,849 --> 00:51:06,829 Ausworth appears in the email, all right? 913 00:51:06,829 --> 00:51:10,699 And, in other words, there's assuming - once you know whether an email is spam 914 00:51:10,699 --> 00:51:12,619 or not spam, 915 00:51:12,619 --> 00:51:15,519 then knowing whether other words appear in the email won't help 916 00:51:15,519 --> 00:51:19,049 you predict whether any other word appears in the email, okay? 917 00:51:19,049 --> 00:51:20,289 And, 918 00:51:20,289 --> 00:51:23,489 obviously, this assumption is false, right? This 919 00:51:23,489 --> 00:51:25,979 assumption can't possibly be 920 00:51:25,979 --> 00:51:28,190 true. I mean, if you see the word 921 00:51:28,190 --> 00:51:31,410 - I don't know, CS229 in an email, you're much more likely to see my name in the email, or 922 00:51:31,410 --> 00:51:37,050 the TA's names, or whatever. So this assumption is normally just false 923 00:51:37,050 --> 00:51:37,469 924 00:51:37,469 --> 00:51:39,349 under English, right, 925 00:51:39,349 --> 00:51:41,239 for normal written English, 926 00:51:41,239 --> 00:51:42,519 but it 927 00:51:42,519 --> 00:51:44,269 turns out that despite 928 00:51:44,269 --> 00:51:46,599 this assumption being, sort of, 929 00:51:46,599 --> 00:51:48,180 false in the literal sense, 930 00:51:48,180 --> 00:51:50,829 the Naive Bayes algorithm is, sort of, 931 00:51:50,829 --> 00:51:53,469 an extremely effective 932 00:51:53,469 --> 00:51:57,869 algorithm for classifying text documents into spam or not spam, for 933 00:51:57,869 --> 00:52:01,519 classifying your emails into different emails for your automatic view, for 934 00:52:01,519 --> 00:52:03,589 looking at web pages and classifying 935 00:52:03,589 --> 00:52:06,290 whether this webpage is trying to sell something or whatever. It 936 00:52:06,290 --> 00:52:07,959 turns out, this assumption 937 00:52:07,959 --> 00:52:11,819 works very well for classifying text documents and for other applications too that I'll 938 00:52:11,819 --> 00:52:16,419 talk a bit about later. 939 00:52:16,419 --> 00:52:20,689 As a digression that'll make sense only to some of you. 940 00:52:20,689 --> 00:52:23,019 Let me just say that 941 00:52:23,019 --> 00:52:27,959 if you're familiar with Bayesian X world, say 942 00:52:27,959 --> 00:52:29,839 943 00:52:29,839 --> 00:52:33,999 graphical models, the Bayesian network associated with this model looks like this, and you're assuming 944 00:52:33,999 --> 00:52:34,509 that 945 00:52:34,509 --> 00:52:36,769 this is random variable Y 946 00:52:36,769 --> 00:52:38,779 that then generates X1, X2, through 947 00:52:38,779 --> 00:52:41,089 X50,000, okay? If you've not seen the 948 00:52:41,089 --> 00:52:43,069 Bayes Net before, if 949 00:52:43,069 --> 00:52:47,549 you don't know your graphical model, just ignore this. It's not important to our purposes, but 950 00:52:47,549 --> 00:52:54,549 if you've seen it before, that's what it will look like. Okay. 951 00:52:58,019 --> 00:53:03,349 So 952 00:53:03,349 --> 00:53:10,349 953 00:53:11,660 --> 00:53:16,239 the parameters of the model 954 00:53:16,239 --> 00:53:18,069 are as follows 955 00:53:18,069 --> 00:53:21,659 with phi FI given Y = 1, which 956 00:53:21,659 --> 00:53:23,189 957 00:53:23,189 --> 00:53:28,019 is probably FX = 1 or XI = 1 958 00:53:28,019 --> 00:53:28,839 given Y = 1, 959 00:53:28,839 --> 00:53:35,529 phi I 960 00:53:35,529 --> 00:53:39,959 given Y = 0, and phi Y, okay? 961 00:53:39,959 --> 00:53:43,459 So these are the parameters of the model, 962 00:53:43,459 --> 00:53:45,419 963 00:53:45,419 --> 00:53:47,249 and, therefore, 964 00:53:47,249 --> 00:53:50,139 to fit the parameters of the model, you 965 00:53:50,139 --> 00:53:57,139 can write down the joint likelihood, right, 966 00:54:01,849 --> 00:54:07,329 is 967 00:54:07,329 --> 00:54:14,329 equal to, as usual, okay? 968 00:54:18,670 --> 00:54:20,779 So given the training sets, 969 00:54:20,779 --> 00:54:27,779 you can write down the joint 970 00:54:28,979 --> 00:54:32,579 likelihood of the parameters, and 971 00:54:32,579 --> 00:54:39,579 then when 972 00:54:43,729 --> 00:54:44,640 you 973 00:54:44,640 --> 00:54:46,869 do maximum likelihood estimation, 974 00:54:46,869 --> 00:54:50,860 you find that the maximum likelihood estimate of the parameters are 975 00:54:50,860 --> 00:54:53,239 - they're really, pretty much, what you'd expect. 976 00:54:53,239 --> 00:54:55,630 Maximum likelihood estimate for phi J given Y = 1 is 977 00:54:55,630 --> 00:54:57,880 sum from I = 1 to 978 00:54:57,880 --> 00:55:01,759 M, 979 00:55:01,759 --> 00:55:03,459 indicator 980 00:55:03,459 --> 00:55:07,129 XIJ = 981 00:55:07,129 --> 00:55:14,129 1, YI = 1, okay? 982 00:55:14,699 --> 00:55:18,479 983 00:55:18,479 --> 00:55:19,359 984 00:55:19,359 --> 00:55:20,719 And this is just a, 985 00:55:20,719 --> 00:55:22,829 I guess, stated more simply, 986 00:55:22,829 --> 00:55:24,969 the numerator just says, Run for 987 00:55:24,969 --> 00:55:27,979 your entire training set, some [inaudible] examples, 988 00:55:27,979 --> 00:55:30,919 and count up the number of times you saw word Jay 989 00:55:30,919 --> 00:55:32,549 in a piece of email 990 00:55:32,549 --> 00:55:34,649 for which the label Y was equal to 1. So, in other words, look 991 00:55:34,649 --> 00:55:36,800 through all your spam emails 992 00:55:36,800 --> 00:55:39,320 and count the number of emails in which the word 993 00:55:39,320 --> 00:55:40,559 Jay appeared out of 994 00:55:40,559 --> 00:55:42,420 all your spam emails, 995 00:55:42,420 --> 00:55:44,710 and the denominator is, you know, 996 00:55:44,710 --> 00:55:45,989 sum from I = 1 to M, 997 00:55:45,989 --> 00:55:47,949 the number of spam. The 998 00:55:47,949 --> 00:55:51,439 denominator is just the number of spam emails you got. 999 00:55:51,439 --> 00:55:54,329 And so this ratio is 1000 00:55:54,329 --> 00:55:57,769 in all your spam emails in your training set, 1001 00:55:57,769 --> 00:56:00,399 what fraction of these emails 1002 00:56:00,399 --> 00:56:03,039 did the word Jay 1003 00:56:03,039 --> 00:56:03,659 appear in - 1004 00:56:03,659 --> 00:56:06,640 did the, Jay you wrote in your dictionary appear in? 1005 00:56:06,640 --> 00:56:10,239 And that's the maximum likelihood estimate 1006 00:56:10,239 --> 00:56:13,819 for the probability of seeing the word Jay conditions on the piece of email being spam, okay? And similar to your 1007 00:56:13,819 --> 00:56:17,499 1008 00:56:17,499 --> 00:56:20,499 1009 00:56:20,499 --> 00:56:23,160 maximum likelihood estimate for phi 1010 00:56:23,160 --> 00:56:24,999 Y 1011 00:56:24,999 --> 00:56:28,880 is pretty much what you'd expect, right? 1012 00:56:28,880 --> 00:56:35,880 Okay? 1013 00:56:41,409 --> 00:56:43,389 And so 1014 00:56:43,389 --> 00:56:47,669 having estimated all these parameters, 1015 00:56:47,669 --> 00:56:50,960 when you're given a new piece of email that you want to classify, 1016 00:56:50,960 --> 00:56:53,430 you can then compute P of Y given X 1017 00:56:53,430 --> 00:56:56,079 using Bayes rule, right? 1018 00:56:56,079 --> 00:56:57,580 Same as before because 1019 00:56:57,580 --> 00:57:04,229 together these parameters gives you a model for P of X given Y and for P of Y, 1020 00:57:04,229 --> 00:57:07,849 and by using Bayes rule, given these two terms, you can compute 1021 00:57:07,849 --> 00:57:10,709 P of X given Y, and 1022 00:57:10,709 --> 00:57:15,650 there's your spam classifier, okay? 1023 00:57:15,650 --> 00:57:18,999 Turns out we need one more elaboration to this idea, but let me check if there are 1024 00:57:18,999 --> 00:57:25,109 questions about this so far. 1025 00:57:25,109 --> 00:57:29,269 Student:So does this model depend 1026 00:57:29,269 --> 00:57:29,659 on 1027 00:57:29,659 --> 00:57:34,380 the number of inputs? Instructor (Andrew Ng):What do 1028 00:57:34,380 --> 00:57:35,149 you 1029 00:57:35,149 --> 00:57:38,169 mean, number of inputs, the number of features? Student:No, number of samples. Instructor (Andrew Ng):Well, N is the number of training examples, so this 1030 00:57:38,169 --> 00:57:45,169 given M training examples, this is the formula for the maximum likelihood estimate of the parameters, right? So other questions, does it make 1031 00:57:48,959 --> 00:57:52,629 sense? Or M is the number of training examples, so when you have M training examples, you plug them 1032 00:57:52,629 --> 00:57:53,999 into this formula, 1033 00:57:53,999 --> 00:58:00,999 and that's how you compute the maximum likelihood estimates. Student:Is training examples you mean M is the 1034 00:58:01,289 --> 00:58:04,089 number of emails? Instructor (Andrew Ng):Yeah, right. So, right. 1035 00:58:04,089 --> 00:58:07,130 So it's, kind of, your training set. I would go through all the email I've gotten 1036 00:58:07,130 --> 00:58:08,869 in the last two months 1037 00:58:08,869 --> 00:58:11,129 and label them as spam or not spam, 1038 00:58:11,129 --> 00:58:14,450 and so you have - I don't 1039 00:58:14,450 --> 00:58:16,809 know, like, a few hundred emails 1040 00:58:16,809 --> 00:58:18,630 labeled as spam or not spam, 1041 00:58:18,630 --> 00:58:25,630 and that will comprise your training sets for X1 and Y1 through XM, 1042 00:58:28,049 --> 00:58:28,620 YM, 1043 00:58:28,620 --> 00:58:32,820 where X is one of those vectors representing which words appeared in the email and Y 1044 00:58:32,820 --> 00:58:39,820 is 0, 1 depending on whether they equal spam or not spam, okay? Student:So you are saying that this model depends on the number 1045 00:58:41,309 --> 00:58:42,019 1046 00:58:42,019 --> 00:58:49,019 of examples, but the last model doesn't depend on the models, but your phi is the 1047 00:58:49,829 --> 00:58:53,630 same for either one. Instructor (Andrew Ng):They're 1048 00:58:53,630 --> 00:58:58,019 different things, right? There's the model which is 1049 00:58:58,019 --> 00:59:00,659 - the modeling assumptions aren't made very well. 1050 00:59:00,659 --> 00:59:04,069 I'm assuming that - I'm making the Naive Bayes assumption. 1051 00:59:04,069 --> 00:59:08,900 So the probabilistic model is an assumption on the joint distribution 1052 00:59:08,900 --> 00:59:10,319 of X and Y. 1053 00:59:10,319 --> 00:59:12,390 That's what the model is, 1054 00:59:12,390 --> 00:59:16,650 and then I'm given a fixed number of training examples. I'm given M training examples, and 1055 00:59:16,650 --> 00:59:20,309 then it's, like, after I'm given the training sets, I'll then go in to write the maximum 1056 00:59:20,309 --> 00:59:22,299 likelihood estimate of the parameters, right? So that's, sort 1057 00:59:22,299 --> 00:59:24,529 of, 1058 00:59:24,529 --> 00:59:31,529 maybe we should take that offline for - yeah, ask a question? Student:Then how would you do this, like, 1059 00:59:38,389 --> 00:59:41,079 if this [inaudible] didn't work? Instructor (Andrew Ng):Say that again. Student:How would you do it, say, like the 50,000 words - Instructor (Andrew Ng):Oh, okay. How to do this with the 50,000 words, yeah. So 1060 00:59:41,079 --> 00:59:42,360 it turns out 1061 00:59:42,360 --> 00:59:45,859 this is, sort of, a very practical question, really. How do I count this list of 1062 00:59:45,859 --> 00:59:49,629 words? One common way to do this is to actually 1063 00:59:49,629 --> 00:59:52,899 find some way to count a list of words, like go through all your emails, go through 1064 00:59:52,899 --> 00:59:53,909 all the - 1065 00:59:53,909 --> 00:59:56,930 in practice, one common way to count a list of words 1066 00:59:56,930 --> 01:00:00,339 is to just take all the words that appear in your training set. That's one fairly common way 1067 01:00:00,339 --> 01:00:02,190 to do it, 1068 01:00:02,190 --> 01:00:06,130 or if that turns out to be too many words, you can take all words that appear 1069 01:00:06,130 --> 01:00:07,030 at least 1070 01:00:07,030 --> 01:00:08,080 three times 1071 01:00:08,080 --> 01:00:09,349 in your training set. So 1072 01:00:09,349 --> 01:00:10,659 words that 1073 01:00:10,659 --> 01:00:14,749 you didn't even see three times in the emails you got in the last 1074 01:00:14,749 --> 01:00:17,210 two months, you discard. So those are - I 1075 01:00:17,210 --> 01:00:20,259 was talking about going through a dictionary, which is a nice way of thinking about it, but in 1076 01:00:20,259 --> 01:00:22,049 practice, you might go through 1077 01:00:22,049 --> 01:00:26,409 your training set and then just take the union of all the words that appear in 1078 01:00:26,409 --> 01:00:29,759 it. In some of the tests I've even, by the way, said select these features, but this is one 1079 01:00:29,759 --> 01:00:32,309 way to think about 1080 01:00:32,309 --> 01:00:34,139 creating your feature vector, 1081 01:00:34,139 --> 01:00:41,139 right, as zero and one values, okay? Moving on, yeah. Okay. Ask a question? Student:I'm getting, kind of, confused on how you compute all those parameters. Instructor (Andrew Ng):On 1082 01:00:49,209 --> 01:00:51,609 how I came up with the parameters? 1083 01:00:51,609 --> 01:00:53,929 Student:Correct. Instructor (Andrew Ng):Let's see. 1084 01:00:53,929 --> 01:00:58,249 So in Naive Bayes, what I need to do - the question was how did I come up with the parameters, right? 1085 01:00:58,249 --> 01:00:59,279 In Naive Bayes, 1086 01:00:59,279 --> 01:01:01,499 I need to build a model 1087 01:01:01,499 --> 01:01:03,549 for P of X given Y and for 1088 01:01:03,549 --> 01:01:05,359 P of Y, 1089 01:01:05,359 --> 01:01:09,749 right? So this is, I mean, in generous of learning algorithms, I need to come up with 1090 01:01:09,749 --> 01:01:11,079 models for these. 1091 01:01:11,079 --> 01:01:15,359 So how'd I model P of Y? Well, I just those to model it using a Bernoulli 1092 01:01:15,359 --> 01:01:16,389 distribution, 1093 01:01:16,389 --> 01:01:17,150 and so 1094 01:01:17,150 --> 01:01:19,709 P of Y will be 1095 01:01:19,709 --> 01:01:22,109 parameterized by that, all right? Student:Okay. 1096 01:01:22,109 --> 01:01:26,519 Instructor (Andrew Ng):And then how'd I model P of X given Y? Well, 1097 01:01:26,519 --> 01:01:28,529 let's keep changing bullets. 1098 01:01:28,529 --> 01:01:32,729 My model for P of X given Y under the Naive Bayes assumption, I assume 1099 01:01:32,729 --> 01:01:34,539 that P of X given Y 1100 01:01:34,539 --> 01:01:37,549 is the product of these probabilities, 1101 01:01:37,549 --> 01:01:40,900 and so I'm going to need parameters to tell me 1102 01:01:40,900 --> 01:01:43,509 what's the probability of each word occurring, 1103 01:01:43,509 --> 01:01:46,309 you know, of each word occurring or not occurring, 1104 01:01:46,309 --> 01:01:53,309 conditions on the email being spam or not spam email, okay? Student:How is that 1105 01:01:54,639 --> 01:01:58,829 Bernoulli? Instructor (Andrew Ng):Oh, because X is either zero or one, right? By the way I defined the feature 1106 01:01:58,829 --> 01:02:00,799 vectors, XI 1107 01:02:00,799 --> 01:02:05,499 is either one or zero, depending on whether words I appear as in the email, 1108 01:02:05,499 --> 01:02:08,859 right? So by the way I define the 1109 01:02:08,859 --> 01:02:11,419 feature vectors, XI - 1110 01:02:11,419 --> 01:02:16,639 the XI is always zero or one. So that by definition, if XI, you know, is either zero or 1111 01:02:16,639 --> 01:02:19,969 one, then it has to be a Bernoulli distribution, right? 1112 01:02:19,969 --> 01:02:22,419 If XI would continue as then 1113 01:02:22,419 --> 01:02:24,819 you might model this as Gaussian and say you end up 1114 01:02:24,819 --> 01:02:27,299 like we did in Gaussian discriminant analysis. It's 1115 01:02:27,299 --> 01:02:30,719 just that the way I constructed my features for email, XI is always binary 1116 01:02:30,719 --> 01:02:32,129 value, and so you end up 1117 01:02:32,129 --> 01:02:32,990 with a 1118 01:02:32,990 --> 01:02:36,059 Bernoulli here, okay? All right. I 1119 01:02:36,059 --> 01:02:40,389 should move on. So 1120 01:02:40,389 --> 01:02:42,389 it turns out that 1121 01:02:42,389 --> 01:02:44,339 this idea 1122 01:02:44,339 --> 01:02:46,369 almost works. 1123 01:02:46,369 --> 01:02:47,719 Now, here's the problem. 1124 01:02:47,719 --> 01:02:49,959 So let's say you 1125 01:02:49,959 --> 01:02:54,439 complete this class and you start to do, maybe do the class project, and you 1126 01:02:54,439 --> 01:02:56,760 keep working on your class project for a bit, and it 1127 01:02:56,760 --> 01:03:01,439 becomes really good, and you want to submit your class project to a conference, right? So, 1128 01:03:01,439 --> 01:03:03,479 you know, around - I don't know, 1129 01:03:03,479 --> 01:03:08,039 June every year is the conference deadline for the next conference. 1130 01:03:08,039 --> 01:03:10,889 It's just the name of the conference; it's an acronym. 1131 01:03:10,889 --> 01:03:12,359 And so maybe 1132 01:03:12,359 --> 01:03:16,080 you send your project partners or senior friends even, and say, Hey, let's 1133 01:03:16,080 --> 01:03:20,069 work on a project and submit it to the NIPS conference. And so you're getting these emails 1134 01:03:20,069 --> 01:03:21,489 with the word NIPS in them, 1135 01:03:21,489 --> 01:03:24,329 which you've probably never seen before, 1136 01:03:24,329 --> 01:03:28,179 and so a 1137 01:03:28,179 --> 01:03:32,519 piece of email comes from your project partner, and so you 1138 01:03:32,519 --> 01:03:35,639 go, Let's send a paper to the NIPS conference. 1139 01:03:35,639 --> 01:03:37,310 And then your stamp classifier 1140 01:03:37,310 --> 01:03:38,759 will say 1141 01:03:38,759 --> 01:03:39,930 P of X - 1142 01:03:39,930 --> 01:03:44,630 let's say NIPS is the 30,000th word in your dictionary, okay? 1143 01:03:44,630 --> 01:03:46,669 So X30,000 given 1144 01:03:46,669 --> 01:03:50,189 the 1, given 1145 01:03:50,189 --> 01:03:51,619 Y = 1146 01:03:51,619 --> 01:03:52,799 1 1147 01:03:52,799 --> 01:03:54,029 will be equal to 0. 1148 01:03:54,029 --> 01:03:57,479 That's the maximum likelihood of this, right? Because you've never seen the word NIPS before in 1149 01:03:57,479 --> 01:04:01,199 your training set, so maximum likelihood of the parameter is that probably have seen the word 1150 01:04:01,199 --> 01:04:02,459 NIPS is zero, 1151 01:04:02,459 --> 01:04:06,789 and, similarly, 1152 01:04:06,789 --> 01:04:08,379 you 1153 01:04:08,379 --> 01:04:12,459 know, in, I guess, non-spam mail, the chance of seeing the word NIPS is also 1154 01:04:12,459 --> 01:04:14,729 estimated 1155 01:04:14,729 --> 01:04:21,209 as zero. 1156 01:04:21,209 --> 01:04:28,209 So 1157 01:04:34,789 --> 01:04:40,709 when your spam classifier goes to compute P of Y = 1 given X, it will 1158 01:04:40,709 --> 01:04:44,529 compute this 1159 01:04:44,529 --> 01:04:47,209 right here P of Y 1160 01:04:47,209 --> 01:04:54,209 over - well, 1161 01:04:56,179 --> 01:05:03,179 all 1162 01:05:05,089 --> 01:05:09,509 right. 1163 01:05:09,509 --> 01:05:11,329 And so 1164 01:05:11,329 --> 01:05:15,759 you look at that terms, say, this will be product from I = 1165 01:05:15,759 --> 01:05:18,379 1 to 50,000, 1166 01:05:18,379 --> 01:05:22,359 P of XI given Y, 1167 01:05:22,359 --> 01:05:26,139 and one of those probabilities will be equal to 1168 01:05:26,139 --> 01:05:28,169 1169 01:05:28,169 --> 01:05:32,339 zero because P of X30,000 = 1 given Y = 1 is equal to zero. So you have a 1170 01:05:32,339 --> 01:05:36,699 zero in this product, and so the numerator is zero, 1171 01:05:36,699 --> 01:05:40,249 and in the same way, it turns out the denominator will also be zero, and so you end 1172 01:05:40,249 --> 01:05:42,719 up with - 1173 01:05:42,719 --> 01:05:45,839 actually all of these terms end up being zero. So you end up with P of Y = 1 1174 01:05:45,839 --> 01:05:48,779 given X is 0 over 0 + 0, okay, which is 1175 01:05:48,779 --> 01:05:51,410 undefined. And the 1176 01:05:51,410 --> 01:05:53,959 problem with this is that it's 1177 01:05:53,959 --> 01:05:57,240 just statistically a bad idea 1178 01:05:57,240 --> 01:06:00,209 to say that P of X30,000 1179 01:06:00,209 --> 01:06:02,950 given Y is 1180 01:06:02,950 --> 01:06:03,490 0, 1181 01:06:03,490 --> 01:06:06,270 right? Just because you haven't seen the word NIPS in your last 1182 01:06:06,270 --> 01:06:10,420 two months worth of email, it's also statistically not sound to say that, 1183 01:06:10,420 --> 01:06:16,319 therefore, the chance of ever seeing this word is zero, right? 1184 01:06:16,319 --> 01:06:19,170 And so 1185 01:06:19,170 --> 01:06:20,479 1186 01:06:20,479 --> 01:06:22,719 1187 01:06:22,719 --> 01:06:25,819 is this idea that just because you haven't seen something 1188 01:06:25,819 --> 01:06:30,400 before, that may mean that that event is unlikely, but it doesn't mean that 1189 01:06:30,400 --> 01:06:33,920 it's impossible, and just saying that if you've never seen the word NIPS before, 1190 01:06:33,920 --> 01:06:40,759 then it is impossible to ever see the word NIPS in future emails; the chance of that is just zero. 1191 01:06:40,759 --> 01:06:47,759 So we're gonna fix this, 1192 01:06:48,339 --> 01:06:50,079 and 1193 01:06:50,079 --> 01:06:54,119 to motivate the fix I'll talk about 1194 01:06:54,119 --> 01:06:58,939 - the example we're gonna use is let's say that you've been following the Stanford basketball 1195 01:06:58,939 --> 01:07:04,099 team for all of their away games, and been, sort of, tracking their wins and losses 1196 01:07:04,099 --> 01:07:08,169 to gather statistics, and, maybe - I don't know, form a betting pool about 1197 01:07:08,169 --> 01:07:11,059 whether they're likely to win or lose the next game, okay? 1198 01:07:11,059 --> 01:07:12,630 So 1199 01:07:12,630 --> 01:07:15,409 these are some of the statistics. 1200 01:07:15,409 --> 01:07:17,589 1201 01:07:17,589 --> 01:07:19,429 1202 01:07:19,429 --> 01:07:24,819 So on, I guess, the 8th of February 1203 01:07:24,819 --> 01:07:29,299 last season they played Washington State, and they 1204 01:07:29,299 --> 01:07:31,779 did not win. 1205 01:07:31,779 --> 01:07:36,239 On 1206 01:07:36,239 --> 01:07:38,209 the 11th of February, 1207 01:07:38,209 --> 01:07:42,349 they play Washington, 22nd 1208 01:07:42,349 --> 01:07:47,259 they played USC, 1209 01:07:47,259 --> 01:07:49,769 1210 01:07:49,769 --> 01:07:54,929 played UCLA, 1211 01:07:54,929 --> 01:07:57,630 played USC again, 1212 01:07:57,630 --> 01:08:03,739 1213 01:08:03,739 --> 01:08:05,819 and now you want to estimate 1214 01:08:05,819 --> 01:08:09,880 what's the chance that they'll win or lose against Louisville, right? 1215 01:08:09,880 --> 01:08:11,169 So 1216 01:08:11,169 --> 01:08:14,909 find the four guys last year or five times and they weren't good in their away games, but it 1217 01:08:14,909 --> 01:08:17,009 seems awfully harsh to say that - so it 1218 01:08:17,009 --> 01:08:24,009 seems awfully harsh to say there's zero chance that they'll 1219 01:08:37,079 --> 01:08:40,269 win in the last - in the 5th game. So here's the idea behind Laplace smoothing 1220 01:08:40,269 --> 01:08:42,629 which is 1221 01:08:42,629 --> 01:08:44,819 that we're estimate 1222 01:08:44,819 --> 01:08:48,219 the probably of Y being equal to one, right? 1223 01:08:48,219 --> 01:08:53,189 Normally, the maximum likelihood [inaudible] is the 1224 01:08:53,189 --> 01:08:56,059 number of ones 1225 01:08:56,059 --> 01:08:57,659 divided by 1226 01:08:57,659 --> 01:09:00,759 the number of zeros 1227 01:09:00,759 --> 01:09:05,350 plus the number of ones, okay? I 1228 01:09:05,350 --> 01:09:07,869 hope this informal notation makes sense, right? Knowing 1229 01:09:07,869 --> 01:09:11,349 the maximum likelihood estimate for, sort of, a win or loss for Bernoulli random 1230 01:09:11,349 --> 01:09:12,219 variable 1231 01:09:12,219 --> 01:09:14,629 is 1232 01:09:14,629 --> 01:09:16,969 just the number of ones you saw 1233 01:09:16,969 --> 01:09:20,859 divided by the total number of examples. So it's the number of zeros you saw plus the number of ones you saw. So in 1234 01:09:20,859 --> 01:09:22,859 the Laplace 1235 01:09:22,859 --> 01:09:24,170 Smoothing 1236 01:09:24,170 --> 01:09:25,799 we're going to 1237 01:09:25,799 --> 01:09:29,549 just take each of these terms, the number of ones and, sort of, add one to that, the number 1238 01:09:29,549 --> 01:09:31,830 of zeros and add one to that, the 1239 01:09:31,830 --> 01:09:34,129 number of ones and add one to that, 1240 01:09:34,129 --> 01:09:37,109 and so in our example, 1241 01:09:37,109 --> 01:09:38,299 instead of estimating 1242 01:09:38,299 --> 01:09:42,890 the probability of winning the next game to be 0 ÷ 1243 01:09:42,890 --> 01:09:45,210 5 + 0, 1244 01:09:45,210 --> 01:09:50,319 we'll add one to all of these counts, and so we say that the chance of 1245 01:09:50,319 --> 01:09:51,850 their 1246 01:09:51,850 --> 01:09:53,799 winning the next game is 1/7th, 1247 01:09:53,799 --> 01:09:55,179 okay? Which is 1248 01:09:55,179 --> 01:09:59,310 that having seen them lose, you know, five away games in a row, we aren't terribly - 1249 01:09:59,310 --> 01:10:02,510 we don't think it's terribly likely they'll win the next game, but at 1250 01:10:02,510 --> 01:10:05,979 least we're not saying it's impossible. 1251 01:10:05,979 --> 01:10:09,949 As a historical side note, the Laplace actually came up with the method. 1252 01:10:09,949 --> 01:10:14,320 It's called the Laplace smoothing after him. 1253 01:10:14,320 --> 01:10:18,399 1254 01:10:18,399 --> 01:10:21,919 When he was trying to estimate the probability that the sun will rise tomorrow, and his rationale 1255 01:10:21,919 --> 01:10:25,239 was in a lot of days now, we've seen the sun rise, 1256 01:10:25,239 --> 01:10:28,489 but that doesn't mean we can be absolutely certain the sun will rise tomorrow. 1257 01:10:28,489 --> 01:10:31,420 He was using this to estimate the probability that the sun will rise tomorrow. This is, kind of, 1258 01:10:31,420 --> 01:10:36,449 cool. So, 1259 01:10:36,449 --> 01:10:38,549 and more generally, 1260 01:10:38,549 --> 01:10:41,570 1261 01:10:41,570 --> 01:10:43,969 1262 01:10:43,969 --> 01:10:45,429 1263 01:10:45,429 --> 01:10:46,449 1264 01:10:46,449 --> 01:10:49,409 if Y 1265 01:10:49,409 --> 01:10:52,840 takes on 1266 01:10:52,840 --> 01:10:55,150 K possible of values, 1267 01:10:55,150 --> 01:10:59,479 if you're trying to estimate the parameter of the multinomial, then you estimate P of Y = 1. 1268 01:10:59,479 --> 01:11:01,729 Let's 1269 01:11:01,729 --> 01:11:05,900 see. 1270 01:11:05,900 --> 01:11:11,380 So the maximum likelihood estimate will be Sum from J = 1 to M, 1271 01:11:11,380 --> 01:11:16,499 indicator YI = J ÷ M, 1272 01:11:16,499 --> 01:11:19,379 right? 1273 01:11:19,379 --> 01:11:21,629 That's the maximum likelihood estimate 1274 01:11:21,629 --> 01:11:24,369 of a multinomial probability of Y 1275 01:11:24,369 --> 01:11:26,929 being equal to - oh, 1276 01:11:26,929 --> 01:11:29,229 excuse me, Y = J. All right. 1277 01:11:29,229 --> 01:11:33,110 That's the maximum likelihood estimate for the probability of Y = J, 1278 01:11:33,110 --> 01:11:36,890 and so when you apply Laplace smoothing to that, 1279 01:11:36,890 --> 01:11:39,639 you add one to the numerator, and 1280 01:11:39,639 --> 01:11:41,889 add K to the denominator, 1281 01:11:41,889 --> 01:11:48,889 if Y can take up K possible values, okay? 1282 01:12:06,469 --> 01:12:08,579 So for Naive Bayes, 1283 01:12:08,579 --> 01:12:14,349 what that gives us is - 1284 01:12:14,349 --> 01:12:21,349 shoot. 1285 01:12:38,639 --> 01:12:44,320 Right? So that was the maximum likelihood estimate, and what you 1286 01:12:44,320 --> 01:12:47,949 end up doing is adding one to the numerator and adding two to the denominator, 1287 01:12:47,949 --> 01:12:52,030 and this solves the problem of the zero probabilities, and when your friend sends 1288 01:12:52,030 --> 01:12:57,489 you email about the NIPS conference, 1289 01:12:57,489 --> 01:13:00,799 your spam filter will still be able to 1290 01:13:00,799 --> 01:13:07,799 make a meaningful prediction, all right? Okay. 1291 01:13:12,030 --> 01:13:14,829 Shoot. Any questions about this? Yeah? Student:So that's what doesn't makes sense because, for instance, if you take the 1292 01:13:14,829 --> 01:13:16,239 games on the right, it's liberal assumptions that the probability 1293 01:13:16,239 --> 01:13:23,239 of 1294 01:13:24,650 --> 01:13:27,960 winning is very close to zero, so, I mean, the prediction should 1295 01:13:27,960 --> 01:13:29,789 be equal to PF, 0. Instructor (Andrew Ng):Right. 1296 01:13:29,789 --> 01:13:35,959 I would say that 1297 01:13:35,959 --> 01:13:38,139 in this case the prediction 1298 01:13:38,139 --> 01:13:41,170 is 1/7th, right? We don't have a lot of - if you see somebody lose five games 1299 01:13:41,170 --> 01:13:44,289 in a row, you may not have a lot of faith in them, 1300 01:13:44,289 --> 01:13:48,520 but as an extreme example, suppose you saw them lose one game, 1301 01:13:48,520 --> 01:13:51,500 right? It's just not reasonable to say that the chances of winning the next game 1302 01:13:51,500 --> 01:13:52,860 is zero, but 1303 01:13:52,860 --> 01:13:58,900 that's what maximum likelihood 1304 01:13:58,900 --> 01:14:01,150 estimate 1305 01:14:01,150 --> 01:14:05,099 will say. Student:Yes. Instructor (Andrew Ng):And - 1306 01:14:05,099 --> 01:14:08,139 Student:In such a case anywhere the learning algorithm [inaudible] or - Instructor (Andrew Ng):So some questions of, you 1307 01:14:08,139 --> 01:14:11,309 know, given just five training examples, what's a reasonable estimate for the chance of 1308 01:14:11,309 --> 01:14:13,170 winning the next game, 1309 01:14:13,170 --> 01:14:14,119 and 1310 01:14:14,119 --> 01:14:18,460 1/7th is, I think, is actually pretty reasonable. It's less than 1/5th for instance. 1311 01:14:18,460 --> 01:14:21,119 We're saying the chances of winning the next game is less 1312 01:14:21,119 --> 01:14:25,069 than 1/5th. It turns out, under a certain set of assumptions I won't go into - under a certain set of 1313 01:14:25,069 --> 01:14:27,859 Bayesian assumptions about the prior and posterior, 1314 01:14:27,859 --> 01:14:31,280 this Laplace smoothing actually gives the optimal estimate, 1315 01:14:31,280 --> 01:14:33,459 in a certain sense I won't go into 1316 01:14:33,459 --> 01:14:37,009 of what's the chance of winning the next game, and so under a certain assumption 1317 01:14:37,009 --> 01:14:38,079 about the 1318 01:14:38,079 --> 01:14:40,679 Bayesian prior on the parameter. 1319 01:14:40,679 --> 01:14:44,260 So I don't know. It actually seems like a pretty reasonable assumption to me. 1320 01:14:44,260 --> 01:14:46,940 Although, I should say, it actually 1321 01:14:46,940 --> 01:14:50,059 turned out - 1322 01:14:50,059 --> 01:14:53,530 No, I'm just being mean. We actually are a pretty good basketball team, but I chose a 1323 01:14:53,530 --> 01:14:54,660 losing streak 1324 01:14:54,660 --> 01:14:58,769 because it's funnier that way. 1325 01:14:58,769 --> 01:15:05,769 Let's see. Shoot. Does someone want to - are there other questions about 1326 01:15:07,489 --> 01:15:09,320 1327 01:15:09,320 --> 01:15:13,380 this? No, yeah. Okay. So there's more that I want to say about Naive Bayes, but 1328 01:15:13,380 --> 01:15:15,059 1329 01:15:15,059 --> 01:15:15,860 we'll do that in the next lecture. So let's wrap it.