1 00:00:00,290 --> 00:00:12,079 This subtitle is provided from lecsub.jimdo.com under Creative Commons License. 2 00:00:12,079 --> 00:00:15,349 This presentation is delivered by the Stanford Center for Professional 3 00:00:15,349 --> 00:00:22,349 Development. 4 00:00:24,169 --> 00:00:25,189 5 00:00:25,189 --> 00:00:27,299 What I want to do today is 6 00:00:27,299 --> 00:00:30,829 continue our discussion of Naive Bayes, which is the learning algorithm that 7 00:00:30,829 --> 00:00:32,600 I 8 00:00:32,600 --> 00:00:36,110 started to discuss in the previous lecture 9 00:00:36,110 --> 00:00:40,250 and talk about a couple of different event models in Naive Bayes, 10 00:00:40,250 --> 00:00:44,450 and then I'll take a brief digression to talk about neural networks, which is 11 00:00:44,450 --> 00:00:47,000 something that I actually won't spend a lot of time on, 12 00:00:47,000 --> 00:00:50,350 and then I want to start to talk about support vector machines, 13 00:00:50,350 --> 00:00:52,680 and support vector machines is 14 00:00:52,680 --> 00:00:56,450 the learning algorithms, the supervised learning algorithm that 15 00:00:56,450 --> 00:01:00,910 many people consider the most effective, off-the-shelf 16 00:01:00,910 --> 00:01:02,250 supervised learning 17 00:01:02,250 --> 00:01:05,240 algorithm. That point of view is debatable, but there are many people that hold that 18 00:01:05,240 --> 00:01:06,340 point of view, 19 00:01:06,340 --> 00:01:10,800 and we'll start discussing that today, and this will actually take us a few 20 00:01:10,800 --> 00:01:13,830 lectures to complete. 21 00:01:13,830 --> 00:01:16,940 So let's talk about Naive Bayes. To recap 22 00:01:16,940 --> 00:01:21,370 from the previous lecture, 23 00:01:21,370 --> 00:01:23,299 I started off 24 00:01:23,299 --> 00:01:24,420 describing 25 00:01:24,420 --> 00:01:28,329 spam classification as the most [inaudible] example for Naive Bayes 26 00:01:28,329 --> 00:01:33,700 in which we would create feature vectors like these, right, that correspond 27 00:01:33,700 --> 00:01:37,840 to words in a dictionary. 28 00:01:37,840 --> 00:01:39,500 And so, you know, 29 00:01:39,500 --> 00:01:43,100 based on what words appear in a piece of email were represented as a 30 00:01:43,100 --> 00:01:45,230 feature vector with 31 00:01:45,230 --> 00:01:48,130 ones and zeros in the corresponding places, 32 00:01:48,130 --> 00:01:49,950 and Naive 33 00:01:49,950 --> 00:01:56,950 Bayes was a generative learning algorithm, and by that 34 00:01:58,799 --> 00:02:00,159 I mean it's 35 00:02:00,159 --> 00:02:06,140 an algorithm in which we model P(X|Y), and for Naive 36 00:02:06,140 --> 00:02:08,479 Bayes, specifically, we 37 00:02:08,479 --> 00:02:12,859 modeled it as product from I equals one 38 00:02:12,859 --> 00:02:13,789 to 39 00:02:13,789 --> 00:02:17,109 N, P(Xi|Y), 40 00:02:17,109 --> 00:02:19,259 and also we model P(Y), 41 00:02:19,259 --> 00:02:22,749 and then we use Bayes Rule, right, to combine these two together, 42 00:02:22,749 --> 00:02:25,050 and so our predictions, 43 00:02:25,050 --> 00:02:28,549 when you give it a new piece of email you want to tell if it's spam or not spam, 44 00:02:28,549 --> 00:02:33,119 you predict arg max over Y, P(Y|X), which 45 00:02:33,119 --> 00:02:36,459 by Bayes Rule is arg max over Y, 46 00:02:36,459 --> 00:02:39,420 P(X given Y) 47 00:02:39,420 --> 00:02:41,019 times P(Y), okay? 48 00:02:41,019 --> 00:02:43,559 So this is Naive Bayes, 49 00:02:43,559 --> 00:02:45,869 and just to draw attention to two things, 50 00:02:45,869 --> 00:02:49,819 one is that in this model, each of our features 51 00:02:49,819 --> 00:02:51,549 were zero, one, so 52 00:02:51,549 --> 00:02:53,919 indicating whether different words appear, 53 00:02:53,919 --> 00:02:57,589 and the length or the feature vector was, sort of, 54 00:02:57,589 --> 00:03:00,029 the length N of the feature vector was 55 00:03:00,029 --> 00:03:07,029 the number of words in the dictionary. 56 00:03:08,339 --> 00:03:15,339 So it might be on this version on the order of 50,000 words, say. What I want 57 00:03:18,419 --> 00:03:21,170 to do now is describe two variations on this algorithm. 58 00:03:21,170 --> 00:03:23,779 The first one is the simpler one, 59 00:03:23,779 --> 00:03:26,609 which it's just a generalization to if 60 00:03:26,609 --> 00:03:28,039 xi 61 00:03:28,039 --> 00:03:33,359 takes on more values. So, you know, one thing that's commonly done 62 00:03:33,359 --> 00:03:35,169 is 63 00:03:35,169 --> 00:03:38,839 to apply Naive Bayes to problems where 64 00:03:38,839 --> 00:03:43,479 some of these features, xi, takes on K values rather than just two values, 65 00:03:43,479 --> 00:03:46,579 and in that case, 66 00:03:46,579 --> 00:03:48,679 you actually build, 67 00:03:48,679 --> 00:03:51,919 sort of, a very similar model where P(X|Y) 68 00:03:51,919 --> 00:03:53,130 is 69 00:03:53,130 --> 00:03:57,120 really the same thing, right, 70 00:03:57,120 --> 00:04:02,099 where now these are going to be multinomial probabilities 71 00:04:02,099 --> 00:04:07,729 rather than Bernoulli's because the XI's can, maybe, take on up to K values. 72 00:04:07,729 --> 00:04:11,739 It turns out, the situation where - one situation where this arises very 73 00:04:11,739 --> 00:04:16,760 commonly is if you have a feature that's actually continuous valued, 74 00:04:16,760 --> 00:04:20,639 and you choose to dispertise it, and you choose to take a continuous value feature 75 00:04:20,639 --> 00:04:25,770 and dispertise it into a finite set of K values, 76 00:04:25,770 --> 00:04:27,710 and so it's a perfect example 77 00:04:27,710 --> 00:04:32,370 if you remember our very first 78 00:04:32,370 --> 00:04:35,189 supervised learning problem of predicting 79 00:04:35,189 --> 00:04:37,089 the price of 80 00:04:37,089 --> 00:04:41,039 houses. If you have the classification problem on these houses, so based on 81 00:04:41,039 --> 00:04:44,129 features of a house, and you want to predict whether or not the house will be sold in the next six 82 00:04:44,129 --> 00:04:46,880 months, say. That's a classification problem, and 83 00:04:46,880 --> 00:04:49,210 once you use Naive Bayes, 84 00:04:49,210 --> 00:04:52,229 then given a continuous value feature 85 00:04:52,229 --> 00:04:54,619 like the living area, 86 00:04:54,619 --> 00:04:59,259 you know, one pretty common thing to do would be take the continuous value living area 87 00:04:59,259 --> 00:05:02,300 and just dispertise it 88 00:05:02,300 --> 00:05:06,669 into a few 89 00:05:06,669 --> 00:05:08,909 90 00:05:08,909 --> 00:05:12,769 - discreet buckets, 91 00:05:12,769 --> 00:05:13,900 and so 92 00:05:13,900 --> 00:05:16,900 depending on whether the living area of the house is less than 500 square feet 93 00:05:16,900 --> 00:05:19,479 or between 1,000 and 1500 square feet, 94 00:05:19,479 --> 00:05:22,279 and so on, or whether it's greater than 2,000 square feet, 95 00:05:22,279 --> 00:05:26,490 you choose the value of the corresponding feature, XI, to be one, 96 00:05:26,490 --> 00:05:28,889 two, three, or four, okay? 97 00:05:28,889 --> 00:05:31,210 So that was the first 98 00:05:31,210 --> 00:05:36,479 variation or generalization of Naive Bayes I 99 00:05:36,479 --> 00:05:43,479 wanted to talk about. I should just check; are there questions about this? Okay. 100 00:05:49,919 --> 00:05:55,740 Cool. And so it turns out that in practice, it's fairly common to use about ten buckets 101 00:05:55,740 --> 00:05:59,099 to dispertise a continuous value feature. I drew four here only to 102 00:05:59,099 --> 00:06:00,900 save on writing. 103 00:06:00,900 --> 00:06:02,789 104 00:06:02,789 --> 00:06:03,389 105 00:06:03,389 --> 00:06:06,639 The second and, sort of, final variation that I want to talk about for Naive Bayes 106 00:06:06,639 --> 00:06:09,839 is a variation that's specific 107 00:06:09,839 --> 00:06:10,510 to 108 00:06:10,510 --> 00:06:15,720 classifying text documents, or, more generally, for classifying sequences. So the text 109 00:06:15,720 --> 00:06:20,219 document, like a piece of email, you can think of as a sequence of words 110 00:06:20,219 --> 00:06:23,509 and you can apply this, sort of, model I'm about to describe to classifying other sequences as 111 00:06:23,509 --> 00:06:24,909 well, but let 112 00:06:24,909 --> 00:06:27,400 me just focus on text, 113 00:06:27,400 --> 00:06:29,689 and here's 114 00:06:29,689 --> 00:06:33,419 the idea. So the 115 00:06:33,419 --> 00:06:38,379 Naiven a piece of email, we were 116 00:06:38,379 --> 00:06:42,110 representing it using this binary vector value representation, 117 00:06:42,110 --> 00:06:46,469 and one of the things that this loses, for instance, is the number of times that different words 118 00:06:46,469 --> 00:06:47,049 appear, 119 00:06:47,049 --> 00:06:49,429 all right? So, for example, if 120 00:06:49,429 --> 00:06:53,149 some word appears a lot of times, and you see the word, you know, buy a lot of times. 121 00:06:53,149 --> 00:06:58,369 You see the word Viagra; it seems to be a common email example. You see the word Viagra a ton of 122 00:06:58,369 --> 00:06:59,490 times in the email, it 123 00:06:59,490 --> 00:07:03,449 is more likely to be spam than it appears, I guess, only once 124 00:07:03,449 --> 00:07:08,129 because even once, I guess, is enough. So let me just 125 00:07:08,129 --> 00:07:11,210 try a different, what's called an event model 126 00:07:11,210 --> 00:07:14,870 for Naive Bayes that will take into account the number of times a word appears in 127 00:07:14,870 --> 00:07:16,530 the email, 128 00:07:16,530 --> 00:07:17,430 and to give 129 00:07:17,430 --> 00:07:22,300 this previous model a name as well this particular model for 130 00:07:22,300 --> 00:07:24,059 text 131 00:07:24,059 --> 00:07:25,990 classification 132 00:07:25,990 --> 00:07:29,189 133 00:07:29,189 --> 00:07:32,369 134 00:07:32,369 --> 00:07:35,639 is called the Multivariate Bernoulli Event Model. It's not a great name. Don't worry about what 135 00:07:35,639 --> 00:07:39,639 the name means. It 136 00:07:39,639 --> 00:07:43,139 refers to the fact that there are multiple Bernoulli random variables, but it's really - don't worry about 137 00:07:43,139 --> 00:07:45,349 what the name means. 138 00:07:45,349 --> 00:07:46,469 In contrast, 139 00:07:46,469 --> 00:07:50,529 what I want to do now is describe a different representation for email in terms of the 140 00:07:50,529 --> 00:07:51,819 feature 141 00:07:51,819 --> 00:07:58,819 vector, and this is called the Multinomial Event Model, and, 142 00:08:00,619 --> 00:08:03,319 again, there is a rationale behind the name, but it's slightly cryptic, so don't worry 143 00:08:03,319 --> 00:08:07,249 about why it's called the Multinomial Event Model; it's just called that. 144 00:08:07,249 --> 00:08:09,349 And here's what we're gonna do, 145 00:08:09,349 --> 00:08:15,559 given a piece of email, I'm going to represent my email as a feature vector, 146 00:08:15,559 --> 00:08:19,039 and so my IF training 147 00:08:19,039 --> 00:08:23,340 example, XI will be a feature vector, 148 00:08:23,340 --> 00:08:29,150 XI sub group one, XI sub group two, 149 00:08:29,150 --> 00:08:31,279 XI subscript 150 00:08:31,279 --> 00:08:33,020 NI 151 00:08:33,020 --> 00:08:35,370 where 152 00:08:35,370 --> 00:08:38,810 NI is equal to the 153 00:08:38,810 --> 00:08:42,160 number of words 154 00:08:42,160 --> 00:08:45,830 in this email, right? So if one of my training examples is an email with 155 00:08:45,830 --> 00:08:47,340 300 words in it, 156 00:08:47,340 --> 00:08:49,750 then I represent this email via 157 00:08:49,750 --> 00:08:52,010 a feature vector 158 00:08:52,010 --> 00:08:57,320 with 300 elements, 159 00:08:57,320 --> 00:09:00,360 and each 160 00:09:00,360 --> 00:09:04,270 of these elements of the feature vector - lets see. Let me just write this as X subscript 161 00:09:04,270 --> 00:09:05,560 J. 162 00:09:05,560 --> 00:09:12,560 These will be an index into my dictionary, okay? 163 00:09:12,820 --> 00:09:16,350 And so if my dictionary has 50,000 words, then 164 00:09:16,350 --> 00:09:18,690 each position in my feature vector 165 00:09:18,690 --> 00:09:22,520 will be a variable that takes on one of 50,000 possible 166 00:09:22,520 --> 00:09:24,440 values 167 00:09:24,440 --> 00:09:25,640 168 00:09:25,640 --> 00:09:31,580 corresponding to what word appeared in the J position of my email, okay? 169 00:09:31,580 --> 00:09:33,940 So, in other words, I'm gonna take all the words in my email 170 00:09:33,940 --> 00:09:35,680 and you have a feature 171 00:09:35,680 --> 00:09:38,550 vector that just says which word 172 00:09:38,550 --> 00:09:39,899 in my dictionary 173 00:09:39,899 --> 00:09:43,700 was each word in the email, okay? 174 00:09:43,700 --> 00:09:47,680 So a different definition for NI now, NI now varies and is different for every 175 00:09:47,680 --> 00:09:49,050 training example, 176 00:09:49,050 --> 00:09:52,740 and this XJ is now indexed into the dictionary. 177 00:09:52,740 --> 00:09:54,740 You know, the components of the feature vector 178 00:09:54,740 --> 00:09:59,450 are no longer binary random variables; they're these indices in the dictionary that take on a much 179 00:09:59,450 --> 00:10:02,750 larger set of values. 180 00:10:02,750 --> 00:10:05,660 And so 181 00:10:05,660 --> 00:10:08,890 our generative model for this 182 00:10:08,890 --> 00:10:09,980 183 00:10:09,980 --> 00:10:16,980 184 00:10:22,550 --> 00:10:24,360 will be that the joint distribution 185 00:10:24,360 --> 00:10:27,760 over X and Y will be that, where again N is now the length of the email, all right? 186 00:10:27,760 --> 00:10:30,839 So the way to think about this formula is you 187 00:10:30,839 --> 00:10:32,060 imagine 188 00:10:32,060 --> 00:10:33,410 that there was some 189 00:10:33,410 --> 00:10:38,050 probably distribution over emails. There's some random distribution that generates the emails, 190 00:10:38,050 --> 00:10:40,900 and that process proceeds as follows: 191 00:10:40,900 --> 00:10:45,090 First, Y is chosen, first the class label. Is someone gonna send you spam email or not 192 00:10:45,090 --> 00:10:48,520 spam emails is chosen for 193 00:10:48,520 --> 00:10:54,240 us. So first Y, the random variable Y, the class label of spam or not spam is generated, 194 00:10:54,240 --> 00:10:58,630 and then having decided whether they sent you spam or not spam, 195 00:10:58,630 --> 00:11:02,879 someone iterates over all 300 positions of the email, 196 00:11:02,879 --> 00:11:07,360 or 300 words that are going to compose them as email, 197 00:11:07,360 --> 00:11:10,300 and would generate words from some distribution that 198 00:11:10,300 --> 00:11:14,240 depends on whether they chose to send you spam or not spam. So if 199 00:11:14,240 --> 00:11:17,459 they sent you spam, they'll send you words - they'll tend to generate words like, you know, buy, and Viagra, and 200 00:11:17,459 --> 00:11:19,520 whatever at discounts, sale, whatever. 201 00:11:19,520 --> 00:11:23,270 And if somebody chose to send you not spam, then they'll send you, sort of, the more normal 202 00:11:23,270 --> 00:11:27,580 203 00:11:27,580 --> 00:11:31,150 words you get in an email, okay? So, sort of, just careful, right? XI here has a very different definition from the 204 00:11:31,150 --> 00:11:34,820 previous event model, and N has a very different definition from the previous event 205 00:11:34,820 --> 00:11:38,150 model. 206 00:11:38,150 --> 00:11:44,680 And so the parameters of this model are - let's see. Phi 207 00:11:44,680 --> 00:11:48,200 subscript K given Y equals one, which 208 00:11:48,200 --> 00:11:51,370 is the 209 00:11:51,370 --> 00:11:55,190 probability that, 210 00:11:55,190 --> 00:11:57,490 you know, conditioned on 211 00:11:57,490 --> 00:11:59,280 someone deciding to spend you spam, 212 00:11:59,280 --> 00:12:03,070 what's the probability that the next word they choose to email you in 213 00:12:03,070 --> 00:12:06,260 the spam email is going to be word K, and 214 00:12:06,260 --> 00:12:10,350 similarly, you know, sort of, same thing - well, I'll just write it out, I guess - and Phi 215 00:12:10,350 --> 00:12:17,350 Y 216 00:12:21,160 --> 00:12:25,590 and just same as before, okay? 217 00:12:25,590 --> 00:12:28,170 So these are the parameters of the model, 218 00:12:28,170 --> 00:12:29,350 219 00:12:29,350 --> 00:12:31,450 and 220 00:12:31,450 --> 00:12:35,170 given a training set, you can work out the maximum likelihood estimates of the 221 00:12:35,170 --> 00:12:42,170 222 00:12:42,230 --> 00:12:47,810 parameters. So the maximum likelihood estimate of the parameters will be 223 00:12:47,810 --> 00:12:49,100 equal to - 224 00:12:49,100 --> 00:12:53,440 and now I'm gonna write one of these, you know, big indicator function things again. It'll be 225 00:12:53,440 --> 00:12:57,750 a sum over your training sets 226 00:12:57,750 --> 00:13:00,290 indicator 227 00:13:00,290 --> 00:13:07,290 whether that was spam times the 228 00:13:07,450 --> 00:13:10,129 sum over all the words in that email where N subscript 229 00:13:10,129 --> 00:13:15,310 I is the number of words in email I in your training set, 230 00:13:15,310 --> 00:13:22,310 times indicator XIJ, SK times that I, okay? 231 00:13:39,390 --> 00:13:40,350 So 232 00:13:40,350 --> 00:13:45,400 the numerator says sum over all your emails and take into account all the emails that had 233 00:13:45,400 --> 00:13:48,400 class label one, take into account only of the emails that were spam because if Y equals zero, then 234 00:13:48,400 --> 00:13:52,210 this is zero, and this would go away, 235 00:13:52,210 --> 00:13:54,470 and then times sum over 236 00:13:54,470 --> 00:13:56,579 all the words in your spam email, and 237 00:13:56,579 --> 00:14:01,040 it counts up the number of times you observed the word K in your spam emails. 238 00:14:01,040 --> 00:14:04,670 So, in other words, the numerator is look at all the spam emails in your training set 239 00:14:04,670 --> 00:14:11,670 and count up the total number of times the word K appeared in this email. The 240 00:14:13,840 --> 00:14:17,160 denominator then is sum over I into our training set of 241 00:14:17,160 --> 00:14:20,199 whenever one of your examples is spam, 242 00:14:20,199 --> 00:14:21,430 you know, sum up the length 243 00:14:21,430 --> 00:14:23,220 of that spam email, 244 00:14:23,220 --> 00:14:26,160 and so the denominator is the total length 245 00:14:26,160 --> 00:14:32,370 of all of the spam emails in your training set. 246 00:14:32,370 --> 00:14:36,820 And so the ratio is just out of all your spam emails, what is the fraction of words in 247 00:14:36,820 --> 00:14:39,190 all your spam emails that were word K, 248 00:14:39,190 --> 00:14:41,840 and that's your estimate for the probability 249 00:14:41,840 --> 00:14:44,240 of the next piece of spam mail 250 00:14:44,240 --> 00:14:51,240 generating the word K in any given position, okay? At the 251 00:14:53,100 --> 00:14:56,480 end of the previous lecture, I talked about LaPlace smoothing, 252 00:14:56,480 --> 00:14:58,980 and so when you do that as 253 00:14:58,980 --> 00:15:03,310 well, you add one to the numerator and K to the denominator, and this is 254 00:15:03,310 --> 00:15:06,600 the LaPlace smoothed estimate 255 00:15:06,600 --> 00:15:08,790 of this parameter, 256 00:15:08,790 --> 00:15:11,370 okay? And similarly, you do the same thing for - 257 00:15:11,370 --> 00:15:14,040 and you can work out the estimates for the 258 00:15:14,040 --> 00:15:20,970 other parameters yourself, okay? So it's very similar. Yeah? Student:I'm sorry. On the right on the top, I was just 259 00:15:20,970 --> 00:15:22,620 wondering 260 00:15:22,620 --> 00:15:23,600 261 00:15:23,600 --> 00:15:26,290 what the X of I is, and what the N of - 262 00:15:26,290 --> 00:15:28,350 Instructor (Andrew Ng):Right. So 263 00:15:28,350 --> 00:15:32,140 in this second event model, the definition for XI and the definition for N 264 00:15:32,140 --> 00:15:34,770 are different, right? So here - 265 00:15:34,770 --> 00:15:41,770 well, this is for one example XY. So here, N is the number of words 266 00:15:44,300 --> 00:15:46,339 in a given email, 267 00:15:46,339 --> 00:15:50,730 right? And if it's the I email subscripting then this N subscript I, and so 268 00:15:50,730 --> 00:15:55,710 N will be different for different training examples, and here 269 00:15:55,710 --> 00:16:00,910 XI will be, you know, 270 00:16:00,910 --> 00:16:04,070 these values from 1 to 50,000, 271 00:16:04,070 --> 00:16:05,580 and XI is 272 00:16:05,580 --> 00:16:06,940 essentially 273 00:16:06,940 --> 00:16:10,600 the identity of the Ith 274 00:16:10,600 --> 00:16:11,340 word 275 00:16:11,340 --> 00:16:14,830 in a given piece of email, 276 00:16:14,830 --> 00:16:16,850 okay? So that's why this is 277 00:16:16,850 --> 00:16:20,440 grouping, or this is a product over all the different words of your email 278 00:16:20,440 --> 00:16:22,059 of their probability 279 00:16:22,059 --> 00:16:24,430 the Ith word in your email, 280 00:16:24,430 --> 00:16:31,430 conditioned on Y. Yeah? 281 00:16:32,280 --> 00:16:36,480 Student:[Off mic]. Instructor (Andrew Ng):Oh, no, actually, you know what, I apologize. I just realized that overload the notation, right, 282 00:16:36,480 --> 00:16:39,260 and I shouldn't have used K here. Let 283 00:16:39,260 --> 00:16:43,750 me use a different alphabet and see if that makes sense; does 284 00:16:43,750 --> 00:16:46,490 that make sense? Oh, you know what, I'm sorry. 285 00:16:46,490 --> 00:16:49,430 You're absolutely right. 286 00:16:49,430 --> 00:16:50,560 Thank you. All right. So 287 00:16:50,560 --> 00:16:54,660 in LaPlace smoothing, that shouldn't be 288 00:16:54,660 --> 00:16:57,940 K. This should be, you know, 50,000, 289 00:16:57,940 --> 00:17:01,940 if you have 50,000 words in your dictionary. Yeah, thanks. Great. 290 00:17:01,940 --> 00:17:04,660 I stole notation from the previous lecture and 291 00:17:04,660 --> 00:17:06,720 didn't translate it properly. 292 00:17:06,720 --> 00:17:08,400 So LaPlace smoothing, right, 293 00:17:08,400 --> 00:17:15,400 this is the number of possible values that the random variable XI can 294 00:17:16,430 --> 00:17:17,320 take on. Cool. Raise your hand if 295 00:17:17,320 --> 00:17:21,680 this makes sense? Okay. Some of you, are there 296 00:17:21,680 --> 00:17:28,680 more questions about this? 297 00:17:34,850 --> 00:17:41,850 298 00:17:42,010 --> 00:17:43,880 Yeah. Student:On LaPlace smoothing, the denominator and the plus A is the number of values that Y could take? Instructor (Andrew Ng):Yeah, 299 00:17:43,880 --> 00:17:45,110 let's see. 300 00:17:45,110 --> 00:17:49,920 So LaPlace smoothing is 301 00:17:49,920 --> 00:17:51,110 a 302 00:17:51,110 --> 00:17:55,570 method to give you, sort of, hopefully, better estimates of their probability 303 00:17:55,570 --> 00:17:58,150 distribution over a multinomial, 304 00:17:58,150 --> 00:18:00,430 and so 305 00:18:00,430 --> 00:18:02,870 was I using X to Y in the previous lecture? 306 00:18:02,870 --> 00:18:04,600 So 307 00:18:04,600 --> 00:18:07,809 in trying to estimate the probability over a multinomial - I think X 308 00:18:07,809 --> 00:18:10,660 and Y are different. I think - 309 00:18:10,660 --> 00:18:13,690 was it X or Y? I think it was X, actually. Well - oh, 310 00:18:13,690 --> 00:18:16,800 I see, right, right. I think I was using a different definition 311 00:18:16,800 --> 00:18:18,960 for the random variable Y because 312 00:18:18,960 --> 00:18:23,860 suppose you have a multinomial random variable, X 313 00:18:23,860 --> 00:18:27,890 which takes on - let's 314 00:18:27,890 --> 00:18:30,190 use a different alphabet. Suppose you have 315 00:18:30,190 --> 00:18:34,950 a multinomial random variable X which takes on L different values, then 316 00:18:34,950 --> 00:18:37,440 the maximum likelihood estimate 317 00:18:37,440 --> 00:18:40,550 for the probability of X, 318 00:18:40,550 --> 00:18:42,060 PFX equals K, 319 00:18:42,060 --> 00:18:43,720 will be equal to, right, 320 00:18:43,720 --> 00:18:46,700 the number of observations. 321 00:18:46,700 --> 00:18:51,860 322 00:18:51,860 --> 00:18:55,140 The maximum likelihood estimate for the 323 00:18:55,140 --> 00:19:02,110 probability of X being equal to K will be the number of observations of X 324 00:19:02,110 --> 00:19:04,800 equals K divided 325 00:19:04,800 --> 00:19:10,510 by the total number of observations 326 00:19:10,510 --> 00:19:11,330 of X, 327 00:19:11,330 --> 00:19:13,750 okay? So that's the maximum likelihood estimate. 328 00:19:13,750 --> 00:19:17,710 And to add LaPlace smoothing to this, you, sort of, add one to the numerator, 329 00:19:17,710 --> 00:19:20,120 and you add L to the 330 00:19:20,120 --> 00:19:20,910 331 00:19:20,910 --> 00:19:22,610 denominator 332 00:19:22,610 --> 00:19:26,200 where L was the number of possible values that X can take on. 333 00:19:26,200 --> 00:19:28,540 So, in this case, 334 00:19:28,540 --> 00:19:31,740 this is a probability that X equals K, 335 00:19:31,740 --> 00:19:36,400 and X can take on 50,000 values if 50,000 is the length of your dictionary; it may be 336 00:19:36,400 --> 00:19:40,300 something else, but that's why I add 50,000 to the denominator. Are there other questions? Yeah. Student:Is there a specific 337 00:19:40,300 --> 00:19:47,300 definition for a 338 00:19:48,190 --> 00:19:48,659 maximum likelihood estimation of a parameter? We've talked about it a couple times, and all the examples 339 00:19:48,659 --> 00:19:51,630 make sense, but I don't know what the, like, general formula for it is. 340 00:19:51,630 --> 00:19:53,970 Instructor (Andrew Ng):I see. Yeah, right. So the definition of maximum likelihood - so 341 00:19:53,970 --> 00:19:55,610 the question is 342 00:19:55,610 --> 00:20:01,809 what's the definition for maximum likelihood estimate? So actually 343 00:20:01,809 --> 00:20:05,460 in today's lecture and the previous lecture when I talk about Gaussian Discriminant Analysis I 344 00:20:05,460 --> 00:20:06,550 was, sort of, 345 00:20:06,550 --> 00:20:09,130 throwing out the maximum likelihood estimates on the board 346 00:20:09,130 --> 00:20:11,049 without proving them. 347 00:20:11,049 --> 00:20:15,520 The way to actually work this out is 348 00:20:15,520 --> 00:20:19,900 to actually 349 00:20:19,900 --> 00:20:26,900 write down the likelihood. 350 00:20:29,070 --> 00:20:32,850 So the way to figure out all of these maximum likelihood estimates is to write 351 00:20:32,850 --> 00:20:33,730 down 352 00:20:33,730 --> 00:20:35,190 the likelihood of 353 00:20:35,190 --> 00:20:41,170 the parameters, phi K given Y being 354 00:20:41,170 --> 00:20:43,810 zero, phi 355 00:20:43,810 --> 00:20:44,450 Y, 356 00:20:44,450 --> 00:20:47,090 right? And so given a training set, 357 00:20:47,090 --> 00:20:50,350 the likelihood, I guess, I should be writing log likelihood 358 00:20:50,350 --> 00:20:54,740 will be the log of the product of I equals one to 359 00:20:54,740 --> 00:20:55,910 N, PFXI, YI, 360 00:20:55,910 --> 00:20:57,950 you 361 00:20:57,950 --> 00:21:04,950 know, parameterized by these things, okay? 362 00:21:06,940 --> 00:21:09,450 Where PFXI, YI, right, is given by NI, PFX, YJ given YI. They are 363 00:21:09,450 --> 00:21:12,010 parameterized 364 00:21:12,010 --> 00:21:13,500 by 365 00:21:13,500 --> 00:21:19,299 366 00:21:19,299 --> 00:21:21,520 367 00:21:21,520 --> 00:21:26,930 - well, 368 00:21:26,930 --> 00:21:28,140 369 00:21:28,140 --> 00:21:31,690 I'll just drop the parameters to write this 370 00:21:31,690 --> 00:21:38,690 371 00:21:42,370 --> 00:21:49,200 more simply - oh, I just put it in - times PFYI, okay? 372 00:21:49,200 --> 00:21:50,970 So this is my log likelihood, 373 00:21:50,970 --> 00:21:55,090 and so the way you get the maximum likelihood estimate of the parameters 374 00:21:55,090 --> 00:21:55,809 is you - 375 00:21:55,809 --> 00:21:59,750 so if given a fixed training set, given a set of fixed IYI's, 376 00:21:59,750 --> 00:22:02,180 you maximize this in terms of 377 00:22:02,180 --> 00:22:05,740 these parameters, and then you get the maximum likelihood estimates that I've been writing 378 00:22:05,740 --> 00:22:07,530 out. 379 00:22:07,530 --> 00:22:11,220 So in a previous section of today's lecture I wrote out some maximum likelihood estimates 380 00:22:11,220 --> 00:22:14,280 for the Gaussian Discriminant Analysis model, and 381 00:22:14,280 --> 00:22:15,610 for Naive Bayes, 382 00:22:15,610 --> 00:22:17,750 and then this - I didn't prove them, 383 00:22:17,750 --> 00:22:22,310 but you get to, sort of, play with that yourself in the homework problem as well 384 00:22:22,310 --> 00:22:26,420 and for one of these models, and you'll be able to verify that when 385 00:22:26,420 --> 00:22:29,700 you maximize the likelihood and maximize the log likelihood 386 00:22:29,700 --> 00:22:33,750 that hopefully you do get the same formulas as what I was drawing up on the board, but a way is to 387 00:22:33,750 --> 00:22:34,950 find the way 388 00:22:34,950 --> 00:22:41,630 these are derived is by maximizing this, okay? Cool. All right. 389 00:22:41,630 --> 00:22:44,730 So 390 00:22:44,730 --> 00:22:51,299 that wraps up what I wanted to say about - oh, 391 00:22:51,299 --> 00:22:55,809 so that, more or less, wraps up what I wanted to say about Naive Bayes, and it turns out that 392 00:22:55,809 --> 00:23:01,480 for text classification, the Naivent 393 00:23:01,480 --> 00:23:04,080 model, the last 394 00:23:04,080 --> 00:23:07,300 Naivent model, 395 00:23:07,300 --> 00:23:10,330 it turns out that almost always does better than 396 00:23:10,330 --> 00:23:11,370 the 397 00:23:11,370 --> 00:23:15,360 first Naive Bayes model I talked about when you're applying it to the 398 00:23:15,360 --> 00:23:18,670 specific case - to the specific of text classification, 399 00:23:18,670 --> 00:23:21,070 and 400 00:23:21,070 --> 00:23:25,049 one of the reasons is hypothesized for this is that this second model, 401 00:23:25,049 --> 00:23:26,880 the multinomial event model, 402 00:23:26,880 --> 00:23:30,120 takes into account 403 00:23:30,120 --> 00:23:33,880 the number of times a word appears in a document, whereas the former model 404 00:23:33,880 --> 00:23:35,170 doesn't. 405 00:23:35,170 --> 00:23:37,370 I should say that in truth 406 00:23:37,370 --> 00:23:40,700 that actually turns out not to be completely understood why the latter 407 00:23:40,700 --> 00:23:43,780 model does better than the former one for text classification, and, sort of, 408 00:23:43,780 --> 00:23:45,640 researchers are still 409 00:23:45,640 --> 00:23:46,790 debating about it a little bit, but if you 410 00:23:46,790 --> 00:23:51,320 ever have a text classification problem, you know, Naive Bayes Classify is probably 411 00:23:51,320 --> 00:23:51,860 not, 412 00:23:51,860 --> 00:23:54,150 by far, the best learning algorithm out there, 413 00:23:54,150 --> 00:23:55,630 but it is 414 00:23:55,630 --> 00:23:58,460 relatively straightforward to implement, and it's a very good 415 00:23:58,460 --> 00:24:03,070 algorithm to try if you have a text classification problem, okay? Still a question? Yeah. Student:So the second model is still positioning a variant, right? It 416 00:24:03,070 --> 00:24:04,490 doesn't 417 00:24:04,490 --> 00:24:07,020 actually 418 00:24:07,020 --> 00:24:07,910 419 00:24:07,910 --> 00:24:13,710 care where the words are. Instructor (Andrew Ng):Yes, all right. Student:And, I 420 00:24:13,710 --> 00:24:14,950 mean, X variable, if my model like you had exclamation in, does 421 00:24:14,950 --> 00:24:17,170 that usually 422 00:24:17,170 --> 00:24:18,329 do 423 00:24:18,329 --> 00:24:20,340 better if you have enough data? 424 00:24:20,340 --> 00:24:23,050 Instructor (Andrew Ng):Yeah, so the question is, sort of, the second model, right? The second model, the multinomial event model actually doesn't care about the ordering of the words. 425 00:24:23,050 --> 00:24:26,990 You can shuffle all the words in the email, and it does exactly the same thing. 426 00:24:26,990 --> 00:24:27,770 So 427 00:24:27,770 --> 00:24:30,929 in natural language processing, there's actually another name; it's called a Unique Grand 428 00:24:30,929 --> 00:24:32,850 Model in natural language processing, 429 00:24:32,850 --> 00:24:35,529 and there's some other models like, sort of, 430 00:24:35,529 --> 00:24:39,190 say, higher order markup models that take into account some of the ordering 431 00:24:39,190 --> 00:24:44,080 of the words. It turns out that for text classification, the models 432 00:24:44,080 --> 00:24:45,260 like the 433 00:24:45,260 --> 00:24:49,290 bigram models or trigram models, 434 00:24:49,290 --> 00:24:53,360 I believe they do only very slightly better, if at all, 435 00:24:53,360 --> 00:25:00,360 but that's when you're applying them to text classification, okay? All right. 436 00:25:03,550 --> 00:25:07,190 So the next thing I want to talk about is to start again to discussion of 437 00:25:07,190 --> 00:25:11,560 non-linear classifiers. So 438 00:25:11,560 --> 00:25:18,560 it turns out 439 00:25:21,670 --> 00:25:27,590 - well, and so the very first 440 00:25:27,590 --> 00:25:31,220 classification algorithm we talked about was logistic regression, which 441 00:25:31,220 --> 00:25:33,419 had the forming form for 442 00:25:33,419 --> 00:25:36,340 hypothesis, 443 00:25:36,340 --> 00:25:37,520 and 444 00:25:37,520 --> 00:25:39,220 you can think of this as 445 00:25:39,220 --> 00:25:41,190 predicting one when 446 00:25:41,190 --> 00:25:44,420 this estimator probability is greater or equal to 0.5 and 447 00:25:44,420 --> 00:25:46,180 predicting zero, 448 00:25:46,180 --> 00:25:46,839 right, 449 00:25:46,839 --> 00:25:50,050 when this is less than 0.5, 450 00:25:50,050 --> 00:25:52,760 and given a training set, right? 451 00:25:52,760 --> 00:25:59,760 Logistic regression 452 00:26:02,549 --> 00:26:05,919 will maybe do grade and descends or something or 453 00:26:05,919 --> 00:26:07,440 use Newton's method 454 00:26:07,440 --> 00:26:09,370 to find a straight line that 455 00:26:09,370 --> 00:26:12,130 reasonably separates the positive and negative classes. 456 00:26:12,130 --> 00:26:15,830 But sometimes a data set just can't be separated by a straight line, so is there 457 00:26:15,830 --> 00:26:16,950 an algorithm 458 00:26:16,950 --> 00:26:18,230 that will let you 459 00:26:18,230 --> 00:26:25,230 start to learn these sorts of non-linear division boundaries? 460 00:26:25,540 --> 00:26:29,230 And so how do you go about getting a non-linear classifier? And, by the 461 00:26:29,230 --> 00:26:34,360 way, one cool result is that remember when I said - when we talked 462 00:26:34,360 --> 00:26:37,790 about generative learning algorithms, I said that 463 00:26:37,790 --> 00:26:41,040 if you assume Y given X is 464 00:26:41,040 --> 00:26:43,710 exponential family, 465 00:26:43,710 --> 00:26:45,330 466 00:26:45,330 --> 00:26:46,669 right, with parameter 467 00:26:46,669 --> 00:26:50,730 A, and if you build a generative learning algorithm using this, right, plus 468 00:26:50,730 --> 00:26:55,850 one, if this is A to 469 00:26:55,850 --> 00:26:59,710 one. This is exponential family 470 00:26:59,710 --> 00:27:01,750 with natural parameter A to zero, 471 00:27:01,750 --> 00:27:05,610 right. I think when we talked about Gaussian Discriminant Analysis, I said that if this holds true, 472 00:27:05,610 --> 00:27:06,210 then 473 00:27:06,210 --> 00:27:08,610 you end up with a logistic posterior. It 474 00:27:08,610 --> 00:27:11,410 actually turns out that a Naive Bayes model 475 00:27:11,410 --> 00:27:12,629 actually falls into this 476 00:27:12,629 --> 00:27:17,440 as well. So the Naive Bayes model actually falls into this exponential family as well, and, 477 00:27:17,440 --> 00:27:19,370 therefore, 478 00:27:19,370 --> 00:27:23,000 under the Naive Bayes model, you're actually 479 00:27:23,000 --> 00:27:27,080 using this other linear classifier as well, okay? 480 00:27:27,080 --> 00:27:29,020 So the question is 481 00:27:29,020 --> 00:27:32,480 how can you start to get non-linear classifiers? 482 00:27:32,480 --> 00:27:35,649 And I'm going to talk about one method today 483 00:27:35,649 --> 00:27:41,840 which is - and 484 00:27:41,840 --> 00:27:45,300 we started to talk about it very briefly which is 485 00:27:45,300 --> 00:27:46,650 taking a 486 00:27:46,650 --> 00:27:48,990 simpler algorithm like 487 00:27:48,990 --> 00:27:50,720 logistic regression 488 00:27:50,720 --> 00:27:57,720 and using it to build up to more complex non-linear classifiers, okay? So 489 00:27:58,390 --> 00:28:00,440 to motivate this discussion, 490 00:28:00,440 --> 00:28:05,340 I'm going to use the little picture - let's see. So suppose you have features X1, 491 00:28:05,340 --> 00:28:07,340 X2, and X3, 492 00:28:07,340 --> 00:28:10,320 and so by convention, I'm gonna follow 493 00:28:10,320 --> 00:28:13,460 our earlier convention that X0 is set to one, 494 00:28:13,460 --> 00:28:17,010 and so I'm gonna use a little diagram like this 495 00:28:17,010 --> 00:28:18,090 to denote 496 00:28:18,090 --> 00:28:19,690 our 497 00:28:19,690 --> 00:28:24,320 logistic regression unit, okay? So 498 00:28:24,320 --> 00:28:28,609 think of a little picture like that, you know, this little circle as denoting a 499 00:28:28,609 --> 00:28:29,419 computation note 500 00:28:29,419 --> 00:28:32,630 that takes this input, you know, several features and then it outputs another 501 00:28:32,630 --> 00:28:36,090 number that's X subscript theta of X, given by 502 00:28:36,090 --> 00:28:37,800 a sigmoid function, 503 00:28:37,800 --> 00:28:39,930 and so this little computational unit - 504 00:28:39,930 --> 00:28:43,460 well, will have parameters theta. 505 00:28:43,460 --> 00:28:47,700 Now, in order to get non-linear division boundaries, all we need to do - 506 00:28:47,700 --> 00:28:50,850 well, at least one thing to do is just 507 00:28:50,850 --> 00:28:52,780 come up with 508 00:28:52,780 --> 00:28:54,390 509 00:28:54,390 --> 00:28:57,020 a way to represent hypotheses 510 00:28:57,020 --> 00:29:00,640 that can output non-linear division boundaries, right, 511 00:29:00,640 --> 00:29:02,700 and so 512 00:29:02,700 --> 00:29:07,490 this is - 513 00:29:07,490 --> 00:29:09,750 when you put a bunch of those little pictures 514 00:29:09,750 --> 00:29:11,390 that I drew 515 00:29:11,390 --> 00:29:12,790 on the previous board, 516 00:29:12,790 --> 00:29:14,169 you can then get 517 00:29:14,169 --> 00:29:18,549 what's called 518 00:29:18,549 --> 00:29:20,780 a Neural Network in which you 519 00:29:20,780 --> 00:29:23,710 think of having my features here 520 00:29:23,710 --> 00:29:30,710 and then I would feed them to 521 00:29:31,019 --> 00:29:34,530 say a few of these little 522 00:29:34,530 --> 00:29:38,940 sigmoidal units, 523 00:29:38,940 --> 00:29:43,559 and these together will feed into yet another sigmoidal unit, say, 524 00:29:43,559 --> 00:29:46,270 which will output 525 00:29:46,270 --> 00:29:51,510 my final output H subscript theta of X, okay? And 526 00:29:51,510 --> 00:29:54,500 just to give these things names, let me call 527 00:29:54,500 --> 00:29:58,380 the values output by these three intermediate sigmoidal units; let me call them A1, 528 00:29:58,380 --> 00:30:01,360 A2, A3. 529 00:30:01,360 --> 00:30:02,580 And let me just be 530 00:30:02,580 --> 00:30:05,679 completely concrete about what this formula represents, right? So 531 00:30:05,679 --> 00:30:06,570 532 00:30:06,570 --> 00:30:08,309 each of these 533 00:30:08,309 --> 00:30:12,080 units in the middle will have their own associated set of parameters, 534 00:30:12,080 --> 00:30:15,820 and so the value A1 will be computed as 535 00:30:15,820 --> 00:30:17,940 G 536 00:30:17,940 --> 00:30:21,440 of X transpose, 537 00:30:21,440 --> 00:30:27,440 and then some set of parameters, which I'll write as theta one, and similarly, A2 will be computed as G of X transpose theta 538 00:30:27,440 --> 00:30:29,090 two, 539 00:30:29,090 --> 00:30:30,980 and A3 will be 540 00:30:30,980 --> 00:30:34,490 G of X transpose, 541 00:30:34,490 --> 00:30:35,700 theta three, 542 00:30:35,700 --> 00:30:42,700 where G is the sigmoid function, all right? So G of Z, 543 00:30:43,640 --> 00:30:49,850 and then, finally, our hypothesis will output 544 00:30:49,850 --> 00:30:52,360 G of 545 00:30:52,360 --> 00:30:53,950 A 546 00:30:53,950 --> 00:30:56,620 transpose 547 00:30:56,620 --> 00:30:59,200 theta four, right? Where, you 548 00:30:59,200 --> 00:31:01,320 know, this A vector 549 00:31:01,320 --> 00:31:04,390 is a vector of A1, 550 00:31:04,390 --> 00:31:05,780 A2, A3. 551 00:31:05,780 --> 00:31:08,220 We can append another one to it at first 552 00:31:08,220 --> 00:31:11,470 if you want, okay? Let 553 00:31:11,470 --> 00:31:14,270 me 554 00:31:14,270 --> 00:31:16,210 just draw up here this - I'm 555 00:31:16,210 --> 00:31:18,470 sorry about the cluttered board. 556 00:31:18,470 --> 00:31:20,950 And so H subscript theta of X, 557 00:31:20,950 --> 00:31:23,850 this is a function 558 00:31:23,850 --> 00:31:29,300 of all the parameters theta one 559 00:31:29,300 --> 00:31:32,420 through theta four, 560 00:31:32,420 --> 00:31:36,680 and so one way to learn parameters for this model 561 00:31:36,680 --> 00:31:37,650 is to 562 00:31:37,650 --> 00:31:39,910 write down the cost function, 563 00:31:39,910 --> 00:31:41,180 say, J of theta 564 00:31:41,180 --> 00:31:46,040 equals one-half sum from Y equals one to 565 00:31:46,040 --> 00:31:49,730 M, YI minus H subscript theta of 566 00:31:49,730 --> 00:31:51,770 XI 567 00:31:51,770 --> 00:31:54,450 squared, say. Okay, 568 00:31:54,450 --> 00:31:56,070 so that's our familiar 569 00:31:56,070 --> 00:31:58,280 quadratic cost function, 570 00:31:58,280 --> 00:31:59,980 and so 571 00:31:59,980 --> 00:32:03,990 one way to learn the parameters of an algorithm like this is to just use gradient 572 00:32:03,990 --> 00:32:04,840 interscent 573 00:32:04,840 --> 00:32:08,809 to minimize J of theta as a function of theta, okay? See, in the phi 574 00:32:08,809 --> 00:32:10,940 gradient descent 575 00:32:10,940 --> 00:32:14,610 to minimize this square area, which 576 00:32:14,610 --> 00:32:18,050 stated differently means you use gradient descent to make the predictions of your neural 577 00:32:18,050 --> 00:32:20,330 network as close as possible 578 00:32:20,330 --> 00:32:27,330 to what you observed as the labels in your training 579 00:32:30,550 --> 00:32:34,520 set, okay? So it turns out green descent on this neural network is a specific name, the 580 00:32:34,520 --> 00:32:35,059 algorithm 581 00:32:35,059 --> 00:32:38,789 that implements grand descent is called back propagation, and so if you ever hear that all that 582 00:32:38,789 --> 00:32:41,240 means is - it just means gradient interscent on 583 00:32:41,240 --> 00:32:45,170 a cost function like this or a variation of this on the neural network 584 00:32:45,170 --> 00:32:47,909 that looks like that, 585 00:32:47,909 --> 00:32:50,220 and - 586 00:32:50,220 --> 00:32:54,360 well, this algorithm actually has some advantages and disadvantages, but let me 587 00:32:54,360 --> 00:32:57,830 actually show you. So, let's see. 588 00:32:57,830 --> 00:33:00,729 One of the interesting things about the neural network is that 589 00:33:00,729 --> 00:33:05,430 you can look at what these intermediate notes are computing, right? So 590 00:33:05,430 --> 00:33:08,270 this neural network has what's called 591 00:33:08,270 --> 00:33:11,610 a hidden layer before you then have the output layer, 592 00:33:11,610 --> 00:33:14,410 and, more generally, you can actually have inputs feed 593 00:33:14,410 --> 00:33:16,279 into these 594 00:33:16,279 --> 00:33:19,690 computation units, feed into more layers of computation units, to even more layers, to 595 00:33:19,690 --> 00:33:23,370 more layers, and then finally you have an output layer at the end 596 00:33:23,370 --> 00:33:26,970 And one cool thing you can do is look at all of these intermediate units, look at 597 00:33:26,970 --> 00:33:28,280 these 598 00:33:28,280 --> 00:33:30,980 units and what's called a hidden layer of the neural network. Don't 599 00:33:30,980 --> 00:33:32,580 worry about why it's called that. Look 600 00:33:32,580 --> 00:33:34,489 at computations of the hidden unit 601 00:33:34,489 --> 00:33:39,570 and ask what is the hidden unit computing the neural network? So 602 00:33:39,570 --> 00:33:42,900 to, maybe, get a better sense of neural networks might be doing, let me show you a 603 00:33:42,900 --> 00:33:44,580 video - I'm gonna switch to the laptop - this is made 604 00:33:44,580 --> 00:33:46,250 by a 605 00:33:46,250 --> 00:33:49,360 friend, Yann LeCun 606 00:33:49,360 --> 00:33:52,420 who's currently a professor at New York University. 607 00:33:52,420 --> 00:33:56,289 Can I show a video on the laptop? 608 00:33:56,289 --> 00:33:58,600 So let me show you a video 609 00:33:58,600 --> 00:34:00,440 from Yann 610 00:34:00,440 --> 00:34:06,330 LeCun on a neural network that he developed for Hammerton Digit Recognition. 611 00:34:06,330 --> 00:34:10,110 There was one other thing he did in this neural network that I'm not gonna talk about called 612 00:34:10,110 --> 00:34:12,389 a Convolutional Neural Network 613 00:34:12,389 --> 00:34:16,009 that - well, 614 00:34:16,009 --> 00:34:18,809 his system is called LeNet, 615 00:34:18,809 --> 00:34:22,569 and let's see. Would you put 616 00:34:22,569 --> 00:34:23,360 on the 617 00:34:23,360 --> 00:34:30,360 laptop display? Hum, 618 00:34:37,779 --> 00:34:38,789 actually maybe if - 619 00:34:38,789 --> 00:34:42,339 or you can just put on the screen on the side; that would work too 620 00:34:42,339 --> 00:34:49,339 if the big screen isn't working. 621 00:35:12,249 --> 00:35:13,729 Let's see. I'm 622 00:35:13,729 --> 00:35:19,299 just trying to think, okay, how do I keep you guys entertained while we're waiting for the video to come 623 00:35:19,299 --> 00:35:24,209 on? Well, let me say a few more things about neural network. 624 00:35:24,209 --> 00:35:25,529 So it turns out that 625 00:35:25,529 --> 00:35:29,039 when you write a quadratic cost function like I 626 00:35:29,039 --> 00:35:31,689 wrote down on the chalkboard just now, 627 00:35:31,689 --> 00:35:34,639 it turns out that unlike logistic regression, 628 00:35:34,639 --> 00:35:39,029 that will almost always respond to non-convex optimization problem, 629 00:35:39,029 --> 00:35:42,709 and so whereas for logistic regression if you run gradient descent 630 00:35:42,709 --> 00:35:44,280 or Newton's method or whatever, 631 00:35:44,280 --> 00:35:46,000 you converse the global optimer. 632 00:35:46,000 --> 00:35:50,500 This is not true for neural networks. In general, there are lots of local optimer 633 00:35:50,500 --> 00:35:55,209 and, sort of, much harder optimization problem. 634 00:35:55,209 --> 00:35:56,159 So 635 00:35:56,159 --> 00:35:59,699 neural networks, if you're, sort of, familiar with them, and you're good at making design choices 636 00:35:59,699 --> 00:36:01,599 like what learning rate to use, and 637 00:36:01,599 --> 00:36:04,319 how many hidden units to use, and so on, you can, 638 00:36:04,319 --> 00:36:07,700 sort of, get them to be fairly effective, 639 00:36:07,700 --> 00:36:09,109 and 640 00:36:09,109 --> 00:36:12,910 there's, sort of, often ongoing debates about, you know, is this learning algorithm 641 00:36:12,910 --> 00:36:13,819 better, or is that learning 642 00:36:13,819 --> 00:36:17,869 algorithm better? The vast majority of machine learning researchers today seem to perceive 643 00:36:17,869 --> 00:36:20,180 support vector machines, which is what I'll talk about later, 644 00:36:20,180 --> 00:36:24,349 to be a much more effective off-theshelf learning algorithm than neural networks. 645 00:36:24,349 --> 00:36:26,509 This point of view is contested 646 00:36:26,509 --> 00:36:32,039 a bit, but so neural networks are not something that I personally use a lot right 647 00:36:32,039 --> 00:36:35,670 now because there's a hard optimization problem and you should do so often verge, 648 00:36:35,670 --> 00:36:39,969 and it actually, sort of works. It, sort of, works reasonably 649 00:36:39,969 --> 00:36:41,739 well. It's just 650 00:36:41,739 --> 00:36:43,700 because this is fairly complicated, 651 00:36:43,700 --> 00:36:45,269 there's not an algorithm that 652 00:36:45,269 --> 00:36:52,009 I use commonly or that my friends use all 653 00:36:52,009 --> 00:36:56,019 time. Oh, cool. So but let me just go and show you an example of neural network, which was for many years, 654 00:36:56,019 --> 00:36:56,819 655 00:36:56,819 --> 00:37:00,400 you know, the most effective learning algorithm before support vector 656 00:37:00,400 --> 00:37:02,029 machines were invented. 657 00:37:02,029 --> 00:37:06,349 So here's Yann LeCun's video, and - 658 00:37:06,349 --> 00:37:09,350 well, there's actually audio on this too, the soundboard. So I'll just tell you 659 00:37:09,350 --> 00:37:11,209 what's happening. 660 00:37:11,209 --> 00:37:11,880 661 00:37:11,880 --> 00:37:14,629 What you're seeing is a trained neural network, 662 00:37:14,629 --> 00:37:18,549 and this display where my mouse pointer is pointing, right, this big three 663 00:37:18,549 --> 00:37:19,429 there 664 00:37:19,429 --> 00:37:21,189 is 665 00:37:21,189 --> 00:37:24,799 the input to the neural network. So you're showing the neural network this image, and it's 666 00:37:24,799 --> 00:37:26,859 trying to recognize what is this. 667 00:37:26,859 --> 00:37:30,769 The final answer output by the neural network is this number up here, right 668 00:37:30,769 --> 00:37:33,189 below where it says LeNet-5, 669 00:37:33,189 --> 00:37:34,229 and the 670 00:37:34,229 --> 00:37:37,759 neural network correctly recognizes this image as a three, 671 00:37:37,759 --> 00:37:39,890 and if you look to the left of this image, 672 00:37:39,890 --> 00:37:42,309 what's interesting about this is 673 00:37:42,309 --> 00:37:46,640 the display on the left portion of this is actually showing the 674 00:37:46,640 --> 00:37:49,999 intermediate computations of the neural network. In other words, it's showing you 675 00:37:49,999 --> 00:37:51,900 what are the hidden layers of the 676 00:37:51,900 --> 00:37:53,829 neural network computing. 677 00:37:53,829 --> 00:37:57,529 And so, for example, if you look at this one, the third image down from the top, 678 00:37:57,529 --> 00:38:01,569 this seems to be computing, you know, certain edges into digits, 679 00:38:01,569 --> 00:38:05,630 right? We're just computing digits on the 680 00:38:05,630 --> 00:38:07,489 right-hand side of the bottom or something 681 00:38:07,489 --> 00:38:09,889 of the input display 682 00:38:09,889 --> 00:38:11,479 of the input image, okay? 683 00:38:11,479 --> 00:38:15,129 So let me just play this video, and you can see 684 00:38:15,129 --> 00:38:17,019 some 685 00:38:17,019 --> 00:38:20,619 of the inputs and outputs of the neural network, and 686 00:38:20,619 --> 00:38:21,529 those 687 00:38:21,529 --> 00:38:26,309 are very different fonts. There's this robustness to 688 00:38:26,309 --> 00:38:30,670 noise. 689 00:38:30,670 --> 00:38:36,929 All 690 00:38:36,929 --> 00:38:37,880 691 00:38:37,880 --> 00:38:41,949 692 00:38:41,949 --> 00:38:43,000 right. Multiple 693 00:38:43,000 --> 00:38:49,089 digits, 694 00:38:49,089 --> 00:38:50,739 695 00:38:50,739 --> 00:38:57,739 that's, kind of, cool. All 696 00:38:59,259 --> 00:39:06,259 right. 697 00:39:12,140 --> 00:39:19,140 698 00:39:30,779 --> 00:39:35,789 699 00:39:35,789 --> 00:39:38,689 700 00:39:38,689 --> 00:39:40,459 701 00:39:40,459 --> 00:39:41,660 702 00:39:41,660 --> 00:39:48,660 703 00:39:52,359 --> 00:39:54,809 So, 704 00:39:54,809 --> 00:39:58,219 just for fun, let me show you one more video, 705 00:39:58,219 --> 00:40:00,579 which was - let's 706 00:40:00,579 --> 00:40:06,089 see. This is another video from the various CV's, the machine that changed the world, which 707 00:40:06,089 --> 00:40:07,490 was produced by WGBH 708 00:40:07,490 --> 00:40:09,059 Television 709 00:40:09,059 --> 00:40:10,800 in 710 00:40:10,800 --> 00:40:12,599 corporation with British Foreclass Incorporation, 711 00:40:12,599 --> 00:40:16,159 and it was aired on PBS a few years ago, I think. 712 00:40:16,159 --> 00:40:17,630 I want to show you a 713 00:40:17,630 --> 00:40:19,589 video describing the NETtalk 714 00:40:19,589 --> 00:40:23,629 Neural Network, which was developed by Terry Sejnowski; he's a 715 00:40:23,629 --> 00:40:25,739 researcher. 716 00:40:25,739 --> 00:40:29,589 And so NETtalk was actually one of the major milestones in the 717 00:40:29,589 --> 00:40:31,749 history of neural network, 718 00:40:31,749 --> 00:40:33,490 and this specific application 719 00:40:33,490 --> 00:40:36,399 is getting the neural network to read text. 720 00:40:36,399 --> 00:40:38,779 So, in other words, can you show a 721 00:40:38,779 --> 00:40:41,659 piece of English to a computer 722 00:40:41,659 --> 00:40:43,419 and have the computer read, 723 00:40:43,419 --> 00:40:46,410 sort of, verbally produce sounds that could respond 724 00:40:46,410 --> 00:40:48,319 to the 725 00:40:48,319 --> 00:40:50,629 reading of the text. 726 00:40:50,629 --> 00:40:55,279 And it turns out that in the history of AI and the history of machine learning, 727 00:40:55,279 --> 00:40:56,509 this video 728 00:40:56,509 --> 00:41:00,849 created a lot of excitement about neural networks and about machine learning. Part of the 729 00:41:00,849 --> 00:41:05,429 reason was that Terry Sejnowski had the foresight to choose 730 00:41:05,429 --> 00:41:06,699 to use, 731 00:41:06,699 --> 00:41:08,040 in his video, 732 00:41:08,040 --> 00:41:12,940 a child-like voice talking about visiting your grandmother's house and so on. You'll 733 00:41:12,940 --> 00:41:14,289 see it in a second, 734 00:41:14,289 --> 00:41:16,029 and so 735 00:41:16,029 --> 00:41:19,399 this really created the perception of - created 736 00:41:19,399 --> 00:41:23,129 the impression of the neural network being like a young child learning how to speak, 737 00:41:23,129 --> 00:41:25,260 and talking about going to your grandmothers, and so 738 00:41:25,260 --> 00:41:27,379 on. So this actually 739 00:41:27,379 --> 00:41:29,289 helped generate a lot of excitement 740 00:41:29,289 --> 00:41:32,449 within academia and outside academia on neural networks, sort of, early in the history 741 00:41:32,449 --> 00:41:36,109 of neural networks. I'm just gonna show you the video. 742 00:41:36,109 --> 00:41:37,910 [Begin Video] You're going to hear first 743 00:41:37,910 --> 00:41:41,559 what the network sounds like at the very beginning of the training, and it won't 744 00:41:41,559 --> 00:41:43,589 sound like words, but it'll sound 745 00:41:43,589 --> 00:41:48,009 like attempts that will get better and better with 746 00:41:48,009 --> 00:41:54,239 time. [Computer's voice] The network 747 00:41:54,239 --> 00:41:58,249 takes the letters, say the phrase, grandmother's house, and 748 00:41:58,249 --> 00:42:02,679 makes a random attempt at pronouncing it. [Computer's 749 00:42:02,679 --> 00:42:07,219 voice] 750 00:42:07,219 --> 00:42:09,269 Grandmother's house. The 751 00:42:09,269 --> 00:42:12,530 phonetic difference between the guess and the right pronunciation 752 00:42:12,530 --> 00:42:16,539 is sent back through the network. [Computer's 753 00:42:16,539 --> 00:42:18,989 voice] 754 00:42:18,989 --> 00:42:20,329 Grandmother's house. 755 00:42:20,329 --> 00:42:23,920 By adjusting the connection strengths after each attempt, 756 00:42:23,920 --> 00:42:26,689 the net slowly improves. And, 757 00:42:26,689 --> 00:42:30,029 finally, after letting it train overnight, 758 00:42:30,029 --> 00:42:32,700 the next morning it sounds like this: Grandmother's house, I'd like to go to my grandmother's 759 00:42:32,700 --> 00:42:37,759 house. 760 00:42:37,759 --> 00:42:41,149 Well, because she gives us candy. 761 00:42:41,149 --> 00:42:46,789 Well, and we - NETtalk understands nothing about the language. It is simply associating letters 762 00:42:46,789 --> 00:42:49,809 with sounds. 763 00:42:49,809 --> 00:42:52,359 [End Video] All right. So 764 00:42:52,359 --> 00:42:54,380 at the time this was done, I mean, this is 765 00:42:54,380 --> 00:42:57,379 an amazing piece of work. I should say 766 00:42:57,379 --> 00:42:59,140 today there are other 767 00:42:59,140 --> 00:43:03,389 text to speech systems that work better than what you just saw, 768 00:43:03,389 --> 00:43:08,359 and you'll also appreciate getting candy from your grandmother's house is a little bit 769 00:43:08,359 --> 00:43:11,449 less impressive than talking about the Dow Jones falling 15 points, 770 00:43:11,449 --> 00:43:16,149 and profit taking, whatever. So 771 00:43:16,149 --> 00:43:18,789 but I wanted to show that just because that was another cool, 772 00:43:18,789 --> 00:43:25,789 major landmark in the history of neural networks. Okay. 773 00:43:27,259 --> 00:43:32,949 So let's switch back to the chalkboard, 774 00:43:32,949 --> 00:43:35,859 and what I want to do next 775 00:43:35,859 --> 00:43:39,479 is 776 00:43:39,479 --> 00:43:42,369 tell you about Support Vector Machines, 777 00:43:42,369 --> 00:43:49,369 okay? That, sort of, wraps up our discussion on neural 778 00:44:08,839 --> 00:44:10,850 networks. So I started off talking about neural networks 779 00:44:10,850 --> 00:44:13,269 by motivating it as a way to get 780 00:44:13,269 --> 00:44:17,920 us to output non-linear classifiers, right? I don't really approve of it. It turns out 781 00:44:17,920 --> 00:44:18,419 that 782 00:44:18,419 --> 00:44:21,799 you'd be able to come up with non-linear division boundaries using a neural 783 00:44:21,799 --> 00:44:22,479 network 784 00:44:22,479 --> 00:44:27,309 like what I drew on the chalkboard earlier. 785 00:44:27,309 --> 00:44:30,469 Support Vector Machines will be another learning algorithm that will give us a way 786 00:44:30,469 --> 00:44:34,529 to come up with non-linear classifiers. There's a very effective, off-the-shelf 787 00:44:34,529 --> 00:44:36,019 learning algorithm, 788 00:44:36,019 --> 00:44:39,399 but it turns out that in the discussion I'm gonna - in 789 00:44:39,399 --> 00:44:43,379 the progression and development I'm gonna pursue, I'm actually going to start 790 00:44:43,379 --> 00:44:47,799 off by describing yet another class of linear classifiers with linear division 791 00:44:47,799 --> 00:44:49,269 boundaries, 792 00:44:49,269 --> 00:44:54,179 and only be later, sort of, in probably the next lecture or the one after that, 793 00:44:54,179 --> 00:44:55,249 that we'll then 794 00:44:55,249 --> 00:44:58,699 take the support vector machine idea and, sort of, do some clever things to it 795 00:44:58,699 --> 00:45:02,959 to make it work very well to generate non-linear division boundaries as well, okay? 796 00:45:02,959 --> 00:45:07,859 But we'll actually start by talking about linear classifiers a little bit more. 797 00:45:07,859 --> 00:45:12,619 And to do that, I want to 798 00:45:12,619 --> 00:45:14,559 convey two 799 00:45:14,559 --> 00:45:16,689 intuitions about classification. 800 00:45:16,689 --> 00:45:20,700 One is 801 00:45:20,700 --> 00:45:24,450 you think about logistic regression; we have this logistic function that was outputting 802 00:45:24,450 --> 00:45:28,669 the probability that Y equals one, 803 00:45:28,669 --> 00:45:31,049 and it crosses 804 00:45:31,049 --> 00:45:32,959 this line at zero. 805 00:45:32,959 --> 00:45:37,819 So when you run logistic regression, I want you to think of it as 806 00:45:37,819 --> 00:45:41,179 an algorithm that computes 807 00:45:41,179 --> 00:45:43,239 theta transpose X, 808 00:45:43,239 --> 00:45:46,789 and then it predicts 809 00:45:46,789 --> 00:45:51,170 one, right, 810 00:45:51,170 --> 00:45:56,359 if and only if, theta transpose X is greater than zero, right? IFF stands for if 811 00:45:56,359 --> 00:45:59,539 and only if. It means the same thing as a double implication, 812 00:45:59,539 --> 00:46:04,629 and it predicts zero, 813 00:46:04,629 --> 00:46:11,629 if and only if, theta transpose X is less than zero, okay? So if 814 00:46:12,499 --> 00:46:15,019 it's the case that 815 00:46:15,019 --> 00:46:18,069 theta transpose X is much greater than zero, 816 00:46:18,069 --> 00:46:22,309 the double greater than sign means these are much greater than, all right. 817 00:46:22,309 --> 00:46:25,589 So if theta transpose X is much greater than zero, 818 00:46:25,589 --> 00:46:26,739 then, 819 00:46:26,739 --> 00:46:33,069 you know, think of that as a very confident 820 00:46:33,069 --> 00:46:38,779 prediction 821 00:46:38,779 --> 00:46:40,779 that Y is equal to one, 822 00:46:40,779 --> 00:46:43,459 right? If theta transpose X is much greater than zero, then we're 823 00:46:43,459 --> 00:46:47,049 gonna predict one then moreover we're very confident it's one, and the picture for 824 00:46:47,049 --> 00:46:48,479 that is if theta 825 00:46:48,479 --> 00:46:51,179 transpose X is way out here, then 826 00:46:51,179 --> 00:46:54,329 we're estimating that the probability of Y being equal to one 827 00:46:54,329 --> 00:46:56,089 on the sigmoid function, it will 828 00:46:56,089 --> 00:46:58,699 be very close to one. 829 00:46:58,699 --> 00:47:02,779 And, in the same way, if theta transpose X 830 00:47:02,779 --> 00:47:05,190 is much less than zero, 831 00:47:05,190 --> 00:47:12,190 then we're very confident that 832 00:47:12,809 --> 00:47:15,849 Y is equal to zero. 833 00:47:15,849 --> 00:47:19,599 So 834 00:47:19,599 --> 00:47:23,859 wouldn't it be nice - so when we fit logistic regression of some of the 835 00:47:23,859 --> 00:47:25,309 classifiers is your training set, 836 00:47:25,309 --> 00:47:29,569 then so wouldn't it be nice if, 837 00:47:29,569 --> 00:47:30,430 right, 838 00:47:30,430 --> 00:47:34,349 for all I 839 00:47:34,349 --> 00:47:38,739 such that Y is equal to one. 840 00:47:38,739 --> 00:47:39,699 We have 841 00:47:39,699 --> 00:47:41,279 theta 842 00:47:41,279 --> 00:47:45,009 transpose XI is much greater than zero, 843 00:47:45,009 --> 00:47:47,929 and for all I such that Y is equal 844 00:47:47,929 --> 00:47:49,869 to 845 00:47:49,869 --> 00:47:56,869 zero, 846 00:47:57,589 --> 00:47:59,910 we have theta transpose XI is 847 00:47:59,910 --> 00:48:00,970 much less than zero, 848 00:48:00,970 --> 00:48:05,659 okay? So wouldn't it be nice if this is true? That, 849 00:48:05,659 --> 00:48:06,380 850 00:48:06,380 --> 00:48:09,929 essentially, if our training set, we can find parameters theta 851 00:48:09,929 --> 00:48:12,749 so that our learning algorithm not only 852 00:48:12,749 --> 00:48:16,839 makes correct classifications on all the examples in a training set, but further it's, sort 853 00:48:16,839 --> 00:48:19,880 of, is very confident about all of those correct 854 00:48:19,880 --> 00:48:22,299 classifications. This is 855 00:48:22,299 --> 00:48:26,859 the first intuition that I want you to have, and we'll come back to this first intuition 856 00:48:26,859 --> 00:48:28,889 in a second 857 00:48:28,889 --> 00:48:33,209 when we talk about functional margins, okay? 858 00:48:33,209 --> 00:48:40,209 We'll define this later. The second 859 00:48:43,539 --> 00:48:48,119 intuition that I want to 860 00:48:48,119 --> 00:48:55,119 convey, 861 00:48:56,769 --> 00:49:00,209 and it turns out for the rest of today's lecture I'm going to assume 862 00:49:00,209 --> 00:49:04,049 that a training set is linearly separable, okay? So by that I mean for the rest of 863 00:49:04,049 --> 00:49:05,619 864 00:49:05,619 --> 00:49:08,690 today's lecture, I'm going to assume that 865 00:49:08,690 --> 00:49:10,900 there is indeed a straight line that can separate 866 00:49:10,900 --> 00:49:15,030 your training set, and we'll remove this assumption later, but 867 00:49:15,030 --> 00:49:17,089 just to develop the algorithm, let's take away the 868 00:49:17,089 --> 00:49:19,409 linearly separable training set. 869 00:49:19,409 --> 00:49:20,440 And so 870 00:49:20,440 --> 00:49:24,400 there's a sense that out of all the straight lines that separate the training 871 00:49:24,400 --> 00:49:28,569 set, you know, maybe that straight line isn't such a good one, 872 00:49:28,569 --> 00:49:33,139 and that one actually isn't such a great one either, but 873 00:49:33,139 --> 00:49:35,019 maybe that line in the 874 00:49:35,019 --> 00:49:38,869 middle is a much better linear separator than the others, right? 875 00:49:38,869 --> 00:49:41,179 And one reason that 876 00:49:41,179 --> 00:49:45,759 when you and I look at it this one seems best 877 00:49:45,759 --> 00:49:48,329 is because this line is just further from the data, all right? 878 00:49:48,329 --> 00:49:52,469 That is separates the data with a greater distance between your positive and your negative 879 00:49:52,469 --> 00:49:55,779 examples and division boundary, okay? 880 00:49:55,779 --> 00:49:59,390 And this second intuition, we'll come back to this shortly, 881 00:49:59,390 --> 00:50:02,829 about this final line 882 00:50:02,829 --> 00:50:06,099 that I drew being, maybe, the best line 883 00:50:06,099 --> 00:50:09,999 this notion of distance from the training examples. This is the second intuition I want 884 00:50:09,999 --> 00:50:11,239 to convey, 885 00:50:11,239 --> 00:50:13,659 and we'll formalize it later when 886 00:50:13,659 --> 00:50:17,849 we talk about 887 00:50:17,849 --> 00:50:24,849 geometric margins of our classifiers, okay? 888 00:50:41,549 --> 00:50:44,889 So in order to describe support vector machine, unfortunately, I'm gonna 889 00:50:44,889 --> 00:50:48,239 have to pull a notation change, 890 00:50:48,239 --> 00:50:50,109 and, sort 891 00:50:50,109 --> 00:50:53,989 of, unfortunately, it, sort of, was impossible to do logistic regression, 892 00:50:53,989 --> 00:50:55,809 and support vector machines, 893 00:50:55,809 --> 00:51:01,029 and all the other algorithms using one completely consistent notation, 894 00:51:01,029 --> 00:51:04,180 and so I'm actually gonna change notations slightly 895 00:51:04,180 --> 00:51:07,999 for linear classifiers, and that will actually make it much easier for us - 896 00:51:07,999 --> 00:51:10,019 that'll make it much easier later today 897 00:51:10,019 --> 00:51:15,149 and in next week's lectures to actually talk about support vector machine. But the 898 00:51:15,149 --> 00:51:19,549 notation that I'm gonna use for the rest 899 00:51:19,549 --> 00:51:21,759 of today and for most of next week 900 00:51:21,759 --> 00:51:25,769 will be that my B equals Y, 901 00:51:25,769 --> 00:51:30,219 and instead of be zero, one, they'll be minus one and plus one, 902 00:51:30,219 --> 00:51:33,779 903 00:51:33,779 --> 00:51:36,759 and a development of a support vector machine 904 00:51:36,759 --> 00:51:38,469 we will have H, 905 00:51:38,469 --> 00:51:43,939 have a hypothesis 906 00:51:43,939 --> 00:51:49,369 output values to 907 00:51:49,369 --> 00:51:53,179 the either plus one or minus one, 908 00:51:53,179 --> 00:51:54,619 and so 909 00:51:54,619 --> 00:51:58,039 we'll let G of Z be equal to 910 00:51:58,039 --> 00:52:00,349 one if Z is 911 00:52:00,349 --> 00:52:04,460 greater or equal to zero, and minus one otherwise, right? So just rather than zero and 912 00:52:04,460 --> 00:52:07,809 one, we change everything to plus one and minus one. 913 00:52:07,809 --> 00:52:10,149 And, finally, 914 00:52:10,149 --> 00:52:13,770 whereas previously I wrote G subscript 915 00:52:13,770 --> 00:52:16,779 theta of X equals 916 00:52:16,779 --> 00:52:18,429 G of theta transpose X 917 00:52:18,429 --> 00:52:20,880 and we had the convention 918 00:52:20,880 --> 00:52:23,409 that X zero is equal to one, 919 00:52:23,409 --> 00:52:25,789 right? And so X is an RN plus 920 00:52:25,789 --> 00:52:28,670 one. 921 00:52:28,670 --> 00:52:34,499 I'm gonna drop this convention of 922 00:52:34,499 --> 00:52:38,400 letting X zero equals a one, and letting X be an RN plus one, and instead I'm 923 00:52:38,400 --> 00:52:44,049 going to parameterize my linear classifier as H subscript W, B of X 924 00:52:44,049 --> 00:52:46,159 equals G of 925 00:52:46,159 --> 00:52:50,629 W transpose X plus B, okay? 926 00:52:50,629 --> 00:52:51,349 And so 927 00:52:51,349 --> 00:52:53,359 B 928 00:52:53,359 --> 00:52:54,989 just now plays the role of 929 00:52:54,989 --> 00:52:56,289 theta zero, 930 00:52:56,289 --> 00:53:03,289 and W now plays the role of the rest of the parameters, theta one 931 00:53:04,179 --> 00:53:07,059 through theta N, okay? So just by separating out 932 00:53:07,059 --> 00:53:10,569 the interceptor B rather than lumping it together, it'll make it easier for 933 00:53:10,569 --> 00:53:11,230 us 934 00:53:11,230 --> 00:53:17,569 to develop support vector machines. 935 00:53:17,569 --> 00:53:24,569 So - yes. Student:[Off mic]. Instructor 936 00:53:27,889 --> 00:53:31,829 (Andrew Ng):Oh, yes. Right, yes. So W is - 937 00:53:31,829 --> 00:53:33,049 right. So W 938 00:53:33,049 --> 00:53:35,499 is a vector in RN, and 939 00:53:35,499 --> 00:53:37,069 X 940 00:53:37,069 --> 00:53:42,709 is now a vector in RN rather than N plus one, 941 00:53:42,709 --> 00:53:49,709 and a lowercase b is a real number. Okay. 942 00:53:56,429 --> 00:54:02,650 Now, let's formalize the notion of functional margin and germesh margin. Let me make a 943 00:54:02,650 --> 00:54:03,660 definition. 944 00:54:03,660 --> 00:54:10,660 I'm going to say that the functional margin 945 00:54:13,160 --> 00:54:19,680 of the hyper plane 946 00:54:19,680 --> 00:54:21,880 WB 947 00:54:21,880 --> 00:54:25,229 with respect to a specific training example, 948 00:54:25,229 --> 00:54:27,429 XIYI is - WRT 949 00:54:27,429 --> 00:54:32,069 stands for with respect to - 950 00:54:32,069 --> 00:54:35,489 the function margin of a hyper plane WB with 951 00:54:35,489 --> 00:54:38,099 respect to 952 00:54:38,099 --> 00:54:43,099 a certain training example, XIYI has been defined as Gamma 953 00:54:43,099 --> 00:54:45,319 Hat I equals YI 954 00:54:45,319 --> 00:54:46,919 times W transpose XI 955 00:54:46,919 --> 00:54:51,159 plus B, okay? 956 00:54:51,159 --> 00:54:54,579 And so a set of parameters, W, B defines 957 00:54:54,579 --> 00:54:55,960 a 958 00:54:55,960 --> 00:55:00,239 classifier - it, sort of, defines a linear separating boundary, 959 00:55:00,239 --> 00:55:01,160 and so 960 00:55:01,160 --> 00:55:03,989 when I say hyper plane, I just mean 961 00:55:03,989 --> 00:55:07,390 the decision boundary that's 962 00:55:07,390 --> 00:55:11,169 defined by the parameters W, B. 963 00:55:11,169 --> 00:55:12,969 You know what, 964 00:55:12,969 --> 00:55:17,699 if you're confused by the hyper plane term, just ignore it. The hyper plane of a classifier with 965 00:55:17,699 --> 00:55:18,869 parameters W, B 966 00:55:18,869 --> 00:55:21,019 with respect to a training example 967 00:55:21,019 --> 00:55:23,539 is given by this formula, okay? 968 00:55:23,539 --> 00:55:25,530 And interpretation of this is that 969 00:55:25,530 --> 00:55:28,369 if 970 00:55:28,369 --> 00:55:29,799 YI is equal to one, 971 00:55:29,799 --> 00:55:33,239 then for each to have a large functional margin, you 972 00:55:33,239 --> 00:55:36,519 want W transpose XI plus B to 973 00:55:36,519 --> 00:55:37,849 be large, 974 00:55:37,849 --> 00:55:39,079 right? 975 00:55:39,079 --> 00:55:41,389 And if YI 976 00:55:41,389 --> 00:55:43,709 is equal minus one, 977 00:55:43,709 --> 00:55:47,479 then in order for the functional margin to be large - we, sort of, want the functional margins 978 00:55:47,479 --> 00:55:50,229 to large, but in order for the function margins to be large, 979 00:55:50,229 --> 00:55:55,229 if YI is equal to minus one, then the only way for this to be big is if W 980 00:55:55,229 --> 00:55:57,029 transpose XI 981 00:55:57,029 --> 00:55:58,219 plus B 982 00:55:58,219 --> 00:56:03,849 is much less than zero, okay? So this 983 00:56:03,849 --> 00:56:06,780 captures the intuition that we had earlier about functional 984 00:56:06,780 --> 00:56:08,689 margins - the 985 00:56:08,689 --> 00:56:12,839 intuition we had earlier that if YI is equal to one, we want this to be big, and if YI 986 00:56:12,839 --> 00:56:14,669 is equal to minus one, we 987 00:56:14,669 --> 00:56:15,990 want this to be small, 988 00:56:15,990 --> 00:56:17,009 and this, sort of, 989 00:56:17,009 --> 00:56:18,699 practice of two cases 990 00:56:18,699 --> 00:56:22,599 into one statement that we'd like the functional margin to be large. 991 00:56:22,599 --> 00:56:26,589 And notice this is also that 992 00:56:26,589 --> 00:56:31,249 so long as YI times W transpose XY 993 00:56:31,249 --> 00:56:34,049 plus B, so long as this is greater than zero, 994 00:56:34,049 --> 00:56:36,649 that means we 995 00:56:36,649 --> 00:56:43,649 classified it correctly, okay? 996 00:57:14,359 --> 00:57:18,069 And one more definition, I'm going to say that 997 00:57:18,069 --> 00:57:19,899 the functional margin of 998 00:57:19,899 --> 00:57:23,679 a hyper plane with respect to an entire training set 999 00:57:23,679 --> 00:57:30,679 is 1000 00:57:31,579 --> 00:57:33,609 going to define gamma hat to 1001 00:57:33,609 --> 00:57:35,509 be equal to 1002 00:57:35,509 --> 00:57:35,759 min 1003 00:57:35,650 --> 00:57:39,569 over all your training examples of gamma hat, I, right? 1004 00:57:39,569 --> 00:57:40,539 So if you have 1005 00:57:40,539 --> 00:57:44,439 a training set, if you have just more than one training example, 1006 00:57:44,439 --> 00:57:46,949 I'm going to define the functional margin 1007 00:57:46,949 --> 00:57:49,779 with respect to the entire training set as 1008 00:57:49,779 --> 00:57:51,799 the worst case of all of your 1009 00:57:51,799 --> 00:57:54,439 functional margins of the entire training set. 1010 00:57:54,439 --> 00:57:57,849 And so for now we should think of the 1011 00:57:57,849 --> 00:58:02,179 first function like an intuition of saying that we would like the function margin 1012 00:58:02,179 --> 00:58:03,450 to be large, 1013 00:58:03,450 --> 00:58:05,910 and for our purposes, for now, 1014 00:58:05,910 --> 00:58:07,309 let's just say we would like 1015 00:58:07,309 --> 00:58:10,389 the worst-case functional margin to be large, okay? And 1016 00:58:10,389 --> 00:58:16,299 we'll change this a little bit later as well. Now, it turns out that 1017 00:58:16,299 --> 00:58:20,150 there's one little problem with this intuition that will, sort of, edge 1018 00:58:20,150 --> 00:58:22,640 us later, 1019 00:58:22,640 --> 00:58:26,189 which it actually turns out to be very easy to make the functional margin large, all 1020 00:58:26,189 --> 00:58:28,599 right? So, for example, 1021 00:58:28,599 --> 00:58:31,899 so as I have a classifiable parameters W and 1022 00:58:31,899 --> 00:58:37,369 B. If I take W and multiply it by two and take B and multiply it by two, 1023 00:58:37,369 --> 00:58:42,049 then if you refer to the definition of the functional margin, I guess that was what? 1024 00:58:42,049 --> 00:58:42,959 Gamma I, 1025 00:58:42,959 --> 00:58:44,409 gamma hat I 1026 00:58:44,409 --> 00:58:47,929 equals YI 1027 00:58:47,929 --> 00:58:51,059 times W times transpose B. If 1028 00:58:51,059 --> 00:58:52,859 I double 1029 00:58:52,859 --> 00:58:54,609 W and B, 1030 00:58:54,609 --> 00:58:55,010 then 1031 00:58:55,010 --> 00:58:59,419 I can easily double my functional margin. So this goal 1032 00:58:59,419 --> 00:59:02,799 of making the functional margin large, in and of itself, isn't so 1033 00:59:02,799 --> 00:59:05,959 useful because it's easy to make the functional margin arbitrarily large 1034 00:59:05,959 --> 00:59:08,729 just by scaling other parameters. And so 1035 00:59:08,729 --> 00:59:13,259 maybe one thing we need to do later is add a normalization 1036 00:59:13,259 --> 00:59:14,879 condition. 1037 00:59:14,879 --> 00:59:16,849 For example, maybe 1038 00:59:16,849 --> 00:59:20,179 we want to add a normalization condition that de-norm, the 1039 00:59:20,179 --> 00:59:21,789 alter-norm of 1040 00:59:21,789 --> 00:59:23,109 the parameter W 1041 00:59:23,109 --> 00:59:24,009 is equal to one, 1042 00:59:24,009 --> 00:59:28,150 and we'll come back to this in a second. All 1043 00:59:28,150 --> 00:59:34,279 right. And then so - Okay. 1044 00:59:34,279 --> 00:59:39,869 Now, let's talk about - see how much 1045 00:59:39,869 --> 00:59:46,869 time we have, 15 minutes. Well, see, I'm trying to 1046 00:59:52,909 --> 00:59:59,909 decide how much to try to do in the last 15 minutes. Okay. So let's talk 1047 01:00:05,339 --> 01:00:10,890 about the geometric margin, 1048 01:00:10,890 --> 01:00:13,379 and so 1049 01:00:13,379 --> 01:00:16,939 the geometric margin of a training example - 1050 01:00:16,939 --> 01:00:23,939 [inaudible], right? 1051 01:00:27,279 --> 01:00:30,649 So the division boundary of my classifier 1052 01:00:30,649 --> 01:00:33,439 is going to be given by the plane W transpose X 1053 01:00:33,439 --> 01:00:34,889 plus B is equal 1054 01:00:34,889 --> 01:00:37,650 to zero, okay? Right, and these are the 1055 01:00:37,650 --> 01:00:41,389 X1, X2 axis, say, 1056 01:00:41,389 --> 01:00:42,579 and 1057 01:00:42,579 --> 01:00:48,299 we're going to draw relatively few training examples here. Let's say I'm drawing 1058 01:00:48,299 --> 01:00:52,929 deliberately few training examples so that I can add things to this, okay? 1059 01:00:52,929 --> 01:00:54,659 And so 1060 01:00:54,659 --> 01:00:57,199 assuming we classified an example correctly, I'm 1061 01:00:57,199 --> 01:01:01,209 going to define the geometric margin 1062 01:01:01,209 --> 01:01:05,559 as just a geometric distance between a point between the 1063 01:01:05,559 --> 01:01:07,239 training example - 1064 01:01:07,239 --> 01:01:09,649 yeah, between the training example XI, 1065 01:01:09,649 --> 01:01:11,679 YI 1066 01:01:11,679 --> 01:01:13,259 and the distance 1067 01:01:13,259 --> 01:01:15,489 given by this separating 1068 01:01:15,489 --> 01:01:18,249 line, given by this separating hyper plane, okay? 1069 01:01:18,249 --> 01:01:20,329 That's what I'm going to define the 1070 01:01:20,329 --> 01:01:23,809 geometric margin to be. 1071 01:01:23,809 --> 01:01:26,299 And so I'm gonna do 1072 01:01:26,299 --> 01:01:29,919 some algebra fairly quickly. In case it doesn't make sense, and read 1073 01:01:29,919 --> 01:01:33,969 through the lecture notes more carefully for details. 1074 01:01:33,969 --> 01:01:36,479 Sort of, by standard geometry, 1075 01:01:36,479 --> 01:01:40,589 the normal, or in other words, the vector that's 90 degrees to the 1076 01:01:40,589 --> 01:01:41,869 separating hyper plane 1077 01:01:41,869 --> 01:01:45,069 is going to be given by W divided by 1078 01:01:45,069 --> 01:01:47,480 the norm of W; that's just how 1079 01:01:47,480 --> 01:01:49,619 planes and high dimensions work. If this 1080 01:01:49,619 --> 01:01:53,380 stuff - some of this you have to use, take a look t the lecture notes on the 1081 01:01:53,380 --> 01:01:55,589 website. 1082 01:01:55,589 --> 01:01:58,199 And so let's say this distance 1083 01:01:58,199 --> 01:01:59,789 is 1084 01:01:59,789 --> 01:02:01,220 1085 01:02:01,220 --> 01:02:02,079 gamma I, 1086 01:02:02,079 --> 01:02:04,869 okay? And so I'm going to use the convention that 1087 01:02:04,869 --> 01:02:08,549 I'll put a hat on top where I'm referring to functional margins, 1088 01:02:08,549 --> 01:02:11,619 and no hat on top for geometric margins. So let's say geometric margin, 1089 01:02:11,619 --> 01:02:14,120 as this example, is 1090 01:02:14,120 --> 01:02:18,009 gamma I. 1091 01:02:18,009 --> 01:02:21,789 That means that 1092 01:02:21,789 --> 01:02:26,459 this point here, right, 1093 01:02:26,459 --> 01:02:28,899 is going to be 1094 01:02:28,899 --> 01:02:32,149 XI minus 1095 01:02:32,149 --> 01:02:36,119 gamma I times W 1096 01:02:36,119 --> 01:02:39,479 over normal W, okay? 1097 01:02:39,479 --> 01:02:41,279 Because 1098 01:02:41,279 --> 01:02:45,399 W over normal W is the unit vector, is the length one vector that 1099 01:02:45,399 --> 01:02:48,459 is normal to the separating hyper plane, 1100 01:02:48,459 --> 01:02:52,069 and so when we subtract gamma I times the unit vector 1101 01:02:52,069 --> 01:02:55,189 from this point, XI, or at this point here is XI. 1102 01:02:55,189 --> 01:02:57,039 So XI minus, 1103 01:02:57,039 --> 01:02:59,079 you know, this little vector here 1104 01:02:59,079 --> 01:03:01,539 is going to be this point that I've drawn as 1105 01:03:01,539 --> 01:03:03,080 a heavy circle, 1106 01:03:03,080 --> 01:03:05,660 okay? So this heavy point here is 1107 01:03:05,660 --> 01:03:07,429 XI minus this vector, 1108 01:03:07,429 --> 01:03:14,429 and this vector is gamma I time W over norm of W, okay? 1109 01:03:16,049 --> 01:03:17,649 And so 1110 01:03:17,649 --> 01:03:21,160 because this heavy point is on the separating hyper plane, 1111 01:03:21,160 --> 01:03:22,689 right, this point 1112 01:03:22,689 --> 01:03:24,949 must satisfy 1113 01:03:24,949 --> 01:03:28,660 W transpose times 1114 01:03:28,660 --> 01:03:34,659 that point 1115 01:03:34,659 --> 01:03:35,569 equals zero, 1116 01:03:35,569 --> 01:03:37,189 right? Because 1117 01:03:37,189 --> 01:03:38,029 all 1118 01:03:38,029 --> 01:03:41,479 points X on the separating hyper plane satisfy the equation W transpose X plus B 1119 01:03:41,479 --> 01:03:42,939 equals zero, 1120 01:03:42,939 --> 01:03:45,989 and so this point is on the separating hyper plane, therefore, 1121 01:03:45,989 --> 01:03:49,199 it must satisfy W transpose this point - oh, 1122 01:03:49,199 --> 01:03:53,809 excuse me. Plus B is equal 1123 01:03:53,809 --> 01:03:57,579 to zero, okay? 1124 01:03:57,579 --> 01:04:01,289 Raise your hand if this makes sense so far? Oh, 1125 01:04:01,289 --> 01:04:03,669 okay. Cool, most of you, but, again, 1126 01:04:03,669 --> 01:04:07,099 I'm, sort of, being slightly fast in this geometry. So if you're not quite sure 1127 01:04:07,099 --> 01:04:10,089 why this is a normal vector, or how I subtracted 1128 01:04:10,089 --> 01:04:17,089 this, or whatever, take a look at the details in the lecture notes. 1129 01:04:21,299 --> 01:04:24,319 And so 1130 01:04:24,319 --> 01:04:28,459 what I'm going to do is I'll just take this equation, and I'll solve for gamma, right? So 1131 01:04:28,459 --> 01:04:30,269 this equation I just wrote down, 1132 01:04:30,269 --> 01:04:32,199 solve this equation for gamma 1133 01:04:32,199 --> 01:04:33,939 or gamma I, 1134 01:04:33,939 --> 01:04:40,939 and you find that - 1135 01:04:44,129 --> 01:04:47,630 you saw that previous equation from gamma I - 1136 01:04:47,630 --> 01:04:49,769 well, why don't I just do it? 1137 01:04:49,769 --> 01:04:52,819 You have W transpose XI plus 1138 01:04:52,819 --> 01:04:54,660 B 1139 01:04:54,660 --> 01:04:55,489 equals 1140 01:04:55,489 --> 01:04:57,049 gamma 1141 01:04:57,049 --> 01:05:01,489 I times W transpose W over norm of W; 1142 01:05:01,489 --> 01:05:05,169 that's just equal to gamma 1143 01:05:05,169 --> 01:05:06,749 times the norm of W 1144 01:05:06,749 --> 01:05:07,659 because 1145 01:05:07,659 --> 01:05:09,529 W transpose W 1146 01:05:09,529 --> 01:05:12,469 is the norm 1147 01:05:12,469 --> 01:05:14,829 of W squared, and, therefore, 1148 01:05:14,829 --> 01:05:17,589 gamma 1149 01:05:17,589 --> 01:05:24,589 is just - well, 1150 01:05:26,579 --> 01:05:33,579 transpose X equals, okay? 1151 01:05:33,999 --> 01:05:38,089 And, in other words, this little calculation just showed us that if you have a 1152 01:05:38,089 --> 01:05:41,299 training example 1153 01:05:41,299 --> 01:05:42,599 XI, 1154 01:05:42,599 --> 01:05:46,749 then the distance between XI and the separating hyper plane defined by the 1155 01:05:46,749 --> 01:05:49,449 parameters W and B 1156 01:05:49,449 --> 01:05:56,449 can be computed by this formula, okay? So 1157 01:06:02,979 --> 01:06:05,459 the last thing I want to do is actually 1158 01:06:05,459 --> 01:06:06,729 take into account 1159 01:06:06,729 --> 01:06:11,929 the sign of the - the correct classification of the training example. So I've 1160 01:06:11,929 --> 01:06:13,130 been assuming that 1161 01:06:13,130 --> 01:06:16,339 we've been classifying an example correctly. 1162 01:06:16,339 --> 01:06:20,999 So, more generally, to find 1163 01:06:20,999 --> 01:06:27,439 the geometric 1164 01:06:27,439 --> 01:06:30,029 margin of a training example to be 1165 01:06:30,029 --> 01:06:32,979 gamma I equals YI 1166 01:06:32,979 --> 01:06:39,979 times that thing on top, okay? 1167 01:06:45,389 --> 01:06:47,279 And so 1168 01:06:47,279 --> 01:06:51,169 this is very similar to the functional margin, except for the normalization by the 1169 01:06:51,169 --> 01:06:52,729 norm of W, 1170 01:06:52,729 --> 01:06:57,759 and so as before, you know, this says that so long as - 1171 01:06:57,759 --> 01:07:00,889 we would like the geometric margin to be large, and all that means is that so 1172 01:07:00,889 --> 01:07:01,849 long as 1173 01:07:01,849 --> 01:07:03,919 we're classifying the example correctly, 1174 01:07:03,919 --> 01:07:08,209 we would ideally hope of the example to be as far as possible from the separating 1175 01:07:08,209 --> 01:07:11,079 hyper plane, so long as it's on the right side of 1176 01:07:11,079 --> 01:07:12,279 the separating hyper plane, and that's what YI 1177 01:07:12,279 --> 01:07:19,279 multiplied into this does. 1178 01:07:22,689 --> 01:07:27,839 And so a couple of easy facts, one is if the 1179 01:07:27,839 --> 01:07:31,769 norm of W is equal to one, 1180 01:07:31,769 --> 01:07:34,289 then 1181 01:07:34,289 --> 01:07:37,439 the functional margin is equal to the geometric margin, and you see 1182 01:07:37,439 --> 01:07:39,369 that quite easily, 1183 01:07:39,369 --> 01:07:41,479 and, more generally, 1184 01:07:41,479 --> 01:07:43,019 1185 01:07:43,019 --> 01:07:47,779 the geometric margin is just equal 1186 01:07:47,779 --> 01:07:54,779 to the functional margin divided by the norm of W, okay? Let's see, okay. 1187 01:08:11,959 --> 01:08:18,959 And so one final definition is 1188 01:08:22,089 --> 01:08:26,769 so far I've defined the geometric margin with respect to a single training example, 1189 01:08:26,769 --> 01:08:32,189 and so as before, I'll define the geometric margin with respect to an entire training set 1190 01:08:32,189 --> 01:08:34,499 as gamma equals 1191 01:08:34,499 --> 01:08:37,599 min over I 1192 01:08:37,599 --> 01:08:44,599 of gamma I, all right? 1193 01:08:45,250 --> 01:08:47,309 And so 1194 01:08:47,309 --> 01:08:54,309 the maximum margin classifier, which is a precursor to the support vector machine, 1195 01:08:59,539 --> 01:09:04,880 is the learning algorithm that chooses the parameters W and B 1196 01:09:04,880 --> 01:09:06,730 so as to maximize 1197 01:09:06,730 --> 01:09:08,900 the geometric margin, 1198 01:09:08,900 --> 01:09:11,919 and so I just write that down. The 1199 01:09:11,919 --> 01:09:16,489 maximum margin classified poses the following optimization problem. It says 1200 01:09:16,489 --> 01:09:19,960 choose gamma, W, and B 1201 01:09:19,960 --> 01:09:23,529 so as to maximize the geometric margin, 1202 01:09:23,529 --> 01:09:25,499 subject to that YI times - 1203 01:09:25,499 --> 01:09:32,499 well, 1204 01:09:33,420 --> 01:09:37,559 this is just one way to write it, 1205 01:09:37,559 --> 01:09:41,089 subject to - actually, do I write it like that? Yeah, 1206 01:09:41,089 --> 01:09:43,819 fine. 1207 01:09:43,819 --> 01:09:46,209 There are several ways to write this, and one of the things we'll 1208 01:09:46,209 --> 01:09:48,949 do next time is actually - I'm trying 1209 01:09:48,949 --> 01:09:52,730 to figure out if I can do this in five minutes. I'm guessing this could 1210 01:09:52,730 --> 01:09:55,420 be difficult. Well, 1211 01:09:55,420 --> 01:09:59,429 so this maximizing your classifier is the maximization problem 1212 01:09:59,429 --> 01:10:00,929 over parameter gamma 1213 01:10:00,929 --> 01:10:02,519 W 1214 01:10:02,519 --> 01:10:04,919 and B, and for now, it turns out 1215 01:10:04,919 --> 01:10:08,869 that the geometric margin doesn't change depending on the norm of W, right? Because 1216 01:10:08,869 --> 01:10:11,479 in the definition of the geometric margin, 1217 01:10:11,479 --> 01:10:14,429 notice that we're dividing by the norm of W anyway. 1218 01:10:14,429 --> 01:10:18,239 So you can actually set the norm of W to be anything you want, and you can multiply 1219 01:10:18,239 --> 01:10:22,530 W and B by any constant; it doesn't change the geometric margin. 1220 01:10:22,530 --> 01:10:26,809 This will actually be important, and we'll come back to this later. Notice that you 1221 01:10:26,809 --> 01:10:29,829 can take the parameters WB, 1222 01:10:29,829 --> 01:10:33,759 and you can impose any normalization constant to it, or you can change W and B by 1223 01:10:33,759 --> 01:10:35,760 any scaling factor and 1224 01:10:35,760 --> 01:10:37,969 replace them by ten W and ten B 1225 01:10:37,969 --> 01:10:38,730 whatever, 1226 01:10:38,730 --> 01:10:42,809 and it does not change the geometric margin, 1227 01:10:42,809 --> 01:10:44,329 okay? And so 1228 01:10:44,329 --> 01:10:47,800 in this first formulation, I'm just gonna impose a constraint and say that the norm of W was 1229 01:10:47,800 --> 01:10:48,989 one, 1230 01:10:48,989 --> 01:10:51,969 and so the function of the geometric margins will be the same, 1231 01:10:51,969 --> 01:10:52,699 and then we'll say 1232 01:10:52,699 --> 01:10:55,119 maximize the geometric margins subject to - 1233 01:10:55,119 --> 01:10:56,800 you maximize gamma 1234 01:10:56,800 --> 01:11:00,489 subject to that every training example must have geometric margin 1235 01:11:00,489 --> 01:11:02,069 at least gamma, 1236 01:11:02,069 --> 01:11:04,219 and this is a geometric margin because 1237 01:11:04,219 --> 01:11:05,780 when the norm of W is equal to one, then 1238 01:11:05,780 --> 01:11:08,159 the functional of the geometric margin 1239 01:11:08,159 --> 01:11:11,409 are identical, okay? 1240 01:11:11,409 --> 01:11:15,130 So this is the maximum margin classifier, and it turns out that if you do this, 1241 01:11:15,130 --> 01:11:19,079 it'll run, you know, maybe about as well as a - maybe slight - maybe comparable to logistic regression, but 1242 01:11:19,079 --> 01:11:19,659 it 1243 01:11:19,659 --> 01:11:22,709 1244 01:11:22,709 --> 01:11:24,139 turns out that 1245 01:11:24,139 --> 01:11:26,449 as we develop this algorithm further, there 1246 01:11:26,449 --> 01:11:30,519 will be a clever way to allow us to change this algorithm to let it work 1247 01:11:30,519 --> 01:11:33,150 in infinite dimensional feature spaces 1248 01:11:33,150 --> 01:11:37,010 and come up with very efficient non-linear classifiers. 1249 01:11:37,010 --> 01:11:38,020 So 1250 01:11:38,020 --> 01:11:41,280 there's a ways to go before we turn this into a support vector machine, 1251 01:11:41,280 --> 01:11:43,409 but this is the first step. 1252 01:11:43,409 --> 01:11:50,409 So are there questions about this? Yeah. 1253 01:12:03,610 --> 01:12:08,179 Student:[Off mic]. Instructor (Andrew Ng):For now, let's just say you're given a fixed training set, and you can't - yeah, for now, let's just say you're given a 1254 01:12:08,179 --> 01:12:09,900 fixed training set, and the 1255 01:12:09,900 --> 01:12:12,879 scaling of the training set is not something you get to play with, right? 1256 01:12:12,879 --> 01:12:17,190 So everything I've said is for a fixed training set, so that you can't change the X's, and you can't change the 1257 01:12:17,190 --> 01:12:20,869 Y's. Are there other questions? Okay. So all right. 1258 01:12:20,869 --> 01:12:26,150 1259 01:12:26,150 --> 01:12:30,129 1260 01:12:30,129 --> 01:12:34,779 Next week we will take this, and we'll talk about authorization algorithms, 1261 01:12:34,779 --> 01:12:35,839 and 1262 01:12:35,839 --> 01:12:39,949 work our way towards turning this into one of the most effective off-theshelf 1263 01:12:39,949 --> 01:12:41,920 learning algorithms, 1264 01:12:41,920 --> 01:12:44,369 and just a final reminder again, this next 1265 01:12:44,369 --> 01:12:48,170 discussion session will be on Matlab and Octaves. So show up for that if you want 1266 01:12:48,170 --> 01:12:49,679 to see a tutorial. 1267 01:12:49,679 --> 01:12:51,529 Okay. See you guys in the next class.