1 00:00:09,170 --> 00:00:10,419 2 00:00:10,419 --> 00:00:13,689 this presentation is delivered by the stanford center for professional 3 00:00:13,689 --> 00:00:20,689 development. 4 00:00:23,579 --> 00:00:26,800 Okay. Good morning and welcome back 5 00:00:26,800 --> 00:00:30,619 to the third lecture of this class. So here's 6 00:00:30,619 --> 00:00:33,360 what I want to do today, 7 00:00:33,360 --> 00:00:34,320 and some of 8 00:00:34,320 --> 00:00:37,920 the topics I do today may seem a little bit like I'm jumping, sort of, 9 00:00:37,920 --> 00:00:41,980 from topic to topic, but here's, sort of, the outline for today and the illogical flow of ideas. In the last lecture, we 10 00:00:41,980 --> 00:00:44,360 talked about 11 00:00:44,360 --> 00:00:48,390 linear regression and today I want to talk about 12 00:00:48,390 --> 00:00:52,500 sort of an adaptation of that called locally weighted regression. It's very a popular 13 00:00:52,500 --> 00:00:57,120 algorithm that's actually one of my former mentors probably favorite machine 14 00:00:57,120 --> 00:00:59,360 learning algorithm. We'll then 15 00:00:59,360 --> 00:01:02,800 talk about a probable second interpretation of linear regression 16 00:01:02,800 --> 00:01:07,650 and use that to move onto our first classification algorithm, which 17 00:01:07,650 --> 00:01:08,910 is logistic regression; take 18 00:01:08,910 --> 00:01:12,630 a brief digression to tell you about something called the perceptron algorithm, 19 00:01:12,630 --> 00:01:15,990 which is something we'll come back to, again, later this quarter; 20 00:01:15,990 --> 00:01:17,170 and 21 00:01:17,170 --> 00:01:21,070 time allowing I hope to get to Newton's method, which is an algorithm for 22 00:01:21,070 --> 00:01:23,119 fitting logistic regression 23 00:01:23,119 --> 00:01:24,650 models. 24 00:01:24,650 --> 00:01:30,510 So this is recap where we're talking about in the previous lecture, 25 00:01:30,510 --> 00:01:33,790 remember the notation I defined was that 26 00:01:33,790 --> 00:01:35,090 I used this 27 00:01:35,090 --> 00:01:37,620 X superscript I, 28 00:01:37,620 --> 00:01:44,620 Y superscript I to denote the I training example. 29 00:01:47,190 --> 00:01:49,500 And 30 00:01:49,500 --> 00:01:51,790 when we're talking about linear regression 31 00:01:51,790 --> 00:01:52,900 32 00:01:52,900 --> 00:01:54,080 or linear least squares, 33 00:01:54,080 --> 00:01:56,370 we use this to 34 00:01:56,370 --> 00:01:58,549 denote the predicted value 35 00:01:58,549 --> 00:02:00,830 of "by my hypothesis H" on 36 00:02:00,830 --> 00:02:02,420 the input XI. 37 00:02:02,420 --> 00:02:04,070 And my hypothesis 38 00:02:04,070 --> 00:02:06,160 was franchised by the vector 39 00:02:06,160 --> 00:02:08,399 of grams as theta 40 00:02:08,399 --> 00:02:13,809 and so we said that this was equal to some from theta 41 00:02:13,809 --> 00:02:15,549 J, 42 00:02:15,549 --> 00:02:19,870 si 43 00:02:19,870 --> 00:02:22,409 J, and more theta transpose X. 44 00:02:22,409 --> 00:02:25,489 And we had the convention that X 45 00:02:25,489 --> 00:02:29,649 subscript Z is equal to one so this accounts for the intercept term in our 46 00:02:29,649 --> 00:02:31,279 linear 47 00:02:31,279 --> 00:02:33,169 regression model. And lowercase n here 48 00:02:33,169 --> 00:02:36,780 was the notation I was using for the 49 00:02:36,780 --> 00:02:40,209 number of features in my training set. Okay? So in the 50 00:02:40,209 --> 00:02:43,689 example when trying to predict housing prices, we had two features, the size of 51 00:02:43,689 --> 00:02:45,939 the house and the number of bedrooms. 52 00:02:45,939 --> 00:02:50,559 We had two features and there was - little n was equal to two. 53 00:02:50,559 --> 00:02:51,989 54 00:02:51,989 --> 00:02:54,709 So just to 55 00:02:54,709 --> 00:02:57,079 finish recapping the previous lecture, we 56 00:02:57,079 --> 00:03:01,260 defined this quadratic cos function J of theta equals one-half, 57 00:03:01,260 --> 00:03:05,510 something 58 00:03:05,510 --> 00:03:07,209 I 59 00:03:07,209 --> 00:03:10,379 equals one to m, theta of XI minus YI 60 00:03:10,379 --> 00:03:12,439 squared 61 00:03:12,439 --> 00:03:16,819 where this is the sum over our m training examples and my training set. So lowercase 62 00:03:16,819 --> 00:03:17,769 m 63 00:03:17,769 --> 00:03:21,129 was the notation I've been using to denote the number of training examples I have and the 64 00:03:21,129 --> 00:03:23,059 size of my training set. 65 00:03:23,059 --> 00:03:25,400 And at the end of the last lecture, 66 00:03:25,400 --> 00:03:26,809 we derive 67 00:03:26,809 --> 00:03:30,840 the value of theta that minimizes this enclosed form, which was X 68 00:03:30,840 --> 00:03:32,349 transpose X 69 00:03:32,349 --> 00:03:35,369 inverse X 70 00:03:35,369 --> 00:03:38,889 transpose Y. Okay? 71 00:03:38,889 --> 00:03:42,409 So 72 00:03:42,409 --> 00:03:46,419 as we move on in today's lecture, I'll continue to use this notation and, again, 73 00:03:46,419 --> 00:03:50,609 I realize this is a fair amount of notation to all remember, 74 00:03:50,609 --> 00:03:55,099 so if partway through this lecture you forgot - if you're having trouble remembering 75 00:03:55,099 --> 00:04:02,099 what lowercase m is or what lowercase n is or something please raise your hand and ask. When 76 00:04:04,709 --> 00:04:07,690 we talked about linear regression last time 77 00:04:07,690 --> 00:04:10,440 we used two features. One of the features was 78 00:04:10,440 --> 00:04:14,279 the size of the houses in square feet, so the living area of the house, 79 00:04:14,279 --> 00:04:18,449 and the other feature was the number of bedrooms in the house. 80 00:04:18,449 --> 00:04:22,059 In general, we apply a machine-learning algorithm to some problem that you care 81 00:04:22,059 --> 00:04:23,060 about. 82 00:04:23,060 --> 00:04:28,210 The choice of the features will very much be up to you, right? And 83 00:04:28,210 --> 00:04:32,300 the way you choose your features to give the learning algorithm will often have a 84 00:04:32,300 --> 00:04:34,330 large impact on how it actually does. 85 00:04:34,330 --> 00:04:40,120 So just for example, 86 00:04:40,120 --> 00:04:44,120 the choice we made last time was X1 equal this size, and let's leave this idea 87 00:04:44,120 --> 00:04:47,080 of the feature of the number of bedrooms for now, let's say we don't have data 88 00:04:47,080 --> 00:04:50,630 that tells us how many bedrooms are in these houses. 89 00:04:50,630 --> 00:04:54,030 One thing you could do is actually define - oh, let's 90 00:04:54,030 --> 00:04:56,289 draw this out. 91 00:04:56,289 --> 00:05:03,289 And so, right? So say that 92 00:05:04,689 --> 00:05:07,669 was the size of the house and that's the price of the house. So 93 00:05:07,669 --> 00:05:10,300 if you use 94 00:05:10,300 --> 00:05:14,789 this as a feature maybe you get theta zero plus theta 1, 95 00:05:14,789 --> 00:05:19,889 X1, this, sort of, linear model. 96 00:05:19,889 --> 00:05:21,590 If you choose - let me just copy the 97 00:05:21,590 --> 00:05:26,770 same data set over, right? 98 00:05:26,770 --> 00:05:30,330 You can define the set of features where X1 is equal to the size of the house 99 00:05:30,330 --> 00:05:34,989 and X2 is the 100 00:05:34,989 --> 00:05:36,050 square 101 00:05:36,050 --> 00:05:37,169 of the size 102 00:05:37,169 --> 00:05:38,319 of the house. Okay? 103 00:05:38,319 --> 00:05:43,080 So X1 is the size of the house in say square footage and X2 is 104 00:05:43,080 --> 00:05:45,919 just take whatever the square footage of the house is and just 105 00:05:45,919 --> 00:05:49,059 square that number, and this would be another way to come up with a feature, 106 00:05:49,059 --> 00:05:51,109 and if you do that then 107 00:05:51,109 --> 00:05:55,960 the same algorithm will end up fitting a 108 00:05:55,960 --> 00:05:59,399 quadratic function for you. 109 00:05:59,399 --> 00:06:01,379 Theta 2, XM squared. 110 00:06:01,379 --> 00:06:06,139 Okay? Because this 111 00:06:06,139 --> 00:06:09,610 is actually X2. And 112 00:06:09,610 --> 00:06:12,520 depending on what the data looks like, maybe this is a slightly 113 00:06:12,520 --> 00:06:16,509 better fit to the data. You can actually 114 00:06:16,509 --> 00:06:23,509 take this even further, right? 115 00:06:25,280 --> 00:06:26,960 Which is - let's see. 116 00:06:26,960 --> 00:06:30,400 I have seven training examples here, so you can actually 117 00:06:30,400 --> 00:06:34,460 maybe fit up to six for the polynomial. You can actually fill a model 118 00:06:34,460 --> 00:06:35,619 theta zero plus 119 00:06:35,619 --> 00:06:38,229 theta one, X1 plus theta two, 120 00:06:38,229 --> 00:06:42,030 X squared plus up to 121 00:06:42,030 --> 00:06:48,710 theta six. X to the 122 00:06:48,710 --> 00:06:52,889 power of six and theta six are the polynomial 123 00:06:52,889 --> 00:06:55,589 to these seven data points. 124 00:06:55,589 --> 00:06:58,030 And if you do that you find that 125 00:06:58,030 --> 00:07:01,669 you come up with a model that fits your data exactly. This is where, I guess, 126 00:07:01,669 --> 00:07:06,190 in this example I drew, we have seven data points, so if you fit a 127 00:07:06,190 --> 00:07:08,489 six model polynomial you can, sort of, fit a line 128 00:07:08,489 --> 00:07:11,860 that passes through these seven points perfectly. 129 00:07:11,860 --> 00:07:14,379 And you probably find that the curve 130 00:07:14,379 --> 00:07:17,069 you get will look something 131 00:07:17,069 --> 00:07:20,039 like that. 132 00:07:20,039 --> 00:07:23,179 And on the one hand, this is a great model in a sense that it 133 00:07:23,179 --> 00:07:25,329 fits your training data perfectly. 134 00:07:25,329 --> 00:07:26,680 On the other hand, this is probably not a 135 00:07:26,680 --> 00:07:28,819 very good model in the sense that 136 00:07:28,819 --> 00:07:31,889 none of us seriously think that this is a very good predictor of housing 137 00:07:31,889 --> 00:07:36,220 prices as a function of the size of the house, right? So 138 00:07:36,220 --> 00:07:39,020 we'll actually come back to this later. It turns 139 00:07:39,020 --> 00:07:41,940 out of the models we have here; 140 00:07:41,940 --> 00:07:45,180 I feel like maybe the quadratic model fits the data best. 141 00:07:45,180 --> 00:07:46,460 Whereas 142 00:07:46,460 --> 00:07:47,810 143 00:07:47,810 --> 00:07:52,020 the linear model looks like there's actually a bit of a quadratic component in this 144 00:07:52,020 --> 00:07:52,830 data 145 00:07:52,830 --> 00:07:56,539 that the linear function is not capturing. 146 00:07:56,539 --> 00:07:59,959 So we'll actually come back to this a little bit later and talk about the problems 147 00:07:59,959 --> 00:08:03,779 associated with fitting models that are either too simple, use two small a set of 148 00:08:03,779 --> 00:08:04,620 features, or 149 00:08:04,620 --> 00:08:08,449 on the models that are too complex and maybe 150 00:08:08,449 --> 00:08:11,210 use too large a set of features. 151 00:08:11,210 --> 00:08:12,289 Just to give these a 152 00:08:12,289 --> 00:08:13,240 name, 153 00:08:13,240 --> 00:08:14,490 we call this 154 00:08:14,490 --> 00:08:19,750 the problem of underfitting 155 00:08:19,750 --> 00:08:23,139 and, very informally, this refers to a setting where 156 00:08:23,139 --> 00:08:26,819 there are obvious patterns that - where there are patterns in the data that the 157 00:08:26,819 --> 00:08:28,729 algorithm is just failing to fit. 158 00:08:28,729 --> 00:08:32,839 And this problem here we refer to as 159 00:08:32,839 --> 00:08:34,580 overfitting 160 00:08:34,580 --> 00:08:36,299 and, again, very informally, 161 00:08:36,299 --> 00:08:41,310 this is when the algorithm is fitting the idiosyncrasies of this specific data set, 162 00:08:41,310 --> 00:08:43,480 right? It just so happens that 163 00:08:43,480 --> 00:08:48,190 of the seven houses we sampled in Portland, or wherever you collect data from, 164 00:08:48,190 --> 00:08:51,640 that house happens to be a bit more expensive, that house happened on the less 165 00:08:51,640 --> 00:08:54,070 expensive, and by 166 00:08:54,070 --> 00:08:57,400 fitting six for the polynomial we're, sort of, fitting the idiosyncratic properties 167 00:08:57,400 --> 00:08:58,820 of this data set, 168 00:08:58,820 --> 00:09:01,079 rather than the true underlying trends 169 00:09:01,079 --> 00:09:04,940 of how housing prices vary as the function of the size of house. Okay? 170 00:09:04,940 --> 00:09:08,460 So these are two very different problems. We'll define them more formally me later 171 00:09:08,460 --> 00:09:12,280 and talk about how to address each of these problems, 172 00:09:12,280 --> 00:09:13,890 but for now I 173 00:09:13,890 --> 00:09:20,890 hope you appreciate that there is this issue of selecting features. So if 174 00:09:22,150 --> 00:09:23,610 you want to just 175 00:09:23,610 --> 00:09:26,270 teach us the learning problems there are a few ways to do 176 00:09:26,270 --> 00:09:27,370 so. 177 00:09:27,370 --> 00:09:29,010 We'll talk about 178 00:09:29,010 --> 00:09:32,760 feature selection algorithms later this quarter as well. So automatic algorithms 179 00:09:32,760 --> 00:09:33,779 for choosing 180 00:09:33,779 --> 00:09:35,959 what features you use in a 181 00:09:35,959 --> 00:09:37,530 regression problem like this. 182 00:09:37,530 --> 00:09:41,610 What I want to do today is talk about a class of algorithms 183 00:09:41,610 --> 00:09:44,570 called non-parametric learning algorithms that will help 184 00:09:44,570 --> 00:09:49,730 to alleviate the need somewhat for you to choose features very carefully. Okay? 185 00:09:49,730 --> 00:09:56,730 And this leads us into our discussion of locally weighted regression. 186 00:09:56,970 --> 00:10:03,970 And just to define the term, 187 00:10:12,340 --> 00:10:16,780 linear regression, as we've defined it so far, is an example of a parametric learning 188 00:10:16,780 --> 00:10:17,740 189 00:10:17,740 --> 00:10:19,150 algorithm. Parametric 190 00:10:19,150 --> 00:10:21,920 learning algorithm is one that's defined as 191 00:10:21,920 --> 00:10:24,670 an algorithm that has a fixed number of parameters 192 00:10:24,670 --> 00:10:27,200 that fit to the data. Okay? So 193 00:10:27,200 --> 00:10:28,670 in linear regression 194 00:10:28,670 --> 00:10:32,930 we have a fix set of parameters theta, right? That must 195 00:10:32,930 --> 00:10:39,440 fit to the data. 196 00:10:39,440 --> 00:10:46,440 In contrast, what I'm gonna talk about now is our first non-parametric learning algorithm. The 197 00:10:58,630 --> 00:11:02,830 formal definition, which is not very intuitive, so I've replaced it with a 198 00:11:02,830 --> 00:11:04,110 second, say, more 199 00:11:04,110 --> 00:11:06,200 intuitive. 200 00:11:06,200 --> 00:11:10,460 The, sort of, formal definition of the non-parametric learning algorithm is that it's an algorithm 201 00:11:10,460 --> 00:11:17,460 where the number of parameters 202 00:11:18,300 --> 00:11:22,170 goes 203 00:11:22,170 --> 00:11:25,419 with M, with the size of the training set. And usually it's 204 00:11:25,419 --> 00:11:30,620 defined as a number of parameters grows linearly with the size of the training set. 205 00:11:30,620 --> 00:11:32,640 This is the formal definition. 206 00:11:32,640 --> 00:11:33,410 A 207 00:11:33,410 --> 00:11:36,189 slightly less formal definition is that 208 00:11:36,189 --> 00:11:37,529 the amount of stuff that your learning algorithm needs 209 00:11:37,529 --> 00:11:40,830 to keep around 210 00:11:40,830 --> 00:11:44,820 will grow linearly with the training sets or, in another way of saying it, is that this is an 211 00:11:44,820 --> 00:11:45,830 algorithm that 212 00:11:45,830 --> 00:11:51,590 we'll need to keep around an entire training set, even after learning. Okay? So 213 00:11:51,590 --> 00:11:53,980 don't worry too much about this definition. But 214 00:11:53,980 --> 00:11:55,979 what I want to do now is 215 00:11:55,979 --> 00:11:58,989 describe a specific non-parametric learning algorithm 216 00:11:58,989 --> 00:12:05,989 called locally weighted regression. 217 00:12:09,860 --> 00:12:16,860 Which also goes by a couple of other names - 218 00:12:17,040 --> 00:12:20,580 which also goes by the name of Loess for self-hysterical reasons. Loess 219 00:12:20,580 --> 00:12:23,030 is usually spelled L-O-E-S-S, 220 00:12:23,030 --> 00:12:24,130 sometimes spelled like that, 221 00:12:24,130 --> 00:12:27,140 too. I just call it locally weighted regression. 222 00:12:27,140 --> 00:12:34,140 So here's 223 00:12:34,540 --> 00:12:37,760 the idea. This will be an algorithm that allows us 224 00:12:37,760 --> 00:12:42,650 to worry a little bit less about having to choose features very carefully. 225 00:12:42,650 --> 00:12:48,350 So 226 00:12:48,350 --> 00:12:55,320 for my motivating example, let's say that I 227 00:12:55,320 --> 00:12:59,000 have a 228 00:12:59,000 --> 00:13:00,939 training site that looks like this, okay? 229 00:13:00,939 --> 00:13:04,270 So this is X and that's Y. 230 00:13:04,270 --> 00:13:07,010 If you run 231 00:13:07,010 --> 00:13:10,920 linear regression on this and you fit maybe a linear function to this 232 00:13:10,920 --> 00:13:12,380 and you end up with a 233 00:13:12,380 --> 00:13:13,399 more or less 234 00:13:13,399 --> 00:13:16,580 flat, straight line, which is not a very good fit to this data. You 235 00:13:16,580 --> 00:13:19,820 can sit around and stare at this and try to decide whether the features are used right. 236 00:13:19,820 --> 00:13:22,570 So maybe you want to toss in a quadratic function, 237 00:13:22,570 --> 00:13:25,770 but this isn't really quadratic either. So maybe you want to 238 00:13:25,770 --> 00:13:27,910 model this as a X 239 00:13:27,910 --> 00:13:31,060 plus X squared plus maybe some function of sin of X or something. 240 00:13:31,060 --> 00:13:33,850 You actually sit around and fiddle with features. 241 00:13:33,850 --> 00:13:37,160 And after a while you can probably come up with a set of features that the model is 242 00:13:37,160 --> 00:13:39,720 okay, but let's talk about an algorithm that 243 00:13:39,720 --> 00:13:46,720 you can use without needing to do that. 244 00:13:50,340 --> 00:13:52,930 So 245 00:13:52,930 --> 00:13:54,480 if - well, 246 00:13:54,480 --> 00:13:56,370 suppose you want to evaluate 247 00:13:56,370 --> 00:13:59,170 your hypothesis H 248 00:13:59,170 --> 00:14:03,950 at a certain point 249 00:14:03,950 --> 00:14:06,310 with a certain query point low K is X. Okay? And 250 00:14:06,310 --> 00:14:07,490 let's say 251 00:14:07,490 --> 00:14:10,670 you want to know what's the predicted value of 252 00:14:10,670 --> 00:14:11,930 Y 253 00:14:11,930 --> 00:14:16,540 at this position of X, right? So 254 00:14:16,540 --> 00:14:18,760 for linear regression, 255 00:14:18,760 --> 00:14:22,440 what we were doing was we would fit 256 00:14:22,440 --> 00:14:25,070 theta 257 00:14:25,070 --> 00:14:28,130 to minimize 258 00:14:28,130 --> 00:14:30,050 sum 259 00:14:30,050 --> 00:14:34,610 over I, YI minus theta, transpose XI 260 00:14:34,610 --> 00:14:38,760 squared, 261 00:14:38,760 --> 00:14:41,200 and return theta 262 00:14:41,200 --> 00:14:46,170 transpose X. Okay? So that was linear regression. 263 00:14:46,170 --> 00:14:49,970 In contrast, in locally weighted linear regression you're going to do things slightly 264 00:14:49,970 --> 00:14:51,130 different. You're 265 00:14:51,130 --> 00:14:54,270 going to look at this point X 266 00:14:54,270 --> 00:14:58,700 and then I'm going to look in my data set and take into account 267 00:14:58,700 --> 00:15:03,370 only the data points that are, sort of, in the little vicinity of X. Okay? 268 00:15:03,370 --> 00:15:07,130 So we'll look at where I want to value my hypothesis. I'm going to look 269 00:15:07,130 --> 00:15:10,140 only in the vicinity of 270 00:15:10,140 --> 00:15:13,730 this point where I want to value my hypothesis, 271 00:15:13,730 --> 00:15:16,480 and then I'm going to take, 272 00:15:16,480 --> 00:15:19,600 let's say, just these few points, 273 00:15:19,600 --> 00:15:21,320 and I will 274 00:15:21,320 --> 00:15:22,760 apply linear regression 275 00:15:22,760 --> 00:15:26,160 to fit a straight line just to this sub-set of the data. Okay? I'm 276 00:15:26,160 --> 00:15:29,740 using this sub-term sub-set - well let's come back to that later. 277 00:15:29,740 --> 00:15:32,000 So we take this data set and I fit a 278 00:15:32,000 --> 00:15:36,690 straight line to it and maybe I get a straight line like that. 279 00:15:36,690 --> 00:15:40,720 And what I'll do is then 280 00:15:40,720 --> 00:15:45,150 evaluate this particular value of straight line and 281 00:15:45,150 --> 00:15:47,280 that will be the value I return for my algorithm. 282 00:15:47,280 --> 00:15:50,360 I think this would be the predicted value 283 00:15:50,360 --> 00:15:53,060 for - 284 00:15:53,060 --> 00:15:57,370 this would be the value of then my hypothesis outputs 285 00:15:57,370 --> 00:16:04,370 in locally weighted regression. Okay? So 286 00:16:05,270 --> 00:16:10,110 we're gonna fall one up. Let me go ahead and formalize that. In 287 00:16:10,110 --> 00:16:15,120 locally weighted regression, we're going to fit theta to 288 00:16:15,120 --> 00:16:18,350 minimize 289 00:16:18,350 --> 00:16:25,350 sum over I 290 00:16:27,480 --> 00:16:32,590 to minimize that 291 00:16:32,590 --> 00:16:36,600 where these terms W superscript I are called weights. 292 00:16:36,600 --> 00:16:37,769 293 00:16:37,769 --> 00:16:41,080 There are many possible choice for ways, I'm just gonna write one down. So this E's and 294 00:16:41,080 --> 00:16:42,990 minus, XI minus 295 00:16:42,990 --> 00:16:45,590 X squared 296 00:16:45,590 --> 00:16:49,290 over 297 00:16:49,290 --> 00:16:54,930 two. So let's look at what these weights really are, right? So notice that - 298 00:16:54,930 --> 00:16:57,790 suppose you have a training example XI. 299 00:16:57,790 --> 00:17:04,790 So that XI is very close to X. So this is small, 300 00:17:06,940 --> 00:17:10,620 right? Then if XI minus X is small, so if XI minus X is close to zero, then 301 00:17:10,620 --> 00:17:14,319 this is E's to the minus zero and E to the zero 302 00:17:14,319 --> 00:17:15,860 is one. 303 00:17:15,860 --> 00:17:19,790 So if XI is close to X, then 304 00:17:19,790 --> 00:17:21,180 WI 305 00:17:21,180 --> 00:17:25,130 will be close to one. In other words, the weight associated with the, I training 306 00:17:25,130 --> 00:17:27,030 example be close to one 307 00:17:27,030 --> 00:17:30,160 if XI and X are close to each 308 00:17:30,160 --> 00:17:31,150 other. 309 00:17:31,150 --> 00:17:34,270 Conversely if XI minus X 310 00:17:34,270 --> 00:17:41,270 is large 311 00:17:42,330 --> 00:17:43,110 then - I don't 312 00:17:43,110 --> 00:17:48,010 know, what would WI be? Zero. 313 00:17:48,010 --> 00:17:51,020 Zero, right. Close to zero. Right. 314 00:17:51,020 --> 00:17:51,850 So 315 00:17:51,850 --> 00:17:56,630 if XI is very far from X then this is E to the minus of some large number 316 00:17:56,630 --> 00:18:02,210 and E to the minus some large number will be close to zero. 317 00:18:02,210 --> 00:18:05,170 Okay? 318 00:18:05,170 --> 00:18:12,170 So the picture is, if I'm 319 00:18:12,400 --> 00:18:13,390 quarrying 320 00:18:13,390 --> 00:18:16,970 at a certain point X, shown on the X axis, 321 00:18:16,970 --> 00:18:21,820 and if my data 322 00:18:21,820 --> 00:18:24,090 set, say, look like that, 323 00:18:24,090 --> 00:18:28,880 then I'm going to give the points close to this a large weight and give the points 324 00:18:28,880 --> 00:18:32,610 far away a small weight. 325 00:18:32,610 --> 00:18:35,390 So 326 00:18:35,390 --> 00:18:37,930 for the points that are far away, 327 00:18:37,930 --> 00:18:39,690 WI will be close to zero. 328 00:18:39,690 --> 00:18:44,280 And so as if for the points that are far away, 329 00:18:44,280 --> 00:18:47,850 they will not contribute much at all to this summation, right? So I think this is 330 00:18:47,850 --> 00:18:48,970 sum over I 331 00:18:48,970 --> 00:18:54,240 of one times this quadratic term for points by points plus zero times this quadratic term for faraway 332 00:18:54,240 --> 00:18:55,720 points. 333 00:18:55,720 --> 00:18:58,610 And so the effect of using this weighting is that 334 00:18:58,610 --> 00:19:00,050 locally weighted linear regression 335 00:19:00,050 --> 00:19:02,630 fits a set of parameters theta, 336 00:19:02,630 --> 00:19:05,030 paying much more attention to fitting the 337 00:19:05,030 --> 00:19:06,950 points close by 338 00:19:06,950 --> 00:19:10,870 accurately. Whereas ignoring the contribution from faraway points. Okay? Yeah? Your Y is 339 00:19:10,870 --> 00:19:17,580 exponentially [inaudible]? 340 00:19:17,580 --> 00:19:22,570 Yeah. Let's see. So it turns out there are many other weighting functions you can use. It 341 00:19:22,570 --> 00:19:26,890 turns out that there are definitely different communities of researchers that tend to 342 00:19:26,890 --> 00:19:29,400 choose different choices by default. 343 00:19:29,400 --> 00:19:33,700 There is somewhat of a literature on debating what point - 344 00:19:33,700 --> 00:19:35,660 exactly what function to use. 345 00:19:35,660 --> 00:19:37,970 This, sort of, exponential decay function is - 346 00:19:37,970 --> 00:19:41,180 this happens to be a reasonably common one that seems to be a more reasonable choice on many problems, 347 00:19:41,180 --> 00:19:42,310 348 00:19:42,310 --> 00:19:45,380 but you can actually plug in other functions as well. Did 349 00:19:45,380 --> 00:19:47,500 I mention what [inaudible] is it at? 350 00:19:47,500 --> 00:19:49,710 For those of you that are familiar with 351 00:19:49,710 --> 00:19:51,550 the 352 00:19:51,550 --> 00:19:53,890 normal distribution, or the Gaussian distribution, 353 00:19:53,890 --> 00:19:55,350 say this - 354 00:19:55,350 --> 00:19:59,370 what this formula I've written out here, it cosmetically 355 00:19:59,370 --> 00:20:02,770 looks a bit like a Gaussian distribution. Okay? But this actually has 356 00:20:02,770 --> 00:20:06,190 absolutely nothing to do with Gaussian distribution. 357 00:20:06,190 --> 00:20:07,940 So this 358 00:20:07,940 --> 00:20:12,240 is not that a problem with XI is Gaussian or whatever. This is no such 359 00:20:12,240 --> 00:20:13,010 interpretation. 360 00:20:13,010 --> 00:20:16,840 This is just a convenient function that happens to be a bell-shaped function, but 361 00:20:16,840 --> 00:20:21,460 don't endow this of any Gaussian semantics. Okay? 362 00:20:21,460 --> 00:20:23,160 So, in fact - well, 363 00:20:23,160 --> 00:20:28,510 if you remember the familiar bell-shaped Gaussian, again, it's just 364 00:20:28,510 --> 00:20:32,160 the ways of associating with these points is that if you 365 00:20:32,160 --> 00:20:33,850 imagine 366 00:20:33,850 --> 00:20:36,050 putting this on a bell-shaped bump, 367 00:20:36,050 --> 00:20:39,650 centered around the position of where you want to value your hypothesis H, 368 00:20:39,650 --> 00:20:42,910 then there's a saying this point here I'll give 369 00:20:42,910 --> 00:20:44,460 a weight that's proportional 370 00:20:44,460 --> 00:20:48,780 to the height of the Gaussian - excuse me, to the height of the bell-shaped function 371 00:20:48,780 --> 00:20:50,960 evaluated at this point. And the 372 00:20:50,960 --> 00:20:53,020 way to get to this point will be, 373 00:20:53,020 --> 00:20:54,730 to this training example, 374 00:20:54,730 --> 00:20:56,800 will be proportionate to that height 375 00:20:56,800 --> 00:20:58,549 and so on. Okay? 376 00:20:58,549 --> 00:21:01,450 And so training examples that are really far away 377 00:21:01,450 --> 00:21:04,820 get a very small weight. 378 00:21:04,820 --> 00:21:07,400 379 00:21:07,400 --> 00:21:10,630 One last small generalization to this is that 380 00:21:10,630 --> 00:21:11,960 normally 381 00:21:11,960 --> 00:21:15,080 there's one other parameter to this 382 00:21:15,080 --> 00:21:17,760 algorithm, which I'll denote as tow. 383 00:21:17,760 --> 00:21:21,110 Again, this looks suspiciously like the variants of a Gaussian, but this is not a Gaussian. 384 00:21:21,110 --> 00:21:24,640 This is a convenient form or function. 385 00:21:24,640 --> 00:21:27,140 This parameter tow 386 00:21:27,140 --> 00:21:32,510 is called the bandwidth 387 00:21:32,510 --> 00:21:33,760 parameter 388 00:21:33,760 --> 00:21:40,760 and 389 00:21:40,980 --> 00:21:47,530 informally it controls how fast the weights fall of with distance. Okay? So just 390 00:21:47,530 --> 00:21:49,350 copy my diagram from 391 00:21:49,350 --> 00:21:53,810 the other side, I guess. 392 00:21:53,810 --> 00:21:55,770 So if 393 00:21:55,770 --> 00:21:57,750 tow is very small, 394 00:21:57,750 --> 00:22:00,000 395 00:22:00,000 --> 00:22:04,520 if that's a query X, then you end up choosing a fairly narrow Gaussian - excuse me, a fairly narrow bell shape, 396 00:22:04,520 --> 00:22:08,080 so that the weights of the points are far away fall off rapidly. 397 00:22:08,080 --> 00:22:09,570 Whereas if 398 00:22:09,570 --> 00:22:16,570 tow 399 00:22:17,490 --> 00:22:21,309 is large then you'd end 400 00:22:21,309 --> 00:22:25,309 up choosing a weighting function that falls of relatively slowly with distance from your 401 00:22:25,309 --> 00:22:27,540 query. 402 00:22:27,540 --> 00:22:30,549 Okay? 403 00:22:30,549 --> 00:22:32,910 So 404 00:22:32,910 --> 00:22:39,910 I hope you can, therefore, see that if you 405 00:22:41,580 --> 00:22:44,410 apply locally weighted linear regression to a data set that looks like 406 00:22:44,410 --> 00:22:45,640 this, 407 00:22:45,640 --> 00:22:46,850 then to 408 00:22:46,850 --> 00:22:50,100 ask what your hypothesis output is at a point like this you end up having a straight line 409 00:22:50,100 --> 00:22:51,169 making 410 00:22:51,169 --> 00:22:52,740 that prediction. To 411 00:22:52,740 --> 00:22:55,159 ask what kind of class this [inaudible] at 412 00:22:55,159 --> 00:22:56,430 that value 413 00:22:56,430 --> 00:22:58,540 you put a straight line there 414 00:22:58,540 --> 00:23:00,420 and you predict that value. 415 00:23:00,420 --> 00:23:04,060 It turns out that every time you try to vary your hypothesis, every time you 416 00:23:04,060 --> 00:23:06,509 ask your learning algorithm to make a prediction for 417 00:23:06,509 --> 00:23:08,910 how much a new house costs or whatever, 418 00:23:08,910 --> 00:23:09,910 419 00:23:09,910 --> 00:23:13,640 you need to run a new fitting procedure and then 420 00:23:13,640 --> 00:23:15,730 evaluate this line that you fit 421 00:23:15,730 --> 00:23:17,810 just at the position of 422 00:23:17,810 --> 00:23:22,430 the value of X. So the position of the query where you're trying to make a prediction. Okay? But if you 423 00:23:22,430 --> 00:23:25,510 do this for every point along the X-axis then 424 00:23:25,510 --> 00:23:27,010 you find that 425 00:23:27,010 --> 00:23:29,990 locally weighted regression is able to trace on this, sort of, very 426 00:23:29,990 --> 00:23:31,690 non-linear curve 427 00:23:31,690 --> 00:23:37,559 for a data set like this. Okay? So 428 00:23:37,559 --> 00:23:42,110 in the problem set we're actually gonna let you play around more with this algorithm. So I won't say 429 00:23:42,110 --> 00:23:43,850 too much more about it here. 430 00:23:43,850 --> 00:23:48,730 But to finally move on to the next topic let me check the questions you have. Yeah? It seems like you 431 00:23:48,730 --> 00:23:49,950 still have 432 00:23:49,950 --> 00:23:51,200 the same problem of overfitting 433 00:23:51,200 --> 00:23:55,320 and underfitting, like when 434 00:23:55,320 --> 00:23:57,960 you had a Q's tow. Like you make it too small 435 00:23:57,960 --> 00:23:59,080 in your - 436 00:23:59,080 --> 00:24:01,760 Yes, absolutely. Yes. So locally 437 00:24:01,760 --> 00:24:04,260 weighted regression can run into 438 00:24:04,260 --> 00:24:07,420 - locally weighted regression is not a penancier for the problem of 439 00:24:07,420 --> 00:24:10,470 overfitting or underfitting. 440 00:24:10,470 --> 00:24:15,130 You can still run into the same problems with locally weighted regression. What you just 441 00:24:15,130 --> 00:24:16,920 said about 442 00:24:16,920 --> 00:24:19,920 - and so some of these things I'll leave you to discover for yourself in the 443 00:24:19,920 --> 00:24:20,859 homework problem. 444 00:24:20,859 --> 00:24:24,720 You'll actually see what you just mentioned. 445 00:24:24,720 --> 00:24:29,070 Yeah? It almost seems like 446 00:24:29,070 --> 00:24:31,360 you're 447 00:24:31,360 --> 00:24:37,430 not even thoroughly [inaudible] with this locally weighted, you had all the 448 00:24:37,430 --> 00:24:42,500 data that you originally had anyway. Yeah. I'm just trying to think of [inaudible] the original data points. Right. So the question is, sort of, this - it's almost as if you're not building a 449 00:24:42,500 --> 00:24:44,700 model, because you need the entire data set. 450 00:24:44,700 --> 00:24:45,460 And 451 00:24:45,460 --> 00:24:49,720 the other way of saying that is that this is a non-parametric learning 452 00:24:49,720 --> 00:24:51,150 algorithm. So this 453 00:24:51,150 --> 00:24:52,280 454 00:24:52,280 --> 00:24:54,440 -I don't know. I won't 455 00:24:54,440 --> 00:24:58,130 debate whether, you know, are we really building a model or not. But 456 00:24:58,130 --> 00:25:00,370 this is a perfectly fine - so 457 00:25:00,370 --> 00:25:01,660 if I think 458 00:25:01,660 --> 00:25:03,750 when you write a code implementing 459 00:25:03,750 --> 00:25:07,080 locally weighted linear regression on the 460 00:25:07,080 --> 00:25:08,980 data set I think of that code 461 00:25:08,980 --> 00:25:13,340 as a whole - as building your model. 462 00:25:13,340 --> 00:25:14,560 So it actually uses - 463 00:25:14,560 --> 00:25:18,610 we've actually used this quite successfully to model, sort of, the dynamics of this 464 00:25:18,610 --> 00:25:23,030 autonomous helicopter this is. Yeah? I ask if this algorithm that learn the weights 465 00:25:23,030 --> 00:25:24,090 based 466 00:25:24,090 --> 00:25:27,770 on 467 00:25:27,770 --> 00:25:32,690 the data? Learn what weights? Oh, the weights WI. Instead of using [inaudible]. I see, 468 00:25:32,690 --> 00:25:34,259 yes. So 469 00:25:34,259 --> 00:25:37,450 it turns out there are a few things you can do. One thing that is quite common is 470 00:25:37,450 --> 00:25:40,190 how to choose this band with parameter tow, 471 00:25:40,190 --> 00:25:45,790 right? As using the data. We'll actually talk about that a bit later when we talk about model selection. Yes? One last question. I used [inaudible] 472 00:25:45,790 --> 00:25:52,790 Gaussian 473 00:25:56,480 --> 00:25:58,030 sometimes if you [inaudible] Gaussian and then - Oh, I guess. Lt's 474 00:25:58,030 --> 00:25:59,640 see. Boy. 475 00:25:59,640 --> 00:26:01,679 The weights are not 476 00:26:01,679 --> 00:26:06,130 random variables and it's not, for the purpose of this algorithm, it is not useful to 477 00:26:06,130 --> 00:26:09,370 endow it with probable semantics. So you could 478 00:26:09,370 --> 00:26:15,220 choose to define things as Gaussian, but it, sort of, doesn't lead anywhere. In 479 00:26:15,220 --> 00:26:15,820 fact, 480 00:26:15,820 --> 00:26:17,390 it turns out that 481 00:26:17,390 --> 00:26:21,530 I happened to choose this, sort of, bell-shaped function 482 00:26:21,530 --> 00:26:23,119 to define my weights. 483 00:26:23,119 --> 00:26:27,149 It's actually fine to choose a function that doesn't even integrate to one, that integrates to 484 00:26:27,149 --> 00:26:29,560 infinity, say, as you're weighting function. So 485 00:26:29,560 --> 00:26:31,700 in that sense, 486 00:26:31,700 --> 00:26:36,220 I mean, you could force in the definition of a Gaussian, but it's, sort of, not useful. Especially 487 00:26:36,220 --> 00:26:41,830 since you use other functions that integrate to infinity and don't integrate to one. Okay? 488 00:26:41,830 --> 00:26:43,990 It's the last question and let's move on Assume 489 00:26:43,990 --> 00:26:50,990 that we have a very huge [inaudible], for example. A very huge set of houses and want to 490 00:26:51,370 --> 00:26:53,870 predict the linear for each house 491 00:26:53,870 --> 00:26:57,530 and so should the end result for each input - I'm 492 00:26:57,530 --> 00:26:59,549 seeing this very constantly for 493 00:26:59,549 --> 00:27:02,120 - Yes, you're right. So 494 00:27:02,120 --> 00:27:04,830 because locally weighted regression is a 495 00:27:04,830 --> 00:27:07,200 non-parametric algorithm 496 00:27:07,200 --> 00:27:11,380 every time you make a prediction you need to fit theta to your entire training set again. 497 00:27:11,380 --> 00:27:13,120 So you're actually right. 498 00:27:13,120 --> 00:27:17,640 If you have a very large training set then this is a somewhat expensive algorithm to 499 00:27:17,640 --> 00:27:19,840 use. Because every time you want to make a prediction 500 00:27:19,840 --> 00:27:20,929 you need to fit 501 00:27:20,929 --> 00:27:22,660 a straight line 502 00:27:22,660 --> 00:27:25,640 to a huge data set again. 503 00:27:25,640 --> 00:27:28,860 Turns out there are algorithms that - turns 504 00:27:28,860 --> 00:27:32,170 out there are ways to make this much more efficient for large data sets as well. 505 00:27:32,170 --> 00:27:34,539 So don't want to talk about that. If you're interested, look 506 00:27:34,539 --> 00:27:36,089 up the work of Andrew Moore 507 00:27:36,089 --> 00:27:37,780 on KD-trees. He, 508 00:27:37,780 --> 00:27:39,580 sort 509 00:27:39,580 --> 00:27:42,190 of, figured out ways to fit these models much more efficiently. That's not something I want to go 510 00:27:42,190 --> 00:27:44,760 into today. Okay? Let me 511 00:27:44,760 --> 00:27:51,760 move one. Let's take more questions later. So, okay. 512 00:28:00,789 --> 00:28:05,670 So that's locally weighted regression. 513 00:28:05,670 --> 00:28:07,080 Remember the 514 00:28:07,080 --> 00:28:10,830 outline I had, I guess, at the beginning of this lecture. What I want to do now is 515 00:28:10,830 --> 00:28:11,440 516 00:28:11,440 --> 00:28:15,650 talk about a probabilistic interpretation of linear regression, all right? 517 00:28:15,650 --> 00:28:19,179 And in particular of the - it'll be this probabilistic interpretation 518 00:28:19,179 --> 00:28:22,250 that let's us move on to talk 519 00:28:22,250 --> 00:28:29,250 about logistic regression, which will be our first classification algorithm. So 520 00:28:39,680 --> 00:28:43,320 let's put aside locally weighted regression for now. We'll just talk about 521 00:28:43,320 --> 00:28:46,650 ordinary unweighted linear regression. Let's 522 00:28:46,650 --> 00:28:50,580 ask the question of why least squares, right? Of all the things 523 00:28:50,580 --> 00:28:52,400 we could optimize 524 00:28:52,400 --> 00:28:56,250 how do we come up with this criteria for minimizing the square of the area 525 00:28:56,250 --> 00:28:59,490 between the predictions of the hypotheses 526 00:28:59,490 --> 00:29:02,320 and the values Y predicted. So why not minimize 527 00:29:02,320 --> 00:29:08,140 the absolute value of the areas or the areas to the power of four or something? 528 00:29:08,140 --> 00:29:10,260 What I'm going to do now is present 529 00:29:10,260 --> 00:29:12,270 one set of assumptions that 530 00:29:12,270 --> 00:29:14,520 will serve to "justify" 531 00:29:14,520 --> 00:29:18,950 why we're minimizing the sum of square zero. Okay? 532 00:29:18,950 --> 00:29:21,620 It turns out that there are many assumptions that are sufficient 533 00:29:21,620 --> 00:29:25,820 to justify why we do least squares and this is just one of them. 534 00:29:25,820 --> 00:29:26,929 So 535 00:29:26,929 --> 00:29:30,260 just because I present one set of assumptions under which least 536 00:29:30,260 --> 00:29:32,340 squares regression make sense, 537 00:29:32,340 --> 00:29:35,270 but this is not the only set of assumptions. So even if the assumptions 538 00:29:35,270 --> 00:29:39,160 I describe don't hold, least squares actually still makes sense in many 539 00:29:39,160 --> 00:29:40,030 circumstances. But this 540 00:29:40,030 --> 00:29:41,490 sort of new help, you know, 541 00:29:41,490 --> 00:29:48,289 give one rationalization, like, one reason for doing least squares regression. 542 00:29:48,289 --> 00:29:51,140 And, in particular, what I'm going to do is 543 00:29:51,140 --> 00:29:57,520 endow the least squares model with probabilistic semantics. So 544 00:29:57,520 --> 00:30:01,809 let's assume in our example of predicting housing prices, 545 00:30:01,809 --> 00:30:02,950 that 546 00:30:02,950 --> 00:30:07,510 the price of the house it's sold four, and 547 00:30:07,510 --> 00:30:12,400 there's going to be some linear function of the features, 548 00:30:12,400 --> 00:30:13,720 plus 549 00:30:13,720 --> 00:30:17,049 some term epsilon I. Okay? 550 00:30:17,049 --> 00:30:19,620 And epsilon I will be 551 00:30:19,620 --> 00:30:22,010 our error term. 552 00:30:22,010 --> 00:30:23,160 You can think of 553 00:30:23,160 --> 00:30:27,860 the error term as capturing unmodeled effects, like, that maybe 554 00:30:27,860 --> 00:30:31,490 there's some other features of a house, like, maybe how many fireplaces it has or whether 555 00:30:31,490 --> 00:30:33,390 there's a garden or whatever, 556 00:30:33,390 --> 00:30:34,130 that there 557 00:30:34,130 --> 00:30:37,100 are additional features that we jut fail to capture or 558 00:30:37,100 --> 00:30:39,520 you can think of epsilon as random noise. 559 00:30:39,520 --> 00:30:42,360 Epsilon is our error term that captures both these 560 00:30:42,360 --> 00:30:46,230 unmodeled effects. Just things we forgot to model. Maybe the function isn't quite 561 00:30:46,230 --> 00:30:47,660 linear or something. 562 00:30:47,660 --> 00:30:50,360 563 00:30:50,360 --> 00:30:52,709 As well as random noise, like maybe 564 00:30:52,709 --> 00:30:56,539 that day the seller was in a really bad mood and so he sold it, just 565 00:30:56,539 --> 00:31:00,440 refused to go for a reasonable price or something. 566 00:31:00,440 --> 00:31:02,670 And now 567 00:31:02,670 --> 00:31:07,880 I will assume that the errors have a 568 00:31:07,880 --> 00:31:09,010 probabilistic 569 00:31:09,010 --> 00:31:13,100 - have a probability distribution. I'll assume that the errors epsilon I 570 00:31:13,100 --> 00:31:14,520 are distributed 571 00:31:14,520 --> 00:31:16,500 just till 572 00:31:16,500 --> 00:31:18,660 they denote epsilon I 573 00:31:18,660 --> 00:31:23,440 is distributive according to a probability distribution. That's 574 00:31:23,440 --> 00:31:26,510 a Gaussian distribution with mean zero 575 00:31:26,510 --> 00:31:28,170 and variance sigma squared. Okay? So 576 00:31:28,170 --> 00:31:30,630 let me just scripts in here, 577 00:31:30,630 --> 00:31:34,630 n stands for normal, right? To denote a normal distribution, also known as the 578 00:31:34,630 --> 00:31:36,169 Gaussian distribution, 579 00:31:36,169 --> 00:31:37,320 with mean 580 00:31:37,320 --> 00:31:41,190 zero and covariance sigma squared. 581 00:31:41,190 --> 00:31:46,730 Actually, just quickly raise your hand if you've seen a Gaussian distribution before. Okay, cool. Most of you. 582 00:31:46,730 --> 00:31:48,640 Great. Almost everyone. 583 00:31:48,640 --> 00:31:51,280 So, 584 00:31:51,280 --> 00:31:55,309 in other words, the density for Gaussian is what you've seen before. 585 00:31:55,309 --> 00:31:57,730 The density for epsilon I would be 586 00:31:57,730 --> 00:31:59,650 one over root 2 pi sigma, E to the 587 00:31:59,650 --> 00:32:01,790 negative, 588 00:32:01,790 --> 00:32:03,570 epsilon I 589 00:32:03,570 --> 00:32:05,760 590 00:32:05,760 --> 00:32:09,420 squared over 2 sigma squared, right? 591 00:32:09,420 --> 00:32:15,170 And the 592 00:32:15,170 --> 00:32:22,170 density of our epsilon I will be this bell-shaped curve 593 00:32:22,929 --> 00:32:25,870 with one standard deviation 594 00:32:25,870 --> 00:32:31,039 being a, sort of, sigma. Okay? This 595 00:32:31,039 --> 00:32:31,780 is 596 00:32:31,780 --> 00:32:34,200 form for that bell-shaped curve. 597 00:32:34,200 --> 00:32:41,200 So, let's see. I can erase that. Can I 598 00:32:41,260 --> 00:32:47,040 erase the board? 599 00:32:47,040 --> 00:32:54,040 600 00:32:59,159 --> 00:33:04,770 So this implies that the 601 00:33:04,770 --> 00:33:10,120 probability distribution of a price of a house 602 00:33:10,120 --> 00:33:11,530 given in si 603 00:33:11,530 --> 00:33:13,620 and the parameters theta, 604 00:33:13,620 --> 00:33:20,620 that this is going to be Gaussian 605 00:33:29,530 --> 00:33:34,820 with that density. Okay? 606 00:33:34,820 --> 00:33:37,930 In other words, saying goes as that the 607 00:33:37,930 --> 00:33:41,440 price of 608 00:33:41,440 --> 00:33:46,550 a house given the features of the house and my parameters theta, 609 00:33:46,550 --> 00:33:50,530 this is going to be a random variable 610 00:33:50,530 --> 00:33:53,820 that's distributed Gaussian with 611 00:33:53,820 --> 00:33:56,700 mean theta transpose XI 612 00:33:56,700 --> 00:33:58,039 and variance sigma squared. 613 00:33:58,039 --> 00:34:01,120 Right? Because we imagine that 614 00:34:01,120 --> 00:34:04,990 the way the housing prices are generated is that the price of a house 615 00:34:04,990 --> 00:34:08,990 is equal to theta transpose XI and then plus some random Gaussian noise with variance sigma 616 00:34:08,990 --> 00:34:11,649 squared. So 617 00:34:11,649 --> 00:34:13,589 the price of a house is going to 618 00:34:13,589 --> 00:34:16,239 have mean theta transpose XI, again, and sigma squared, right? Does 619 00:34:16,239 --> 00:34:20,069 this make 620 00:34:20,069 --> 00:34:24,379 sense? Raise your hand if this makes sense. Yeah, 621 00:34:24,379 --> 00:34:31,379 okay. Lots of you. In 622 00:34:38,369 --> 00:34:43,939 point of notation - oh, yes? Assuming we don't know anything about the error, why do you assume 623 00:34:43,939 --> 00:34:47,489 here the error is a 624 00:34:47,489 --> 00:34:50,299 Gaussian? 625 00:34:50,299 --> 00:34:54,039 Right. So, boy. 626 00:34:54,039 --> 00:34:56,429 Why do I see the error as Gaussian? 627 00:34:56,429 --> 00:35:00,909 Two reasons, right? One is that it turns out to be mathematically convenient to do so 628 00:35:00,909 --> 00:35:03,109 and the other is, I don't 629 00:35:03,109 --> 00:35:06,529 know, I can also mumble about justifications, such as things to the 630 00:35:06,529 --> 00:35:08,609 central limit theorem. It turns out that if you, 631 00:35:08,609 --> 00:35:11,989 for the vast majority of problems, if you apply a linear regression model like 632 00:35:11,989 --> 00:35:15,679 this and try to measure the distribution of the errors, 633 00:35:15,679 --> 00:35:19,509 not all the time, but very often you find that the errors really are Gaussian. That 634 00:35:19,509 --> 00:35:21,779 this Gaussian model is a good 635 00:35:21,779 --> 00:35:23,519 assumption for the error 636 00:35:23,519 --> 00:35:26,079 in regression problems like these. 637 00:35:26,079 --> 00:35:29,000 Some of you may have heard of the central limit theorem, which says that 638 00:35:29,000 --> 00:35:33,219 the sum of many independent random variables will tend towards a Gaussian. 639 00:35:33,219 --> 00:35:37,450 So if the error is caused by many effects, like the mood of the 640 00:35:37,450 --> 00:35:39,060 seller, the mood of the buyer, 641 00:35:39,060 --> 00:35:42,680 some other features that we miss, whether the place has a garden or not, and 642 00:35:42,680 --> 00:35:45,630 if all of these effects are independent, then 643 00:35:45,630 --> 00:35:49,259 by the central limit theorem you might be inclined to believe that 644 00:35:49,259 --> 00:35:52,259 the sum of all these effects will be approximately Gaussian. If 645 00:35:52,259 --> 00:35:53,760 in practice, I guess, the 646 00:35:53,760 --> 00:35:57,810 two real answers are that, 1.) In practice this is actually a reasonably accurate 647 00:35:57,810 --> 00:36:04,810 assumption, and 2.) Is it turns out to be mathematically convenient to do so. Okay? Yeah? It seems like we're 648 00:36:06,509 --> 00:36:07,539 saying 649 00:36:07,539 --> 00:36:10,539 if we assume that area around model 650 00:36:10,539 --> 00:36:12,849 has zero mean, then 651 00:36:12,849 --> 00:36:16,269 the area is centered around our model. Which 652 00:36:16,269 --> 00:36:18,059 it seems almost like we're trying to assume 653 00:36:18,059 --> 00:36:19,960 what we're trying to prove. Instructor? 654 00:36:19,960 --> 00:36:23,640 That's the [inaudible] but, yes. You are assuming that 655 00:36:23,640 --> 00:36:28,359 the error has zero mean. Which is, yeah, right. 656 00:36:28,359 --> 00:36:30,919 I think later this quarter we get to some of the other 657 00:36:30,919 --> 00:36:35,059 things, but for now just think of this as a mathematically - it's actually not 658 00:36:35,059 --> 00:36:37,879 an unreasonable assumption. 659 00:36:37,879 --> 00:36:40,390 I guess, 660 00:36:40,390 --> 00:36:44,979 in machine learning all the assumptions we make are almost never 661 00:36:44,979 --> 00:36:48,519 true in the absence sense, right? Because, for instance, 662 00:36:48,519 --> 00:36:54,359 housing prices are priced to dollars and cents, so the error will be - 663 00:36:54,359 --> 00:36:55,199 errors 664 00:36:55,199 --> 00:36:58,719 in prices are not continued as value random variables, because 665 00:36:58,719 --> 00:37:02,239 houses can only be priced at a certain number of dollars and a certain number of 666 00:37:02,239 --> 00:37:05,970 cents and you never have fractions of cents in housing prices. 667 00:37:05,970 --> 00:37:10,339 Whereas a Gaussian random variable would. So in that sense, assumptions we make are never 668 00:37:10,339 --> 00:37:14,689 "absolutely true," but for practical purposes this is a 669 00:37:14,689 --> 00:37:19,459 accurate enough assumption that it'll be useful to 670 00:37:19,459 --> 00:37:23,529 make. Okay? I think in a week or two, we'll actually come back to 671 00:37:23,529 --> 00:37:27,279 selected more about the assumptions we make and when they help our learning 672 00:37:27,279 --> 00:37:29,059 algorithms and when they hurt our learning 673 00:37:29,059 --> 00:37:30,839 algorithms. We'll say a bit more about it 674 00:37:30,839 --> 00:37:37,839 when we talk about generative and discriminative learning algorithms, like, in a week or 675 00:37:40,319 --> 00:37:44,819 two. Okay? So let's point out one bit of notation, which is that when I 676 00:37:44,819 --> 00:37:48,399 wrote this down I actually wrote P of YI given XI and then semicolon 677 00:37:48,399 --> 00:37:49,719 theta 678 00:37:49,719 --> 00:37:54,999 and I'm going to use this notation when we are not thinking of theta as a 679 00:37:54,999 --> 00:37:56,799 random variable. So 680 00:37:56,799 --> 00:38:00,709 in statistics, though, sometimes it's called the frequentist's point of view, 681 00:38:00,709 --> 00:38:02,380 where you think of there as being some, 682 00:38:02,380 --> 00:38:06,450 sort of, true value of theta that's out there that's generating the data say, 683 00:38:06,450 --> 00:38:07,519 but 684 00:38:07,519 --> 00:38:11,299 we don't know what theta is, but theta is not a random 685 00:38:11,299 --> 00:38:13,440 vehicle, right? So it's not like there's some 686 00:38:13,440 --> 00:38:16,270 random value of theta out there. It's that theta is - 687 00:38:16,270 --> 00:38:19,140 there's some true value of theta out there. It's just that we don't 688 00:38:19,140 --> 00:38:22,459 know what the true value of theta is. So 689 00:38:22,459 --> 00:38:25,620 if theta is not a random variable, then I'm 690 00:38:25,620 --> 00:38:27,299 going to avoid 691 00:38:27,299 --> 00:38:31,359 writing P of YI given XI comma theta, because this would mean 692 00:38:31,359 --> 00:38:34,899 that probably of YI conditioned on X and theta 693 00:38:34,899 --> 00:38:40,200 and you can only condition on random variables. 694 00:38:40,200 --> 00:38:43,010 So at this part of the class where we're taking 695 00:38:43,010 --> 00:38:46,490 sort of frequentist's viewpoint rather than the Dasian viewpoint, in this part of class 696 00:38:46,490 --> 00:38:49,220 we're thinking of theta not as a random variable, but just as something 697 00:38:49,220 --> 00:38:50,470 we're trying to estimate 698 00:38:50,470 --> 00:38:52,829 and use the semicolon 699 00:38:52,829 --> 00:38:57,559 notation. So the way to read this is this is the probability of YI given XI 700 00:38:57,559 --> 00:39:00,410 and parameterized by theta. Okay? So 701 00:39:00,410 --> 00:39:03,719 you read the semicolon as parameterized by. 702 00:39:03,719 --> 00:39:08,269 And in the same way here, I'll say YI given XI parameterized by 703 00:39:08,269 --> 00:39:09,499 theta is distributed 704 00:39:09,499 --> 00:39:16,499 Gaussian with that. All right. 705 00:39:36,319 --> 00:39:38,329 So we're gonna make one more assumption. 706 00:39:38,329 --> 00:39:41,279 Let's assume that the 707 00:39:41,279 --> 00:39:43,819 error terms are 708 00:39:43,819 --> 00:39:48,269 709 00:39:48,269 --> 00:39:50,609 IID, okay? 710 00:39:50,609 --> 00:39:54,279 Which stands for Independently and Identically Distributed. So it's 711 00:39:54,279 --> 00:39:57,259 going to assume that the error terms are 712 00:39:57,259 --> 00:40:04,259 independent of each other, 713 00:40:04,469 --> 00:40:11,279 right? 714 00:40:11,279 --> 00:40:14,749 The identically distributed part just means that I'm assuming the outcome for the 715 00:40:14,749 --> 00:40:18,009 same Gaussian distribution or the same variance, 716 00:40:18,009 --> 00:40:22,099 but the more important part of is this is that I'm assuming that the epsilon I's are 717 00:40:22,099 --> 00:40:25,519 independent of each other. 718 00:40:25,519 --> 00:40:28,899 Now, let's talk about how to fit a model. 719 00:40:28,899 --> 00:40:32,359 The probability of Y given 720 00:40:32,359 --> 00:40:36,239 X parameterized by theta - I'm actually going to give 721 00:40:36,239 --> 00:40:39,149 this another name. I'm going to write this down 722 00:40:39,149 --> 00:40:42,749 and we'll call this the likelihood of theta 723 00:40:42,749 --> 00:40:46,439 as the probability of Y given X parameterized by theta. 724 00:40:46,439 --> 00:40:49,009 And so this is going to be 725 00:40:49,009 --> 00:40:50,419 the product 726 00:40:50,419 --> 00:40:57,419 over my training set like that. 727 00:40:59,939 --> 00:41:04,299 Which is, in turn, going to be a product of 728 00:41:04,299 --> 00:41:10,539 those Gaussian densities that I wrote down just now, 729 00:41:10,539 --> 00:41:14,549 right? 730 00:41:14,549 --> 00:41:20,359 Okay? 731 00:41:20,359 --> 00:41:24,629 Then in parts of notation, I guess, I define this term here to be the 732 00:41:24,629 --> 00:41:26,180 likelihood of theta. 733 00:41:26,180 --> 00:41:30,109 And the likely of theta is just the probability of the data Y, right? Given X 734 00:41:30,109 --> 00:41:32,789 and prioritized by theta. 735 00:41:32,789 --> 00:41:36,619 To test the likelihood and probability are often confused. 736 00:41:36,619 --> 00:41:40,519 So the likelihood of theta is the same thing as the 737 00:41:40,519 --> 00:41:45,509 probability of the data you saw. So likely and probably are, sort of, the same thing. 738 00:41:45,509 --> 00:41:47,779 Except that when I use the term likelihood 739 00:41:47,779 --> 00:41:51,799 I'm trying to emphasize that I'm taking this thing 740 00:41:51,799 --> 00:41:55,380 and viewing it as a function of theta. 741 00:41:55,380 --> 00:41:57,130 Okay? 742 00:41:57,130 --> 00:42:00,609 So likelihood and for probability, they're really the same thing except that 743 00:42:00,609 --> 00:42:02,339 when I want to view this thing 744 00:42:02,339 --> 00:42:05,660 as a function of theta holding X and Y fix are 745 00:42:05,660 --> 00:42:10,089 then called likelihood. Okay? So 746 00:42:10,089 --> 00:42:13,299 hopefully you hear me say the likelihood of the parameters and the probability 747 00:42:13,299 --> 00:42:15,119 of the data, 748 00:42:15,119 --> 00:42:18,469 right? Rather than the likelihood of the data or probability of parameters. So try 749 00:42:18,469 --> 00:42:25,469 to be consistent in that terminology. 750 00:42:30,509 --> 00:42:31,519 So given that 751 00:42:31,519 --> 00:42:33,529 the probability of the data is this and this 752 00:42:33,529 --> 00:42:36,859 is also the likelihood of the parameters, 753 00:42:36,859 --> 00:42:37,809 how do you estimate 754 00:42:37,809 --> 00:42:40,460 the parameters theta? So given a training set, 755 00:42:40,460 --> 00:42:46,309 what parameters theta do you want to choose for your model? 756 00:42:46,309 --> 00:42:53,309 757 00:42:58,749 --> 00:43:01,549 Well, the principle of maximum likelihood 758 00:43:01,549 --> 00:43:02,709 estimation 759 00:43:02,709 --> 00:43:09,329 says that, 760 00:43:09,329 --> 00:43:13,299 right? You can choose the value of theta that makes the data 761 00:43:13,299 --> 00:43:20,039 as probable as possible, right? So choose theta 762 00:43:20,039 --> 00:43:27,039 to maximize the likelihood. Or 763 00:43:27,199 --> 00:43:29,769 in other words choose the parameters that make 764 00:43:29,769 --> 00:43:33,119 the data as probable as possible, right? So this is 765 00:43:33,119 --> 00:43:36,939 massive likely your estimation from six to six. So it's choose the parameters that makes 766 00:43:36,939 --> 00:43:39,539 it as likely as probable as possible 767 00:43:39,539 --> 00:43:43,309 for me to have seen the data I just 768 00:43:43,309 --> 00:43:46,819 did. So 769 00:43:46,819 --> 00:43:53,189 for mathematical convenience, let me define lower case l of theta. 770 00:43:53,189 --> 00:43:57,959 This is called the log likelihood function and it's just log 771 00:43:57,959 --> 00:44:01,429 of capital L of theta. 772 00:44:01,429 --> 00:44:06,309 So this is log over product over I 773 00:44:06,309 --> 00:44:09,539 to find 774 00:44:09,539 --> 00:44:14,299 sigma E to that. I won't bother to write out what's in the exponent for now. It's just saying this 775 00:44:14,299 --> 00:44:16,609 from the previous board. 776 00:44:16,609 --> 00:44:23,609 Log and a product is the same as the sum of over logs, right? So it's a sum 777 00:44:24,509 --> 00:44:31,509 of 778 00:44:35,140 --> 00:44:38,400 the logs of - which simplifies to m times 779 00:44:38,400 --> 00:44:39,349 one over root 780 00:44:39,349 --> 00:44:43,509 two pi 781 00:44:43,509 --> 00:44:44,380 sigma 782 00:44:44,380 --> 00:44:47,469 plus 783 00:44:47,469 --> 00:44:52,099 and then log of explanation cancel each other, right? So if log of E of 784 00:44:52,099 --> 00:44:53,089 something is just 785 00:44:53,089 --> 00:45:00,089 whatever's inside the exponent. So, you know what, 786 00:45:01,229 --> 00:45:08,229 let me write this on the 787 00:45:12,079 --> 00:45:16,129 next 788 00:45:16,129 --> 00:45:21,359 board. 789 00:45:21,359 --> 00:45:28,359 Okay. 790 00:45:32,759 --> 00:45:39,759 791 00:45:46,239 --> 00:45:50,509 So 792 00:45:50,509 --> 00:45:53,349 maximizing the likelihood or maximizing the log 793 00:45:53,349 --> 00:45:58,440 likelihood is the same 794 00:45:58,440 --> 00:46:02,940 as minimizing 795 00:46:02,940 --> 00:46:09,940 that term over there. Well, you get it, right? 796 00:46:22,019 --> 00:46:26,039 Because there's a minus sign. So maximizing this because of the minus sign is the same as 797 00:46:26,039 --> 00:46:26,829 minimizing 798 00:46:26,829 --> 00:46:33,229 this as a function of theta. And 799 00:46:33,229 --> 00:46:35,729 this is, of course, just 800 00:46:35,729 --> 00:46:42,729 the same quadratic cos function that we had last time, J of theta, 801 00:46:43,089 --> 00:46:44,469 right? So what 802 00:46:44,469 --> 00:46:48,339 we've just shown is that the ordinary least squares algorithm, 803 00:46:48,339 --> 00:46:50,539 that we worked on the previous lecture, 804 00:46:50,539 --> 00:46:55,019 is just maximum likelihood 805 00:46:55,019 --> 00:46:56,079 assuming 806 00:46:56,079 --> 00:46:58,219 this probabilistic model, 807 00:46:58,219 --> 00:47:05,219 assuming IID Gaussian errors on our data. 808 00:47:06,109 --> 00:47:09,659 809 00:47:09,659 --> 00:47:10,720 Okay? One thing that we'll 810 00:47:10,720 --> 00:47:12,480 actually leave is that, 811 00:47:12,480 --> 00:47:14,489 in the next lecture notice that 812 00:47:14,489 --> 00:47:17,060 the value of sigma squared doesn't matter, 813 00:47:17,060 --> 00:47:17,930 right? That somehow 814 00:47:17,930 --> 00:47:21,390 no matter what the value of sigma squared is, I mean, sigma squared has to be a positive number. It's a 815 00:47:21,390 --> 00:47:21,999 variance 816 00:47:21,999 --> 00:47:26,229 of a Gaussian. So that no matter what sigma 817 00:47:26,229 --> 00:47:30,160 squared is since it's a positive number the value of theta we end up with 818 00:47:30,160 --> 00:47:33,619 will be the same, right? So because 819 00:47:33,619 --> 00:47:34,659 minimizing this 820 00:47:34,659 --> 00:47:39,289 you get the same value of theta no matter what sigma squared is. So it's as if 821 00:47:39,289 --> 00:47:42,690 in this model the value of sigma squared doesn't really matter. 822 00:47:42,690 --> 00:47:45,790 Just remember that for the next lecture. We'll come back 823 00:47:45,790 --> 00:47:47,509 to this again. 824 00:47:47,509 --> 00:47:51,429 Any questions about this? 825 00:47:51,429 --> 00:47:52,670 Actually, let me clean up 826 00:47:52,670 --> 00:47:59,670 another couple of boards and then I'll see what questions you have. Okay. Any questions? Yeah? You are, I think here you try to measure the likelihood of your nice of 827 00:48:43,929 --> 00:48:50,929 theta by 828 00:48:51,209 --> 00:48:52,079 a fraction 829 00:48:52,079 --> 00:48:53,809 of error, 830 00:48:53,809 --> 00:48:55,369 but I think it's that you 831 00:48:55,369 --> 00:48:57,319 measure 832 00:48:57,319 --> 00:49:01,179 because it depends on the family of theta too, for example. If 833 00:49:01,179 --> 00:49:05,269 834 00:49:05,269 --> 00:49:09,069 you have a lot of parameters [inaudible] or fitting in? Yeah, yeah. I mean, you're asking about overfitting, whether this is a good model. I think 835 00:49:09,069 --> 00:49:12,609 let's - the thing's you're mentioning are 836 00:49:12,609 --> 00:49:14,549 maybe deeper questions about 837 00:49:14,549 --> 00:49:17,869 learning algorithms that we'll just come back to later, so don't really want to get into 838 00:49:17,869 --> 00:49:19,459 that right 839 00:49:19,459 --> 00:49:22,199 now. Any more 840 00:49:22,199 --> 00:49:29,199 questions? Okay. So 841 00:49:33,219 --> 00:49:38,759 this endows linear regression with a probabilistic interpretation. 842 00:49:38,759 --> 00:49:43,209 I'm actually going to use this probabil - use this, sort of, probabilistic 843 00:49:43,209 --> 00:49:43,940 interpretation 844 00:49:43,940 --> 00:49:46,190 in order to derive our next learning algorithm, 845 00:49:46,190 --> 00:49:50,309 which will be our first classification algorithm. Okay? 846 00:49:50,309 --> 00:49:53,619 So 847 00:49:53,619 --> 00:49:57,930 you'll recall that I said that regression problems are where the variable Y 848 00:49:57,930 --> 00:50:01,339 that you're trying to predict is continuous values. 849 00:50:01,339 --> 00:50:04,259 Now I'm actually gonna talk about our first classification problem, 850 00:50:04,259 --> 00:50:07,859 where the value Y you're trying to predict 851 00:50:07,859 --> 00:50:10,939 will be discreet value. You can take on only a small number of discrete values 852 00:50:10,939 --> 00:50:14,379 and in this case I'll talk about binding classification 853 00:50:14,379 --> 00:50:15,709 where 854 00:50:15,709 --> 00:50:18,769 Y takes on only two values, right? So 855 00:50:18,769 --> 00:50:21,890 you come up with classification problems if you're trying to do, 856 00:50:21,890 --> 00:50:25,900 say, a medical diagnosis and try to decide based on some features 857 00:50:25,900 --> 00:50:29,890 that the patient has a disease or does not have a disease. 858 00:50:29,890 --> 00:50:34,239 Or if in the housing example, maybe you're trying to decide will this house sell in the 859 00:50:34,239 --> 00:50:37,509 next six months or not and the answer is either yes or no. It'll either be sold in the 860 00:50:37,509 --> 00:50:40,989 next six months or it won't be. 861 00:50:40,989 --> 00:50:44,529 Other standing examples, if you want to build a spam filter. Is this e-mail spam 862 00:50:44,529 --> 00:50:50,930 or not? It's yes or no. Or if you, you know, some of my colleagues sit in whether predicting 863 00:50:50,930 --> 00:50:55,439 whether a computer system will crash. So you have a learning algorithm to predict will 864 00:50:55,439 --> 00:50:59,159 this computing cluster crash over the next 24 hours? And, again, it's a yes 865 00:50:59,159 --> 00:51:06,039 or no answer. So 866 00:51:06,039 --> 00:51:08,149 there's X, there's Y. 867 00:51:08,149 --> 00:51:14,119 And in a classification problem 868 00:51:14,119 --> 00:51:15,269 Y takes on 869 00:51:15,269 --> 00:51:18,569 two values, zero and one. That's it in binding the classification. 870 00:51:18,569 --> 00:51:21,630 So what can you do? Well, one thing you could do is 871 00:51:21,630 --> 00:51:25,299 take linear regression, as we've described it so far, and apply it to this problem, 872 00:51:25,299 --> 00:51:26,399 right? So you, 873 00:51:26,399 --> 00:51:30,289 you know, given this data set you can fit a straight line to it. Maybe 874 00:51:30,289 --> 00:51:31,829 you get that straight line, right? 875 00:51:31,829 --> 00:51:32,430 But 876 00:51:32,430 --> 00:51:36,770 this data set I've drawn, right? This is an amazingly easy classification problem. It's 877 00:51:36,770 --> 00:51:37,890 pretty obvious 878 00:51:37,890 --> 00:51:41,420 to all of us that, right? The relationship between X and Y is - 879 00:51:41,420 --> 00:51:47,729 well, you just look at a value around here and it's the right is one, it's 880 00:51:47,729 --> 00:51:51,529 the left and Y is zero. So you apply linear regressions to this data set and you get a reasonable fit and you can then 881 00:51:51,529 --> 00:51:53,650 maybe take your linear regression 882 00:51:53,650 --> 00:51:55,669 hypothesis to this straight line 883 00:51:55,669 --> 00:51:58,489 and threshold it at 0.5. 884 00:51:58,489 --> 00:52:02,059 If you do that you'll certainly get the right answer. You predict that 885 00:52:02,059 --> 00:52:02,660 886 00:52:02,660 --> 00:52:04,269 if X is to the right of, sort 887 00:52:04,269 --> 00:52:06,229 of, the mid-point here 888 00:52:06,229 --> 00:52:11,829 then Y is one and then next to the left of that mid-point then Y is zero. 889 00:52:11,829 --> 00:52:16,099 So some people actually do this. Apply linear regression to classification problems 890 00:52:16,099 --> 00:52:17,719 and sometimes it'll 891 00:52:17,719 --> 00:52:19,319 work okay, 892 00:52:19,319 --> 00:52:21,820 but in general it's actually a pretty bad idea to 893 00:52:21,820 --> 00:52:26,130 apply linear regression to 894 00:52:26,130 --> 00:52:31,849 classification problems like these and here's why. Let's say I 895 00:52:31,849 --> 00:52:33,529 change my training set 896 00:52:33,529 --> 00:52:39,509 by giving you just one more training example all the way up there, right? 897 00:52:39,509 --> 00:52:43,049 Imagine if given this training set is actually still entirely obvious what the 898 00:52:43,049 --> 00:52:45,670 relationship between X and Y is, right? It's just - 899 00:52:45,670 --> 00:52:50,839 take this value as greater than Y is one and it's less then Y 900 00:52:50,839 --> 00:52:52,039 is zero. 901 00:52:52,039 --> 00:52:55,059 By giving you this additional training example it really shouldn't 902 00:52:55,059 --> 00:52:56,429 change anything. I mean, 903 00:52:56,429 --> 00:52:59,289 I didn't really convey much new information. There's no surprise that this 904 00:52:59,289 --> 00:53:01,670 corresponds to Y equals one. 905 00:53:01,670 --> 00:53:04,589 But if you now fit linear regression to this data 906 00:53:04,589 --> 00:53:07,159 set you end up with a line that, I 907 00:53:07,159 --> 00:53:10,410 don't know, maybe looks like that, right? 908 00:53:10,410 --> 00:53:13,430 And now the predictions of your 909 00:53:13,430 --> 00:53:15,739 hypothesis have changed completely if 910 00:53:15,739 --> 00:53:22,739 your threshold - your hypothesis at Y equal both 0.5. Okay? So - In between there might be an interval where it's zero, right? For that far off point? Oh, you mean, like that? 911 00:53:26,579 --> 00:53:30,079 Right. 912 00:53:30,079 --> 00:53:31,279 Yeah, yeah, fine. Yeah, sure. A theta 913 00:53:31,279 --> 00:53:36,849 set like that so. So, I 914 00:53:36,849 --> 00:53:38,609 guess, 915 00:53:38,609 --> 00:53:42,179 these just - yes, you're right, but this is an example and this example works. This - [Inaudible] that will 916 00:53:42,179 --> 00:53:47,609 change it even more if you gave it 917 00:53:47,609 --> 00:53:49,579 all - 918 00:53:49,579 --> 00:53:51,999 Yeah. Then I think this actually would make it even worse. You 919 00:53:51,999 --> 00:53:54,349 would actually get a line that pulls out even further, right? So 920 00:53:54,349 --> 00:53:57,819 this is my example. I get to make it whatever I want, right? But 921 00:53:57,819 --> 00:54:00,689 the point of this is that there's not a deep meaning to this. The point of this is 922 00:54:00,689 --> 00:54:01,879 just that 923 00:54:01,879 --> 00:54:04,859 it could be a really bad idea to apply linear regression to classification 924 00:54:04,859 --> 00:54:05,880 925 00:54:05,880 --> 00:54:11,609 algorithm. Sometimes it work fine, but usually I wouldn't do it. 926 00:54:11,609 --> 00:54:13,769 So a couple of problems with this. One is that, 927 00:54:13,769 --> 00:54:17,349 well - so what do you want to do 928 00:54:17,349 --> 00:54:20,630 for classification? If you know the value of Y lies between zero and 929 00:54:20,630 --> 00:54:24,709 one then to kind of fix this problem 930 00:54:24,709 --> 00:54:27,079 let's just start by 931 00:54:27,079 --> 00:54:29,219 changing the form 932 00:54:29,219 --> 00:54:33,869 of our hypothesis so that my hypothesis 933 00:54:33,869 --> 00:54:39,279 always lies in the unit interval between zero and one. Okay? 934 00:54:39,279 --> 00:54:41,829 So if I know Y is either 935 00:54:41,829 --> 00:54:44,039 zero or one then 936 00:54:44,039 --> 00:54:47,469 let's at least not have my hypothesis predict values much larger than one and much 937 00:54:47,469 --> 00:54:50,939 smaller than zero. 938 00:54:50,939 --> 00:54:52,369 And so 939 00:54:52,369 --> 00:54:55,749 I'm going to - instead of choosing a linear function for my hypothesis I'm going 940 00:54:55,749 --> 00:54:59,799 to choose something slightly different. And, 941 00:54:59,799 --> 00:55:03,219 in particular, I'm going to choose 942 00:55:03,219 --> 00:55:08,129 this function, H subscript theta of X is going to equal to G of 943 00:55:08,129 --> 00:55:10,839 theta transpose X 944 00:55:10,839 --> 00:55:12,930 where 945 00:55:12,930 --> 00:55:13,890 G 946 00:55:13,890 --> 00:55:17,279 is going to be this 947 00:55:17,279 --> 00:55:18,329 function and so 948 00:55:18,329 --> 00:55:20,819 this becomes more than one plus theta X 949 00:55:20,819 --> 00:55:22,819 of theta 950 00:55:22,819 --> 00:55:25,049 transpose X. 951 00:55:25,049 --> 00:55:27,809 And G of Z is called the sigmoid 952 00:55:27,809 --> 00:55:32,959 function and it 953 00:55:32,959 --> 00:55:38,659 is often also called the logistic function. It 954 00:55:38,659 --> 00:55:41,229 goes by either of these 955 00:55:41,229 --> 00:55:48,229 names. And what G of Z looks like is the following. So when you have your 956 00:55:48,449 --> 00:55:50,529 horizontal axis I'm going to plot Z 957 00:55:50,529 --> 00:55:56,549 and so G of Z 958 00:55:56,549 --> 00:55:58,879 will look like this. 959 00:55:58,879 --> 00:56:04,759 Okay? I didn't draw that very well. Okay. 960 00:56:04,759 --> 00:56:05,849 So G of Z 961 00:56:05,849 --> 00:56:07,769 tends towards zero 962 00:56:07,769 --> 00:56:09,909 as Z becomes very small 963 00:56:09,909 --> 00:56:12,219 and G of Z will ascend 964 00:56:12,219 --> 00:56:15,209 towards one as Z becomes large and it crosses the 965 00:56:15,209 --> 00:56:17,089 vertical 966 00:56:17,089 --> 00:56:19,519 axis at 0.5. 967 00:56:19,519 --> 00:56:24,460 So this is what sigmoid function, also called the logistic function of. Yeah? Question? What sort of 968 00:56:24,460 --> 00:56:27,479 sigmoid in other 969 00:56:27,479 --> 00:56:30,299 step five? Say that again. Why we cannot chose this at five for some reason, like, that's 970 00:56:30,299 --> 00:56:35,429 better binary. Yeah. Let me come back to that later. So it turns out that Y - where did I get this function from, 971 00:56:35,429 --> 00:56:37,270 right? I just 972 00:56:37,270 --> 00:56:38,839 wrote down this function. It actually 973 00:56:38,839 --> 00:56:42,569 turns out that there are two reasons for using this function that we'll come to. 974 00:56:42,569 --> 00:56:43,619 One is - 975 00:56:43,619 --> 00:56:46,809 we talked about generalized linear models. We'll see that this falls out naturally 976 00:56:46,809 --> 00:56:49,910 as part of the broader class of models. 977 00:56:49,910 --> 00:56:51,089 And another reason 978 00:56:51,089 --> 00:56:52,340 that we'll talk about 979 00:56:52,340 --> 00:56:54,239 next week, it turns out 980 00:56:54,239 --> 00:56:55,089 there are a couple of, 981 00:56:55,089 --> 00:56:57,269 I think, very beautiful reasons for why 982 00:56:57,269 --> 00:56:58,999 we choose logistic 983 00:56:58,999 --> 00:57:00,469 functions. We'll see 984 00:57:00,469 --> 00:57:02,089 that in a little bit. But for now let me just 985 00:57:02,089 --> 00:57:07,249 define it and just take my word for it for now that this is a reasonable choice. 986 00:57:07,249 --> 00:57:08,529 Okay? But notice now that 987 00:57:08,529 --> 00:57:15,529 my - the values output by my hypothesis will always be between zero 988 00:57:15,879 --> 00:57:17,289 and one. Furthermore, 989 00:57:17,289 --> 00:57:20,719 just like we did for linear regression, I'm going to endow 990 00:57:20,719 --> 00:57:25,549 the outputs and my hypothesis with a probabilistic interpretation, right? So 991 00:57:25,549 --> 00:57:30,619 I'm going to assume that the probability that Y is equal to one 992 00:57:30,619 --> 00:57:33,899 given X and parameterized by theta 993 00:57:33,899 --> 00:57:35,559 that's equal to 994 00:57:35,559 --> 00:57:38,439 H subscript theta of X, all right? 995 00:57:38,439 --> 00:57:39,640 So in other words 996 00:57:39,640 --> 00:57:42,789 I'm going to imagine that my hypothesis is outputting all these 997 00:57:42,789 --> 00:57:45,409 numbers that lie between zero and one. 998 00:57:45,409 --> 00:57:47,890 I'm going to think of my hypothesis 999 00:57:47,890 --> 00:57:54,890 as trying to estimate the probability that Y is equal to one. Okay? 1000 00:57:56,209 --> 00:57:59,710 And because 1001 00:57:59,710 --> 00:58:02,119 Y has to be either zero or one 1002 00:58:02,119 --> 00:58:04,729 then the probability of Y equals zero is going 1003 00:58:04,729 --> 00:58:11,729 to be that. All right? 1004 00:58:11,859 --> 00:58:15,349 So more simply it turns out - actually, take these two equations 1005 00:58:15,349 --> 00:58:19,209 and write them more compactly. 1006 00:58:19,209 --> 00:58:22,139 Write P of Y given X 1007 00:58:22,139 --> 00:58:24,209 parameterized by theta. 1008 00:58:24,209 --> 00:58:25,839 This is going to be H 1009 00:58:25,839 --> 00:58:28,349 subscript theta of X to 1010 00:58:28,349 --> 00:58:31,949 the power of Y times 1011 00:58:31,949 --> 00:58:33,179 one minus 1012 00:58:33,179 --> 00:58:34,769 H of X to the power of 1013 00:58:34,769 --> 00:58:37,219 one minus Y. Okay? So I know this 1014 00:58:37,219 --> 00:58:42,889 looks somewhat bizarre, but this actually makes the variation much nicer. 1015 00:58:42,889 --> 00:58:44,439 So Y is equal to one 1016 00:58:44,439 --> 00:58:48,059 then this equation is H of X to the power of one 1017 00:58:48,059 --> 00:58:51,229 times something to the power of zero. 1018 00:58:51,229 --> 00:58:54,429 So anything to the power of zero is just one, 1019 00:58:54,429 --> 00:58:55,949 right? So Y equals one then 1020 00:58:55,949 --> 00:58:59,389 this is something to the power of zero and so this is just one. 1021 00:58:59,389 --> 00:59:03,349 So if Y equals one this is just saying P of Y equals one is equal to H subscript 1022 00:59:03,349 --> 00:59:05,919 theta of X. Okay? 1023 00:59:05,919 --> 00:59:08,249 And in the same way, if Y is equal 1024 00:59:08,249 --> 00:59:09,919 to zero then this is P 1025 00:59:09,919 --> 00:59:14,329 of Y equals zero equals this thing to the power of zero and so this disappears. This is 1026 00:59:14,329 --> 00:59:15,669 just one 1027 00:59:15,669 --> 00:59:17,579 times this thing power of one. Okay? So this is 1028 00:59:17,579 --> 00:59:18,599 a 1029 00:59:18,599 --> 00:59:20,269 compact way of writing 1030 00:59:20,269 --> 00:59:22,799 both of these equations to 1031 00:59:22,799 --> 00:59:29,799 gather them to one line. 1032 00:59:31,149 --> 00:59:35,650 So let's hope our parameter fitting, right? And, again, you can ask - 1033 00:59:35,650 --> 00:59:38,389 well, given this model by data, how do I fit 1034 00:59:38,389 --> 00:59:41,409 the parameters theta of my 1035 00:59:41,409 --> 00:59:46,169 model? So the likelihood of the parameters is, as before, it's just the probability 1036 00:59:46,169 --> 00:59:48,639 1037 00:59:48,639 --> 00:59:50,499 1038 00:59:50,499 --> 00:59:54,469 of theta, right? Which is product over I, PFYI 1039 00:59:54,469 --> 00:59:56,709 given XI 1040 00:59:56,709 --> 00:59:59,119 parameterized by theta. 1041 00:59:59,119 --> 01:00:01,609 Which is - just plugging those 1042 01:00:01,609 --> 01:00:08,609 in. Okay? I 1043 01:00:09,399 --> 01:00:16,399 dropped this theta subscript just so you can write a little bit less. Oh, 1044 01:00:17,479 --> 01:00:19,699 excuse me. These 1045 01:00:19,699 --> 01:00:26,699 should be 1046 01:00:29,380 --> 01:00:35,749 XI's 1047 01:00:35,749 --> 01:00:42,749 and YI's. Okay? 1048 01:00:51,299 --> 01:00:53,359 So, 1049 01:00:53,359 --> 01:00:57,029 as before, let's say we want to find a maximum likelihood estimate of the parameters theta. So 1050 01:00:57,029 --> 01:00:58,759 we want 1051 01:00:58,759 --> 01:01:03,529 to find the - setting the parameters theta that maximizes the likelihood L 1052 01:01:03,529 --> 01:01:08,119 of theta. It 1053 01:01:08,119 --> 01:01:09,619 turns out 1054 01:01:09,619 --> 01:01:11,439 that very often 1055 01:01:11,439 --> 01:01:14,479 - just when you work with the derivations, it turns out that it is often 1056 01:01:14,479 --> 01:01:17,809 much easier to maximize the log of the likelihood rather than maximize the 1057 01:01:17,809 --> 01:01:19,559 likelihood. 1058 01:01:19,559 --> 01:01:20,439 So 1059 01:01:20,439 --> 01:01:22,640 the log 1060 01:01:22,640 --> 01:01:24,630 likelihood L of theta is just log of capital L. 1061 01:01:24,630 --> 01:01:27,879 This will, therefore, 1062 01:01:27,879 --> 01:01:34,879 be sum of this. Okay? 1063 01:01:58,089 --> 01:01:59,239 And so 1064 01:01:59,239 --> 01:02:03,689 to fit the parameters theta of our model we'll 1065 01:02:03,689 --> 01:02:05,929 find the value of 1066 01:02:05,929 --> 01:02:11,539 theta that maximizes this log likelihood. Yeah? [Inaudible] Say that again. YI is [inaudible]. Oh, yes. 1067 01:02:11,539 --> 01:02:18,539 Thanks. 1068 01:02:21,519 --> 01:02:27,239 So having maximized this function - well, it turns out we can actually apply 1069 01:02:27,239 --> 01:02:28,759 the same gradient 1070 01:02:28,759 --> 01:02:32,749 descent algorithm that we learned. That was the first algorithm we used 1071 01:02:32,749 --> 01:02:34,979 to minimize the quadratic function. 1072 01:02:34,979 --> 01:02:37,209 And you remember, when we talked about least squares, 1073 01:02:37,209 --> 01:02:40,120 the first algorithm we used to minimize the quadratic 1074 01:02:40,120 --> 01:02:41,060 error function 1075 01:02:41,060 --> 01:02:42,649 was great in descent. 1076 01:02:42,649 --> 01:02:45,409 So can actually use exactly the same algorithm 1077 01:02:45,409 --> 01:02:48,039 to maximize the log likelihood. 1078 01:02:48,039 --> 01:02:48,749 1079 01:02:48,749 --> 01:02:51,359 And you remember, that algorithm was just 1080 01:02:51,359 --> 01:02:54,609 repeatedly take the value of theta 1081 01:02:54,609 --> 01:02:56,489 and you replace it with 1082 01:02:56,489 --> 01:02:58,919 the previous value of theta plus 1083 01:02:58,919 --> 01:03:01,959 a learning rate alpha 1084 01:03:01,959 --> 01:03:03,939 times 1085 01:03:03,939 --> 01:03:08,339 the gradient of the cos function. The log likelihood will respect the 1086 01:03:08,339 --> 01:03:09,640 theta. Okay? 1087 01:03:09,640 --> 01:03:13,869 One small change is that because previously we were trying to minimize 1088 01:03:13,869 --> 01:03:14,650 1089 01:03:14,650 --> 01:03:16,999 the quadratic error term. 1090 01:03:16,999 --> 01:03:20,359 Today we're trying to maximize rather than minimize. So rather than having a minus 1091 01:03:20,359 --> 01:03:23,189 sign we have a plus sign. So this is 1092 01:03:23,189 --> 01:03:24,070 just great in ascents, 1093 01:03:24,070 --> 01:03:25,489 but for the 1094 01:03:25,489 --> 01:03:27,560 maximization rather than the minimization. 1095 01:03:27,560 --> 01:03:32,380 So we actually call this gradient ascent and it's really the same 1096 01:03:32,380 --> 01:03:35,389 algorithm. 1097 01:03:35,389 --> 01:03:36,840 So to figure out 1098 01:03:36,840 --> 01:03:41,619 what this gradient - so in order to derive gradient descent, 1099 01:03:41,619 --> 01:03:43,719 what you need to do is 1100 01:03:43,719 --> 01:03:48,069 compute the partial derivatives of your objective function with respect to 1101 01:03:48,069 --> 01:03:52,729 each of your parameters theta I, right? 1102 01:03:52,729 --> 01:03:58,219 It turns out that 1103 01:03:58,219 --> 01:04:00,269 if you actually 1104 01:04:00,269 --> 01:04:02,619 compute this partial derivative - 1105 01:04:02,619 --> 01:04:09,539 so you take this formula, this L of theta, which is - oh, got that wrong too. If 1106 01:04:09,539 --> 01:04:13,879 you take this lower case l theta, if you take the log likelihood of theta, 1107 01:04:13,879 --> 01:04:17,159 and if you take it's partial derivative with 1108 01:04:17,159 --> 01:04:19,479 respect to theta I 1109 01:04:19,479 --> 01:04:21,549 you find that 1110 01:04:21,549 --> 01:04:27,119 this is equal to - 1111 01:04:27,119 --> 01:04:29,829 let's see. Okay? And, 1112 01:04:29,829 --> 01:04:36,829 I 1113 01:04:45,939 --> 01:04:47,309 don't 1114 01:04:47,309 --> 01:04:50,329 know, the derivation isn't terribly complicated, but in 1115 01:04:50,329 --> 01:04:53,900 the interest of saving you watching me write down a couple of 1116 01:04:53,900 --> 01:04:55,789 blackboards full of math I'll just write 1117 01:04:55,789 --> 01:04:57,199 down the final answer. But 1118 01:04:57,199 --> 01:04:58,949 the way you get this is you 1119 01:04:58,949 --> 01:05:00,179 just take those, plug 1120 01:05:00,179 --> 01:05:04,429 in the definition for F subscript theta as function of XI, and take derivatives, 1121 01:05:04,429 --> 01:05:06,390 and work through the algebra 1122 01:05:06,390 --> 01:05:08,439 it turns out it'll simplify 1123 01:05:08,439 --> 01:05:13,219 down to this formula. Okay? 1124 01:05:13,219 --> 01:05:15,069 And so 1125 01:05:15,069 --> 01:05:19,130 what that gives you is that gradient ascent 1126 01:05:19,130 --> 01:05:21,669 is the following 1127 01:05:21,669 --> 01:05:24,150 rule. Theta J gets updated as theta 1128 01:05:24,150 --> 01:05:25,899 J 1129 01:05:25,899 --> 01:05:27,839 plus alpha 1130 01:05:27,839 --> 01:05:34,839 gives this. Okay? 1131 01:05:46,999 --> 01:05:50,229 Does this look familiar to anyone? Did you 1132 01:05:50,229 --> 01:05:56,369 remember seeing this formula at the last lecture? Right. 1133 01:05:56,369 --> 01:05:59,349 So when I worked up Bastrian descent 1134 01:05:59,349 --> 01:06:02,449 for least squares regression I, 1135 01:06:02,449 --> 01:06:06,269 actually, wrote down 1136 01:06:06,269 --> 01:06:07,640 exactly the same thing, or maybe 1137 01:06:07,640 --> 01:06:11,419 there's a minus sign and this is also fit. But I, actually, had 1138 01:06:11,419 --> 01:06:13,909 exactly the same learning rule last time 1139 01:06:13,909 --> 01:06:19,919 for least squares regression, 1140 01:06:19,919 --> 01:06:23,509 right? Is this the same learning algorithm then? So what's different? How come I was making 1141 01:06:23,509 --> 01:06:25,499 all that noise earlier about 1142 01:06:25,499 --> 01:06:30,009 least squares regression being a bad idea for classification problems and then I did 1143 01:06:30,009 --> 01:06:33,769 a bunch of math and I skipped some steps, but I'm, sort of, claiming at the end they're really the same learning algorithm? [Inaudible] constants? 1144 01:06:33,769 --> 01:06:40,769 Say that again. [Inaudible] Oh, 1145 01:06:44,180 --> 01:06:46,639 right. Okay, cool. It's the lowest it - No, exactly. Right. So zero to the same, 1146 01:06:46,639 --> 01:06:48,350 this is not the same, right? And the 1147 01:06:48,350 --> 01:06:49,269 reason is, 1148 01:06:49,269 --> 01:06:51,700 in logistic regression 1149 01:06:51,700 --> 01:06:54,169 this is different from before, right? 1150 01:06:54,169 --> 01:06:55,699 The definition 1151 01:06:55,699 --> 01:06:58,380 of this H subscript theta of XI 1152 01:06:58,380 --> 01:06:59,319 is not 1153 01:06:59,319 --> 01:07:03,279 the same as the definition I was using in the previous lecture. 1154 01:07:03,279 --> 01:07:06,519 And in particular this is no longer theta transpose XI. This is not 1155 01:07:06,519 --> 01:07:09,909 a linear function anymore. 1156 01:07:09,909 --> 01:07:11,910 This is a logistic function of theta 1157 01:07:11,910 --> 01:07:14,369 transpose XI. Okay? 1158 01:07:14,369 --> 01:07:17,919 So even though this looks cosmetically similar, 1159 01:07:17,919 --> 01:07:20,119 even though this is similar on the surface, 1160 01:07:20,119 --> 01:07:23,739 to the Bastrian descent rule I derived last time for 1161 01:07:23,739 --> 01:07:25,209 least squares regression 1162 01:07:25,209 --> 01:07:29,359 this is actually a totally different learning algorithm. Okay? 1163 01:07:29,359 --> 01:07:32,499 And it turns out that there's actually no coincidence that you ended up with the 1164 01:07:32,499 --> 01:07:34,579 same learning rule. We'll actually 1165 01:07:34,579 --> 01:07:35,450 1166 01:07:35,450 --> 01:07:39,519 talk a bit more about this later when we talk about generalized linear models. 1167 01:07:39,519 --> 01:07:42,659 But this is one of the most elegant generalized learning models 1168 01:07:42,659 --> 01:07:44,519 that we'll see later. That 1169 01:07:44,519 --> 01:07:48,199 even though we're using a different model, you actually ended up with 1170 01:07:48,199 --> 01:07:51,379 what looks like the same learning algorithm and it's actually no 1171 01:07:51,379 --> 01:07:56,449 coincidence. Cool. 1172 01:07:56,449 --> 01:07:59,029 One last comment as 1173 01:07:59,029 --> 01:08:01,239 part of a sort of learning process, 1174 01:08:01,239 --> 01:08:02,339 over here 1175 01:08:02,339 --> 01:08:05,030 I said I take the derivatives and I 1176 01:08:05,030 --> 01:08:06,729 ended up with this line. 1177 01:08:06,729 --> 01:08:09,269 I didn't want to 1178 01:08:09,269 --> 01:08:13,119 make you sit through a long algebraic derivation, but 1179 01:08:13,119 --> 01:08:14,640 later today or later this week, 1180 01:08:14,640 --> 01:08:18,569 please, do go home and look at our lecture notes, where I wrote out 1181 01:08:18,569 --> 01:08:20,829 the entirety of this derivation in full, 1182 01:08:20,829 --> 01:08:23,419 and make sure you can follow every single step of 1183 01:08:23,419 --> 01:08:26,709 how we take partial derivatives of this log likelihood 1184 01:08:26,709 --> 01:08:32,069 to get this formula over here. Okay? By the way, for those who are 1185 01:08:32,069 --> 01:08:36,379 interested in seriously masking machine learning material, 1186 01:08:36,379 --> 01:08:40,349 when you go home and look at the lecture notes it will actually be very easy for most 1187 01:08:40,349 --> 01:08:41,549 of you to look through 1188 01:08:41,549 --> 01:08:44,599 the lecture notes and read through every line and go yep, that makes sense, that makes sense, that makes sense, 1189 01:08:44,599 --> 01:08:45,089 and, 1190 01:08:45,089 --> 01:08:49,539 sort of, say cool. I see how you get this line. 1191 01:08:49,539 --> 01:08:53,159 You want to make sure you really understand the material. My concrete 1192 01:08:53,159 --> 01:08:55,049 suggestion to you would be to you to go home, 1193 01:08:55,049 --> 01:08:57,779 read through the lecture notes, check every line, 1194 01:08:57,779 --> 01:09:02,159 and then to cover up the derivation and see if you can derive this example, right? So 1195 01:09:02,159 --> 01:09:06,529 in general, that's usually good advice for studying technical 1196 01:09:06,529 --> 01:09:09,029 material like machine learning. Which is if you work through a proof 1197 01:09:09,029 --> 01:09:11,239 and you think you understood every line, 1198 01:09:11,239 --> 01:09:14,129 the way to make sure you really understood it is to cover it up and see 1199 01:09:14,129 --> 01:09:17,459 if you can rederive the entire thing itself. This is actually a great way because I 1200 01:09:17,459 --> 01:09:19,679 did this a lot when I was trying to study 1201 01:09:19,679 --> 01:09:22,239 various pieces of machine learning 1202 01:09:22,239 --> 01:09:26,119 theory and various proofs. And this is actually a great way to study because cover up 1203 01:09:26,119 --> 01:09:28,420 the derivations and see if you can do it yourself 1204 01:09:28,420 --> 01:09:32,529 without looking at the original derivation. All right. 1205 01:09:32,529 --> 01:09:36,690 1206 01:09:36,690 --> 01:09:39,950 I probably won't get to Newton's Method today. I just 1207 01:09:39,950 --> 01:09:46,950 want to say 1208 01:09:55,250 --> 01:09:58,159 - take one quick digression to talk about 1209 01:09:58,159 --> 01:09:59,480 one more algorithm, 1210 01:09:59,480 --> 01:10:01,119 which was the discussion sort 1211 01:10:01,119 --> 01:10:08,119 of alluding to this earlier, 1212 01:10:09,169 --> 01:10:11,530 which is the perceptron 1213 01:10:11,530 --> 01:10:13,219 algorithm, right? So 1214 01:10:13,219 --> 01:10:13,760 I'm 1215 01:10:13,760 --> 01:10:16,840 not gonna say a whole lot about the perceptron algorithm, but this is something that we'll come 1216 01:10:16,840 --> 01:10:20,340 back to later. Later this quarter 1217 01:10:20,340 --> 01:10:23,820 we'll talk about learning theory. 1218 01:10:23,820 --> 01:10:27,149 So in logistic regression we said that G of Z are, sort 1219 01:10:27,149 --> 01:10:27,920 of, 1220 01:10:27,920 --> 01:10:30,150 my hypothesis output values 1221 01:10:30,150 --> 01:10:32,800 that were low numbers between zero and one. 1222 01:10:32,800 --> 01:10:37,110 The question is what if you want to force G of Z to up 1223 01:10:37,110 --> 01:10:38,889 the value to 1224 01:10:38,889 --> 01:10:39,190 either 1225 01:10:39,190 --> 01:10:41,400 zero one? 1226 01:10:41,400 --> 01:10:43,860 So the 1227 01:10:43,860 --> 01:10:46,210 perceptron algorithm defines G of Z 1228 01:10:46,210 --> 01:10:48,090 to be this. 1229 01:10:48,090 --> 01:10:52,590 1230 01:10:52,590 --> 01:10:54,300 So the picture is - or 1231 01:10:54,300 --> 01:11:01,300 the cartoon is, rather than this sigmoid function. E of 1232 01:11:02,349 --> 01:11:07,699 Z now looks like this step function that you were asking about earlier. 1233 01:11:07,699 --> 01:11:13,579 In saying this before, we can use H subscript theta of X equals G of theta transpose X. Okay? So 1234 01:11:13,579 --> 01:11:14,310 this 1235 01:11:14,310 --> 01:11:17,570 is actually - everything is exactly the same as before, 1236 01:11:17,570 --> 01:11:20,389 except that G of Z is now the step function. 1237 01:11:20,389 --> 01:11:21,539 It 1238 01:11:21,539 --> 01:11:24,960 turns out there's this learning called the perceptron learning rule that's actually 1239 01:11:24,960 --> 01:11:28,429 even the same as the classic gradient ascent 1240 01:11:28,429 --> 01:11:30,610 for logistic regression. 1241 01:11:30,610 --> 01:11:32,539 And the learning rule is 1242 01:11:32,539 --> 01:11:39,539 given by this. Okay? 1243 01:11:44,349 --> 01:11:49,799 So it looks just like the 1244 01:11:49,799 --> 01:11:51,019 classic gradient ascent rule 1245 01:11:51,019 --> 01:11:53,659 1246 01:11:53,659 --> 01:11:56,210 for logistic regression. 1247 01:11:56,210 --> 01:12:00,780 So this is very different flavor of algorithm than least squares regression and logistic 1248 01:12:00,780 --> 01:12:01,669 regression, 1249 01:12:01,669 --> 01:12:05,889 and, in particular, because it outputs only values are either zero or one it 1250 01:12:05,889 --> 01:12:09,369 turns out it's very difficult to endow this algorithm with 1251 01:12:09,369 --> 01:12:11,840 probabilistic semantics. And this 1252 01:12:11,840 --> 01:12:18,840 is, again, even though - oh, excuse me. Right there. Okay. 1253 01:12:19,639 --> 01:12:23,659 And even though this learning rule looks, again, looks cosmetically very similar to 1254 01:12:23,659 --> 01:12:25,710 what we have in logistics regression this is actually 1255 01:12:25,710 --> 01:12:28,320 a very different type of learning rule 1256 01:12:28,320 --> 01:12:31,420 than the others that were seen 1257 01:12:31,420 --> 01:12:33,630 in this class. So 1258 01:12:33,630 --> 01:12:36,969 because this is such a simple learning algorithm, right? It just 1259 01:12:36,969 --> 01:12:41,030 computes theta transpose X and then you threshold and then your output is zero or one. 1260 01:12:41,030 --> 01:12:42,909 This is - 1261 01:12:42,909 --> 01:12:45,989 right. So these are a simpler algorithm than logistic regression, I think. 1262 01:12:45,989 --> 01:12:49,190 When we talk about learning theory later in this class, 1263 01:12:49,190 --> 01:12:54,590 the simplicity of this algorithm will let us come back and use it as a building block. Okay? 1264 01:12:54,590 --> 01:12:56,550 But that's all I want to say about this algorithm for now.