1 00:00:10,170 --> 00:00:11,549 2 00:00:11,549 --> 00:00:14,819 This presentation is delivered by the Stanford Center for Professional 3 00:00:14,819 --> 00:00:21,819 Development. 4 00:00:24,739 --> 00:00:28,499 Okay. Good morning. Welcome back. 5 00:00:28,499 --> 00:00:31,239 What I want to do today is 6 00:00:31,239 --> 00:00:36,640 actually wrap up our discussion on learning theory and sort of on " 7 00:00:36,640 --> 00:00:37,670 and I'm gonna start by 8 00:00:37,670 --> 00:00:41,810 talking about Bayesian statistics and regularization, 9 00:00:41,810 --> 00:00:45,680 and then take a very brief digression to tell you about online learning. 10 00:00:45,680 --> 00:00:50,320 And most of today's lecture will actually be on various pieces of that, so applying 11 00:00:50,320 --> 00:00:52,450 machine learning algorithms to problems like, you know, 12 00:00:52,450 --> 00:00:55,920 like the project or other problems you may go work on after you graduate from this 13 00:00:55,920 --> 00:00:57,010 14 00:00:57,010 --> 00:00:57,739 class. But 15 00:00:57,739 --> 00:01:02,300 let's start the talk about Bayesian statistics and regularization. 16 00:01:02,300 --> 00:01:04,130 So you remember from last week, 17 00:01:04,130 --> 00:01:08,960 we started to talk about learning theory and we learned about bias and 18 00:01:08,960 --> 00:01:12,260 variance. And I guess in the previous lecture, we spent most of the previous 19 00:01:12,260 --> 00:01:13,240 lecture 20 00:01:13,240 --> 00:01:17,510 talking about algorithms for model selection and for 21 00:01:17,510 --> 00:01:21,360 feature selection. We talked about cross-validation. Right? So 22 00:01:21,360 --> 00:01:24,200 most of the methods we talked about in the previous lecture were 23 00:01:24,200 --> 00:01:28,460 ways for you to try to simply the model. So for example, 24 00:01:28,460 --> 00:01:31,920 the feature selection algorithms we talked about gives you a way to eliminate a number 25 00:01:31,920 --> 00:01:32,909 of features, 26 00:01:32,909 --> 00:01:36,350 so as to reduce the number of parameters you need to fit and thereby reduce 27 00:01:36,350 --> 00:01:39,330 overfitting. Right? You remember that? So feature 28 00:01:39,330 --> 00:01:40,670 selection algorithms 29 00:01:40,670 --> 00:01:42,149 30 00:01:42,149 --> 00:01:43,970 choose a subset of the features 31 00:01:43,970 --> 00:01:48,470 so that you have less parameters and you may be less likely to overfit. Right? 32 00:01:48,470 --> 00:01:52,790 What I want to do today is to talk about a different way to prevent overfitting. 33 00:01:52,790 --> 00:01:53,870 And 34 00:01:53,870 --> 00:01:57,860 there's a method called regularization and there's a way that lets you keep all the 35 00:01:57,860 --> 00:01:58,859 parameters. 36 00:01:58,859 --> 00:01:59,920 37 00:01:59,920 --> 00:02:04,200 So here's the idea, and I'm gonna illustrate this example with, 38 00:02:04,200 --> 00:02:06,960 say, linear regression. 39 00:02:06,960 --> 00:02:08,709 So 40 00:02:08,709 --> 00:02:13,329 you take 41 00:02:13,329 --> 00:02:16,920 the linear regression model, the very first model we learned about, 42 00:02:16,920 --> 00:02:17,969 right, 43 00:02:17,969 --> 00:02:21,639 we said that we would choose the parameters 44 00:02:21,639 --> 00:02:23,609 via 45 00:02:23,609 --> 00:02:29,519 maximum likelihood. 46 00:02:29,519 --> 00:02:31,669 Right? And that meant that, 47 00:02:31,669 --> 00:02:35,299 you know, you would choose the parameters theta 48 00:02:35,299 --> 00:02:39,879 that maximized 49 00:02:39,879 --> 00:02:43,009 the probability of the data, 50 00:02:43,009 --> 00:02:46,109 which is parameters theta that maximized the probability of the data we observe. 51 00:02:46,109 --> 00:02:50,709 Right? 52 00:02:50,709 --> 00:02:52,049 And so 53 00:02:52,049 --> 00:02:56,279 to give this sort of procedure a name, this is one example of most 54 00:02:56,279 --> 00:03:00,259 common frequencies procedure, 55 00:03:00,259 --> 00:03:05,419 and frequency, you can think of sort of as maybe one school of statistics. 56 00:03:05,419 --> 00:03:07,079 And the philosophical view 57 00:03:07,079 --> 00:03:08,869 behind writing this down was 58 00:03:08,869 --> 00:03:13,480 we envisioned that there was some true parameter theta out there that generated, 59 00:03:13,480 --> 00:03:16,469 you know, the Xs and the Ys. There's some true parameter theta 60 00:03:16,469 --> 00:03:20,529 that govern housing prices, Y is a function of X, 61 00:03:20,529 --> 00:03:23,459 and we don't know what the value of theta is, 62 00:03:23,459 --> 00:03:27,630 and we'd like to come up with some procedure for estimating the value of theta. Okay? 63 00:03:27,630 --> 00:03:28,379 And so, 64 00:03:28,379 --> 00:03:31,269 maximum likelihood is just one possible procedure 65 00:03:31,269 --> 00:03:33,209 for estimating the unknown value 66 00:03:33,209 --> 00:03:35,879 for theta. 67 00:03:35,879 --> 00:03:37,609 And 68 00:03:37,609 --> 00:03:41,060 the way you formulated this, you know, theta was not a random variable. Right? 69 00:03:41,060 --> 00:03:42,140 That's what why said, 70 00:03:42,140 --> 00:03:45,490 so theta is just some true value out there. It's not random or anything, we just don't 71 00:03:45,490 --> 00:03:50,329 know what it is, and we have a procedure called maximum likelihood for estimating the 72 00:03:50,329 --> 00:03:55,469 value for theta. So this is one example of what's called a frequencies procedure. 73 00:03:55,469 --> 00:03:57,279 The alternative to the, I 74 00:03:57,279 --> 00:04:04,279 guess, the frequency school of statistics is the Bayesian school, 75 00:04:04,389 --> 00:04:06,869 in which 76 00:04:06,869 --> 00:04:10,239 we're gonna say that we don't know what theta, 77 00:04:10,239 --> 00:04:13,339 and so we will put a prior 78 00:04:13,339 --> 00:04:14,769 on 79 00:04:14,769 --> 00:04:18,239 theta. Okay? So in the Bayesian school students would say, well 80 00:04:18,239 --> 00:04:22,379 don't know what the value of theta so let's represent our uncertainty over theta with a 81 00:04:22,379 --> 00:04:27,229 prior. 82 00:04:27,229 --> 00:04:28,259 83 00:04:28,259 --> 00:04:32,009 So for example, 84 00:04:32,009 --> 00:04:34,770 our prior on theta 85 00:04:34,770 --> 00:04:36,860 may be a Gaussian distribution 86 00:04:36,860 --> 00:04:39,849 with mean zero and curvalence matrix 87 00:04:39,849 --> 00:04:42,889 given by tau squared I. Okay? 88 00:04:42,889 --> 00:04:45,319 And so 89 00:04:45,319 --> 00:04:52,319 actually, if I use S to denote my training set, well right, 90 00:04:56,279 --> 00:05:00,289 so theta represents my beliefs about what the parameters are in the absence of 91 00:05:00,289 --> 00:05:01,380 any data. So 92 00:05:01,380 --> 00:05:03,290 not having seen any data, theta 93 00:05:03,290 --> 00:05:05,709 represents, you know, what I think theta " it 94 00:05:05,709 --> 00:05:10,090 probably represents what I think theta is most likely to be. 95 00:05:10,090 --> 00:05:15,659 And so given the training set, S, in the sort of Bayesian procedure, we would, 96 00:05:15,659 --> 00:05:22,439 well, 97 00:05:22,439 --> 00:05:26,539 calculate the probability, the posterior probability by parameters 98 00:05:26,539 --> 00:05:28,869 given my training sets, and " let's 99 00:05:28,869 --> 00:05:30,469 write 100 00:05:30,469 --> 00:05:31,779 this on the next board. 101 00:05:31,779 --> 00:05:33,360 So my posterior 102 00:05:33,360 --> 00:05:36,740 on my parameters given my training set, by Bayes' rule, this will 103 00:05:36,740 --> 00:05:40,160 be proportional to, you 104 00:05:40,160 --> 00:05:47,160 know, this. 105 00:05:51,459 --> 00:05:56,559 Right? So by Bayes' rule. Let's call 106 00:05:56,559 --> 00:06:01,429 it posterior. 107 00:06:01,429 --> 00:06:05,669 And this distribution now represents my beliefs about what theta is after I've 108 00:06:05,669 --> 00:06:08,219 seen the training set. 109 00:06:08,219 --> 00:06:15,219 And when you now want to make a new prediction on the price of a new house, 110 00:06:20,060 --> 00:06:21,959 on the input X, 111 00:06:21,959 --> 00:06:26,709 I would say that, well, the distribution over the possible housing prices for 112 00:06:26,709 --> 00:06:30,560 this new house I'm trying to estimate the price of, say, 113 00:06:30,560 --> 00:06:34,270 given the size of the house, the features of the house at X, and the training 114 00:06:34,270 --> 00:06:36,559 set I had previously, it 115 00:06:36,559 --> 00:06:43,559 is going to be given by 116 00:06:46,150 --> 00:06:47,280 117 00:06:47,280 --> 00:06:50,470 an integral over my parameters theta of 118 00:06:50,470 --> 00:06:53,020 probably of Y given X comma theta 119 00:06:53,020 --> 00:06:54,649 and times 120 00:06:54,649 --> 00:06:59,929 the posterior distribution of theta given the training set. Okay? 121 00:06:59,929 --> 00:07:01,789 And in 122 00:07:01,789 --> 00:07:03,719 particular, if you want your prediction to be 123 00:07:03,719 --> 00:07:07,559 the expected value 124 00:07:07,559 --> 00:07:08,729 of Y given 125 00:07:08,729 --> 00:07:10,279 the 126 00:07:10,279 --> 00:07:13,559 input X in training set, you would 127 00:07:13,559 --> 00:07:16,349 say integrate 128 00:07:16,349 --> 00:07:23,349 over Y 129 00:07:23,750 --> 00:07:26,539 times the posterior. Okay? 130 00:07:26,539 --> 00:07:33,539 You would take an expectation of Y with respect to your posterior distribution. Okay? 131 00:07:35,440 --> 00:07:39,169 And you notice that when I was writing this down, so with the Bayesian 132 00:07:39,169 --> 00:07:41,959 formulation, and now started to write up here Y given X 133 00:07:41,959 --> 00:07:44,110 comma theta because 134 00:07:44,110 --> 00:07:47,590 this formula now is the property of Y conditioned on the values of the 135 00:07:47,590 --> 00:07:51,909 random variables X and theta. So I'm no longer writing semicolon theta, I'm writing 136 00:07:51,909 --> 00:07:53,099 comma theta 137 00:07:53,099 --> 00:07:56,379 because I'm now treating theta 138 00:07:56,379 --> 00:08:00,259 as a random variable. So all 139 00:08:00,259 --> 00:08:02,849 of this is somewhat abstract but this is 140 00:08:02,849 --> 00:08:04,109 " and 141 00:08:04,109 --> 00:08:11,109 it turns out " actually let's check. Are there questions about this? No? Okay. Let's 142 00:08:14,779 --> 00:08:17,819 try to make this more concrete. It turns out that 143 00:08:17,819 --> 00:08:21,699 for many problems, 144 00:08:21,699 --> 00:08:26,059 both of these steps in the computation are difficult because if, 145 00:08:26,059 --> 00:08:29,469 you know, theta is an N plus onedimensional vector, is an N plus 146 00:08:29,469 --> 00:08:31,219 one-dimensional parameter vector, 147 00:08:31,219 --> 00:08:35,010 then this is one an integral over an N plus one-dimensional, you know, over RN 148 00:08:35,010 --> 00:08:36,390 plus one. 149 00:08:36,390 --> 00:08:38,660 And because numerically it's very difficult 150 00:08:38,660 --> 00:08:41,130 to compute integrals over 151 00:08:41,130 --> 00:08:44,800 very high dimensional spaces, all right? So 152 00:08:44,800 --> 00:08:48,740 usually this integral " actually usually it's hard to compute the posterior in theta 153 00:08:48,740 --> 00:08:54,310 and it's also hard to compute this integral if theta is very high dimensional. There are few 154 00:08:54,310 --> 00:08:55,910 exceptions for which this can be done in 155 00:08:55,910 --> 00:08:58,290 closed form, but for 156 00:08:58,290 --> 00:09:00,460 many learning algorithms, say, 157 00:09:00,460 --> 00:09:04,940 Bayesian logistic regression, this is hard to do. 158 00:09:04,940 --> 00:09:09,780 And so what's commonly done is to take the posterior distribution 159 00:09:09,780 --> 00:09:13,390 and instead of actually computing a full posterior distribution, chi of theta 160 00:09:13,390 --> 00:09:15,280 given S, 161 00:09:15,280 --> 00:09:16,510 we'll instead 162 00:09:16,510 --> 00:09:19,330 take this quantity on the right-hand side 163 00:09:19,330 --> 00:09:23,880 and just maximize this quantity on the right-hand side. So let me write this down. 164 00:09:23,880 --> 00:09:26,650 So 165 00:09:26,650 --> 00:09:30,100 commonly, instead of computing the full posterior distribution, 166 00:09:30,100 --> 00:09:37,100 we will choose the following. Okay? 167 00:09:42,280 --> 00:09:47,160 We will choose what's called the MAP estimate, or the maximum a posteriori 168 00:09:47,160 --> 00:09:50,270 estimate of theta, which is the most likely value of theta, 169 00:09:50,270 --> 00:09:53,990 most probable value of theta onto your posterior distribution. 170 00:09:53,990 --> 00:09:56,430 And that's just 171 00:09:56,430 --> 00:10:03,430 ont max chi 172 00:10:07,960 --> 00:10:11,870 of theta. 173 00:10:11,870 --> 00:10:18,870 And then when you need to make a prediction, 174 00:10:21,480 --> 00:10:28,480 you know, you would just predict, say, well, 175 00:10:35,530 --> 00:10:37,599 using your usual hypothesis 176 00:10:37,599 --> 00:10:41,290 and using this MAP value of theta 177 00:10:41,290 --> 00:10:42,479 in place of 178 00:10:42,479 --> 00:10:47,380 " as the parameter vector you'd choose. Okay? And 179 00:10:47,380 --> 00:10:51,340 notice, the only difference between this and standard maximum likelihood estimation 180 00:10:51,340 --> 00:10:53,060 is that when you're choosing, 181 00:10:53,060 --> 00:10:56,920 you know, the " instead of choosing the maximum likelihood value for 182 00:10:56,920 --> 00:11:01,140 theta, you're instead maximizing this, which is what you have for maximum likelihood estimation, 183 00:11:01,140 --> 00:11:05,170 and then times this other quantity which is the prior. 184 00:11:05,170 --> 00:11:06,650 Right? 185 00:11:06,650 --> 00:11:09,520 And 186 00:11:09,520 --> 00:11:10,870 let's see, 187 00:11:10,870 --> 00:11:16,560 when intuition is that if your prior 188 00:11:16,560 --> 00:11:21,470 is 189 00:11:21,470 --> 00:11:23,650 theta being Gaussian and with mean 190 00:11:23,650 --> 00:11:26,030 zero and some covariance, 191 00:11:26,030 --> 00:11:30,550 then for a distribution like this, most of the [inaudible] mass is close to zero. Right? 192 00:11:30,550 --> 00:11:34,010 So there's a Gaussian centered around the point zero, and so [inaudible] mass is close to zero. 193 00:11:34,010 --> 00:11:37,990 194 00:11:37,990 --> 00:11:40,550 And so the prior distribution, instead of saying that 195 00:11:40,550 --> 00:11:44,800 you think most of the parameters should be close to 196 00:11:44,800 --> 00:11:48,780 zero, and if you remember our discussion on feature selection, if you eliminate a 197 00:11:48,780 --> 00:11:50,880 feature from consideration 198 00:11:50,880 --> 00:11:53,280 that's the same as 199 00:11:53,280 --> 00:11:56,630 setting the source and value of theta to be equal to zero. All right? So if you 200 00:11:56,630 --> 00:11:58,020 set 201 00:11:58,020 --> 00:12:02,090 theta five to be equal to zero, that's the same as, you know, eliminating feature 202 00:12:02,090 --> 00:12:04,550 five from the your hypothesis. 203 00:12:04,550 --> 00:12:09,710 And so, this is the prior that drives most of the parameter values to zero 204 00:12:09,710 --> 00:12:12,710 " to values close to zero. And you'll think of this as 205 00:12:12,710 --> 00:12:17,410 doing something analogous, if " doing something reminiscent of feature selection. Okay? And 206 00:12:17,410 --> 00:12:21,560 it turns out that with this formulation, the parameters won't actually be 207 00:12:21,560 --> 00:12:24,990 exactly zero but many of the values will be close to zero. 208 00:12:24,990 --> 00:12:27,030 And 209 00:12:27,030 --> 00:12:31,510 I guess in pictures, 210 00:12:31,510 --> 00:12:34,550 211 00:12:34,550 --> 00:12:36,210 if you remember, 212 00:12:36,210 --> 00:12:40,070 I said that if you have, say, five data points and you fit 213 00:12:40,070 --> 00:12:47,070 a fourth-order polynomial " 214 00:12:47,620 --> 00:12:51,040 well I think that had too many bumps in it, but never mind. 215 00:12:51,040 --> 00:12:54,360 If you fit it a " if you fit very high polynomial to a 216 00:12:54,360 --> 00:12:58,180 very small dataset, you can get these very large oscillations 217 00:12:58,180 --> 00:13:01,420 if you use maximum likelihood estimation. All right? 218 00:13:01,420 --> 00:13:05,610 In contrast, if you apply this sort of Bayesian regularization, 219 00:13:05,610 --> 00:13:08,200 you can actually fit a higherorder polynomial 220 00:13:08,200 --> 00:13:11,160 that still get 221 00:13:11,160 --> 00:13:13,250 sort of a smoother and smoother fit to the data 222 00:13:13,250 --> 00:13:17,840 as you decrease tau. So as you decrease tau, you're driving the parameters to be closer and closer 223 00:13:17,840 --> 00:13:19,830 to zero. And that 224 00:13:19,830 --> 00:13:22,750 in practice " it's sort of hard to see, but you can take my word for it. 225 00:13:22,750 --> 00:13:24,900 As tau becomes smaller and smaller, 226 00:13:24,900 --> 00:13:28,760 the curves you tend to fit your data also become smoother and smoother, and so you 227 00:13:28,760 --> 00:13:29,560 tend 228 00:13:29,560 --> 00:13:35,700 less and less overfit, even when you're fitting a large number of 229 00:13:35,700 --> 00:13:37,670 parameters. 230 00:13:37,670 --> 00:13:41,290 Okay? Let's see, 231 00:13:41,290 --> 00:13:44,590 and 232 00:13:44,590 --> 00:13:48,129 one last piece of intuition that I would just toss out there. And 233 00:13:48,129 --> 00:13:49,920 you get to play more with 234 00:13:49,920 --> 00:13:52,850 this particular set of ideas more in Problem Set 235 00:13:52,850 --> 00:13:54,980 3, which I'll post online later 236 00:13:54,980 --> 00:13:57,160 this week I guess. 237 00:13:57,160 --> 00:14:01,950 Is that whereas maximum likelihood tries to minimize, 238 00:14:01,950 --> 00:14:08,950 say, this, 239 00:14:12,210 --> 00:14:15,830 right? Whereas maximum likelihood for, say, linear regression turns out to be minimizing 240 00:14:15,830 --> 00:14:17,030 this, 241 00:14:17,030 --> 00:14:19,290 it turns out that if you 242 00:14:19,290 --> 00:14:21,220 add this prior term there, 243 00:14:21,220 --> 00:14:26,500 it turns out that the authorization objective you end up optimizing 244 00:14:26,500 --> 00:14:31,160 turns out to be that. Where you add an extra term that, you know, 245 00:14:31,160 --> 00:14:34,380 penalizes your parameter theta as being large. 246 00:14:34,380 --> 00:14:34,980 247 00:14:34,980 --> 00:14:38,820 And so this ends up being an algorithm that's very similar to maximum likelihood, expect that you tend to 248 00:14:38,820 --> 00:14:40,890 keep your parameters small. 249 00:14:40,890 --> 00:14:42,810 And this has the effect. 250 00:14:42,810 --> 00:14:46,530 Again, it's kind of hard to see but just take my word for it. That strengthening the parameters has the effect of 251 00:14:46,530 --> 00:14:48,260 keeping the functions you fit 252 00:14:48,260 --> 00:14:55,260 to be smoother and less likely to overfit. Okay? Okay, hopefully this will 253 00:14:57,870 --> 00:15:02,200 make more sense when you play with these ideas a bit more in the next problem set. But let's check 254 00:15:02,200 --> 00:15:09,200 questions about all this. 255 00:15:12,560 --> 00:15:16,000 Student:The smoothing behavior is it because [inaudible] actually get different [inaudible]? Instructor 256 00:15:16,000 --> 00:15:19,260 (Andrew 257 00:15:19,260 --> 00:15:23,010 Ng):Let's see. Yeah. It depends on " well 258 00:15:23,010 --> 00:15:28,160 most priors with most of the mass close to zero will get this effect, I guess. And just by 259 00:15:28,160 --> 00:15:29,090 convention, the Gaussian 260 00:15:29,090 --> 00:15:31,770 prior is what's most " 261 00:15:31,770 --> 00:15:33,820 used the most common for models like logistic 262 00:15:33,820 --> 00:15:37,910 regression and linear regression, generalized in your models. There are a few 263 00:15:37,910 --> 00:15:41,190 other priors that I sometimes use, like the Laplace prior, 264 00:15:41,190 --> 00:15:48,190 but all of them will tend to have these sorts of smoothing effects. All right. Cool. 265 00:15:50,270 --> 00:15:52,640 And so it turns out that for 266 00:15:52,640 --> 00:15:57,250 problems like text classification, text classification is like 30,000 features or 50,000 267 00:15:57,250 --> 00:15:59,560 features, 268 00:15:59,560 --> 00:16:02,980 where it seems like an algorithm like logistic regression would be very much prone to 269 00:16:02,980 --> 00:16:03,670 overfitting. Right? 270 00:16:03,670 --> 00:16:06,470 So imagine trying to build a spam classifier, 271 00:16:06,470 --> 00:16:10,050 maybe you have 100 training examples but you have 30,000 272 00:16:10,050 --> 00:16:12,530 features or 50,000 features, 273 00:16:12,530 --> 00:16:16,820 that seems clearly to be prone to overfitting. Right? But it turns out that with 274 00:16:16,820 --> 00:16:18,440 this sort of Bayesian 275 00:16:18,440 --> 00:16:19,460 regularization, 276 00:16:19,460 --> 00:16:22,100 with [inaudible] Gaussian, 277 00:16:22,100 --> 00:16:26,710 logistic regression becomes a very effective text classification algorithm 278 00:16:26,710 --> 00:16:32,510 with this sort of Bayesian regularization. Alex? Student:[Inaudible]? Instructor (Andrew Ng):Yeah, 279 00:16:32,510 --> 00:16:36,950 right, and so pick " and to pick either tau squared or lambda. 280 00:16:36,950 --> 00:16:40,260 I think the relation is lambda equals one over tau squared. But right, so pick either 281 00:16:40,260 --> 00:16:41,840 tau squared or lambda, you could 282 00:16:41,840 --> 00:16:48,840 use cross-validation, yeah. All right? Okay, cool. So all right, 283 00:16:49,120 --> 00:16:51,740 that 284 00:16:51,740 --> 00:16:55,240 was all I want to say about methods for preventing 285 00:16:55,240 --> 00:16:56,410 overfitting. What I 286 00:16:56,410 --> 00:17:00,820 want to do next is just spend, you know, five minutes talking about 287 00:17:00,820 --> 00:17:02,300 online learning. 288 00:17:02,300 --> 00:17:04,650 And this is sort of a digression. 289 00:17:04,650 --> 00:17:06,390 And so, you know, when 290 00:17:06,390 --> 00:17:09,680 you're designing the syllabus of a class, I guess, sometimes 291 00:17:09,680 --> 00:17:13,040 there are just some ideas you want to talk about but can't find a very good place to 292 00:17:13,040 --> 00:17:16,480 fit in anywhere. So this is one of those ideas that may 293 00:17:16,480 --> 00:17:20,890 seem a bit disjointed from the rest of the class but I just 294 00:17:20,890 --> 00:17:25,770 want to 295 00:17:25,770 --> 00:17:29,730 tell 296 00:17:29,730 --> 00:17:33,720 you a little bit about it. Okay. So here's the idea. 297 00:17:33,720 --> 00:17:37,820 So far, all the learning algorithms we've talked about are what's called batch learning 298 00:17:37,820 --> 00:17:38,520 algorithms, 299 00:17:38,520 --> 00:17:41,650 where you're given a training set and then you get to run your learning algorithm on the 300 00:17:41,650 --> 00:17:46,910 training set and then maybe you test it on some other test set. 301 00:17:46,910 --> 00:17:49,980 And there's another learning setting called online learning, 302 00:17:49,980 --> 00:17:53,670 in which you have to make predictions even while you are in the process of learning. 303 00:17:53,670 --> 00:17:57,650 So here's how the problem sees. 304 00:17:57,650 --> 00:17:59,360 All right? I'm first gonna give you X one. 305 00:17:59,360 --> 00:18:01,410 Let's say there's a classification problem, 306 00:18:01,410 --> 00:18:02,780 so I'm first gonna give you 307 00:18:02,780 --> 00:18:07,420 X one and then gonna ask you, you know, Can you make a prediction on X one? Is the label one or zero? 308 00:18:07,420 --> 00:18:09,670 And you've not seen any data yet. 309 00:18:09,670 --> 00:18:11,679 And so, you make a guess. Right? You 310 00:18:11,679 --> 00:18:12,990 guess " 311 00:18:12,990 --> 00:18:15,190 we'll call your guess Y hat one. 312 00:18:15,190 --> 00:18:18,010 And after you've made your prediction, I 313 00:18:18,010 --> 00:18:21,840 will then reveal to you the true label Y one. Okay? And 314 00:18:21,840 --> 00:18:25,580 not having seen any data before, your odds of getting the first one right are only 315 00:18:25,580 --> 00:18:28,750 50 percent, right, if you guess randomly. 316 00:18:28,750 --> 00:18:29,950 317 00:18:29,950 --> 00:18:35,080 And then I show you X two. And then I ask you, Can you make a prediction on X two? And 318 00:18:35,080 --> 00:18:35,910 so you now maybe 319 00:18:35,910 --> 00:18:40,200 are gonna make a slightly more educated guess and call that Y hat two. 320 00:18:40,200 --> 00:18:44,880 And after you've made your guess, I reveal the true label to you. 321 00:18:44,880 --> 00:18:50,000 And so, then I show you X three, and then you make 322 00:18:50,000 --> 00:18:53,000 your guess, and learning proceeds as follows. 323 00:18:53,000 --> 00:18:54,860 So this is just a lot of 324 00:18:54,860 --> 00:18:59,370 machine learning and batch learning, and the model settings where 325 00:18:59,370 --> 00:19:03,710 you have to keep learning even as you're making predictions, 326 00:19:03,710 --> 00:19:08,050 okay? So I don't know, setting your website and you have users coming in. And as the first user 327 00:19:08,050 --> 00:19:10,990 comes in, you need to start making predictions already about what 328 00:19:10,990 --> 00:19:13,450 the user likes or dislikes. And there's only, you know, 329 00:19:13,450 --> 00:19:18,370 as you're making predictions you get to show more and more training examples. 330 00:19:18,370 --> 00:19:25,370 So in online learning what you care about is the total online error, 331 00:19:27,370 --> 00:19:28,400 332 00:19:28,400 --> 00:19:32,900 which is sum from I equals one to MC if you get the sequence of M examples 333 00:19:32,900 --> 00:19:34,390 all together, 334 00:19:34,390 --> 00:19:37,860 indicator Y hat I 335 00:19:37,860 --> 00:19:40,900 not equal to Y hi. Okay? 336 00:19:40,900 --> 00:19:43,510 So the total online error is 337 00:19:43,510 --> 00:19:45,920 the total number of mistakes you make 338 00:19:45,920 --> 00:19:49,029 on a sequence of examples like this. 339 00:19:49,029 --> 00:19:51,910 And 340 00:19:51,910 --> 00:19:54,049 it turns out that, you know, 341 00:19:54,049 --> 00:19:57,690 many of the learning algorithms you have " when you finish all the learning algorithms, you've 342 00:19:57,690 --> 00:20:00,690 learned about and can apply to this setting. One thing 343 00:20:00,690 --> 00:20:02,680 you could do is 344 00:20:02,680 --> 00:20:03,850 when you're asked 345 00:20:03,850 --> 00:20:06,230 to make prediction on Y hat three, 346 00:20:06,230 --> 00:20:10,409 right, one simple thing to do is well you've seen some other training examples 347 00:20:10,409 --> 00:20:14,040 up to this point so you can just take your learning algorithm and run it 348 00:20:14,040 --> 00:20:15,429 on the examples, 349 00:20:15,429 --> 00:20:18,970 you know, leading up to Y hat three. So just run the learning algorithm on all 350 00:20:18,970 --> 00:20:21,420 the examples you've seen previous 351 00:20:21,420 --> 00:20:25,400 to being asked to make a prediction on certain example, and then use your 352 00:20:25,400 --> 00:20:27,320 learning 353 00:20:27,320 --> 00:20:30,940 algorithm to make a prediction on the next example. And it turns out that there are also algorithms, especially the algorithms that we 354 00:20:30,940 --> 00:20:34,190 saw that you could use the stochastic gradient descent, that, 355 00:20:34,190 --> 00:20:41,190 you know, can be adapted very nicely to this. So as a concrete example, if you 356 00:20:41,740 --> 00:20:45,660 remember the perceptron algorithms, say, right, 357 00:20:45,660 --> 00:20:47,350 you would 358 00:20:47,350 --> 00:20:52,280 say initial the parameter theta to be equal to zero. 359 00:20:52,280 --> 00:20:55,940 And then after seeing the Ith training 360 00:20:55,940 --> 00:20:59,230 example, you'd 361 00:20:59,230 --> 00:21:06,230 update the parameters, you 362 00:21:13,740 --> 00:21:15,420 know, 363 00:21:15,420 --> 00:21:19,740 using " you've see this reel a lot of times now, right, using the standard 364 00:21:19,740 --> 00:21:21,720 perceptron learning rule. And 365 00:21:21,720 --> 00:21:26,230 the same thing, if you were using logistic regression you can then, 366 00:21:26,230 --> 00:21:29,940 again, after seeing each training example, just run, you know, essentially run 367 00:21:29,940 --> 00:21:33,620 one-step stochastic gradient descent 368 00:21:33,620 --> 00:21:38,050 on just the example you saw. Okay? 369 00:21:38,050 --> 00:21:39,800 And so 370 00:21:39,800 --> 00:21:41,040 the reason I've 371 00:21:41,040 --> 00:21:45,380 put this into the sort of learning theory section of this class was because 372 00:21:45,380 --> 00:21:49,190 it turns that sometimes you can prove fairly amazing results 373 00:21:49,190 --> 00:21:51,500 on your total online error 374 00:21:51,500 --> 00:21:53,790 using algorithms like these. 375 00:21:53,790 --> 00:21:57,500 I will actually " I don't actually want to spend the time in the main 376 00:21:57,500 --> 00:21:58,640 lecture to prove 377 00:21:58,640 --> 00:22:01,880 this, but, for example, you can prove that 378 00:22:01,880 --> 00:22:04,790 when you use the perceptron algorithm, 379 00:22:04,790 --> 00:22:07,890 then even when 380 00:22:07,890 --> 00:22:11,710 the features XI, 381 00:22:11,710 --> 00:22:15,250 maybe infinite dimensional feature vectors, like we saw for 382 00:22:15,250 --> 00:22:18,700 simple vector machines. And sometimes, infinite feature dimensional vectors 383 00:22:18,700 --> 00:22:20,740 may use kernel representations. Okay? But so it 384 00:22:20,740 --> 00:22:22,180 turns out that you can prove that 385 00:22:22,180 --> 00:22:24,790 when you a perceptron algorithm, 386 00:22:24,790 --> 00:22:27,590 even when the data is maybe 387 00:22:27,590 --> 00:22:28,920 extremely high dimensional and it 388 00:22:28,920 --> 00:22:32,900 seems like you'd be prone to overfitting, right, you can prove that so as the 389 00:22:32,900 --> 00:22:36,550 long as the positive and negative examples 390 00:22:36,550 --> 00:22:38,740 391 00:22:38,740 --> 00:22:41,860 are separated by a margin, 392 00:22:41,860 --> 00:22:43,360 right. So in this 393 00:22:43,360 --> 00:22:48,050 infinite dimensional space, 394 00:22:48,050 --> 00:22:49,740 so long as, 395 00:22:49,740 --> 00:22:53,390 you know, there is some margin 396 00:22:53,390 --> 00:22:57,480 down there separating the positive and negative examples, 397 00:22:57,480 --> 00:22:59,300 you can prove that 398 00:22:59,300 --> 00:23:02,150 perceptron algorithm will converge 399 00:23:02,150 --> 00:23:06,870 to a hypothesis that perfectly separates the positive and negative examples. 400 00:23:06,870 --> 00:23:10,760 Okay? And then so after seeing only a finite number of examples, it'll 401 00:23:10,760 --> 00:23:14,410 converge to digital boundary that perfectly separates the 402 00:23:14,410 --> 00:23:18,590 positive and negative examples, even though you may in an infinite dimensional space. Okay? 403 00:23:18,590 --> 00:23:20,970 So 404 00:23:20,970 --> 00:23:22,260 let's see. 405 00:23:22,260 --> 00:23:25,920 The proof itself would take me sort of almost an entire lecture to do, 406 00:23:25,920 --> 00:23:27,179 and there are sort of 407 00:23:27,179 --> 00:23:29,460 other things that I want to do more than that. 408 00:23:29,460 --> 00:23:32,850 So you want to see the proof of this yourself, it's actually written up in the lecture 409 00:23:32,850 --> 00:23:35,010 notes that I posted online. 410 00:23:35,010 --> 00:23:38,210 For the purposes of this class' syllabus, the proof of this 411 00:23:38,210 --> 00:23:41,059 result, you can treat this as optional reading. And by that, I mean, 412 00:23:41,059 --> 00:23:44,799 you know, it won't appear on the midterm and you won't be asked about this 413 00:23:44,799 --> 00:23:47,039 specifically in the problem sets, 414 00:23:47,039 --> 00:23:48,770 but I thought it'd be " 415 00:23:48,770 --> 00:23:52,340 I know some of you are curious after the previous lecture so why 416 00:23:52,340 --> 00:23:53,750 you can prove 417 00:23:53,750 --> 00:23:55,450 that, you know, SVMs can have 418 00:23:55,450 --> 00:23:59,240 bounded VC dimension, even in these infinite dimensional spaces, and how do 419 00:23:59,240 --> 00:24:00,940 you prove things in these " 420 00:24:00,940 --> 00:24:04,450 how do you prove learning theory results in these infinite dimensional feature spaces. And 421 00:24:04,450 --> 00:24:05,190 so 422 00:24:05,190 --> 00:24:08,480 the perceptron bound that I just talked about was the simplest instance I 423 00:24:08,480 --> 00:24:11,040 know of that you can sort of read in like half an hour and understand 424 00:24:11,040 --> 00:24:11,960 it. 425 00:24:11,960 --> 00:24:14,700 So if you're 426 00:24:14,700 --> 00:24:16,179 interested, there are lecture notes online for 427 00:24:16,179 --> 00:24:20,720 how this perceptron bound is actually proved. It's a very 428 00:24:20,720 --> 00:24:21,580 429 00:24:21,580 --> 00:24:25,710 [inaudible], you can prove it in like a page or so, so go ahead and take a look at that if you're interested. Okay? But 430 00:24:25,710 --> 00:24:28,450 regardless of the theoretical results, 431 00:24:28,450 --> 00:24:30,880 you know, the online learning setting is something 432 00:24:30,880 --> 00:24:33,490 that you " that comes reasonably often. And so, 433 00:24:33,490 --> 00:24:38,750 these algorithms based on stochastic gradient descent often go very well. Okay, any 434 00:24:38,750 --> 00:24:45,750 questions about this before I move on? 435 00:24:49,990 --> 00:24:54,799 All right. Cool. So the last thing I want to do today, and was the majority of today's lecture, actually can I switch to 436 00:24:54,799 --> 00:24:56,500 PowerPoint slides, please, 437 00:24:56,500 --> 00:24:59,810 is I actually want to spend most of today's lecture sort of talking about 438 00:24:59,810 --> 00:25:05,270 advice for applying different machine learning 439 00:25:05,270 --> 00:25:10,850 algorithms. 440 00:25:10,850 --> 00:25:11,970 And so, 441 00:25:11,970 --> 00:25:14,019 you know, right now, already you have a, 442 00:25:14,019 --> 00:25:16,250 I think, a good understanding of 443 00:25:16,250 --> 00:25:18,620 really the most powerful tools 444 00:25:18,620 --> 00:25:21,020 known to humankind in machine learning. 445 00:25:21,020 --> 00:25:22,090 Right? And 446 00:25:22,090 --> 00:25:26,520 what I want to do today is give you some advice on how to apply them really 447 00:25:26,520 --> 00:25:27,720 powerfully because, 448 00:25:27,720 --> 00:25:30,299 you know, the same tool " it turns out that you can 449 00:25:30,299 --> 00:25:33,249 take the same machine learning tool, say logistic regression, 450 00:25:33,249 --> 00:25:36,800 and you can ask two different people to apply it to the same problem. 451 00:25:36,800 --> 00:25:38,230 And 452 00:25:38,230 --> 00:25:41,290 sometimes one person will do an amazing job and it'll work amazingly well, and the second 453 00:25:41,290 --> 00:25:43,150 person will sort of 454 00:25:43,150 --> 00:25:47,150 not really get it to work, even though it was exactly the same algorithm. Right? 455 00:25:47,150 --> 00:25:51,049 And so what I want to do today, in the rest of the time I have today, is try 456 00:25:51,049 --> 00:25:52,190 to 457 00:25:52,190 --> 00:25:55,210 convey to you, you know, some of the methods for how to 458 00:25:55,210 --> 00:25:56,600 make sure you're one of " 459 00:25:56,600 --> 00:26:03,600 you really know how to get these learning algorithms to work well in problems. So 460 00:26:05,120 --> 00:26:07,340 just some caveats on what I'm gonna, I 461 00:26:07,340 --> 00:26:08,859 guess, talk about in the 462 00:26:08,859 --> 00:26:11,080 rest of today's lecture. 463 00:26:11,080 --> 00:26:11,940 464 00:26:11,940 --> 00:26:15,760 Something I want to talk about is actually not very mathematical but is also 465 00:26:15,760 --> 00:26:17,610 some of the hardest, 466 00:26:17,610 --> 00:26:18,480 most 467 00:26:18,480 --> 00:26:21,910 conceptually most difficult material in this class to understand. All right? So this 468 00:26:21,910 --> 00:26:23,070 is 469 00:26:23,070 --> 00:26:25,400 not mathematical but this is not easy. 470 00:26:25,400 --> 00:26:29,730 And I want to say this caveat some of what I'll say today is debatable. 471 00:26:29,730 --> 00:26:33,309 I think most good machine learning people will agree with most of what I say but maybe not 472 00:26:33,309 --> 00:26:35,730 everything I say. 473 00:26:35,730 --> 00:26:39,350 And some of what I'll say is also not good advice for doing machine learning either, so I'll say more about 474 00:26:39,350 --> 00:26:41,680 this later. What I'm 475 00:26:41,680 --> 00:26:43,730 focusing on today is advice for 476 00:26:43,730 --> 00:26:47,269 how to just get stuff to work. If you work in the company and you want to deliver a 477 00:26:47,269 --> 00:26:49,120 product or you're, you know, 478 00:26:49,120 --> 00:26:52,750 building a system and you just want your machine learning system to work. Okay? Some of what I'm about 479 00:26:52,750 --> 00:26:54,000 to say today 480 00:26:54,000 --> 00:26:57,559 isn't great advice if you goal is to invent a new machine learning algorithm, but 481 00:26:57,559 --> 00:26:59,179 this is advice for how to 482 00:26:59,179 --> 00:27:02,040 make machine learning algorithm work and, you know, and 483 00:27:02,040 --> 00:27:05,080 deploy a working system. So three 484 00:27:05,080 --> 00:27:08,900 key areas I'm gonna talk about. One: diagnostics for 485 00:27:08,900 --> 00:27:11,890 debugging learning algorithms. Second: sort of 486 00:27:11,890 --> 00:27:16,820 talk briefly about error analyses and ablative analysis. 487 00:27:16,820 --> 00:27:21,470 And third, I want to talk about just advice for how to get started on a machine-learning 488 00:27:21,470 --> 00:27:23,410 489 00:27:23,410 --> 00:27:24,420 problem. 490 00:27:24,420 --> 00:27:27,340 And one theme that'll come up later is it 491 00:27:27,340 --> 00:27:31,230 turns out you've heard about premature optimization, right, 492 00:27:31,230 --> 00:27:33,890 in writing software. This is when 493 00:27:33,890 --> 00:27:38,120 someone over-designs from the start, when someone, you know, is writing piece of code and 494 00:27:38,120 --> 00:27:41,330 they choose a subroutine to optimize 495 00:27:41,330 --> 00:27:45,350 heavily. And maybe you write the subroutine as assembly or something. And that's often 496 00:27:45,350 --> 00:27:48,919 " and many of us have been guilty of premature optimization, where 497 00:27:48,919 --> 00:27:51,510 we're trying to get a piece of code to run faster. And 498 00:27:51,510 --> 00:27:54,710 we choose probably a piece of code and we implement it an assembly, and 499 00:27:54,710 --> 00:27:56,820 really tune and get to run really quickly. And 500 00:27:56,820 --> 00:28:01,280 it turns out that wasn't the bottleneck in the code at all. Right? And we call that premature 501 00:28:01,280 --> 00:28:02,210 optimization. 502 00:28:02,210 --> 00:28:06,140 And in undergraduate programming classes, we warn people all the time not to do 503 00:28:06,140 --> 00:28:10,250 premature optimization and people still do it all the time. Right? 504 00:28:10,250 --> 00:28:11,750 And 505 00:28:11,750 --> 00:28:16,020 turns out, a very similar thing happens in building machine-learning systems. That 506 00:28:16,020 --> 00:28:20,980 many people are often guilty of, what I call, premature statistical optimization, where they 507 00:28:20,980 --> 00:28:21,630 508 00:28:21,630 --> 00:28:26,090 heavily optimize part of a machine learning system and that turns out 509 00:28:26,090 --> 00:28:27,830 not to be the important piece. Okay? 510 00:28:27,830 --> 00:28:30,170 So I'll talk about that later, as well. 511 00:28:30,170 --> 00:28:35,680 So let's first talk about debugging learning algorithms. 512 00:28:35,680 --> 00:28:37,700 513 00:28:37,700 --> 00:28:39,560 514 00:28:39,560 --> 00:28:42,610 As a motivating 515 00:28:42,610 --> 00:28:46,910 example, let's say you want to build an anti-spam system. And 516 00:28:46,910 --> 00:28:51,780 let's say you've carefully chosen, you know, a small set of 100 words to use as features. All right? 517 00:28:51,780 --> 00:28:54,410 So instead of using 50,000 words, you're chosen a small set of 100 518 00:28:54,410 --> 00:28:55,390 features 519 00:28:55,390 --> 00:28:59,010 to use for your anti-spam system. 520 00:28:59,010 --> 00:29:02,140 And let's say you implement Bayesian logistic regression, implement gradient 521 00:29:02,140 --> 00:29:02,850 descent, 522 00:29:02,850 --> 00:29:07,310 and you get 20 percent test error, which is unacceptably high. Right? 523 00:29:07,310 --> 00:29:09,750 So this 524 00:29:09,750 --> 00:29:14,260 is Bayesian logistic regression, and so it's just like maximum likelihood but, you know, with that additional lambda 525 00:29:14,260 --> 00:29:15,160 squared term. 526 00:29:15,160 --> 00:29:19,950 And we're maximizing rather than minimizing as well, so there's a minus lambda 527 00:29:19,950 --> 00:29:23,809 theta square instead of plus lambda theta squared. So 528 00:29:23,809 --> 00:29:28,390 the question is, you implemented your Bayesian logistic regression algorithm, 529 00:29:28,390 --> 00:29:34,050 and you tested it on your test set and you got unacceptably high error, so what do you do 530 00:29:34,050 --> 00:29:35,370 next? Right? 531 00:29:35,370 --> 00:29:37,310 So, 532 00:29:37,310 --> 00:29:40,730 you know, one thing you could do is think about the ways you could improve this algorithm. 533 00:29:40,730 --> 00:29:44,510 And this is probably what most people will do instead of, Well let's sit down and think what 534 00:29:44,510 --> 00:29:47,929 could've gone wrong, and then we'll try to improve the algorithm. 535 00:29:47,929 --> 00:29:51,370 Well obviously having more training data could only help, so one thing you can do is try to 536 00:29:51,370 --> 00:29:55,140 get more training examples. 537 00:29:55,140 --> 00:29:58,880 Maybe you suspect, that even 100 features was too many, so you might try to 538 00:29:58,880 --> 00:30:01,059 get a smaller set of 539 00:30:01,059 --> 00:30:04,510 features. What's more common is you might suspect your features aren't good enough, so you might 540 00:30:04,510 --> 00:30:06,880 spend some time, look at the email headers, see if 541 00:30:06,880 --> 00:30:09,240 you can figure out better features for, you know, 542 00:30:09,240 --> 00:30:12,890 finding spam emails or whatever. 543 00:30:12,890 --> 00:30:14,460 Right. And 544 00:30:14,460 --> 00:30:17,720 right, so and 545 00:30:17,720 --> 00:30:20,780 just sit around and come up with better features, such as for email headers. 546 00:30:20,780 --> 00:30:24,940 You may also suspect that gradient descent hasn't quite converged yet, and so let's try 547 00:30:24,940 --> 00:30:28,090 running gradient descent a bit longer to see if that works. And clearly, that can't hurt, right, 548 00:30:28,090 --> 00:30:29,480 just run 549 00:30:29,480 --> 00:30:30,940 gradient descent longer. 550 00:30:30,940 --> 00:30:35,620 Or maybe you remember, you know, you remember hearing from class that maybe 551 00:30:35,620 --> 00:30:38,580 Newton's method converges better, so let's 552 00:30:38,580 --> 00:30:39,809 try 553 00:30:39,809 --> 00:30:41,840 that instead. You may want to tune the value for lambda, because not 554 00:30:41,840 --> 00:30:43,560 sure if that was the right thing, 555 00:30:43,560 --> 00:30:46,960 or maybe you even want to an SVM because maybe you think an SVM might work better than logistic regression. So I only 556 00:30:46,960 --> 00:30:50,240 listed eight things 557 00:30:50,240 --> 00:30:54,549 here, but you can imagine if you were actually sitting down, building machine-learning 558 00:30:54,549 --> 00:30:55,720 system, 559 00:30:55,720 --> 00:30:58,040 the options to you are endless. You can think of, you 560 00:30:58,040 --> 00:31:01,040 know, hundreds of ways to improve a learning system. 561 00:31:01,040 --> 00:31:02,429 And some of these things like, 562 00:31:02,429 --> 00:31:05,670 well getting more training examples, surely that's gonna help, so that seems like it's a good 563 00:31:05,670 --> 00:31:08,980 use of your time. Right? 564 00:31:08,980 --> 00:31:11,290 And it turns out that 565 00:31:11,290 --> 00:31:15,210 this [inaudible] of picking ways to improve the learning algorithm and picking one and going 566 00:31:15,210 --> 00:31:16,120 for it, 567 00:31:16,120 --> 00:31:20,620 it might work in the sense that it may eventually get you to a working system, but 568 00:31:20,620 --> 00:31:25,030 often it's very time-consuming. And I think it's often a largely " largely a matter of 569 00:31:25,030 --> 00:31:28,140 luck, whether you end up fixing what the problem is. 570 00:31:28,140 --> 00:31:29,490 In particular, these 571 00:31:29,490 --> 00:31:33,030 eight improvements all fix very different problems. 572 00:31:33,030 --> 00:31:37,340 And some of them will be fixing problems that you don't have. And 573 00:31:37,340 --> 00:31:40,070 if you can rule out six of 574 00:31:40,070 --> 00:31:44,129 eight of these, say, you could " if by somehow looking at the problem more deeply, 575 00:31:44,129 --> 00:31:46,809 you can figure out which one of these eight things is actually the right thing 576 00:31:46,809 --> 00:31:47,850 to do, 577 00:31:47,850 --> 00:31:48,850 you can save yourself 578 00:31:48,850 --> 00:31:50,760 a lot of time. 579 00:31:50,760 --> 00:31:56,090 So let's see how we can go about doing that. 580 00:31:56,090 --> 00:32:01,830 The people in industry and in research that I see that are really good, would not 581 00:32:01,830 --> 00:32:05,490 go and try to change a learning algorithm randomly. There are lots of things that 582 00:32:05,490 --> 00:32:08,110 obviously improve your learning algorithm, 583 00:32:08,110 --> 00:32:12,460 but the problem is there are so many of them it's hard to know what to do. 584 00:32:12,460 --> 00:32:16,590 So you find all the really good ones that run various diagnostics to figure out 585 00:32:16,590 --> 00:32:18,010 the problem is 586 00:32:18,010 --> 00:32:21,610 and they think where a problem is. Okay? 587 00:32:21,610 --> 00:32:23,830 So 588 00:32:23,830 --> 00:32:27,309 for our motivating story, right, we said " let's say Bayesian logistic regression test 589 00:32:27,309 --> 00:32:29,010 error was 20 percent, 590 00:32:29,010 --> 00:32:32,020 which let's say is unacceptably high. 591 00:32:32,020 --> 00:32:34,830 And let's suppose you suspected the problem is 592 00:32:34,830 --> 00:32:36,440 either overfitting, 593 00:32:36,440 --> 00:32:37,790 so it's high bias, 594 00:32:37,790 --> 00:32:42,240 or you suspect that, you know, maybe you have two few features that classify as spam, so there's " Oh excuse 595 00:32:42,240 --> 00:32:45,220 me; I think I 596 00:32:45,220 --> 00:32:46,620 wrote that wrong. 597 00:32:46,620 --> 00:32:48,080 Let's firstly " so let's 598 00:32:48,080 --> 00:32:49,219 forget " forget the tables. 599 00:32:49,219 --> 00:32:52,839 Suppose you suspect the problem is either high bias or high variance, and some of the text 600 00:32:52,839 --> 00:32:54,730 here 601 00:32:54,730 --> 00:32:55,250 doesn't make sense. And 602 00:32:55,250 --> 00:32:56,429 you want to know 603 00:32:56,429 --> 00:33:00,850 if you're overfitting, which would be high variance, or you have too few 604 00:33:00,850 --> 00:33:06,240 features classified as spam, it'd be high bias. I had those two reversed, sorry. Okay? So 605 00:33:06,240 --> 00:33:08,750 how can you figure out whether the problem 606 00:33:08,750 --> 00:33:10,790 is one of high bias 607 00:33:10,790 --> 00:33:15,610 or high variance? Right? So it turns 608 00:33:15,610 --> 00:33:19,009 out there's a simple diagnostic you can look at that will tell you 609 00:33:19,009 --> 00:33:24,150 whether the problem is high bias or high variance. If you 610 00:33:24,150 --> 00:33:27,900 remember the cartoon we'd seen previously for high variance problems, when you have high 611 00:33:27,900 --> 00:33:29,710 variance 612 00:33:29,710 --> 00:33:33,280 the training error will be much lower than the test error. All right? When you 613 00:33:33,280 --> 00:33:36,140 have a high variance problem, that's when you're fitting 614 00:33:36,140 --> 00:33:39,480 your training set very well. That's when you're fitting, you know, a tenth order polynomial to 615 00:33:39,480 --> 00:33:41,650 11 data points. All right? And 616 00:33:41,650 --> 00:33:44,670 that's when you're just fitting the data set very well, and so your training error will be 617 00:33:44,670 --> 00:33:45,670 much lower than 618 00:33:45,670 --> 00:33:47,640 your test 619 00:33:47,640 --> 00:33:49,940 error. And in contrast, if you have high bias, 620 00:33:49,940 --> 00:33:52,700 that's when your training error will also be high. Right? 621 00:33:52,700 --> 00:33:56,450 That's when your data is quadratic, say, but you're fitting a linear function to it 622 00:33:56,450 --> 00:34:02,290 and so you aren't even fitting your training set well. So 623 00:34:02,290 --> 00:34:04,450 just in cartoons, I guess, 624 00:34:04,450 --> 00:34:07,950 this is a " this is what a typical learning curve for high variance looks 625 00:34:07,950 --> 00:34:09,339 like. 626 00:34:09,339 --> 00:34:13,690 On your horizontal axis, I'm plotting the training set size M, right, 627 00:34:13,690 --> 00:34:16,429 and on vertical axis, I'm plotting the error. 628 00:34:16,429 --> 00:34:19,469 And so, let's see, 629 00:34:19,469 --> 00:34:21,029 you know, as you increase " 630 00:34:21,029 --> 00:34:25,119 if you have a high variance problem, you'll notice as the training set size, M, 631 00:34:25,119 --> 00:34:29,219 increases, your test set error will keep on decreasing. 632 00:34:29,219 --> 00:34:32,829 And so this sort of suggests that, well, if you can increase the training set size even 633 00:34:32,829 --> 00:34:36,359 further, maybe if you extrapolate the green curve out, maybe 634 00:34:36,359 --> 00:34:39,970 that test set error will decrease even further. All right? 635 00:34:39,970 --> 00:34:43,399 Another thing that's useful to plot here is " let's say 636 00:34:43,399 --> 00:34:46,539 the red horizontal line is the desired performance 637 00:34:46,539 --> 00:34:50,259 you're trying to reach, another useful thing to plot is actually the training error. Right? 638 00:34:50,259 --> 00:34:52,009 And it turns out that 639 00:34:52,009 --> 00:34:59,009 your training error will actually grow as a function of the training set size 640 00:34:59,249 --> 00:35:01,609 because the larger your training set, 641 00:35:01,609 --> 00:35:03,619 the harder it is to fit, 642 00:35:03,619 --> 00:35:06,149 you know, your training set perfectly. Right? 643 00:35:06,149 --> 00:35:09,249 So this is just a cartoon, don't take it too seriously, but in general, your training error 644 00:35:09,249 --> 00:35:11,420 will actually grow 645 00:35:11,420 --> 00:35:15,079 as a function of your training set size. Because smart training sets, if you have one data point, 646 00:35:15,079 --> 00:35:17,769 it's really easy to fit that perfectly, but if you have 647 00:35:17,769 --> 00:35:22,099 10,000 data points, it's much harder to fit that perfectly. 648 00:35:22,099 --> 00:35:23,149 All right? 649 00:35:23,149 --> 00:35:27,960 And so another diagnostic for high variance, and the one that I tend to use more, 650 00:35:27,960 --> 00:35:31,670 is to just look at training versus test error. And if there's a large gap between 651 00:35:31,670 --> 00:35:32,789 them, 652 00:35:32,789 --> 00:35:34,160 then this suggests that, you know, 653 00:35:34,160 --> 00:35:39,630 getting more training data may allow you to help close that gap. Okay? 654 00:35:39,630 --> 00:35:41,420 So this is 655 00:35:41,420 --> 00:35:42,340 what the 656 00:35:42,340 --> 00:35:45,059 cartoon would look like when " in the 657 00:35:45,059 --> 00:35:49,199 case of high variance. 658 00:35:49,199 --> 00:35:53,099 This is what the cartoon looks like for high bias. Right? If you 659 00:35:53,099 --> 00:35:54,779 look at the learning curve, you 660 00:35:54,779 --> 00:35:57,499 see that the curve for test error 661 00:35:57,499 --> 00:36:01,419 has flattened out already. And so this is a sign that, 662 00:36:01,419 --> 00:36:05,179 you know, if you get more training examples, if you extrapolate this curve 663 00:36:05,179 --> 00:36:06,519 further to the right, 664 00:36:06,519 --> 00:36:09,670 it's maybe not likely to go down much further. 665 00:36:09,670 --> 00:36:12,469 And this is a property of high bias: that getting more training data won't 666 00:36:12,469 --> 00:36:15,619 necessarily help. 667 00:36:15,619 --> 00:36:18,999 But again, to me the more useful diagnostic is 668 00:36:18,999 --> 00:36:20,299 if you plot 669 00:36:20,299 --> 00:36:23,999 training errors well, if you look at your training error as well as your, you know, 670 00:36:23,999 --> 00:36:26,369 hold out test set error. 671 00:36:26,369 --> 00:36:29,409 If you find that even your training error 672 00:36:29,409 --> 00:36:31,529 is high, 673 00:36:31,529 --> 00:36:34,779 then that's a sign that getting more training data is not 674 00:36:34,779 --> 00:36:38,269 going to help. Right? 675 00:36:38,269 --> 00:36:42,199 In fact, you know, think about it, 676 00:36:42,199 --> 00:36:44,539 training error 677 00:36:44,539 --> 00:36:48,089 grows as a function of your training set size. 678 00:36:48,089 --> 00:36:50,449 And so if your 679 00:36:50,449 --> 00:36:55,569 training error is already above your level of desired performance, 680 00:36:55,569 --> 00:36:56,599 then 681 00:36:56,599 --> 00:37:00,789 getting even more training data is not going to reduce your training 682 00:37:00,789 --> 00:37:03,009 error down to the desired level of performance. Right? 683 00:37:03,009 --> 00:37:06,469 Because, you know, your training error sort of only gets worse as you get more and more training 684 00:37:06,469 --> 00:37:07,549 examples. 685 00:37:07,549 --> 00:37:10,799 So if you extrapolate further to the right, it's not like this blue line will come 686 00:37:10,799 --> 00:37:13,399 back down to the level of desired performance. Right? This will stay up 687 00:37:13,399 --> 00:37:17,479 there. Okay? So for 688 00:37:17,479 --> 00:37:21,339 me personally, I actually, when looking at a curve like the green 689 00:37:21,339 --> 00:37:25,380 curve on test error, I actually personally tend to find it very difficult to tell 690 00:37:25,380 --> 00:37:29,000 if the curve is still going down or if it's [inaudible]. Sometimes you can tell, but very 691 00:37:29,000 --> 00:37:31,009 often, it's somewhat 692 00:37:31,009 --> 00:37:32,899 ambiguous. So for me personally, 693 00:37:32,899 --> 00:37:37,129 the diagnostic I tend to use the most often to tell if I have a bias problem or a variance 694 00:37:37,129 --> 00:37:37,859 problem 695 00:37:37,859 --> 00:37:41,319 is to look at training and test error and see if they're very close together or if they're relatively far apart. Okay? And so, 696 00:37:41,319 --> 00:37:45,420 going 697 00:37:45,420 --> 00:37:47,130 back to 698 00:37:47,130 --> 00:37:52,399 the list of fixes, look 699 00:37:52,399 --> 00:37:54,109 at the first fix, 700 00:37:54,109 --> 00:37:56,339 getting more training examples 701 00:37:56,339 --> 00:37:58,650 is a way to fix high variance. 702 00:37:58,650 --> 00:38:02,749 Right? If you have a high variance problem, getting more training examples will help. 703 00:38:02,749 --> 00:38:05,529 Trying a smaller set of features: 704 00:38:05,529 --> 00:38:11,759 that also fixes high variance. All right? 705 00:38:11,759 --> 00:38:15,869 Trying a larger set of features or adding email features, these 706 00:38:15,869 --> 00:38:20,150 are solutions that fix high bias. Right? 707 00:38:20,150 --> 00:38:26,769 So high bias being if you're hypothesis was too simple, you didn't have enough features. Okay? 708 00:38:26,769 --> 00:38:29,069 And so 709 00:38:29,069 --> 00:38:33,579 quite often you see people working on machine learning problems 710 00:38:33,579 --> 00:38:34,589 and 711 00:38:34,589 --> 00:38:37,569 they'll remember that getting more training examples helps. And 712 00:38:37,569 --> 00:38:41,119 so, they'll build a learning system, build an anti-spam system and it doesn't work. 713 00:38:41,119 --> 00:38:42,229 And then they 714 00:38:42,229 --> 00:38:45,999 go off and spend lots of time and money and effort collecting more training data 715 00:38:45,999 --> 00:38:50,509 because they'll say, Oh well, getting more data's obviously got to help. 716 00:38:50,509 --> 00:38:53,319 But if they had a high bias problem in the first place, and not a high variance 717 00:38:53,319 --> 00:38:54,890 problem, 718 00:38:54,890 --> 00:38:56,769 it's entirely possible to spend 719 00:38:56,769 --> 00:39:00,149 three months or six months collecting more and more training data, 720 00:39:00,149 --> 00:39:04,999 not realizing that it couldn't possibly help. Right? 721 00:39:04,999 --> 00:39:07,619 And so, this actually happens a lot in, you 722 00:39:07,619 --> 00:39:12,409 know, in Silicon Valley and companies, this happens a lot. There will often 723 00:39:12,409 --> 00:39:15,329 people building various machine learning systems, and 724 00:39:15,329 --> 00:39:19,519 they'll often " you often see people spending six months working on fixing a 725 00:39:19,519 --> 00:39:20,999 learning algorithm 726 00:39:20,999 --> 00:39:23,940 and you could've told them six months ago that, you know, 727 00:39:23,940 --> 00:39:27,210 that couldn't possibly have helped. But because they didn't know what the 728 00:39:27,210 --> 00:39:28,709 problem was, and 729 00:39:28,709 --> 00:39:33,549 they'd easily spend six months trying to invent new features or something. And 730 00:39:33,549 --> 00:39:37,809 this is " you see this surprisingly often and this is somewhat depressing. You could've gone to them and 731 00:39:37,809 --> 00:39:42,289 told them, I could've told you six months ago that this was not going to help. And 732 00:39:42,289 --> 00:39:46,149 the six months is not a joke, you actually see 733 00:39:46,149 --> 00:39:47,709 this. 734 00:39:47,709 --> 00:39:49,510 And in contrast, if you 735 00:39:49,510 --> 00:39:53,049 actually figure out the problem's one of high bias or high variance, then 736 00:39:53,049 --> 00:39:54,299 you can rule out 737 00:39:54,299 --> 00:39:55,799 two of these solutions and 738 00:39:55,799 --> 00:40:00,779 save yourself many months of fruitless effort. Okay? I actually 739 00:40:00,779 --> 00:40:03,709 want to talk about these four at the bottom as well. But before I move on, let me 740 00:40:03,709 --> 00:40:05,319 just check if there were questions about what I've talked 741 00:40:05,319 --> 00:40:12,319 about so far. No? Okay, great. So bias 742 00:40:20,209 --> 00:40:23,220 versus variance is one thing that comes up 743 00:40:23,220 --> 00:40:29,539 often. This bias versus variance is one common diagnostic. And so, 744 00:40:29,539 --> 00:40:33,180 for other machine learning problems, it's often up to your own ingenuity to figure out 745 00:40:33,180 --> 00:40:35,700 your own diagnostics to figure out what's wrong. All right? 746 00:40:35,700 --> 00:40:41,230 So if a machine-learning algorithm isn't working, very often it's up to you to figure out, you 747 00:40:41,230 --> 00:40:44,300 know, to construct your own tests. Like do you look at the difference training and 748 00:40:44,300 --> 00:40:46,499 test errors or do you look at something else? 749 00:40:46,499 --> 00:40:49,929 It's often up to your own ingenuity to construct your own diagnostics to figure out what's 750 00:40:49,929 --> 00:40:52,589 going on. 751 00:40:52,589 --> 00:40:55,029 What I want to do is go through another example. All right? 752 00:40:55,029 --> 00:40:58,890 And this one is slightly more contrived but it'll illustrate another 753 00:40:58,890 --> 00:41:02,769 common question that comes up, another one of the most common 754 00:41:02,769 --> 00:41:04,750 issues that comes up in applying 755 00:41:04,750 --> 00:41:06,089 learning algorithms. 756 00:41:06,089 --> 00:41:08,319 So in this example, it's slightly more contrived, 757 00:41:08,319 --> 00:41:11,579 let's say you implement Bayesian logistic regression 758 00:41:11,579 --> 00:41:17,549 and you get 2 percent error on spam mail and 2 percent error non-spam mail. Right? So 759 00:41:17,549 --> 00:41:19,150 it's rejecting, you know, 760 00:41:19,150 --> 00:41:21,449 2 percent of " 761 00:41:21,449 --> 00:41:25,179 it's rejecting 98 percent of your spam mail, which is fine, so 2 percent of all 762 00:41:25,179 --> 00:41:26,959 spam gets 763 00:41:26,959 --> 00:41:30,660 through which is fine, but is also rejecting 2 percent of your good email, 764 00:41:30,660 --> 00:41:35,489 2 percent of the email from your friends and that's unacceptably high, let's 765 00:41:35,489 --> 00:41:36,909 say. 766 00:41:36,909 --> 00:41:39,010 And let's say that 767 00:41:39,010 --> 00:41:41,899 a simple vector machine using a linear kernel 768 00:41:41,899 --> 00:41:44,830 gets 10 percent error on spam and 769 00:41:44,830 --> 00:41:49,069 0.01 percent error on non-spam, which is more of the acceptable performance you want. And let's say for the sake of this 770 00:41:49,069 --> 00:41:53,359 example, let's say you're trying to build an anti-spam system. Right? 771 00:41:53,359 --> 00:41:56,170 Let's say that you really want to deploy 772 00:41:56,170 --> 00:41:57,679 logistic regression 773 00:41:57,679 --> 00:42:01,209 to your customers because of computational efficiency or because you need 774 00:42:01,209 --> 00:42:03,389 retrain overnight every day, 775 00:42:03,389 --> 00:42:07,319 and because logistic regression just runs more easily and more quickly or something. Okay? So let's 776 00:42:07,319 --> 00:42:08,669 say you want to deploy logistic 777 00:42:08,669 --> 00:42:12,649 regression, but it's just not working out well. So 778 00:42:12,649 --> 00:42:17,609 question is: What do you do next? So it 779 00:42:17,609 --> 00:42:18,829 turns out that this " 780 00:42:18,829 --> 00:42:22,319 the issue that comes up here, the one other common question that 781 00:42:22,319 --> 00:42:24,889 comes up is 782 00:42:24,889 --> 00:42:30,189 a question of is the algorithm converging. So you might suspect that maybe 783 00:42:30,189 --> 00:42:33,299 the problem with logistic regression is that it's just not converging. 784 00:42:33,299 --> 00:42:36,309 Maybe you need to run iterations. And 785 00:42:36,309 --> 00:42:37,759 it 786 00:42:37,759 --> 00:42:40,359 turns out that, again if you look at the optimization objective, say, 787 00:42:40,359 --> 00:42:43,710 logistic regression is, let's say, optimizing J 788 00:42:43,710 --> 00:42:46,730 of theta, it actually turns out that if you look at optimizing your objective as a function of the number 789 00:42:46,730 --> 00:42:51,809 of iterations, when you look 790 00:42:51,809 --> 00:42:55,009 at this curve, you know, it sort of looks like it's going up but it sort of 791 00:42:55,009 --> 00:42:57,630 looks like there's absentiles. And 792 00:42:57,630 --> 00:43:00,949 when you look at these curves, it's often very hard to tell 793 00:43:00,949 --> 00:43:03,729 if the curve has already flattened out. All right? And you look at these 794 00:43:03,729 --> 00:43:05,979 curves a lot so you can ask: 795 00:43:05,979 --> 00:43:08,229 Well has the algorithm converged? When you look at the J of theta like this, it's 796 00:43:08,229 --> 00:43:10,329 often hard to tell. 797 00:43:10,329 --> 00:43:14,149 You can run this ten times as long and see if it's flattened out. And you can run this ten 798 00:43:14,149 --> 00:43:21,079 times as long and it'll often still look like maybe it's going up very slowly, or something. Right? 799 00:43:21,079 --> 00:43:24,919 So a better diagnostic for what logistic regression is converged than 800 00:43:24,919 --> 00:43:28,809 looking at this curve. 801 00:43:28,809 --> 00:43:32,089 The other question you might wonder " the other thing you might 802 00:43:32,089 --> 00:43:36,709 suspect is a problem is are you optimizing the right function. 803 00:43:36,709 --> 00:43:38,920 So 804 00:43:38,920 --> 00:43:40,599 what you care about, 805 00:43:40,599 --> 00:43:42,879 right, in spam, say, 806 00:43:42,879 --> 00:43:44,260 is a 807 00:43:44,260 --> 00:43:47,499 weighted accuracy function like that. So A of theta is, 808 00:43:47,499 --> 00:43:49,190 you know, sum over your 809 00:43:49,190 --> 00:43:52,249 examples of some weights times whether you got it right. 810 00:43:52,249 --> 00:43:56,809 And so the weight may be higher for non-spam than for spam mail because you care 811 00:43:56,809 --> 00:43:57,710 about getting 812 00:43:57,710 --> 00:44:01,469 your predictions correct for spam email much more than non-spam mail, say. So let's 813 00:44:01,469 --> 00:44:02,359 814 00:44:02,359 --> 00:44:05,469 say A of theta 815 00:44:05,469 --> 00:44:10,819 is the optimization objective that you really care about, but Bayesian logistic regression is 816 00:44:10,819 --> 00:44:15,400 that it optimizes a quantity like that. Right? It's this 817 00:44:15,400 --> 00:44:17,689 sort of maximum likelihood thing 818 00:44:17,689 --> 00:44:19,380 and then with this 819 00:44:19,380 --> 00:44:20,849 two-nom, you know, 820 00:44:20,849 --> 00:44:22,779 penalty thing that we saw previously. And you 821 00:44:22,779 --> 00:44:26,499 might be wondering: Is this the right optimization function to be optimizing. 822 00:44:26,499 --> 00:44:30,949 Okay? And: Or do I maybe need to change the value for lambda 823 00:44:30,949 --> 00:44:33,899 to change this parameter? Or: 824 00:44:33,899 --> 00:44:39,819 Should I maybe really be switching to support vector machine optimization objective? 825 00:44:39,819 --> 00:44:42,130 Okay? Does that make sense? So 826 00:44:42,130 --> 00:44:44,490 the second diagnostic I'm gonna talk about 827 00:44:44,490 --> 00:44:46,989 is let's say you want to figure out 828 00:44:46,989 --> 00:44:50,609 is the algorithm converging, is the optimization algorithm converging, or 829 00:44:50,609 --> 00:44:51,900 is the problem with 830 00:44:51,900 --> 00:44:57,749 the optimization objective I chose in the first place? Okay? 831 00:44:57,749 --> 00:45:02,819 So here's 832 00:45:02,819 --> 00:45:07,329 the diagnostic you can use. Let me let " right. So to 833 00:45:07,329 --> 00:45:11,029 just reiterate the story, right, let's say an SVM outperforms Bayesian 834 00:45:11,029 --> 00:45:13,519 logistic regression but you really want to deploy 835 00:45:13,519 --> 00:45:16,759 Bayesian logistic regression to your problem. Let me 836 00:45:16,759 --> 00:45:19,049 let theta subscript SVM, be the 837 00:45:19,049 --> 00:45:21,669 parameters learned by an SVM, 838 00:45:21,669 --> 00:45:25,259 and I'll let theta subscript BLR be the parameters learned by Bayesian 839 00:45:25,259 --> 00:45:28,049 logistic regression. 840 00:45:28,049 --> 00:45:32,480 So the optimization objective you care about is this, you know, weighted accuracy 841 00:45:32,480 --> 00:45:35,079 criteria that I talked about just now. 842 00:45:35,079 --> 00:45:37,859 And 843 00:45:37,859 --> 00:45:41,739 the support vector machine outperforms Bayesian logistic regression. And so, you know, 844 00:45:41,739 --> 00:45:44,969 the weighted accuracy on the supportvector-machine parameters 845 00:45:44,969 --> 00:45:46,969 is better than the weighted accuracy 846 00:45:46,969 --> 00:45:50,179 for Bayesian logistic regression. 847 00:45:50,179 --> 00:45:53,929 So 848 00:45:53,929 --> 00:45:57,039 further, Bayesian logistic regression tries to optimize 849 00:45:57,039 --> 00:45:59,410 an optimization objective like that, which I 850 00:45:59,410 --> 00:46:02,269 denoted J theta. 851 00:46:02,269 --> 00:46:05,839 And so, the diagnostic I choose to use is 852 00:46:05,839 --> 00:46:08,430 to see if J of SVM 853 00:46:08,430 --> 00:46:12,269 is bigger-than or less-than J of BLR. Okay? 854 00:46:12,269 --> 00:46:14,609 So I explain this on the next slide. 855 00:46:14,609 --> 00:46:15,569 So 856 00:46:15,569 --> 00:46:19,529 we know two facts. We know that " well we know one fact. We know that a weighted 857 00:46:19,529 --> 00:46:20,519 accuracy 858 00:46:20,519 --> 00:46:23,160 of support vector machine, right, 859 00:46:23,160 --> 00:46:24,479 is bigger than 860 00:46:24,479 --> 00:46:28,859 this weighted accuracy of Bayesian logistic regression. So 861 00:46:28,859 --> 00:46:32,209 in order for me to figure out whether Bayesian logistic regression is converging, 862 00:46:32,209 --> 00:46:35,379 or whether I'm just optimizing the wrong objective function, 863 00:46:35,379 --> 00:46:41,059 the diagnostic I'm gonna use and I'm gonna check if this equality hold through. Okay? 864 00:46:41,059 --> 00:46:43,549 So let me explain this, 865 00:46:43,549 --> 00:46:44,769 so in Case 1, 866 00:46:44,769 --> 00:46:46,029 right, 867 00:46:46,029 --> 00:46:48,319 it's just those two equations copied over. 868 00:46:48,319 --> 00:46:50,489 In Case 1, let's say that 869 00:46:50,489 --> 00:46:54,589 J of SVM is, indeed, is greater than J of BLR " or J of 870 00:46:54,589 --> 00:47:01,169 theta SVM is greater than J of theta BLR. But 871 00:47:01,169 --> 00:47:04,439 we know that Bayesian logistic regression 872 00:47:04,439 --> 00:47:07,519 was trying to maximize J of theta; 873 00:47:07,519 --> 00:47:08,869 that's the definition of 874 00:47:08,869 --> 00:47:12,359 Bayesian logistic regression. 875 00:47:12,359 --> 00:47:16,759 So this means that 876 00:47:16,759 --> 00:47:17,599 theta " 877 00:47:17,599 --> 00:47:22,029 the value of theta output that Bayesian logistic regression actually fails to 878 00:47:22,029 --> 00:47:24,209 maximize J 879 00:47:24,209 --> 00:47:27,309 because the support back to machine actually returned the value of theta that, 880 00:47:27,309 --> 00:47:28,719 you know does a 881 00:47:28,719 --> 00:47:31,349 better job out-maximizing J. 882 00:47:31,349 --> 00:47:36,509 And so, this tells me that Bayesian logistic regression didn't actually maximize J 883 00:47:36,509 --> 00:47:39,319 correctly, and so the problem is with 884 00:47:39,319 --> 00:47:41,099 the optimization algorithm. The 885 00:47:41,099 --> 00:47:45,269 optimization algorithm hasn't converged. The other 886 00:47:45,269 --> 00:47:46,099 case 887 00:47:46,099 --> 00:47:49,890 is as follows, where 888 00:47:49,890 --> 00:47:52,579 J of theta SVM is less-than/equal to J of theta 889 00:47:52,579 --> 00:47:55,719 BLR. Okay? 890 00:47:55,719 --> 00:47:58,389 In this case, what does 891 00:47:58,389 --> 00:47:59,140 that mean? 892 00:47:59,140 --> 00:48:01,849 This means that Bayesian logistic regression 893 00:48:01,849 --> 00:48:04,599 actually attains the higher value 894 00:48:04,599 --> 00:48:07,289 for the optimization objective J 895 00:48:07,289 --> 00:48:10,929 then doesn't support back to machine. 896 00:48:10,929 --> 00:48:13,159 The support back to machine, 897 00:48:13,159 --> 00:48:14,969 which does worse 898 00:48:14,969 --> 00:48:17,669 on your optimization problem, 899 00:48:17,669 --> 00:48:19,199 actually does better 900 00:48:19,199 --> 00:48:24,329 on the weighted accuracy measure. 901 00:48:24,329 --> 00:48:27,999 So what this means is that something that does worse on your optimization 902 00:48:27,999 --> 00:48:28,789 objective, 903 00:48:28,789 --> 00:48:29,789 on J, 904 00:48:29,789 --> 00:48:31,429 can actually do better 905 00:48:31,429 --> 00:48:34,039 on the weighted accuracy objective. 906 00:48:34,039 --> 00:48:37,109 And this really means that maximizing 907 00:48:37,109 --> 00:48:38,369 J of theta, 908 00:48:38,369 --> 00:48:42,059 you know, doesn't really correspond that well to maximizing your weighted accuracy criteria. 909 00:48:42,059 --> 00:48:43,430 910 00:48:43,430 --> 00:48:47,359 And therefore, this tells you that J of theta is maybe the wrong optimization 911 00:48:47,359 --> 00:48:49,649 objective to be maximizing. Right? 912 00:48:49,649 --> 00:48:51,160 That just maximizing J of 913 00:48:51,160 --> 00:48:53,149 theta just wasn't a good objective 914 00:48:53,149 --> 00:49:00,149 to be choosing if you care about the weighted accuracy. Okay? Can you 915 00:49:02,669 --> 00:49:03,460 raise your hand 916 00:49:03,460 --> 00:49:09,989 if this made sense? 917 00:49:09,989 --> 00:49:11,519 Cool, good. So 918 00:49:11,519 --> 00:49:16,829 that tells us whether the problem is with the optimization objective 919 00:49:16,829 --> 00:49:19,380 or whether it's with the objective function. 920 00:49:19,380 --> 00:49:21,009 And so going back to this 921 00:49:21,009 --> 00:49:23,149 slide, the eight fixes we had, 922 00:49:23,149 --> 00:49:24,180 you notice that if you 923 00:49:24,180 --> 00:49:27,170 run gradient descent for more iterations 924 00:49:27,170 --> 00:49:31,019 that fixes the optimization algorithm. You try and use this method 925 00:49:31,019 --> 00:49:33,259 fixes the optimization algorithm, 926 00:49:33,259 --> 00:49:37,289 whereas using a different value for lambda, in that lambda times norm of data 927 00:49:37,289 --> 00:49:39,469 squared, you know, in your objective, 928 00:49:39,469 --> 00:49:42,359 fixes the optimization objective. And 929 00:49:42,359 --> 00:49:47,629 changing to an SVM is also another way of trying to fix the optimization objective. Okay? 930 00:49:47,629 --> 00:49:49,329 And so 931 00:49:49,329 --> 00:49:52,309 once again, you actually see this quite often that " 932 00:49:52,309 --> 00:49:55,079 actually, you see it very often, people will 933 00:49:55,079 --> 00:49:58,479 have a problem with the optimization objective 934 00:49:58,479 --> 00:50:00,989 and be working harder and harder 935 00:50:00,989 --> 00:50:03,179 to fix the optimization algorithm. 936 00:50:03,179 --> 00:50:06,079 That's another very common pattern that 937 00:50:06,079 --> 00:50:10,190 the problem is in the formula from your J of theta, that often you see people, you know, 938 00:50:10,190 --> 00:50:13,269 just running more and more iterations of gradient descent. Like trying Newton's 939 00:50:13,269 --> 00:50:16,010 method and trying conjugate and then trying 940 00:50:16,010 --> 00:50:18,589 more and more crazy optimization algorithms, 941 00:50:18,589 --> 00:50:20,889 whereas the problem was, you know, 942 00:50:20,889 --> 00:50:24,459 optimizing J of theta wasn't going to fix the problem at all. Okay? 943 00:50:24,459 --> 00:50:28,649 So there's another example of when these sorts of diagnostics will 944 00:50:28,649 --> 00:50:31,909 help you figure out whether you should be fixing your optimization algorithm 945 00:50:31,909 --> 00:50:33,259 or fixing the 946 00:50:33,259 --> 00:50:38,849 optimization 947 00:50:38,849 --> 00:50:45,339 objective. Okay? Let me think 948 00:50:45,339 --> 00:50:47,599 how much time I have. 949 00:50:47,599 --> 00:50:48,819 Hmm, let's 950 00:50:48,819 --> 00:50:49,620 see. Well okay, we have time. Let's do this. 951 00:50:49,620 --> 00:50:52,980 Show you one last example of a diagnostic. This is one that came up in, 952 00:50:52,980 --> 00:50:56,999 you know, my students' and my work on flying helicopters. 953 00:50:56,999 --> 00:50:57,839 954 00:50:57,839 --> 00:51:00,189 This one actually, 955 00:51:00,189 --> 00:51:04,179 this example is the most complex of the three examples I'm gonna do 956 00:51:04,179 --> 00:51:05,609 today. 957 00:51:05,609 --> 00:51:08,559 I'm going to somewhat quickly, and 958 00:51:08,559 --> 00:51:11,259 this actually draws on reinforcement learning which is something that I'm not 959 00:51:11,259 --> 00:51:14,499 gonna talk about until towards " close to the end of the course here, but this just 960 00:51:14,499 --> 00:51:16,759 a more 961 00:51:16,759 --> 00:51:20,009 complicated example of a diagnostic we're gonna go over. 962 00:51:20,009 --> 00:51:23,759 What I'll do is probably go over this fairly quickly, and then after we've talked about 963 00:51:23,759 --> 00:51:26,839 reinforcement learning in the class, I'll probably actually come back and redo this exact 964 00:51:26,839 --> 00:51:32,919 same example because you'll understand it more deeply. Okay? 965 00:51:32,919 --> 00:51:37,099 So some of you know that my students and I fly autonomous helicopters, so how do you get a 966 00:51:37,099 --> 00:51:41,559 machine-learning algorithm to design the controller for 967 00:51:41,559 --> 00:51:44,199 helicopter? This is what we do. All right? 968 00:51:44,199 --> 00:51:48,519 This first step was you build a simulator for a helicopter, so, you know, there's a screenshot of our 969 00:51:48,519 --> 00:51:49,619 simulator. 970 00:51:49,619 --> 00:51:53,499 This is just like a " it's like a joystick simulator; you can fly a helicopter in simulation. And then you 971 00:51:53,499 --> 00:51:55,679 972 00:51:55,679 --> 00:51:57,190 choose a cost function, it's 973 00:51:57,190 --> 00:52:00,849 actually called a [inaudible] function, but for this actually I'll call it cost function. 974 00:52:00,849 --> 00:52:02,949 Say J of theta is, you know, 975 00:52:02,949 --> 00:52:06,589 the expected squared error in your helicopter's 976 00:52:06,589 --> 00:52:08,150 position. Okay? So this is J of theta is 977 00:52:08,150 --> 00:52:08,509 maybe 978 00:52:08,509 --> 00:52:12,359 it's expected square error or just the square error. 979 00:52:12,359 --> 00:52:16,909 And then we run a reinforcement-learning algorithm, you'll learn about RL algorithms 980 00:52:16,909 --> 00:52:18,599 in a few weeks. 981 00:52:18,599 --> 00:52:22,499 You run reinforcement learning algorithm in your simulator 982 00:52:22,499 --> 00:52:26,640 to try to minimize this cost function; try to minimize the squared error of 983 00:52:26,640 --> 00:52:31,439 how well you're controlling your helicopter's position. Okay? 984 00:52:31,439 --> 00:52:35,279 The reinforcement learning algorithm will output some parameters, which I'm denoting theta 985 00:52:35,279 --> 00:52:37,209 subscript RL, 986 00:52:37,209 --> 00:52:41,709 and then you'll use that to fly your helicopter. 987 00:52:41,709 --> 00:52:44,959 So suppose you run this learning algorithm and 988 00:52:44,959 --> 00:52:48,589 you get out a set of controller parameters, theta subscript RL, 989 00:52:48,589 --> 00:52:52,299 that gives much worse performance than a human pilot. Then 990 00:52:52,299 --> 00:52:54,729 what do you do next? And in particular, you 991 00:52:54,729 --> 00:52:57,959 know, corresponding to the three steps above, there are three 992 00:52:57,959 --> 00:53:00,589 natural things you can try. Right? You can 993 00:53:00,589 --> 00:53:01,869 try to " oh, the bottom of 994 00:53:01,869 --> 00:53:03,919 the slide got chopped off. 995 00:53:03,919 --> 00:53:07,529 You can try to improve the simulator. And 996 00:53:07,529 --> 00:53:10,329 maybe you think your simulator's isn't that accurate, you need to capture 997 00:53:10,329 --> 00:53:12,339 the aerodynamic effects more 998 00:53:12,339 --> 00:53:15,429 accurately. You need to capture the airflow and the turbulence affects around the helicopter 999 00:53:15,429 --> 00:53:18,279 more accurately. 1000 00:53:18,279 --> 00:53:21,439 Maybe you need to modify the cost function. Maybe your square error isn't cutting it. Maybe 1001 00:53:21,439 --> 00:53:24,719 what a human pilot does isn't just optimizing square area but it's something more 1002 00:53:24,719 --> 00:53:25,989 subtle. 1003 00:53:25,989 --> 00:53:26,769 Or maybe 1004 00:53:26,769 --> 00:53:32,989 the reinforcement-learning algorithm isn't working; maybe it's not quite converging or something. Okay? So 1005 00:53:32,989 --> 00:53:36,799 these are the diagnostics that I actually used, and my students and I actually use to figure out what's 1006 00:53:36,799 --> 00:53:41,299 going on. 1007 00:53:41,299 --> 00:53:44,509 Actually, why don't you just think about this for a second and think what you'd do, and then 1008 00:53:44,509 --> 00:53:51,509 I'll go on and tell you what we do. All right, 1009 00:54:46,229 --> 00:54:47,869 so let me tell you what " 1010 00:54:47,869 --> 00:54:49,599 how we do this and see 1011 00:54:49,599 --> 00:54:52,770 whether it's the same as yours or not. And if you have a better idea than I do, let me 1012 00:54:52,770 --> 00:54:53,569 know and I'll let you try it 1013 00:54:53,569 --> 00:54:55,919 on my helicopter. 1014 00:54:55,919 --> 00:54:58,239 So 1015 00:54:58,239 --> 00:55:01,449 here's a reasoning that I wanted to experiment, right. So, 1016 00:55:01,449 --> 00:55:03,680 yeah, let's say the controller output 1017 00:55:03,680 --> 00:55:10,369 by our reinforcement-learning algorithm does poorly. Well 1018 00:55:10,369 --> 00:55:12,609 suppose the following three things hold true. 1019 00:55:12,609 --> 00:55:15,149 Suppose the contrary, I guess. Suppose that 1020 00:55:15,149 --> 00:55:19,649 the helicopter simulator is accurate, so let's assume we have an accurate model 1021 00:55:19,649 --> 00:55:22,449 of our helicopter. And 1022 00:55:22,449 --> 00:55:25,299 let's suppose that the reinforcement learning algorithm, 1023 00:55:25,299 --> 00:55:28,890 you know, correctly controls the helicopter in simulation, 1024 00:55:28,890 --> 00:55:31,819 so we tend to run a learning algorithm in simulation so that, you know, the 1025 00:55:31,819 --> 00:55:35,149 learning algorithm can crash a helicopter and it's fine. Right? 1026 00:55:35,149 --> 00:55:37,230 So let's assume our reinforcement-learning 1027 00:55:37,230 --> 00:55:40,110 algorithm correctly controls the helicopter so as to minimize the cost 1028 00:55:40,110 --> 00:55:42,099 function J of theta. 1029 00:55:42,099 --> 00:55:43,740 And let's suppose that 1030 00:55:43,740 --> 00:55:47,630 minimizing J of theta does indeed correspond to accurate or the correct autonomous 1031 00:55:47,630 --> 00:55:49,339 flight. 1032 00:55:49,339 --> 00:55:52,069 If all of these things held true, 1033 00:55:52,069 --> 00:55:53,909 then that means that 1034 00:55:53,909 --> 00:55:58,459 the parameters, theta RL, should actually fly well on my real 1035 00:55:58,459 --> 00:56:01,039 helicopter. Right? 1036 00:56:01,039 --> 00:56:03,280 And so the fact that the learning 1037 00:56:03,280 --> 00:56:05,339 control parameters, theta RL, 1038 00:56:05,339 --> 00:56:08,599 does not fly well on my helicopter, that sort 1039 00:56:08,599 --> 00:56:11,249 of means that ones of these three assumptions must be wrong 1040 00:56:11,249 --> 00:56:17,869 and I'd like to figure out which of these 1041 00:56:17,869 --> 00:56:19,669 three assumptions 1042 00:56:19,669 --> 00:56:22,089 is wrong. Okay? So these are the diagnostics we use. 1043 00:56:22,089 --> 00:56:25,449 First one is 1044 00:56:25,449 --> 00:56:31,719 we look at the controller and see if it even flies well in 1045 00:56:31,719 --> 00:56:35,089 simulation. Right? So the simulator of the helicopter that we did the learning on, 1046 00:56:35,089 --> 00:56:38,699 and so if the learning algorithm flies well in the simulator but 1047 00:56:38,699 --> 00:56:42,029 it doesn't fly well on my real helicopter, 1048 00:56:42,029 --> 00:56:46,109 then that tells me the problem is probably in the simulator. Right? 1049 00:56:46,109 --> 00:56:48,049 My simulator predicts 1050 00:56:48,049 --> 00:56:51,909 the helicopter's controller will fly well but it doesn't actually fly well in real life, so 1051 00:56:51,909 --> 00:56:53,580 could be the problem's in the simulator 1052 00:56:53,580 --> 00:56:59,239 and we should spend out efforts improving the accuracy of our simulator. 1053 00:56:59,239 --> 00:57:03,170 Otherwise, let me write theta subscript human, be the human 1054 00:57:03,170 --> 00:57:07,049 control policy. All right? So 1055 00:57:07,049 --> 00:57:11,639 let's go ahead and ask a human to fly the helicopter, it could be in the simulator, it 1056 00:57:11,639 --> 00:57:13,479 could be in real life, 1057 00:57:13,479 --> 00:57:16,769 and let's measure, you know, the means squared error 1058 00:57:16,769 --> 00:57:20,209 of the human pilot's flight. And 1059 00:57:20,209 --> 00:57:24,239 let's see if the human pilot does better or worse 1060 00:57:24,239 --> 00:57:26,089 than the learned controller, 1061 00:57:26,089 --> 00:57:28,249 in terms of optimizing this 1062 00:57:28,249 --> 00:57:31,969 objective function J of theta. Okay? 1063 00:57:31,969 --> 00:57:33,929 So if the human does 1064 00:57:33,929 --> 00:57:36,890 worse, if even a very good human pilot 1065 00:57:36,890 --> 00:57:41,439 attains a worse value on my optimization objective, on my cost 1066 00:57:41,439 --> 00:57:42,410 function, 1067 00:57:42,410 --> 00:57:48,619 than my learning algorithm, 1068 00:57:48,619 --> 00:57:51,799 then the problem is in the reinforcement-learning algorithm. 1069 00:57:51,799 --> 00:57:56,089 Because my reinforcement-learning algorithm was trying to minimize J of 1070 00:57:56,089 --> 00:58:00,140 theta, but a human actually attains a lower value for J of theta than does my 1071 00:58:00,140 --> 00:58:01,779 algorithm. 1072 00:58:01,779 --> 00:58:05,489 And so that tells me that clearly my algorithm's not 1073 00:58:05,489 --> 00:58:07,819 managing to minimize J of theta 1074 00:58:07,819 --> 00:58:12,880 and that tells me the problem's in the reinforcement learning algorithm. 1075 00:58:12,880 --> 00:58:17,650 And finally, if J of theta " if the human actually attains a larger value 1076 00:58:17,650 --> 00:58:19,549 for theta " excuse me, 1077 00:58:19,549 --> 00:58:24,399 if the human actually attains a larger value for J of theta, the human actually 1078 00:58:24,399 --> 00:58:27,859 has, you know, larger mean squared error for the helicopter position than 1079 00:58:27,859 --> 00:58:30,599 does my reinforcement learning algorithms, that's 1080 00:58:30,599 --> 00:58:34,000 I like " but I like the way the human flies much better than my reinforcement learning 1081 00:58:34,000 --> 00:58:35,319 algorithm. So 1082 00:58:35,319 --> 00:58:37,229 if that holds true, 1083 00:58:37,229 --> 00:58:39,779 then clearly the problem's in the cost function, right, 1084 00:58:39,779 --> 00:58:42,880 because the human does worse on my cost function 1085 00:58:42,880 --> 00:58:46,069 but flies much better than my learning algorithm. 1086 00:58:46,069 --> 00:58:48,359 And so that means the problem's in the cost function. It 1087 00:58:48,359 --> 00:58:50,089 means " oh 1088 00:58:50,089 --> 00:58:50,539 excuse me, I 1089 00:58:50,539 --> 00:58:53,679 meant minimizing it, not maximizing it, there's a typo on the slide, 1090 00:58:53,679 --> 00:58:55,379 because that means that minimizing 1091 00:58:55,379 --> 00:58:57,089 the cost function 1092 00:58:57,089 --> 00:59:00,219 " my learning algorithm does a better job minimizing the cost function but doesn't 1093 00:59:00,219 --> 00:59:03,439 fly as well as a human pilot. So that tells you that 1094 00:59:03,439 --> 00:59:04,719 minimizing the cost function 1095 00:59:04,719 --> 00:59:06,880 doesn't correspond to good autonomous flight. And what 1096 00:59:06,880 --> 00:59:11,859 you should do it go back and see if you can change J of 1097 00:59:11,859 --> 00:59:13,099 theta. Okay? 1098 00:59:13,099 --> 00:59:18,379 And so for those reinforcement learning problems, you know, if something doesn't work " often reinforcement 1099 00:59:18,379 --> 00:59:21,730 learning algorithms just work but when they don't work, 1100 00:59:21,730 --> 00:59:26,200 these are the sorts of diagnostics you use to figure out should we be focusing on the simulator, 1101 00:59:26,200 --> 00:59:30,329 on changing the cost function, or on changing the reinforcement learning 1102 00:59:30,329 --> 00:59:32,089 algorithm. And 1103 00:59:32,089 --> 00:59:37,039 again, if you don't know which of your three problems it is, it's entirely possible, 1104 00:59:37,039 --> 00:59:40,280 you know, to spend two years, whatever, changing, building a better simulator 1105 00:59:40,280 --> 00:59:42,599 for your helicopter. 1106 00:59:42,599 --> 00:59:43,950 But it turns out that 1107 00:59:43,950 --> 00:59:47,690 modeling helicopter aerodynamics is an active area of research. There are people, you know, writing 1108 00:59:47,690 --> 00:59:49,789 entire PhD theses on this still. 1109 00:59:49,789 --> 00:59:53,559 So it's entirely possible to go out and spend six years and write a PhD thesis and build 1110 00:59:53,559 --> 00:59:55,499 a much better helicopter simulator, but if you're fixing 1111 00:59:55,499 --> 01:00:02,499 the wrong problem it's not gonna help. 1112 01:00:03,209 --> 01:00:05,529 So 1113 01:00:05,529 --> 01:00:08,919 quite often, you need to come up with your own diagnostics to figure out what's happening in an 1114 01:00:08,919 --> 01:00:11,639 algorithm when something is going wrong. 1115 01:00:11,639 --> 01:00:15,679 And unfortunately I don't know of " what I've described 1116 01:00:15,679 --> 01:00:17,149 are sort of maybe 1117 01:00:17,149 --> 01:00:20,509 some of the most common diagnostics that I've used, that I've seen, 1118 01:00:20,509 --> 01:00:23,709 you know, to be useful for many problems. But very often, you need to come up 1119 01:00:23,709 --> 01:00:28,189 with your own for your own specific learning problem. 1120 01:00:28,189 --> 01:00:31,729 And I just want to point out that even when the learning algorithm is working well, it's 1121 01:00:31,729 --> 01:00:35,159 often a good idea to run diagnostics, like the ones I talked 1122 01:00:35,159 --> 01:00:36,069 about, 1123 01:00:36,069 --> 01:00:38,309 to make sure you really understand what's going on. 1124 01:00:38,309 --> 01:00:41,599 All right? And this is useful for a couple of reasons. One is that 1125 01:00:41,599 --> 01:00:45,609 diagnostics like these will often help you to understand your application 1126 01:00:45,609 --> 01:00:47,899 problem better. 1127 01:00:47,899 --> 01:00:52,159 So some of you will, you know, graduate from Stanford and go on to get some amazingly high-paying 1128 01:00:52,159 --> 01:00:56,349 job to apply machine-learning algorithms to some application problem of, you 1129 01:00:56,349 --> 01:00:59,299 know, significant economic interest. Right? 1130 01:00:59,299 --> 01:01:02,929 And you're gonna be working on one specific 1131 01:01:02,929 --> 01:01:08,059 important machine learning application for many months, or even for years. 1132 01:01:08,059 --> 01:01:10,989 One of the most valuable things for you personally will be for you to 1133 01:01:10,989 --> 01:01:13,249 get in " for you personally 1134 01:01:13,249 --> 01:01:16,909 to get in an intuitive understanding of what works and what doesn't work your 1135 01:01:16,909 --> 01:01:17,369 problem. 1136 01:01:17,369 --> 01:01:21,240 Sort of right now in the industry, in Silicon Valley or around the world, 1137 01:01:21,240 --> 01:01:24,830 there are many companies with important machine learning problems and there are often people 1138 01:01:24,830 --> 01:01:26,950 working on the same machine learning problem, you 1139 01:01:26,950 --> 01:01:31,209 know, for many months or for years on end. And 1140 01:01:31,209 --> 01:01:34,659 when you're doing that, I mean solving a really important problem using learning algorithms, one of 1141 01:01:34,659 --> 01:01:38,719 the most valuable things is just your own personal intuitive understanding of the 1142 01:01:38,719 --> 01:01:40,999 problem. 1143 01:01:40,999 --> 01:01:42,169 Okay? 1144 01:01:42,169 --> 01:01:43,410 And diagnostics, like 1145 01:01:43,410 --> 01:01:48,149 the sort I talked about, will be one way for you to get a better and better understanding of 1146 01:01:48,149 --> 01:01:50,279 these problems. It 1147 01:01:50,279 --> 01:01:54,089 turns out, by the way, there are some of Silicon Valley companies that outsource their 1148 01:01:54,089 --> 01:01:56,679 machine learning. So there's sometimes, you know, whatever. 1149 01:01:56,679 --> 01:01:59,529 They're a company in Silicon Valley and they'll, you know, 1150 01:01:59,529 --> 01:02:03,230 hire a firm in New York to run all their learning algorithms for them. 1151 01:02:03,230 --> 01:02:06,890 And I'm not a businessman, but I personally think that's 1152 01:02:06,890 --> 01:02:09,309 often a terrible idea because 1153 01:02:09,309 --> 01:02:13,639 if your expertise, if your understanding of your data is given, 1154 01:02:13,639 --> 01:02:15,709 you know, to an outsource agency, 1155 01:02:15,709 --> 01:02:19,589 then if you don't maintain that expertise, if there's a problem you really care about 1156 01:02:19,589 --> 01:02:22,299 then it'll be your own, you know, 1157 01:02:22,299 --> 01:02:26,009 understanding of the problem that you build up over months that'll be really valuable. 1158 01:02:26,009 --> 01:02:28,649 And if that knowledge is outsourced, you don't get to keep that knowledge 1159 01:02:28,649 --> 01:02:29,489 yourself. 1160 01:02:29,489 --> 01:02:31,699 I personally think that's a terrible idea. 1161 01:02:31,699 --> 01:02:35,809 But I'm not a businessman, but I just see people do that a lot, 1162 01:02:35,809 --> 01:02:39,109 and just. Let's see. 1163 01:02:39,109 --> 01:02:42,949 Another reason for running diagnostics like these is actually in writing research 1164 01:02:42,949 --> 01:02:43,609 papers, 1165 01:02:43,609 --> 01:02:46,149 right? So 1166 01:02:46,149 --> 01:02:49,329 diagnostics and error analyses, which I'll talk about in a minute, 1167 01:02:49,329 --> 01:02:53,019 often help to convey insight about the problem and help justify your research 1168 01:02:53,019 --> 01:02:54,109 claims. 1169 01:02:54,109 --> 01:02:56,559 1170 01:02:56,559 --> 01:02:57,780 So for example, 1171 01:02:57,780 --> 01:03:00,790 rather than writing a research paper, say, that's says, you know, Oh well here's 1172 01:03:00,790 --> 01:03:04,039 an algorithm that works. I built this helicopter and it flies, or whatever, 1173 01:03:04,039 --> 01:03:05,650 it's often much more interesting to say, 1174 01:03:05,650 --> 01:03:09,609 Here's an algorithm that works, and it works because of a specific 1175 01:03:09,609 --> 01:03:13,920 component X. And moreover, here's the diagnostic that gives you justification that shows X was 1176 01:03:13,920 --> 01:03:19,159 the thing that fixed this problem, and that's where you made it work. Okay? So 1177 01:03:19,159 --> 01:03:21,389 that leads me 1178 01:03:21,389 --> 01:03:25,929 into a discussion on error analysis, which is often good machine learning practice, 1179 01:03:25,929 --> 01:03:26,439 1180 01:03:26,439 --> 01:03:32,099 is a way for understanding what your sources of errors are. So what I 1181 01:03:32,099 --> 01:03:34,689 call error analyses " and let's check 1182 01:03:34,689 --> 01:03:41,689 questions about this. 1183 01:03:41,769 --> 01:03:45,789 Yeah? 1184 01:03:45,789 --> 01:03:49,809 Student:What ended up being wrong with the helicopter? Instructor (Andrew Ng):Oh, don't know. Let's see. We've flown so many times. 1185 01:03:49,809 --> 01:03:53,500 The thing that is most difficult a helicopter is actually building a 1186 01:03:53,500 --> 01:03:55,109 very " I don't know. It 1187 01:03:55,109 --> 01:03:58,489 changes all the time. Quite often, it's actually the simulator. Building an accurate simulator of a helicopter 1188 01:03:58,489 --> 01:04:02,859 is very hard. Yeah. Okay. So 1189 01:04:02,859 --> 01:04:03,930 for error 1190 01:04:03,930 --> 01:04:06,269 analyses, 1191 01:04:06,269 --> 01:04:10,809 this is a way for figuring out what is working in your algorithm and what isn't working. 1192 01:04:10,809 --> 01:04:17,709 And we're gonna talk about two specific examples. So there are 1193 01:04:17,709 --> 01:04:21,529 many learning " there are many sort of IA systems, many machine learning systems, that 1194 01:04:21,529 --> 01:04:22,469 combine 1195 01:04:22,469 --> 01:04:24,890 many different components into a pipeline. So 1196 01:04:24,890 --> 01:04:27,469 here's sort of a contrived example for this, 1197 01:04:27,469 --> 01:04:31,019 not dissimilar in many ways from the actual machine learning systems you see. 1198 01:04:31,019 --> 01:04:32,390 So let's say you want to 1199 01:04:32,390 --> 01:04:37,749 recognize people from images. This is a picture of one of my friends. 1200 01:04:37,749 --> 01:04:41,899 So you take this input in camera image, say, and you often run it through a long pipeline. So 1201 01:04:41,899 --> 01:04:43,069 for example, 1202 01:04:43,069 --> 01:04:47,859 the first thing you may do may be preprocess the image and remove the background, so you remove the 1203 01:04:47,859 --> 01:04:49,189 background. 1204 01:04:49,189 --> 01:04:51,909 And then you run a 1205 01:04:51,909 --> 01:04:55,209 face detection algorithm, so a machine learning algorithm to detect people's faces. 1206 01:04:55,209 --> 01:04:56,109 Right? 1207 01:04:56,109 --> 01:04:59,759 And then, you know, let's say you want to recognize the identity of the person, right, this is your 1208 01:04:59,759 --> 01:05:01,719 application. 1209 01:05:01,719 --> 01:05:04,440 You then segment of the eyes, segment of the nose, 1210 01:05:04,440 --> 01:05:08,329 and have different learning algorithms to detect the mouth and so on. 1211 01:05:08,329 --> 01:05:10,029 I know; she might not want to be friend 1212 01:05:10,029 --> 01:05:13,249 after she sees this. 1213 01:05:13,249 --> 01:05:16,769 And then having found all these features, based on, you know, what the nose looks like, what the eyes 1214 01:05:16,769 --> 01:05:18,610 looks like, whatever, then you 1215 01:05:18,610 --> 01:05:22,800 feed all the features into a logistic regression algorithm. And your logistic 1216 01:05:22,800 --> 01:05:24,770 regression or soft match regression, or whatever, 1217 01:05:24,770 --> 01:05:30,379 will tell you the identity of this person. Okay? 1218 01:05:30,379 --> 01:05:32,459 So 1219 01:05:32,459 --> 01:05:35,059 this is what error analysis is. 1220 01:05:35,059 --> 01:05:40,329 You have a long complicated pipeline combining many machine learning 1221 01:05:40,329 --> 01:05:43,919 components. Many of these would be used in learning algorithms. 1222 01:05:43,919 --> 01:05:45,689 And so, 1223 01:05:45,689 --> 01:05:50,419 it's often very useful to figure out how much of your error can be attributed to each of 1224 01:05:50,419 --> 01:05:55,179 these components. 1225 01:05:55,179 --> 01:05:56,179 So 1226 01:05:56,179 --> 01:05:59,589 what we'll do in a typical error analysis procedure 1227 01:05:59,589 --> 01:06:03,709 is we'll repeatedly plug in the ground-truth for each component and see how the 1228 01:06:03,709 --> 01:06:05,129 accuracy changes. 1229 01:06:05,129 --> 01:06:07,589 So what I mean by that is the 1230 01:06:07,589 --> 01:06:11,389 figure on the bottom left " bottom right, let's say the overall accuracy of the system is 1231 01:06:11,389 --> 01:06:12,739 85 percent. Right? 1232 01:06:12,739 --> 01:06:14,689 Then I want to know 1233 01:06:14,689 --> 01:06:17,399 where my 15 percent of error comes from. 1234 01:06:17,399 --> 01:06:19,159 And so what I'll do is I'll go 1235 01:06:19,159 --> 01:06:21,329 to my test set 1236 01:06:21,329 --> 01:06:26,629 and I'll actually code it and " oh, instead of " actually implement my correct 1237 01:06:26,629 --> 01:06:29,750 background removal. So actually, go in and give it, 1238 01:06:29,750 --> 01:06:33,439 give my algorithm what is the correct background versus foreground. 1239 01:06:33,439 --> 01:06:36,840 And if I do that, let's color that blue to denote that I'm 1240 01:06:36,840 --> 01:06:39,529 giving that ground-truth data in the test set, 1241 01:06:39,529 --> 01:06:43,839 let's assume our accuracy increases to 85.1 percent. Okay? 1242 01:06:43,839 --> 01:06:47,760 And now I'll go in and, you know, give my algorithm the ground-truth, 1243 01:06:47,760 --> 01:06:48,929 face detection 1244 01:06:48,929 --> 01:06:53,019 output. So I'll go in and actually on my test set I'll just tell the algorithm where the 1245 01:06:53,019 --> 01:06:55,130 face is. And if I do that, 1246 01:06:55,130 --> 01:06:59,049 let's say my algorithm's accuracy increases to 91 percent, 1247 01:06:59,049 --> 01:07:02,519 and so on. And then I'll go for each of these components 1248 01:07:02,519 --> 01:07:05,019 and just give it 1249 01:07:05,019 --> 01:07:08,660 the ground-truth label for each of the components, 1250 01:07:08,660 --> 01:07:11,640 because say, like, the nose segmentation algorithm's trying to figure out 1251 01:07:11,640 --> 01:07:13,219 where the nose is. I just in 1252 01:07:13,219 --> 01:07:16,589 and tell it where the nose is so that it doesn't have to figure that out. 1253 01:07:16,589 --> 01:07:20,559 And as I do this, one component through the other, you know, I end up giving it the correct output 1254 01:07:20,559 --> 01:07:23,650 label and end up with 100 percent accuracy. 1255 01:07:23,650 --> 01:07:27,000 And now you can look at this table " I'm sorry this is cut off on the bottom, 1256 01:07:27,000 --> 01:07:29,119 it says logistic regression 100 percent. Now you can 1257 01:07:29,119 --> 01:07:30,719 look at this 1258 01:07:30,719 --> 01:07:31,670 table and 1259 01:07:31,670 --> 01:07:33,009 see, 1260 01:07:33,009 --> 01:07:36,079 you know, how much giving the ground-truth labels for each of these 1261 01:07:36,079 --> 01:07:39,029 components could help boost your final performance. 1262 01:07:39,029 --> 01:07:42,419 In particular, if you look at this table, you notice that 1263 01:07:42,419 --> 01:07:45,269 when I added the face detection ground-truth, 1264 01:07:45,269 --> 01:07:48,279 my performance jumped from 85.1 percent accuracy 1265 01:07:48,279 --> 01:07:50,619 to 91 percent accuracy. Right? 1266 01:07:50,619 --> 01:07:54,529 So this tells me that if only I can get better face detection, 1267 01:07:54,529 --> 01:07:58,029 maybe I can boost my accuracy by 6 percent. 1268 01:07:58,029 --> 01:08:00,499 Whereas in contrast, when I, 1269 01:08:00,499 --> 01:08:04,349 you know, say plugged in better, I don't know, 1270 01:08:04,349 --> 01:08:07,059 background removal, my accuracy improved from 85 1271 01:08:07,059 --> 01:08:08,669 to 85.1 percent. 1272 01:08:08,669 --> 01:08:11,519 And so, this sort of diagnostic also tells you that if your goal 1273 01:08:11,519 --> 01:08:13,869 is to improve the system, it's probably a waste of 1274 01:08:13,869 --> 01:08:17,679 your time to try to improve your background subtraction. Because if 1275 01:08:17,679 --> 01:08:19,219 even if you got the ground-truth, 1276 01:08:19,219 --> 01:08:22,059 this is gives you, at most, 0.1 percent accuracy, 1277 01:08:22,059 --> 01:08:24,600 whereas if you do better face detection, maybe there's a much 1278 01:08:24,600 --> 01:08:26,399 larger potential for gains there. Okay? 1279 01:08:26,399 --> 01:08:28,669 So this sort of diagnostic, 1280 01:08:28,669 --> 01:08:29,899 again, 1281 01:08:29,899 --> 01:08:33,149 is very useful because if your is to improve the system, 1282 01:08:33,149 --> 01:08:35,999 there are so many different pieces you can easily choose to spend the next three 1283 01:08:35,999 --> 01:08:36,650 months on. Right? 1284 01:08:36,650 --> 01:08:39,259 And choosing the right piece 1285 01:08:39,259 --> 01:08:42,799 is critical, and this sort of diagnostic tells you what's the piece that may 1286 01:08:42,799 --> 01:08:48,729 actually be worth your time to work on. 1287 01:08:48,729 --> 01:08:51,709 There's sort of another type of analyses that's sort of the opposite of what I just 1288 01:08:51,709 --> 01:08:53,369 talked about. 1289 01:08:53,369 --> 01:08:55,469 The error analysis I just talked about 1290 01:08:55,469 --> 01:08:58,279 tries to explain the difference between the current performance and perfect 1291 01:08:58,279 --> 01:08:59,770 performance, 1292 01:08:59,770 --> 01:09:03,619 whereas this sort of ablative analysis tries to explain the difference 1293 01:09:03,619 --> 01:09:09,119 between some baselines, some really bad performance and your current performance. 1294 01:09:09,119 --> 01:09:13,089 So for this example, let's suppose you've built a very good anti-spam classifier for 1295 01:09:13,089 --> 01:09:17,150 adding lots of clever features to your logistic regression algorithm. Right? So you added 1296 01:09:17,150 --> 01:09:20,690 features for spam correction, for, you know, sender host features, for email header 1297 01:09:20,690 --> 01:09:21,409 features, 1298 01:09:21,409 --> 01:09:24,799 email text parser features, JavaScript parser features, 1299 01:09:24,799 --> 01:09:26,839 features for embedded images, and so on. 1300 01:09:26,839 --> 01:09:30,230 So now let's say you preview the system and you want to figure out, you know, how well did 1301 01:09:30,230 --> 01:09:33,799 each of these " how much did each of these components actually contribute? Maybe you want 1302 01:09:33,799 --> 01:09:37,130 to write a research paper and claim this was the piece that made the 1303 01:09:37,130 --> 01:09:40,949 big difference. Can you actually document that claim and justify it? 1304 01:09:40,949 --> 01:09:43,319 So in ablative analysis, 1305 01:09:43,319 --> 01:09:44,569 here's what we do. 1306 01:09:44,569 --> 01:09:46,330 So in this example, 1307 01:09:46,330 --> 01:09:49,670 let's say that simple logistic regression without any of your clever 1308 01:09:49,670 --> 01:09:52,089 improvements get 94 percent performance. And 1309 01:09:52,089 --> 01:09:55,479 you want to figure out what accounts for your improvement from 94 to 1310 01:09:55,479 --> 01:09:58,429 99.9 percent performance. 1311 01:09:58,429 --> 01:10:03,280 So in ablative analysis and so instead of adding components one at a day, we'll instead 1312 01:10:03,280 --> 01:10:06,840 remove components one at a time to see how it rates. 1313 01:10:06,840 --> 01:10:11,460 So start with your overall system, which is 99 percent accuracy. 1314 01:10:11,460 --> 01:10:14,130 And then we remove spelling correction and see how much performance 1315 01:10:14,130 --> 01:10:15,389 drops. 1316 01:10:15,389 --> 01:10:22,389 Then we'll remove the sender host features and see how much performance drops, and so on. All right? And so, 1317 01:10:24,219 --> 01:10:28,150 in this contrived example, 1318 01:10:28,150 --> 01:10:31,120 you see that, I guess, the biggest drop 1319 01:10:31,120 --> 01:10:32,380 occurred when you remove 1320 01:10:32,380 --> 01:10:37,560 the text parser features. And so you can then make a credible case that, 1321 01:10:37,560 --> 01:10:41,280 you know, the text parser features where what really made the biggest difference here. Okay? 1322 01:10:41,280 --> 01:10:42,699 And you can also tell, 1323 01:10:42,699 --> 01:10:45,530 for instance, that, I don't know, 1324 01:10:45,530 --> 01:10:49,360 removing the sender host features on this 1325 01:10:49,360 --> 01:10:52,280 line, right, performance dropped from 99.9 to 98.9. And so this also means 1326 01:10:52,280 --> 01:10:53,139 that 1327 01:10:53,139 --> 01:10:56,449 in case you want to get rid of the sender host features to speed up 1328 01:10:56,449 --> 01:11:03,449 computational something that would be a good candidate for elimination. Okay? Student:Are there any 1329 01:11:03,630 --> 01:11:05,420 guarantees that if you shuffle around the order in which 1330 01:11:05,420 --> 01:11:06,420 you drop those 1331 01:11:06,420 --> 01:11:09,580 features that you'll get the same " Instructor (Andrew Ng):Yeah. Let's address the question: What if you shuffle in which you remove things? The answer is no. There's 1332 01:11:09,580 --> 01:11:12,110 no guarantee you'd get the similar result. 1333 01:11:12,110 --> 01:11:13,890 So in practice, 1334 01:11:13,890 --> 01:11:17,730 sometimes there's a fairly natural of ordering for both types of analyses, the error 1335 01:11:17,730 --> 01:11:19,329 analyses and ablative analysis, 1336 01:11:19,329 --> 01:11:22,749 sometimes there's a fairly natural ordering in which you add things or remove things, 1337 01:11:22,749 --> 01:11:24,559 sometimes there's isn't. And 1338 01:11:24,559 --> 01:11:28,469 quite often, you either choose one ordering and just go for it 1339 01:11:28,469 --> 01:11:32,070 or " And don't think of these analyses as sort of formulas that are constants, though; I mean 1340 01:11:32,070 --> 01:11:35,239 feel free to invent your own, as well. You know 1341 01:11:35,239 --> 01:11:36,639 one of the things 1342 01:11:36,639 --> 01:11:37,920 that's done quite often is 1343 01:11:37,920 --> 01:11:39,310 take the overall system 1344 01:11:39,310 --> 01:11:43,289 and just remove one and then put it back, then remove a different one 1345 01:11:43,289 --> 01:11:48,129 then put it back until all of these things are done. Okay. 1346 01:11:48,129 --> 01:11:51,009 So the very last thing I want to talk about is sort of this 1347 01:11:51,009 --> 01:11:57,979 general advice for how to get started on a learning problem. So 1348 01:11:57,979 --> 01:12:03,840 here's a cartoon description on two broad to get started on learning problem. 1349 01:12:03,840 --> 01:12:05,740 The first one is 1350 01:12:05,740 --> 01:12:07,610 carefully design your system, so 1351 01:12:07,610 --> 01:12:11,739 you spend a long time designing exactly the right features, collecting the right data set, and 1352 01:12:11,739 --> 01:12:14,189 designing the right algorithmic structure, then you 1353 01:12:14,189 --> 01:12:17,679 implement it and hope it works. All right? 1354 01:12:17,679 --> 01:12:21,039 The benefit of this sort of approach is you get maybe nicer, maybe more scalable 1355 01:12:21,039 --> 01:12:22,429 algorithms, 1356 01:12:22,429 --> 01:12:26,719 and maybe you come up with new elegant learning algorithms. And if your goal is to, 1357 01:12:26,719 --> 01:12:30,760 you know, contribute to basic research in machine learning, if your goal is to invent new machine learning 1358 01:12:30,760 --> 01:12:31,499 algorithms, 1359 01:12:31,499 --> 01:12:33,550 this process of slowing down and 1360 01:12:33,550 --> 01:12:36,300 thinking deeply about the problem, you know, that is sort of the right way to go 1361 01:12:36,300 --> 01:12:37,119 about is 1362 01:12:37,119 --> 01:12:41,099 think deeply about a problem and invent new solutions. 1363 01:12:41,099 --> 01:12:42,280 1364 01:12:42,280 --> 01:12:44,079 Second sort of approach 1365 01:12:44,079 --> 01:12:48,839 is what I call build-and-fix, which is we input something quick and dirty 1366 01:12:48,839 --> 01:12:52,309 and then you run error analyses and diagnostics to figure out what's wrong and 1367 01:12:52,309 --> 01:12:54,199 you fix those errors. 1368 01:12:54,199 --> 01:12:58,130 The benefit of this second type of approach is that it'll often get your 1369 01:12:58,130 --> 01:13:01,119 application working much more quickly. 1370 01:13:01,119 --> 01:13:04,400 And especially with those of you, if you end up working in a company, and sometimes " if you end up working in 1371 01:13:04,400 --> 01:13:05,550 a company, 1372 01:13:05,550 --> 01:13:07,460 you know, very often it's not 1373 01:13:07,460 --> 01:13:10,900 the best product that wins; it's the first product to market that 1374 01:13:10,900 --> 01:13:11,690 wins. And 1375 01:13:11,690 --> 01:13:14,869 so there's " especially in the industry. There's really something to be said for, 1376 01:13:14,869 --> 01:13:18,790 you know, building a system quickly and getting it deployed quickly. 1377 01:13:18,790 --> 01:13:23,139 And the second approach of building a quick-and-dirty, I'm gonna say hack 1378 01:13:23,139 --> 01:13:26,469 and then fixing the problems will actually get you to a 1379 01:13:26,469 --> 01:13:27,839 system that works well 1380 01:13:27,839 --> 01:13:30,969 much more quickly. 1381 01:13:30,969 --> 01:13:32,649 And the reason is 1382 01:13:32,649 --> 01:13:36,149 very often it's really not clear what parts of a system are easier to think of to 1383 01:13:36,149 --> 01:13:37,590 build and therefore what 1384 01:13:37,590 --> 01:13:40,179 you need to spends lot of time focusing on. 1385 01:13:40,179 --> 01:13:43,420 So there's that example I talked about just now. Right? 1386 01:13:43,420 --> 01:13:46,929 For identifying 1387 01:13:46,929 --> 01:13:48,710 people, say. 1388 01:13:48,710 --> 01:13:53,199 And with a big complicated learning system like this, a big complicated pipeline like this, 1389 01:13:53,199 --> 01:13:55,590 it's really not obvious at the outset 1390 01:13:55,590 --> 01:13:59,130 which of these components you should spend lots of time working on. Right? And if 1391 01:13:59,130 --> 01:14:00,960 you didn't know that 1392 01:14:00,960 --> 01:14:03,800 preprocessing wasn't the right component, you could easily have 1393 01:14:03,800 --> 01:14:07,269 spent three months working on better background subtraction, not knowing that it's 1394 01:14:07,269 --> 01:14:09,880 just not gonna ultimately matter. 1395 01:14:09,880 --> 01:14:10,769 And so 1396 01:14:10,769 --> 01:14:13,690 the only way to find out what really works was inputting something quickly and 1397 01:14:13,690 --> 01:14:15,350 you find out what parts " 1398 01:14:15,350 --> 01:14:16,889 and find out 1399 01:14:16,889 --> 01:14:17,889 what parts 1400 01:14:17,889 --> 01:14:21,359 are really the hard parts to implement, or what parts are hard parts that could make a 1401 01:14:21,359 --> 01:14:23,079 difference in performance. 1402 01:14:23,079 --> 01:14:26,579 In fact, say that if your goal is to build a 1403 01:14:26,579 --> 01:14:29,309 people recognition system, a system like this is actually far too 1404 01:14:29,309 --> 01:14:31,639 complicated as your initial system. 1405 01:14:31,639 --> 01:14:35,560 Maybe after you're prototyped a few systems, and you converged a system like this. But if this 1406 01:14:35,560 --> 01:14:42,560 is your first system you're designing, this is much too complicated. Also, this is a 1407 01:14:43,570 --> 01:14:48,059 very concrete piece of advice, and this applies to your projects as well. 1408 01:14:48,059 --> 01:14:51,229 If your goal is to build a working application, 1409 01:14:51,229 --> 01:14:55,260 Step 1 is actually probably not to design a system like this. Step 1 is where you would plot your 1410 01:14:55,260 --> 01:14:57,279 data. 1411 01:14:57,279 --> 01:15:01,219 And very often, and if you just take the data you're trying to predict and just plot your 1412 01:15:01,219 --> 01:15:05,729 data, plot X, plot Y, plot your data everywhere you can think of, 1413 01:15:05,729 --> 01:15:10,309 you know, half the time you look at it and go, Gee, how come all those numbers are negative? I thought they 1414 01:15:10,309 --> 01:15:13,899 should be positive. Something's wrong with this dataset. And it's about 1415 01:15:13,899 --> 01:15:18,389 half the time you find something obviously wrong with your data or something very surprising. 1416 01:15:18,389 --> 01:15:21,570 And this is something you find out just by plotting your data, and that you 1417 01:15:21,570 --> 01:15:28,179 won't find out be implementing these big complicated learning algorithms on it. Plotting 1418 01:15:28,179 --> 01:15:31,920 the data sounds so simple, it was one of the pieces of advice that lots of us give but 1419 01:15:31,920 --> 01:15:38,570 hardly anyone follows, so you can take that for what it's worth. 1420 01:15:38,570 --> 01:15:42,199 Let me just reiterate, what I just said here may be bad advice 1421 01:15:42,199 --> 01:15:44,019 if your goal is to come up with 1422 01:15:44,019 --> 01:15:46,639 new machine learning algorithms. All right? So 1423 01:15:46,639 --> 01:15:51,019 for me personally, the learning algorithm I use the most often is probably 1424 01:15:51,019 --> 01:15:53,600 logistic regression because I have code lying around. So give me a 1425 01:15:53,600 --> 01:15:56,770 learning problem, I probably won't try anything more complicated than logistic 1426 01:15:56,770 --> 01:15:58,260 regression on it first. And it's 1427 01:15:58,260 --> 01:16:01,940 only after trying something really simple and figure our what's easy, what's hard, then you know 1428 01:16:01,940 --> 01:16:03,940 where to focus your efforts. But 1429 01:16:03,940 --> 01:16:07,610 again, if your goal is to invent new machine learning algorithms, then you sort of don't 1430 01:16:07,610 --> 01:16:10,750 want to hack up something and then add another hack to fix it, and hack it even more to 1431 01:16:10,750 --> 01:16:12,219 fix it. Right? So if 1432 01:16:12,219 --> 01:16:15,919 your goal is to do novel machine learning research, then it pays to think more deeply about the 1433 01:16:15,919 --> 01:16:21,340 problem and not gonna follow this specifically. 1434 01:16:21,340 --> 01:16:22,919 Shoot, you know what? All 1435 01:16:22,919 --> 01:16:28,280 right, sorry if I'm late but I just have two more slides so I'm gonna go through these quickly. 1436 01:16:28,280 --> 01:16:30,620 And so, this is what I think 1437 01:16:30,620 --> 01:16:33,459 of as premature statistical optimization, 1438 01:16:33,459 --> 01:16:35,079 where quite often, 1439 01:16:35,079 --> 01:16:38,320 just like premature optimization of code, quite often 1440 01:16:38,320 --> 01:16:44,369 people will prematurely optimize one component of a big complicated machine learning system. Okay? Just two more 1441 01:16:44,369 --> 01:16:46,949 slides. This 1442 01:16:46,949 --> 01:16:48,539 was " 1443 01:16:48,539 --> 01:16:52,070 this is a sort of cartoon that highly influenced my own thinking. It was based on 1444 01:16:52,070 --> 01:16:55,340 a paper written by Christos Papadimitriou. 1445 01:16:55,340 --> 01:16:57,429 This is how 1446 01:16:57,429 --> 01:16:59,360 progress " this is how 1447 01:16:59,360 --> 01:17:02,360 developmental progress of research often happens. Right? 1448 01:17:02,360 --> 01:17:05,559 Let's say you want to build a mail delivery robot, so I've drawn a circle there that says mail delivery robot. And it 1449 01:17:05,559 --> 01:17:06,519 seems like a useful thing to have. 1450 01:17:06,519 --> 01:17:09,670 Right? You know free up people, don't have 1451 01:17:09,670 --> 01:17:12,760 to deliver mail. So what " 1452 01:17:12,760 --> 01:17:14,280 to deliver mail, 1453 01:17:14,280 --> 01:17:19,139 obviously you need a robot to wander around indoor environments and you need a robot to 1454 01:17:19,139 --> 01:17:21,480 manipulate objects and pickup envelopes. And so, 1455 01:17:21,480 --> 01:17:24,890 you need to build those two components in order to get a mail delivery robot. And 1456 01:17:24,890 --> 01:17:25,590 so I've 1457 01:17:25,590 --> 01:17:29,650 drawing those two components and little arrows to denote that, you know, obstacle avoidance 1458 01:17:29,650 --> 01:17:30,460 is 1459 01:17:30,460 --> 01:17:32,229 needed or would help build 1460 01:17:32,229 --> 01:17:35,510 your mail delivery robot. Well 1461 01:17:35,510 --> 01:17:37,189 for obstacle for avoidance, 1462 01:17:37,189 --> 01:17:43,159 clearly, you need a robot that can navigate and you need to detect objects so you can avoid the obstacles. 1463 01:17:43,159 --> 01:17:46,840 Now we're gonna use computer vision to detect the objects. And so, 1464 01:17:46,840 --> 01:17:51,120 we know that, you know, lighting sometimes changes, right, depending on whether it's the 1465 01:17:51,120 --> 01:17:52,709 morning or noontime or evening. This 1466 01:17:52,709 --> 01:17:53,930 is lighting 1467 01:17:53,930 --> 01:17:56,639 causes the color of things to change, and so you need 1468 01:17:56,639 --> 01:18:00,509 an object detection system that's invariant to the specific colors of an 1469 01:18:00,509 --> 01:18:01,199 object. Right? 1470 01:18:01,199 --> 01:18:04,420 Because lighting 1471 01:18:04,420 --> 01:18:05,400 changes, 1472 01:18:05,400 --> 01:18:09,849 say. Well color, or RGB values, is represented by three-dimensional vectors. And 1473 01:18:09,849 --> 01:18:11,170 so you need to learn 1474 01:18:11,170 --> 01:18:13,499 when two colors might be the same thing, 1475 01:18:13,499 --> 01:18:15,260 when two, you know, 1476 01:18:15,260 --> 01:18:18,159 visual appearance of two colors may be the same thing as just the lighting change or 1477 01:18:18,159 --> 01:18:19,539 something. 1478 01:18:19,539 --> 01:18:20,590 And 1479 01:18:20,590 --> 01:18:24,059 to understand that properly, we can go out and study differential geometry 1480 01:18:24,059 --> 01:18:27,510 of 3d manifolds because that helps us build a sound theory on which 1481 01:18:27,510 --> 01:18:32,250 to develop our 3d similarity learning algorithms. 1482 01:18:32,250 --> 01:18:36,159 And to really understand the fundamental aspects of this problem, 1483 01:18:36,159 --> 01:18:40,110 we have to study the complexity of non-Riemannian geometries. And on 1484 01:18:40,110 --> 01:18:43,850 and on it goes until eventually you're proving convergence bounds for 1485 01:18:43,850 --> 01:18:49,790 sampled of non-monotonic logic. I don't even know what this is because I just made it up. 1486 01:18:49,790 --> 01:18:51,530 Whereas in reality, 1487 01:18:51,530 --> 01:18:53,970 you know, chances are that link isn't real. 1488 01:18:53,970 --> 01:18:55,660 Color variance 1489 01:18:55,660 --> 01:18:59,550 just barely helped object recognition maybe. I'm making this up. 1490 01:18:59,550 --> 01:19:03,499 Maybe differential geometry was hardly gonna help 3d similarity learning and that link's also gonna fail. Okay? 1491 01:19:03,499 --> 01:19:05,270 So, each of 1492 01:19:05,270 --> 01:19:09,130 these circles can represent a person, or a research community, or a thought in your 1493 01:19:09,130 --> 01:19:12,020 head. And there's a very real chance that 1494 01:19:12,020 --> 01:19:15,469 maybe there are all these papers written on differential geometry of 3d manifolds, and they are 1495 01:19:15,469 --> 01:19:18,570 written because some guy once told someone else that it'll help 3d similarity learning. 1496 01:19:18,570 --> 01:19:20,489 And, 1497 01:19:20,489 --> 01:19:23,369 you know, it's like A friend of mine told me that color invariance would help in 1498 01:19:23,369 --> 01:19:26,119 object recognition, so I'm working on color invariance. And now I'm gonna tell a friend 1499 01:19:26,119 --> 01:19:27,440 of mine 1500 01:19:27,440 --> 01:19:30,280 that his thing will help my problem. And he'll tell a friend of his that his thing will help 1501 01:19:30,280 --> 01:19:31,619 with his problem. 1502 01:19:31,619 --> 01:19:33,520 And pretty soon, you're working on 1503 01:19:33,520 --> 01:19:37,540 convergence bound for sampled non-monotonic logic, when in reality none of these will 1504 01:19:37,540 --> 01:19:39,129 see the light of 1505 01:19:39,129 --> 01:19:42,519 day of your mail delivery robot. Okay? 1506 01:19:42,519 --> 01:19:46,599 I'm not criticizing the role of theory. There are very powerful theories, like the 1507 01:19:46,599 --> 01:19:48,400 theory of VC dimension, 1508 01:19:48,400 --> 01:19:52,090 which is far, far, far to the right of this. So VC dimension is about 1509 01:19:52,090 --> 01:19:53,289 as theoretical 1510 01:19:53,289 --> 01:19:57,119 as it can get. And it's clearly had a huge impact on many applications. And there's, 1511 01:19:57,119 --> 01:19:59,559 you know, dramatically advanced data machine learning. And another example is theory of NP-hardness as again, you know, 1512 01:19:59,559 --> 01:20:00,750 is about 1513 01:20:00,750 --> 01:20:04,219 theoretical as it can get. It's 1514 01:20:04,219 --> 01:20:05,800 like a huge application 1515 01:20:05,800 --> 01:20:09,309 on all of computer science, the theory of NP-hardness. 1516 01:20:09,309 --> 01:20:10,669 But 1517 01:20:10,669 --> 01:20:13,799 when you are off working on highly theoretical things, I guess, to me 1518 01:20:13,799 --> 01:20:16,849 personally it's important to keep in mind 1519 01:20:16,849 --> 01:20:19,699 are you working on something like VC dimension, which is high impact, or are you 1520 01:20:19,699 --> 01:20:23,290 working on something like convergence bound for sampled nonmonotonic logic, which 1521 01:20:23,290 --> 01:20:24,710 you're only hoping 1522 01:20:24,710 --> 01:20:25,900 has some peripheral relevance 1523 01:20:25,900 --> 01:20:30,040 to some application. Okay? 1524 01:20:30,040 --> 01:20:34,849 For me personally, I tend to work on an application only if I " excuse me. 1525 01:20:34,849 --> 01:20:36,989 For me personally, and this is a personal choice, 1526 01:20:36,989 --> 01:20:41,340 I tend to trust something only if I personally can see a link from the 1527 01:20:41,340 --> 01:20:42,679 theory I'm working on 1528 01:20:42,679 --> 01:20:44,430 all the way back to an application. 1529 01:20:44,430 --> 01:20:46,010 And 1530 01:20:46,010 --> 01:20:50,299 if I don't personally see a direct link from what I'm doing to an application then, 1531 01:20:50,299 --> 01:20:53,429 you know, then that's fine. Then I can choose to work on theory, but 1532 01:20:53,429 --> 01:20:55,650 I wouldn't necessarily trust that 1533 01:20:55,650 --> 01:20:59,210 what the theory I'm working on will relate to an application, if I don't personally 1534 01:20:59,210 --> 01:21:02,429 see a link all the way back. 1535 01:21:02,429 --> 01:21:04,400 Just to summarize. 1536 01:21:04,400 --> 01:21:06,409 1537 01:21:06,409 --> 01:21:08,679 One lesson to take away from today is I think 1538 01:21:08,679 --> 01:21:12,529 time spent coming up with diagnostics for learning algorithms is often time well spent. 1539 01:21:12,529 --> 01:21:13,029 1540 01:21:13,029 --> 01:21:16,199 It's often up to your own ingenuity to come up with great diagnostics. And 1541 01:21:16,199 --> 01:21:19,019 just when I personally, when I work on machine learning algorithm, 1542 01:21:19,019 --> 01:21:21,169 it's not uncommon for me to be spending like 1543 01:21:21,169 --> 01:21:23,679 between a third and often half of my time 1544 01:21:23,679 --> 01:21:26,409 just writing diagnostics and trying to figure out what's going right and what's 1545 01:21:26,409 --> 01:21:28,079 going on. 1546 01:21:28,079 --> 01:21:31,500 Sometimes it's tempting not to, right, because you want to be implementing learning algorithms and 1547 01:21:31,500 --> 01:21:34,780 making progress. You don't want to be spending all this time, you know, implementing tests on your 1548 01:21:34,780 --> 01:21:38,280 learning algorithms; it doesn't feel like when you're doing anything. But when 1549 01:21:38,280 --> 01:21:41,419 I implement learning algorithms, at least a third, and quite often half of 1550 01:21:41,419 --> 01:21:45,880 my time, is actually spent implementing those tests and you can figure out what to work on. And 1551 01:21:45,880 --> 01:21:49,219 I think it's actually one of the best uses of your time. Talked 1552 01:21:49,219 --> 01:21:50,729 about error 1553 01:21:50,729 --> 01:21:54,319 analyses and ablative analyses, and lastly 1554 01:21:54,319 --> 01:21:56,890 talked about, you know, different approaches and the 1555 01:21:56,890 --> 01:22:00,979 risks of premature statistical optimization. Okay. 1556 01:22:00,979 --> 01:22:04,339 Sorry I ran you over. I'll be here for a few more minutes for your questions.