1 00:00:10,530 --> 00:00:11,799 2 00:00:11,799 --> 00:00:15,059 This presentation is delivered by the Stanford Center for Professional 3 00:00:15,059 --> 00:00:22,059 Development. Okay. 4 00:00:29,109 --> 00:00:31,810 So welcome back. 5 00:00:31,810 --> 00:00:35,860 What I want to do today is start a new chapter 6 00:00:35,860 --> 00:00:41,020 in between now and then. In particular, I want to talk about learning theory. 7 00:00:41,020 --> 00:00:45,450 So in the previous, I guess eight lectures so far, you've learned about 8 00:00:45,450 --> 00:00:49,920 a lot of learning algorithms, and yes, you now I hope understand a little about 9 00:00:49,920 --> 00:00:53,870 some of the best and most powerful tools of machine learning in the [inaudible]. 10 00:00:53,870 --> 00:00:58,219 And all of you are now sort of well qualified to go into industry 11 00:00:58,219 --> 00:01:01,539 and though powerful learning algorithms apply, really the most powerful 12 00:01:01,539 --> 00:01:05,170 learning algorithms we know to all sorts of problems, and in fact, I hope 13 00:01:05,170 --> 00:01:10,730 you start to do that on your projects right away as well. 14 00:01:10,730 --> 00:01:14,460 You might remember, I think it was in the very first lecture, 15 00:01:14,460 --> 00:01:18,960 that I made an analogy to if you're trying to learn to be a carpenter, so 16 00:01:18,960 --> 00:01:20,520 if 17 00:01:20,520 --> 00:01:24,440 you imagine you're going to carpentry school to learn to be a carpenter, 18 00:01:24,440 --> 00:01:28,540 then only a small part of what you need to do is to acquire a set of tools. If you learn to 19 00:01:28,540 --> 00:01:29,990 be a carpenter 20 00:01:29,990 --> 00:01:33,390 you don't walk in and pick up a tool box and [inaudible], so 21 00:01:33,390 --> 00:01:35,560 when you need to 22 00:01:35,560 --> 00:01:39,650 cut a piece of wood do you use a rip saw, or a jig saw, or a keyhole saw 23 00:01:39,650 --> 00:01:42,780 whatever, is this really mastering the tools there's also 24 00:01:42,780 --> 00:01:45,820 an essential part of becoming a good carpenter. 25 00:01:45,820 --> 00:01:46,970 And 26 00:01:46,970 --> 00:01:49,600 what I want to do in the next few lectures is 27 00:01:49,600 --> 00:01:52,640 actually give you a sense of the mastery of the machine learning tools all of 28 00:01:52,640 --> 00:01:55,170 you have. Okay? 29 00:01:55,170 --> 00:01:57,090 And so in particular, 30 00:01:57,090 --> 00:02:01,460 in the next few lectures what I want to is to talk more deeply about 31 00:02:01,460 --> 00:02:05,060 the properties of different machine learning algorithms so that you can get a sense of 32 00:02:05,060 --> 00:02:06,730 when it's most appropriate to use 33 00:02:06,730 --> 00:02:08,119 each one. 34 00:02:08,119 --> 00:02:11,660 And it turns out that one of the most common scenarios in machine learning is 35 00:02:11,660 --> 00:02:12,869 36 00:02:12,869 --> 00:02:15,239 someday you'll be doing research or 37 00:02:15,239 --> 00:02:16,919 [inaudible] a company. 38 00:02:16,919 --> 00:02:20,419 And you'll apply one of the learning algorithms you learned about, you may apply logistic regression, 39 00:02:20,419 --> 00:02:23,599 or support vector machines, or Naive Bayes or something, 40 00:02:23,599 --> 00:02:27,849 and for whatever bizarre reason, it won't work as well as you were hoping, or it 41 00:02:27,849 --> 00:02:32,259 won't quite do what you were hoping it to. 42 00:02:32,259 --> 00:02:32,980 43 00:02:32,980 --> 00:02:34,530 To 44 00:02:34,530 --> 00:02:37,409 me what really separates the people from 45 00:02:37,409 --> 00:02:40,689 what really separates the people that really understand and really get machine 46 00:02:40,689 --> 00:02:41,959 learning, 47 00:02:41,959 --> 00:02:46,089 compared to people that maybe read the textbook and so they'll work through the math, 48 00:02:46,089 --> 00:02:50,059 will be what you do next. Will be in your decisions of when 49 00:02:50,059 --> 00:02:53,689 you apply a support vector machine and it doesn't quite do what you wanted, 50 00:02:53,689 --> 00:02:56,540 do you really understand enough about support vector machines to know what to 51 00:02:56,540 --> 00:02:58,559 do next and how to modify the algorithm? 52 00:02:58,559 --> 00:03:02,019 And to me that's often what really separates the great people in machine 53 00:03:02,019 --> 00:03:06,259 learning versus the people that like read the text book and so they'll [inaudible] the math, and so they'll have 54 00:03:06,259 --> 00:03:08,989 just understood that. Okay? 55 00:03:08,989 --> 00:03:09,659 So 56 00:03:09,659 --> 00:03:13,819 what I want to do today, today's lecture will mainly be on learning theory and 57 00:03:13,819 --> 00:03:15,269 we'll start to 58 00:03:15,269 --> 00:03:18,549 talk about some of the theoretical results of machine learning. 59 00:03:18,549 --> 00:03:19,349 60 00:03:19,349 --> 00:03:22,239 The next lecture, later this week, will be on algorithms for 61 00:03:22,239 --> 00:03:26,370 sort of [inaudible], or fixing some of the problems 62 00:03:26,370 --> 00:03:30,039 that the learning theory will point out to us and help us understand. And then 63 00:03:30,039 --> 00:03:34,459 two lectures from now, that lecture will be almost entirely focused on 64 00:03:34,459 --> 00:03:36,400 the practical advice for 65 00:03:36,400 --> 00:03:40,229 how to apply learning algorithms. Okay? 66 00:03:40,229 --> 00:03:47,229 So you have any questions about this before I start? Okay. 67 00:03:50,629 --> 00:03:53,949 So the very first thing we're gonna talk about is something that 68 00:03:53,949 --> 00:03:58,689 you've probably already seen on the first homework, and something that 69 00:03:58,689 --> 00:04:01,289 alluded to previously, 70 00:04:01,289 --> 00:04:07,119 which is the bias variance trade-off. So take 71 00:04:07,119 --> 00:04:10,999 ordinary least squares, the first learning algorithm we learned 72 00:04:10,999 --> 00:04:15,849 about, if you [inaudible] a straight line through these datas, this is not a very good model. 73 00:04:15,849 --> 00:04:19,789 Right. 74 00:04:19,789 --> 00:04:21,009 And if 75 00:04:21,009 --> 00:04:23,699 this happens, we say it has 76 00:04:23,699 --> 00:04:24,490 underfit 77 00:04:24,490 --> 00:04:27,460 the data, or we say that this is a learning algorithm 78 00:04:27,460 --> 00:04:30,629 with a very high bias, because it is 79 00:04:30,629 --> 00:04:35,669 failing to fit the evident quadratic structure in the data. 80 00:04:35,669 --> 00:04:36,250 And 81 00:04:36,250 --> 00:04:37,590 for the prefaces 82 00:04:37,590 --> 00:04:41,340 of [inaudible] you can formally think of the 83 00:04:41,340 --> 00:04:45,439 bias of the learning algorithm as representing the fact that even if you 84 00:04:45,439 --> 00:04:47,960 had an infinite amount of training data, even if 85 00:04:47,960 --> 00:04:50,829 you had tons of training data, 86 00:04:50,829 --> 00:04:52,580 this algorithm would still fail 87 00:04:52,580 --> 00:04:54,199 to fit the quadratic 88 00:04:54,199 --> 00:04:57,819 function, the quadratic structure in 89 00:04:57,819 --> 00:05:01,669 the data. And so we think of this as a learning algorithm with high bias. 90 00:05:01,669 --> 00:05:04,340 Then there's the opposite problem, so that's the 91 00:05:04,340 --> 00:05:06,469 same dataset. 92 00:05:06,469 --> 00:05:13,099 If 93 00:05:13,099 --> 00:05:19,500 you fit a fourth of the polynomials into this dataset, 94 00:05:19,500 --> 00:05:20,430 then you have 95 00:05:20,430 --> 00:05:25,520 you'll be able to interpolate the five data points exactly, but clearly, this is also 96 00:05:25,520 --> 00:05:27,200 not a great model 97 00:05:27,200 --> 00:05:30,650 to the structure that you and I probably see in the data. 98 00:05:30,650 --> 00:05:34,409 And we say that this 99 00:05:34,409 --> 00:05:37,310 algorithm has a problem, excuse me, 100 00:05:37,310 --> 00:05:42,830 is overfitting the data, 101 00:05:42,830 --> 00:05:48,539 or alternatively that this algorithm has high variance. Okay? And the intuition behind 102 00:05:48,539 --> 00:05:50,240 overfitting a high variance is that 103 00:05:50,240 --> 00:05:54,419 the algorithm is fitting serious patterns in the data, or is fitting 104 00:05:54,419 --> 00:05:58,360 idiosyncratic properties of this specific dataset, be it the dataset of 105 00:05:58,360 --> 00:06:00,889 housing prices or whatever. 106 00:06:00,889 --> 00:06:07,889 And quite often, they'll be some happy medium 107 00:06:08,189 --> 00:06:10,060 of fitting a quadratic function 108 00:06:10,060 --> 00:06:14,819 that maybe won't interpolate your data points perfectly, but also captures multistructure 109 00:06:14,819 --> 00:06:16,740 in your data 110 00:06:16,740 --> 00:06:20,759 than a simple model which under fits. 111 00:06:20,759 --> 00:06:23,829 I say that you can sort of have the exactly the same picture 112 00:06:23,829 --> 00:06:25,779 of classification problems as well, 113 00:06:25,779 --> 00:06:28,449 so lets say 114 00:06:28,449 --> 00:06:35,449 this is my training set, right, 115 00:06:35,840 --> 00:06:37,300 of 116 00:06:37,300 --> 00:06:39,319 positive and negative examples, 117 00:06:39,319 --> 00:06:42,169 118 00:06:42,169 --> 00:06:44,259 119 00:06:44,259 --> 00:06:46,779 and so you can fit 120 00:06:46,779 --> 00:06:51,679 logistic regression with a very high order polynomial [inaudible], or [inaudible] of X 121 00:06:51,679 --> 00:06:57,389 equals the sigmoid function of 122 00:06:57,389 --> 00:07:01,240 123 00:07:01,240 --> 00:07:04,439 whatever. Sigmoid function applied to a tenth of the polynomial. 124 00:07:04,439 --> 00:07:07,520 And you do that, maybe you get a decision boundary 125 00:07:07,520 --> 00:07:13,460 like this. Right. 126 00:07:13,460 --> 00:07:16,830 That does indeed perfectly separate the positive and negative classes, this is 127 00:07:16,830 --> 00:07:18,610 another example of how 128 00:07:18,610 --> 00:07:21,169 overfitting, and 129 00:07:21,169 --> 00:07:24,029 in contrast you fit logistic regression into this model with just the linear features, 130 00:07:24,029 --> 00:07:26,619 with none 131 00:07:26,619 --> 00:07:29,560 of the quadratic features, then maybe you get a decision boundary like that, which 132 00:07:29,560 --> 00:07:33,179 can also underfit. 133 00:07:33,179 --> 00:07:34,589 Okay. 134 00:07:34,589 --> 00:07:38,059 So what I want to do now is 135 00:07:38,059 --> 00:07:41,899 understand this problem of overfitting versus underfitting, of high bias versus high 136 00:07:41,899 --> 00:07:43,409 variance, more explicitly, 137 00:07:43,409 --> 00:07:44,819 138 00:07:44,819 --> 00:07:48,770 I will do that by posing a more formal model of machine learning and so 139 00:07:48,770 --> 00:07:50,789 trying to prove when 140 00:07:50,789 --> 00:07:56,680 these two twin problems when each of these two problems come up. 141 00:07:56,680 --> 00:07:59,220 And as I'm modeling the 142 00:07:59,220 --> 00:08:00,679 143 00:08:00,679 --> 00:08:04,389 example for our initial foray into learning theory, 144 00:08:04,389 --> 00:08:10,449 I want to talk about learning classification, 145 00:08:10,449 --> 00:08:13,060 in which 146 00:08:13,060 --> 00:08:13,770 147 00:08:13,770 --> 00:08:16,360 H of X is equal 148 00:08:16,360 --> 00:08:18,819 to G of data transpose X. Okay? 149 00:08:18,819 --> 00:08:20,930 So the learning classifier. 150 00:08:20,930 --> 00:08:26,840 And for this class I'm going to use, Z 151 00:08:26,840 --> 00:08:32,740 excuse me. I'm gonna use G as indicator Z grading with 152 00:08:32,740 --> 00:08:39,740 zero. 153 00:08:40,260 --> 00:08:43,570 With apologies in advance for changing the notation yet again, 154 00:08:43,570 --> 00:08:45,350 for the support vector machine 155 00:08:45,350 --> 00:08:49,450 lectures we use Y equals minus one or plus one. For 156 00:08:49,450 --> 00:08:52,670 learning theory lectures, turns out it'll be a bit cleaner if I switch back to 157 00:08:52,670 --> 00:08:57,810 Y equals zero-one again, so I'm gonna switch back to my original notation. 158 00:08:57,810 --> 00:09:02,220 And so you think of this model as a model forum as 159 00:09:02,220 --> 00:09:04,540 logistic regressions, say, and think of this as being 160 00:09:04,540 --> 00:09:06,600 similar to logistic regression, 161 00:09:06,600 --> 00:09:08,420 except that now we're going to force 162 00:09:08,420 --> 00:09:12,420 the logistic regression algorithm, to opt for labels that are 163 00:09:12,420 --> 00:09:14,010 either zero or one. Okay? 164 00:09:14,010 --> 00:09:15,610 So you can think of this as a 165 00:09:15,610 --> 00:09:20,710 classifier to opt for labels zero or one involved in the probabilities. 166 00:09:20,710 --> 00:09:25,410 And so as 167 00:09:25,410 --> 00:09:32,410 usual, let's say we're given a training set of M examples. 168 00:09:34,350 --> 00:09:38,080 That's just my notation for writing a set of M examples ranging from I equals 169 00:09:38,080 --> 00:09:40,170 one 170 00:09:40,170 --> 00:09:40,500 171 00:09:40,500 --> 00:09:45,450 through M. And I'm going to assume that the training example is XIYI. 172 00:09:45,450 --> 00:09:48,320 I've drawn IID, 173 00:09:48,320 --> 00:09:50,220 from sum distribution, 174 00:09:50,220 --> 00:09:50,959 scripts 175 00:09:50,959 --> 00:09:54,470 D. Okay? [Inaudible]. Identically and definitively distributed 176 00:09:54,470 --> 00:09:59,720 and if you have you have running a classification problem on houses, like 177 00:09:59,720 --> 00:10:03,450 features of the house comma, whether the house will be sold in the next six months, then this 178 00:10:03,450 --> 00:10:04,700 is just 179 00:10:04,700 --> 00:10:07,650 the priority distribution over 180 00:10:07,650 --> 00:10:12,190 features of houses and whether or not they'll be sold. Okay? So I'm gonna assume that 181 00:10:12,190 --> 00:10:16,810 training examples we've drawn IID from some probability distributions, 182 00:10:16,810 --> 00:10:18,320 183 00:10:18,320 --> 00:10:19,940 scripts D. Well, same thing for spam, if you're 184 00:10:19,940 --> 00:10:23,420 trying to build a spam classifier then this would be the distribution of 185 00:10:23,420 --> 00:10:29,610 what emails look like comma, whether they are spam or not. 186 00:10:29,610 --> 00:10:32,130 And in particular, to understand 187 00:10:32,130 --> 00:10:38,000 or simplify to understand the phenomena of bias invariance, I'm actually going to use a 188 00:10:38,000 --> 00:10:41,880 simplified model of machine learning. 189 00:10:41,880 --> 00:10:44,670 And in particular, 190 00:10:44,670 --> 00:10:45,810 logistic regression fits 191 00:10:45,810 --> 00:10:50,040 this parameters the model like this for maximizing the law of likelihood. 192 00:10:50,040 --> 00:10:55,010 But in order to understand learning algorithms more deeply, I'm just going to assume a simplified 193 00:10:55,010 --> 00:10:58,890 model of machine learning, let me just write that down. 194 00:10:58,890 --> 00:11:01,960 195 00:11:01,960 --> 00:11:05,260 So I'm going to define training error 196 00:11:05,260 --> 00:11:08,670 as 197 00:11:08,670 --> 00:11:14,490 so this is a training error of a hypothesis X subscript data. 198 00:11:14,490 --> 00:11:17,610 Write this epsilon hat of subscript data. 199 00:11:17,610 --> 00:11:19,980 If I want to make the 200 00:11:19,980 --> 00:11:23,450 dependence on a training set explicit, I'll write this with 201 00:11:23,450 --> 00:11:27,210 a subscript S there where S is a training set. 202 00:11:27,210 --> 00:11:34,210 And I'll define this as, 203 00:11:41,660 --> 00:11:44,270 let's see. Okay. I 204 00:11:44,270 --> 00:11:47,860 hope the notation is clear. This is a sum of indicator functions for whether your hypothesis 205 00:11:47,860 --> 00:11:52,170 correctly classifies the Y the IFE example. 206 00:11:52,170 --> 00:11:55,340 And so when you divide by M, this is just 207 00:11:55,340 --> 00:11:57,740 in your training set what's the fraction 208 00:11:57,740 --> 00:11:59,560 of training examples your 209 00:11:59,560 --> 00:12:01,200 hypothesis classifies 210 00:12:01,200 --> 00:12:05,340 so defined as a training error. And 211 00:12:05,340 --> 00:12:07,980 training error is also called risk. 212 00:12:07,980 --> 00:12:10,190 213 00:12:10,190 --> 00:12:14,180 The simplified model of machine learning I'm gonna talk about is 214 00:12:14,180 --> 00:12:17,170 called empirical risk minimization. 215 00:12:17,170 --> 00:12:21,210 And in particular, I'm going to assume that the way my learning algorithm works 216 00:12:21,210 --> 00:12:25,950 is it will choose parameters 217 00:12:25,950 --> 00:12:32,950 data, that 218 00:12:33,940 --> 00:12:39,090 minimize my training error. Okay? 219 00:12:39,090 --> 00:12:43,240 And it will be this learning algorithm that we'll prove properties about. 220 00:12:43,240 --> 00:12:44,640 And it turns out that 221 00:12:44,640 --> 00:12:45,990 you 222 00:12:45,990 --> 00:12:49,910 can think of this as the most basic learning algorithm, the algorithm that minimizes 223 00:12:49,910 --> 00:12:51,970 your training error. It 224 00:12:51,970 --> 00:12:55,350 turns out that logistic regression and support vector machines can be 225 00:12:55,350 --> 00:12:58,930 formally viewed as approximation cities, so it 226 00:12:58,930 --> 00:13:02,510 turns out that if you actually want to do this, this is a nonconvex optimization 227 00:13:02,510 --> 00:13:03,399 problem. 228 00:13:03,399 --> 00:13:08,860 This is actually it actually [inaudible] hard to solve this optimization problem. 229 00:13:08,860 --> 00:13:13,790 And logistic regression and support vector machines can both be viewed as 230 00:13:13,790 --> 00:13:14,429 approximations 231 00:13:14,429 --> 00:13:17,270 to this nonconvex optimization problem 232 00:13:17,270 --> 00:13:20,240 by finding the convex approximation to it. 233 00:13:20,240 --> 00:13:22,589 Think of this as 234 00:13:22,589 --> 00:13:23,640 similar to what 235 00:13:23,640 --> 00:13:27,710 algorithms like logistic regression 236 00:13:27,710 --> 00:13:30,280 are doing. So 237 00:13:30,280 --> 00:13:32,550 let me take that 238 00:13:32,550 --> 00:13:34,230 definition of empirical risk 239 00:13:34,230 --> 00:13:35,710 minimization 240 00:13:35,710 --> 00:13:42,600 and actually just rewrite it in a different equivalent way. 241 00:13:42,600 --> 00:13:45,440 For the results I want to prove today, it turns out 242 00:13:45,440 --> 00:13:47,630 that it will be useful to think of 243 00:13:47,630 --> 00:13:49,340 our learning algorithm 244 00:13:49,340 --> 00:13:52,150 as not choosing a set of parameters, 245 00:13:52,150 --> 00:13:54,810 but as choosing a function. 246 00:13:54,810 --> 00:13:57,870 So let me say what I mean by that. Let me define 247 00:13:57,870 --> 00:14:00,319 the hypothesis 248 00:14:00,319 --> 00:14:03,490 class, 249 00:14:03,490 --> 00:14:04,800 script h, 250 00:14:04,800 --> 00:14:09,410 as the class of all hypotheses of in other words as the class of all linear 251 00:14:09,410 --> 00:14:10,690 classifiers, that your 252 00:14:10,690 --> 00:14:13,730 learning algorithm is choosing from. 253 00:14:13,730 --> 00:14:17,260 Okay? So 254 00:14:17,260 --> 00:14:19,029 H subscript data 255 00:14:19,029 --> 00:14:23,190 is a specific linear classifier, so H subscript data, 256 00:14:23,190 --> 00:14:29,880 in each of these functions each 257 00:14:29,880 --> 00:14:33,550 of these is a function mapping from the input domain X is the class zero-one. Each 258 00:14:33,550 --> 00:14:34,760 of 259 00:14:34,760 --> 00:14:38,430 these is a function, and as you vary the parameter's data, 260 00:14:38,430 --> 00:14:41,110 you get different functions. 261 00:14:41,110 --> 00:14:43,810 And so let me define the hypothesis class 262 00:14:43,810 --> 00:14:47,750 script H to be the class of all functions that say logistic regression can choose 263 00:14:47,750 --> 00:14:49,140 from. Okay. So 264 00:14:49,140 --> 00:14:53,000 this is the class of all linear classifiers 265 00:14:53,000 --> 00:14:54,480 and so 266 00:14:54,480 --> 00:14:57,420 267 00:14:57,420 --> 00:15:00,900 I'm going to define, 268 00:15:00,900 --> 00:15:05,050 or maybe redefine empirical risk minimization as 269 00:15:05,050 --> 00:15:08,820 instead of writing this choosing a set of parameters, I want to think of it as 270 00:15:08,820 --> 00:15:10,390 choosing a 271 00:15:10,390 --> 00:15:12,500 function 272 00:15:12,500 --> 00:15:15,300 into hypothesis class of script H 273 00:15:15,300 --> 00:15:21,570 that minimizes 274 00:15:21,570 --> 00:15:27,360 that minimizes my training error. Okay? 275 00:15:27,360 --> 00:15:31,040 So actually can you raise your 276 00:15:31,040 --> 00:15:34,540 hand if it makes sense to you why this is equivalent to the previous 277 00:15:34,540 --> 00:15:40,900 formulation? Okay, cool. 278 00:15:40,900 --> 00:15:44,350 Thanks. So for development of the use of think of algorithms as choosing from 279 00:15:44,350 --> 00:15:46,750 function from the class instead, 280 00:15:46,750 --> 00:15:47,980 because 281 00:15:47,980 --> 00:15:51,300 in a more general case this set, script H, 282 00:15:51,300 --> 00:15:53,999 can be some other class of functions. Maybe 283 00:15:53,999 --> 00:15:58,220 is a class of all functions represented by viewer network, or the class of all 284 00:15:58,220 --> 00:16:03,670 some other class of functions the learning algorithm wants to choose from. 285 00:16:03,670 --> 00:16:05,240 And 286 00:16:05,240 --> 00:16:11,150 this definition for empirical risk minimization will still apply. Okay? 287 00:16:11,150 --> 00:16:13,330 So 288 00:16:13,330 --> 00:16:16,480 what we'd like to do is understand 289 00:16:16,480 --> 00:16:18,780 whether empirical risk minimization 290 00:16:18,780 --> 00:16:24,500 is a reasonable algorithm. Alex? Student:[Inaudible] a function that's 291 00:16:24,500 --> 00:16:26,239 defined 292 00:16:26,239 --> 00:16:29,610 by G 293 00:16:29,610 --> 00:16:35,310 of data TX, or is it now more general? Instructor (Andrew Ng): I see, right, so lets see 294 00:16:35,310 --> 00:16:36,470 I guess this 295 00:16:36,470 --> 00:16:40,610 the question is H data still defined by G of phase transpose 296 00:16:40,610 --> 00:16:46,700 X, is this more general? So Student:[Inaudible] Instructor (Andrew Ng): Oh, 297 00:16:46,700 --> 00:16:48,410 yeah so very two answers 298 00:16:48,410 --> 00:16:51,070 to that. One is, this framework is general, so 299 00:16:51,070 --> 00:16:54,770 for the purpose of this lecture it may be useful to you to keep in mind a model of the 300 00:16:54,770 --> 00:16:56,670 example of 301 00:16:56,670 --> 00:17:00,639 when H subscript data is the class of all linear classifiers such as those used by 302 00:17:00,639 --> 00:17:04,350 like a visectron algorithm or logistic regression. 303 00:17:04,350 --> 00:17:05,570 This 304 00:17:05,570 --> 00:17:06,790 everything on this board, 305 00:17:06,790 --> 00:17:11,140 however, is actually more general. H can be any set of functions, mapping 306 00:17:11,140 --> 00:17:14,930 from the INFA domain to the center of class label zero and one, 307 00:17:14,930 --> 00:17:17,630 and then you can perform empirical risk minimization 308 00:17:17,630 --> 00:17:21,240 over any hypothesis class. For the purpose 309 00:17:21,240 --> 00:17:22,770 of today's lecture, 310 00:17:22,770 --> 00:17:27,170 I am going to restrict myself to talking about binary classification, but it turns 311 00:17:27,170 --> 00:17:30,660 out everything I say generalizes to regression 312 00:17:30,660 --> 00:17:35,960 in other problem as 313 00:17:35,960 --> 00:17:39,730 well. Does that answer your question? Student:Yes. Instructor (Andrew Ng): Cool. All right. So I wanna understand if empirical risk minimization is a reasonable algorithm. 314 00:17:39,730 --> 00:17:40,370 315 00:17:40,370 --> 00:17:45,340 In particular, what are the things we can prove about it? So 316 00:17:45,340 --> 00:17:48,900 clearly we don't actually care about training error, we don't really care about 317 00:17:48,900 --> 00:17:53,430 making accurate predictions on the training set, or at a least that's not the ultimate goal. The 318 00:17:53,430 --> 00:17:55,510 ultimate goal is 319 00:17:55,510 --> 00:17:57,560 how well it makes 320 00:17:57,560 --> 00:17:58,730 generalization 321 00:17:58,730 --> 00:18:03,940 how well it makes predictions on examples that we haven't seen before. How 322 00:18:03,940 --> 00:18:05,660 well it predicts prices or 323 00:18:05,660 --> 00:18:07,380 sale or no sale 324 00:18:07,380 --> 00:18:10,289 outcomes of houses you haven't seen before. 325 00:18:10,289 --> 00:18:12,870 So what we really care about 326 00:18:12,870 --> 00:18:16,160 is generalization error, which I write 327 00:18:16,160 --> 00:18:17,870 as epsilon of H. 328 00:18:17,870 --> 00:18:21,700 And this is defined as the probability 329 00:18:21,700 --> 00:18:23,830 that if I sample 330 00:18:23,830 --> 00:18:25,940 a new example, X 331 00:18:25,940 --> 00:18:28,370 comma Y, 332 00:18:28,370 --> 00:18:35,370 from that distribution scripts D, 333 00:18:40,250 --> 00:18:43,230 my hypothesis mislabels 334 00:18:43,230 --> 00:18:46,920 that example. And 335 00:18:46,920 --> 00:18:49,990 in terms of notational convention, usually 336 00:18:49,990 --> 00:18:53,400 if I use if I place a hat on top of something, it 337 00:18:53,400 --> 00:18:57,190 usually means not always but it usually means that it is an attempt to 338 00:18:57,190 --> 00:18:59,400 estimate something about the hat. So 339 00:18:59,400 --> 00:19:00,660 for example, 340 00:19:00,660 --> 00:19:02,580 epsilon hat here 341 00:19:02,580 --> 00:19:06,049 this is something that we're trying think of epsilon hat training error 342 00:19:06,049 --> 00:19:08,779 as an attempt to approximate generalization error. 343 00:19:08,779 --> 00:19:10,980 Okay, so the notation convention is 344 00:19:10,980 --> 00:19:15,060 usually the things with the hats on top are things we're using to estimate other 345 00:19:15,060 --> 00:19:15,950 quantities. 346 00:19:15,950 --> 00:19:19,560 And H hat is a hypothesis output by learning algorithm to try to 347 00:19:19,560 --> 00:19:24,670 estimate what the functions from H to Y X to Y. So 348 00:19:24,670 --> 00:19:29,240 let's actually prove some things about when empirical risk minimization 349 00:19:29,240 --> 00:19:30,149 will do well 350 00:19:30,149 --> 00:19:33,440 in a sense of giving us low generalization error, which is what 351 00:19:33,440 --> 00:19:40,440 we really care about. 352 00:19:46,750 --> 00:19:50,110 In order to prove our first learning theory result, I'm going to have to state 353 00:19:50,110 --> 00:19:52,670 two lemmas, the first is 354 00:19:52,670 --> 00:19:59,670 the union 355 00:20:00,250 --> 00:20:07,250 vowel, which is the following, 356 00:20:07,800 --> 00:20:10,640 let A1 through AK be 357 00:20:10,640 --> 00:20:12,250 K event. And when 358 00:20:12,250 --> 00:20:16,080 I say events, I mean events in a sense of a probabilistic event 359 00:20:16,080 --> 00:20:18,130 that either happens or not. 360 00:20:18,130 --> 00:20:25,130 And these are not necessarily independent. 361 00:20:28,490 --> 00:20:32,400 So there's some current distribution over the events A one through AK, 362 00:20:32,400 --> 00:20:34,830 and maybe they're independent maybe not, 363 00:20:34,830 --> 00:20:41,240 no assumption on that. Then 364 00:20:41,240 --> 00:20:44,740 the probability of A one or 365 00:20:44,740 --> 00:20:46,179 A two 366 00:20:46,179 --> 00:20:47,610 or dot, dot, 367 00:20:47,610 --> 00:20:48,960 dot, up to 368 00:20:48,960 --> 00:20:52,480 AK, this union symbol, this 369 00:20:52,480 --> 00:20:54,070 hat, this just means 370 00:20:54,070 --> 00:20:57,970 this sort of just set notation for probability just means or. So the probability 371 00:20:57,970 --> 00:20:58,640 of 372 00:20:58,640 --> 00:21:02,290 at least one of these events occurring, of A one or A two, or up 373 00:21:02,290 --> 00:21:03,429 to AK, 374 00:21:03,429 --> 00:21:07,950 this is S equal to the probability of A one plus probability of A two plus dot, 375 00:21:07,950 --> 00:21:11,120 dot, dot, plus probability of AK. 376 00:21:11,120 --> 00:21:17,540 Okay? So 377 00:21:17,540 --> 00:21:20,700 the intuition behind this is just that 378 00:21:20,700 --> 00:21:23,390 I'm not sure if you've seen Venn diagrams 379 00:21:23,390 --> 00:21:26,010 depictions of probability before, if you haven't, 380 00:21:26,010 --> 00:21:29,480 what I'm about to do may be a little cryptic, so just ignore that. Just 381 00:21:29,480 --> 00:21:31,799 ignore what I'm about to do if you haven't seen it before. 382 00:21:31,799 --> 00:21:37,340 But if you have seen it before then this is really 383 00:21:37,340 --> 00:21:40,770 this is really great the probability of A one, 384 00:21:40,770 --> 00:21:43,770 union A two, union A three, is less 385 00:21:43,770 --> 00:21:46,630 than the P of A one, plus 386 00:21:46,630 --> 00:21:51,380 P of A two, plus P of A 387 00:21:51,380 --> 00:21:51,809 three. 388 00:21:51,809 --> 00:21:54,370 Right. So that the total mass in 389 00:21:54,370 --> 00:21:57,510 the union of these three things [inaudible] to the sum of the masses 390 00:21:57,510 --> 00:22:01,250 in the three individual sets, it's not very surprising. 391 00:22:01,250 --> 00:22:04,520 It turns out that depending on how you define your axioms of probability, 392 00:22:04,520 --> 00:22:08,100 this is actually one of the axioms that probably varies, so 393 00:22:08,100 --> 00:22:12,180 I won't actually try to prove this. This is usually 394 00:22:12,180 --> 00:22:16,700 written as an axiom. So sigmas of avitivity are probably 395 00:22:16,700 --> 00:22:17,379 measured as this 396 00:22:17,379 --> 00:22:24,379 what is sometimes called as well. 397 00:22:26,370 --> 00:22:30,600 But in learning theory it's commonly called the union balance I just call it that. The 398 00:22:30,600 --> 00:22:31,750 other 399 00:22:31,750 --> 00:22:38,750 lemma I need is called the Hufting inequality. And 400 00:22:39,640 --> 00:22:43,240 again, I won't actually prove this, I'll just state it, 401 00:22:43,240 --> 00:22:44,039 which is 402 00:22:44,039 --> 00:22:46,290 let's let Z1 up to ZM, 403 00:22:46,290 --> 00:22:48,850 BM, IID, 404 00:22:48,850 --> 00:22:51,720 there 405 00:22:51,720 --> 00:22:55,180 406 00:22:55,180 --> 00:23:02,180 may be random variables with mean Phi. 407 00:23:06,559 --> 00:23:09,150 So the probability of ZI equals 1 is equal to 408 00:23:09,150 --> 00:23:16,150 Phi. 409 00:23:16,400 --> 00:23:18,289 So 410 00:23:18,289 --> 00:23:22,549 let's say you observe M IID for newly random variables and you want to estimate their 411 00:23:22,549 --> 00:23:23,640 mean. 412 00:23:23,640 --> 00:23:28,790 So let me define Phi hat, and this is again that notation, no convention, Phi hat means 413 00:23:28,790 --> 00:23:29,520 414 00:23:29,520 --> 00:23:30,440 does not attempt 415 00:23:30,440 --> 00:23:33,880 is an estimate or something else. So when we define Phi 416 00:23:33,880 --> 00:23:39,290 hat to be 1 over M, semper my equals one through MZI. Okay? 417 00:23:39,290 --> 00:23:40,559 So this is 418 00:23:40,559 --> 00:23:44,500 our attempt to estimate the mean of these Benuve random variables by sort of taking 419 00:23:44,500 --> 00:23:46,549 its average. 420 00:23:46,549 --> 00:23:48,690 421 00:23:48,690 --> 00:23:52,710 And let any gamma 422 00:23:52,710 --> 00:23:54,970 be fixed. 423 00:23:54,970 --> 00:23:59,060 424 00:23:59,060 --> 00:24:00,690 Then, 425 00:24:00,690 --> 00:24:04,340 the Hufting inequality 426 00:24:04,340 --> 00:24:11,120 is that 427 00:24:11,120 --> 00:24:13,810 428 00:24:13,810 --> 00:24:17,110 the probability your estimate of Phi 429 00:24:17,110 --> 00:24:20,630 is more than gamma away from the true value of Phi, 430 00:24:20,630 --> 00:24:26,380 that this is bounded by two E to the next of two gamma squared. Okay? So 431 00:24:26,380 --> 00:24:31,299 just in pictures 432 00:24:31,299 --> 00:24:35,210 so this theorem holds this lemma, the Hufting inequality, 433 00:24:35,210 --> 00:24:37,950 this is just a statement of fact, this just holds true. 434 00:24:37,950 --> 00:24:41,850 But let me now draw a cartoon to describe some of the intuition behind this, I 435 00:24:41,850 --> 00:24:44,059 guess. 436 00:24:44,059 --> 00:24:44,940 So 437 00:24:44,940 --> 00:24:49,070 lets say [inaudible] this is a real number line from zero to one. 438 00:24:49,070 --> 00:24:52,570 And so Phi is the mean of your Benuve random variables. 439 00:24:52,570 --> 00:24:53,580 440 00:24:53,580 --> 00:24:56,000 You will remember from you know, 441 00:24:56,000 --> 00:24:59,370 whatever some undergraduate probability or statistics class, 442 00:24:59,370 --> 00:25:02,550 the central limit theorem that says that when you average all the things together, 443 00:25:02,550 --> 00:25:05,480 you tend to get a Gaussian distribution. 444 00:25:05,480 --> 00:25:10,380 And so when you toss M coins with bias Phi, we observe these M 445 00:25:10,380 --> 00:25:12,320 Benuve random variables, 446 00:25:12,320 --> 00:25:17,680 and we average them, then the probability distribution of 447 00:25:17,680 --> 00:25:24,680 Phi hat 448 00:25:26,260 --> 00:25:28,740 will roughly be a Gaussian lets say. Okay? It 449 00:25:28,740 --> 00:25:29,680 turns out if 450 00:25:29,680 --> 00:25:32,650 you haven't seen this up before, this is actually that the 451 00:25:32,650 --> 00:25:36,500 cumulative distribution function of Phi hat will converse with that of the Gaussian. 452 00:25:36,500 --> 00:25:40,060 Technically Phi hat can only take on a discreet set of values 453 00:25:40,060 --> 00:25:43,120 because these are factions one over Ms. It doesn't really have an 454 00:25:43,120 --> 00:25:45,159 entity but just as a cartoon 455 00:25:45,159 --> 00:25:49,470 think of it as a converse roughly to a Gaussian. 456 00:25:49,470 --> 00:25:56,050 So what the Hufting inequality says is that if you pick a value of gamma, let me put 457 00:25:56,050 --> 00:25:59,500 S one interval gamma there's another interval gamma. 458 00:25:59,500 --> 00:26:03,480 Then the saying that the probability mass of the details, 459 00:26:03,480 --> 00:26:05,820 in other words the probability that 460 00:26:05,820 --> 00:26:08,410 my value of Phi hat is more than 461 00:26:08,410 --> 00:26:10,960 a gamma away from the true value, 462 00:26:10,960 --> 00:26:15,500 that the total mass 463 00:26:15,500 --> 00:26:18,019 that the 464 00:26:18,019 --> 00:26:22,410 total probability mass in these tails is at most two 465 00:26:22,410 --> 00:26:25,660 E to the negative two gamma squared M. Okay? 466 00:26:25,660 --> 00:26:27,510 That's what the Hufting inequality so if you 467 00:26:27,510 --> 00:26:30,610 can't read that this just says this is just the right hand side of the bound, two E to 468 00:26:30,610 --> 00:26:32,659 negative two gamma squared. 469 00:26:32,659 --> 00:26:36,220 So balance the probability that you make a mistake in estimating the 470 00:26:36,220 --> 00:26:41,740 mean of a Benuve random variable. 471 00:26:41,740 --> 00:26:43,030 And the 472 00:26:43,030 --> 00:26:47,180 cool thing about this bound the interesting thing behind this bound is that 473 00:26:47,180 --> 00:26:48,039 474 00:26:48,039 --> 00:26:50,610 the [inaudible] exponentially in M, 475 00:26:50,610 --> 00:26:51,710 so it says 476 00:26:51,710 --> 00:26:54,030 that for a fixed value of gamma, 477 00:26:54,030 --> 00:26:55,540 as you increase the size 478 00:26:55,540 --> 00:26:58,030 of your training set, as you toss a coin more and more, 479 00:26:58,030 --> 00:27:02,250 then the worth of this Gaussian will shrink. The worth of this 480 00:27:02,250 --> 00:27:06,270 Gaussian will actually shrink like one over root to M. 481 00:27:06,270 --> 00:27:12,010 And that will cause the probability mass left in the tails to decrease exponentially, 482 00:27:12,010 --> 00:27:19,010 quickly, as a function of that. And this will be important later. Yeah? Student: Does this come from the 483 00:27:21,400 --> 00:27:23,360 central limit theorem [inaudible]. Instructor (Andrew Ng): No it doesn't. So this is proved by a different 484 00:27:23,360 --> 00:27:27,220 this is proved no so the central limit theorem there may be a 485 00:27:27,220 --> 00:27:29,330 version of the central limit theorem, but the 486 00:27:29,330 --> 00:27:32,840 versions I'm familiar with tend are sort of asymptotic, but this works 487 00:27:32,840 --> 00:27:35,630 for any finer value of M. Oh, and for your 488 00:27:35,630 --> 00:27:40,190 this bound holds even if M is equal to two, or M is [inaudible], if M is 489 00:27:40,190 --> 00:27:41,410 very small, 490 00:27:41,410 --> 00:27:45,480 the central limit theorem approximation is not gonna hold, but this theorem holds 491 00:27:45,480 --> 00:27:46,370 regardless. 492 00:27:46,370 --> 00:27:47,680 Okay? I'm 493 00:27:47,680 --> 00:27:51,920 drawing this just as a cartoon to help explain the intuition, but this 494 00:27:51,920 --> 00:27:58,920 theorem just holds true, without reference to central limit 495 00:28:07,190 --> 00:28:08,590 496 00:28:08,590 --> 00:28:12,830 theorem. All right. So lets start to understand empirical risk minimization, 497 00:28:12,830 --> 00:28:19,830 and what I want to do is 498 00:28:23,080 --> 00:28:25,490 begin with 499 00:28:25,490 --> 00:28:28,440 studying empirical risk minimization 500 00:28:28,440 --> 00:28:30,890 for a 501 00:28:30,890 --> 00:28:33,000 [inaudible] case 502 00:28:33,000 --> 00:28:36,670 that's a logistic regression, and in particular I want to start with studying 503 00:28:36,670 --> 00:28:40,510 the case of finite hypothesis classes. 504 00:28:40,510 --> 00:28:47,510 So let's say script H is a class of 505 00:28:48,909 --> 00:28:55,600 K hypotheses. 506 00:28:55,600 --> 00:28:59,540 Right. So this is K functions with no each of these is just a function mapping 507 00:28:59,540 --> 00:29:04,640 from inputs to outputs, there's no parameters in this. 508 00:29:04,640 --> 00:29:05,910 And so 509 00:29:05,910 --> 00:29:12,530 what the empirical risk minimization would do is it would take the training set 510 00:29:12,530 --> 00:29:13,809 and it'll then 511 00:29:13,809 --> 00:29:16,590 look at each of these K functions, 512 00:29:16,590 --> 00:29:21,660 and it'll pick whichever of these functions has the lowest training error. Okay? 513 00:29:21,660 --> 00:29:24,260 So now that the logistic regression uses an infinitely 514 00:29:24,260 --> 00:29:28,940 large a continuous infinitely large class of hypotheses, script H, 515 00:29:28,940 --> 00:29:32,440 but to prove the first row I actually want to just 516 00:29:32,440 --> 00:29:33,090 describe 517 00:29:33,090 --> 00:29:37,540 our first learning theorem is all for the case of when you have a finite hypothesis class, and then 518 00:29:37,540 --> 00:29:41,750 we'll later generalize that into the hypothesis classes. 519 00:29:41,750 --> 00:29:43,050 520 00:29:43,050 --> 00:29:45,780 So 521 00:29:45,780 --> 00:29:52,780 522 00:29:53,220 --> 00:29:58,340 empirical risk minimization takes the hypothesis of the lowest training error, 523 00:29:58,340 --> 00:30:04,270 and what I'd like to do is prove a bound on the generalization error 524 00:30:04,270 --> 00:30:05,960 of H hat. All right. So in other words I'm 525 00:30:05,960 --> 00:30:07,070 gonna prove that 526 00:30:07,070 --> 00:30:14,070 somehow minimizing training error allows me to do well on generalization error. 527 00:30:14,070 --> 00:30:15,920 And here's the strategy, 528 00:30:15,920 --> 00:30:19,950 I'm going to 529 00:30:19,950 --> 00:30:22,899 the first step in this prove I'm going to show that 530 00:30:22,899 --> 00:30:25,039 training error 531 00:30:25,039 --> 00:30:29,580 is a good approximation to generalization error, 532 00:30:29,580 --> 00:30:32,040 533 00:30:32,040 --> 00:30:34,610 and then I'm going to show 534 00:30:34,610 --> 00:30:37,760 that this implies a bound 535 00:30:37,760 --> 00:30:40,040 on 536 00:30:40,040 --> 00:30:42,860 the generalization error 537 00:30:42,860 --> 00:30:48,360 of the hypothesis of [inaudible] empirical risk 538 00:30:48,360 --> 00:30:53,310 minimization. And I just realized, this class I guess is also maybe slightly notation 539 00:30:53,310 --> 00:30:55,559 heavy class 540 00:30:55,559 --> 00:30:59,160 round, instead of just introducing a reasonably large set of new symbols, so if 541 00:30:59,160 --> 00:31:02,570 again, in the course of today's lecture, you're looking at some symbol and you don't quite 542 00:31:02,570 --> 00:31:07,210 remember what it is, please raise your hand and ask. [Inaudible] what's that, what was that, was that a 543 00:31:07,210 --> 00:31:11,570 generalization error or was it something else? So raise your hand and 544 00:31:11,570 --> 00:31:16,740 ask if you don't understand what the notation I was defining. 545 00:31:16,740 --> 00:31:20,429 Okay. So let me introduce this in two steps. And the empirical risk strategy is I'm gonna show training errors 546 00:31:20,429 --> 00:31:22,970 that give approximation generalization error, 547 00:31:22,970 --> 00:31:26,360 and this will imply that minimizing training error 548 00:31:26,360 --> 00:31:29,919 will also do pretty well in terms of minimizing generalization error. 549 00:31:29,919 --> 00:31:33,130 And this will give us a bound on the generalization error 550 00:31:33,130 --> 00:31:39,520 of the hypothesis output by empirical risk minimization. Okay? 551 00:31:39,520 --> 00:31:46,520 So here's the idea. 552 00:31:47,509 --> 00:31:51,530 So 553 00:31:51,530 --> 00:31:56,330 lets even not consider all the hypotheses at once, lets pick any 554 00:31:56,330 --> 00:32:00,330 hypothesis, HJ in the class script H, and 555 00:32:00,330 --> 00:32:04,549 so until further notice lets just consider there one fixed hypothesis. So pick any 556 00:32:04,549 --> 00:32:10,250 one hypothesis and let's talk about that 557 00:32:10,250 --> 00:32:12,390 one. Let 558 00:32:12,390 --> 00:32:15,160 me define ZI 559 00:32:15,160 --> 00:32:22,160 to be indicator function 560 00:32:23,970 --> 00:32:25,490 for 561 00:32:25,490 --> 00:32:29,929 whether this hypothesis misclassifies the IFE example excuse me 562 00:32:29,929 --> 00:32:33,820 or Z subscript I. Okay? 563 00:32:33,820 --> 00:32:40,250 So 564 00:32:40,250 --> 00:32:42,809 ZI would be zero or one 565 00:32:42,809 --> 00:32:46,799 depending on whether this one hypothesis which is the only one 566 00:32:46,799 --> 00:32:48,640 I'm gonna even consider now, 567 00:32:48,640 --> 00:32:52,280 whether this hypothesis was classified as an example. 568 00:32:52,280 --> 00:32:56,970 And so 569 00:32:56,970 --> 00:33:00,620 my training set is drawn randomly from sum distribution scripts 570 00:33:00,620 --> 00:33:02,140 d, 571 00:33:02,140 --> 00:33:04,620 and 572 00:33:04,620 --> 00:33:09,820 depending on what training examples I've got, these ZIs would be either zero or one. 573 00:33:09,820 --> 00:33:14,260 So let's figure out what the probability distribution ZI is. 574 00:33:14,260 --> 00:33:15,020 Well, 575 00:33:15,020 --> 00:33:15,950 so 576 00:33:15,950 --> 00:33:20,970 ZI takes on the value of either zero or one, so clearly is a Benuve random variable, it can only 577 00:33:20,970 --> 00:33:24,870 take on these values. 578 00:33:24,870 --> 00:33:31,260 Well, 579 00:33:31,260 --> 00:33:35,160 what's the probability that ZI is equal to one? In other words, what's the 580 00:33:35,160 --> 00:33:36,490 probability 581 00:33:36,490 --> 00:33:40,780 that from a fixed hypothesis HJ, 582 00:33:40,780 --> 00:33:45,080 when I sample my training set IID from distribution D, what is the chance 583 00:33:45,080 --> 00:33:45,970 that 584 00:33:45,970 --> 00:33:50,370 my hypothesis will misclassify it? 585 00:33:50,370 --> 00:33:54,290 Well, by definition, 586 00:33:54,290 --> 00:33:56,770 that's just a generalization error of my 587 00:33:56,770 --> 00:34:00,669 hypothesis HJ. 588 00:34:00,669 --> 00:34:03,990 So ZI is a Benuve random variable 589 00:34:03,990 --> 00:34:05,490 with mean 590 00:34:05,490 --> 00:34:12,490 given by the generalization error of this hypothesis. 591 00:34:14,439 --> 00:34:20,169 Raise your hand if that made sense. Oh, cool. Great. 592 00:34:20,169 --> 00:34:21,880 And moreover, 593 00:34:21,880 --> 00:34:23,989 all the ZIs have the same 594 00:34:23,989 --> 00:34:27,969 probability of being one, and all my training examples I've drawn are IID, 595 00:34:27,969 --> 00:34:32,839 and so the ZIs are also independent 596 00:34:32,839 --> 00:34:39,099 and therefore 597 00:34:39,099 --> 00:34:42,419 the ZIs themselves are IID random 598 00:34:42,419 --> 00:34:49,419 variables. Okay? Because my training examples were drawn independently of each other, by assumption. 599 00:34:54,419 --> 00:34:55,799 600 00:34:55,799 --> 00:34:59,149 If you read this as the definition of training error, 601 00:34:59,149 --> 00:35:04,329 the training error of my hypothesis HJ, that's just that. 602 00:35:04,329 --> 00:35:07,210 603 00:35:07,210 --> 00:35:09,029 That's just the average 604 00:35:09,029 --> 00:35:16,029 of my ZIs, which was well I previously defined it like this. Okay? 605 00:35:22,929 --> 00:35:25,729 606 00:35:25,729 --> 00:35:28,539 And so epsilon hat of HJ 607 00:35:28,539 --> 00:35:30,420 is exactly 608 00:35:30,420 --> 00:35:32,829 the average of MIID, 609 00:35:32,829 --> 00:35:35,849 Benuve random variables, 610 00:35:35,849 --> 00:35:40,039 drawn from Benuve distribution with mean given by 611 00:35:40,039 --> 00:35:41,749 the 612 00:35:41,749 --> 00:35:43,869 generalization error, so this is 613 00:35:43,869 --> 00:35:47,909 well this is the average of MIID Benuve random variables, 614 00:35:47,909 --> 00:35:54,909 each of which has meaning given by the 615 00:35:54,939 --> 00:35:58,689 generalization error of HJ. 616 00:35:58,689 --> 00:36:03,769 And therefore, by 617 00:36:03,769 --> 00:36:07,029 the Hufting inequality 618 00:36:07,029 --> 00:36:11,069 we have to add the 619 00:36:11,069 --> 00:36:15,699 probability 620 00:36:15,699 --> 00:36:20,029 that the difference between training and generalization error, the probability that this is greater than gamma is 621 00:36:20,029 --> 00:36:21,539 less than to two, E to 622 00:36:21,539 --> 00:36:24,719 the negative two, 623 00:36:24,719 --> 00:36:26,949 gamma squared M. Okay? 624 00:36:26,949 --> 00:36:32,420 Exactly by the Hufting inequality. 625 00:36:32,420 --> 00:36:34,579 And what this proves is that, 626 00:36:34,579 --> 00:36:37,289 for my fixed hypothesis HJ, 627 00:36:37,289 --> 00:36:40,529 my training error, epsilon hat will 628 00:36:40,529 --> 00:36:43,899 with high probability, assuming M is large, if 629 00:36:43,899 --> 00:36:46,890 M is large than this thing on the right hand side will be small, because 630 00:36:46,890 --> 00:36:49,239 this is two Es and a negative two gamma squared M. 631 00:36:49,239 --> 00:36:52,559 So this says that if my training set is large enough, 632 00:36:52,559 --> 00:36:54,390 then the probability 633 00:36:54,390 --> 00:36:57,299 my training error is far from generalization error, 634 00:36:57,299 --> 00:36:59,089 meaning that it is more than gamma, 635 00:36:59,089 --> 00:37:06,089 will be small, will be bounded by this thing on the right hand side. Okay? Now, 636 00:37:09,119 --> 00:37:11,569 here's the [inaudible] tricky part, 637 00:37:11,569 --> 00:37:16,569 what we've done is approve this bound for one fixed hypothesis, for HJ. 638 00:37:16,569 --> 00:37:20,429 What I want to prove is that training error will be a good estimate for generalization 639 00:37:20,429 --> 00:37:21,279 error, 640 00:37:21,279 --> 00:37:26,249 not just for this one hypothesis HJ, but actually for all K hypotheses in my 641 00:37:26,249 --> 00:37:27,829 hypothesis class 642 00:37:27,829 --> 00:37:29,650 script 643 00:37:29,650 --> 00:37:30,769 H. 644 00:37:30,769 --> 00:37:33,069 So let's do it 645 00:37:33,069 --> 00:37:35,719 646 00:37:35,719 --> 00:37:42,719 well, 647 00:37:54,659 --> 00:37:57,859 better do it on a new board. So in order to show that, let me define 648 00:37:57,859 --> 00:38:01,229 a random event, let me define AJ to be the event 649 00:38:01,229 --> 00:38:05,269 that 650 00:38:05,269 --> 00:38:12,269 let 651 00:38:21,469 --> 00:38:24,479 me define AJ to be the event that 652 00:38:24,479 --> 00:38:28,039 you know, the difference between training and generalization error is more than gamma on a 653 00:38:28,039 --> 00:38:30,439 hypothesis HJ. 654 00:38:30,439 --> 00:38:32,090 And so what we 655 00:38:32,090 --> 00:38:36,449 put on the previous board was that the probability of AJ is less equal to two E 656 00:38:36,449 --> 00:38:43,449 to the negative two, gamma squared M, and this is pretty small. Now, 657 00:38:44,059 --> 00:38:49,729 What I want to bound is the probability that 658 00:38:49,729 --> 00:38:53,879 there exists some hypothesis in my class 659 00:38:53,879 --> 00:39:00,879 script H, 660 00:39:02,839 --> 00:39:08,109 such that I make a large error in my estimate of generalization error. Okay? 661 00:39:08,109 --> 00:39:10,269 Such that this holds true. 662 00:39:10,269 --> 00:39:12,180 663 00:39:12,180 --> 00:39:14,289 So this is really just 664 00:39:14,289 --> 00:39:17,890 that the probability that there exists a hypothesis for which this 665 00:39:17,890 --> 00:39:20,650 holds. This is really the probability that 666 00:39:20,650 --> 00:39:25,299 A one or A two, or up to AK holds. 667 00:39:25,299 --> 00:39:28,659 The chance 668 00:39:28,659 --> 00:39:33,179 there exists a hypothesis is just well the priority 669 00:39:33,179 --> 00:39:37,479 that for hypothesis one and make a large error in estimating the generalization error, 670 00:39:37,479 --> 00:39:41,069 or for hypothesis two and make a large error in estimating generalization error, 671 00:39:41,069 --> 00:39:43,559 and so on. 672 00:39:43,559 --> 00:39:50,559 And so by the union bound, this is less than equal to 673 00:39:50,819 --> 00:39:56,229 that, 674 00:39:56,229 --> 00:39:59,329 which is therefore less than equal to 675 00:39:59,329 --> 00:40:06,329 676 00:40:06,519 --> 00:40:13,519 is 677 00:40:13,989 --> 00:40:20,989 equal to that. Okay? 678 00:40:38,949 --> 00:40:40,439 So let 679 00:40:40,439 --> 00:40:42,359 me just take 680 00:40:42,359 --> 00:40:47,639 one minus both sides of the equation 681 00:40:47,639 --> 00:40:50,939 on the previous board let me take one minus both sides, so 682 00:40:50,939 --> 00:40:56,899 the probability that there does not exist for 683 00:40:56,899 --> 00:40:59,009 hypothesis 684 00:40:59,009 --> 00:41:06,009 such that, 685 00:41:07,650 --> 00:41:10,839 that. The probability that there does not exist a hypothesis on which I make a large 686 00:41:10,839 --> 00:41:13,190 error in this estimate 687 00:41:13,190 --> 00:41:20,190 while this is equal to the probability that for all hypotheses, 688 00:41:22,680 --> 00:41:24,759 I make a 689 00:41:24,759 --> 00:41:26,569 small error, or at 690 00:41:26,569 --> 00:41:31,609 most gamma, in my estimate of generalization error. 691 00:41:31,609 --> 00:41:35,519 In taking one minus on the right hand side I get two KE 692 00:41:35,519 --> 00:41:39,149 to the negative two 693 00:41:39,149 --> 00:41:44,109 gamma squared M. Okay? 694 00:41:44,109 --> 00:41:45,709 And so 695 00:41:45,709 --> 00:41:49,519 and the sign of the inequality flipped because I took one minus both 696 00:41:49,519 --> 00:41:53,799 sides. The minus sign flips the sign of the 697 00:41:53,799 --> 00:41:55,819 equality. 698 00:41:55,819 --> 00:41:59,799 So what we're shown is that 699 00:41:59,799 --> 00:42:04,679 with probability which abbreviates to WP with probability one minus 700 00:42:04,679 --> 00:42:08,529 two KE to the negative two gamma squared M. 701 00:42:08,529 --> 00:42:11,889 We have that, epsilon hat 702 00:42:11,889 --> 00:42:16,479 of H 703 00:42:16,479 --> 00:42:21,149 will be 704 00:42:21,149 --> 00:42:28,149 will then gamma of epsilon of H, 705 00:42:31,309 --> 00:42:33,059 simultaneously for all 706 00:42:33,059 --> 00:42:35,449 hypotheses in our 707 00:42:35,449 --> 00:42:42,449 class script H. 708 00:42:43,269 --> 00:42:47,209 And so 709 00:42:47,209 --> 00:42:54,209 just to give this result a name, this is called 710 00:42:55,529 --> 00:43:00,089 this is one instance of what's called a uniform conversions result, 711 00:43:00,089 --> 00:43:01,130 and 712 00:43:01,130 --> 00:43:03,759 the term uniform conversions 713 00:43:03,759 --> 00:43:05,719 this sort of alludes to the fact that 714 00:43:05,719 --> 00:43:09,659 this shows that as M becomes large, 715 00:43:09,659 --> 00:43:12,029 then these epsilon hats 716 00:43:12,029 --> 00:43:16,189 will all simultaneously converge to epsilon of H. That 717 00:43:16,189 --> 00:43:17,849 training error will 718 00:43:17,849 --> 00:43:20,840 become very close to generalization error 719 00:43:20,840 --> 00:43:23,210 simultaneously for all hypotheses H. 720 00:43:23,210 --> 00:43:25,459 That's what the term uniform 721 00:43:25,459 --> 00:43:29,150 refers to, is the fact that this converges for all hypotheses H and not just 722 00:43:29,150 --> 00:43:31,059 for one hypothesis. And so 723 00:43:31,059 --> 00:43:36,130 what we're shown is one example of a uniform conversions result. Okay? 724 00:43:36,130 --> 00:43:39,400 So let me clean a couple more boards. I'll come back and ask what questions you have 725 00:43:39,400 --> 00:43:46,400 about this. We should take another look at this and make sure it all makes sense. Yeah, okay. 726 00:44:20,109 --> 00:44:27,109 What questions do you have about this? 727 00:44:27,829 --> 00:44:31,269 728 00:44:31,269 --> 00:44:34,239 Student: How the is the 729 00:44:34,239 --> 00:44:36,150 value of gamma computed [inaudible]? Instructor (Andrew Ng): Right. Yeah. So let's see, the question is how is the value of gamma computed? 730 00:44:36,150 --> 00:44:40,289 So for these purposes for the purposes of this, gamma is a constant. 731 00:44:40,289 --> 00:44:43,489 Imagine a gamma is some constant that we chose in advance, 732 00:44:43,489 --> 00:44:48,589 and this is a bound that holds true for any fixed value 733 00:44:48,589 --> 00:44:50,739 of gamma. Later on as we 734 00:44:50,739 --> 00:44:54,449 take this bound and then sort of develop this result further, 735 00:44:54,449 --> 00:44:58,129 we'll choose specific values of gamma as a [inaudible] of this bound. For now 736 00:44:58,129 --> 00:45:05,129 we'll just imagine that when we're proved this holds true for any value of gamma. Any questions? 737 00:45:08,890 --> 00:45:11,579 Yeah? Student:[Inaudible] hypothesis phase is infinite [inaudible]? Instructor (Andrew Ng):Yes, the labs in the hypothesis phase is infinite, 738 00:45:11,579 --> 00:45:16,279 so this simple result won't work in this present form, but we'll generalize this 739 00:45:16,279 --> 00:45:20,199 probably won't get to it today but we'll generalize this at the beginning of the 740 00:45:20,199 --> 00:45:27,199 next lecture to infinite hypothesis classes. Student:How do we use this 741 00:45:29,779 --> 00:45:31,889 theory [inaudible]? Instructor (Andrew Ng):How do you use theorem factors? So let me 742 00:45:31,889 --> 00:45:37,259 I might get to a little of that later today, we'll talk concretely about algorithms, 743 00:45:37,259 --> 00:45:38,890 the consequences of 744 00:45:38,890 --> 00:45:45,719 the understanding of these things in the next lecture as well. Yeah, okay? Cool. Can you 745 00:45:45,719 --> 00:45:47,289 just raise your hand if the 746 00:45:47,289 --> 00:45:49,599 things I've proved so far 747 00:45:49,599 --> 00:45:55,069 make sense? Okay. Cool. Great. 748 00:45:55,069 --> 00:45:59,339 Thanks. All right. Let me just take this uniform conversions bound and rewrite 749 00:45:59,339 --> 00:46:02,279 it in a couple of other forms. 750 00:46:02,279 --> 00:46:04,589 So 751 00:46:04,589 --> 00:46:08,689 this is a sort of a bound on probability, this is saying suppose I fix my training 752 00:46:08,689 --> 00:46:11,089 set and then fix my training 753 00:46:11,089 --> 00:46:14,699 set fix my threshold, my error threshold gamma, 754 00:46:14,699 --> 00:46:19,109 what is the probability that uniform conversions holds, and well, 755 00:46:19,109 --> 00:46:22,670 that's my formula that gives the answer. This is the probability of something 756 00:46:22,670 --> 00:46:25,749 happening. 757 00:46:25,749 --> 00:46:28,750 So there are actually three parameters of interest. One is, 758 00:46:28,750 --> 00:46:31,539 What is this probability? The 759 00:46:31,539 --> 00:46:34,290 other parameter is, What's the training set size M? 760 00:46:34,290 --> 00:46:36,020 And the third parameter is, What 761 00:46:36,020 --> 00:46:37,009 is the value 762 00:46:37,009 --> 00:46:39,989 of this error 763 00:46:39,989 --> 00:46:41,950 threshold gamma? I'm not gonna 764 00:46:41,950 --> 00:46:44,309 vary K for these purposes. 765 00:46:44,309 --> 00:46:47,619 So other two other equivalent forms of the bounds, 766 00:46:47,619 --> 00:46:49,509 which so you can ask, 767 00:46:49,509 --> 00:46:50,599 Given 768 00:46:50,599 --> 00:46:52,309 gamma 769 00:46:52,309 --> 00:46:53,289 so what 770 00:46:53,289 --> 00:46:56,019 we proved was given gamma and given M, what 771 00:46:56,019 --> 00:46:57,849 is the probability of 772 00:46:57,849 --> 00:47:01,019 uniform conversions? The 773 00:47:01,019 --> 00:47:04,150 other equivalent forms are, 774 00:47:04,150 --> 00:47:06,899 so that given gamma 775 00:47:06,899 --> 00:47:13,899 and the probability delta of making a large error, 776 00:47:15,339 --> 00:47:19,049 how large a training set size do you need in order to give 777 00:47:19,049 --> 00:47:21,949 a bound on how large a 778 00:47:21,949 --> 00:47:25,459 training set size do you need 779 00:47:25,459 --> 00:47:29,749 to give a uniform conversions bound with parameters gamma 780 00:47:29,749 --> 00:47:31,209 and delta? 781 00:47:31,209 --> 00:47:33,749 And so well, 782 00:47:33,749 --> 00:47:37,729 so if you set delta to be two KE so negative two gamma 783 00:47:37,729 --> 00:47:38,489 squared M. 784 00:47:38,489 --> 00:47:41,359 This is that form that I had on the left. 785 00:47:41,359 --> 00:47:43,630 And if you solve 786 00:47:43,630 --> 00:47:46,079 for M, 787 00:47:46,079 --> 00:47:48,170 what you find is that 788 00:47:48,170 --> 00:47:54,430 there's an equivalent form of this result that says that 789 00:47:54,430 --> 00:47:55,890 so long as your training 790 00:47:55,890 --> 00:47:59,029 set assigns M as greater than this. 791 00:47:59,029 --> 00:48:05,390 And this is the formula that I get by solving for M. 792 00:48:05,390 --> 00:48:08,809 Okay? So long as M is greater than equal to this, 793 00:48:08,809 --> 00:48:12,929 then with probability, which I'm abbreviating to WP again, 794 00:48:12,929 --> 00:48:17,369 with probability at least one minus delta, 795 00:48:17,369 --> 00:48:24,369 we have for all. 796 00:48:24,880 --> 00:48:30,959 Okay? 797 00:48:30,959 --> 00:48:34,299 So this says how large a training set size that I need 798 00:48:34,299 --> 00:48:37,859 to guarantee that with probability at least one minus delta, 799 00:48:37,859 --> 00:48:42,239 we have the training error is within gamma of generalization error for all my 800 00:48:42,239 --> 00:48:45,259 hypotheses, and this gives an answer. 801 00:48:45,259 --> 00:48:49,400 And just to give this another name, this is an example of a sample 802 00:48:49,400 --> 00:48:53,469 complexity 803 00:48:53,469 --> 00:48:58,079 804 00:48:58,079 --> 00:48:58,940 bound. 805 00:48:58,940 --> 00:49:02,870 So from undergrad computer science classes you may have heard of 806 00:49:02,870 --> 00:49:05,680 computational complexity, which is how much computations you need to achieve 807 00:49:05,680 --> 00:49:06,849 something. 808 00:49:06,849 --> 00:49:11,319 So sample complexity just means how large a training example how large a training set how 809 00:49:11,319 --> 00:49:13,679 large a sample of examples do you need 810 00:49:13,679 --> 00:49:16,549 in order to achieve a certain bound and error. 811 00:49:16,549 --> 00:49:19,619 And it turns out that in many of the theorems we write out you can 812 00:49:19,619 --> 00:49:23,659 pose them in sort of a form of probability bound or a sample complexity bound or in some other 813 00:49:23,659 --> 00:49:24,459 form. 814 00:49:24,459 --> 00:49:28,369 I personally often find the sample complexity bounds the most 815 00:49:28,369 --> 00:49:32,419 easy to interpret because it says how large a training set do you need to give a certain bound on the 816 00:49:32,419 --> 00:49:35,819 errors. 817 00:49:35,819 --> 00:49:38,829 And in fact well, we'll see this later, 818 00:49:38,829 --> 00:49:41,050 sample complexity bounds often sort of 819 00:49:41,050 --> 00:49:43,299 help to give guidance for really if 820 00:49:43,299 --> 00:49:44,240 you're trying to 821 00:49:44,240 --> 00:49:46,140 achieve something on a machine learning problem, 822 00:49:46,140 --> 00:49:46,940 this really is 823 00:49:46,940 --> 00:49:49,719 trying to give guidance on how much training data you need 824 00:49:49,719 --> 00:49:54,269 to prove something. 825 00:49:54,269 --> 00:49:57,769 The one thing I want to note here is that M 826 00:49:57,769 --> 00:50:01,019 grows like the log of K, right, so 827 00:50:01,019 --> 00:50:05,630 the log of K grows extremely slowly as a function of K. The log is one 828 00:50:05,630 --> 00:50:10,619 of the slowest growing functions, right. It's one of well, 829 00:50:10,619 --> 00:50:17,619 some of you may have heard this, right? That for all values of K, right I learned 830 00:50:18,869 --> 00:50:22,179 this from a colleague, Andrew Moore, 831 00:50:22,179 --> 00:50:23,739 at Carnegie Mellon 832 00:50:23,739 --> 00:50:24,699 that in 833 00:50:24,699 --> 00:50:30,549 computer science for all practical purposes for all values of K, log K is less [inaudible], this is 834 00:50:30,549 --> 00:50:31,509 almost true. 835 00:50:31,509 --> 00:50:35,529 So log K is logs is one of the slowest growing functions, and so the 836 00:50:35,529 --> 00:50:40,059 fact that M sample complexity grows like the log of K, 837 00:50:40,059 --> 00:50:41,549 means that 838 00:50:41,549 --> 00:50:45,749 you can increase this number of hypotheses in your hypothesis class quite a lot 839 00:50:45,749 --> 00:50:50,609 and the number of the training examples you need won't grow 840 00:50:50,609 --> 00:50:56,390 841 00:50:56,390 --> 00:51:00,289 very much. [Inaudible]. This property will be important later when we talk about infinite 842 00:51:00,289 --> 00:51:02,859 hypothesis classes. 843 00:51:02,859 --> 00:51:05,679 The final form is the 844 00:51:05,679 --> 00:51:08,489 I guess is sometimes called the error bound, 845 00:51:08,489 --> 00:51:13,659 which is when you hold M and delta fixed and solved for gamma. 846 00:51:13,659 --> 00:51:20,659 And so 847 00:51:22,809 --> 00:51:27,469 and what do you do what you get then is that the 848 00:51:27,469 --> 00:51:30,759 probability at least one minus delta, 849 00:51:30,759 --> 00:51:37,759 we have that. 850 00:51:41,339 --> 00:51:44,409 For all hypotheses in my hypothesis class, 851 00:51:44,409 --> 00:51:51,409 the difference in the training generalization error would be less 852 00:51:53,690 --> 00:51:55,169 than equal to that. Okay? And that's just 853 00:51:55,169 --> 00:52:02,169 solving for gamma and plugging the value I get in there. Okay? All right. 854 00:52:03,239 --> 00:52:10,239 So the 855 00:52:30,949 --> 00:52:33,569 second 856 00:52:33,569 --> 00:52:36,219 step of 857 00:52:36,219 --> 00:52:43,219 the overall proof I want to execute is the following. 858 00:52:44,519 --> 00:52:47,329 The result of the training error is essentially that uniform 859 00:52:47,329 --> 00:52:49,100 conversions will hold true 860 00:52:49,100 --> 00:52:51,439 with high probability. 861 00:52:51,439 --> 00:52:55,249 What I want to show now is let's assume that uniform conversions hold. So let's 862 00:52:55,249 --> 00:52:57,959 assume that for all 863 00:52:57,959 --> 00:53:00,089 hypotheses H, 864 00:53:00,089 --> 00:53:05,439 we have that epsilon of H minus epsilon hat of H, is 865 00:53:05,439 --> 00:53:08,619 less than of the gamma. Okay? 866 00:53:08,619 --> 00:53:10,930 What I want to do now is 867 00:53:10,930 --> 00:53:13,140 use this to see what we can 868 00:53:13,140 --> 00:53:14,910 prove about the bound 869 00:53:14,910 --> 00:53:21,910 of see what we can prove about the generalization error. 870 00:53:28,829 --> 00:53:31,400 871 00:53:31,400 --> 00:53:32,619 So I want to know 872 00:53:32,619 --> 00:53:35,639 suppose this holds true I want 873 00:53:35,639 --> 00:53:40,259 to know can we prove something about the generalization error of H hat, where 874 00:53:40,259 --> 00:53:43,380 again, H hat 875 00:53:43,380 --> 00:53:45,950 was the hypothesis 876 00:53:45,950 --> 00:53:51,679 selected by empirical risk minimization. 877 00:53:51,679 --> 00:53:55,789 Okay? So in order to show this, let me make one more definition, let me define H 878 00:53:55,789 --> 00:54:00,289 star, 879 00:54:00,289 --> 00:54:03,019 to be the hypothesis 880 00:54:03,019 --> 00:54:06,459 in my class 881 00:54:06,459 --> 00:54:10,189 script H that has the smallest generalization error. 882 00:54:10,189 --> 00:54:11,309 So this is 883 00:54:11,309 --> 00:54:14,930 if I had an infinite amount of training data or if I really I 884 00:54:14,930 --> 00:54:19,029 could go in and find the best possible hypothesis 885 00:54:19,029 --> 00:54:23,359 best possible hypothesis in the sense of minimizing generalization error 886 00:54:23,359 --> 00:54:28,989 what's the hypothesis I would get? Okay? So 887 00:54:28,989 --> 00:54:32,959 in some sense, it sort of makes sense to compare the performance of our learning 888 00:54:32,959 --> 00:54:33,609 algorithm 889 00:54:33,609 --> 00:54:35,569 to the performance of H star, because we 890 00:54:35,569 --> 00:54:39,959 sort of we clearly can't hope to do better than H star. 891 00:54:39,959 --> 00:54:42,469 Another way of saying that is that if 892 00:54:42,469 --> 00:54:47,609 your hypothesis class is a class of all linear decision boundaries, that 893 00:54:47,609 --> 00:54:50,630 the data just can't be separated by any linear functions. 894 00:54:50,630 --> 00:54:53,739 So if even H star is really bad, 895 00:54:53,739 --> 00:54:55,240 then there's sort of 896 00:54:55,240 --> 00:54:59,120 it's unlikely then there's just not much hope that your learning algorithm could do even 897 00:54:59,120 --> 00:55:04,130 better 898 00:55:04,130 --> 00:55:08,979 than H star. So I actually prove this result in three steps. 899 00:55:08,979 --> 00:55:13,399 So the generalization error of H hat, the hypothesis I chose, 900 00:55:13,399 --> 00:55:20,399 this is going to be less than equal to that, actually let me number these 901 00:55:21,739 --> 00:55:25,059 equations, right. This 902 00:55:25,059 --> 00:55:27,199 is 903 00:55:27,199 --> 00:55:30,289 because of equation one, because I see that 904 00:55:30,289 --> 00:55:33,089 epsilon of H hat and epsilon hat of H hat will then gamma of each 905 00:55:33,089 --> 00:55:36,630 other. 906 00:55:36,630 --> 00:55:38,319 Now 907 00:55:38,319 --> 00:55:42,140 because H star, excuse me, 908 00:55:42,140 --> 00:55:45,890 now by the 909 00:55:45,890 --> 00:55:51,009 definition of empirical risk minimization, H 910 00:55:51,009 --> 00:55:54,049 hat was chosen to minimize training error 911 00:55:54,049 --> 00:55:58,679 and so there can't be any hypothesis with lower training error than H hat. 912 00:55:58,679 --> 00:56:04,629 So the training error of H hat must be less than the equal to 913 00:56:04,629 --> 00:56:07,079 the training error of H star. So 914 00:56:07,079 --> 00:56:10,549 this is sort of by two, or by the definition of H hat, 915 00:56:10,549 --> 00:56:17,549 as the hypothesis that minimizes training error H hat. 916 00:56:17,619 --> 00:56:20,159 And the final step is I'm 917 00:56:20,159 --> 00:56:23,499 going to apply this uniform conversions result again. We know that 918 00:56:23,499 --> 00:56:24,929 epsilon hat 919 00:56:24,929 --> 00:56:29,549 of H star must be moving gamma of epsilon of H star. 920 00:56:29,549 --> 00:56:33,879 And so this is at most 921 00:56:33,879 --> 00:56:35,199 plus gamma. Then 922 00:56:35,199 --> 00:56:36,089 923 00:56:36,089 --> 00:56:38,299 I have my original gamma 924 00:56:38,299 --> 00:56:41,319 there. Okay? And so this is by 925 00:56:41,319 --> 00:56:44,519 equation one again because oh, excuse me 926 00:56:44,519 --> 00:56:47,690 because I know the training error of H star must be moving gamma of the 927 00:56:47,690 --> 00:56:49,929 generalization error of H star. 928 00:56:49,929 --> 00:56:51,699 And so well, I'll just 929 00:56:51,699 --> 00:56:57,469 write this as plus two gamma. Okay? Yeah? Student:[Inaudible] notation, is epsilon proof of [inaudible] H hat that's not the training 930 00:56:57,469 --> 00:57:04,469 error, that's the generalization error with estimate of the hypothesis? Instructor (Andrew Ng): Oh, 931 00:57:24,869 --> 00:57:26,649 okay. Let me just well, let 932 00:57:26,649 --> 00:57:32,779 me write that down on this board. So actually actually let me 933 00:57:32,779 --> 00:57:36,459 think 934 00:57:36,459 --> 00:57:38,739 [inaudible] 935 00:57:38,739 --> 00:57:43,699 fit this in here. So epsilon hat 936 00:57:43,699 --> 00:57:45,169 of H is the 937 00:57:45,169 --> 00:57:48,999 training 938 00:57:48,999 --> 00:57:50,620 939 00:57:50,620 --> 00:57:52,469 error of the hypothesis H. In 940 00:57:52,469 --> 00:57:55,949 other words, given the hypothesis a hypothesis is just a function, right mapped from X or 941 00:57:55,949 --> 00:57:57,559 Ys 942 00:57:57,559 --> 00:57:59,489 so epsilon hat of H is 943 00:57:59,489 --> 00:58:02,809 given the hypothesis H, what's the fraction of training examples it 944 00:58:02,809 --> 00:58:05,029 misclassifies? 945 00:58:05,029 --> 00:58:11,599 And generalization error 946 00:58:11,599 --> 00:58:13,839 of 947 00:58:13,839 --> 00:58:17,949 H, is given the hypothesis H if I 948 00:58:17,949 --> 00:58:19,149 sample another 949 00:58:19,149 --> 00:58:21,709 example from my distribution 950 00:58:21,709 --> 00:58:22,899 scripts 951 00:58:22,899 --> 00:58:28,309 D, what's the probability that H will misclassify that example? Does that make sense? Student:[Inaudible]? Instructor (Andrew Ng): 952 00:58:28,309 --> 00:58:31,859 Oh, 953 00:58:31,859 --> 00:58:38,239 okay. And H hat is the hypothesis that's chosen by empirical risk minimization. 954 00:58:38,239 --> 00:58:42,579 So when I talk about empirical risk minimization, is the algorithm that 955 00:58:42,579 --> 00:58:47,799 minimizes training error, and so epsilon hat of H is the training error of H, 956 00:58:47,799 --> 00:58:50,219 and so H hat is defined as 957 00:58:50,219 --> 00:58:52,049 the hypothesis that out 958 00:58:52,049 --> 00:58:55,029 of all hypotheses in my class script H, 959 00:58:55,029 --> 00:58:57,589 the one that minimizes training error epsilon hat of H. Okay? All right. Yeah? Student:[Inaudible] H is [inaudible] a member 960 00:58:57,589 --> 00:59:04,589 of 961 00:59:06,669 --> 00:59:08,799 typical H, [inaudible] 962 00:59:08,799 --> 00:59:10,689 family 963 00:59:10,689 --> 00:59:12,909 right? Instructor (Andrew Ng): Yes it is. 964 00:59:12,909 --> 00:59:19,909 Student:So what happens with the generalization error [inaudible]? 965 00:59:20,069 --> 00:59:24,489 Instructor (Andrew Ng): I'll talk about that later. So 966 00:59:24,489 --> 00:59:31,489 let me tie all these things together into a 967 00:59:34,339 --> 00:59:38,049 theorem. 968 00:59:38,049 --> 00:59:42,519 Let there be a hypothesis class given with a 969 00:59:42,519 --> 00:59:48,869 finite set of K hypotheses, 970 00:59:48,869 --> 00:59:50,889 and let any M 971 00:59:50,889 --> 00:59:52,249 delta 972 00:59:52,249 --> 00:59:58,319 be fixed. 973 00:59:58,319 --> 01:00:02,049 Then so I fixed M and delta, so this will be the error bound 974 01:00:02,049 --> 01:00:04,499 form of the theorem, right? 975 01:00:04,499 --> 01:00:06,759 Then, with 976 01:00:06,759 --> 01:00:09,640 probability at least one minus delta. 977 01:00:09,640 --> 01:00:13,920 We have that. The generalization error of H hat is 978 01:00:13,920 --> 01:00:19,219 less than or equal to 979 01:00:19,219 --> 01:00:22,609 the minimum over all hypotheses in 980 01:00:22,609 --> 01:00:25,049 set H epsilon of H, 981 01:00:25,049 --> 01:00:26,679 plus 982 01:00:26,679 --> 01:00:33,679 two times, 983 01:00:38,029 --> 01:00:39,109 plus that. Okay? 984 01:00:39,109 --> 01:00:41,759 So to prove this, well, 985 01:00:41,759 --> 01:00:45,529 this term of course is just epsilon of H star. 986 01:00:45,529 --> 01:00:47,879 And so to prove this 987 01:00:47,879 --> 01:00:49,739 we set 988 01:00:49,739 --> 01:00:52,509 gamma to equal to that 989 01:00:52,509 --> 01:00:55,969 this is two times the square root term. 990 01:00:55,969 --> 01:01:02,969 To prove this theorem we set gamma to equal to that square root term. Say that again? Student:[Inaudible]. 991 01:01:04,009 --> 01:01:07,519 Instructor (Andrew Ng): Wait. Say that again? Student:[Inaudible]. Instructor (Andrew Ng): 992 01:01:07,519 --> 01:01:11,389 Oh, 993 01:01:11,389 --> 01:01:13,949 994 01:01:13,949 --> 01:01:16,059 yes. Thank you. That didn't 995 01:01:16,059 --> 01:01:19,639 make sense at all. 996 01:01:19,639 --> 01:01:20,939 Thanks. Great. 997 01:01:20,939 --> 01:01:25,670 So set gamma to that square root term, 998 01:01:25,670 --> 01:01:29,910 and so we know equation one, 999 01:01:29,910 --> 01:01:31,739 right, from the previous board 1000 01:01:31,739 --> 01:01:35,279 holds with probability one minus delta. 1001 01:01:35,279 --> 01:01:40,430 Right. Equation one was the uniform conversions result right, that well, 1002 01:01:40,430 --> 01:01:47,430 IE. 1003 01:01:49,549 --> 01:01:52,249 This is equation one from the previous board, right, so 1004 01:01:52,249 --> 01:01:54,759 set gamma equal to this we know that we'll probably 1005 01:01:54,759 --> 01:01:58,459 use one minus delta this uniform conversions holds, 1006 01:01:58,459 --> 01:02:01,149 and whenever that holds, that implies you 1007 01:02:01,149 --> 01:02:04,719 know, I 1008 01:02:04,719 --> 01:02:05,019 guess 1009 01:02:05,019 --> 01:02:07,669 if we call this equation star I guess. 1010 01:02:07,669 --> 01:02:11,809 And whenever uniform conversions holds, we showed again, on the previous boards 1011 01:02:11,809 --> 01:02:17,089 that this result holds, that generalization error of H hat is less than 1012 01:02:17,089 --> 01:02:20,769 two generalization error of H star plus two times gamma. Okay? And 1013 01:02:20,769 --> 01:02:22,670 so that proves 1014 01:02:22,670 --> 01:02:29,670 this theorem. So 1015 01:02:41,559 --> 01:02:44,229 this result sort of helps us to quantify 1016 01:02:44,229 --> 01:02:45,230 a little bit 1017 01:02:45,230 --> 01:02:47,609 that bias variance tradeoff 1018 01:02:47,609 --> 01:02:49,739 that I talked about 1019 01:02:49,739 --> 01:02:54,239 at the beginning of actually near the very start of this lecture. 1020 01:02:54,239 --> 01:02:58,390 And in particular 1021 01:02:58,390 --> 01:03:03,659 let's say I have some hypothesis class script H, that 1022 01:03:03,659 --> 01:03:06,039 I'm using, maybe as a class of all 1023 01:03:06,039 --> 01:03:09,799 linear functions and linear regression, and logistic regression with 1024 01:03:09,799 --> 01:03:11,789 just the linear features. 1025 01:03:11,789 --> 01:03:14,739 And let's say I'm considering switching 1026 01:03:14,739 --> 01:03:19,619 to some new class H prime by having more features. So lets say this is 1027 01:03:19,619 --> 01:03:23,059 linear 1028 01:03:23,059 --> 01:03:27,239 and this is quadratic, 1029 01:03:27,239 --> 01:03:28,039 1030 01:03:28,039 --> 01:03:31,659 so the class of all linear functions and the subset of the class of all quadratic functions, 1031 01:03:31,659 --> 01:03:34,859 and so H is the subset of H prime. And 1032 01:03:34,859 --> 01:03:36,789 let's say I'm considering 1033 01:03:36,789 --> 01:03:40,929 instead of using my linear hypothesis class let's say I'm considering switching 1034 01:03:40,929 --> 01:03:44,769 to a quadratic hypothesis class, or switching to a larger hypothesis 1035 01:03:44,769 --> 01:03:46,369 class. 1036 01:03:46,369 --> 01:03:47,130 1037 01:03:47,130 --> 01:03:49,809 Then what are the tradeoffs involved? Well, I 1038 01:03:49,809 --> 01:03:52,929 proved this only for finite hypothesis classes, but we'll see that something 1039 01:03:52,929 --> 01:03:56,019 very similar holds for infinite hypothesis classes too. 1040 01:03:56,019 --> 01:03:57,559 But the tradeoff is 1041 01:03:57,559 --> 01:04:00,619 what if I switch from H to H prime, or I switch from linear to quadratic 1042 01:04:00,619 --> 01:04:03,999 functions. Then 1043 01:04:03,999 --> 01:04:08,029 epsilon of H star will become better because 1044 01:04:08,029 --> 01:04:12,419 the best hypothesis in my hypothesis class will become better. 1045 01:04:12,419 --> 01:04:14,829 The best quadratic function 1046 01:04:14,829 --> 01:04:16,049 1047 01:04:16,049 --> 01:04:18,799 by best I mean in the sense of generalization error 1048 01:04:18,799 --> 01:04:21,299 the hypothesis function 1049 01:04:21,299 --> 01:04:24,829 the quadratic function with the lowest generalization error 1050 01:04:24,829 --> 01:04:25,599 has to have 1051 01:04:25,599 --> 01:04:27,059 equal or 1052 01:04:27,059 --> 01:04:28,019 more likely lower generalization error than the best 1053 01:04:28,019 --> 01:04:30,279 linear function. 1054 01:04:30,279 --> 01:04:31,799 1055 01:04:31,799 --> 01:04:37,549 So by switching to a more complex hypothesis class you can get this first term as you go down. 1056 01:04:37,549 --> 01:04:40,349 But what I pay for then is that 1057 01:04:40,349 --> 01:04:42,829 K will increase. 1058 01:04:42,829 --> 01:04:46,449 By switching to a larger hypothesis class, 1059 01:04:46,449 --> 01:04:48,110 the first term will go down, 1060 01:04:48,110 --> 01:04:51,820 but the second term will increase because I now have a larger class of 1061 01:04:51,820 --> 01:04:52,880 hypotheses 1062 01:04:52,880 --> 01:04:54,630 and so the second term K 1063 01:04:54,630 --> 01:04:57,189 will increase. 1064 01:04:57,189 --> 01:05:01,679 And so this is sometimes called the bias this is usually called the bias variance 1065 01:05:01,679 --> 01:05:03,049 tradeoff. Whereby 1066 01:05:03,049 --> 01:05:07,699 going to larger hypothesis class maybe I have the hope for finding a better function, 1067 01:05:07,699 --> 01:05:09,949 that my risk of sort of 1068 01:05:09,949 --> 01:05:13,799 not fitting my model so accurately also increases, and 1069 01:05:13,799 --> 01:05:19,449 that's because illustrated by the second term going up when the 1070 01:05:19,449 --> 01:05:21,299 size of your hypothesis, 1071 01:05:21,299 --> 01:05:28,199 when K goes up. 1072 01:05:28,199 --> 01:05:32,079 And so 1073 01:05:32,079 --> 01:05:35,279 speaking very loosely, 1074 01:05:35,279 --> 01:05:37,109 we can think of 1075 01:05:37,109 --> 01:05:40,149 this first term as corresponding 1076 01:05:40,149 --> 01:05:41,459 maybe to the bias 1077 01:05:41,459 --> 01:05:45,209 of the learning algorithm, or the bias of the hypothesis class. 1078 01:05:45,209 --> 01:05:50,429 And you can again speaking very loosely, 1079 01:05:50,429 --> 01:05:52,880 think of the second term as corresponding 1080 01:05:52,880 --> 01:05:56,760 to the variance in your hypothesis, in other words how well you can actually fit a 1081 01:05:56,760 --> 01:05:57,869 hypothesis in the 1082 01:05:57,869 --> 01:05:59,759 how well you 1083 01:05:59,759 --> 01:06:02,869 actually fit this hypothesis class to the data. 1084 01:06:02,869 --> 01:06:06,829 And by switching to a more complex hypothesis class, your variance increases and your bias decreases. 1085 01:06:06,829 --> 01:06:09,489 1086 01:06:09,489 --> 01:06:13,539 As a note of warning, it turns out that if you take like a statistics class you've seen 1087 01:06:13,539 --> 01:06:15,119 definitions of bias and variance, 1088 01:06:15,119 --> 01:06:18,499 which are often defined in terms of squared error or something. 1089 01:06:18,499 --> 01:06:21,580 It turns out that for classification problems, 1090 01:06:21,580 --> 01:06:24,649 there actually is no 1091 01:06:24,649 --> 01:06:25,880 1092 01:06:25,880 --> 01:06:27,489 universally accepted 1093 01:06:27,489 --> 01:06:28,989 formal definition of 1094 01:06:28,989 --> 01:06:30,679 bias and 1095 01:06:30,679 --> 01:06:35,259 variance for classification problems. For regression problems, there is this square error 1096 01:06:35,259 --> 01:06:38,639 definition. For classification problems it 1097 01:06:38,639 --> 01:06:39,980 turns out there've 1098 01:06:39,980 --> 01:06:43,700 been several competing proposals for definitions of bias and variance. So when I 1099 01:06:43,700 --> 01:06:45,300 say bias and variance here, 1100 01:06:45,300 --> 01:06:47,420 think of these as very loose, informal, 1101 01:06:47,420 --> 01:06:54,420 intuitive definitions, and not formal definitions. Okay. 1102 01:07:00,499 --> 01:07:07,499 The 1103 01:07:15,329 --> 01:07:18,080 cartoon associated with intuition I just 1104 01:07:18,080 --> 01:07:19,390 said would be 1105 01:07:19,390 --> 01:07:22,349 as follows: Let's 1106 01:07:22,349 --> 01:07:24,009 say 1107 01:07:24,009 --> 01:07:26,919 1108 01:07:26,919 --> 01:07:30,369 and everything about the plot will be for a fixed value of M, for a 1109 01:07:30,369 --> 01:07:35,219 fixed training set size M. Vertical axis I'll plot ever and 1110 01:07:35,219 --> 01:07:38,799 on the horizontal axis I'll plot model 1111 01:07:38,799 --> 01:07:43,729 complexity. 1112 01:07:43,729 --> 01:07:46,519 And by model complexity I mean 1113 01:07:46,519 --> 01:07:53,499 sort of degree of polynomial, 1114 01:07:53,499 --> 01:07:57,669 1115 01:07:57,669 --> 01:07:59,739 size of your 1116 01:07:59,739 --> 01:08:04,669 hypothesis class script H etc. It actually turns out, 1117 01:08:04,669 --> 01:08:07,509 you remember the bandwidth parameter 1118 01:08:07,509 --> 01:08:10,579 from 1119 01:08:10,579 --> 01:08:13,890 locally weighted linear regression, that also 1120 01:08:13,890 --> 01:08:17,460 has a similar effect in controlling how complex your model is. 1121 01:08:17,460 --> 01:08:20,650 Model complexity [inaudible] polynomial I guess. 1122 01:08:20,650 --> 01:08:23,239 So the more complex your model, 1123 01:08:23,239 --> 01:08:25,390 the better your 1124 01:08:25,390 --> 01:08:26,020 training error, 1125 01:08:26,020 --> 01:08:30,219 and so 1126 01:08:30,219 --> 01:08:32,889 your training error will tend to [inaudible] zero as you increase the complexity of your model because the 1127 01:08:32,889 --> 01:08:36,419 more complete your model the better you can fit your training set. 1128 01:08:36,419 --> 01:08:37,519 But 1129 01:08:37,519 --> 01:08:44,519 because of this bias variance tradeoff, 1130 01:08:44,689 --> 01:08:48,639 you find 1131 01:08:48,639 --> 01:08:53,370 that 1132 01:08:53,370 --> 01:08:57,139 generalization error will come down for a while and then it will go back up. 1133 01:08:57,139 --> 01:09:03,669 And this regime on the left is when you're underfitting the data or when 1134 01:09:03,669 --> 01:09:06,259 you have high bias. 1135 01:09:06,259 --> 01:09:07,690 And this regime 1136 01:09:07,690 --> 01:09:10,379 on the right 1137 01:09:10,379 --> 01:09:16,370 is when you have high variance or 1138 01:09:16,370 --> 01:09:20,809 you're overfitting the data. Okay? 1139 01:09:20,809 --> 01:09:24,459 And this is why a model of sort of intermediate 1140 01:09:24,459 --> 01:09:27,400 complexity, somewhere here 1141 01:09:27,400 --> 01:09:28,999 if often preferable 1142 01:09:28,999 --> 01:09:30,239 to if 1143 01:09:30,239 --> 01:09:32,859 [inaudible] and minimize generalization error. 1144 01:09:32,859 --> 01:09:35,729 Okay? 1145 01:09:35,729 --> 01:09:39,509 So that's just a cartoon. In the next lecture we'll actually talk about the number of 1146 01:09:39,509 --> 01:09:40,230 algorithms 1147 01:09:40,230 --> 01:09:44,459 for trying to automatically select model complexities, say to get 1148 01:09:44,459 --> 01:09:48,179 you as close as possible to this minimum to 1149 01:09:48,179 --> 01:09:55,179 this area of minimized generalization error. 1150 01:10:10,550 --> 01:10:14,539 The last thing I want to do is actually 1151 01:10:14,539 --> 01:10:18,959 going back to the theorem I wrote out, I just want to take that theorem well, 1152 01:10:18,959 --> 01:10:20,150 so the theorem I wrote out 1153 01:10:20,150 --> 01:10:25,130 was an error bound theorem this says for fixed M and delta where probability 1154 01:10:25,130 --> 01:10:28,639 one minus delta, I get a bound on 1155 01:10:28,639 --> 01:10:30,779 gamma, which is what this term is. 1156 01:10:30,779 --> 01:10:34,769 So the very last thing I wanna do today is just come back to this theorem and write out a 1157 01:10:34,769 --> 01:10:35,719 corollary 1158 01:10:35,719 --> 01:10:39,590 where I'm gonna fix gamma, I'm gonna fix my error bound, and fix delta 1159 01:10:39,590 --> 01:10:41,389 and solve for M. 1160 01:10:41,389 --> 01:10:46,080 And if you do that, 1161 01:10:46,080 --> 01:10:47,190 you 1162 01:10:47,190 --> 01:10:49,289 get the following corollary: Let H 1163 01:10:49,289 --> 01:10:52,440 be 1164 01:10:52,440 --> 01:10:54,759 fixed with K hypotheses 1165 01:10:54,759 --> 01:10:58,989 and let 1166 01:10:58,989 --> 01:11:01,340 any delta 1167 01:11:01,340 --> 01:11:04,369 and gamma be fixed. 1168 01:11:04,369 --> 01:11:09,019 1169 01:11:09,019 --> 01:11:11,449 Then 1170 01:11:11,449 --> 01:11:18,449 in order to guarantee that, 1171 01:11:22,800 --> 01:11:25,060 let's say I want a guarantee 1172 01:11:25,060 --> 01:11:26,989 that the generalization error 1173 01:11:26,989 --> 01:11:30,729 of the hypothesis I choose with empirical risk minimization, 1174 01:11:30,729 --> 01:11:33,869 that this is at most two times gamma worse 1175 01:11:33,869 --> 01:11:38,319 than the best possible error I could obtain with this hypothesis class. 1176 01:11:38,319 --> 01:11:40,470 Lets say I want this to 1177 01:11:40,470 --> 01:11:45,050 hold true with probability at least one minus delta, 1178 01:11:45,050 --> 01:11:50,159 then it suffices 1179 01:11:50,159 --> 01:11:53,689 that M 1180 01:11:53,689 --> 01:12:00,519 is [inaudible] to that. Okay? And this is 1181 01:12:00,519 --> 01:12:01,010 sort of 1182 01:12:01,010 --> 01:12:05,570 solving for the error 1183 01:12:05,570 --> 01:12:07,699 bound for M. One thing we're going to convince yourselves 1184 01:12:07,699 --> 01:12:10,110 of the easy part of this is 1185 01:12:10,110 --> 01:12:14,960 if you set that term [inaudible] gamma and solve for M you will get this. 1186 01:12:14,960 --> 01:12:18,670 One thing I want you to go home and sort of convince yourselves of is that this 1187 01:12:18,670 --> 01:12:20,559 result really holds true. 1188 01:12:20,559 --> 01:12:23,869 That this really logically follows from the theorem we've proved. 1189 01:12:23,869 --> 01:12:25,669 In other words, 1190 01:12:25,669 --> 01:12:29,849 you can take that formula we wrote and solve for M and because this is the formula you get for M, 1191 01:12:29,849 --> 01:12:31,670 that's just that's the easy part. That 1192 01:12:31,670 --> 01:12:33,770 once you go back and convince yourselves that 1193 01:12:33,770 --> 01:12:37,660 this theorem is a true fact and that it does indeed logically follow from the 1194 01:12:37,660 --> 01:12:38,659 other one. In 1195 01:12:38,659 --> 01:12:39,869 particular, 1196 01:12:39,869 --> 01:12:43,739 make sure that if you solve for that you really get M grading equals this, and why is this M 1197 01:12:43,739 --> 01:12:46,679 grading that and not M less equal two, and just make sure 1198 01:12:46,679 --> 01:12:50,229 I can write this down and it sounds plausible why don't you just go back and convince yourself this is really true. Okay? 1199 01:12:50,229 --> 01:12:53,280 1200 01:12:53,280 --> 01:12:54,840 And 1201 01:12:54,840 --> 01:12:58,110 it turns out that when we prove these bounds in learning theory it turns out 1202 01:12:58,110 --> 01:13:02,820 that very often the constants are sort of loose. So it turns out that when we prove 1203 01:13:02,820 --> 01:13:05,619 these bounds usually we're interested 1204 01:13:05,619 --> 01:13:10,349 usually we're not very interested in the constants, and so I 1205 01:13:10,349 --> 01:13:12,409 write this as big O of 1206 01:13:12,409 --> 01:13:16,139 one over gamma squared, log 1207 01:13:16,139 --> 01:13:17,780 K over delta, 1208 01:13:17,780 --> 01:13:20,570 and again, the key 1209 01:13:20,570 --> 01:13:23,869 step in this is that the dependence on M 1210 01:13:23,869 --> 01:13:26,709 with the size of the hypothesis class is logarithmic. 1211 01:13:26,709 --> 01:13:30,099 And this will be very important later when we 1212 01:13:30,099 --> 01:13:35,429 talk about infinite hypothesis classes. Okay? Any 1213 01:13:35,429 --> 01:13:42,429 questions about this? No? Okay, cool. 1214 01:13:49,659 --> 01:13:51,379 So 1215 01:13:51,379 --> 01:13:54,489 next lecture we'll come back, we'll actually start from this result again. 1216 01:13:54,489 --> 01:13:57,699 Remember this. I'll write this down as the first thing I do in the next lecture 1217 01:13:57,699 --> 01:14:02,519 and we'll generalize these to infinite hypothesis classes and then talk about 1218 01:14:02,519 --> 01:14:04,659 practical algorithms for model spectrum. So I'll see you guys in a couple days.