1 00:00:10,910 --> 00:00:12,240 2 00:00:12,240 --> 00:00:15,509 This presentation is delivered by the Stanford Center for Professional 3 00:00:15,509 --> 00:00:22,509 Development. 4 00:00:24,209 --> 00:00:28,029 So what I want to do today in this lecture is 5 00:00:28,029 --> 00:00:31,479 talk a little bit more about learning theory. In particular, I'll 6 00:00:31,479 --> 00:00:33,230 talk about VC dimension 7 00:00:33,230 --> 00:00:35,610 and 8 00:00:35,610 --> 00:00:39,560 building on the issues of bias variance tradeoffs of under 9 00:00:39,560 --> 00:00:42,630 fitting and over fitting; that we've been seeing in the previous lecture, and then we'll see in 10 00:00:42,630 --> 00:00:43,640 this one. 11 00:00:43,640 --> 00:00:49,180 I then want to talk about model selection algorithms for automatically 12 00:00:49,180 --> 00:00:50,070 making 13 00:00:50,070 --> 00:00:52,190 decisions for this 14 00:00:52,190 --> 00:00:55,280 bias variance tradeoff, that we started to talk about in the previous lecture. 15 00:00:55,280 --> 00:00:59,310 And depending on how much time, I actually may not get to Bayesian, [inaudible]. But if 16 00:00:59,310 --> 00:01:01,469 I don't get to this today, 17 00:01:01,469 --> 00:01:07,280 I'll get to this in next week's lecture. 18 00:01:07,280 --> 00:01:09,520 To recap: 19 00:01:09,520 --> 00:01:10,430 the result we 20 00:01:10,430 --> 00:01:13,510 proved at the previous lecture was that 21 00:01:13,510 --> 00:01:15,370 if you have a finite 22 00:01:15,370 --> 00:01:16,430 hypothesis class if 23 00:01:16,430 --> 00:01:19,509 h is a set of k hypotheses, 24 00:01:19,509 --> 00:01:24,080 25 00:01:24,080 --> 00:01:25,710 and suppose you have some 26 00:01:25,710 --> 00:01:32,710 fixed parameters, gamma and delta, 27 00:01:33,030 --> 00:01:34,440 then 28 00:01:34,440 --> 00:01:36,530 29 00:01:36,530 --> 00:01:40,890 in order to guarantee 30 00:01:40,890 --> 00:01:47,750 that this holds, 31 00:01:47,750 --> 00:01:50,770 we're probability at least one minus delta. 32 00:01:50,770 --> 00:01:55,970 It suffices 33 00:01:55,970 --> 00:02:02,970 that n is greater and equal to that; okay? 34 00:02:05,240 --> 00:02:07,359 And using big-O notations, 35 00:02:07,359 --> 00:02:14,359 just learning dropped constants, I can also write this as that; okay? 36 00:02:14,459 --> 00:02:18,180 So just to quickly remind you of what all of the notation means, 37 00:02:18,180 --> 00:02:21,180 we talked about empirical risk minimization, 38 00:02:21,180 --> 00:02:21,970 which was 39 00:02:21,970 --> 00:02:24,620 the simplified modern machine learning 40 00:02:24,620 --> 00:02:25,899 that 41 00:02:25,899 --> 00:02:28,549 has a hypothesis class of script h. 42 00:02:28,549 --> 00:02:32,539 And what the empirical risk minimization-learning algorithm does is it just chooses the 43 00:02:32,539 --> 00:02:34,029 hypothesis 44 00:02:34,029 --> 00:02:38,169 that attains the smallest error on the training set. 45 00:02:38,169 --> 00:02:39,019 And so 46 00:02:39,019 --> 00:02:43,479 this symbol, epsilon, just denoted generalization error; right? This is the 47 00:02:43,479 --> 00:02:44,450 probability 48 00:02:44,450 --> 00:02:48,949 of a hypothesis h [inaudible] misclassifying a new example drawn from the same 49 00:02:48,949 --> 00:02:50,930 distribution as the training set. 50 00:02:50,930 --> 00:02:53,079 And so this says that 51 00:02:53,079 --> 00:02:55,449 52 00:02:55,449 --> 00:02:56,939 53 00:02:56,939 --> 00:03:01,869 54 00:03:01,869 --> 00:03:06,119 in order to guarantee that the generalization error of the hypothesis h 55 00:03:06,119 --> 00:03:08,679 [inaudible] output by empirical risk minimization 56 00:03:08,679 --> 00:03:11,009 that this is less and equal to 57 00:03:11,009 --> 00:03:14,209 the best possible generalization error 58 00:03:14,209 --> 00:03:14,780 59 00:03:14,780 --> 00:03:19,889 use it in your hypothesis class plus two times gamma two times this error threshold. 60 00:03:19,889 --> 00:03:24,120 We want to guarantee that this holds a probability at least one minus delta. 61 00:03:24,120 --> 00:03:25,519 We show that 62 00:03:25,519 --> 00:03:28,999 it suffices for your training set size m 63 00:03:28,999 --> 00:03:33,079 to be greater than equal to this; okay? One over two gamma square log two k over 64 00:03:33,079 --> 00:03:34,179 delta; 65 00:03:34,179 --> 00:03:39,229 where again, k is the size of your hypothesis class. 66 00:03:39,229 --> 00:03:42,909 And so this is some complexity result because it gives us a bound in the 67 00:03:42,909 --> 00:03:45,019 number of training examples we need 68 00:03:45,019 --> 00:03:47,050 in order to give a guarantee 69 00:03:47,050 --> 00:03:48,699 on something on the error; okay? So 70 00:03:48,699 --> 00:03:54,819 this is a sample complexity result. So 71 00:03:54,819 --> 00:03:58,879 what I want to do now is take this result, and try to generalize it to the 72 00:03:58,879 --> 00:04:02,939 case of infinite hypothesis classes. So here, 73 00:04:02,939 --> 00:04:07,139 we said that the set script h is sort of just k 74 00:04:07,139 --> 00:04:08,570 specific functions, 75 00:04:08,570 --> 00:04:12,120 when you want to use a model like logistic regression, which is actually 76 00:04:12,120 --> 00:04:14,059 parameterized by real numbers. 77 00:04:14,059 --> 00:04:15,019 So 78 00:04:15,019 --> 00:04:17,420 I'm actually first going to give an argument that's sort of 79 00:04:17,420 --> 00:04:21,160 formally broken just sort of technically somewhat broken, but conveys 80 00:04:21,160 --> 00:04:25,230 useful intuition. And then I'll give the more correct argument, but 81 00:04:25,230 --> 00:04:30,060 without proving. It's as if, full proof is somewhat involved. So here's a somewhat broken argument. Let's say I want 82 00:04:30,060 --> 00:04:31,050 to 83 00:04:31,050 --> 00:04:34,679 apply this result analyzing logistic regression. 84 00:04:34,679 --> 00:04:35,550 85 00:04:35,550 --> 00:04:37,400 So let's say 86 00:04:37,400 --> 00:04:39,210 87 00:04:39,210 --> 00:04:44,789 your hypothesis class is because of all linear division boundaries; right? So say 88 00:04:44,789 --> 00:04:46,770 script h is 89 00:04:46,770 --> 00:04:51,749 parameterized 90 00:04:51,749 --> 00:04:57,860 by d 91 00:04:57,860 --> 00:05:00,279 real numbers; okay? So for example, if 92 00:05:00,279 --> 00:05:02,570 you're applying logistic regression with 93 00:05:02,570 --> 00:05:06,940 over [inaudible], then d would be endless one with logistic regression to find the 94 00:05:06,940 --> 00:05:08,680 linear position boundary, 95 00:05:08,680 --> 00:05:09,640 parameterized by 96 00:05:09,640 --> 00:05:13,529 endless one real numbers. 97 00:05:13,529 --> 00:05:17,729 When you think about how your hypothesis class is really represented in a computer 98 00:05:17,729 --> 00:05:18,569 computers use 99 00:05:18,569 --> 00:05:21,949 zero one bits to represent real numbers. 100 00:05:21,949 --> 00:05:22,930 And so if you 101 00:05:22,930 --> 00:05:24,479 use like 102 00:05:24,479 --> 00:05:28,559 a normal standard computer, it normally will represent real numbers by what's 103 00:05:28,559 --> 00:05:30,869 called double position floating point numbers. 104 00:05:30,869 --> 00:05:33,200 And what that means is that each real number 105 00:05:33,200 --> 00:05:36,789 is represented by or a 64-bit representation; 106 00:05:36,789 --> 00:05:38,060 right? So really 107 00:05:38,060 --> 00:05:41,820 you know what floating point is in a computer. So a 64-bit floating point 108 00:05:41,820 --> 00:05:43,309 is what almost all of 109 00:05:43,309 --> 00:05:44,780 us use routinely. 110 00:05:44,780 --> 00:05:47,910 And so this parameterized by d real numbers, that's really 111 00:05:47,910 --> 00:05:50,979 as if it's parameterized by 64 112 00:05:50,979 --> 00:05:52,919 times d 113 00:05:52,919 --> 00:05:54,319 bits. 114 00:05:54,319 --> 00:05:56,590 Computers can't represent real numbers. They only represent 115 00:05:56,590 --> 00:05:58,800 used to speed things. 116 00:05:58,800 --> 00:06:00,689 And so the size of 117 00:06:00,689 --> 00:06:02,949 your 118 00:06:02,949 --> 00:06:07,250 hypothesis class in your computer representation 119 00:06:07,250 --> 00:06:11,000 you have 64 times d bits that you can flip. 120 00:06:11,000 --> 00:06:15,050 And so the number of possible values for your 62 to 64 d bits is 121 00:06:15,050 --> 00:06:17,509 really just 122 00:06:17,509 --> 00:06:20,889 to the power of 64 d; okay? 123 00:06:20,889 --> 00:06:23,360 Because that's the number of ways you can 124 00:06:23,360 --> 00:06:26,349 flip the 64 d bits. 125 00:06:26,349 --> 00:06:29,150 And so 126 00:06:29,150 --> 00:06:33,979 this is why it's important that we that we had log k there; right? So k is 127 00:06:33,979 --> 00:06:36,819 therefore, to the 64 d. 128 00:06:36,819 --> 00:06:40,379 And if I plug it into this equation over here, 129 00:06:40,379 --> 00:06:43,300 what you find is that 130 00:06:43,300 --> 00:06:47,139 in order to get this sort of guarantee, 131 00:06:47,139 --> 00:06:50,979 it suffices that m 132 00:06:50,979 --> 00:06:52,739 is 133 00:06:52,739 --> 00:06:55,979 great and equal to 134 00:06:55,979 --> 00:06:58,770 on the order of one of 135 00:06:58,770 --> 00:07:01,740 the gamma square log it's 136 00:07:01,740 --> 00:07:03,369 just a 64 d 137 00:07:03,369 --> 00:07:05,050 over delta, 138 00:07:05,050 --> 00:07:12,050 which is that; okay? So 139 00:07:14,520 --> 00:07:17,779 just to be clear, in order to guarantee that 140 00:07:17,779 --> 00:07:23,729 there's only one, instead of the same complexity result as we had before so the question is: 141 00:07:23,729 --> 00:07:27,309 suppose, you want a guarantee that a hypotheses returned by empirical risk 142 00:07:27,309 --> 00:07:29,249 minimization will 143 00:07:29,249 --> 00:07:33,909 have a generalization error that's within two gamma or the best hypotheses in your hypotheses 144 00:07:33,909 --> 00:07:35,970 class. 145 00:07:35,970 --> 00:07:39,400 Then what this result suggests is that, you know, in order to give that sort of 146 00:07:39,400 --> 00:07:40,770 error bound guarantee, 147 00:07:40,770 --> 00:07:43,429 it suffices that m is greater and equal to this. 148 00:07:43,429 --> 00:07:44,579 In other words, that 149 00:07:44,579 --> 00:07:46,889 your number of training examples has to be 150 00:07:46,889 --> 00:07:51,699 on the order of d over gamma square; 10, 12, 1 over delta. Okay? And the 151 00:07:51,699 --> 00:07:53,329 152 00:07:53,329 --> 00:07:54,419 intuition 153 00:07:54,419 --> 00:07:57,740 that this conveys is actually, roughly right. This says, that the number of 154 00:07:57,740 --> 00:07:59,050 training examples you need 155 00:07:59,050 --> 00:08:00,740 is roughly linear 156 00:08:00,740 --> 00:08:04,620 in the number of parameters of your hypothesis class. That 157 00:08:04,620 --> 00:08:08,680 m has [inaudible] on the order of something linear, [inaudible]. That intuition is actually, roughly right. 158 00:08:08,680 --> 00:08:09,580 159 00:08:09,580 --> 00:08:12,479 I'll say more about 160 00:08:12,479 --> 00:08:16,969 161 00:08:16,969 --> 00:08:20,969 this 162 00:08:20,969 --> 00:08:21,919 163 00:08:21,919 --> 00:08:25,520 later. This result is clearly, slightly broken, in the sense that it relies on a 164 00:08:25,520 --> 00:08:28,729 64-bit representation of 14-point numbers. 165 00:08:28,729 --> 00:08:31,899 So let me actually go ahead and outline 166 00:08:31,899 --> 00:08:34,900 the right way to show this more formally; all 167 00:08:34,900 --> 00:08:38,910 right? And it turns out the right way to show this more formally 168 00:08:38,910 --> 00:08:43,120 involves a much longer because the 169 00:08:43,120 --> 00:08:46,250 proof is extremely involved, so I'm just actually going to state the result, and not 170 00:08:46,250 --> 00:08:50,160 prove it. Farther proof be a source of learning theory balance, infinite hypothesis 171 00:08:50,160 --> 00:08:54,320 classes. 172 00:08:54,320 --> 00:08:56,450 173 00:08:56,450 --> 00:09:03,450 This definition given a set of d points, 174 00:09:15,820 --> 00:09:17,430 175 00:09:17,430 --> 00:09:18,630 176 00:09:18,630 --> 00:09:22,880 we say, 177 00:09:22,880 --> 00:09:26,560 a hypothesis class h 178 00:09:26,560 --> 00:09:28,590 shatters 179 00:09:28,590 --> 00:09:34,520 the set s, if 180 00:09:34,520 --> 00:09:41,520 h can realize any labeling on it; okay? 181 00:09:46,650 --> 00:09:47,360 And what 182 00:09:47,360 --> 00:09:50,990 I mean by realizing any labeling on it the informal way of thinking about this 183 00:09:50,990 --> 00:09:53,020 is: 184 00:09:53,020 --> 00:09:56,860 if a hypothesis class has shattered the set s, what that means is that I can take these 185 00:09:56,860 --> 00:09:58,510 d points, 186 00:09:58,510 --> 00:10:01,740 and I can associate these d points with any 187 00:10:01,740 --> 00:10:04,250 caught set of labels y; right? 188 00:10:04,250 --> 00:10:08,900 So choose any set of labeling y for each of these d 189 00:10:08,900 --> 00:10:10,140 points. 190 00:10:10,140 --> 00:10:13,840 And if your hypothesis class shatters s, then that means that 191 00:10:13,840 --> 00:10:14,790 there will be 192 00:10:14,790 --> 00:10:19,470 a hypothesis that labels those d examples perfectly; 193 00:10:19,470 --> 00:10:23,710 okay? That's what shattering means. So let me just illustrate those in an example. 194 00:10:23,710 --> 00:10:30,710 So let's say h is the class of all linear classifiers 195 00:10:31,280 --> 00:10:35,520 into e, 196 00:10:35,520 --> 00:10:38,400 and let's say that 197 00:10:38,400 --> 00:10:40,370 s is this 198 00:10:40,370 --> 00:10:43,740 [inaudible] comprising two points; okay? 199 00:10:43,740 --> 00:10:45,980 So there are 200 00:10:45,980 --> 00:10:49,520 four possible labelings that computes 201 00:10:49,520 --> 00:10:53,270 with these two points. You can choose 202 00:10:53,270 --> 00:10:56,470 to label both positive; one positive, one negative, one negative, one positive or you can 203 00:10:56,470 --> 00:10:59,280 label both of them negative. 204 00:10:59,280 --> 00:11:00,750 And if 205 00:11:00,750 --> 00:11:04,730 the hypothesis class h classed all linear classifiers into the then, for each 206 00:11:04,730 --> 00:11:08,450 of these training sets, I can sort of find a 207 00:11:08,450 --> 00:11:09,990 linear classifier 208 00:11:09,990 --> 00:11:13,500 that attains zero training error on 209 00:11:13,500 --> 00:11:17,810 each of these. Then on all possible labelings of this set of two points. 210 00:11:17,810 --> 00:11:20,140 And so I'll say 211 00:11:20,140 --> 00:11:23,160 that the hypothesis class script h shatters 212 00:11:23,160 --> 00:11:24,129 this set s 213 00:11:24,129 --> 00:11:29,480 of two points; okay? One 214 00:11:29,480 --> 00:11:36,480 more example show 215 00:11:37,350 --> 00:11:38,780 you 216 00:11:38,780 --> 00:11:42,059 a larger example. Suppose my set s is now 217 00:11:42,059 --> 00:11:44,640 this set of three points; 218 00:11:44,640 --> 00:11:51,640 right? Then, I now have eight possible labelings for these three points; 219 00:12:10,600 --> 00:12:14,830 okay? And so for these three points, I now have eight possible labelings. 220 00:12:14,830 --> 00:12:16,940 And once again, I can for each 221 00:12:16,940 --> 00:12:21,620 of these labelings, I can find the hypothesis in the hypothesis class 222 00:12:21,620 --> 00:12:23,390 that labels 223 00:12:23,390 --> 00:12:25,610 these examples correctly. 224 00:12:25,610 --> 00:12:30,220 And so once again, I see that by definition, say, that my hypothesis 225 00:12:30,220 --> 00:12:33,890 class also shatters this set s. Student:Right. Instructor (Andrew Ng):And 226 00:12:33,890 --> 00:12:38,340 then that that terminology h can realize any labeling on s. That's 227 00:12:38,340 --> 00:12:39,830 obviously [inaudible]. 228 00:12:39,830 --> 00:12:43,680 Give it any set of labels and you can find a hypothesis that perfectly 229 00:12:43,680 --> 00:12:47,570 separates the positive and negative examples; 230 00:12:47,570 --> 00:12:48,840 okay? 231 00:12:48,840 --> 00:12:51,090 So 232 00:12:51,090 --> 00:12:54,660 how about this 233 00:12:54,660 --> 00:12:57,430 set? 234 00:12:57,430 --> 00:13:00,840 Suppose s is now this set of four points, then, 235 00:13:00,840 --> 00:13:04,840 you know, there are lots of labels. There are now 16 labelings we can choose on this; right? 236 00:13:04,840 --> 00:13:06,400 That's one for instance, 237 00:13:06,400 --> 00:13:09,090 and this is another one; 238 00:13:09,090 --> 00:13:13,420 right? And so I can realize some labelings. But there's no linear division 239 00:13:13,420 --> 00:13:15,780 boundary that can realize this labeling, 240 00:13:15,780 --> 00:13:17,800 and so h does not shatter 241 00:13:17,800 --> 00:13:20,250 this set of four points; okay? 242 00:13:20,250 --> 00:13:21,580 And 243 00:13:21,580 --> 00:13:25,360 I'm not really going to prove it here, but it turns out that you can show that 244 00:13:25,360 --> 00:13:30,040 in two dimensions, there is no set of four points that the 245 00:13:30,040 --> 00:13:37,040 class of all linear classifiers can shatter; okay? 246 00:13:45,630 --> 00:13:49,110 So here's another definition. When I say that the well, 247 00:13:49,110 --> 00:13:50,290 it's called the 248 00:13:50,290 --> 00:13:52,700 VC dimension. These 249 00:13:52,700 --> 00:13:59,700 two people, Vapnik and Chervonenkis 250 00:14:06,550 --> 00:14:09,910 so given a hypothesis class, 251 00:14:09,910 --> 00:14:13,590 the Vapnik and Chervonenkis 252 00:14:13,590 --> 00:14:17,600 dimension of h, which we usually write as VC of script h, 253 00:14:17,600 --> 00:14:24,600 is the size of the 254 00:14:33,780 --> 00:14:36,670 larger 255 00:14:36,670 --> 00:14:40,070 set that is shattered by this set by h. 256 00:14:40,070 --> 00:14:40,759 And 257 00:14:40,759 --> 00:14:44,939 if a hypothesis class can shatter arbitrarily large sets, then the VC dimension 258 00:14:44,939 --> 00:14:47,820 is infinite. 259 00:14:47,820 --> 00:14:50,549 So just as a kind of good example: if h 260 00:14:50,549 --> 00:14:51,920 is the class of all linear 261 00:14:51,920 --> 00:14:57,850 classifiers 262 00:14:57,850 --> 00:15:02,660 into d, then the 263 00:15:02,660 --> 00:15:04,100 VC dimension of the set 264 00:15:04,100 --> 00:15:04,730 is equal to 265 00:15:04,730 --> 00:15:09,490 three because we saw just now that there is a size of 266 00:15:09,490 --> 00:15:12,940 there was a set s of size three that it could shatter, 267 00:15:12,940 --> 00:15:17,259 and I don't really prove it. But it turns out there is no sets of size four 268 00:15:17,259 --> 00:15:18,480 that it can 269 00:15:18,480 --> 00:15:19,900 shatter. And therefore, 270 00:15:19,900 --> 00:15:21,220 the VC dimension of this is three. Yeah? Student:But there are sets of size three that cannot shatter; right? [Inaudible] was your 271 00:15:21,220 --> 00:15:23,270 point. Instructor 272 00:15:23,270 --> 00:15:27,530 (Andrew 273 00:15:27,530 --> 00:15:30,210 Ng):Yes, absolutely. So it turns out that 274 00:15:30,210 --> 00:15:32,370 if 275 00:15:32,370 --> 00:15:35,650 I choose a set like this it's actually set s, 276 00:15:35,650 --> 00:15:39,170 then there are labelings on this they cannot realize. And so, 277 00:15:39,170 --> 00:15:40,980 h cannot shatter this set. 278 00:15:40,980 --> 00:15:43,620 But that's okay because right there definitely is 279 00:15:43,620 --> 00:15:46,590 there exists some other set of size three being shattered. So the VC 280 00:15:46,590 --> 00:15:47,890 dimension is 281 00:15:47,890 --> 00:15:54,890 three. And then there is no set of size four that can shatter. Yeah? Student:[Inaudible]. Instructor 282 00:15:55,590 --> 00:15:58,130 (Andrew Ng):Not according to this definition. No. 283 00:15:58,130 --> 00:16:00,640 Right. So again, let's see, 284 00:16:00,640 --> 00:16:03,619 I can choose my set s to be 285 00:16:03,619 --> 00:16:07,869 to be a set of three points that are all over lapping. Three points in exactly 286 00:16:07,869 --> 00:16:09,780 the same place. 287 00:16:09,780 --> 00:16:11,730 And clearly, I can't shatter this set, 288 00:16:11,730 --> 00:16:15,130 but that's okay. And I can't shatter this set, either, 289 00:16:15,130 --> 00:16:18,560 but that's okay because there are some other sets of size three that I can 290 00:16:18,560 --> 00:16:25,040 shatter. 291 00:16:25,040 --> 00:16:28,410 And it turns out this result holds true into the 292 00:16:28,410 --> 00:16:31,690 more generally, 293 00:16:31,690 --> 00:16:34,430 294 00:16:34,430 --> 00:16:41,410 in 295 00:16:41,410 --> 00:16:48,410 any dimensions the VC dimension of the class of linear classifiers in 296 00:16:51,460 --> 00:16:53,500 any dimensions is equal to n plus one. 297 00:16:53,500 --> 00:16:59,210 Okay? So this is in [inaudible], and if you have linear classifiers over in any dimensional feature space, the VC 298 00:16:59,210 --> 00:17:01,839 dimension in any dimensions; whereas, n d 299 00:17:01,839 --> 00:17:04,350 is equal to n plus one. 300 00:17:04,350 --> 00:17:11,350 301 00:17:12,059 --> 00:17:14,459 So maybe you wanna write it down: 302 00:17:14,459 --> 00:17:15,809 303 00:17:15,809 --> 00:17:20,740 what is arguably the best-known result in all of learning theory, I guess; 304 00:17:20,740 --> 00:17:25,930 305 00:17:25,930 --> 00:17:28,770 which is that. 306 00:17:28,770 --> 00:17:35,770 Let a hypothesis class be given, 307 00:17:37,600 --> 00:17:41,170 and let the VC dimension of h be 308 00:17:41,170 --> 00:17:45,830 equal to d. Then we're 309 00:17:45,830 --> 00:17:47,530 in probability of 310 00:17:47,530 --> 00:17:50,090 one minus delta. 311 00:17:50,090 --> 00:17:57,090 We have that 312 00:18:31,830 --> 00:18:34,770 the formula on the right looks a bit complicated, but don't worry about it. I'll point 313 00:18:34,770 --> 00:18:36,910 out the essential aspects of it later. 314 00:18:36,910 --> 00:18:40,600 But the key to this result is that if you have a hypothesis class with VC dimension d, and 315 00:18:40,600 --> 00:18:45,320 now this can be an infinite hypothesis 316 00:18:45,320 --> 00:18:46,660 class, 317 00:18:46,660 --> 00:18:48,620 what 318 00:18:48,620 --> 00:18:50,990 Vapnik and Chervonenkis show is that 319 00:18:50,990 --> 00:18:54,030 we're probability of at least one minus delta. 320 00:18:54,030 --> 00:18:55,300 You enjoy this 321 00:18:55,300 --> 00:18:57,540 sort of uniform conversions results; 322 00:18:57,540 --> 00:19:00,380 okay? We have 323 00:19:00,380 --> 00:19:02,520 324 00:19:02,520 --> 00:19:06,760 that for all hypotheses h that for all the hypotheses in your 325 00:19:06,760 --> 00:19:08,490 hypothesis class, 326 00:19:08,490 --> 00:19:13,080 you have that the generalization error of h 327 00:19:13,080 --> 00:19:13,780 minus 328 00:19:13,780 --> 00:19:19,210 the training error of h. So the difference between these two things is bounded above 329 00:19:19,210 --> 00:19:24,250 by some complicated formula like this; okay? 330 00:19:24,250 --> 00:19:29,520 And thus, 331 00:19:29,520 --> 00:19:36,520 we're probably one minus delta. We also have that have the 332 00:19:45,140 --> 00:19:52,140 same thing; okay? 333 00:19:56,810 --> 00:20:00,730 And going from this step to this step; right? Going from this step 334 00:20:00,730 --> 00:20:03,429 to this step is actually something that you saw yourself; 335 00:20:03,429 --> 00:20:05,929 that we actually proved earlier. Because 336 00:20:05,929 --> 00:20:09,200 you remember, in the previous lecture we proved that if you have uniform 337 00:20:09,200 --> 00:20:10,720 conversions, 338 00:20:10,720 --> 00:20:13,210 then that implies that 339 00:20:13,210 --> 00:20:18,080 it appears actually that we showed that if generalization error and training error are close to 340 00:20:18,080 --> 00:20:20,800 each other; within gamma of each other, 341 00:20:20,800 --> 00:20:24,360 then the generalization error of the hypotheses you pick will be within 342 00:20:24,360 --> 00:20:26,150 two gamma 343 00:20:26,150 --> 00:20:28,080 times the best generalization error. 344 00:20:28,080 --> 00:20:29,650 So this is really 345 00:20:29,650 --> 00:20:30,900 generalization error of 346 00:20:30,900 --> 00:20:35,130 h [inaudible] best possible generalization error plus two times gamma. And just the 347 00:20:35,130 --> 00:20:37,390 two constants in front here 348 00:20:37,390 --> 00:20:43,790 that I've absorbed into the big-O notation. 349 00:20:43,790 --> 00:20:47,310 So that formula is slightly more complicated. Let 350 00:20:47,310 --> 00:20:54,310 me just rewrite this as a corollary, which is that 351 00:20:55,370 --> 00:20:58,850 in order to guarantee that 352 00:20:58,850 --> 00:21:04,390 this holds, 353 00:21:04,390 --> 00:21:08,580 we're probability of one minus delta. 354 00:21:08,580 --> 00:21:11,270 We're probably at least one minus delta, I should say. It 355 00:21:11,270 --> 00:21:16,790 suffices 356 00:21:16,790 --> 00:21:20,400 that I'm gonna write this this way: I'm gonna write m equals 357 00:21:20,400 --> 00:21:22,310 big-O of 358 00:21:22,310 --> 00:21:22,950 d, 359 00:21:22,950 --> 00:21:25,880 and I'm going to put 360 00:21:25,880 --> 00:21:32,570 gamma and delta in as a subscript error to denote that. 361 00:21:32,570 --> 00:21:37,340 Let's see, if we treat gamma and delta as constants, so they allow me to absorb turns that 362 00:21:37,340 --> 00:21:40,390 depend on gamma and delta into the big-O notation, 363 00:21:40,390 --> 00:21:41,600 then 364 00:21:41,600 --> 00:21:44,870 in order to guarantee this holds, it suffices that m 365 00:21:44,870 --> 00:21:46,430 is on the order of the 366 00:21:46,430 --> 00:21:53,430 VC dimension and hypotheses class; okay? 367 00:21:53,470 --> 00:21:55,790 So 368 00:21:55,790 --> 00:21:58,860 let's see. 369 00:21:58,860 --> 00:22:02,490 So what we conclude from this is that 370 00:22:02,490 --> 00:22:06,100 if you have a learning algorithm that tries to for empirical risk minimization 371 00:22:06,100 --> 00:22:08,770 algorithms in other words, less formally, 372 00:22:08,770 --> 00:22:12,630 for learning algorithms, they try to minimize training error. The 373 00:22:12,630 --> 00:22:14,850 intuition to take away from this is that 374 00:22:14,850 --> 00:22:18,080 the number of training examples you need is therefore, 375 00:22:18,080 --> 00:22:19,250 roughly, linear 376 00:22:19,250 --> 00:22:24,350 in the VC dimension of the hypotheses class. 377 00:22:24,350 --> 00:22:27,690 And more formally, this shows that sample complexity 378 00:22:27,690 --> 00:22:29,570 is upper bounded 379 00:22:29,570 --> 00:22:34,040 by the VC dimension; okay? 380 00:22:34,040 --> 00:22:38,490 It turns out that for most reasonable hypothesis classes, it turns out that 381 00:22:38,490 --> 00:22:41,460 the VC dimension is sort 382 00:22:41,460 --> 00:22:41,869 383 00:22:41,869 --> 00:22:44,930 of very similar, I guess, to 384 00:22:44,930 --> 00:22:47,110 the number of parameters you model. 385 00:22:47,110 --> 00:22:48,600 So for example, 386 00:22:48,600 --> 00:22:52,080 you have model and logistic regression linear classification. 387 00:22:52,080 --> 00:22:56,059 In any dimensions logistic regression in any dimensions is endless 388 00:22:56,059 --> 00:22:57,430 one parameters. 389 00:22:57,430 --> 00:23:02,430 And the VC dimension of which is the of class of linear classifiers is always the endless one. 390 00:23:02,430 --> 00:23:04,289 So it turns out that 391 00:23:04,289 --> 00:23:07,360 for most reasonable hypothesis classes, the 392 00:23:07,360 --> 00:23:11,270 VC dimension is usually linear in the number of parameters of your model. 393 00:23:11,270 --> 00:23:14,820 Wherein, is most sense of low other polynomial; 394 00:23:14,820 --> 00:23:17,440 in the number of parameters of your model. 395 00:23:17,440 --> 00:23:18,760 And so this 396 00:23:18,760 --> 00:23:21,120 the takeaway intuition from this is that 397 00:23:21,120 --> 00:23:25,059 the number of training examples you need to fit in those models is going to be let's say, 398 00:23:25,059 --> 00:23:29,550 roughly, linear in the number of parameters in your model; okay? 399 00:23:29,550 --> 00:23:33,120 There are some somewhat strange examples where what I just said is not true. There are some 400 00:23:33,120 --> 00:23:36,559 strange examples where you have very few parameters, but the VC dimension is 401 00:23:36,559 --> 00:23:37,450 enormous. 402 00:23:37,450 --> 00:23:38,610 But I actually know of 403 00:23:38,610 --> 00:23:42,070 all of the examples I know of that fall into that regime are somewhat 404 00:23:42,070 --> 00:23:43,559 strange and 405 00:23:43,559 --> 00:23:45,160 degenerate. So somewhat unusual, and 406 00:23:45,160 --> 00:23:50,340 not the source of not learning algorithms you usually use. 407 00:23:50,340 --> 00:23:52,270 Let's see, 408 00:23:52,270 --> 00:23:55,030 just other things. It turns out that 409 00:23:55,030 --> 00:23:58,490 so this result shows the sample complexity is upper bounded by VC 410 00:23:58,490 --> 00:24:01,320 dimension. But if you have a number of training examples that 411 00:24:01,320 --> 00:24:04,510 are on the order of the VC dimension, then you find 412 00:24:04,510 --> 00:24:07,510 it turns out that in the worse case 413 00:24:07,510 --> 00:24:10,320 414 00:24:10,320 --> 00:24:12,200 415 00:24:12,200 --> 00:24:15,840 some complexity is also lower bounded by VC dimension. 416 00:24:15,840 --> 00:24:20,220 And what that means is that if you have a perfectly nasty learning 417 00:24:20,220 --> 00:24:23,120 problem, say, then if 418 00:24:23,120 --> 00:24:26,630 the number of training examples you have is less than on the order of the VC 419 00:24:26,630 --> 00:24:27,450 dimension; 420 00:24:27,450 --> 00:24:29,690 then 421 00:24:29,690 --> 00:24:33,910 it is not possible to prove this bound. So I guess in the worse case, 422 00:24:33,910 --> 00:24:37,490 sample complexity in the number of training examples you need is upper bounded and lower bounded 423 00:24:37,490 --> 00:24:41,820 by the VC dimension. Let's 424 00:24:41,820 --> 00:24:48,110 see, 425 00:24:48,110 --> 00:24:55,110 questions about this? Student:Does 426 00:24:55,540 --> 00:24:59,700 427 00:24:59,700 --> 00:25:00,490 the 428 00:25:00,490 --> 00:25:04,560 proof of this assume any sort of finites of, like, finite [inaudible] like you have to just [inaudible] real numbers and [inaudible]? Instructor (Andrew Ng):Let's see. The proof is not, no. I've actually stated the 429 00:25:04,560 --> 00:25:07,280 entirety of the theorem. This is true. It 430 00:25:07,280 --> 00:25:09,920 turns out in the proof well, 431 00:25:09,920 --> 00:25:12,730 somewhere, regardless of the proof there's a step reconstruction called an 432 00:25:12,730 --> 00:25:15,750 epsilon net, which is a very clever [inaudible]. It' sort of in regardless of the 433 00:25:15,750 --> 00:25:19,880 proof, it is not an assumption that you need. 434 00:25:19,880 --> 00:25:23,610 In someway that sort of proof that's one-step that uses a very 435 00:25:23,610 --> 00:25:27,050 clever [inaudible] to prove 436 00:25:27,050 --> 00:25:32,130 this. But that's not needed; it's an assumption. I'd like to say, back when 437 00:25:32,130 --> 00:25:36,370 I was a Ph.D. student, when I was working through this proof, 438 00:25:36,370 --> 00:25:38,220 there was sort of a solid week where I 439 00:25:38,220 --> 00:25:42,100 would wake up, and go to the office at 9:00 a.m. Then I'd start reading 440 00:25:42,100 --> 00:25:44,840 the book that led up to this proof. And then 441 00:25:44,840 --> 00:25:49,330 I'd read from 9:00 a.m. to 6:00 p.m. And then I'd go home, and then the next day, I'd pick up where I left 442 00:25:49,330 --> 00:25:51,609 off. And it sort of took me a whole week that way, 443 00:25:51,609 --> 00:25:53,180 to understand this proof, so 444 00:25:53,180 --> 00:26:00,180 I thought I would inflict that on you. 445 00:26:03,090 --> 00:26:07,470 Just to tie a couple of loose ends: 446 00:26:07,470 --> 00:26:13,110 what I'm about to do is, I'm about to just mention a few things that will 447 00:26:13,110 --> 00:26:18,380 maybe, feel a little bit like random facts. But I'm just gonna tie up just a couple of loose ends. 448 00:26:18,380 --> 00:26:21,380 And so 449 00:26:21,380 --> 00:26:24,150 let's see, 450 00:26:24,150 --> 00:26:26,860 it turns out that 451 00:26:26,860 --> 00:26:30,170 just so it will be more strong with you so this bound was 452 00:26:30,170 --> 00:26:30,670 proved 453 00:26:30,670 --> 00:26:34,840 for an algorithm that uses empirical risk minimization, for an algorithm that 454 00:26:34,840 --> 00:26:35,920 minimizes 455 00:26:35,920 --> 00:26:39,580 0-1 training error. So 456 00:26:39,580 --> 00:26:46,330 one 457 00:26:46,330 --> 00:26:49,350 question that some of 458 00:26:49,350 --> 00:26:51,720 you ask is 459 00:26:51,720 --> 00:26:55,460 how about support vector machines; right? How come SVM's don't over 460 00:26:55,460 --> 00:26:57,250 fit? 461 00:26:57,250 --> 00:27:01,280 And in the sequel of remember our discussion on support vector machines said that 462 00:27:01,280 --> 00:27:06,450 you use kernels, and map the features in infinite dimensional feature space. 463 00:27:06,450 --> 00:27:10,390 And so it seems like the VC dimension should be infinite; n plus 464 00:27:10,390 --> 00:27:12,340 one and n is infinite. 465 00:27:12,340 --> 00:27:14,840 So it turns out that 466 00:27:14,840 --> 00:27:19,630 the class of linear separators with large margin actually has low VC dimension. 467 00:27:19,630 --> 00:27:23,300 I wanna say this very quickly, and informally. 468 00:27:23,300 --> 00:27:28,010 It's actually, not very important for you to understand the details, but I'm going 469 00:27:28,010 --> 00:27:31,760 to say it very informally. It turns out that I will give you a set of points. 470 00:27:31,760 --> 00:27:32,660 And 471 00:27:32,660 --> 00:27:37,500 if I ask you to consider only the course of lines that separate these points of a 472 00:27:37,500 --> 00:27:38,780 large margin [inaudible], 473 00:27:38,780 --> 00:27:40,030 so 474 00:27:40,030 --> 00:27:43,840 my hypothesis class will comprise only 475 00:27:43,840 --> 00:27:45,950 the linear position boundaries 476 00:27:45,950 --> 00:27:49,289 that separate the points of a large margin. Say with a 477 00:27:49,289 --> 00:27:50,049 margin, 478 00:27:50,049 --> 00:27:52,260 at least gamma; okay. 479 00:27:52,260 --> 00:27:53,000 And so 480 00:27:53,000 --> 00:27:55,760 I won't allow a point that comes closer. Like, 481 00:27:55,760 --> 00:28:00,210 I won't allow that line because it comes too close to one of my points. 482 00:28:00,210 --> 00:28:04,330 It turns out that 483 00:28:04,330 --> 00:28:10,890 if I consider my 484 00:28:10,890 --> 00:28:15,410 data points all 485 00:28:15,410 --> 00:28:18,690 lie within some sphere of radius r, 486 00:28:18,690 --> 00:28:23,190 and if I consider only the course of linear separators is separate to data 487 00:28:23,190 --> 00:28:26,799 with a margin of at least gamma, 488 00:28:26,799 --> 00:28:30,420 then the VC dimension of this 489 00:28:30,420 --> 00:28:34,260 course is less than or equal to r squared over four 490 00:28:34,260 --> 00:28:36,150 gamma squared 491 00:28:36,150 --> 00:28:38,170 plus one; okay? 492 00:28:38,170 --> 00:28:39,970 So 493 00:28:39,970 --> 00:28:43,340 this funny symbol here, that just means rounding up. 494 00:28:43,340 --> 00:28:47,190 This is a ceiling symbol; it means rounding up x. 495 00:28:47,190 --> 00:28:48,360 And it 496 00:28:48,360 --> 00:28:51,280 turns out you prove and there are some strange things 497 00:28:51,280 --> 00:28:52,940 about this result that I'm deliberately not 498 00:28:52,940 --> 00:28:55,090 gonna to talk about but turns they 499 00:28:55,090 --> 00:28:59,100 can prove that the VC dimension of the class of linear classifiers with large 500 00:28:59,100 --> 00:29:00,890 margins is actually bounded. 501 00:29:00,890 --> 00:29:02,820 The surprising thing about this is that 502 00:29:02,820 --> 00:29:05,909 this is the bound on VC dimension that has no dependents 503 00:29:05,909 --> 00:29:07,860 on the dimension of the points x. So 504 00:29:07,860 --> 00:29:09,240 in other words, 505 00:29:09,240 --> 00:29:12,799 your data points x combine an infinite dimensional space, 506 00:29:12,799 --> 00:29:15,680 but so long as you restrict attention to the class of 507 00:29:15,680 --> 00:29:17,990 your separators with large margin, the 508 00:29:17,990 --> 00:29:20,250 VC dimension is bounded. 509 00:29:20,250 --> 00:29:21,529 And so 510 00:29:21,529 --> 00:29:25,650 in trying to find a large margin separator 511 00:29:25,650 --> 00:29:28,990 in trying to find the line that separates your positive and 512 00:29:28,990 --> 00:29:29,950 your negative examples 513 00:29:29,950 --> 00:29:32,490 with large margin, 514 00:29:32,490 --> 00:29:35,100 it turns out therefore, that 515 00:29:35,100 --> 00:29:38,060 the support vector machine is automatically trying to find 516 00:29:38,060 --> 00:29:39,809 a hypothesis class 517 00:29:39,809 --> 00:29:42,450 with small VC dimension. And therefore, it does not over fit. Alex? Student:What is 518 00:29:42,450 --> 00:29:49,320 the [inaudible]? Instructor (Andrew 519 00:29:49,320 --> 00:29:50,010 Ng):It 520 00:29:50,010 --> 00:29:51,879 is actually defined the 521 00:29:51,879 --> 00:29:55,640 same way as finite dimensional spaces. So you 522 00:29:55,640 --> 00:29:57,210 know, suppose you have infinite actually, 523 00:29:57,210 --> 00:30:00,590 these are constantly infinite dimensional vectors; not [inaudible] to the infinite 524 00:30:00,590 --> 00:30:03,590 dimensional 525 00:30:03,590 --> 00:30:06,080 vectors. Normally, the 2 to 1 526 00:30:06,080 --> 00:30:07,460 squared 527 00:30:07,460 --> 00:30:10,010 is equal to some [inaudible] equals 110 528 00:30:10,010 --> 00:30:12,210 xi squared, so if x 529 00:30:12,210 --> 00:30:17,470 is infinite dimensional, you just appoint it like 530 00:30:17,470 --> 00:30:20,220 that. [Inaudible]. [Crosstalk] Student:[Inaudible]. Instructor (Andrew Ng):Now, say that again. 531 00:30:20,220 --> 00:30:22,679 Student:[Inaudible]. 532 00:30:22,679 --> 00:30:27,230 Instructor (Andrew Ng):Yes. Although, I assume that this is bounded by r. Student:Oh. Instructor (Andrew Ng):It's 533 00:30:27,230 --> 00:30:30,750 a yeah so this insures that conversions. 534 00:30:30,750 --> 00:30:31,670 535 00:30:31,670 --> 00:30:35,040 So just 536 00:30:35,040 --> 00:30:37,650 something people sometimes wonder about. And 537 00:30:37,650 --> 00:30:39,640 last, the actually 538 00:30:39,640 --> 00:30:44,470 tie empirical risk minimization back a little more strongly to the source of algorithms 539 00:30:44,470 --> 00:30:46,650 we've talked about. 540 00:30:46,650 --> 00:30:48,050 It turns out 541 00:30:48,050 --> 00:30:55,050 that 542 00:31:10,179 --> 00:31:15,100 so the theory was about, and so far, was really for empirical risk minimization. So that view's 543 00:31:15,100 --> 00:31:17,770 544 00:31:17,770 --> 00:31:21,010 so we focus on just one training example. 545 00:31:21,010 --> 00:31:22,860 Let me draw a 546 00:31:22,860 --> 00:31:24,049 function, you know, a 547 00:31:24,049 --> 00:31:25,080 zero here jumps 548 00:31:25,080 --> 00:31:26,210 to one, 549 00:31:26,210 --> 00:31:27,670 and it looks like that. 550 00:31:27,670 --> 00:31:29,570 And so this for 551 00:31:29,570 --> 00:31:31,720 once, this training example, 552 00:31:31,720 --> 00:31:36,230 this may be 553 00:31:36,230 --> 00:31:42,830 indicator 554 00:31:42,830 --> 00:31:47,110 h where [inaudible] is d 555 00:31:47,110 --> 00:31:49,340 equals data transpose x; 556 00:31:49,340 --> 00:31:52,600 okay? But one 557 00:31:52,600 --> 00:31:56,250 training example your training example will be positive or negative. And 558 00:31:56,250 --> 00:31:59,360 depending on what the value of this data transpose x is, you either get it right 559 00:31:59,360 --> 00:32:01,800 or wrong. And so 560 00:32:01,800 --> 00:32:05,720 you know, I guess if your training example if you have a positive example, 561 00:32:05,720 --> 00:32:08,370 then when z is positive, 562 00:32:08,370 --> 00:32:10,380 you get it 563 00:32:10,380 --> 00:32:11,240 right. 564 00:32:11,240 --> 00:32:18,240 565 00:32:18,610 --> 00:32:22,630 Suppose you have a negative example, so y equals 0; right? 566 00:32:22,630 --> 00:32:26,780 Then if z, which is data transpose x if this is positive, 567 00:32:26,780 --> 00:32:29,650 then you will get this example wrong; 568 00:32:29,650 --> 00:32:32,860 whereas, if z is negative then you'd get this example right. 569 00:32:32,860 --> 00:32:36,630 And so this is a part of indicator h subscript [inaudible] x not equals y; okay? You 570 00:32:36,630 --> 00:32:42,639 know, it's equal 571 00:32:42,639 --> 00:32:49,539 to g of data transpose 572 00:32:49,539 --> 00:32:50,730 573 00:32:50,730 --> 00:32:53,940 x; okay? And so it turns out that 574 00:32:53,940 --> 00:32:57,260 so what you really like to do is choose parameters data so as to 575 00:32:57,260 --> 00:32:59,550 minimize this step function; right? 576 00:32:59,550 --> 00:33:02,400 You'd like to choose parameters data, so that 577 00:33:02,400 --> 00:33:04,710 you end up with a 578 00:33:04,710 --> 00:33:08,500 correct classification on setting your training example, and so you'd like 579 00:33:08,500 --> 00:33:09,440 indicator h 580 00:33:09,440 --> 00:33:10,890 of x not equal y. 581 00:33:10,890 --> 00:33:14,960 You'd like this indicator function to be 0. It 582 00:33:14,960 --> 00:33:18,560 turns out this step function is clearly a non-convex function. And so 583 00:33:18,560 --> 00:33:21,799 it turns out that just the linear classifiers 584 00:33:21,799 --> 00:33:25,130 minimizing the training error is an empty heart problem. It 585 00:33:25,130 --> 00:33:29,560 turns out that both logistic regression, and support vector machines can be viewed as 586 00:33:29,560 --> 00:33:31,210 using a convex 587 00:33:31,210 --> 00:33:32,600 approximation for this problem. 588 00:33:32,600 --> 00:33:36,440 And in particular 589 00:33:36,440 --> 00:33:40,130 and draw a function like that 590 00:33:40,130 --> 00:33:41,460 591 00:33:41,460 --> 00:33:48,460 it turns out that 592 00:33:50,809 --> 00:33:54,600 logistic regression is trying to maximize likelihood. And so it's tying to minimize 593 00:33:54,600 --> 00:33:56,900 the minus of the logged likelihood. 594 00:33:56,900 --> 00:34:00,350 And if you plot the minus of the logged likelihood, it actually turns out it'll be a function that 595 00:34:00,350 --> 00:34:03,400 looks like this. And 596 00:34:03,400 --> 00:34:07,250 this line that I just drew, you can think of it as a rough approximation to this step function; which is 597 00:34:07,250 --> 00:34:10,709 maybe what you're really trying to minimize, 598 00:34:10,709 --> 00:34:14,539 so you want to minimize training error. So you can actually think of logistic regression as trying to approximate 599 00:34:14,539 --> 00:34:16,069 empirical risk minimization. 600 00:34:16,069 --> 00:34:19,859 Where instead of using this step function, which is non-convex, and gives you a hard 601 00:34:19,859 --> 00:34:21,059 optimization problem, it 602 00:34:21,059 --> 00:34:25,189 uses this line above this curve above. So approximate it, 603 00:34:25,189 --> 00:34:28,229 so you have a convex optimization problem you can 604 00:34:28,229 --> 00:34:29,929 find the 605 00:34:29,929 --> 00:34:34,409 maximum likelihood it's in the parameters for logistic regression. 606 00:34:34,409 --> 00:34:38,690 And it turns out, support vector machine also can be viewed as approximated dysfunction 607 00:34:38,690 --> 00:34:41,139 to only a little bit different let's 608 00:34:41,139 --> 00:34:44,719 see, support vector machine turns out, can be viewed as trying to approximate this 609 00:34:44,719 --> 00:34:46,209 step function two 610 00:34:46,209 --> 00:34:48,289 over different 611 00:34:48,289 --> 00:34:50,929 approximation that's linear, and then 612 00:34:50,929 --> 00:34:53,589 that sort of [inaudible] linear that 613 00:34:53,589 --> 00:34:55,710 our results goes this [inaudible] there, and then it goes up as a 614 00:34:55,710 --> 00:34:58,150 linear function there. And that's that 615 00:34:58,150 --> 00:35:00,170 is called the hinge class. And so you 616 00:35:00,170 --> 00:35:03,799 can think of logistic regression and the support vector machine as 617 00:35:03,799 --> 00:35:07,259 different approximations to try to minimize this 618 00:35:07,259 --> 00:35:10,859 step function; 619 00:35:10,859 --> 00:35:11,940 okay? 620 00:35:11,940 --> 00:35:13,380 And that's why I guess, 621 00:35:13,380 --> 00:35:16,109 all the theory we developed 622 00:35:16,109 --> 00:35:19,239 even though SVM's and logistic regression aren't exactly due to 623 00:35:19,239 --> 00:35:21,739 empirical risk minimization, 624 00:35:21,739 --> 00:35:24,790 the theory we develop often gives the 625 00:35:24,790 --> 00:35:31,790 completely appropriate intuitions for SVM's, and logistic regression; okay. 626 00:35:32,879 --> 00:35:35,779 So that was the last of the loose ends. 627 00:35:35,779 --> 00:35:39,319 And if you didn't get this, don't worry too much about it. It's a high-level message. It's 628 00:35:39,319 --> 00:35:40,309 just that 629 00:35:40,309 --> 00:35:44,479 SVM's and logistic regression are reasonable to think of as approximations 630 00:35:44,479 --> 00:35:47,769 empirical risk minimization algorithms. 631 00:35:47,769 --> 00:35:51,609 What I want to do next is move on to talk about model selection. Before I do that, let me just 632 00:35:51,609 --> 00:35:58,609 check for questions about 633 00:36:38,669 --> 00:36:45,669 634 00:36:49,190 --> 00:36:52,869 this. Okay. Cool. Okay. So in the theory that we started to develop in the previous lecture, and that 635 00:36:52,869 --> 00:36:53,329 we 636 00:36:53,329 --> 00:36:55,049 sort of wrapped up with a 637 00:36:55,049 --> 00:36:56,799 discussion on VC dimension, 638 00:36:56,799 --> 00:37:01,019 we saw that there's often a trade-off between bias and variance. And in 639 00:37:01,019 --> 00:37:02,489 particular, so 640 00:37:02,489 --> 00:37:06,759 it is important not to choose a hypothesis that's either too simple or too 641 00:37:06,759 --> 00:37:07,849 complex. So 642 00:37:07,849 --> 00:37:11,249 if your data has sort of a quadratic structure to it, 643 00:37:11,249 --> 00:37:13,710 then if you choose a linear 644 00:37:13,710 --> 00:37:17,889 function to try to approximate it, then you would under fit. So you have a 645 00:37:17,889 --> 00:37:19,939 hypothesis with high bias. 646 00:37:19,939 --> 00:37:24,179 And conversely, we choose a hypothesis that's too complex, and you have high variance. 647 00:37:24,179 --> 00:37:26,130 And you'll also 648 00:37:26,130 --> 00:37:31,899 fail to fit. Then you would over fit the data, and you'd also fail to generalize well. 649 00:37:31,899 --> 00:37:35,649 So model selection algorithms 650 00:37:35,649 --> 00:37:39,629 provide a class of methods to automatically trade make these tradeoffs 651 00:37:39,629 --> 00:37:41,619 between bias 652 00:37:41,619 --> 00:37:43,069 and variance; right? So remember the 653 00:37:43,069 --> 00:37:44,989 cartoon I drew last time 654 00:37:44,989 --> 00:37:51,989 of generalization error? 655 00:37:54,309 --> 00:37:57,409 I drew this last time. Where on the x-axis 656 00:37:57,409 --> 00:38:03,319 was model complexity, meaning the number of 657 00:38:03,319 --> 00:38:07,599 the degree of the polynomial; the [inaudible] regression function 658 00:38:07,599 --> 00:38:08,719 or whatever. 659 00:38:08,719 --> 00:38:12,469 And if you have too simple a model, you have high generalization error, those under 660 00:38:12,469 --> 00:38:13,529 fitting. 661 00:38:13,529 --> 00:38:15,630 And you if have too complex a model, 662 00:38:15,630 --> 00:38:19,729 like 15 or 14-degree polynomial to five data points, then you also have high 663 00:38:19,729 --> 00:38:24,069 generalization error, and you're over fitting. 664 00:38:24,069 --> 00:38:28,559 So what I wanna do now is actually just talk about model selection in the abstract; 665 00:38:28,559 --> 00:38:30,019 all right? 666 00:38:30,019 --> 00:38:32,779 Some examples of model selection problems will include 667 00:38:32,779 --> 00:38:34,369 well, I'll run the example 668 00:38:34,369 --> 00:38:36,139 of 669 00:38:36,139 --> 00:38:36,790 let's say you're 670 00:38:36,790 --> 00:38:43,790 trying to choose the degree of a polynomial; 671 00:38:45,159 --> 00:38:48,909 right? What degree polynomial do you want to choose? 672 00:38:48,909 --> 00:38:51,899 Another example of a model selection problem would be if you're trying to choose 673 00:38:51,899 --> 00:38:53,869 the parameter [inaudible], 674 00:38:53,869 --> 00:38:59,839 which was the bandwidth parameter 675 00:38:59,839 --> 00:39:04,140 in locally awaited linear regression 676 00:39:04,140 --> 00:39:07,709 or in some sort of local way to 677 00:39:07,709 --> 00:39:08,440 regression. 678 00:39:08,440 --> 00:39:10,480 Yet, another model selection problem 679 00:39:10,480 --> 00:39:13,940 is if you're trying to choose the parameter c [inaudible] and as the 680 00:39:13,940 --> 00:39:16,049 [inaudible]; 681 00:39:16,049 --> 00:39:21,680 right? And so one known soft margin is the 682 00:39:21,680 --> 00:39:28,539 we had this optimization objective; right? 683 00:39:28,539 --> 00:39:31,039 And the parameter c 684 00:39:31,039 --> 00:39:32,969 controls the tradeoff between 685 00:39:32,969 --> 00:39:34,019 how much you want 686 00:39:34,019 --> 00:39:38,239 to set for your example. So a large margin versus how much you want to penalize 687 00:39:38,239 --> 00:39:40,499 688 00:39:40,499 --> 00:39:44,279 in this class [inaudible] example. So these are three specific examples of model selection 689 00:39:44,279 --> 00:39:45,339 problems. 690 00:39:45,339 --> 00:39:47,569 And 691 00:39:47,569 --> 00:39:54,569 let's come up with a method for semantically choosing them. 692 00:39:56,709 --> 00:39:58,879 693 00:39:58,879 --> 00:40:05,879 694 00:40:11,209 --> 00:40:13,679 695 00:40:13,679 --> 00:40:17,849 Let's say you have some finite set of models, and let's write these as m1, m2, m3, 696 00:40:17,849 --> 00:40:20,129 and 697 00:40:20,129 --> 00:40:25,219 so on. For example, this may be the linear 698 00:40:25,219 --> 00:40:29,660 classifier or this may be the quadratic classifier, 699 00:40:29,660 --> 00:40:30,410 and so on; 700 00:40:30,410 --> 00:40:32,170 okay? Or this 701 00:40:32,170 --> 00:40:33,700 may also be 702 00:40:33,700 --> 00:40:37,240 you may also take the bandwidth parameter [inaudible] 703 00:40:37,240 --> 00:40:41,130 and discretize it into a range of values, and you're trying to choose from the most 704 00:40:41,130 --> 00:40:43,469 discrete of the values. 705 00:40:43,469 --> 00:40:48,239 So let's talk about how you would select an appropriate model; all right? Well, 706 00:40:48,239 --> 00:40:52,029 one thing you could do is you can pick all of these models, and train them on 707 00:40:52,029 --> 00:40:53,869 you're training set. 708 00:40:53,869 --> 00:40:56,989 And then see which model has the lowest training 709 00:40:56,989 --> 00:41:03,989 error. So that's a terrible idea, and why's that? Student:[Inaudible]. 710 00:41:05,450 --> 00:41:11,099 Instructor (Andrew Ng):Right. Cool. Because of the over fit; right. And those some of you are laughing that 711 00:41:11,099 --> 00:41:13,939 I asked that. So that'd be a terrible idea to choose a model 712 00:41:13,939 --> 00:41:17,019 by looking at your training set because well, obviously, you end up choosing the most 713 00:41:17,019 --> 00:41:18,839 complex model; right? And you 714 00:41:18,839 --> 00:41:25,549 choose a 10th degree polynomial because that's what fits the training set. So we 715 00:41:25,549 --> 00:41:28,569 come to model selection in a training set 716 00:41:28,569 --> 00:41:31,939 several standard procedures to do this. One is hold out cross 717 00:41:31,939 --> 00:41:38,869 validation, 718 00:41:38,869 --> 00:41:41,410 and in hold out cross validation, 719 00:41:41,410 --> 00:41:45,509 we teach a training set. And we randomly split 720 00:41:45,509 --> 00:41:50,529 the training set into two subsets. We 721 00:41:50,529 --> 00:41:51,490 call it 722 00:41:51,490 --> 00:41:53,589 subset take all the data you have 723 00:41:53,589 --> 00:42:00,589 and randomly split it into two subsets. And we'll call it the training set, and the 724 00:42:00,799 --> 00:42:05,159 hold out cross validation subset. 725 00:42:05,159 --> 00:42:07,119 And then, 726 00:42:07,119 --> 00:42:11,989 you know, you train each model 727 00:42:11,989 --> 00:42:18,739 on just trading subset of it, and test it 728 00:42:18,739 --> 00:42:22,069 on your hold out cross validation set. 729 00:42:22,069 --> 00:42:25,289 And you pick the model 730 00:42:25,289 --> 00:42:32,289 with the lowest error 731 00:42:32,949 --> 00:42:34,889 on the hold out cross validation 732 00:42:34,889 --> 00:42:36,959 subset; okay? So this is sort of a 733 00:42:36,959 --> 00:42:39,799 relatively straightforward procedure, and it's commonly used 734 00:42:39,799 --> 00:42:42,039 where you train on 70 percent of the data. 735 00:42:42,039 --> 00:42:44,509 Then test all of your models. And 30 percent, you can take 736 00:42:44,509 --> 00:42:45,900 whatever has the smallest 737 00:42:45,900 --> 00:42:51,239 hold out cross validation error. 738 00:42:51,239 --> 00:42:54,829 And after this you actually have a chose. You can actually 739 00:42:54,829 --> 00:42:58,589 having taken all of these hypothesis trained on 70 percent of the data, 740 00:42:58,589 --> 00:43:02,650 you can actually just output the hypothesis that has the lowest error on your hold out 741 00:43:02,650 --> 00:43:04,469 cross validation set. 742 00:43:04,469 --> 00:43:08,619 And optionally, you can actually take the model that you selected 743 00:43:08,619 --> 00:43:13,079 and go back, and retrain it on all 100 percent of the data; 744 00:43:13,079 --> 00:43:15,529 okay? So both versions are actually done and used really often. You can 745 00:43:15,529 --> 00:43:16,409 either, 746 00:43:16,409 --> 00:43:20,039 you know, just take the best hypothesis that was trained on 70 percent of the data, 747 00:43:20,039 --> 00:43:22,419 and just output that as you find the hypothesis 748 00:43:22,419 --> 00:43:23,959 or you can use this to 749 00:43:23,959 --> 00:43:27,289 say, having chosen the degree of the polynomial you want to fit, you can then go 750 00:43:27,289 --> 00:43:31,329 back and retrain the model on the entire 100 percent of your data. 751 00:43:31,329 --> 00:43:38,329 And both of these are commonly done. 752 00:43:39,749 --> 00:43:46,749 753 00:43:50,489 --> 00:43:54,459 How about a cross validation does sort of work straight? 754 00:43:54,459 --> 00:43:55,759 And 755 00:43:55,759 --> 00:43:58,249 sometimes we're working 756 00:43:58,249 --> 00:43:58,950 757 00:43:58,950 --> 00:44:03,670 with a company or application or something. The 758 00:44:03,670 --> 00:44:05,929 many machine-learning applications we have 759 00:44:05,929 --> 00:44:07,580 very little data or where, you 760 00:44:07,580 --> 00:44:10,100 know, every training example you have was 761 00:44:10,100 --> 00:44:14,940 painfully acquired at great cost; right? Sometimes your data is 762 00:44:14,940 --> 00:44:16,039 acquired by 763 00:44:16,039 --> 00:44:18,780 medical experiments, and each of these each 764 00:44:18,780 --> 00:44:23,829 training example represents a sick man in amounts of physical human pain or something. So 765 00:44:23,829 --> 00:44:27,279 we talk and say, Well, I'm going to hold out 30 percent of your data set, just to select 766 00:44:27,279 --> 00:44:28,109 my model. 767 00:44:28,109 --> 00:44:32,809 If people were who sometimes that causes unhappiness, and so maybe you wanna 768 00:44:32,809 --> 00:44:37,099 use not have to leave out 30 percent of your data just to do model 769 00:44:37,099 --> 00:44:38,719 selection. 770 00:44:38,719 --> 00:44:43,400 So there are a couple of other variations on hold out cross validation that makes 771 00:44:43,400 --> 00:44:47,809 sometimes, slightly more efficient use of the data. 772 00:44:47,809 --> 00:44:51,909 And one is called k-fold cross validation. 773 00:44:51,909 --> 00:44:54,190 And here's the idea: I'm gonna 774 00:44:54,190 --> 00:44:55,950 take all of 775 00:44:55,950 --> 00:44:56,819 my data s; 776 00:44:56,819 --> 00:45:00,619 so imagine, I'm gonna draw this 777 00:45:00,619 --> 00:45:04,489 box s, as to note the entirety of all the data I have. 778 00:45:04,489 --> 00:45:07,489 And I'll then divide it 779 00:45:07,489 --> 00:45:12,999 into k pieces, and this is five pieces in what I've drawn. 780 00:45:12,999 --> 00:45:16,809 Then what'll I'll do is I will repeatedly 781 00:45:16,809 --> 00:45:23,410 train on k minus one pieces. 782 00:45:23,410 --> 00:45:27,669 Test on the remaining one 783 00:45:27,669 --> 00:45:30,649 test on the remaining piece, I guess; 784 00:45:30,649 --> 00:45:35,699 right? And then you average 785 00:45:35,699 --> 00:45:41,369 over the k result. 786 00:45:41,369 --> 00:45:43,809 So another way, we'll just hold out I will 787 00:45:43,809 --> 00:45:48,329 hold 788 00:45:48,329 --> 00:45:50,619 out say, just 1/5 of my data 789 00:45:50,619 --> 00:45:54,649 and I'll train on the remaining 4/5, and I'll test on the first one. 790 00:45:54,649 --> 00:45:58,399 And then I'll then go and hold out the second 1/5 from my [inaudible] for the 791 00:45:58,399 --> 00:46:03,099 remaining pieces test on this, you remove the third piece, 792 00:46:03,099 --> 00:46:05,909 train on the 4/5; I'm gonna do this five times. 793 00:46:05,909 --> 00:46:09,089 And then I'll take the five error measures I 794 00:46:09,089 --> 00:46:10,569 have and I'll 795 00:46:10,569 --> 00:46:16,369 average them. And this then gives me an estimate of the generalization error of my model; okay? 796 00:46:16,369 --> 00:46:18,089 And then, again, when you do 797 00:46:18,089 --> 00:46:19,929 k-fold cross validation, 798 00:46:19,929 --> 00:46:21,829 usually you then go back and 799 00:46:21,829 --> 00:46:24,030 retrain the model you selected 800 00:46:24,030 --> 00:46:27,670 on the entirety of your training set. 801 00:46:27,670 --> 00:46:33,930 So I drew five pieces here because that was easier for me to draw, but k equals 10 is 802 00:46:33,930 --> 00:46:39,339 very 803 00:46:39,339 --> 00:46:41,710 common; okay? I should say 804 00:46:41,710 --> 00:46:47,509 k equals 10 is the fairly common choice to do 10 fold cross validation. 805 00:46:47,509 --> 00:46:52,299 And the advantage of the over hold out cross option is that you switch the data into ten pieces. 806 00:46:52,299 --> 00:46:54,400 Then each time you're only holding out 807 00:46:54,400 --> 00:46:56,380 1/10 of your data, rather than, 808 00:46:56,380 --> 00:46:59,119 you know, say, 30 percent of your data. I 809 00:46:59,119 --> 00:47:03,309 must say, in standard hold out in simple hold out cross validation, 810 00:47:03,309 --> 00:47:05,620 a 30 70 split is fairly common. 811 00:47:05,620 --> 00:47:07,849 Sometimes like 2/3 1/3 or 812 00:47:07,849 --> 00:47:09,919 a 70 30 split is fairly common. 813 00:47:09,919 --> 00:47:14,299 And if you use k-fold cross validation, k equals 5 or more commonly k equals 10, and is the most 814 00:47:14,299 --> 00:47:16,400 common choice. 815 00:47:16,400 --> 00:47:19,849 The disadvantage of k-fold cross validation is that it can be much more 816 00:47:19,849 --> 00:47:21,430 computationally expensive. 817 00:47:21,430 --> 00:47:22,999 In particular, 818 00:47:22,999 --> 00:47:26,160 to validate your model, you now need to train your model ten times, 819 00:47:26,160 --> 00:47:29,089 instead of just once. And so you need 820 00:47:29,089 --> 00:47:33,140 to: from logistic regression, ten times per model, rather than just once. And so this is 821 00:47:33,140 --> 00:47:34,569 computationally more expensive. 822 00:47:34,569 --> 00:47:38,329 But k equals ten works great. 823 00:47:38,329 --> 00:47:40,249 And then, 824 00:47:40,249 --> 00:47:44,269 finally, in 825 00:47:44,269 --> 00:47:47,769 there's actually a version of this that you can take even further, 826 00:47:47,769 --> 00:47:50,059 which is when your set k 827 00:47:50,059 --> 00:47:51,660 equals m. 828 00:47:51,660 --> 00:47:56,690 And so that's when you take your training set, and you split it into as many pieces as you have training 829 00:47:56,690 --> 00:47:59,190 examples. 830 00:47:59,190 --> 00:48:06,190 And this procedure is called leave one out cross validation. 831 00:48:06,429 --> 00:48:07,479 And 832 00:48:07,479 --> 00:48:09,069 what you do is you then 833 00:48:09,069 --> 00:48:12,479 take out the first training example, train on the rest, and test on the first 834 00:48:12,479 --> 00:48:13,359 example. 835 00:48:13,359 --> 00:48:16,479 Then you take out the second training example, 836 00:48:16,479 --> 00:48:18,560 train on the rest, and test on the second example. Then 837 00:48:18,560 --> 00:48:22,180 you take out the third example, train on everything, but your third example. Test on 838 00:48:22,180 --> 00:48:24,079 the third example, and so on. 839 00:48:24,079 --> 00:48:25,410 And so 840 00:48:25,410 --> 00:48:27,900 with this many pieces you are now making, 841 00:48:27,900 --> 00:48:31,170 maybe even more effective use of your data than k-fold cross 842 00:48:31,170 --> 00:48:32,459 validation. But 843 00:48:32,459 --> 00:48:36,269 you could leave leave one out cross validation is computationally very expensive 844 00:48:36,269 --> 00:48:37,459 because now 845 00:48:37,459 --> 00:48:41,970 you need to repeatedly leave one example out, and then run your learning 846 00:48:41,970 --> 00:48:46,459 algorithm on m minus one training examples. You need to do this a lot of times, and so 847 00:48:46,459 --> 00:48:49,119 this is computationally very expensive. 848 00:48:49,119 --> 00:48:54,829 And typically, this is done only when you're extremely data scarce. So if you 849 00:48:54,829 --> 00:48:58,779 have a learning problem where you have, say, 15 training examples or something, 850 00:48:58,779 --> 00:49:01,109 then if you have very few training examples, leave one 851 00:49:01,109 --> 00:49:03,629 out cross validation 852 00:49:03,629 --> 00:49:04,460 is 853 00:49:04,460 --> 00:49:11,460 maybe preferred. 854 00:49:15,709 --> 00:49:18,369 Yeah? 855 00:49:18,369 --> 00:49:22,890 Student:You know, that time you proved that the difference between the generalized [inaudible] by 856 00:49:22,890 --> 00:49:26,989 number of examples in your training 857 00:49:26,989 --> 00:49:28,089 set and VC 858 00:49:28,089 --> 00:49:30,609 dimension. So 859 00:49:30,609 --> 00:49:32,679 maybe [inaudible] examples into 860 00:49:32,679 --> 00:49:34,599 different groups, we can use that for [inaudible]. Instructor (Andrew Ng):Yeah, I 861 00:49:34,599 --> 00:49:36,829 mean Student:- compute the training error, and use that for computing [inaudible] for a generalized error. Instructor (Andrew Ng):Yeah, that's done, but 862 00:49:36,829 --> 00:49:40,100 yeah, in practice, I personally tend not to do that. It 863 00:49:40,100 --> 00:49:42,729 tends not to be the VC 864 00:49:42,729 --> 00:49:46,529 dimension bounds are somewhat loose bounds. And so 865 00:49:46,529 --> 00:49:47,960 there are people 866 00:49:47,960 --> 00:49:51,060 in structure risk minimization that propose what you do, but I personally tend not to do 867 00:49:51,060 --> 00:49:54,539 that, 868 00:49:54,539 --> 00:49:58,299 869 00:49:58,299 --> 00:50:05,299 870 00:50:09,459 --> 00:50:16,459 though. Questions for cross validation? 871 00:50:33,519 --> 00:50:35,669 Yeah. Student:This 872 00:50:35,669 --> 00:50:38,899 is kind of far from there because when we spend all this time [inaudible] but how many data points do you sort of need to go into your certain marginal [inaudible]? Instructor 873 00:50:38,899 --> 00:50:39,939 (Andrew Ng):Right. 874 00:50:39,939 --> 00:50:42,239 Student:So it seems like when I'd be 875 00:50:42,239 --> 00:50:44,129 able to use that 876 00:50:44,129 --> 00:50:46,629 877 00:50:46,629 --> 00:50:47,769 instead 878 00:50:47,769 --> 00:50:49,980 of do this; more analytically, I guess. I mean Instructor (Andrew Ng):Yeah. Student:[Inaudible]. Instructor 879 00:50:49,980 --> 00:50:53,349 (Andrew Ng):No okay. So it turns out that when you're proving learning theory bounds, very often 880 00:50:53,349 --> 00:50:58,109 the bounds will be extremely loose because you're sort of proving the worse case upper bound that 881 00:50:58,109 --> 00:50:59,160 holds true 882 00:50:59,160 --> 00:51:03,299 even for very bad what is it 883 00:51:03,299 --> 00:51:06,429 884 00:51:06,429 --> 00:51:10,479 so the bounds that I proved just now; right? That holds true for absolutely any 885 00:51:10,479 --> 00:51:12,209 probability distribution over training 886 00:51:12,209 --> 00:51:13,480 examples; right? 887 00:51:13,480 --> 00:51:17,069 So just assume the training examples we've drawn, iid 888 00:51:17,069 --> 00:51:19,029 from some distribution script d, 889 00:51:19,029 --> 00:51:23,209 and the bounds I proved hold true for absolutely any probability 890 00:51:23,209 --> 00:51:27,129 distribution over script 891 00:51:27,129 --> 00:51:28,939 d. And chances are 892 00:51:28,939 --> 00:51:33,049 whatever real life distribution you get over, you know, houses and their 893 00:51:33,049 --> 00:51:36,139 prices or whatever, is probably not as bad as 894 00:51:36,139 --> 00:51:39,119 the very worse one you 895 00:51:39,119 --> 00:51:41,249 could've gotten; okay? 896 00:51:41,249 --> 00:51:42,189 And so it turns out that if you 897 00:51:42,189 --> 00:51:46,299 actually plug in the constants of learning theory bounds, you often get 898 00:51:46,299 --> 00:51:50,839 extremely large numbers. Take logistic regression logistic 899 00:51:50,839 --> 00:51:53,229 regression you have ten parameters 900 00:51:53,229 --> 00:51:56,969 and 0.01 error, and with 95 percent probability. How many training 901 00:51:56,969 --> 00:51:58,239 examples do I 902 00:51:58,239 --> 00:52:02,109 need? If you actually plug in actual constants into the text for learning theory bounds, 903 00:52:02,109 --> 00:52:06,329 you often get extremely pessimistic estimates with the number of examples you need. You end 904 00:52:06,329 --> 00:52:07,529 up with 905 00:52:07,529 --> 00:52:11,749 some ridiculously large numbers. You would need 10,000 training examples to fit 906 00:52:11,749 --> 00:52:14,349 ten parameters. 907 00:52:14,349 --> 00:52:17,489 So 908 00:52:17,489 --> 00:52:20,319 a good way to think of these learning theory bounds is 909 00:52:20,319 --> 00:52:24,599 and this is why, also, when I write papers on learning theory bounds, I 910 00:52:24,599 --> 00:52:26,390 quite often 911 00:52:26,390 --> 00:52:30,159 use big-O notation to just absolutely just ignore the constant factors because 912 00:52:30,159 --> 00:52:33,339 the bounds seem to be very loose. 913 00:52:33,339 --> 00:52:38,669 There are some attempts to use these bounds to give guidelines as to 914 00:52:38,669 --> 00:52:40,019 what model 915 00:52:40,019 --> 00:52:41,559 to choose, and so on. 916 00:52:41,559 --> 00:52:46,589 But I personally tend to use the bounds again, 917 00:52:46,589 --> 00:52:47,489 intuition 918 00:52:47,489 --> 00:52:49,239 about 919 00:52:49,239 --> 00:52:53,089 for example, what are the number of training examples you need 920 00:52:53,089 --> 00:52:56,089 gross linearly in the number of parameters or what are your gross x dimension in number of parameters; whether it goes 921 00:52:56,089 --> 00:52:59,669 quadratic parameters? 922 00:52:59,669 --> 00:53:02,400 So it's quite often the shape of the bounds. The fact that 923 00:53:02,400 --> 00:53:06,749 the number of training examples the fact that some complexity is linear in the VC dimension, 924 00:53:06,749 --> 00:53:09,449 that's sort of a useful intuition you can get from 925 00:53:09,449 --> 00:53:11,329 these theories. But the 926 00:53:11,329 --> 00:53:14,439 actual magnitude of the bound will tend to be much looser than 927 00:53:14,439 --> 00:53:18,009 will hold true for a particular problem you are working 928 00:53:18,009 --> 00:53:25,009 on. So did that 929 00:53:25,839 --> 00:53:28,409 answer your question? Student:Uh-huh. Instructor (Andrew Ng):Yeah. And it turns out, by the 930 00:53:28,409 --> 00:53:31,509 way, for myself, a rule of thumb that I often use is if 931 00:53:31,509 --> 00:53:34,969 you're trying to fit a logistic regression model, 932 00:53:34,969 --> 00:53:36,179 if you have 933 00:53:36,179 --> 00:53:38,749 n parameters or n plus one parameters; 934 00:53:38,749 --> 00:53:42,429 if the number of training examples is ten times your number of 935 00:53:42,429 --> 00:53:44,709 parameters, then you're probably in good shape. 936 00:53:44,709 --> 00:53:49,289 And if your number of training examples is like tiny times the number of parameters, then 937 00:53:49,289 --> 00:53:52,299 you're probably perfectly fine fitting that model. 938 00:53:52,299 --> 00:53:54,880 So those are the sorts of intuitions 939 00:53:54,880 --> 00:54:01,880 that you can get from these bounds. Student:In cross validation do 940 00:54:02,939 --> 00:54:06,739 we assume these examples randomly? Instructor (Andrew Ng):Yes. So by convention we usually split the train 941 00:54:06,739 --> 00:54:13,739 testers randomly. 942 00:54:22,989 --> 00:54:26,839 One more thing I want to talk about for model selection is there's actually a special case of model selections, 943 00:54:26,839 --> 00:54:33,839 called the feature selection problem. 944 00:54:37,809 --> 00:54:41,339 And so here's the intuition: 945 00:54:41,339 --> 00:54:45,479 for many machine-learning problems you may have a very high dimensional 946 00:54:45,479 --> 00:54:48,119 feature space with very high dimensional 947 00:54:48,119 --> 00:54:51,640 you have x's [inaudible] feature x's. 948 00:54:51,640 --> 00:54:56,289 So for example, for text classification and I wanna talk about this text classification example that 949 00:54:56,289 --> 00:54:58,299 spam versus 950 00:54:58,299 --> 00:55:00,430 non-spam. You may easily have on 951 00:55:00,430 --> 00:55:04,130 the order of 30,000 or 50,000 features. I think I used 50,000 in 952 00:55:04,130 --> 00:55:07,150 my early examples. So if you have 953 00:55:07,150 --> 00:55:10,659 so many features you have 50,000 features, 954 00:55:10,659 --> 00:55:13,759 depending on what learning algorithm you use, there may be a real 955 00:55:13,759 --> 00:55:15,529 risk of over fitting. 956 00:55:15,529 --> 00:55:17,219 And so 957 00:55:17,219 --> 00:55:20,079 if you can reduce the number of features, maybe 958 00:55:20,079 --> 00:55:23,179 you can reduce the variance of your learning algorithm, and reduce the risk of 959 00:55:23,179 --> 00:55:25,309 over fitting. And for 960 00:55:25,309 --> 00:55:28,259 the specific case of text classification, if 961 00:55:28,259 --> 00:55:31,919 you imagine that maybe there's a small number of relevant features, so 962 00:55:31,919 --> 00:55:33,679 there are all these English words. 963 00:55:33,679 --> 00:55:36,969 And many of these English words probably don't tell you anything at all about 964 00:55:36,969 --> 00:55:39,880 whether the email is spam or non-spam. If it were, you 965 00:55:39,880 --> 00:55:43,989 know, English function words like, the, of, a, and; 966 00:55:43,989 --> 00:55:48,049 these are probably words that don't tell you anything about whether the email is spam or non-spam. So 967 00:55:48,049 --> 00:55:50,569 words in contrast will be a much smaller number of 968 00:55:50,569 --> 00:55:54,319 features that are truly relevant to the learning problem. 969 00:55:54,319 --> 00:55:57,369 So for example, you see the word buy or Viagra, those are words that 970 00:55:57,369 --> 00:56:01,669 are very useful. So you words, some you spam and non-spam. You see the word 971 00:56:01,669 --> 00:56:02,829 Stanford or 972 00:56:02,829 --> 00:56:06,270 machinelearning or your own personal name. These are other words that are useful for 973 00:56:06,270 --> 00:56:10,529 telling you whether something is spam or non-spam. So 974 00:56:10,529 --> 00:56:15,659 in feature selection, we would like to select a subset of the features 975 00:56:15,659 --> 00:56:19,519 that may be or hopefully the most relevant ones for a specific learning 976 00:56:19,519 --> 00:56:20,369 problem, so 977 00:56:20,369 --> 00:56:25,009 as to give ourselves a simpler learning a simpler hypothesis class to choose 978 00:56:25,009 --> 00:56:28,499 from. And then therefore, reduce the risk of over fitting. 979 00:56:28,499 --> 00:56:29,699 Even when we 980 00:56:29,699 --> 00:56:36,619 may have had 50,000 features originally. So 981 00:56:36,619 --> 00:56:41,209 how do you do this? Well, if you 982 00:56:41,209 --> 00:56:44,009 have n 983 00:56:44,009 --> 00:56:47,329 features, then there are 984 00:56:47,329 --> 00:56:49,529 two to the n possible subsets; 985 00:56:49,529 --> 00:56:54,429 986 00:56:54,429 --> 00:56:55,900 right? Because, you know, 987 00:56:55,900 --> 00:56:59,880 each of your n features can either be included or excluded. So there are two to the n 988 00:56:59,880 --> 00:57:00,909 possibilities. 989 00:57:00,909 --> 00:57:04,200 And this is a huge space. 990 00:57:04,200 --> 00:57:08,859 So in feature selection, what we most commonly do is use various searcheristics 991 00:57:08,859 --> 00:57:09,510 sort of 992 00:57:09,510 --> 00:57:10,799 simple search algorithms 993 00:57:10,799 --> 00:57:15,329 to try to search through this space of two to the n possible subsets of features; 994 00:57:15,329 --> 00:57:16,710 to try to find a good 995 00:57:16,710 --> 00:57:18,290 subset of features. This is 996 00:57:18,290 --> 00:57:22,569 too large a number to enumerate all possible feature subsets. 997 00:57:22,569 --> 00:57:25,519 And as a complete example, 998 00:57:25,519 --> 00:57:26,759 999 00:57:26,759 --> 00:57:31,159 this is the forward search algorithm; it's also called the forward selection algorithm. 1000 00:57:31,159 --> 00:57:32,689 1001 00:57:32,689 --> 00:57:34,849 It's actually pretty simple, but I'll just 1002 00:57:34,849 --> 00:57:36,080 write it out. 1003 00:57:36,080 --> 00:57:42,059 My writing it out will make it look more complicated than it really 1004 00:57:42,059 --> 00:57:44,649 is, but 1005 00:57:44,649 --> 00:57:46,040 it starts with 1006 00:57:46,040 --> 00:57:48,149 initialize the sets script f 1007 00:57:48,149 --> 00:57:51,649 to be the empty set, 1008 00:57:51,649 --> 00:57:53,239 and then 1009 00:57:53,239 --> 00:57:55,019 repeat 1010 00:57:55,019 --> 00:57:56,619 for 1011 00:57:56,619 --> 00:57:58,859 i equals one 1012 00:57:58,859 --> 00:58:00,439 to n; 1013 00:58:00,439 --> 00:58:04,849 1014 00:58:04,849 --> 00:58:08,239 try adding 1015 00:58:08,239 --> 00:58:11,459 feature i 1016 00:58:11,459 --> 00:58:13,849 to the set scripts 1017 00:58:13,849 --> 00:58:16,039 f, and evaluate the 1018 00:58:16,039 --> 00:58:19,739 model 1019 00:58:19,739 --> 00:58:25,719 using cross validation. 1020 00:58:25,719 --> 00:58:30,399 And by cross validation, I mean any of the three flavors, be it simple hold out cross 1021 00:58:30,399 --> 00:58:34,629 validation or k-fold cross validation or leave one out cross 1022 00:58:34,629 --> 00:58:35,739 validation. 1023 00:58:35,739 --> 00:58:39,949 And then, you know, set f to be 1024 00:58:39,949 --> 00:58:41,239 equal to 1025 00:58:41,239 --> 00:58:43,249 f union, I guess. 1026 00:58:43,249 --> 00:58:48,579 And then the best feature 1027 00:58:48,579 --> 00:58:55,579 found is f 1, I guess; okay? 1028 00:58:57,619 --> 00:59:00,059 And finally, you would 1029 00:59:00,059 --> 00:59:06,920 okay? So 1030 00:59:06,920 --> 00:59:13,920 1031 00:59:16,699 --> 00:59:20,319 forward selections, procedure is: follow through the empty set of features. 1032 00:59:20,319 --> 00:59:22,319 And then on each 1033 00:59:22,319 --> 00:59:25,689 generation, take each of 1034 00:59:25,689 --> 00:59:28,829 your features that isn't already in your set script f and you try adding that 1035 00:59:28,829 --> 00:59:30,449 feature to your set. Then 1036 00:59:30,449 --> 00:59:33,870 you train them all, though, and evaluate them all, though, using 1037 00:59:33,870 --> 00:59:35,039 cross validation. 1038 00:59:35,039 --> 00:59:38,719 And basically, figure out what is the best single feature to add to your 1039 00:59:38,719 --> 00:59:40,719 set 1040 00:59:40,719 --> 00:59:43,709 script f. In step two here, you go ahead and add 1041 00:59:43,709 --> 00:59:46,509 that feature to your set script f, and you get it right. 1042 00:59:46,509 --> 00:59:50,559 And when I say best feature or best model here by best, I really mean the 1043 00:59:50,559 --> 00:59:53,669 best model according to hold out cross validation. 1044 00:59:53,669 --> 00:59:56,049 By best, I really mean 1045 00:59:56,049 --> 01:00:00,639 the single feature addition that results in the lowest hold out 1046 01:00:00,639 --> 01:00:01,509 cross validation error or the 1047 01:00:01,509 --> 01:00:03,209 lowest cross validation error. 1048 01:00:03,209 --> 01:00:04,499 So you do this 1049 01:00:04,499 --> 01:00:07,959 adding one feature at a time. 1050 01:00:07,959 --> 01:00:12,019 When you terminate this a little bit, as if you've 1051 01:00:12,019 --> 01:00:15,079 added all the features to f, so f is now 1052 01:00:15,079 --> 01:00:17,919 the entire set of features; you can terminate this. 1053 01:00:17,919 --> 01:00:19,309 Or if 1054 01:00:19,309 --> 01:00:22,039 by some rule of thumb, you know that you 1055 01:00:22,039 --> 01:00:23,769 probably don't ever want more than 1056 01:00:23,769 --> 01:00:26,729 k features, you can also terminate 1057 01:00:26,729 --> 01:00:30,239 this if f is already exceeded some threshold number of features. So maybe 1058 01:00:30,239 --> 01:00:30,979 if 1059 01:00:30,979 --> 01:00:34,139 you have 100 training examples, and you're fitting logistic regression, 1060 01:00:34,139 --> 01:00:39,359 you probably know you won't want more than 100 features. And so 1061 01:00:39,359 --> 01:00:41,229 you stop after you have 1062 01:00:41,229 --> 01:00:45,709 100 features added to set f; okay? 1063 01:00:45,709 --> 01:00:50,029 And then finally, having done this, output of best hypothesis found; again, by 1064 01:00:50,029 --> 01:00:51,669 best, I mean, 1065 01:00:51,669 --> 01:00:55,459 when learning this algorithm, you'd be seeing lots of hypothesis. You'd be training lots of 1066 01:00:55,459 --> 01:00:58,419 hypothesis, and testing them using cross validation. 1067 01:00:58,419 --> 01:01:01,659 So when I say output best hypothesis found, I mean 1068 01:01:01,659 --> 01:01:04,650 of all of the hypothesis you've seen during this entire procedure, 1069 01:01:04,650 --> 01:01:07,149 pick the one with the lowest 1070 01:01:07,149 --> 01:01:10,309 cross validation error that you saw; okay? 1071 01:01:10,309 --> 01:01:17,309 So that's forward selection. So 1072 01:01:35,239 --> 01:01:39,599 let's see, just to give this a name, this is an incidence of what's called 1073 01:01:39,599 --> 01:01:43,189 1074 01:01:43,189 --> 01:01:46,439 wrapper feature 1075 01:01:46,439 --> 01:01:47,879 selection. 1076 01:01:47,879 --> 01:01:50,579 And the term wrapper comes 1077 01:01:50,579 --> 01:01:52,049 from the fact that 1078 01:01:52,049 --> 01:01:55,489 this feature selection algorithm that I just described is a forward selection or forward 1079 01:01:55,489 --> 01:01:56,819 search. 1080 01:01:56,819 --> 01:01:58,680 It's a piece of software that 1081 01:01:58,680 --> 01:02:02,149 you write that wraps around your learning algorithm. In the sense that 1082 01:02:02,149 --> 01:02:04,249 to perform forward selection, 1083 01:02:04,249 --> 01:02:07,579 you need to repeatedly make cause to your learning algorithm 1084 01:02:07,579 --> 01:02:08,829 to train 1085 01:02:08,829 --> 01:02:11,789 your model, 1086 01:02:11,789 --> 01:02:15,289 using different subsets of features; okay? So this is called wrapper model 1087 01:02:15,289 --> 01:02:17,319 feature selection. 1088 01:02:17,319 --> 01:02:20,509 And it tends to be somewhat computationally expensive because as you're 1089 01:02:20,509 --> 01:02:22,529 performing the search process, 1090 01:02:22,529 --> 01:02:26,029 you're repeatedly training your learning algorithm over and over and over on all of 1091 01:02:26,029 --> 01:02:30,329 these different subsets of features. Let's 1092 01:02:30,329 --> 01:02:37,329 just mention also, there is a variation of this called backward search or 1093 01:02:38,899 --> 01:02:40,829 backward selection, 1094 01:02:40,829 --> 01:02:43,709 which is where you start with 1095 01:02:43,709 --> 01:02:46,989 f equals 1096 01:02:46,989 --> 01:02:53,989 the entire set of features, and you delete features one at a time; 1097 01:02:59,469 --> 01:03:01,369 okay? 1098 01:03:01,369 --> 01:03:03,979 1099 01:03:03,979 --> 01:03:06,759 So that's backward search or backward selection. 1100 01:03:06,759 --> 01:03:08,079 And 1101 01:03:08,079 --> 01:03:13,369 this is another feature selection algorithm that you might use. Part of 1102 01:03:13,369 --> 01:03:15,400 whether this makes sense is 1103 01:03:15,400 --> 01:03:17,409 really there will be problems where 1104 01:03:17,409 --> 01:03:21,319 it really doesn't even make sense to initialize f to be the set of all features. 1105 01:03:21,319 --> 01:03:23,859 So if you have 100 training examples 1106 01:03:23,859 --> 01:03:25,569 and 10,000 features, 1107 01:03:25,569 --> 01:03:28,829 which may well happen 1108 01:03:28,829 --> 01:03:30,759 100 emails and 10,000 training 1109 01:03:30,759 --> 01:03:35,289 10,000 features in email, then 100 training examples then 1110 01:03:35,289 --> 01:03:38,279 depending on the learning algorithm you're using, it may or may not make sense to 1111 01:03:38,279 --> 01:03:38,989 initialize 1112 01:03:38,989 --> 01:03:42,789 the set f to be all features, and train them all by using all features. And if it 1113 01:03:42,789 --> 01:03:46,130 doesn't make sense, then you can train them all by using all features; then 1114 01:03:46,130 --> 01:03:51,529 forward selection would be more common. 1115 01:03:51,529 --> 01:03:53,609 So 1116 01:03:53,609 --> 01:03:56,960 let's see. Wrapper model feature selection algorithms tend to work 1117 01:03:56,960 --> 01:03:58,459 well. 1118 01:03:58,459 --> 01:04:02,179 And in particular, they actually often work better than a different class of algorithms I'm gonna talk 1119 01:04:02,179 --> 01:04:05,789 about now. But their main disadvantage is that they're computationally very 1120 01:04:05,789 --> 01:04:08,709 expensive. 1121 01:04:08,709 --> 01:04:15,709 Do you have any questions about this 1122 01:04:16,289 --> 01:04:19,109 before I talk about the other? Yeah? Student:[Inaudible]. Instructor (Andrew Ng):Yeah yes, 1123 01:04:19,109 --> 01:04:23,459 you're actually right. So forward search and backward search, both of these are searcheristics, 1124 01:04:23,459 --> 01:04:24,360 and 1125 01:04:24,360 --> 01:04:28,090 you cannot but for either of these you cannot guarantee they'll find the best 1126 01:04:28,090 --> 01:04:29,119 subset of features. It 1127 01:04:29,119 --> 01:04:30,729 actually turns out that 1128 01:04:30,729 --> 01:04:34,999 under many formulizations of the feature selection problems it actually turns out to be an empty 1129 01:04:34,999 --> 01:04:40,029 heart problem, to find the best subset of features. 1130 01:04:40,029 --> 01:04:43,999 But in practice, forward selection backward selection work fine, 1131 01:04:43,999 --> 01:04:47,170 and you can also envision other search algorithms where you sort of 1132 01:04:47,170 --> 01:04:50,489 have other methods to search through the space up to the end possible feature 1133 01:04:50,489 --> 01:04:57,489 subsets. So 1134 01:05:01,079 --> 01:05:03,700 let's see. 1135 01:05:03,700 --> 01:05:06,339 Wrapper feature selection 1136 01:05:06,339 --> 01:05:09,829 tends to work well when you can afford to do it 1137 01:05:09,829 --> 01:05:11,539 computationally. 1138 01:05:11,539 --> 01:05:13,800 But for problems 1139 01:05:13,800 --> 01:05:18,279 such as text classification it turns out for text classification specifically 1140 01:05:18,279 --> 01:05:21,949 because you have so many features, and easily have 50,000 features. 1141 01:05:21,949 --> 01:05:25,789 Forward selection would be very, very expensive. 1142 01:05:25,789 --> 01:05:28,240 So there's a different class of algorithms that 1143 01:05:28,240 --> 01:05:31,189 will give you that 1144 01:05:31,189 --> 01:05:35,319 tends not to do as well in the sense of generalization error. So you tend to learn the 1145 01:05:35,319 --> 01:05:37,329 hypothesis that works less well, 1146 01:05:37,329 --> 01:05:40,289 but is computationally much less expensive. 1147 01:05:40,289 --> 01:05:44,189 And these are called 1148 01:05:44,189 --> 01:05:46,669 the 1149 01:05:46,669 --> 01:05:50,279 filter feature selection methods. 1150 01:05:50,279 --> 01:05:53,119 And the basic idea is 1151 01:05:53,119 --> 01:05:53,679 1152 01:05:53,679 --> 01:06:00,539 that for each feature i will 1153 01:06:00,539 --> 01:06:03,939 compute 1154 01:06:03,939 --> 01:06:10,539 some measure 1155 01:06:10,539 --> 01:06:17,219 of how informative 1156 01:06:17,219 --> 01:06:19,909 xi 1157 01:06:19,909 --> 01:06:25,479 is about y; okay? 1158 01:06:25,479 --> 01:06:28,779 And to do this, we'll use some simple 1159 01:06:28,779 --> 01:06:33,489 heuristics; for every feature we'll just try to compute some rough estimate or 1160 01:06:33,489 --> 01:06:40,489 compute 1161 01:06:41,279 --> 01:06:48,279 some measure of how informative 1162 01:06:50,179 --> 01:06:54,489 xi is about 1163 01:06:54,489 --> 01:06:59,139 y. So there are many ways you can do this. One way you can choose is to just compute the 1164 01:06:59,139 --> 01:07:01,279 correlation between xi and y. And just for each of 1165 01:07:01,279 --> 01:07:04,999 your features just see how correlated this is with 1166 01:07:04,999 --> 01:07:06,719 your class label y. 1167 01:07:06,719 --> 01:07:12,809 And then you just pick the top k most correlated features. 1168 01:07:12,809 --> 01:07:19,809 Another way 1169 01:07:46,429 --> 01:07:51,549 to do this 1170 01:07:51,549 --> 01:07:54,930 for the case of text classification, there's one other method, which especially 1171 01:07:54,930 --> 01:07:56,779 for this k features I guess 1172 01:07:56,779 --> 01:07:58,749 there's one other 1173 01:07:58,749 --> 01:08:00,709 informative measure 1174 01:08:00,709 --> 01:08:04,269 that's used very commonly, which is called major information. I'm going to 1175 01:08:04,269 --> 01:08:11,229 tell you 1176 01:08:11,229 --> 01:08:15,399 some of these ideas in problem sets, but I'll 1177 01:08:15,399 --> 01:08:17,900 just say this very briefly. 1178 01:08:17,900 --> 01:08:21,599 So the major information between feature xi and y 1179 01:08:21,599 --> 01:08:27,670 I'll just write out the definition, I guess. 1180 01:08:27,670 --> 01:08:32,229 Let's say this is text classification, so x can take on two values, 0, 1; 1181 01:08:32,229 --> 01:08:36,179 the major information between xi and y is to find out some overall possible values of 1182 01:08:36,179 --> 01:08:37,359 x; 1183 01:08:37,359 --> 01:08:40,929 some overall possible values of 1184 01:08:40,929 --> 01:08:45,609 y times the distribution 1185 01:08:45,609 --> 01:08:52,049 1186 01:08:52,049 --> 01:08:53,089 times that. 1187 01:08:53,089 --> 01:08:54,379 Where 1188 01:08:54,379 --> 01:08:56,129 1189 01:08:56,129 --> 01:08:59,569 1190 01:08:59,569 --> 01:09:04,039 all of these distributions where so the joint distribution over xi and y, 1191 01:09:04,039 --> 01:09:11,039 you would estimate from your training data 1192 01:09:12,559 --> 01:09:14,220 all of these things you would use, as well. You would estimate 1193 01:09:14,220 --> 01:09:15,259 from 1194 01:09:15,259 --> 01:09:20,159 the training data what is the probability that x is 0, what's the probability that x is one, what's the 1195 01:09:20,159 --> 01:09:27,159 probability that x is 0, y is 0, x is one; y is 0, and so on. So it 1196 01:09:27,849 --> 01:09:31,190 turns out 1197 01:09:31,190 --> 01:09:35,000 there's a standard information theoretic measure of how different 1198 01:09:35,000 --> 01:09:36,750 probability distributions are. 1199 01:09:36,750 --> 01:09:39,989 And I'm not gonna prove this here. But it turns out that 1200 01:09:39,989 --> 01:09:45,609 this major information is actually 1201 01:09:45,609 --> 01:09:51,749 1202 01:09:51,749 --> 01:09:54,309 so the standard 1203 01:09:54,309 --> 01:09:58,099 measure of how different distributions are; called the K-L divergence. 1204 01:09:58,099 --> 01:10:00,359 When you take a class in information theory, 1205 01:10:00,359 --> 01:10:03,419 you have seen concepts of mutual information in the K-L divergence, but if you 1206 01:10:03,419 --> 01:10:06,019 haven't, don't worry about it. Just 1207 01:10:06,019 --> 01:10:09,239 the intuition is there's something called K-L divergence that's a formal measure 1208 01:10:09,239 --> 01:10:12,949 of how different two probability distributions 1209 01:10:12,949 --> 01:10:16,609 are. And mutual information is a measure for how different 1210 01:10:16,609 --> 01:10:19,829 the joint distribution is 1211 01:10:19,829 --> 01:10:21,569 of x and y; 1212 01:10:21,569 --> 01:10:23,400 from the distribution 1213 01:10:23,400 --> 01:10:26,159 you would get if you were to assume they were independent; 1214 01:10:26,159 --> 01:10:27,210 okay? So 1215 01:10:27,210 --> 01:10:29,449 if x and y were independent, then 1216 01:10:29,449 --> 01:10:33,979 p of x, y would be equal to p of x times p of y. 1217 01:10:33,979 --> 01:10:37,340 And so you know, this distribution and this distribution would be identical, and the 1218 01:10:37,340 --> 01:10:40,369 K-L divergence would be 0. 1219 01:10:40,369 --> 01:10:43,579 In contrast, if x and y were very non-independent in 1220 01:10:43,579 --> 01:10:46,900 other words, if x and y are very informative about each other, 1221 01:10:46,900 --> 01:10:48,449 then this K-L divergence 1222 01:10:48,449 --> 01:10:49,719 will be large. 1223 01:10:49,719 --> 01:10:52,379 And so mutual information is a formal measure of 1224 01:10:52,379 --> 01:10:55,460 how non-independent x and y are. 1225 01:10:55,460 --> 01:10:58,299 And if x and y are highly non-independent 1226 01:10:58,299 --> 01:10:59,260 then that means that 1227 01:10:59,260 --> 01:11:02,209 x will presumably tell you something about y, 1228 01:11:02,209 --> 01:11:05,989 and so they'll have large mutual information. And 1229 01:11:05,989 --> 01:11:09,519 this measure of information will tell you x might be a good feature. And you 1230 01:11:09,519 --> 01:11:11,989 get to play with some of these ideas 1231 01:11:11,989 --> 01:11:15,309 more in the problem sets. So I won't say much more about it. 1232 01:11:15,309 --> 01:11:16,579 1233 01:11:16,579 --> 01:11:19,420 And what you do then is having chosen 1234 01:11:19,420 --> 01:11:23,780 some measure like correlation or major information or something else, you 1235 01:11:23,780 --> 01:11:28,889 then pick the top 1236 01:11:28,889 --> 01:11:33,659 k features; meaning that you compute correlation between xi and y for all the 1237 01:11:33,659 --> 01:11:36,979 features of mutual information xi and y for all the features. 1238 01:11:36,979 --> 01:11:40,080 And then you include in your learning algorithm the 1239 01:11:40,080 --> 01:11:43,369 k features of the largest correlation with the label or 1240 01:11:43,369 --> 01:11:46,489 the largest mutual information label, whatever. 1241 01:11:46,489 --> 01:11:51,559 And to choose 1242 01:11:51,559 --> 01:11:54,809 k, 1243 01:11:54,809 --> 01:11:57,799 you can actually use cross validation, as well; okay? 1244 01:11:57,799 --> 01:11:59,360 So you would 1245 01:11:59,360 --> 01:12:03,059 take all your features, and sort them in decreasing order of mutual 1246 01:12:03,059 --> 01:12:04,109 information. 1247 01:12:04,109 --> 01:12:07,529 And then you'd try using just the top one feature, the top two features, the top 1248 01:12:07,529 --> 01:12:09,559 three features, and so on. 1249 01:12:09,559 --> 01:12:13,869 And you decide how many features includes 1250 01:12:13,869 --> 01:12:17,279 using cross validation; okay? Or you 1251 01:12:17,279 --> 01:12:24,279 can sometimes you can just choose this by hand, as well. Okay. Questions about 1252 01:12:24,279 --> 01:12:31,279 this? Okay. 1253 01:12:34,559 --> 01:12:36,479 Cool. Great. So 1254 01:12:36,479 --> 01:12:40,629 next lecture I'll continue I'll wrap up the Bayesian model selection, but less close to the end. 1255 01:12:40,629 --> 01:12:40,879