1 00:00:09,030 --> 00:00:10,429 2 00:00:10,429 --> 00:00:13,699 This presentation is delivered by the Stanford Center for Professional 3 00:00:13,699 --> 00:00:20,699 Development. Welcome back. If you haven't given me the homework yet, 4 00:00:27,019 --> 00:00:28,470 you can just give it to me at the end of class. That's fine. Let's see. And also just a quick reminder " I've actually seen project proposals start to trickle in already, which is great. As a reminder, project proposals are due this Friday, and if any of you want to meet and chat more about project ideas, I also have office hours immediately after lecture today. Are there any questions about any of that before I get started today? Great. Okay. Welcome back. 5 00:00:28,470 --> 00:00:32,259 What I want to do today is wrap up our discussion on support vector 6 00:00:32,259 --> 00:00:33,510 machines 7 00:00:33,510 --> 00:00:37,250 and in particular we'll also talk about the idea of kernels 8 00:00:37,250 --> 00:00:39,440 and then talk about 9 00:00:39,440 --> 00:00:41,869 [inaudible] 10 00:00:41,869 --> 00:00:45,600 and then I'll talk about 11 00:00:45,600 --> 00:00:50,030 the SMO algorithm, which is an algorithm for solving the 12 00:00:50,030 --> 00:00:53,770 optimization problem that we posed last time. 13 00:00:53,770 --> 00:00:55,650 14 00:00:55,650 --> 00:00:58,200 To recap, we 15 00:00:58,200 --> 00:01:01,380 wrote down 16 00:01:01,380 --> 00:01:08,380 the following context optimization problem. 17 00:01:12,270 --> 00:01:15,990 All this is assuming that the data is linearly separable, which is an assumption that I'll 18 00:01:15,990 --> 00:01:18,110 fix later, 19 00:01:18,110 --> 00:01:19,530 20 00:01:19,530 --> 00:01:22,770 21 00:01:22,770 --> 00:01:24,660 22 00:01:24,660 --> 00:01:30,210 and so with this optimization problem, given a training set, 23 00:01:30,210 --> 00:01:34,840 this will find 24 00:01:34,840 --> 00:01:37,439 the optimal margin 25 00:01:37,439 --> 00:01:41,299 classifier for the data set that maximizes this 26 00:01:41,299 --> 00:01:45,020 geometric margin from your training 27 00:01:45,020 --> 00:01:46,900 examples. 28 00:01:46,900 --> 00:01:51,810 And so in the previous lecture, we also derived the dual of this problem, 29 00:01:51,810 --> 00:01:54,130 which was to maximize 30 00:01:54,130 --> 00:02:01,130 31 00:02:01,590 --> 00:02:08,590 32 00:02:18,319 --> 00:02:21,549 this. And this is the dual of our 33 00:02:21,549 --> 00:02:24,129 primal [inaudible] optimization problem. 34 00:02:24,129 --> 00:02:28,379 Here, I'm using these angle brackets to denote inner product, 35 00:02:28,379 --> 00:02:29,389 36 00:02:29,389 --> 00:02:31,369 so 37 00:02:31,369 --> 00:02:33,189 this is just XI transpose XJ 38 00:02:33,189 --> 00:02:38,559 for vectors XI 39 00:02:38,559 --> 00:02:40,819 and XJ. We also worked out the 40 00:02:40,819 --> 00:02:47,489 ways W would be given by sum over I alpha I 41 00:02:47,489 --> 00:02:51,129 YI XI. Therefore, when you need to make a prediction of classification 42 00:02:51,129 --> 00:02:52,149 time, you 43 00:02:52,149 --> 00:02:54,689 need to compute the 44 00:02:54,689 --> 00:03:01,299 value of the hypothesis applied to an [inaudible], which is G of W 45 00:03:01,299 --> 00:03:05,969 transpose X plus B where G is that threshold function that outputs plus one and minus one. 46 00:03:05,969 --> 00:03:08,660 And so this is G of 47 00:03:08,660 --> 00:03:09,900 sum over I alpha 48 00:03:09,900 --> 00:03:11,589 49 00:03:11,589 --> 00:03:18,589 50 00:03:20,239 --> 00:03:22,129 I. So that can also be written 51 00:03:22,129 --> 00:03:24,389 in terms of inner 52 00:03:24,389 --> 00:03:25,629 products between input 53 00:03:25,629 --> 00:03:31,369 vectors X. So what I 54 00:03:31,369 --> 00:03:35,180 want to do is now talk about the idea of kernels, which will make use of this 55 00:03:35,180 --> 00:03:36,809 property because it turns out 56 00:03:36,809 --> 00:03:41,690 you can take the only dependers of the algorithm on X 57 00:03:41,690 --> 00:03:44,230 is through these inner products. 58 00:03:44,230 --> 00:03:46,609 In fact, you can write the entire algorithm 59 00:03:46,609 --> 00:03:50,300 without ever explicitly referring to 60 00:03:50,300 --> 00:03:52,389 an X vector [inaudible] 61 00:03:52,389 --> 00:03:53,890 62 00:03:53,890 --> 00:03:57,729 between input feature vectors. 63 00:03:57,729 --> 00:04:00,939 And the idea of a high kernel is as following " 64 00:04:00,939 --> 00:04:07,389 let's say that you have an input attribute. 65 00:04:07,389 --> 00:04:10,839 Let's just say for now it's a real number. Maybe this is the 66 00:04:10,839 --> 00:04:13,429 living area of a house 67 00:04:13,429 --> 00:04:19,649 that you're trying to make a prediction on, like whether it will be sold in the next six months. 68 00:04:19,649 --> 00:04:21,319 Quite often, we'll take this 69 00:04:21,319 --> 00:04:23,789 feature X and we'll map it 70 00:04:23,789 --> 00:04:28,979 to a richer set of features. So for example, we will take X and map it 71 00:04:28,979 --> 00:04:35,979 to these four polynomial features, and let me acutely call this mapping Phi. So we'll 72 00:04:36,560 --> 00:04:38,560 let Phi of X denote 73 00:04:38,560 --> 00:04:40,930 the mapping from your original features to 74 00:04:40,930 --> 00:04:44,799 some higher dimensional set of features. 75 00:04:44,799 --> 00:04:49,080 So if you do this and you want to use the features Phi of X, 76 00:04:49,080 --> 00:04:49,680 then 77 00:04:49,680 --> 00:04:53,569 all you need to do is go back to the learning algorithm 78 00:04:53,569 --> 00:04:57,270 and everywhere you see 79 00:04:57,270 --> 00:04:59,000 XI, XJ, we'll replace it 80 00:04:59,000 --> 00:05:00,190 with 81 00:05:00,190 --> 00:05:04,979 the inner product between Phi of XI and 82 00:05:04,979 --> 00:05:07,629 Phi 83 00:05:07,629 --> 00:05:10,590 of XJ. So this corresponds to running a support vector machine 84 00:05:10,590 --> 00:05:14,380 with the features given by Phi of X rather than with your original 85 00:05:14,380 --> 00:05:19,129 onedimensional input feature X. 86 00:05:19,129 --> 00:05:21,689 And in a scenario that I want to consider, 87 00:05:21,689 --> 00:05:23,389 sometimes Phi of X 88 00:05:23,389 --> 00:05:29,489 will be very high dimensional, 89 00:05:29,489 --> 00:05:33,759 and in fact sometimes Phi of X " so for example, Phi of X may contain 90 00:05:33,759 --> 00:05:36,599 very high degree polynomial features. 91 00:05:36,599 --> 00:05:41,529 Sometimes Phi of X will actually even be an infinite dimensional vector of features, 92 00:05:41,529 --> 00:05:45,089 and the question is if Phi of X is an extremely high dimensional, 93 00:05:45,089 --> 00:05:48,600 then you can't actually compute to these inner products very 94 00:05:48,600 --> 00:05:51,300 efficiently, it seems, because computers need to 95 00:05:51,300 --> 00:05:55,319 represent an extremely high dimensional feature vector and then 96 00:05:55,319 --> 00:05:57,199 take 97 00:05:57,199 --> 00:05:59,740 [inaudible] inefficient. It 98 00:05:59,740 --> 00:06:02,929 turns out that 99 00:06:02,929 --> 00:06:07,879 in many important special cases, we can write down " let's call the kernel function, 100 00:06:07,879 --> 00:06:09,839 denoted by K, 101 00:06:09,839 --> 00:06:12,499 which will be 102 00:06:12,499 --> 00:06:18,300 this, 103 00:06:18,300 --> 00:06:21,339 which would be inner product between those feature 104 00:06:21,339 --> 00:06:24,409 vectors. It turns out there will be important special cases where 105 00:06:24,409 --> 00:06:28,809 computing Phi of X is computationally very expensive " maybe is impossible. 106 00:06:28,809 --> 00:06:30,229 There's an infinite dimensional vector, 107 00:06:30,229 --> 00:06:33,369 and you can't compute infinite dimensional vectors. 108 00:06:33,369 --> 00:06:37,219 There will be important special cases where Phi of X is very expensive to represent because it 109 00:06:37,219 --> 00:06:38,729 is so high dimensional, 110 00:06:38,729 --> 00:06:41,529 but nonetheless, you can actually compute a kernel between 111 00:06:41,529 --> 00:06:46,150 XI and XJ. You can compute the inner product between these two vectors 112 00:06:46,150 --> 00:06:48,749 very inexpensively. 113 00:06:48,749 --> 00:06:52,330 And so the idea of the support vector machine is that 114 00:06:52,330 --> 00:06:56,539 everywhere in the algorithm that you see these inner products, 115 00:06:56,539 --> 00:06:58,029 we're going to replace it 116 00:06:58,029 --> 00:07:01,629 with a kernel function that you can compute efficiently, 117 00:07:01,629 --> 00:07:03,839 and that lets you work in feature 118 00:07:03,839 --> 00:07:07,909 spaces Phi of X even if Phi of X 119 00:07:07,909 --> 00:07:12,429 are very high dimensional. Let 120 00:07:12,429 --> 00:07:16,289 me now say how that's done. A 121 00:07:16,289 --> 00:07:17,029 122 00:07:17,029 --> 00:07:20,689 little bit later today, we'll actually see some concrete examples of Phi of X and 123 00:07:20,689 --> 00:07:25,579 of kernels. For now, let's just think about constructing kernels explicitly. 124 00:07:25,579 --> 00:07:28,939 This best illustrates my example. 125 00:07:28,939 --> 00:07:32,539 126 00:07:32,539 --> 00:07:35,049 Let's say you have 127 00:07:35,049 --> 00:07:40,129 two inputs, X and Z. Normally I should write those as XI and XJ, but I'm just going to 128 00:07:40,129 --> 00:07:45,519 write X and Z to save on writing. 129 00:07:45,519 --> 00:07:49,679 Let's say my kernel is K of X, Z 130 00:07:49,679 --> 00:07:50,719 equals 131 00:07:50,719 --> 00:07:55,439 X transpose Z squared. 132 00:07:55,439 --> 00:08:02,439 And so this is " 133 00:08:09,199 --> 00:08:12,859 right? X transpose Z " this thing here is X 134 00:08:12,859 --> 00:08:15,240 transpose Z and this thing is X transpose Z, so this is X 135 00:08:15,240 --> 00:08:17,389 transpose Z squared. 136 00:08:17,389 --> 00:08:18,249 137 00:08:18,249 --> 00:08:25,249 And that's equal 138 00:08:29,490 --> 00:08:33,190 to that. 139 00:08:33,190 --> 00:08:35,860 And so this kernel corresponds 140 00:08:35,860 --> 00:08:37,580 to the feature mapping where Phi 141 00:08:37,580 --> 00:08:40,720 of X is equal to " and I'll write 142 00:08:40,720 --> 00:08:47,720 this down for the case of N equals free, I guess. 143 00:09:03,150 --> 00:09:06,520 And so with this definition of Phi of X, 144 00:09:06,520 --> 00:09:10,300 you can verify for yourself that this thing 145 00:09:10,300 --> 00:09:11,660 becomes the 146 00:09:11,660 --> 00:09:17,870 inner product between Phi of X 147 00:09:17,870 --> 00:09:21,960 and Phi of Z, because to get an inner product between two vectors 148 00:09:21,960 --> 00:09:25,440 is " you can just take a sum of the corresponding elements of the vectors. You 149 00:09:25,440 --> 00:09:27,470 multiply them. So 150 00:09:27,470 --> 00:09:29,170 if this is Phi of X, 151 00:09:29,170 --> 00:09:32,400 then the inner product between Phi of X and Phi of Z will be 152 00:09:32,400 --> 00:09:33,550 the sum 153 00:09:33,550 --> 00:09:35,160 over all the 154 00:09:35,160 --> 00:09:39,610 elements of this vector times the corresponding elements of Phi of Z, 155 00:09:39,610 --> 00:09:45,130 and what you get is this one. 156 00:09:45,130 --> 00:09:48,350 And so the cool thing about this is that 157 00:09:48,350 --> 00:09:53,400 in order to compute Phi of X, you 158 00:09:53,400 --> 00:09:55,920 159 00:09:55,920 --> 00:10:02,920 need [inaudible] just to compute 160 00:10:03,020 --> 00:10:05,370 Phi of X. If N 161 00:10:05,370 --> 00:10:08,860 is a dimension of X and Z, then 162 00:10:08,860 --> 00:10:13,790 Phi of X is a vector of all pairs of XI XJ multiplied of 163 00:10:13,790 --> 00:10:15,020 each other, 164 00:10:15,020 --> 00:10:19,450 and so the length of Phi of X is N squared. You need order N 165 00:10:19,450 --> 00:10:23,840 squared time just to compute Phi of X. But 166 00:10:23,840 --> 00:10:30,840 to compute K 167 00:10:37,150 --> 00:10:39,360 " to compute the kernel function, 168 00:10:39,360 --> 00:10:41,220 all you need is order N time, 169 00:10:41,220 --> 00:10:43,070 because the 170 00:10:43,070 --> 00:10:47,150 kernel function is defined as X transpose Z squared, so you just take the inner product 171 00:10:47,150 --> 00:10:50,780 between X and Z, which is order N time and you square that 172 00:10:50,780 --> 00:10:52,410 and you've computed this 173 00:10:52,410 --> 00:10:54,230 kernel function, 174 00:10:54,230 --> 00:10:58,000 and so you just computed the inner product between two vectors where each vector 175 00:10:58,000 --> 00:11:00,200 has N squared elements, but 176 00:11:00,200 --> 00:11:07,200 you did it in N square time. Student:For any kernel you find 177 00:11:13,880 --> 00:11:20,650 for X and Z, 178 00:11:20,650 --> 00:11:23,800 does Phi exist for X and Z? Instructor (Andrew 179 00:11:23,800 --> 00:11:30,800 Ng):Let me talk about that later. We'll talk about what is a valid kernel later. Please raise your hand if this makes sense. 180 00:11:31,660 --> 00:11:34,000 181 00:11:34,000 --> 00:11:35,460 So 182 00:11:35,460 --> 00:11:39,310 let me just describe a couple of quick generalizations to 183 00:11:39,310 --> 00:11:42,170 this. 184 00:11:42,170 --> 00:11:47,460 One is that if 185 00:11:47,460 --> 00:11:54,460 you define KXZ to be equal to 186 00:11:55,210 --> 00:12:00,190 X transpose Z plus C squared, so again, you can compute this kernel 187 00:12:00,190 --> 00:12:03,200 in order N time, 188 00:12:03,200 --> 00:12:07,070 then that turns out to correspond to a feature vector 189 00:12:07,070 --> 00:12:08,070 where I'm 190 00:12:08,070 --> 00:12:10,810 just going to add a few more elements at the bottom 191 00:12:10,810 --> 00:12:17,810 where you add root 2. 192 00:12:23,550 --> 00:12:27,700 Let me read that. That was root 2 CX1 root 2 CX2 root 2 CX3 193 00:12:27,700 --> 00:12:29,670 and C. 194 00:12:29,670 --> 00:12:31,340 And so 195 00:12:31,340 --> 00:12:35,450 this is a way of creating a feature vector with both the monomials, 196 00:12:35,450 --> 00:12:37,069 meaning the first order terms, 197 00:12:37,069 --> 00:12:40,539 as well as the quadratic or the inner product 198 00:12:40,539 --> 00:12:42,730 terms between XI and XJ, 199 00:12:42,730 --> 00:12:45,230 and the parameter C here 200 00:12:45,230 --> 00:12:49,020 allows you to control the relative waiting between the monomial 201 00:12:49,020 --> 00:12:51,490 terms, so the first order terms, 202 00:12:51,490 --> 00:12:53,680 and the quadratic 203 00:12:53,680 --> 00:12:56,940 terms. Again, this is still inner 204 00:12:56,940 --> 00:13:00,600 product between vectors of length and square [inaudible] 205 00:13:00,600 --> 00:13:02,920 in order N time. 206 00:13:02,920 --> 00:13:04,630 207 00:13:04,630 --> 00:13:11,630 More generally, here are some other examples of kernels. 208 00:13:13,720 --> 00:13:18,470 Actually, a 209 00:13:18,470 --> 00:13:25,070 generalization of the one I just derived right now would be the following kernel. 210 00:13:25,070 --> 00:13:30,960 And so this corresponds to using all N plus DQZ 211 00:13:30,960 --> 00:13:33,660 features 212 00:13:33,660 --> 00:13:39,240 of all 213 00:13:39,240 --> 00:13:43,400 monomials. Monomials just mean the products of XI XJ XK. Just 214 00:13:43,400 --> 00:13:46,570 all the polynomial terms up 215 00:13:46,570 --> 00:13:51,850 to degree D 216 00:13:51,850 --> 00:13:56,100 and plus [inaudible] so on the order of 217 00:13:56,100 --> 00:13:58,750 N plus D to the power of D, so this grows exponentially in D. 218 00:13:58,750 --> 00:14:01,130 This is a very high dimensional feature vector, 219 00:14:01,130 --> 00:14:04,930 but again, you can implicitly construct the feature vector and take 220 00:14:04,930 --> 00:14:07,460 inner products between them. 221 00:14:07,460 --> 00:14:10,730 It's very computationally efficient, because you just compute the inner product 222 00:14:10,730 --> 00:14:14,870 between X and Z, add C and you take that real number to the power of D 223 00:14:14,870 --> 00:14:18,590 and by plugging this in as a kernel, you're implicitly working in an extremely 224 00:14:18,590 --> 00:14:25,590 high dimensional computing space. 225 00:14:25,880 --> 00:14:27,730 So what 226 00:14:27,730 --> 00:14:34,730 I've given is just a few specific examples of how to create kernels. I want 227 00:14:36,280 --> 00:14:40,320 to go over just a few specific examples of kernels. So let's you ask 228 00:14:40,320 --> 00:14:43,609 you more generally if you're faced with a new machine-learning problem, 229 00:14:43,609 --> 00:14:46,120 how do you come up with a kernel? There 230 00:14:46,120 --> 00:14:49,720 are many ways to think about it, but here's one intuition that's sort of 231 00:14:49,720 --> 00:14:50,560 useful. 232 00:14:50,560 --> 00:14:51,650 233 00:14:51,650 --> 00:14:53,770 So given a set of attributes of X, you're 234 00:14:53,770 --> 00:14:57,800 going to use a feature vector of Phi of X 235 00:14:57,800 --> 00:14:59,510 and given a set 236 00:14:59,510 --> 00:15:04,150 of attributes Z, you're going to use an input feature vector Phi of Z, 237 00:15:04,150 --> 00:15:07,390 and so the kernel is computing 238 00:15:07,390 --> 00:15:11,900 the inner product between Phi of X and Phi of Z. 239 00:15:11,900 --> 00:15:13,139 And so 240 00:15:13,139 --> 00:15:14,870 one intuition " 241 00:15:14,870 --> 00:15:17,190 this is a partial intuition. This 242 00:15:17,190 --> 00:15:20,820 isn't as rigorous intuition that it is used for. It 243 00:15:20,820 --> 00:15:21,670 is that 244 00:15:21,670 --> 00:15:24,470 if X and Z are very similar, 245 00:15:24,470 --> 00:15:28,380 then Phi of X and Phi of Z will be pointing in the same direction, 246 00:15:28,380 --> 00:15:32,890 and therefore the inner product would be large. 247 00:15:32,890 --> 00:15:34,150 Whereas in contrast, if 248 00:15:34,150 --> 00:15:36,720 X and Z are very dissimilar, 249 00:15:36,720 --> 00:15:40,420 then Phi of X and Phi of Z may be pointing different directions, 250 00:15:40,420 --> 00:15:42,370 and so the inner product may be small. 251 00:15:42,370 --> 00:15:46,990 That intuition is not a rigorous one, but it's sort of a useful one to think about. 252 00:15:46,990 --> 00:15:49,240 253 00:15:49,240 --> 00:15:53,360 If you're faced with a new learning problem " if I give you some random 254 00:15:53,360 --> 00:15:54,620 thing to classify 255 00:15:54,620 --> 00:15:56,190 and you want to decide how to come up with 256 00:15:56,190 --> 00:15:58,000 a kernel, 257 00:15:58,000 --> 00:15:59,209 one way is 258 00:15:59,209 --> 00:16:02,710 to try to come up with the function 259 00:16:02,710 --> 00:16:04,260 P of XZ that 260 00:16:04,260 --> 00:16:05,570 is large, 261 00:16:05,570 --> 00:16:10,540 if you want to learn the algorithm to think of X and Z as similar 262 00:16:10,540 --> 00:16:11,370 and 263 00:16:11,370 --> 00:16:15,040 small. 264 00:16:15,040 --> 00:16:17,640 265 00:16:17,640 --> 00:16:19,290 266 00:16:19,290 --> 00:16:24,500 Again, this isn't always true, but this is one of several intuitions. 267 00:16:24,500 --> 00:16:27,969 So if you're trying to classify some brand new thing " you're trying to 268 00:16:27,969 --> 00:16:31,999 classify [inaudible] or DNA sequences or something, 269 00:16:31,999 --> 00:16:33,010 270 00:16:33,010 --> 00:16:37,250 some strange thing you want to classify, one thing you could do is try to come up with a kernel that's large when you want the algorithm to think 271 00:16:37,250 --> 00:16:38,790 these are similar things 272 00:16:38,790 --> 00:16:41,870 or these are dissimilar. 273 00:16:41,870 --> 00:16:43,590 And so this 274 00:16:43,590 --> 00:16:46,730 answers the question of let's 275 00:16:46,730 --> 00:16:49,860 say I have something I want to classify, and let's say I write down 276 00:16:49,860 --> 00:16:53,370 the function 277 00:16:53,370 --> 00:16:57,700 that I think is a good measure of how similar or dissimilar X and Z 278 00:16:57,700 --> 00:16:59,370 are for my specific problem. 279 00:16:59,370 --> 00:17:05,310 Let's say I write down K of XZ equals E to the 280 00:17:05,310 --> 00:17:08,209 minus. 281 00:17:08,209 --> 00:17:13,260 Let's say I write down this function, 282 00:17:13,260 --> 00:17:19,089 and I think this is a good measure of how similar X and Z are. 283 00:17:19,089 --> 00:17:26,089 The question, then, is is this really a valid 284 00:17:28,490 --> 00:17:32,650 kernel? In other words, to understand how we can construct kernels " if I write down the 285 00:17:32,650 --> 00:17:33,669 function like that, 286 00:17:33,669 --> 00:17:37,940 the question is does there really exist some Phi 287 00:17:37,940 --> 00:17:39,830 such that 288 00:17:39,830 --> 00:17:41,049 KXZ 289 00:17:41,049 --> 00:17:42,419 is equal to 290 00:17:42,419 --> 00:17:48,370 the inner product? 291 00:17:48,370 --> 00:17:49,860 And that's the 292 00:17:49,860 --> 00:17:56,160 question of is K a valid kernel. 293 00:17:56,160 --> 00:17:59,780 It turns out that 294 00:17:59,780 --> 00:18:04,260 there is a result that characterizes necessary and sufficient 295 00:18:04,260 --> 00:18:05,380 conditions for 296 00:18:05,380 --> 00:18:07,700 when a function K that you might choose 297 00:18:07,700 --> 00:18:10,370 is a valid kernel. 298 00:18:10,370 --> 00:18:17,340 I should go ahead show part of that result now. 299 00:18:17,340 --> 00:18:23,559 Suppose K is a valid kernel, 300 00:18:23,559 --> 00:18:27,179 and when I say K is a kernel, what I mean is there does indeed exist some 301 00:18:27,179 --> 00:18:31,800 function Phi for which this holds true. 302 00:18:31,800 --> 00:18:34,700 Then 303 00:18:34,700 --> 00:18:39,190 let any set of points 304 00:18:39,190 --> 00:18:46,190 X1 up to XM be given. Let 305 00:18:47,610 --> 00:18:52,640 me define a matrix K. 306 00:18:52,640 --> 00:18:54,410 307 00:18:54,410 --> 00:18:57,690 I apologize for overloading notation. K I'm going 308 00:18:57,690 --> 00:19:00,029 to use to denote both the 309 00:19:00,029 --> 00:19:02,720 kernel function, which is the function of X and Z 310 00:19:02,720 --> 00:19:06,940 as well as a matrix. Unfortunately, 311 00:19:06,940 --> 00:19:10,180 there aren't enough alphabets. 312 00:19:10,180 --> 00:19:12,590 Well, that's not true. We need to find 313 00:19:12,590 --> 00:19:15,350 the kernel matrix to 314 00:19:15,350 --> 00:19:20,170 be an M-by-M matrix such that K subscript IJ is equal to the 315 00:19:20,170 --> 00:19:23,810 kernel function 316 00:19:23,810 --> 00:19:30,810 applied to two of my examples. 317 00:19:31,480 --> 00:19:38,480 Then it 318 00:19:39,250 --> 00:19:45,030 turns out that for any vector Z that's indimensional, I want 319 00:19:45,030 --> 00:19:50,270 you to consider Z transpose 320 00:19:50,270 --> 00:19:57,270 321 00:20:01,780 --> 00:20:04,080 KZ. 322 00:20:04,080 --> 00:20:05,500 323 00:20:05,500 --> 00:20:07,880 By 324 00:20:07,880 --> 00:20:13,230 definition of matrix multiplication, this is 325 00:20:13,230 --> 00:20:15,120 that, 326 00:20:15,120 --> 00:20:18,500 and so 327 00:20:18,500 --> 00:20:23,120 KIJ is a kernel function between XI and XJ, so that must equal to this. I assume that K is 328 00:20:23,120 --> 00:20:27,580 a valid kernel function, 329 00:20:27,580 --> 00:20:31,120 and so there must exist such a 330 00:20:31,120 --> 00:20:33,419 value 331 00:20:33,419 --> 00:20:36,720 for Phi. 332 00:20:36,720 --> 00:20:40,500 This is the inner product between two feature vectors, so let me just make that inner product 333 00:20:40,500 --> 00:20:43,090 the explicit. I'm going 334 00:20:43,090 --> 00:20:45,950 to sum over the elements of this vector, 335 00:20:45,950 --> 00:20:50,320 and I'm going to use Phi XI subscript K just to denote the K 336 00:20:50,320 --> 00:20:57,320 element of this vector. 337 00:20:58,490 --> 00:21:03,480 Just 338 00:21:03,480 --> 00:21:10,480 rearrange sums. You get 339 00:21:24,940 --> 00:21:30,110 sum over K. This next set may look familiar to some of you, 340 00:21:30,110 --> 00:21:37,110 which is just " 341 00:21:47,220 --> 00:21:48,300 right? Therefore, 342 00:21:48,300 --> 00:21:51,540 this is the sum of squares 343 00:21:51,540 --> 00:21:56,630 and it must therefore be greater than or equal to zero. 344 00:21:56,630 --> 00:22:00,250 Do you want to take a minute to look for all the steps and just make 345 00:22:00,250 --> 00:22:07,250 sure you buy 346 00:22:09,400 --> 00:22:11,010 them all? Oh, this is 347 00:22:11,010 --> 00:22:15,080 the inner product between the vector of Phi of XI and Phi 348 00:22:15,080 --> 00:22:17,170 of XJ, 349 00:22:17,170 --> 00:22:19,210 so the 350 00:22:19,210 --> 00:22:21,100 inner product between two vectors 351 00:22:21,100 --> 00:22:28,080 is the sum over all the elements of the vectors of the corresponding element. 352 00:22:28,080 --> 00:22:31,559 Student:[Inaudible]. Instructor (Andrew Ng):Oh, yes it is. This is just A 353 00:22:31,559 --> 00:22:36,970 transpose B equals sum over K, AK BK, so that's 354 00:22:36,970 --> 00:22:38,830 just this. This is the sum of K of the 355 00:22:38,830 --> 00:22:44,490 K elements of this vector. 356 00:22:44,490 --> 00:22:46,180 Take a look at this and make sure 357 00:22:46,180 --> 00:22:50,470 it makes sense. Questions about 358 00:22:50,470 --> 00:22:57,470 this? So just to 359 00:22:59,720 --> 00:23:00,419 summarize, 360 00:23:00,419 --> 00:23:03,300 what we showed was that 361 00:23:03,300 --> 00:23:05,309 for any vector Z, 362 00:23:05,309 --> 00:23:07,380 Z transpose KZ 363 00:23:07,380 --> 00:23:09,700 is greater than or equal to zero, 364 00:23:09,700 --> 00:23:14,090 and this is one of the standard definitions of a matrix, 365 00:23:14,090 --> 00:23:17,540 the matrix K being posisemidefinite 366 00:23:17,540 --> 00:23:24,240 when a matrix K is posisemidefinite, that is, K is equal to 367 00:23:24,240 --> 00:23:28,010 zero. Just to summarize, what was shown is that if 368 00:23:28,010 --> 00:23:29,000 K 369 00:23:29,000 --> 00:23:31,150 is a 370 00:23:31,150 --> 00:23:34,430 valid kernel " in other words, if K is a function 371 00:23:34,430 --> 00:23:37,140 for which there exists 372 00:23:37,140 --> 00:23:38,390 some Phi such that 373 00:23:38,390 --> 00:23:39,990 K of XI 374 00:23:39,990 --> 00:23:43,830 XJ is the inner product between Phi of XI and Phi of XJ. 375 00:23:43,830 --> 00:23:48,130 So if K is a valid kernel, we showed, then, that the kernel matrix 376 00:23:48,130 --> 00:23:50,929 377 00:23:50,929 --> 00:23:55,030 must be posisemidefinite. 378 00:23:55,030 --> 00:23:59,910 379 00:23:59,910 --> 00:24:01,970 It turns out that the 380 00:24:01,970 --> 00:24:04,840 converse [inaudible] 381 00:24:04,840 --> 00:24:09,480 and so this gives you a test for 382 00:24:09,480 --> 00:24:13,010 whether a function K is a valid kernel. 383 00:24:13,010 --> 00:24:14,600 So this is 384 00:24:14,600 --> 00:24:18,370 a theorem due to Mercer, and so kernels are also sometimes called Mercer 385 00:24:18,370 --> 00:24:25,370 kernels. It means the same thing. It just means it's a valid kernel. Let 386 00:24:25,510 --> 00:24:31,440 K of XZ be given. 387 00:24:31,440 --> 00:24:33,809 Then K 388 00:24:33,809 --> 00:24:36,980 is a valid 389 00:24:36,980 --> 00:24:43,160 kernel " 390 00:24:43,160 --> 00:24:45,310 in other 391 00:24:45,310 --> 00:24:48,060 words, it's a Mercer kernel, i.e., there exists a Phi 392 00:24:48,060 --> 00:24:50,120 such that 393 00:24:50,120 --> 00:24:52,360 KXZ 394 00:24:52,360 --> 00:24:53,640 395 00:24:53,640 --> 00:24:58,649 396 00:24:58,649 --> 00:25:00,970 equals Phi of X 397 00:25:00,970 --> 00:25:04,840 transpose 398 00:25:04,840 --> 00:25:11,179 Phi of Z " if and only if for any set of M examples, 399 00:25:11,179 --> 00:25:15,039 and this really means for any set of M points. It's not necessarily a training set. 400 00:25:15,039 --> 00:25:21,299 It's just any set of M points you may choose. It 401 00:25:21,299 --> 00:25:28,299 holds true that the kernel 402 00:25:29,040 --> 00:25:30,410 matrix, 403 00:25:30,410 --> 00:25:34,290 capital K that I defined just now, 404 00:25:34,290 --> 00:25:36,510 is 405 00:25:36,510 --> 00:25:37,960 406 00:25:37,960 --> 00:25:44,960 symmetric posisemidefinite. 407 00:25:49,260 --> 00:25:54,070 And so I proved only one direction of this result. I proved that if it's a 408 00:25:54,070 --> 00:25:55,420 valid kernel, then K 409 00:25:55,420 --> 00:25:57,299 is symmetry 410 00:25:57,299 --> 00:26:00,909 posisemidefinite, but the converse I didn't show. It turns out that this is 411 00:26:00,909 --> 00:26:03,710 necessary and a sufficient condition. And 412 00:26:03,710 --> 00:26:06,250 so this gives you a useful test for 413 00:26:06,250 --> 00:26:09,289 whether any function that you might want to choose 414 00:26:09,289 --> 00:26:16,289 is a 415 00:26:23,270 --> 00:26:27,040 kernel. A concrete example of something that's clearly not a 416 00:26:27,040 --> 00:26:28,240 valid kernel would 417 00:26:28,240 --> 00:26:30,659 be 418 00:26:30,659 --> 00:26:36,040 if you find an input X such that K of X, X " and this 419 00:26:36,040 --> 00:26:39,410 is minus one, for example " then this is an example of something that's clearly 420 00:26:39,410 --> 00:26:41,530 not a valid kernel, 421 00:26:41,530 --> 00:26:48,010 because minus one cannot possibly be equal to Phi of X 422 00:26:48,010 --> 00:26:49,390 transpose Phi of X, 423 00:26:49,390 --> 00:26:53,250 and so this would be one of many examples of functions that 424 00:26:53,250 --> 00:26:57,040 will fail to meet the conditions of this theorem, 425 00:26:57,040 --> 00:27:04,040 because inner products of a vector itself are always greater than zero. So 426 00:27:15,820 --> 00:27:22,820 427 00:27:29,530 --> 00:27:33,620 just to tie this back explicitly to an 428 00:27:33,620 --> 00:27:34,710 SVM, 429 00:27:34,710 --> 00:27:37,190 let's say to use a support vector machine with a kernel, 430 00:27:37,190 --> 00:27:41,400 what you do is you choose some function K of XZ, 431 00:27:41,400 --> 00:27:45,860 and so you can choose " and it turns out that function I wrote down just now " this is, 432 00:27:45,860 --> 00:27:49,060 indeed, a valid kernel. It 433 00:27:49,060 --> 00:27:50,850 is called the Galcean kernel 434 00:27:50,850 --> 00:27:52,250 because of the similarity to 435 00:27:52,250 --> 00:27:53,540 Galceans. 436 00:27:53,540 --> 00:27:58,580 So you choose some kernel function like this, or you may choose X 437 00:27:58,580 --> 00:28:02,490 transpose Z plus 438 00:28:02,490 --> 00:28:03,470 C 439 00:28:03,470 --> 00:28:06,420 to the D vector. To apply a support vector machine kernel, you choose one of these 440 00:28:06,420 --> 00:28:07,500 functions, 441 00:28:07,500 --> 00:28:09,490 and the choice of this would depend on your problem. 442 00:28:09,490 --> 00:28:14,149 It depends on what is a good measure of one or two examples 443 00:28:14,149 --> 00:28:17,060 similar and one or two examples different for your problem. 444 00:28:17,060 --> 00:28:21,940 Then you go back to our formulation of support vector machine, 445 00:28:21,940 --> 00:28:24,810 and you have to use the dual formulation, 446 00:28:24,810 --> 00:28:28,730 and you then replace everywhere you see 447 00:28:28,730 --> 00:28:31,570 these 448 00:28:31,570 --> 00:28:36,799 things, you 449 00:28:36,799 --> 00:28:39,900 replace it with K of XI, 450 00:28:39,900 --> 00:28:41,820 XJ. 451 00:28:41,820 --> 00:28:45,360 And you then run exactly the same support vector machine algorithm, only everywhere 452 00:28:45,360 --> 00:28:48,950 you see these inner products, you replace them with that, 453 00:28:48,950 --> 00:28:52,480 and what you've just done is you've taken a support vector machine 454 00:28:52,480 --> 00:28:56,890 and you've taken each of your feature vectors X 455 00:28:56,890 --> 00:29:02,760 and you've replaced it with implicitly a very high dimensional feature vector. 456 00:29:02,760 --> 00:29:06,270 It turns out that the Galcean kernel corresponds to a feature vector that's 457 00:29:06,270 --> 00:29:08,640 infinite dimensional. 458 00:29:08,640 --> 00:29:09,950 459 00:29:09,950 --> 00:29:13,860 Nonetheless, you can run a support vector machine in a finite amount of time, 460 00:29:13,860 --> 00:29:15,450 even though you're working with 461 00:29:15,450 --> 00:29:18,920 infinite dimensional feature vectors, 462 00:29:18,920 --> 00:29:20,559 because all you ever need to do is 463 00:29:20,559 --> 00:29:23,120 compute these things, and you don't ever need 464 00:29:23,120 --> 00:29:26,330 to represent these infinite dimensional feature vectors 465 00:29:26,330 --> 00:29:28,330 explicitly. 466 00:29:28,330 --> 00:29:30,040 Why 467 00:29:30,040 --> 00:29:33,250 is this a good idea? It turns out " 468 00:29:33,250 --> 00:29:38,070 I think I started off talking about support vector machines. I started 469 00:29:38,070 --> 00:29:41,580 saying that we wanted to start to develop non-linear learning algorithms. 470 00:29:41,580 --> 00:29:45,350 So here's one useful picture to keep in mind, which is that 471 00:29:45,350 --> 00:29:48,420 let's say your original data " 472 00:29:48,420 --> 00:29:52,060 I didn't mean to draw that slanted. Let's say you have one-dimensional input 473 00:29:52,060 --> 00:29:56,540 data. You just have one feature X and R. What a kernel 474 00:29:56,540 --> 00:30:00,960 does is the following. It takes your original input data and maps it to a 475 00:30:00,960 --> 00:30:04,910 very high dimensional feature space. In the case of Galcean kernels, an infinite dimensional feature space " for pedagogical reasons, I'll draw 476 00:30:04,910 --> 00:30:09,170 two 477 00:30:09,170 --> 00:30:12,950 dimensions here. 478 00:30:12,950 --> 00:30:19,950 So say [inaudible] very high dimensional feature space where " 479 00:30:23,070 --> 00:30:28,900 like so. So it takes all your data in R1 and maps it to R infinity, 480 00:30:28,900 --> 00:30:30,360 and then you run a 481 00:30:30,360 --> 00:30:33,770 support vector machine in this infinite dimensional space and also exponentially 482 00:30:33,770 --> 00:30:35,240 high dimensional space, 483 00:30:35,240 --> 00:30:38,030 and you'll find the optimal margin classifier " 484 00:30:38,030 --> 00:30:39,730 in other words, the 485 00:30:39,730 --> 00:30:43,390 classifier that separates your data in this very high dimensional space 486 00:30:43,390 --> 00:30:46,060 with the largest possible geometric margin. 487 00:30:46,060 --> 00:30:50,350 In this example that you just drew anyway, 488 00:30:50,350 --> 00:30:54,680 whereas your data was not linearly separable in your originally one dimensional space, 489 00:30:54,680 --> 00:30:58,340 when you map it to this much higher dimensional space, it becomes linearly separable, so you 490 00:30:58,340 --> 00:30:59,960 can use your 491 00:30:59,960 --> 00:31:01,250 linear classifier 492 00:31:01,250 --> 00:31:04,990 to [inaudible] which data is not really separable in your original space. 493 00:31:04,990 --> 00:31:07,090 This is what support vector machines 494 00:31:07,090 --> 00:31:10,460 output nonlinear decision boundaries 495 00:31:10,460 --> 00:31:14,930 and in the entire process, all you ever need to do is solve complex optimization 496 00:31:14,930 --> 00:31:18,780 problems. 497 00:31:18,780 --> 00:31:19,560 498 00:31:19,560 --> 00:31:25,000 Questions about any of this? Student:[Inaudible] sigmer? Instructor 499 00:31:25,000 --> 00:31:30,310 (Andrew Ng):Yeah, so sigmer is " let's 500 00:31:30,310 --> 00:31:31,999 see. Well, I was going to talk about [inaudible] 501 00:31:31,999 --> 00:31:32,730 later. One way 502 00:31:32,730 --> 00:31:35,690 to choose sigmer is 503 00:31:35,690 --> 00:31:38,859 save aside a small amount of your data and 504 00:31:38,859 --> 00:31:41,909 try different values of sigmer and train an SVM using, 505 00:31:41,909 --> 00:31:44,740 say, two thirds of your data. 506 00:31:44,740 --> 00:31:48,500 Try different values of sigmer, then see what works best on a separate 507 00:31:48,500 --> 00:31:48,950 hold out 508 00:31:48,950 --> 00:31:53,270 cross validation set " on a separate set that you're testing. Something about 509 00:31:53,270 --> 00:31:56,289 learning algorithms we talked about 510 00:31:56,289 --> 00:31:57,409 " 511 00:31:57,409 --> 00:32:00,360 locally [inaudible] linear aggressions [inaudible] bandwidth parameter, so there are a 512 00:32:00,360 --> 00:32:03,159 number of parameters to some of these algorithms that you 513 00:32:03,159 --> 00:32:05,829 can choose IDs by saving aside some data to test on. I'll 514 00:32:05,829 --> 00:32:10,250 talk more about model selection [inaudible] explicitly. Are there other questions? Student:So 515 00:32:10,250 --> 00:32:12,790 how do you know that 516 00:32:12,790 --> 00:32:14,299 517 00:32:14,299 --> 00:32:16,830 moving it up to high dimensional space 518 00:32:16,830 --> 00:32:19,820 is going to give you that kind of separation? Instructor (Andrew 519 00:32:19,820 --> 00:32:21,279 Ng):Good question. 520 00:32:21,279 --> 00:32:25,980 Usually, you don't know [inaudible]. Sometimes you can know, but 521 00:32:25,980 --> 00:32:29,230 in most cases, you won't know [inaudible] actually going to 522 00:32:29,230 --> 00:32:30,010 linearly 523 00:32:30,010 --> 00:32:33,920 separable, so the next topic will be [inaudible], which is what 524 00:32:33,920 --> 00:32:40,920 [inaudible] SVMs that work even though the data is not linearly separable. Student:If you tend 525 00:32:41,080 --> 00:32:42,260 linearly 526 00:32:42,260 --> 00:32:44,630 separated by mapping a higher dimension, couldn't you 527 00:32:44,630 --> 00:32:47,309 also just use [inaudible] 528 00:32:47,309 --> 00:32:49,400 higher 529 00:32:49,400 --> 00:32:53,580 dimension? Instructor (Andrew Ng):So very right. This is a question about what to do if you can't 530 00:32:53,580 --> 00:32:55,350 separate it in higher dimensional 531 00:32:55,350 --> 00:32:59,460 space. Let me try to address that work with a discussion of [inaudible] soft margin SVMs. 532 00:32:59,460 --> 00:33:06,120 Okay. Student:What 533 00:33:06,120 --> 00:33:08,180 if 534 00:33:08,180 --> 00:33:09,370 you run an SVM algorithm 535 00:33:09,370 --> 00:33:11,720 that assumes the data are linearly 536 00:33:11,720 --> 00:33:12,770 separable on data that 537 00:33:12,770 --> 00:33:15,320 is not actually linearly separable? Instructor 538 00:33:15,320 --> 00:33:19,490 (Andrew Ng):You guys are really giving me a hard time about whether the 539 00:33:19,490 --> 00:33:20,220 data's linearly 540 00:33:20,220 --> 00:33:24,320 separable. It turns out this algorithm won't work if the data is not linearly separable, but 541 00:33:24,320 --> 00:33:29,260 I'll change that in a second and make it work. 542 00:33:29,260 --> 00:33:31,100 543 00:33:31,100 --> 00:33:35,390 If I move on to talk about that, 544 00:33:35,390 --> 00:33:39,760 let me just say one final word about kernels, 545 00:33:39,760 --> 00:33:46,450 which is that 546 00:33:46,450 --> 00:33:49,130 547 00:33:49,130 --> 00:33:52,559 548 00:33:52,559 --> 00:33:54,780 I talked about kernels 549 00:33:54,780 --> 00:34:00,770 in a context of support vector machines, 550 00:34:00,770 --> 00:34:04,550 and the idea of kernels was what really made support vector machines a very 551 00:34:04,550 --> 00:34:06,700 powerful learning algorithm, 552 00:34:06,700 --> 00:34:09,850 and actually towards the end of today's lecture if I have time, I'll actually 553 00:34:09,850 --> 00:34:14,119 give a couple more [inaudible] examples of how to choose kernels as well. 554 00:34:14,119 --> 00:34:14,949 555 00:34:14,949 --> 00:34:18,719 It turns out that the idea of kernels is actually more general than support vector 556 00:34:18,719 --> 00:34:19,659 machines, 557 00:34:19,659 --> 00:34:21,209 and in particular, 558 00:34:21,209 --> 00:34:25,559 we took this SVM algorithm and we derived a dual, and that was what let us write 559 00:34:25,559 --> 00:34:27,959 the entire algorithm 560 00:34:27,959 --> 00:34:30,999 in terms of inner products of these. 561 00:34:30,999 --> 00:34:35,319 It turns out that you can take many of the other algorithms that you've seen 562 00:34:35,319 --> 00:34:36,630 in this class " in fact, it turns 563 00:34:36,630 --> 00:34:40,240 out you can take most of the linear algorithms we talked about, such as linear 564 00:34:40,240 --> 00:34:42,119 regression, logistic regression 565 00:34:42,119 --> 00:34:43,869 [inaudible] 566 00:34:43,869 --> 00:34:48,029 and it turns out you can take all of these algorithms and rewrite them entirely 567 00:34:48,029 --> 00:34:51,429 in terms of these inner products. 568 00:34:51,429 --> 00:34:55,149 So if you have any algorithm that you can rewrite in terms of inner products, then that 569 00:34:55,149 --> 00:35:01,259 means you can replace it with K of 570 00:35:01,259 --> 00:35:02,430 XI XJ 571 00:35:02,430 --> 00:35:06,499 and that means that you can take any of theses algorithms and 572 00:35:06,499 --> 00:35:10,949 implicitly map the features vectors of these very high dimensional feature spaces 573 00:35:10,949 --> 00:35:12,990 and have the algorithm 574 00:35:12,990 --> 00:35:13,670 575 00:35:13,670 --> 00:35:17,380 still work. The idea of kernels is perhaps most widely used with support vector 576 00:35:17,380 --> 00:35:19,869 machines, but it is actually more general than that, 577 00:35:19,869 --> 00:35:23,429 and you can take many of the other algorithms that you've seen and many of the algorithms that 578 00:35:23,429 --> 00:35:26,499 we'll see later this quarter as well 579 00:35:26,499 --> 00:35:30,309 and write them in terms of inner products and thereby kernalize them and apply them to 580 00:35:30,309 --> 00:35:33,199 infinite dimensional feature spaces. 581 00:35:33,199 --> 00:35:36,309 You'll actually get to play with many of these ideas more in the next problem 582 00:35:36,309 --> 00:35:41,089 583 00:35:41,089 --> 00:35:43,269 set. 584 00:35:43,269 --> 00:35:47,019 Let's talk about non-linear decision boundaries, and this is 585 00:35:47,019 --> 00:35:50,209 the idea of 586 00:35:50,209 --> 00:35:55,059 " it's called the L1 norm soft 587 00:35:55,059 --> 00:35:59,140 margin SVM. Machine only people sometimes aren't great at coming up with good 588 00:35:59,140 --> 00:36:00,269 names, but here's 589 00:36:00,269 --> 00:36:01,839 the idea. 590 00:36:01,839 --> 00:36:08,839 Let's say I have a data 591 00:36:12,670 --> 00:36:15,380 set. This is a linearly separable data 592 00:36:15,380 --> 00:36:19,100 set, but what I do if I have a couple of other examples there that makes the data 593 00:36:19,100 --> 00:36:23,779 nonlinearly separable, 594 00:36:23,779 --> 00:36:25,929 and in fact, 595 00:36:25,929 --> 00:36:30,059 sometimes even if the data is linearly separable, maybe you might not want to. 596 00:36:30,059 --> 00:36:32,829 So for example, this is a very nice data set. It looks like 597 00:36:32,829 --> 00:36:37,109 there's a great decision boundary that separates the two [inaudible]. Well, 598 00:36:37,109 --> 00:36:40,269 what if I had just one outlier 599 00:36:40,269 --> 00:36:42,489 down here? I could 600 00:36:42,489 --> 00:36:46,269 still linearly separate this data set 601 00:36:46,269 --> 00:36:48,029 with something like that, 602 00:36:48,029 --> 00:36:50,190 but I'm somehow letting one 603 00:36:50,190 --> 00:36:52,539 slightly suspicious example 604 00:36:52,539 --> 00:36:55,950 skew my entire decision boundary by a lot, 605 00:36:55,950 --> 00:36:57,279 and so 606 00:36:57,279 --> 00:37:00,969 what I'm going to talk about now is the L1 norm soft margin SVM, which is 607 00:37:00,969 --> 00:37:05,339 a slightly modified formulation of the SVM optimization problem. 608 00:37:05,339 --> 00:37:09,069 They will let us deal with both of these cases " one where one of the data's 609 00:37:09,069 --> 00:37:10,809 just not linearly separable 610 00:37:10,809 --> 00:37:14,149 and two, what if you have some examples that you'd rather 611 00:37:14,149 --> 00:37:16,599 not get [inaudible] in a training set. Maybe 612 00:37:16,599 --> 00:37:19,389 with an outlier here, maybe you actually prefer to 613 00:37:19,389 --> 00:37:23,119 choose that original decision boundary and not try so hard to get 614 00:37:23,119 --> 00:37:26,619 that training example. Here's the formulation. 615 00:37:26,619 --> 00:37:33,619 Our 616 00:37:44,479 --> 00:37:45,429 617 00:37:45,429 --> 00:37:46,429 SVM 618 00:37:46,429 --> 00:37:51,109 primal problem was to minimize one-half [inaudible] W 619 00:37:51,109 --> 00:37:51,919 squared. 620 00:37:51,919 --> 00:37:58,919 621 00:38:02,720 --> 00:38:05,189 So this is our original problem, and I'm 622 00:38:05,189 --> 00:38:09,739 going to modify this by adding the 623 00:38:09,739 --> 00:38:16,739 following. 624 00:38:23,449 --> 00:38:27,369 In other words, I'm gonna add these penalty terms, CIs, 625 00:38:27,369 --> 00:38:32,409 and I'm going to demand that each of my training examples is separated with 626 00:38:32,409 --> 00:38:35,999 functional margin greater than or equal to one minus CI, 627 00:38:35,999 --> 00:38:39,639 and you remember 628 00:38:39,639 --> 00:38:45,479 629 00:38:45,479 --> 00:38:47,430 if this is 630 00:38:47,430 --> 00:38:50,010 greater than zero " was it 631 00:38:50,010 --> 00:38:52,750 two lectures ago that I said that if the 632 00:38:52,750 --> 00:38:54,689 function margin is greater than zero, 633 00:38:54,689 --> 00:39:00,809 that implies you classified it correctly. 634 00:39:00,809 --> 00:39:04,589 If it's less than zero, then you misclassified it. 635 00:39:04,589 --> 00:39:08,229 By setting some of the CIs to be larger than one, I can 636 00:39:08,229 --> 00:39:10,660 actually have some examples with 637 00:39:10,660 --> 00:39:12,879 functional margin negative, 638 00:39:12,879 --> 00:39:16,439 and therefore I'm allowing my algorithm to misclassify some of the examples of the 639 00:39:16,439 --> 00:39:19,089 training set. 640 00:39:19,089 --> 00:39:23,130 However, I'll encourage the algorithm not to do that by adding to the 641 00:39:23,130 --> 00:39:24,379 optimization objective, 642 00:39:24,379 --> 00:39:29,609 this sort of penalty term that penalizes setting CIs to be large. 643 00:39:29,609 --> 00:39:34,119 This is an optimization problem where the parameters are WB 644 00:39:34,119 --> 00:39:37,489 and all of the CIs 645 00:39:37,489 --> 00:39:42,739 and this is also a convex optimization problem. It 646 00:39:42,739 --> 00:39:46,549 turns out that 647 00:39:46,549 --> 00:39:50,630 similar to how we worked on the dual of the support vector machine, 648 00:39:50,630 --> 00:39:54,319 we can also work out the dual for this optimization problem. I won't 649 00:39:54,319 --> 00:39:55,499 actually do it, 650 00:39:55,499 --> 00:39:59,169 but just to show you the steps, what you do is you construct [inaudible] Alpha R, and I'm going 651 00:39:59,169 --> 00:40:01,640 to 652 00:40:01,640 --> 00:40:03,489 use 653 00:40:03,489 --> 00:40:08,119 Alpha and R to denote the [inaudible] multipliers no corresponding to 654 00:40:08,119 --> 00:40:10,599 this set of constraints that we had previously 655 00:40:10,599 --> 00:40:15,069 and this new set of constraints on the CI [inaudible] zero. This gives us a use of 656 00:40:15,069 --> 00:40:16,439 the 657 00:40:16,439 --> 00:40:23,279 [inaudible] multipliers. The [inaudible] will be 658 00:40:23,279 --> 00:40:24,569 optimization objective 659 00:40:24,569 --> 00:40:31,569 minus sum from 660 00:40:35,879 --> 00:40:38,699 plus CI 661 00:40:38,699 --> 00:40:44,349 minus " 662 00:40:44,349 --> 00:40:46,679 and so 663 00:40:46,679 --> 00:40:50,549 there's our [inaudible] optimization objective 664 00:40:50,549 --> 00:40:55,049 minus or plus Alpha times each of these constraints, 665 00:40:55,049 --> 00:40:56,769 666 00:40:56,769 --> 00:41:03,769 which are 667 00:41:05,199 --> 00:41:12,199 668 00:41:17,969 --> 00:41:20,779 greater or equal to 669 00:41:20,779 --> 00:41:27,779 zero. I won't redivide the entire dual again, but it's really the same, 670 00:41:28,499 --> 00:41:33,799 and when you derive the dual of this optimization problem and when you simplify, 671 00:41:33,799 --> 00:41:37,199 you find that you get the following. 672 00:41:37,199 --> 00:41:40,609 You have to maximize 673 00:41:40,609 --> 00:41:47,609 [inaudible], which is actually the same as before. 674 00:41:50,969 --> 00:41:57,969 675 00:42:03,130 --> 00:42:10,130 676 00:42:20,319 --> 00:42:23,389 So it turns out when you derive the dual 677 00:42:23,389 --> 00:42:24,720 and simply, 678 00:42:24,720 --> 00:42:28,730 it turns out that the only way the dual changes compared to the previous one 679 00:42:28,730 --> 00:42:32,389 is that rather than the constraint that the Alpha [inaudible] are greater than or equal to zero, 680 00:42:32,389 --> 00:42:36,839 we now have a constraint that the Alphas are between zero and C. 681 00:42:36,839 --> 00:42:40,009 This derivation isn't very hard, and you're encouraged to go home and try to do it yourself. 682 00:42:40,009 --> 00:42:42,450 It's really essentially the same math, 683 00:42:42,450 --> 00:42:44,420 and when you simply, it turns out 684 00:42:44,420 --> 00:42:48,619 you can simply the R of the [inaudible] multiplier away 685 00:42:48,619 --> 00:42:49,420 and you end up with 686 00:42:49,420 --> 00:42:54,789 just these constraints of 687 00:42:54,789 --> 00:42:57,299 688 00:42:57,299 --> 00:43:01,909 the 689 00:43:01,909 --> 00:43:05,179 Alphas. 690 00:43:05,179 --> 00:43:06,450 Just as an aside, 691 00:43:06,450 --> 00:43:08,779 I won't 692 00:43:08,779 --> 00:43:12,499 derive these, either. It turns out that " 693 00:43:12,499 --> 00:43:15,779 remember, I wrote down the [inaudible] conditions in 694 00:43:15,779 --> 00:43:18,259 the last lecture. 695 00:43:18,259 --> 00:43:21,609 The necessary conditions for something to be an optimal solution 696 00:43:21,609 --> 00:43:24,019 to constrain optimization problems. 697 00:43:24,019 --> 00:43:27,119 So if you used the [inaudible] conditions, 698 00:43:27,119 --> 00:43:30,499 it turns out you can actually derive conversions conditions, so 699 00:43:30,499 --> 00:43:32,999 we want to solve this optimization problem. 700 00:43:32,999 --> 00:43:37,929 When do we know the Alphas have converged to the global optimum? 701 00:43:37,929 --> 00:43:40,819 It turns out you can use the following. 702 00:43:40,819 --> 00:43:47,819 703 00:43:48,449 --> 00:43:55,449 704 00:44:05,289 --> 00:44:12,289 705 00:44:14,140 --> 00:44:16,789 706 00:44:16,789 --> 00:44:20,970 I don't want to say a lot about these. It 707 00:44:20,970 --> 00:44:24,219 turns out from the [inaudible] conditions you can derive 708 00:44:24,219 --> 00:44:27,339 these as the conversion conditions for an algorithm that you might choose to 709 00:44:27,339 --> 00:44:28,379 use 710 00:44:28,379 --> 00:44:34,469 to try to solve the optimization problem in terms of the 711 00:44:34,469 --> 00:44:39,789 Alphas. That's the L1 norm soft margin SVM, and this is the 712 00:44:39,789 --> 00:44:43,749 change the algorithm that lets us handle non-linearly separable data sets as well as 713 00:44:43,749 --> 00:44:47,160 single outliers that may still be linearly separable 714 00:44:47,160 --> 00:44:54,160 but you may choose not to separate [inaudible]. 715 00:44:55,869 --> 00:44:58,409 Questions about this? 716 00:44:58,409 --> 00:45:03,319 Raise 717 00:45:03,319 --> 00:45:10,319 718 00:45:17,749 --> 00:45:24,749 719 00:45:39,880 --> 00:45:46,880 720 00:45:52,069 --> 00:45:56,529 your hand if this stuff makes sense at all. Great. 721 00:45:56,529 --> 00:45:59,979 So 722 00:45:59,979 --> 00:46:02,849 the last thing I want to do is 723 00:46:02,849 --> 00:46:07,619 talk about an algorithm for actually solving this optimization problem. 724 00:46:07,619 --> 00:46:10,400 We wrote down 725 00:46:10,400 --> 00:46:13,769 this dual optimization problem with convergence criteria, 726 00:46:13,769 --> 00:46:19,169 so let's come up with an efficient algorithm to actually solve this optimization problem. 727 00:46:19,169 --> 00:46:21,180 728 00:46:21,180 --> 00:46:25,329 I want to do this partly to give me an excuse to talk about an algorithm 729 00:46:25,329 --> 00:46:28,859 called coordinate assent, which is useful to 730 00:46:28,859 --> 00:46:34,229 do. What I actually want to do is tell you about an algorithm called coordinate 731 00:46:34,229 --> 00:46:36,809 assent, which is a useful algorithm to know about, 732 00:46:36,809 --> 00:46:40,049 and it'll turn out that it won't apply in the simplest form 733 00:46:40,049 --> 00:46:43,589 to this problem, but we'll then be able to modify it slightly and then it'll 734 00:46:43,589 --> 00:46:45,549 give us a very efficient 735 00:46:45,549 --> 00:46:48,910 algorithm for solving this [inaudible] 736 00:46:48,910 --> 00:46:52,249 optimization problem. That was the other reason that I had to derive the dual, not only so that we could 737 00:46:52,249 --> 00:46:53,320 use kernels 738 00:46:53,320 --> 00:46:57,610 but also so that we can apply an algorithm like the 739 00:46:57,610 --> 00:47:00,829 SMO 740 00:47:00,829 --> 00:47:04,809 algorithm. First, let's talk about coordinate assent, which is another 741 00:47:04,809 --> 00:47:11,809 [inaudible] optimization 742 00:47:17,829 --> 00:47:21,239 algorithm. To describe coordinate 743 00:47:21,239 --> 00:47:24,679 assent, I just want you to consider the problem of if we 744 00:47:24,679 --> 00:47:29,529 want to maximize some function 745 00:47:29,529 --> 00:47:34,039 W, which is a function of Alpha one through Alpha M with no constraints. So for 746 00:47:34,039 --> 00:47:38,769 747 00:47:38,769 --> 00:47:39,709 748 00:47:39,709 --> 00:47:43,199 now, forget about the constraint that the Alpha [inaudible] must be between zero and C. Forget about the 749 00:47:43,199 --> 00:47:49,569 constraint that some of YI Alpha I must be equal to zero. 750 00:47:49,569 --> 00:47:53,279 Then this is the coordinate 751 00:47:53,279 --> 00:47:54,220 assent 752 00:47:54,220 --> 00:47:56,409 algorithm. It will 753 00:47:56,409 --> 00:48:00,549 repeat until convergence 754 00:48:00,549 --> 00:48:03,839 755 00:48:03,839 --> 00:48:06,949 and will do for I equals one to M. The [inaudible] of coordinate assent essentially 756 00:48:06,949 --> 00:48:09,549 holds all the parameters 757 00:48:09,549 --> 00:48:12,039 except Alpha I fixed 758 00:48:12,039 --> 00:48:16,249 and then it just maximizes this function with respect to just one of the parameters. Let me 759 00:48:16,249 --> 00:48:18,219 write that 760 00:48:18,219 --> 00:48:21,279 as Alpha I gets 761 00:48:21,279 --> 00:48:24,039 updated as [inaudible] 762 00:48:24,039 --> 00:48:29,229 over Alpha I hat of 763 00:48:29,229 --> 00:48:29,839 W 764 00:48:29,839 --> 00:48:31,969 Alpha one 765 00:48:31,969 --> 00:48:38,969 766 00:48:41,789 --> 00:48:48,789 Alpha I minus one. This is really the fancy way of saying hold everything 767 00:48:49,679 --> 00:48:54,249 except Alpha I 768 00:48:54,249 --> 00:48:58,089 fixed. Just optimize W by optimization objective with 769 00:48:58,089 --> 00:48:59,650 respect to only 770 00:48:59,650 --> 00:49:00,960 Alpha I. 771 00:49:00,960 --> 00:49:02,019 This is just a fancy 772 00:49:02,019 --> 00:49:04,390 way of writing 773 00:49:04,390 --> 00:49:09,219 it. This is coordinate 774 00:49:09,219 --> 00:49:11,109 assent. 775 00:49:11,109 --> 00:49:14,259 One picture 776 00:49:14,259 --> 00:49:17,739 that's kind of useful for coordinate assent is if you 777 00:49:17,739 --> 00:49:24,739 imagine you're trying to optimize a quadratic function, it 778 00:49:25,009 --> 00:49:27,489 really looks like 779 00:49:27,489 --> 00:49:30,799 that. These are the contours of the quadratic function and the minimums 780 00:49:30,799 --> 00:49:31,919 here. 781 00:49:31,919 --> 00:49:37,149 This is what 782 00:49:37,149 --> 00:49:39,839 coordinate assent would look like. These are my [inaudible] call this Alpha two and I'll call this Alpha one. 783 00:49:39,839 --> 00:49:44,859 My Alpha one Alpha two axis, and so let's say I start down here. Then I'm going 784 00:49:44,859 --> 00:49:48,969 to begin by minimizing this with respect to Alpha one. I 785 00:49:48,969 --> 00:49:51,029 go there. 786 00:49:51,029 --> 00:49:54,799 And then at my new point, I'll minimize with respect to Alpha two, and so I might go to someplace like 787 00:49:54,799 --> 00:49:56,249 that. 788 00:49:56,249 --> 00:49:58,359 Then, I'll minimize 789 00:49:58,359 --> 00:50:02,239 with respect to Alpha one goes back to Alpha two 790 00:50:02,239 --> 00:50:04,469 and so on. 791 00:50:04,469 --> 00:50:08,049 You're always taking these axis-aligned steps 792 00:50:08,049 --> 00:50:11,709 to get to the minimum. 793 00:50:11,709 --> 00:50:16,039 It turns out that there's a modification to this. There are 794 00:50:16,039 --> 00:50:18,149 variations of this as well. The way I describe 795 00:50:18,149 --> 00:50:20,729 the algorithm, we're always doing this 796 00:50:20,729 --> 00:50:24,169 in alternating order. We always optimize with respect to Alpha 797 00:50:24,169 --> 00:50:28,519 one then Alpha two, then Alpha one, then Alpha two. 798 00:50:28,519 --> 00:50:32,159 What I'm about to say applies only in higher dimensions, but it turns out if you have 799 00:50:32,159 --> 00:50:35,239 a lot of parameters, Alpha one through Alpha M, 800 00:50:35,239 --> 00:50:39,290 you may not choose to always visit them in a fixed order. You may choose 801 00:50:39,290 --> 00:50:43,240 which Alphas update next depending on what you 802 00:50:43,240 --> 00:50:46,079 think will allow you to make the most progress. If you have only two parameters, it makes 803 00:50:46,079 --> 00:50:50,000 sense to alternate between them. 804 00:50:50,000 --> 00:50:51,099 If 805 00:50:51,099 --> 00:50:52,730 you have 806 00:50:52,730 --> 00:50:57,020 higher dimensional parameters, sometimes you may choose to update them in a 807 00:50:57,020 --> 00:51:00,650 different order if you think doing so would let you make faster progress towards the 808 00:51:00,650 --> 00:51:01,529 809 00:51:01,529 --> 00:51:04,959 810 00:51:04,959 --> 00:51:08,519 maximum. It turns out that coordinate assent 811 00:51:08,519 --> 00:51:11,459 compared to some of the algorithms we saw previously " compared to, say, Newton's 812 00:51:11,459 --> 00:51:12,470 813 00:51:12,470 --> 00:51:16,329 method, coordinate assent will usually take a lot more steps, 814 00:51:16,329 --> 00:51:20,819 but the chief advantage of coordinate assent when it works well 815 00:51:20,819 --> 00:51:22,570 is that sometimes 816 00:51:22,570 --> 00:51:29,119 the optimization objective W 817 00:51:29,119 --> 00:51:33,160 sometimes is very inexpensive to optimize W with respect to any one of your 818 00:51:33,160 --> 00:51:34,129 parameters, 819 00:51:34,129 --> 00:51:37,939 and so coordinate assent has to take many more iterations than, say, Newton's 820 00:51:37,939 --> 00:51:38,859 method 821 00:51:38,859 --> 00:51:41,419 in order to 822 00:51:41,419 --> 00:51:45,009 converge. It turns out that there are many optimization problems for which it's 823 00:51:45,009 --> 00:51:46,749 particularly easy 824 00:51:46,749 --> 00:51:47,569 to fix 825 00:51:47,569 --> 00:51:51,939 all but one of the parameters and optimize with respect to just that one parameter, 826 00:51:51,939 --> 00:51:55,619 and if that's true, then the inner group of coordinate assent with optimizing with 827 00:51:55,619 --> 00:51:58,219 respect to Alpha I can be done very quickly 828 00:51:58,219 --> 00:52:00,999 and cause [inaudible]. 829 00:52:00,999 --> 00:52:02,259 It turns out that 830 00:52:02,259 --> 00:52:06,849 this will be true when we modify this algorithm to solve the SVM 831 00:52:06,849 --> 00:52:11,880 optimization problem. 832 00:52:11,880 --> 00:52:18,880 Questions about this? 833 00:52:21,879 --> 00:52:28,879 834 00:52:47,489 --> 00:52:49,699 Okay. Let's go ahead and apply this 835 00:52:49,699 --> 00:52:54,329 to our support vector machine dual optimization problem. It 836 00:52:54,329 --> 00:52:57,179 turns out that 837 00:52:57,179 --> 00:53:02,459 coordinate assent in its basic form does not work for the following reason. The 838 00:53:02,459 --> 00:53:07,369 reason is we have constrains on the Alpha Is. Mainly, what we can 839 00:53:07,369 --> 00:53:10,529 recall from what we worked out previously, 840 00:53:10,529 --> 00:53:12,259 we have a constraint that the sum of [inaudible] Y 841 00:53:12,259 --> 00:53:15,569 Alpha I must be equal to zero, 842 00:53:15,569 --> 00:53:19,649 and so if you fix all the Alphas except for one, 843 00:53:19,649 --> 00:53:23,919 then you can't change one Alpha without violating the constraint. If 844 00:53:23,919 --> 00:53:24,669 845 00:53:24,669 --> 00:53:26,459 I just try to change Alpha one, 846 00:53:26,459 --> 00:53:27,239 847 00:53:27,239 --> 00:53:30,559 Alpha one is actually exactly determined as a function of the other 848 00:53:30,559 --> 00:53:33,400 Alphas because this was sum to zero. 849 00:53:33,400 --> 00:53:35,729 The 850 00:53:35,729 --> 00:53:39,390 SMO algorithm, by the way, is due to John Platt, a colleague at 851 00:53:39,390 --> 00:53:42,609 Microsoft. The SMO 852 00:53:42,609 --> 00:53:46,289 algorithm, therefore, instead of trying to change one Alpha at a time, 853 00:53:46,289 --> 00:53:48,979 we will try to change 854 00:53:48,979 --> 00:53:50,489 two 855 00:53:50,489 --> 00:53:54,199 Alphas at a time. 856 00:53:54,199 --> 00:53:55,270 This is 857 00:53:55,270 --> 00:53:58,189 called the SMO algorithm, in a 858 00:53:58,189 --> 00:54:00,599 sense the sequential minimal optimization 859 00:54:00,599 --> 00:54:02,990 and the term minimal refers to the fact that 860 00:54:02,990 --> 00:54:06,350 we're choosing the smallest number of Alpha Is to change at a time, which 861 00:54:06,350 --> 00:54:10,679 in this case, we need to change at least two at a time. 862 00:54:10,679 --> 00:54:16,849 So then go ahead and outline the algorithm. 863 00:54:16,849 --> 00:54:19,660 We will 864 00:54:19,660 --> 00:54:23,629 select 865 00:54:23,629 --> 00:54:28,609 two Alphas to change, some Alpha I and Alpha J [inaudible] " it 866 00:54:28,609 --> 00:54:33,979 just means a rule of thumb. 867 00:54:33,979 --> 00:54:40,929 We'll hold all the Alpha Ks fixed 868 00:54:40,929 --> 00:54:44,679 except 869 00:54:44,679 --> 00:54:49,939 Alpha I and Alpha J 870 00:54:49,939 --> 00:54:54,199 and optimize W [inaudible] Alpha 871 00:54:54,199 --> 00:54:56,640 with respect 872 00:54:56,640 --> 00:54:59,439 to Alpha I 873 00:54:59,439 --> 00:55:06,439 and Alpha J subject to all the constraints. 874 00:55:14,329 --> 00:55:17,650 It turns out the 875 00:55:17,650 --> 00:55:19,880 key step 876 00:55:19,880 --> 00:55:21,440 which I'm going to 877 00:55:21,440 --> 00:55:22,100 work out 878 00:55:22,100 --> 00:55:25,579 is this one, which 879 00:55:25,579 --> 00:55:28,449 is how do you optimize W of Alpha 880 00:55:28,449 --> 00:55:32,199 with respect to the two parameters that you just chose to update 881 00:55:32,199 --> 00:55:34,729 and subject to the constraints? I'll do 882 00:55:34,729 --> 00:55:37,459 that in a second. 883 00:55:37,459 --> 00:55:38,900 884 00:55:38,900 --> 00:55:41,130 You would keep running this algorithm 885 00:55:41,130 --> 00:55:43,369 until you have satisfied 886 00:55:43,369 --> 00:55:44,690 these 887 00:55:44,690 --> 00:55:49,469 convergence criteria up to Epsilon. 888 00:55:49,469 --> 00:55:53,509 What I want to do now is describe how to do this [inaudible] " 889 00:55:53,509 --> 00:55:55,759 how to optimize W of Alpha 890 00:55:55,759 --> 00:55:57,759 with respect to 891 00:55:57,759 --> 00:56:00,499 Alpha I and Alpha J, 892 00:56:00,499 --> 00:56:05,489 and it turns out that it's because you can do this extremely efficiently 893 00:56:05,489 --> 00:56:08,449 that the SMO algorithm works well. The [inaudible] for the SMO 894 00:56:08,449 --> 00:56:11,549 algorithm can be done extremely efficiently, so it may take a large number of 895 00:56:11,549 --> 00:56:18,549 iterations to converge, but each iteration is very cheap. Let's talk about that. 896 00:56:33,449 --> 00:56:40,449 897 00:56:41,749 --> 00:56:45,679 So in order to derive that step where we update in respect to Alpha I and 898 00:56:45,679 --> 00:56:46,530 Alpha J, 899 00:56:46,530 --> 00:56:50,849 I'm actually going to use Alpha one and Alpha two as my example. I'm gonna 900 00:56:50,849 --> 00:56:51,879 update 901 00:56:51,879 --> 00:56:56,609 Alpha one and Alpha two. In general, this could be any Alpha I and Alpha J, but 902 00:56:56,609 --> 00:56:59,659 just to make my notation on the board easier, I'm going to derive the 903 00:56:59,659 --> 00:57:02,209 derivation for Alpha one and Alpha two 904 00:57:02,209 --> 00:57:07,269 and the general [inaudible] completely analogous. 905 00:57:07,269 --> 00:57:08,869 906 00:57:08,869 --> 00:57:13,259 On every step of the algorithm with respect to constraint, that 907 00:57:13,259 --> 00:57:17,130 sum over I Alpha I YI is equal to zero. This is one of the 908 00:57:17,130 --> 00:57:18,959 constraints we had previously for 909 00:57:18,959 --> 00:57:22,569 our dual optimization problem. 910 00:57:22,569 --> 00:57:25,749 This means that Alpha one Y1 plus Alpha 911 00:57:25,749 --> 00:57:28,159 two Y2 must be equal to this, 912 00:57:28,159 --> 00:57:31,319 to 913 00:57:31,319 --> 00:57:38,319 914 00:57:41,179 --> 00:57:43,029 which I'm going to denote 915 00:57:43,029 --> 00:57:46,259 by Zeta. 916 00:57:46,259 --> 00:57:48,119 917 00:57:48,119 --> 00:57:49,449 So 918 00:57:49,449 --> 00:57:51,949 we also have the constraint 919 00:57:51,949 --> 00:57:53,439 that the Alpha Is 920 00:57:53,439 --> 00:57:57,289 must be between zero and C. We had two constraints on our dual. This was one of the 921 00:57:57,289 --> 00:57:59,809 constraints. This was the other one. 922 00:57:59,809 --> 00:58:06,809 In pictures, 923 00:58:08,139 --> 00:58:15,139 the 924 00:58:29,289 --> 00:58:33,039 constraint that the Alpha Is is between zero 925 00:58:33,039 --> 00:58:35,339 and C, that is often called the Bosk constraint, 926 00:58:35,339 --> 00:58:38,089 and so if I draw Alpha one 927 00:58:38,089 --> 00:58:43,239 and Alpha two, then 928 00:58:43,239 --> 00:58:50,239 I 929 00:58:50,239 --> 00:58:53,269 have a constraint that 930 00:58:53,269 --> 00:58:57,079 the values of Alpha one and Alpha two must lie within this box that ranges from zero 931 00:58:57,079 --> 00:58:58,889 to C. 932 00:58:58,889 --> 00:59:01,909 And so the picture of the algorithm is as follows. 933 00:59:01,909 --> 00:59:04,449 I have some constraint 934 00:59:04,449 --> 00:59:06,699 that Alpha one Y1 935 00:59:06,699 --> 00:59:08,630 plus Alpha 936 00:59:08,630 --> 00:59:10,449 two Y2 937 00:59:10,449 --> 00:59:14,129 must equal to Zeta, 938 00:59:14,129 --> 00:59:18,239 939 00:59:18,239 --> 00:59:20,789 940 00:59:20,789 --> 00:59:21,839 941 00:59:21,839 --> 00:59:26,479 and so this implies that 942 00:59:26,479 --> 00:59:32,279 Alpha one must be equal to Zeta minus Alpha two Y2 943 00:59:32,279 --> 00:59:36,859 over Y1, 944 00:59:36,859 --> 00:59:39,789 and so what I want to do is I want to optimize the objective with respect 945 00:59:39,789 --> 00:59:43,439 946 00:59:43,439 --> 00:59:45,209 to this. 947 00:59:45,209 --> 00:59:48,589 What I can do is plug in my definition for Alpha one as a function of Alpha two 948 00:59:48,589 --> 00:59:51,349 and so this becomes W of 949 00:59:51,349 --> 00:59:55,469 Alpha one must be equal to Zeta minus Alpha two Y2 950 00:59:55,469 --> 00:59:56,530 over 951 00:59:56,530 --> 01:00:01,649 Y1, Alpha two, Alpha three 952 01:00:01,649 --> 01:00:03,140 and so on, 953 01:00:03,140 --> 01:00:06,719 and it turns out that because W is a quadratic function " if you look back to our 954 01:00:06,719 --> 01:00:10,669 earlier definition of W, you find it's a quadratic function in all the Alphas " 955 01:00:10,669 --> 01:00:13,409 it turns out that if you look at this expression for W 956 01:00:13,409 --> 01:00:16,709 and if you view it as just a function of Alpha two, 957 01:00:16,709 --> 01:00:21,529 you find that this is a one dimensional quadratic function of Alpha two 958 01:00:21,529 --> 01:00:25,049 if you hold Alpha three, Alpha four and so on fixed, 959 01:00:25,049 --> 01:00:26,440 and so 960 01:00:26,440 --> 01:00:30,369 this can be simplified to some expression of the form A Alpha two squared 961 01:00:30,369 --> 01:00:32,359 plus B Alpha two plus 962 01:00:32,359 --> 01:00:33,229 C. This 963 01:00:33,229 --> 01:00:37,509 is a standard quadratic function. 964 01:00:37,509 --> 01:00:40,470 This is really easy to optimize. We know how to optimize 965 01:00:40,470 --> 01:00:41,630 " 966 01:00:41,630 --> 01:00:43,820 when did we learn this? This was high school 967 01:00:43,820 --> 01:00:47,969 or undergrad or something. You know how to optimize quadratic functions like these. 968 01:00:47,969 --> 01:00:49,930 You just do that and that gives you the 969 01:00:49,930 --> 01:00:52,449 optimal value for Alpha two. 970 01:00:52,449 --> 01:00:54,739 The last step with a 971 01:00:54,739 --> 01:00:59,089 Bosk constraint like this " just in pictures, 972 01:00:59,089 --> 01:01:02,469 you know your solution must lie on this line, and so there'll be some 973 01:01:02,469 --> 01:01:04,830 sort of quadratic function 974 01:01:04,830 --> 01:01:07,739 over this line, say, 975 01:01:07,739 --> 01:01:10,079 and so if you minimize the quadratic function, 976 01:01:10,079 --> 01:01:13,969 maybe you get a value that lies in the box, and if so, then you're done. 977 01:01:13,969 --> 01:01:17,229 Or if your quadratic function looks like this, 978 01:01:17,229 --> 01:01:20,979 maybe when you optimize your quadratic function, you may end up with a value outside, so you end up with 979 01:01:20,979 --> 01:01:22,719 a solution like that. If 980 01:01:22,719 --> 01:01:25,939 that happens, you clip your solution 981 01:01:25,939 --> 01:01:29,589 just to map it back inside the box. 982 01:01:29,589 --> 01:01:34,769 That'll give you the optimal solution of this quadratic optimization problem 983 01:01:34,769 --> 01:01:35,679 subject to 984 01:01:35,679 --> 01:01:37,079 your solution 985 01:01:37,079 --> 01:01:39,539 satisfying this box constraint 986 01:01:39,539 --> 01:01:42,589 and lying on this straight line " in other words, subject to 987 01:01:42,589 --> 01:01:48,739 the solution lying on this line segment within the box. 988 01:01:48,739 --> 01:01:53,179 989 01:01:53,179 --> 01:01:57,629 Having solved the Alpha two this way, you can clip it if necessary 990 01:01:57,629 --> 01:02:00,430 to get it back within the box constraint 991 01:02:00,430 --> 01:02:01,070 and then 992 01:02:01,070 --> 01:02:04,129 we have Alpha one as a function of Alpha two 993 01:02:04,129 --> 01:02:06,489 and this allows you to 994 01:02:06,489 --> 01:02:09,839 optimize W with respect to Alpha one and Alpha two quickly, 995 01:02:09,839 --> 01:02:11,420 subject to all the constraints, 996 01:02:11,420 --> 01:02:14,769 and the key step is really just sort of one dequadratic optimization, which 997 01:02:14,769 --> 01:02:16,439 we do very quickly, 998 01:02:16,439 --> 01:02:23,439 which is what makes the inner loop of the SMO algorithm very efficient. Student:You mentioned here that 999 01:02:24,959 --> 01:02:31,959 we can 1000 01:02:32,529 --> 01:02:34,329 change 1001 01:02:34,329 --> 01:02:37,749 whatever, but the SMO 1002 01:02:37,749 --> 01:02:42,879 algorithm, we can 1003 01:02:42,879 --> 01:02:44,739 change two at a 1004 01:02:44,739 --> 01:02:46,019 time, 1005 01:02:46,019 --> 01:02:49,059 so how is that [inaudible] understand that. Instructor (Andrew Ng):Right. Let's say I want to change 1006 01:02:49,059 --> 01:02:52,469 " as I run optimization algorithm, I 1007 01:02:52,469 --> 01:02:54,829 have to respect the constraint that 1008 01:02:54,829 --> 01:02:56,060 sum over 1009 01:02:56,060 --> 01:02:57,899 I Alpha I YI 1010 01:02:57,899 --> 01:03:00,339 must be equal to zero, 1011 01:03:00,339 --> 01:03:07,339 so this is a linear constraint that I didn't have when I was talking about [inaudible] ascent. 1012 01:03:07,489 --> 01:03:09,249 Suppose I tried 1013 01:03:09,249 --> 01:03:13,259 to change just Alpha one. 1014 01:03:13,259 --> 01:03:15,449 Then I know that Alpha one 1015 01:03:15,449 --> 01:03:20,309 must be equal to the sum from I equals two to M 1016 01:03:20,309 --> 01:03:21,809 Alpha I 1017 01:03:21,809 --> 01:03:22,959 YI 1018 01:03:22,959 --> 01:03:25,839 divided by Y1, right, 1019 01:03:25,839 --> 01:03:27,099 and so Alpha 1020 01:03:27,099 --> 01:03:30,039 one can actually be written exactly as a function of Alpha two, Alpha three and so on 1021 01:03:30,039 --> 01:03:32,589 through Alpha M. 1022 01:03:32,589 --> 01:03:35,610 And so if I hold Alpha two, Alpha three, Alpha four 1023 01:03:35,610 --> 01:03:39,710 through Alpha M fixed, then I can't change Alpha one, because Alpha one is the final 1024 01:03:39,710 --> 01:03:41,700 1025 01:03:41,700 --> 01:03:45,509 [inaudible]. Whereas in contrast, if I choose to change Alpha one and Alpha two at the same 1026 01:03:45,509 --> 01:03:46,869 time, 1027 01:03:46,869 --> 01:03:50,909 then I still have a constraint and so I know Alpha one and Alpha two 1028 01:03:50,909 --> 01:03:53,629 must satisfy 1029 01:03:53,629 --> 01:03:55,339 that linear constraint 1030 01:03:55,339 --> 01:03:59,199 but at least this way I can change Alpha one if I also 1031 01:03:59,199 --> 01:04:04,549 change Alpha two accordingly to make sure this satisfies 1032 01:04:04,549 --> 01:04:08,869 the constraint. 1033 01:04:08,869 --> 01:04:13,890 Student:[Inaudible]. Instructor (Andrew Ng):So Zeta was defined [inaudible]. So on 1034 01:04:13,890 --> 01:04:15,919 each iteration, I have some 1035 01:04:15,919 --> 01:04:18,099 setting of the parameters, 1036 01:04:18,099 --> 01:04:21,400 Alpha one, Alpha two, Alpha three and so on, 1037 01:04:21,400 --> 01:04:24,729 and I want to change Alpha one and Alpha two, say. 1038 01:04:24,729 --> 01:04:27,009 So from the previous iteration, 1039 01:04:27,009 --> 01:04:28,229 let's 1040 01:04:28,229 --> 01:04:31,759 say I had not validated the constraint, so that holds true, 1041 01:04:31,759 --> 01:04:35,259 and so I'm just defining Zeta to be equal to this, because 1042 01:04:35,259 --> 01:04:37,769 Alpha one Y1 plus Alpha two Y2 must be equal to sum from 1043 01:04:37,769 --> 01:04:42,409 I equals [inaudible] to M of that, 1044 01:04:42,409 --> 01:04:44,119 and so I'm just defining this 1045 01:04:44,119 --> 01:04:51,119 to be Zeta. 1046 01:04:59,709 --> 01:05:01,439 Student:[Inaudible]. Instructor (Andrew 1047 01:05:01,439 --> 01:05:06,709 Ng):On every iteration, you change maybe a different pair of Alphas to update. The 1048 01:05:06,709 --> 01:05:10,979 way you do this is something I don't want to talk about. I'll say a couple more words about 1049 01:05:10,979 --> 01:05:11,729 that, but 1050 01:05:11,729 --> 01:05:14,469 the basic outline of the algorithm is 1051 01:05:14,469 --> 01:05:18,159 on every iteration of the algorithm, you're going to select some Alpha I and 1052 01:05:18,159 --> 01:05:19,390 Alpha J to update 1053 01:05:19,390 --> 01:05:23,129 like on this board. So that's an Alpha I and an Alpha J to update 1054 01:05:23,129 --> 01:05:24,739 via some [inaudible] 1055 01:05:24,739 --> 01:05:27,839 and then you use the procedure I just described to 1056 01:05:27,839 --> 01:05:31,049 actually update 1057 01:05:31,049 --> 01:05:33,880 Alpha I and Alpha J. What I actually just talked about 1058 01:05:33,880 --> 01:05:35,390 was the procedure 1059 01:05:35,390 --> 01:05:39,699 to optimize W with respect to Alpha I and Alpha J. I didn't actually talk about the 1060 01:05:39,699 --> 01:05:46,699 [inaudible] for choosing Alpha 1061 01:05:48,699 --> 01:05:52,289 I and Alpha J. Student:What is the function MW? Instructor (Andrew Ng):MW is way up 1062 01:05:52,289 --> 01:05:53,839 there. I'll just write it again. W of 1063 01:05:53,839 --> 01:06:00,199 Alpha is that function we had previously. W of Alpha was the sum over I 1064 01:06:00,199 --> 01:06:02,949 " this is about solving the " 1065 01:06:02,949 --> 01:06:09,599 it was that thing. 1066 01:06:09,599 --> 01:06:12,869 All of this is about solving the optimization problem for the SVM, so this 1067 01:06:12,869 --> 01:06:19,469 is the objective function we had, so that's W of Alpha. Student:[Inaudible]? Exchanging one of the Alphas " optimizing 1068 01:06:19,469 --> 01:06:26,029 that one, you 1069 01:06:26,029 --> 01:06:27,190 can 1070 01:06:27,190 --> 01:06:28,639 make the 1071 01:06:28,639 --> 01:06:34,339 other one that you have to change work, 1072 01:06:34,339 --> 01:06:35,619 right? Instructor (Andrew Ng):What 1073 01:06:35,619 --> 01:06:39,929 do 1074 01:06:39,929 --> 01:06:43,979 you mean works? Student:It will get farther from its optimal. Instructor (Andrew Ng):Let me translate it differently. 1075 01:06:43,979 --> 01:06:46,890 What we're trying to do is we're trying to optimize the objective function W 1076 01:06:46,890 --> 01:06:48,239 of Alpha, 1077 01:06:48,239 --> 01:06:52,049 so the metric of progress that we care about is whether W of Alpha is getting 1078 01:06:52,049 --> 01:06:54,959 better on every iteration, 1079 01:06:54,959 --> 01:06:58,849 and so what is true for coordinate assent and for SMO is on every iteration; 1080 01:06:58,849 --> 01:07:02,929 W of Alpha can only increase. It may stay the same 1081 01:07:02,929 --> 01:07:04,900 or it may increase. It can't get worse. 1082 01:07:04,900 --> 01:07:08,829 It's true that eventually, the Alphas will converge at some value. It's true that 1083 01:07:08,829 --> 01:07:12,909 in intervening iterations, one of the Alphas may move further 1084 01:07:12,909 --> 01:07:16,529 away and then closer and further and closer to its final value, 1085 01:07:16,529 --> 01:07:20,089 but what we really care about is that W of Alpha is getting better 1086 01:07:20,089 --> 01:07:27,089 every time, which is true. 1087 01:07:28,489 --> 01:07:30,839 Just a couple more words on 1088 01:07:30,839 --> 01:07:33,719 SMO before I wrap up on this. 1089 01:07:33,719 --> 01:07:38,179 One is that John Platt's original 1090 01:07:38,179 --> 01:07:39,849 algorithm talked about a [inaudible] 1091 01:07:39,849 --> 01:07:43,869 for choosing which values or pairs, Alpha I and Alpha J, to update 1092 01:07:43,869 --> 01:07:46,650 next 1093 01:07:46,650 --> 01:07:50,170 is one of those things that's not conceptually complicated but it's very 1094 01:07:50,170 --> 01:07:51,009 complicated 1095 01:07:51,009 --> 01:07:53,329 to explain in words. I won't talk about that here. 1096 01:07:53,329 --> 01:07:56,369 If you want to learn about it, 1097 01:07:56,369 --> 01:07:58,539 go ahead and look up John Platt's paper on the 1098 01:07:58,539 --> 01:08:00,179 1099 01:08:00,179 --> 01:08:06,239 SMO algorithm. The [inaudible] is pretty easy to read, 1100 01:08:06,239 --> 01:08:11,329 and later on, we'll also posting a handout on the 1101 01:08:11,329 --> 01:08:13,389 course homepage 1102 01:08:13,389 --> 01:08:14,200 with 1103 01:08:14,200 --> 01:08:18,120 some of a simplified version of this [inaudible] that you can use in 1104 01:08:18,120 --> 01:08:20,170 problems. 1105 01:08:20,170 --> 01:08:24,230 You can see some of the process readings in more details. 1106 01:08:24,230 --> 01:08:28,579 One other thing that I didn't talk about was how to update the parameter B. So 1107 01:08:28,579 --> 01:08:30,899 this is solving all your Alphas. 1108 01:08:30,899 --> 01:08:34,020 This is also the Alpha that allows us to get W. 1109 01:08:34,020 --> 01:08:38,060 The other thing I didn't talk about was how to compute the parameter B, and it turns out that again is actually not 1110 01:08:38,060 --> 01:08:39,929 very difficult. 1111 01:08:39,929 --> 01:08:42,510 I'll let you read about that yourself with the 1112 01:08:42,510 --> 01:08:45,049 notes that we'll post 1113 01:08:45,049 --> 01:08:52,049 along with the next 1114 01:08:52,229 --> 01:08:55,589 problems. To wrap up today's lecture, what I want to do is just tell 1115 01:08:55,589 --> 01:09:01,469 you briefly about a couple of examples of applications 1116 01:09:01,469 --> 01:09:03,029 1117 01:09:03,029 --> 01:09:04,389 1118 01:09:04,389 --> 01:09:07,489 1119 01:09:07,489 --> 01:09:14,489 of SVMs. 1120 01:09:26,009 --> 01:09:29,630 Let's consider the problem of Handler's Integer Recognition. 1121 01:09:29,630 --> 01:09:31,909 In Handler's Integer 1122 01:09:31,909 --> 01:09:33,469 Recognition, 1123 01:09:33,469 --> 01:09:36,569 you're given a pixel array with 1124 01:09:36,569 --> 01:09:40,609 a scanned image of, 1125 01:09:40,609 --> 01:09:44,119 say, a zip code somewhere in Britain. This is an array of pixels, 1126 01:09:44,119 --> 01:09:47,859 and some of these pixels will be on 1127 01:09:47,859 --> 01:09:48,849 and other pixels 1128 01:09:48,849 --> 01:09:52,689 will be off. This combination of pixels being on maybe represents the 1129 01:09:52,689 --> 01:09:56,030 character one. 1130 01:09:56,030 --> 01:09:57,739 The question is given 1131 01:09:57,739 --> 01:10:00,269 an input feature vector like this, 1132 01:10:00,269 --> 01:10:03,119 if you 1133 01:10:03,119 --> 01:10:06,829 have, say, ten pixels by ten pixels, then you have 1134 01:10:06,829 --> 01:10:13,639 a hundred dimensional feature vector, 1135 01:10:13,639 --> 01:10:16,999 then [inaudible]. If you have ten pixels by ten pixels, you have 1136 01:10:16,999 --> 01:10:18,319 100 1137 01:10:18,319 --> 01:10:22,239 features, and maybe these are binary features of XB01 or maybe the Xs are 1138 01:10:22,239 --> 01:10:23,929 gray scale values 1139 01:10:23,929 --> 01:10:29,169 corresponding to how dark each of these pixels was. 1140 01:10:29,169 --> 01:10:32,300 [Inaudible]. 1141 01:10:32,300 --> 01:10:36,010 Turns out for many years, there was a neuronetwork that was a champion 1142 01:10:36,010 --> 01:10:40,039 algorithm for Handler's Integer Recognition. 1143 01:10:40,039 --> 01:10:44,049 And it turns out that you can apply an SVM with the 1144 01:10:44,049 --> 01:10:46,889 following 1145 01:10:46,889 --> 01:10:53,159 kernel. It turns out either the polynomial kernel or 1146 01:10:53,159 --> 01:10:57,929 the Galcean 1147 01:10:57,929 --> 01:11:02,580 1148 01:11:02,580 --> 01:11:05,199 kernel works fine for this problem, 1149 01:11:05,199 --> 01:11:10,460 and just by writing down this kernel and throwing an SVM at 1150 01:11:10,460 --> 01:11:14,239 it, an SVM gave performance comparable to the very best neuronetworks. 1151 01:11:14,239 --> 01:11:15,099 1152 01:11:15,099 --> 01:11:19,559 1153 01:11:19,559 --> 01:11:22,210 This is surprising because support vector machine 1154 01:11:22,210 --> 01:11:26,059 doesn't take into account any knowledge about the pixels, 1155 01:11:26,059 --> 01:11:27,110 and in particular, 1156 01:11:27,110 --> 01:11:30,310 it doesn't know that this pixel is 1157 01:11:30,310 --> 01:11:32,130 next to that pixel 1158 01:11:32,130 --> 01:11:36,559 because it's just representing the pixel intensity value as a vector. 1159 01:11:36,559 --> 01:11:39,920 And so this means the performance of SVM would be the same even if you were to 1160 01:11:39,920 --> 01:11:42,890 shuffle all the pixels around. 1161 01:11:42,890 --> 01:11:46,420 [Inaudible] 1162 01:11:46,420 --> 01:11:52,030 let's say comparable to the very best neuronetworks, which had been 1163 01:11:52,030 --> 01:11:58,579 under very careful development for many years. I want 1164 01:11:58,579 --> 01:11:59,789 to 1165 01:11:59,789 --> 01:12:03,929 tell you about one other cool example, which is SVMs are 1166 01:12:03,929 --> 01:12:08,610 also used also to classify other fairly esoteric objects. So for 1167 01:12:08,610 --> 01:12:09,360 example, 1168 01:12:09,360 --> 01:12:13,440 let's say you want to classify 1169 01:12:13,440 --> 01:12:16,989 protein sequences into different classes of 1170 01:12:16,989 --> 01:12:20,989 proteins. Every time I do this, I suspect that biologists in the room cringe, so I apologize 1171 01:12:20,989 --> 01:12:25,040 for that. There are 1172 01:12:25,040 --> 01:12:29,280 20 amino acids, and proteins in our bodies are made up by 1173 01:12:29,280 --> 01:12:31,439 sequences of amino acids. 1174 01:12:31,439 --> 01:12:35,469 Even though there are 20 amino acids and 26 alphabets, I'm going to 1175 01:12:35,469 --> 01:12:37,130 denote amino acids by the alphabet 1176 01:12:37,130 --> 01:12:41,409 A through Z with apologizes to the biologists. Here's 1177 01:12:41,409 --> 01:12:45,459 an amino acid sequence 1178 01:12:45,459 --> 01:12:52,459 1179 01:12:53,679 --> 01:13:00,679 1180 01:13:03,449 --> 01:13:05,139 1181 01:13:05,139 --> 01:13:12,139 represented by a series of alphabets. 1182 01:13:15,949 --> 01:13:17,829 1183 01:13:17,829 --> 01:13:18,559 So 1184 01:13:18,559 --> 01:13:21,419 suppose I want to assign 1185 01:13:21,419 --> 01:13:25,070 this protein into a few classes depending on what 1186 01:13:25,070 --> 01:13:27,790 type of protein it is. The 1187 01:13:27,790 --> 01:13:29,780 question is how 1188 01:13:29,780 --> 01:13:32,639 do I construct my feature 1189 01:13:32,639 --> 01:13:36,590 vector? This is challenging for many reasons, one of which is that protein 1190 01:13:36,590 --> 01:13:39,889 sequences can be of different lengths. There are some very long protein 1191 01:13:39,889 --> 01:13:41,900 sequences and some very short ones, 1192 01:13:41,900 --> 01:13:44,590 and so you can't have a feature saying what is 1193 01:13:44,590 --> 01:13:48,869 the amino acid in the 100th position, because maybe there is no 100th 1194 01:13:48,869 --> 01:13:53,050 position in this protein sequence. Some are very long. Some are very short. 1195 01:13:53,050 --> 01:13:55,440 Here's my feature representation, 1196 01:13:55,440 --> 01:13:56,929 which is 1197 01:13:56,929 --> 01:13:59,720 I'm going to write down 1198 01:13:59,720 --> 01:14:05,270 all possible combinations of four alphabets. I'm going to write down AAAA, AAAB, AAAC 1199 01:14:05,270 --> 01:14:07,219 down 1200 01:14:07,219 --> 01:14:09,420 to 1201 01:14:09,420 --> 01:14:11,469 AAAZ and then 1202 01:14:11,469 --> 01:14:13,979 AABA 1203 01:14:13,979 --> 01:14:20,169 and so on. You get the idea. Write down 1204 01:14:20,169 --> 01:14:23,760 all 1205 01:14:23,760 --> 01:14:27,070 possible combinations of 1206 01:14:27,070 --> 01:14:28,330 four alphabets 1207 01:14:28,330 --> 01:14:30,559 and my feature representation will be 1208 01:14:30,559 --> 01:14:33,570 I'm going to scan through this sequence of amino acids 1209 01:14:33,570 --> 01:14:38,599 and count how often each of these subsequences occur. So for example, 1210 01:14:38,599 --> 01:14:40,340 BAJT occurs twice and so I'll put 1211 01:14:40,340 --> 01:14:42,749 a two there, and 1212 01:14:42,749 --> 01:14:44,530 none of these sequences occur, so I'll put 1213 01:14:44,530 --> 01:14:46,169 a zero there. I guess 1214 01:14:46,169 --> 01:14:50,009 I have a one here 1215 01:14:50,009 --> 01:14:54,429 and a zero there. This very long vector 1216 01:14:54,429 --> 01:14:57,300 will be my feature representation for protein. 1217 01:14:57,300 --> 01:15:00,469 This representation applies no matter how long my protein sequence is. 1218 01:15:00,469 --> 01:15:02,729 How 1219 01:15:02,729 --> 01:15:05,249 large is this? Well, it turns out 1220 01:15:05,249 --> 01:15:08,109 this is going to be in R20 1221 01:15:08,109 --> 01:15:12,070 to the four, 1222 01:15:12,070 --> 01:15:16,089 and so you have a 160,000 1223 01:15:16,089 --> 01:15:18,360 dimensional feature vector, 1224 01:15:18,360 --> 01:15:21,429 which is reasonably large, even by modern computer standards. 1225 01:15:21,429 --> 01:15:23,910 Clearly, we don't want to explicitly represent 1226 01:15:23,910 --> 01:15:26,069 these high 1227 01:15:26,069 --> 01:15:30,029 dimensional feature vectors. Imagine you have 1,000 examples and you store this as double 1228 01:15:30,029 --> 01:15:34,280 [inaudible]. Even on modern day computers, this is big. 1229 01:15:34,280 --> 01:15:38,429 It turns out that there's an 1230 01:15:38,429 --> 01:15:42,090 efficient dynamic programming algorithm that can efficiently 1231 01:15:42,090 --> 01:15:45,260 compute inner products between these feature vectors, and so we can apply this 1232 01:15:45,260 --> 01:15:48,210 feature representation, even though it's a ridiculously high 1233 01:15:48,210 --> 01:15:49,630 feature vector 1234 01:15:49,630 --> 01:15:52,499 to classify protein sequences. 1235 01:15:52,499 --> 01:15:56,579 I won't talk about the [inaudible] algorithm. If any of you have seen 1236 01:15:56,579 --> 01:16:00,849 the [inaudible] algorithm for finding subsequences, it's kind of 1237 01:16:00,849 --> 01:16:02,309 reminiscent of that. You 1238 01:16:02,309 --> 01:16:05,289 can look those up if you're interested. 1239 01:16:05,289 --> 01:16:08,179 This is just another example of a cool kernel, and 1240 01:16:08,179 --> 01:16:13,230 more generally, if you're faced with some new machine-learning problem, sometimes you can 1241 01:16:13,230 --> 01:16:15,750 choose a standard kernel like a Galcean kernel, 1242 01:16:15,750 --> 01:16:19,580 and sometimes there are research papers written on how to 1243 01:16:19,580 --> 01:16:21,410 come up with a new kernel 1244 01:16:21,410 --> 01:16:27,520 for a new problem. 1245 01:16:27,520 --> 01:16:29,590 Two last sentences I want 1246 01:16:29,590 --> 01:16:33,710 to say. Where are we now? That wraps up SVMs, which many 1247 01:16:33,710 --> 01:16:38,379 people would consider one of the most effective off the shelf learning algorithms, 1248 01:16:38,379 --> 01:16:42,229 and so as of today, you've actually seen a lot of learning algorithms. I want to 1249 01:16:42,229 --> 01:16:45,689 close this class by saying congrats. You're now well qualified to actually go and apply learning algorithms 1250 01:16:45,689 --> 01:16:48,169 to a lot of problems. 1251 01:16:48,169 --> 01:16:52,510 We're still in week four of the quarter, so there's more to come. In particular, what I 1252 01:16:52,510 --> 01:16:56,890 want to do next is talk about how to really understand the learning algorithms and 1253 01:16:56,890 --> 01:16:58,860 when they work well and when they work poorly 1254 01:16:58,860 --> 01:17:01,889 and to take the tools which you now have and really talk about how you can use 1255 01:17:01,889 --> 01:17:04,449 them really well. We'll start to do that in the next lecture. Thanks.