1 00:00:10,469 --> 00:00:11,859 2 00:00:11,859 --> 00:00:15,129 this presentation is delivered by the stanford center for professional 3 00:00:15,129 --> 00:00:22,129 development. 4 00:00:23,559 --> 00:00:26,100 Okay, so welcome back. And 5 00:00:26,100 --> 00:00:30,800 what I want to do today is talk about 6 00:00:30,800 --> 00:00:32,429 Newton's method, an algorithm 7 00:00:32,429 --> 00:00:34,860 for fitting models like logistic regression, 8 00:00:34,860 --> 00:00:39,150 and then we'll talk about exponential family distributions and generalized linear 9 00:00:39,150 --> 00:00:42,810 models. It's a very nice class of ideas that will tie together, 10 00:00:42,810 --> 00:00:47,180 the logistic regression and the ordinary least squares models that we'll see. So hopefully I'll get to that 11 00:00:47,180 --> 00:00:50,730 today. So 12 00:00:50,730 --> 00:00:54,630 throughout the previous lecture and this lecture, we're starting to use increasingly 13 00:00:54,630 --> 00:00:58,210 large amounts of material on probability. 14 00:00:58,210 --> 00:01:01,710 So if you'd like to see a refresher on sort of the 15 00:01:01,710 --> 00:01:02,920 foundations of probability 16 00:01:02,920 --> 00:01:06,310 - if you're not sure if you quite had your prerequisites for this class 17 00:01:06,310 --> 00:01:09,130 in terms of a background in probability and statistics, 18 00:01:09,130 --> 00:01:11,330 then the discussion section 19 00:01:11,330 --> 00:01:16,130 taught this week by the TA's will go over so they can review a probability. 20 00:01:16,130 --> 00:01:21,080 At the same discussion sections also for the TA's, we'll also briefly go over 21 00:01:21,080 --> 00:01:26,180 sort of Matlab and Octave notation, which you need to use for your problem sets. And so if you 22 00:01:26,180 --> 00:01:27,730 any of you want to see 23 00:01:27,730 --> 00:01:32,100 a review of the probability and statistics pre-reqs, or if you want to, we will have a short tutorial of 24 00:01:32,100 --> 00:01:34,189 Matlab and Octave, 25 00:01:34,189 --> 00:01:41,189 please come to this - the next discussion section. All right. So 26 00:01:42,090 --> 00:01:45,369 just to recap briefly, 27 00:01:45,369 --> 00:01:49,070 towards the end of the last lecture I talked about the logistic regression model 28 00:01:49,070 --> 00:01:50,890 where we had - 29 00:01:50,890 --> 00:01:57,210 which was an algorithm for classification. We had that P of y given one 30 00:01:57,210 --> 00:02:01,810 [inaudible] - if an X - if Y equals one, give an X parameterized by theta 31 00:02:01,810 --> 00:02:03,750 under this model, all right. If 32 00:02:03,750 --> 00:02:05,280 this was one over one 33 00:02:05,280 --> 00:02:10,609 plus e to the theta, transpose X. 34 00:02:10,609 --> 00:02:15,239 And then you can write down the log-likelihood - 35 00:02:15,239 --> 00:02:22,239 like given the training sets, 36 00:02:22,599 --> 00:02:29,599 37 00:02:29,819 --> 00:02:31,329 which was that. 38 00:02:31,329 --> 00:02:32,289 And 39 00:02:32,289 --> 00:02:35,860 by taking the derivatives of this, you can derive 40 00:02:35,860 --> 00:02:38,359 sort of a gradient ascent rule 41 00:02:38,359 --> 00:02:42,689 for finding the maximum likelihood estimate of the parameter theta for 42 00:02:42,689 --> 00:02:46,479 this logistic regression model. And so 43 00:02:46,479 --> 00:02:48,170 last time I wrote down 44 00:02:48,170 --> 00:02:51,769 the learning rule for batch gradient ascent, but the version of stochastic gradient 45 00:02:51,769 --> 00:02:55,309 ascent where 46 00:02:55,309 --> 00:03:02,309 you look at just one training example at a time, 47 00:03:07,049 --> 00:03:09,370 would be like this, okay. 48 00:03:09,370 --> 00:03:12,369 So last time I wrote down a batch gradient ascent. This is stochastic gradient ascent. 49 00:03:12,369 --> 00:03:13,500 So 50 00:03:13,500 --> 00:03:17,969 if you want to fit a logistic regression model, meaning find the 51 00:03:17,969 --> 00:03:20,810 value of theta that maximizes this log likelihood, 52 00:03:20,810 --> 00:03:24,650 gradient ascent or stochastic gradient ascent or batch gradient ascent is a perfectly fine 53 00:03:24,650 --> 00:03:27,159 algorithm to use. 54 00:03:27,159 --> 00:03:29,150 But what I want to do is talk about 55 00:03:29,150 --> 00:03:30,430 a different 56 00:03:30,430 --> 00:03:31,659 algorithm 57 00:03:31,659 --> 00:03:32,669 for fitting 58 00:03:32,669 --> 00:03:34,659 models like logistic regression. 59 00:03:34,659 --> 00:03:39,419 And this would be an algorithm that will, I guess, often run much faster than 60 00:03:39,419 --> 00:03:41,809 gradient descent. 61 00:03:41,809 --> 00:03:43,969 Um. 62 00:03:43,969 --> 00:03:48,799 And this algorithm is called Newton's Method. 63 00:03:48,799 --> 00:03:54,149 And when we describe Newton's Method - let me ask you - I should 64 00:03:54,149 --> 00:04:01,149 ask you to consider a different problem first, 65 00:04:01,239 --> 00:04:03,609 which is - 66 00:04:03,609 --> 00:04:05,439 let's say 67 00:04:05,439 --> 00:04:10,650 you have a function F of theta, 68 00:04:10,650 --> 00:04:13,549 and let's say you want to find the value of theta 69 00:04:13,549 --> 00:04:17,519 so that 70 00:04:17,519 --> 00:04:20,559 F of theta 71 00:04:20,559 --> 00:04:25,789 is equal to zero. Let's start the [inaudible], and then we'll sort of slowly 72 00:04:25,789 --> 00:04:27,830 change this until it becomes an algorithm for 73 00:04:27,830 --> 00:04:34,830 fitting maximum likelihood models, like logistic regression. So - let's see. 74 00:04:51,589 --> 00:04:54,509 I guess that works. Okay, so let's say that's my function F. 75 00:04:54,509 --> 00:04:57,259 This 76 00:04:57,259 --> 00:05:00,629 is my horizontal axis of theta, plot of F of theta, 77 00:05:00,629 --> 00:05:03,350 and so they're really trying to find this value for theta, and 78 00:05:03,350 --> 00:05:05,450 which F of theta is equal to zero. This 79 00:05:05,450 --> 00:05:08,180 is a horizontal axis. 80 00:05:08,180 --> 00:05:11,779 So here's the [inaudible]. I'm going to 81 00:05:11,779 --> 00:05:15,259 initialize 82 00:05:15,259 --> 00:05:18,259 theta as some value. 83 00:05:18,259 --> 00:05:21,780 We'll call theta superscript zero. 84 00:05:21,780 --> 00:05:27,110 And then here's what Newton's Method does. We're going to evaluate the function F at a 85 00:05:27,110 --> 00:05:30,300 value of theta, and then 86 00:05:30,300 --> 00:05:34,360 we'll compute the derivative of F, and we'll use the linear approximation to the 87 00:05:34,360 --> 00:05:38,050 function F of that value of theta. So in particular, 88 00:05:38,050 --> 00:05:39,179 89 00:05:39,179 --> 00:05:43,089 90 00:05:43,089 --> 00:05:45,039 91 00:05:45,039 --> 00:05:52,039 I'm going to take the tangents to my function - 92 00:05:52,369 --> 00:05:55,259 hope that makes sense - starting the function [inaudible] work out nicely. I'm going to take the 93 00:05:55,259 --> 00:05:56,819 94 00:05:56,819 --> 00:06:00,030 tangent to my function at that point there zero, 95 00:06:00,030 --> 00:06:04,099 and I'm going to sort of extend this tangent down until it intercepts the horizontal axis. 96 00:06:04,099 --> 00:06:11,099 I want to see what value this is. And I'm going to call this theta one, okay. 97 00:06:12,069 --> 00:06:15,509 And then so that's one iteration of Newton's Method. And 98 00:06:15,509 --> 00:06:18,610 what I'll do then is the same thing with this point. Take the 99 00:06:18,610 --> 00:06:20,889 tangent 100 00:06:20,889 --> 00:06:22,469 down here, 101 00:06:22,469 --> 00:06:25,329 and that's two iterations of the algorithm. And then just 102 00:06:25,329 --> 00:06:27,960 sort of keep going, that's 103 00:06:27,960 --> 00:06:31,149 theta three and so on, okay. 104 00:06:31,149 --> 00:06:32,400 So 105 00:06:32,400 --> 00:06:37,620 let's just go ahead and write down what this algorithm actually does. 106 00:06:37,620 --> 00:06:38,090 107 00:06:38,090 --> 00:06:40,789 To go from theta zero to theta one, let 1080 0:06:40,789 --> 00:06:45,689 me call that 109 00:06:45,689 --> 00:06:50,089 length - let me just call that capital delta. 110 00:06:50,089 --> 00:06:51,799 So capital - so if you 111 00:06:51,799 --> 00:06:52,870 remember the 112 00:06:52,870 --> 00:06:55,180 definition of a derivative [inaudible], 113 00:06:55,180 --> 00:06:58,569 derivative of F evaluated at theta zero. 114 00:06:58,569 --> 00:07:01,489 In other words, the gradient of this first line, 115 00:07:01,489 --> 00:07:04,809 by the definition of gradient is going to be equal to this vertical 116 00:07:04,809 --> 00:07:05,590 length, 117 00:07:05,590 --> 00:07:07,360 divided by this horizontal length. A 118 00:07:07,360 --> 00:07:09,819 gradient of this - so the slope of this function 119 00:07:09,819 --> 00:07:10,920 is defined as 120 00:07:10,920 --> 00:07:14,299 the ratio between this vertical height 121 00:07:14,299 --> 00:07:16,490 and this width of triangle. 122 00:07:16,490 --> 00:07:18,510 So that's just equal to F of theta 123 00:07:18,510 --> 00:07:22,849 zero, 124 00:07:22,849 --> 00:07:24,870 divided by delta, 125 00:07:24,870 --> 00:07:27,710 which implies that 126 00:07:27,710 --> 00:07:30,159 delta is equal to F of 127 00:07:30,159 --> 00:07:32,289 theta zero, 128 00:07:32,289 --> 00:07:34,339 divided by a 129 00:07:34,339 --> 00:07:37,519 prime of 130 00:07:37,519 --> 00:07:41,840 theta zero, 131 00:07:41,840 --> 00:07:43,330 okay. 132 00:07:43,330 --> 00:07:46,389 And so theta 133 00:07:46,389 --> 00:07:49,740 one is therefore theta zero minus delta, 134 00:07:49,740 --> 00:07:51,620 minus capital delta, 135 00:07:51,620 --> 00:07:56,189 which is therefore just F theta zero over F 136 00:07:56,189 --> 00:07:58,090 prime of theta 137 00:07:58,090 --> 00:08:00,199 zero, all 138 00:08:00,199 --> 00:08:04,740 right. And more generally, one iteration of Newton's Method precedes this, theta T plus 139 00:08:04,740 --> 00:08:07,629 one equals theta T 140 00:08:07,629 --> 00:08:10,529 minus 141 00:08:10,529 --> 00:08:14,620 F of theta T divided by F prime of theta 142 00:08:14,620 --> 00:08:16,899 T. So that's one iteration 143 00:08:16,899 --> 00:08:23,899 of Newton's Method. 144 00:08:23,969 --> 00:08:28,300 Now, this is an algorithm for finding a value of theta for which F of theta equals 145 00:08:28,300 --> 00:08:30,759 zero. And so we apply the same idea 146 00:08:30,759 --> 00:08:32,729 to 147 00:08:32,729 --> 00:08:36,870 maximizing the log likelihood, right. So we have a function L 148 00:08:36,870 --> 00:08:37,990 of 149 00:08:37,990 --> 00:08:40,159 theta, and we want to maximize 150 00:08:40,159 --> 00:08:44,420 this function. Well, how do you maximize the function? You set the derivative to zero. So we want 151 00:08:44,420 --> 00:08:50,120 theta such that our 152 00:08:50,120 --> 00:08:53,900 prime of theta is equal to zero, so to maximize this function we want to find the place where 153 00:08:53,900 --> 00:08:57,270 the derivative of the function is equal to zero, 154 00:08:57,270 --> 00:08:58,379 and so we just apply 155 00:08:58,379 --> 00:09:01,380 the same idea. 156 00:09:01,380 --> 00:09:04,210 So we get 157 00:09:04,210 --> 00:09:08,510 theta one equals theta T minus L 158 00:09:08,510 --> 00:09:10,710 prime 159 00:09:10,710 --> 00:09:17,710 of theta T over L 160 00:09:18,100 --> 00:09:19,790 double prime of T, L 161 00:09:19,790 --> 00:09:22,850 double prime of 162 00:09:22,850 --> 00:09:28,210 theta T, okay. Because to maximize this function, we just let F be equal to L prime. Let F be 163 00:09:28,210 --> 00:09:30,900 the [inaudible] of L, 164 00:09:30,900 --> 00:09:35,690 and then we want to find the value of theta for which the derivative of L is 165 00:09:35,690 --> 00:09:36,300 zero, and 166 00:09:36,300 --> 00:09:43,300 therefore must be a local optimum. 167 00:09:48,350 --> 00:09:55,350 Does this make sense? Any questions about this? 168 00:09:58,900 --> 00:10:02,699 [Inaudible] The answer to that is fairly complicated. There 169 00:10:02,699 --> 00:10:06,259 are conditions on F that would guarantee that this will work. They are fairly complicated, 170 00:10:06,259 --> 00:10:10,530 and this is more complex than I want to go into now. 171 00:10:10,530 --> 00:10:13,350 In practice, this works very well for logistic regression, 172 00:10:13,350 --> 00:10:18,650 and for sort of generalizing any models I'll talk about later. [Inaudible] 173 00:10:18,650 --> 00:10:23,310 Yeah, it 174 00:10:23,310 --> 00:10:27,190 usually doesn't matter. When I implement this, I usually just initialize theta zero to 175 00:10:27,190 --> 00:10:29,389 zero to 176 00:10:29,389 --> 00:10:31,570 just initialize the parameters 177 00:10:31,570 --> 00:10:35,610 to the - back to all zeros, and usually this works fine. It's usually not a huge 178 00:10:35,610 --> 00:10:42,610 deal how you initialize theta. 179 00:10:44,110 --> 00:10:45,940 [Inaudible] 180 00:10:45,940 --> 00:10:51,440 or is it just different conversions? Let me 181 00:10:51,440 --> 00:10:55,600 say some things about that that'll sort of answer it. All of these 182 00:10:55,600 --> 00:10:57,250 183 00:10:57,250 --> 00:10:57,630 184 00:10:57,630 --> 00:11:01,180 algorithms tend not to - converges problems, and all of these 185 00:11:01,180 --> 00:11:03,870 algorithms will generally converge, 186 00:11:03,870 --> 00:11:06,530 unless you choose too large a linear rate for 187 00:11:06,530 --> 00:11:08,870 gradient ascent or something. 188 00:11:08,870 --> 00:11:10,620 But the speeds of 189 00:11:10,620 --> 00:11:14,570 conversions of these algorithms are very different. 190 00:11:14,570 --> 00:11:18,130 So 191 00:11:18,130 --> 00:11:22,400 it turns out that Newton's Method is an algorithm that enjoys extremely 192 00:11:22,400 --> 00:11:24,530 fast conversions. 193 00:11:24,530 --> 00:11:28,670 The technical term is that it enjoys a property called [inaudible] conversions. Don't 194 00:11:28,670 --> 00:11:29,950 know [inaudible] what that means, 195 00:11:29,950 --> 00:11:31,170 but just 196 00:11:31,170 --> 00:11:33,540 stated informally, it 197 00:11:33,540 --> 00:11:35,460 means that asymptotically 198 00:11:35,460 --> 00:11:41,500 every iteration of Newton's Method will double the number of significant digits 199 00:11:41,500 --> 00:11:43,740 that your solution 200 00:11:43,740 --> 00:11:44,980 is accurate 201 00:11:44,980 --> 00:11:46,610 to. Just lots of constant 202 00:11:46,610 --> 00:11:48,010 factors. 203 00:11:48,010 --> 00:11:48,809 Suppose that 204 00:11:48,809 --> 00:11:51,320 on a certain iteration 205 00:11:51,320 --> 00:11:56,730 your solution is within 0.01 at the optimum, so you have 206 00:11:56,730 --> 00:11:58,569 0.01 error. Then after one 207 00:11:58,569 --> 00:12:04,110 iteration, your error will be on the order of 0.001, 208 00:12:04,110 --> 00:12:11,110 and after another iteration, your error will be on the order of 0.0001. 209 00:12:13,610 --> 00:12:14,600 So this is called 210 00:12:14,600 --> 00:12:18,320 quadratic conversions because you essentially get to square the error 211 00:12:18,320 --> 00:12:21,520 on every iteration of Newton's Method. [Inaudible] 212 00:12:21,520 --> 00:12:24,520 this is an asymptotic result that holds only when you are pretty 213 00:12:24,520 --> 00:12:27,120 ** close to the optimum anyway, so this is 214 00:12:27,120 --> 00:12:30,550 the theoretical result that says it's true, but because of constant factors 215 00:12:30,550 --> 00:12:34,340 and so on, may paint a slightly rosier picture than 216 00:12:34,340 --> 00:12:35,690 might be accurate. 217 00:12:35,690 --> 00:12:39,590 But the fact is, when you implement - when I implement Newton's 218 00:12:39,590 --> 00:12:41,390 Method for logistic regression, 219 00:12:41,390 --> 00:12:44,830 usually converges like a dozen iterations or so for most reasonable 220 00:12:44,830 --> 00:12:46,270 size problems of 221 00:12:46,270 --> 00:12:50,910 tens of hundreds of features. So one 222 00:12:50,910 --> 00:12:51,550 thing I should 223 00:12:51,550 --> 00:12:53,170 talk about, which 224 00:12:53,170 --> 00:12:54,110 is 225 00:12:54,110 --> 00:12:58,140 what I wrote down over there was actually Newton's Method for the case of 226 00:12:58,140 --> 00:13:01,410 theta being a single-row number. The 227 00:13:01,410 --> 00:13:05,820 generalization to Newton's Method for when theta is a vector rather than 228 00:13:05,820 --> 00:13:07,830 when theta is just a row number 229 00:13:07,830 --> 00:13:11,040 is the following, 230 00:13:11,040 --> 00:13:14,580 which is that theta T plus one is theta T 231 00:13:14,580 --> 00:13:16,870 plus - and then we have the 232 00:13:16,870 --> 00:13:18,580 second derivative divided by the 233 00:13:18,580 --> 00:13:22,520 first - the first derivative divided by the second derivative. 234 00:13:22,520 --> 00:13:25,450 235 00:13:25,450 --> 00:13:30,090 And the appropriate generalization is this, where 236 00:13:30,090 --> 00:13:32,949 this is the usual gradient of 237 00:13:32,949 --> 00:13:37,350 your objective, and 238 00:13:37,350 --> 00:13:44,350 each [inaudible] is a matrix called a Hessian, 239 00:13:50,730 --> 00:13:55,460 which is just a matrix of second derivative where HIJ 240 00:13:55,460 --> 00:14:02,460 equals - okay. 241 00:14:07,210 --> 00:14:11,000 So just to sort of - the 242 00:14:11,000 --> 00:14:15,980 first derivative divided by the second derivative, now you have a vector of first derivatives 243 00:14:15,980 --> 00:14:17,050 times 244 00:14:17,050 --> 00:14:21,060 sort of the inverse of the matrix of second derivatives. So 245 00:14:21,060 --> 00:14:22,770 this is sort of just the same thing 246 00:14:22,770 --> 00:14:28,210 [inaudible] of multiple dimensions. 247 00:14:28,210 --> 00:14:32,660 So for logistic regression, again, use the - 248 00:14:32,660 --> 00:14:35,179 for a reasonable number of features 249 00:14:35,179 --> 00:14:40,069 and training examples - when I run this algorithm, usually you see a conversion 250 00:14:40,069 --> 00:14:41,150 anywhere from sort of [inaudible] 251 00:14:41,150 --> 00:14:44,430 to like a dozen or so other [inaudible]. 252 00:14:44,430 --> 00:14:47,610 To compare to gradient ascent, it's [inaudible] to gradient ascent, this 253 00:14:47,610 --> 00:14:52,860 usually means far fewer iterations to converge. 254 00:14:52,860 --> 00:14:55,230 Compared to gradient ascent, let's say [inaudible] gradient ascent, 255 00:14:55,230 --> 00:14:58,600 the disadvantage of Newton's Method is that on every iteration 256 00:14:58,600 --> 00:15:01,560 you need to invert the Hessian. 257 00:15:01,560 --> 00:15:02,950 So the Hessian 258 00:15:02,950 --> 00:15:07,190 will be an N-by-N matrix, or an N plus one by N plus one-dimensional matrix if N 259 00:15:07,190 --> 00:15:09,010 is the number of features. 260 00:15:09,010 --> 00:15:12,700 And so if you have a large number of features in your learning problem, if you have 261 00:15:12,700 --> 00:15:14,540 tens of thousands of features, 262 00:15:14,540 --> 00:15:18,050 then inverting H could be a slightly computationally expensive 263 00:15:18,050 --> 00:15:21,910 step. But for smaller, more reasonable numbers of features, this is usually a very [inaudible]. Question? [Inaudible] Let's see. I think 264 00:15:21,910 --> 00:15:28,910 you're right. 265 00:15:35,030 --> 00:15:38,650 That should probably be a minus. 266 00:15:38,650 --> 00:15:40,000 Do you have [inaudible]? 267 00:15:40,000 --> 00:15:44,810 Yeah, thanks. 268 00:15:44,810 --> 00:15:48,130 Yeah, X to a minus. Thank you. [Inaudible] problem also. 269 00:15:48,130 --> 00:15:49,720 I wrote down 270 00:15:49,720 --> 00:15:51,110 this algorithm 271 00:15:51,110 --> 00:15:54,760 to find the maximum likely estimate of the parameters for logistic regression. 272 00:15:54,760 --> 00:15:56,019 I wrote this 273 00:15:56,019 --> 00:15:58,430 down for maximizing a function. 274 00:15:58,430 --> 00:16:01,790 So I'll leave you to think about this yourself. If I wanted to use Newton's 275 00:16:01,790 --> 00:16:04,810 Method to minimize the function, how 276 00:16:04,810 --> 00:16:06,820 does the algorithm change? All right. 277 00:16:06,820 --> 00:16:09,470 So I'll leave you to think about that. So in other words, 278 00:16:09,470 --> 00:16:12,620 it's not the maximizations. How does the algorithm change if you want to use it 279 00:16:12,620 --> 00:16:19,620 for minimization? 280 00:16:27,550 --> 00:16:30,030 Actually, 281 00:16:30,030 --> 00:16:31,820 the answer 282 00:16:31,820 --> 00:16:32,970 is that 283 00:16:32,970 --> 00:16:34,730 it doesn't change. I'll 284 00:16:34,730 --> 00:16:40,410 leave you to work that out yourself why, okay. All right. 285 00:16:40,410 --> 00:16:47,340 Let's talk about generalized linear models. 286 00:16:47,340 --> 00:16:50,340 Let me just say, just to give a recap of 287 00:16:50,340 --> 00:16:52,620 both of the algorithms we've talked about so far. 288 00:16:52,620 --> 00:16:55,440 We've talked about 289 00:16:55,440 --> 00:16:59,470 two different algorithms for modeling PFY given X and parameterized by 290 00:16:59,470 --> 00:17:00,580 theta. And 291 00:17:00,580 --> 00:17:03,240 one of them - R was 292 00:17:03,240 --> 00:17:05,910 a real number 293 00:17:05,910 --> 00:17:10,090 and we are sealing that. And we sort of - the [inaudible] has a Gaussian distribution, then 294 00:17:10,090 --> 00:17:13,050 we got 295 00:17:13,050 --> 00:17:16,390 [inaudible] of linear regression. 296 00:17:16,390 --> 00:17:19,540 In the other case, we saw 297 00:17:19,540 --> 00:17:23,140 that if - was a classification problem where Y took on a value of either 298 00:17:23,140 --> 00:17:25,310 zero or one. 299 00:17:25,310 --> 00:17:27,110 300 00:17:27,110 --> 00:17:31,430 In that case, well, what's the most natural distribution of zeros and 301 00:17:31,430 --> 00:17:34,990 ones is the [inaudible]. The [inaudible] distribution models 302 00:17:34,990 --> 00:17:37,210 random variables with two values, 303 00:17:37,210 --> 00:17:39,890 and in that case we got 304 00:17:39,890 --> 00:17:45,820 logistic regression. 305 00:17:45,820 --> 00:17:48,680 So along the way, some of the questions that came up were - so logistic regression, where 306 00:17:48,680 --> 00:17:51,279 307 00:17:51,279 --> 00:17:52,289 on 308 00:17:52,289 --> 00:17:54,480 earth did I get the [inaudible] function from? 309 00:17:54,480 --> 00:17:57,450 And then so there are the choices you can use for, sort of, 310 00:17:57,450 --> 00:17:59,780 just where did this function come from? 311 00:17:59,780 --> 00:18:02,200 And there are other functions I could've plugged in, but 312 00:18:02,200 --> 00:18:05,540 the [inaudible] function turns out to be a natural 313 00:18:05,540 --> 00:18:06,850 default choice 314 00:18:06,850 --> 00:18:09,240 that lead us to logistic regression. 315 00:18:09,240 --> 00:18:11,810 And what I want to do now is 316 00:18:11,810 --> 00:18:14,690 take both of these algorithms and 317 00:18:14,690 --> 00:18:18,520 show that there are special cases that have [inaudible] the course of algorithms called 318 00:18:18,520 --> 00:18:19,450 generalized 319 00:18:19,450 --> 00:18:21,550 linear models, 320 00:18:21,550 --> 00:18:23,560 and there will be pauses for - it will be as [inaudible] the course 321 00:18:23,560 --> 00:18:27,110 of algorithms that think that the [inaudible] 322 00:18:27,110 --> 00:18:30,900 function will fall out very naturally as well. 323 00:18:30,900 --> 00:18:32,850 So, let's see - 324 00:18:32,850 --> 00:18:36,820 just looking for a longer piece of chalk. I 325 00:18:36,820 --> 00:18:40,929 should warn you, the ideas in generalized linear models are somewhat 326 00:18:40,929 --> 00:18:42,179 complex, so what 327 00:18:42,179 --> 00:18:46,440 I'm going to do today is try to sort of point you - point out the key ideas and give you a 328 00:18:46,440 --> 00:18:48,560 gist of the entire story. 329 00:18:48,560 --> 00:18:52,230 And then some of the details in the map and the derivations I'll leave you to work through 330 00:18:52,230 --> 00:18:57,850 by yourselves in the intellection [inaudible], which posts 331 00:18:57,850 --> 00:19:04,850 online. 332 00:19:05,710 --> 00:19:08,980 So [inaudible] these two distributions, the [inaudible] and 333 00:19:08,980 --> 00:19:13,350 the Gaussian. 334 00:19:13,350 --> 00:19:15,140 So suppose we have data 335 00:19:15,140 --> 00:19:18,550 that is zero-one valued, and we and we want to model it 336 00:19:18,550 --> 00:19:21,420 with 337 00:19:21,420 --> 00:19:23,940 [inaudible] variable 338 00:19:23,940 --> 00:19:25,790 parameterized by 339 00:19:25,790 --> 00:19:30,960 phi. So the [inaudible] distribution has the probability of Y equals one, 340 00:19:30,960 --> 00:19:35,049 which just equals the phi, right. So the parameter phi in the 341 00:19:35,049 --> 00:19:40,030 [inaudible] specifies the probability of Y being one. Now, 342 00:19:40,030 --> 00:19:42,539 as you vary the parameter theta, you get - 343 00:19:42,539 --> 00:19:47,419 you sort of get different [inaudible] distributions. As you vary the value of 344 00:19:47,419 --> 00:19:50,560 theta you get different probability distributions on Y 345 00:19:50,560 --> 00:19:53,310 that have different probabilities of being equal to one. 346 00:19:53,310 --> 00:19:58,130 And so I want you to think of this as not one fixed distribution, but as a set where there are a 347 00:19:58,130 --> 00:20:00,360 cause of distributions 348 00:20:00,360 --> 00:20:03,740 that you get as you vary theta. 349 00:20:03,740 --> 00:20:08,870 And in the same way, if you consider Gaussian distribution, 350 00:20:08,870 --> 00:20:10,880 as you vary [inaudible] you 351 00:20:10,880 --> 00:20:14,260 would get different 352 00:20:14,260 --> 00:20:18,320 Gaussian distributions. So think of this again as a cost, or as a set to 353 00:20:18,320 --> 00:20:19,910 distributions. 354 00:20:19,910 --> 00:20:26,910 And what I want to do now is show that 355 00:20:27,680 --> 00:20:31,890 both of these are special cases of the cause of distribution that's called the 356 00:20:31,890 --> 00:20:34,300 exponential family distribution. 357 00:20:34,300 --> 00:20:38,160 And in particular, we'll say that the cost of distributions, like 358 00:20:38,160 --> 00:20:41,330 the [inaudible] distributions that you get as you vary theta, 359 00:20:41,330 --> 00:20:44,960 we'll say the cost of distributions is in the exponential family 360 00:20:44,960 --> 00:20:46,429 if it can be written 361 00:20:46,429 --> 00:20:49,630 in the following form. 362 00:20:49,630 --> 00:20:51,850 P of Y parameterized by theta is 363 00:20:51,850 --> 00:20:54,220 equal to B of Y 364 00:20:54,220 --> 00:21:01,220 [inaudible], 365 00:21:05,049 --> 00:21:05,860 okay. Let me just 366 00:21:05,860 --> 00:21:07,470 get some of these 367 00:21:07,470 --> 00:21:10,690 terms, names, and 368 00:21:10,690 --> 00:21:13,049 then - let me - I'll 369 00:21:13,049 --> 00:21:19,360 say a bit more about what this means. 370 00:21:19,360 --> 00:21:23,780 So [inaudible] is called the natural parameter of the distribution, 371 00:21:23,780 --> 00:21:26,380 and T 372 00:21:26,380 --> 00:21:33,380 of Y is called the sufficient statistic. 373 00:21:34,630 --> 00:21:37,110 Usually, 374 00:21:37,110 --> 00:21:41,020 for many of the examples we'll see, including the [inaudible] 375 00:21:41,020 --> 00:21:43,590 and the Gaussian, T 376 00:21:43,590 --> 00:21:47,810 of Y is just equal to Y. So for most of this lecture you can 377 00:21:47,810 --> 00:21:51,660 mentally replace T of Y to be equal to Y, although this won't be true for the very 378 00:21:51,660 --> 00:21:56,070 fine example we do today, but mentally, you think of T of Y as equal to 379 00:21:56,070 --> 00:21:57,299 Y. 380 00:21:57,299 --> 00:22:04,299 And so 381 00:22:08,230 --> 00:22:09,930 for a given choice of 382 00:22:09,930 --> 00:22:15,890 these functions, A, B and 383 00:22:15,890 --> 00:22:21,100 T, all right - so we're gonna sort of fix the forms of the functions A, B and T. 384 00:22:21,100 --> 00:22:21,950 Then 385 00:22:21,950 --> 00:22:26,550 this formula defines, again, a set of distributions. It defines the cause of distributions 386 00:22:26,550 --> 00:22:27,990 that 387 00:22:27,990 --> 00:22:30,980 is now parameterized by 388 00:22:30,980 --> 00:22:31,600 [inaudible]. 389 00:22:31,600 --> 00:22:35,770 So again, let's write down specific formulas for A, B and T, true specific 390 00:22:35,770 --> 00:22:38,510 choices of A, B and T. Then 391 00:22:38,510 --> 00:22:43,480 as I vary [inaudible] I get different distributions. 392 00:22:43,480 --> 00:22:44,760 And 393 00:22:44,760 --> 00:22:46,470 I'm going to show that 394 00:22:46,470 --> 00:22:48,360 the 395 00:22:48,360 --> 00:22:52,290 [inaudible] - I'm going to show that the [inaudible] and the Gaussians are special 396 00:22:52,290 --> 00:22:53,720 cases 397 00:22:53,720 --> 00:22:55,880 of exponential family distributions. 398 00:22:55,880 --> 00:23:00,610 And by that I mean that I can choose specific functions, A, B and T, 399 00:23:00,610 --> 00:23:02,870 so that this becomes the formula 400 00:23:02,870 --> 00:23:06,350 of the distributions of either a [inaudible] or a Gaussian. 401 00:23:06,350 --> 00:23:09,000 And then again, as I vary [inaudible], 402 00:23:09,000 --> 00:23:11,590 I'll get [inaudible], distributions with different means, 403 00:23:11,590 --> 00:23:13,220 or as I vary [inaudible], I'll get 404 00:23:13,220 --> 00:23:16,790 Gaussian distributions with different means for my 405 00:23:16,790 --> 00:23:22,920 fixed values of A, B and T. 406 00:23:22,920 --> 00:23:26,920 And for those of you that know what a sufficient statistic and statistics is, 407 00:23:26,920 --> 00:23:27,980 408 00:23:27,980 --> 00:23:30,720 T of Y actually is a sufficient statistic 409 00:23:30,720 --> 00:23:33,930 in the formal sense of 410 00:23:33,930 --> 00:23:37,430 sufficient statistic for a probability distribution. They may have seen it in a statistics 411 00:23:37,430 --> 00:23:38,120 class. 412 00:23:38,120 --> 00:23:42,240 If you don't know what a sufficient statistic is, don't worry about. We 413 00:23:42,240 --> 00:23:49,240 sort of don't need that property today. Okay. 414 00:23:52,440 --> 00:23:55,860 So - 415 00:23:55,860 --> 00:23:58,470 oh, one last comment. 416 00:23:58,470 --> 00:24:02,750 Often, T of Y is equal to Y, and in many of these cases, [inaudible] is also 417 00:24:02,750 --> 00:24:04,290 just a raw number. 418 00:24:04,290 --> 00:24:08,150 So in many cases, the parameter of this distribution is just a raw number, 419 00:24:08,150 --> 00:24:10,320 and [inaudible] transposed T of Y 420 00:24:10,320 --> 00:24:14,340 is just a product of raw numbers. So again, that would be true for our first two examples, but 421 00:24:14,340 --> 00:24:14,910 422 00:24:14,910 --> 00:24:20,770 now for the last example I'll do today. 423 00:24:20,770 --> 00:24:24,900 So now we'll show that the [inaudible] and the Gaussian are examples of exponential family 424 00:24:24,900 --> 00:24:27,659 distributions. We'll start with the [inaudible]. 425 00:24:27,659 --> 00:24:30,769 So the [inaudible] distribution with [inaudible] - I guess I wrote this down 426 00:24:30,769 --> 00:24:32,510 already. 427 00:24:32,510 --> 00:24:36,000 PFY equals one [inaudible] by phi, 428 00:24:36,000 --> 00:24:36,590 [inaudible] equal to phi. 429 00:24:36,590 --> 00:24:40,530 So the parameter of phi specifies the probability 430 00:24:40,530 --> 00:24:42,600 that Y equals one. 431 00:24:42,600 --> 00:24:45,309 And so my goal now is to choose 432 00:24:45,309 --> 00:24:48,320 T, A and B, or is to choose A, B and T 433 00:24:48,320 --> 00:24:53,230 so that my formula for the exponential family becomes identical to my formula 434 00:24:53,230 --> 00:25:00,230 for the distribution of a [inaudible]. So 435 00:25:08,390 --> 00:25:15,390 probability of Y parameterized by phi is equal to that, all 436 00:25:17,060 --> 00:25:19,950 right. And you already saw sort 437 00:25:19,950 --> 00:25:24,480 of a similar exponential notation where we talked about logistic regression. The probability of 438 00:25:24,480 --> 00:25:26,630 Y being one is 439 00:25:26,630 --> 00:25:30,990 phi, the probability of Y being zero is one minus phi, so we can write this compactly as phi to the 440 00:25:30,990 --> 00:25:37,990 Y times one minus phi to the one minus Y. So I'm gonna take 441 00:25:40,010 --> 00:25:43,920 the exponent of the log of this, an exponentiation in taking log [inaudible] 442 00:25:43,920 --> 00:25:45,080 cancel each other 443 00:25:45,080 --> 00:25:46,060 444 00:25:46,060 --> 00:25:53,060 out [inaudible]. 445 00:25:53,170 --> 00:25:54,980 And this is equal to E to the 446 00:25:54,980 --> 00:26:01,980 Y. And so [inaudible] 447 00:26:28,220 --> 00:26:31,150 is to be T of Y, 448 00:26:31,150 --> 00:26:33,360 and 449 00:26:33,360 --> 00:26:40,360 this will be 450 00:26:42,350 --> 00:26:44,990 minus A of [inaudible]. 451 00:26:44,990 --> 00:26:51,990 And then B of Y is just one, so B of Y doesn't matter. Just 452 00:26:56,830 --> 00:26:57,950 take a second 453 00:26:57,950 --> 00:27:00,590 to look through this and make sure it makes 454 00:27:00,590 --> 00:27:07,590 sense. I'll clean another 455 00:27:42,400 --> 00:27:47,330 board while you do that. So now let's write down a few more things. Just 456 00:27:47,330 --> 00:27:51,230 copying from the previous board, we had that [inaudible] zero four equal to log 457 00:27:51,230 --> 00:27:53,350 phi 458 00:27:53,350 --> 00:27:56,780 over one minus 459 00:27:56,780 --> 00:28:00,800 phi. [Inaudible] so if I want to do the [inaudible] take this formula, 460 00:28:00,800 --> 00:28:03,550 and if you invert it, if you solve for 461 00:28:03,550 --> 00:28:05,780 phi - excuse me, if you solve for 462 00:28:05,780 --> 00:28:09,520 theta as a function of phi, which is really [inaudible] is the function of 463 00:28:09,520 --> 00:28:11,480 phi. 464 00:28:11,480 --> 00:28:16,910 Just invert this formula. You find that phi is 465 00:28:16,910 --> 00:28:19,350 one over one plus [inaudible] minus [inaudible]. 466 00:28:19,350 --> 00:28:20,530 And so 467 00:28:20,530 --> 00:28:23,630 somehow the logistic function magically 468 00:28:23,630 --> 00:28:28,940 falls out of this. We'll take this even this even further later. 469 00:28:28,940 --> 00:28:31,370 Again, copying definitions from 470 00:28:31,370 --> 00:28:35,510 the board on - from the previous board, A of [inaudible] 471 00:28:35,510 --> 00:28:39,330 I said is minus log of one minus phi. So again, phi and [inaudible] are function of each other, all 472 00:28:39,330 --> 00:28:42,659 right. So [inaudible] depends on phi, and phi 473 00:28:42,659 --> 00:28:45,700 depends on [inaudible]. So if I plug in 474 00:28:45,700 --> 00:28:49,490 this definition for [inaudible] 475 00:28:49,490 --> 00:28:50,679 into this - excuse 476 00:28:50,679 --> 00:28:53,510 me, plug in this definition for phi into that, 477 00:28:53,510 --> 00:28:57,340 I'll find that A of [inaudible] is therefore equal 478 00:28:57,340 --> 00:29:00,889 to log one plus [inaudible] to [inaudible]. And again, this is just algebra. This is not terribly 479 00:29:00,889 --> 00:29:01,950 interesting. 480 00:29:01,950 --> 00:29:07,890 And just to complete - excuse me. 481 00:29:07,890 --> 00:29:10,920 And just to complete the rest of this, T of 482 00:29:10,920 --> 00:29:12,860 Y is equal to Y, 483 00:29:12,860 --> 00:29:15,710 and 484 00:29:15,710 --> 00:29:18,340 B of Y is equal to one, okay. 485 00:29:18,340 --> 00:29:20,200 So just to recap what we've done, 486 00:29:20,200 --> 00:29:23,430 we've come up with a certain choice of 487 00:29:23,430 --> 00:29:24,720 functions A, T and 488 00:29:24,720 --> 00:29:26,250 B, 489 00:29:26,250 --> 00:29:29,610 so then my formula for the exponential family distribution 490 00:29:29,610 --> 00:29:33,950 now becomes exactly the formula for the distributions, or for the probability 491 00:29:33,950 --> 00:29:36,660 mass function of the [inaudible] distribution. 492 00:29:36,660 --> 00:29:40,320 And the natural parameter [inaudible] has a certain relationship of the 493 00:29:40,320 --> 00:29:45,260 original parameter of the [inaudible]. 494 00:29:45,260 --> 00:29:46,299 Question? 495 00:29:46,299 --> 00:29:48,580 [Inaudible] 496 00:29:48,580 --> 00:29:49,600 Let's 497 00:29:49,600 --> 00:29:54,430 see. [Inaudible]. The second to 498 00:29:54,430 --> 00:29:58,970 the last one. Oh, this answer is fine. Okay. 499 00:29:58,970 --> 00:30:00,620 Let's see. 500 00:30:00,620 --> 00:30:02,620 Yeah, so this is - 501 00:30:02,620 --> 00:30:07,260 well, if you expand this term out, one minus Y times log Y minus phi, 502 00:30:07,260 --> 00:30:08,590 and so 503 00:30:08,590 --> 00:30:09,910 one times log - 504 00:30:09,910 --> 00:30:12,600 one minus phi becomes this. And the other 505 00:30:12,600 --> 00:30:16,200 term is minus Y times log Y minus phi. 506 00:30:16,200 --> 00:30:17,659 And then - 507 00:30:17,659 --> 00:30:22,529 so the minus of a log is log one over X, or is just log one over 508 00:30:22,529 --> 00:30:25,380 whatever. So minus Y times log 509 00:30:25,380 --> 00:30:27,530 one minus phi becomes 510 00:30:27,530 --> 00:30:34,000 sort of Y times log, one over one minus phi. Does that make sense? Yeah. Yeah, 511 00:30:34,000 --> 00:30:39,000 cool. Anything else? Yes? 512 00:30:39,000 --> 00:30:41,210 [Inaudible] is a scaler, isn't it? Up there 513 00:30:41,210 --> 00:30:43,100 - 514 00:30:43,100 --> 00:30:46,710 Yes. - it's a [inaudible] transposed, so it can be a vector or 515 00:30:46,710 --> 00:30:48,690 - Yes, [inaudible]. So 516 00:30:48,690 --> 00:30:53,179 let's see. In most - in this and the next example, [inaudible] will turn out to 517 00:30:53,179 --> 00:30:54,799 be a scaler. 518 00:30:54,799 --> 00:30:56,270 And so - 519 00:30:56,270 --> 00:30:59,440 well, on this board. 520 00:30:59,440 --> 00:31:03,570 And so if [inaudible] is a scaler and T of Y is a scaler, then this is just 521 00:31:03,570 --> 00:31:06,260 a real number times a real number. So this would be like a 522 00:31:06,260 --> 00:31:08,270 one-dimensional vector 523 00:31:08,270 --> 00:31:12,340 transposed times a one-dimensional vector. And so this is just real number times real number. Towards the 524 00:31:12,340 --> 00:31:15,850 end of today's lecture, we'll go with just one example where both of 525 00:31:15,850 --> 00:31:17,050 these are vectors. 526 00:31:17,050 --> 00:31:24,050 But for main distributions, these will turn out to be scalers. 527 00:31:24,350 --> 00:31:25,110 528 00:31:25,110 --> 00:31:26,730 529 00:31:26,730 --> 00:31:33,420 [Inaudible] distribution [inaudible]. I 530 00:31:33,420 --> 00:31:35,120 mean, it doesn't have the 531 00:31:35,120 --> 00:31:37,630 zero probability 532 00:31:37,630 --> 00:31:42,279 or [inaudible] zero and one. I 533 00:31:42,279 --> 00:31:43,279 see. So 534 00:31:43,279 --> 00:31:46,260 - yeah. Let's - for this, 535 00:31:46,260 --> 00:31:50,890 let's imagine that we're restricting the domain 536 00:31:50,890 --> 00:31:51,590 537 00:31:51,590 --> 00:31:55,140 of the input of the function to be Y equals zero or one. 538 00:31:55,140 --> 00:31:57,130 So think of that as maybe in 539 00:31:57,130 --> 00:32:01,620 implicit constraint on it. [Inaudible]. But so this is a 540 00:32:01,620 --> 00:32:02,540 probability mass function 541 00:32:02,540 --> 00:32:05,900 for Y equals zero or Y equals one. So 542 00:32:05,900 --> 00:32:12,900 write down Y equals zero one. Let's think of that as an [inaudible]. So - cool. 543 00:32:16,190 --> 00:32:19,580 So this 544 00:32:19,580 --> 00:32:21,479 takes the [inaudible] 545 00:32:21,479 --> 00:32:24,620 distribution and invites in the form and the exponential family 546 00:32:24,620 --> 00:32:27,830 distribution. [Inaudible] do that very quickly for the Gaussian. I won't do the algebra for the 547 00:32:27,830 --> 00:32:29,970 Gaussian. 548 00:32:29,970 --> 00:32:32,500 I'll basically just write out the answers. 549 00:32:32,500 --> 00:32:37,190 So 550 00:32:37,190 --> 00:32:38,250 551 00:32:38,250 --> 00:32:40,450 with a normal distribution with 552 00:32:40,450 --> 00:32:42,790 [inaudible] sequence squared, 553 00:32:42,790 --> 00:32:47,210 and so you remember, was it two lectures ago, 554 00:32:47,210 --> 00:32:50,910 when we were dividing the maximum likelihood - excuse me, oh, no, just the 555 00:32:50,910 --> 00:32:52,070 previous lecture 556 00:32:52,070 --> 00:32:55,750 when we were dividing the maximum likelihood estimate 557 00:32:55,750 --> 00:32:59,900 for the parameters of ordinary [inaudible] squares. We showed that 558 00:32:59,900 --> 00:33:04,370 the parameter for [inaudible] squared didn't matter. When we divide 559 00:33:04,370 --> 00:33:05,980 the [inaudible] model for [inaudible] square [inaudible], 560 00:33:05,980 --> 00:33:09,200 we said that no matter what [inaudible] square was, we end up with the same 561 00:33:09,200 --> 00:33:11,520 value of the parameters. 562 00:33:11,520 --> 00:33:15,559 So for the purposes of just writing lesson, today's lecture, and not taking 563 00:33:15,559 --> 00:33:16,460 account 564 00:33:16,460 --> 00:33:17,690 [inaudible] squared, 565 00:33:17,690 --> 00:33:20,929 I'm just going to set 566 00:33:20,929 --> 00:33:25,740 [inaudible] squared to be for the one, okay, so as to not worry about it. 567 00:33:25,740 --> 00:33:28,330 Lecture [inaudible] talks a little bit more about this, but I'm just gonna - 568 00:33:28,330 --> 00:33:30,490 just to make 569 00:33:30,490 --> 00:33:33,630 [inaudible] in class a bit easier and simpler today, let's just 570 00:33:33,630 --> 00:33:36,049 say that [inaudible] square equals one. 571 00:33:36,049 --> 00:33:38,610 [Inaudible] square is essentially just a scaling factor 572 00:33:38,610 --> 00:33:43,720 on the variable Y. 573 00:33:43,720 --> 00:33:45,660 So in that case, the Gaussian 574 00:33:45,660 --> 00:33:50,840 density is given by this, 575 00:33:50,840 --> 00:33:52,400 [inaudible] 576 00:33:52,400 --> 00:33:53,680 squared. 577 00:33:53,680 --> 00:33:55,020 578 00:33:55,020 --> 00:33:58,670 And 579 00:33:58,670 --> 00:34:02,490 - well, by a couple of steps of algebra, which I'm not going to do, 580 00:34:02,490 --> 00:34:06,840 but is written out in [inaudible] in the lecture now so you can download. 581 00:34:06,840 --> 00:34:09,869 This is one root two 582 00:34:09,869 --> 00:34:13,989 pie, E to the minus one-half Y squared 583 00:34:13,989 --> 00:34:16,899 times E to E. 584 00:34:16,899 --> 00:34:20,209 New Y minus 585 00:34:20,209 --> 00:34:22,689 one-half [inaudible] squared, okay. 586 00:34:22,689 --> 00:34:25,019 So I'm just not doing the algebra. 587 00:34:25,019 --> 00:34:27,709 And so that's B 588 00:34:27,709 --> 00:34:32,659 of Y, 589 00:34:32,659 --> 00:34:37,279 we have [inaudible] that's equal to [inaudible]. P of 590 00:34:37,279 --> 00:34:42,259 Y equals 591 00:34:42,259 --> 00:34:43,459 Y, 592 00:34:43,459 --> 00:34:45,119 and - well, A of [inaudible] 593 00:34:45,119 --> 00:34:49,859 is equal to 594 00:34:49,859 --> 00:34:53,989 minus one-half - actually, I think that 595 00:34:53,989 --> 00:34:59,239 should be plus one-half. Have I got that right? 596 00:34:59,239 --> 00:35:01,619 Yeah, sorry. Let's 597 00:35:01,619 --> 00:35:03,189 see - excuse me. Plus sign 598 00:35:03,189 --> 00:35:06,619 there, okay. If you minus one-half [inaudible] squared, and 599 00:35:06,619 --> 00:35:11,169 because [inaudible] is equal to [inaudible], this is just minus one-half 600 00:35:11,169 --> 00:35:13,959 [inaudible] squared, okay. 601 00:35:13,959 --> 00:35:15,419 And so 602 00:35:15,419 --> 00:35:19,159 this would be a specific choice again of A, B and T 603 00:35:19,159 --> 00:35:21,069 that 604 00:35:21,069 --> 00:35:22,820 expresses the Gaussian density 605 00:35:22,820 --> 00:35:26,630 in the form of an exponential family distribution. 606 00:35:26,630 --> 00:35:30,839 And in this case, the relationship between [inaudible] and [inaudible] is that 607 00:35:30,839 --> 00:35:34,759 [inaudible] is just equal to [inaudible], so the [inaudible] of the Gaussian is just equal to the natural 608 00:35:34,759 --> 00:35:37,929 parameter of the exponential family distribution. Minus 609 00:35:37,929 --> 00:35:43,579 half. Oh, this is minus half? [Inaudible] Oh, okay, thanks. 610 00:35:43,579 --> 00:35:47,069 And so - guessing 611 00:35:47,069 --> 00:35:51,180 that should be plus then. Is that right? Okay. Oh, 612 00:35:51,180 --> 00:35:58,180 yes, you're right. Thank you. All right. 613 00:36:03,189 --> 00:36:05,579 And so [inaudible] result that 614 00:36:05,579 --> 00:36:09,470 if you've taken a look in undergrad statistics class, turns out that most of the 615 00:36:09,470 --> 00:36:12,899 "textbook distributions," not all, but most of them, 616 00:36:12,899 --> 00:36:16,609 can be written in the form of an exponential family distribution. 617 00:36:16,609 --> 00:36:20,579 So you saw the Gaussian, the normal distribution. It turns out the [inaudible] 618 00:36:20,579 --> 00:36:25,309 in normal distribution, which is a generalization of Gaussian random variables, so 619 00:36:25,309 --> 00:36:27,479 it's a high dimension to vectors. 620 00:36:27,479 --> 00:36:31,999 The [inaudible] normal distribution is also in the exponential family. 621 00:36:31,999 --> 00:36:36,489 You saw the [inaudible] as an exponential family. It turns out the [inaudible] distribution is too, all 622 00:36:36,489 --> 00:36:37,889 right. So the [inaudible] 623 00:36:37,889 --> 00:36:42,139 models outcomes over zero and one. They'll be coin tosses with two outcomes. 624 00:36:42,139 --> 00:36:46,729 The [inaudible] models outcomes over K possible values. That's also an 625 00:36:46,729 --> 00:36:49,289 exponential families distribution. 626 00:36:49,289 --> 00:36:52,670 You may have heard of the Parson distribution. And so the Parson distribution is 627 00:36:52,670 --> 00:36:54,849 often used for modeling counts. Things like 628 00:36:54,849 --> 00:36:59,779 the number of radioactive decays in a sample, or the number of 629 00:36:59,779 --> 00:37:03,710 customers to your website, the numbers of visitors arriving in a store. The 630 00:37:03,710 --> 00:37:08,059 Parson distribution is also in the exponential family. So are 631 00:37:08,059 --> 00:37:10,369 the gamma and the exponential 632 00:37:10,369 --> 00:37:15,359 distributions, if you've heard of them. So the gamma and the exponential distributions are 633 00:37:15,359 --> 00:37:18,859 distributions of the positive numbers. So they're often used in model 634 00:37:18,859 --> 00:37:20,299 intervals, like if you're 635 00:37:20,299 --> 00:37:22,309 standing at the bus stop and you want to 636 00:37:22,309 --> 00:37:25,619 ask, "When is the next bus likely to arrive? How long do I have to wait for 637 00:37:25,619 --> 00:37:27,019 my bus to arrive?" 638 00:37:27,019 --> 00:37:32,229 Often you model that with sort of gamma distribution or 639 00:37:32,229 --> 00:37:34,190 exponential families, or the exponential 640 00:37:34,190 --> 00:37:36,599 distribution. Those are also in the exponential family. 641 00:37:36,599 --> 00:37:40,930 Even more [inaudible] distributions, like the [inaudible] and the [inaudible] 642 00:37:40,930 --> 00:37:43,890 distributions, these are probably distributions over 643 00:37:43,890 --> 00:37:49,089 fractions, are already probability distributions over probability distributions. 644 00:37:49,089 --> 00:37:52,410 And also things like the Wisha distribution, which is the 645 00:37:52,410 --> 00:37:56,609 distribution over covariance matrices. So all of these, it turns out, can be written in 646 00:37:56,609 --> 00:38:00,719 the form of exponential family distributions. Well, 647 00:38:00,719 --> 00:38:03,660 and 648 00:38:03,660 --> 00:38:08,509 in the problem set where he asks you to take one of these distributions and write 649 00:38:08,509 --> 00:38:12,739 it in the form of the exponential family distribution, and derive a generalized linear model 650 00:38:12,739 --> 00:38:14,680 for it, okay. 651 00:38:14,680 --> 00:38:21,680 Which brings me to the next topic of 652 00:38:28,789 --> 00:38:32,119 having chosen and exponential family distribution, 653 00:38:32,119 --> 00:38:37,829 how do you use it to derive a generalized linear model? 654 00:38:37,829 --> 00:38:41,219 So 655 00:38:41,219 --> 00:38:45,459 generalized linear models are often abbreviated GLM's. 656 00:38:45,459 --> 00:38:48,559 And 657 00:38:48,559 --> 00:38:51,619 I'm going to write down the three assumptions. You can think of them 658 00:38:51,619 --> 00:38:54,699 as assumptions, or you can think of them as design choices, 659 00:38:54,699 --> 00:39:00,309 that will then allow me to sort of turn a crank and come up with a generalized linear model. 660 00:39:00,309 --> 00:39:03,599 So the first one is - I'm going to assume 661 00:39:03,599 --> 00:39:04,309 that 662 00:39:04,309 --> 00:39:05,580 given 663 00:39:05,580 --> 00:39:09,429 my input X and my parameters theta, 664 00:39:09,429 --> 00:39:12,729 I'm going to assume that the variable Y, 665 00:39:12,729 --> 00:39:17,009 the output Y, or the response variable Y I'm trying to predict 666 00:39:17,009 --> 00:39:20,079 is distributed 667 00:39:20,079 --> 00:39:21,719 exponential family 668 00:39:21,719 --> 00:39:23,880 669 00:39:23,880 --> 00:39:27,329 with some natural parameter [inaudible]. And so this means that there 670 00:39:27,329 --> 00:39:29,300 is some specific choice 671 00:39:29,300 --> 00:39:32,009 of those functions, A, B and T 672 00:39:32,009 --> 00:39:33,049 so that 673 00:39:33,049 --> 00:39:38,049 the conditional distribution of Y given X and parameterized by theta, 674 00:39:38,049 --> 00:39:39,940 those exponential families 675 00:39:39,940 --> 00:39:41,729 with parameter 676 00:39:41,729 --> 00:39:46,390 [inaudible]. Where here, [inaudible] may depend on X in some way. 677 00:39:46,390 --> 00:39:50,239 So for example, if you're trying to predict - if you want to 678 00:39:50,239 --> 00:39:50,710 predict 679 00:39:50,710 --> 00:39:53,090 how many customers have arrived at your website, 680 00:39:53,090 --> 00:39:54,729 you may choose to 681 00:39:54,729 --> 00:39:57,190 model the number of people 682 00:39:57,190 --> 00:40:01,350 - the number of hits on your website by Parson Distribution since Parson 683 00:40:01,350 --> 00:40:03,869 Distribution is natural for modeling com data. 684 00:40:03,869 --> 00:40:04,920 And so you may 685 00:40:04,920 --> 00:40:11,920 choose the exponential family distribution here to be the Parson distribution. 686 00:40:15,109 --> 00:40:18,049 [Inaudible] that given X, our 687 00:40:18,049 --> 00:40:20,579 goal is 688 00:40:20,579 --> 00:40:22,329 to output 689 00:40:22,329 --> 00:40:25,009 the 690 00:40:25,009 --> 00:40:28,640 effective value of Y 691 00:40:28,640 --> 00:40:29,649 given X. So 692 00:40:29,649 --> 00:40:34,019 given the features in the website examples, I've given a set of features 693 00:40:34,019 --> 00:40:37,609 about whether there were any proportions, whether there were sales, how many people linked to 694 00:40:37,609 --> 00:40:39,949 your website, or whatever. 695 00:40:39,949 --> 00:40:41,489 I'm going to assume that 696 00:40:41,489 --> 00:40:45,359 our goal in our [inaudible] problem is to estimate the expected number of people that will 697 00:40:45,359 --> 00:40:49,249 arrive at your website on a given day. 698 00:40:49,249 --> 00:40:49,999 699 00:40:49,999 --> 00:40:51,659 So in other words, 700 00:40:51,659 --> 00:40:54,670 you're saying that I want H of 701 00:40:54,670 --> 00:40:58,369 X to be equal to - oh, 702 00:40:58,369 --> 00:40:59,739 excuse me. I actually 703 00:40:59,739 --> 00:41:05,709 meant to write T of Y here. 704 00:41:05,709 --> 00:41:10,349 My goal is to get my learning algorithms hypothesis to output the expected value 705 00:41:10,349 --> 00:41:12,719 of T of Y given X. But 706 00:41:12,719 --> 00:41:17,170 again, for most of the examples, T of Y is just equal to Y. 707 00:41:17,170 --> 00:41:21,109 And so for most of the examples, our goal is to get our learning algorithms output, T 708 00:41:21,109 --> 00:41:28,109 expected value of Y given X because T of Y is usually 709 00:41:33,239 --> 00:41:33,919 equal 710 00:41:33,919 --> 00:41:36,229 to Y. Yes? [Inaudible] Yes, same thing, right. T of Y is a sufficient statistic. 711 00:41:36,229 --> 00:41:39,390 Same T of Y. 712 00:41:39,390 --> 00:41:43,880 And lastly, this last one I wrote down - these are assumptions. This last 713 00:41:43,880 --> 00:41:47,680 one you might - maybe wanna think of this as a design choice. 714 00:41:47,680 --> 00:41:48,509 715 00:41:48,509 --> 00:41:50,019 Which is [inaudible] 716 00:41:50,019 --> 00:41:53,119 assume that the distribution of Y given X 717 00:41:53,119 --> 00:41:55,430 is a distributed exponential family 718 00:41:55,430 --> 00:41:59,079 with some parameter [inaudible]. So the number of visitors on the website 719 00:41:59,079 --> 00:42:01,129 on any given day will be Parson 720 00:42:01,129 --> 00:42:03,439 or some parameter [inaudible]. 721 00:42:03,439 --> 00:42:05,970 And the last decision I need to make is 722 00:42:05,970 --> 00:42:09,269 was the relationship between my input teachers 723 00:42:09,269 --> 00:42:11,219 and this parameter 724 00:42:11,219 --> 00:42:14,629 [inaudible] parameterizing my Parson distribution or whatever. 725 00:42:14,629 --> 00:42:18,799 And this last step, I'm going to make the 726 00:42:18,799 --> 00:42:20,929 assumption, or really a design choice, 727 00:42:20,929 --> 00:42:24,169 that I'm going to assume the relationship between [inaudible] 728 00:42:24,169 --> 00:42:26,359 and my [inaudible] axis linear, 729 00:42:26,359 --> 00:42:29,709 and in particular that they're governed by this - that [inaudible] is equal to theta, 730 00:42:29,709 --> 00:42:32,179 transpose X. 731 00:42:32,179 --> 00:42:35,759 And the reason I make this design choice is it will allow me to turn the crank of 732 00:42:35,759 --> 00:42:36,809 the 733 00:42:36,809 --> 00:42:38,619 generalized linear model of 734 00:42:38,619 --> 00:42:42,160 machinery and come off with very nice algorithms for fitting 735 00:42:42,160 --> 00:42:45,409 say Parson Regression models 736 00:42:45,409 --> 00:42:47,209 or performed regression 737 00:42:47,209 --> 00:42:50,279 with a gamma distribution outputs or exponential distribution outputs and 738 00:42:50,279 --> 00:42:53,429 so on. 739 00:42:53,429 --> 00:42:58,929 So 740 00:42:58,929 --> 00:43:05,929 let's work through an example. 741 00:43:13,089 --> 00:43:14,769 742 00:43:14,769 --> 00:43:17,799 [Inaudible] 743 00:43:17,799 --> 00:43:22,589 equals theta transpose X works for the case where [inaudible] is a real number. 744 00:43:22,589 --> 00:43:24,989 For the more general case, you 745 00:43:24,989 --> 00:43:28,789 would have [inaudible] I equals theta 746 00:43:28,789 --> 00:43:35,049 I, transpose X if [inaudible] 747 00:43:35,049 --> 00:43:39,420 is a vector rather than a real number. But again, most of the examples [inaudible] 748 00:43:39,420 --> 00:43:44,859 will just be a real number. 749 00:43:44,859 --> 00:43:51,859 All right. 750 00:43:54,930 --> 00:44:00,389 So let's work through the [inaudible] example. You'll see 751 00:44:00,389 --> 00:44:07,389 where Y given X parameterized by theta - this is a distributed 752 00:44:08,839 --> 00:44:12,239 exponential family with natural parameter [inaudible]. 753 00:44:12,239 --> 00:44:13,669 And for 754 00:44:13,669 --> 00:44:17,299 the [inaudible] distribution, I'm going to choose A, B and T to 755 00:44:17,299 --> 00:44:19,649 be the specific forms 756 00:44:19,649 --> 00:44:21,379 that 757 00:44:21,379 --> 00:44:25,559 cause those exponential families to become the [inaudible] distribution. This is the example we 758 00:44:25,559 --> 00:44:29,749 worked through just now, the first example we worked through just now. 759 00:44:29,749 --> 00:44:33,529 So - oh, 760 00:44:33,529 --> 00:44:35,059 and we also have - 761 00:44:35,059 --> 00:44:41,069 so for any fixed 762 00:44:41,069 --> 00:44:44,440 value of X and theta, my hypothesis, my learning 763 00:44:44,440 --> 00:44:46,339 algorithm 764 00:44:46,339 --> 00:44:53,170 will 765 00:44:53,170 --> 00:44:56,799 make a prediction, or will make - will sort of output [inaudible] of 766 00:44:56,799 --> 00:45:00,859 X, 767 00:45:00,859 --> 00:45:02,559 which is 768 00:45:02,559 --> 00:45:06,179 by my, I guess, assumption [inaudible]. Watch our learning 769 00:45:06,179 --> 00:45:10,309 algorithm to output the expected value of Y given X 770 00:45:10,309 --> 00:45:13,829 771 00:45:13,829 --> 00:45:14,950 and parameterized by theta, where Y can take on 772 00:45:14,950 --> 00:45:17,349 only the value zero and one, 773 00:45:17,349 --> 00:45:21,639 then the expected value of Y is just equal to the 774 00:45:21,639 --> 00:45:26,099 probability that Y is equal to one. So 775 00:45:26,099 --> 00:45:29,669 the expected value of a [inaudible] variable is just equal to the 776 00:45:29,669 --> 00:45:36,669 probability that it's equal to one. 777 00:45:36,680 --> 00:45:38,219 And so 778 00:45:38,219 --> 00:45:41,550 the probability that Y equals one is just equal to phi 779 00:45:41,550 --> 00:45:43,399 because that's the 780 00:45:43,399 --> 00:45:44,589 parameter 781 00:45:44,589 --> 00:45:47,120 of my [inaudible] distribution. Phi 782 00:45:47,120 --> 00:45:54,120 is, by definition, I guess, is the probability of my [inaudible] distribution [inaudible] value of one. 783 00:45:57,059 --> 00:45:59,439 Which we worked out previously, 784 00:45:59,439 --> 00:46:02,509 phi was one over one plus E to the negative [inaudible]. 785 00:46:02,509 --> 00:46:06,059 So we worked this out on our previous board. This is the relationship - 786 00:46:06,059 --> 00:46:10,039 so when we wrote down the [inaudible] distribution 787 00:46:10,039 --> 00:46:14,359 in the form of an exponential family, we worked out what the relationship was between phi and 788 00:46:14,359 --> 00:46:18,459 [inaudible], and it was this. So we worked out the relationship between the 789 00:46:18,459 --> 00:46:23,689 expected value of Y and [inaudible] was this relationship. 790 00:46:23,689 --> 00:46:25,719 And lastly, because 791 00:46:25,719 --> 00:46:28,710 we made the design choice, or the assumption that 792 00:46:28,710 --> 00:46:31,159 [inaudible] and theta are 793 00:46:31,159 --> 00:46:32,189 linearly 794 00:46:32,189 --> 00:46:36,170 related. This is therefore equal to 795 00:46:36,170 --> 00:46:42,089 one over one plus E to the minus theta, transpose X. 796 00:46:42,089 --> 00:46:45,200 And so that's 797 00:46:45,200 --> 00:46:46,170 how I 798 00:46:46,170 --> 00:46:49,389 come up with the logistic regression algorithm 799 00:46:49,389 --> 00:46:51,189 when 800 00:46:51,189 --> 00:46:53,420 you have a variable Y 801 00:46:53,420 --> 00:46:57,420 - when you have a [inaudible] variable Y, or also response variable 802 00:46:57,420 --> 00:46:58,069 Y 803 00:46:58,069 --> 00:47:01,509 that takes on two values, and then you choose to model 804 00:47:01,509 --> 00:47:07,619 variable [inaudible] distribution. Are you 805 00:47:07,619 --> 00:47:14,109 sure this does make sense? Raise your hand if this makes sense. Yeah, okay, cool. So I hope 806 00:47:14,109 --> 00:47:15,560 you get 807 00:47:15,560 --> 00:47:19,070 the ease of use of this, or sort of the power of this. The only decision I 808 00:47:19,070 --> 00:47:22,159 made was really, I said Y - 809 00:47:22,159 --> 00:47:24,759 let's say I have a new machine-learning problem and 810 00:47:24,759 --> 00:47:28,349 I'm trying to predict the value of a variable Y that happens to take on two 811 00:47:28,349 --> 00:47:29,619 values. 812 00:47:29,619 --> 00:47:33,669 Then the only decision I need to make is I chose [inaudible] distribution. I 813 00:47:33,669 --> 00:47:37,729 say I want to model - I want to assume that given X and theta, 814 00:47:37,729 --> 00:47:40,999 I'm going to assume Y is distributed 815 00:47:40,999 --> 00:47:43,160 [inaudible]. That's the only decision I made. 816 00:47:43,160 --> 00:47:46,890 And then everything else follows automatically having made the 817 00:47:46,890 --> 00:47:47,890 decision to 818 00:47:47,890 --> 00:47:50,210 model Y given X and 819 00:47:50,210 --> 00:47:53,979 parameterized by theta as being [inaudible]. 820 00:47:53,979 --> 00:47:58,440 In the same way you can choose a different distribution, you can choose Y as Parson or Y as 821 00:47:58,440 --> 00:48:00,189 gamma or Y as whatever, 822 00:48:00,189 --> 00:48:03,109 and follow a similar process and come up with a different model and 823 00:48:03,109 --> 00:48:04,410 different learning algorithm. 824 00:48:04,410 --> 00:48:07,269 Come up with a different generalized linear model for whatever learning 825 00:48:07,269 --> 00:48:12,289 algorithm you're faced with. 826 00:48:12,289 --> 00:48:16,339 This tiny little notation, the 827 00:48:16,339 --> 00:48:20,109 function G that 828 00:48:20,109 --> 00:48:25,229 relates G of [inaudible] that relates the natural parameter 829 00:48:25,229 --> 00:48:28,119 to the expected value of Y, 830 00:48:28,119 --> 00:48:31,679 which in this case, one over one plus [inaudible] minus [inaudible], 831 00:48:31,679 --> 00:48:38,679 this is called the canonical response function. And G inverse is called the 832 00:48:41,989 --> 00:48:43,349 canonical 833 00:48:43,349 --> 00:48:48,650 link function. These aren't 834 00:48:48,650 --> 00:48:53,199 835 00:48:53,199 --> 00:48:57,179 a huge deal. I won't use this terminology a lot. I'm just 836 00:48:57,179 --> 00:48:58,749 mentioning those in case 837 00:48:58,749 --> 00:49:03,009 you hear about - people talk about generalized linear models, and if they talk about canonical 838 00:49:03,009 --> 00:49:04,540 response functions or 839 00:49:04,540 --> 00:49:09,380 canonical link functions, just so you know there's all of this. 840 00:49:09,380 --> 00:49:11,269 Actually, many techs actually use 841 00:49:11,269 --> 00:49:15,529 the reverse way. This is G inverse and this is G, but this 842 00:49:15,529 --> 00:49:18,179 notation turns out to be more consistent with 843 00:49:18,179 --> 00:49:20,119 other algorithms in machine learning. So 844 00:49:20,119 --> 00:49:23,150 I'm going to use this notation. 845 00:49:23,150 --> 00:49:27,119 But I probably won't use the terms canonical response functions and canonical 846 00:49:27,119 --> 00:49:29,759 link functions in lecture a lot, so just - I don't know. 847 00:49:29,759 --> 00:49:31,799 I'm not big on 848 00:49:31,799 --> 00:49:38,799 memorizing lots of names of things. I'm just tossing those out there in case you see it elsewhere. Okay. 849 00:49:49,839 --> 00:49:51,349 You know what, I think 850 00:49:51,349 --> 00:49:55,390 in the interest of time, I'm going to skip over the Gaussian example. But again, 851 00:49:55,390 --> 00:49:57,469 just like I said, 852 00:49:57,469 --> 00:49:58,920 [inaudible], Y 853 00:49:58,920 --> 00:50:02,559 is [inaudible], different variation I get of logistic regression. You can do the same 854 00:50:02,559 --> 00:50:04,579 thing with the Gaussian distribution 855 00:50:04,579 --> 00:50:08,819 and end up with ordinary [inaudible] squares model. 856 00:50:08,819 --> 00:50:12,059 The problem with Gaussian is that it's almost so simple that when you see it for 857 00:50:12,059 --> 00:50:13,959 the first time that it's 858 00:50:13,959 --> 00:50:17,149 sometimes more confusing than the [inaudible] model because it looks so simple, 859 00:50:17,149 --> 00:50:20,650 it looks like it has to be more complicated. So let me just skip that 860 00:50:20,650 --> 00:50:22,719 and leave you to read about 861 00:50:22,719 --> 00:50:25,369 the Gaussian example in the lecture notes. 862 00:50:25,369 --> 00:50:27,399 And what I want to do is 863 00:50:27,399 --> 00:50:29,770 actually go through a more complex example. 864 00:50:29,770 --> 00:50:33,529 Question? [Inaudible] 865 00:50:33,529 --> 00:50:35,479 Okay, right. So 866 00:50:35,479 --> 00:50:39,219 how do choose what theory will be? 867 00:50:39,219 --> 00:50:42,869 We'll get to that in the end. What you have there is the logistic 868 00:50:42,869 --> 00:50:45,439 regression model, which is a [inaudible] model 869 00:50:45,439 --> 00:50:51,309 that assumes the probability of Y given X is given by a certain form. 870 00:50:51,309 --> 00:50:53,499 And so 871 00:50:53,499 --> 00:50:58,279 what you do is you can write down the log likelihood of your training set, and 872 00:50:58,279 --> 00:51:01,859 find the value of theta that maximizes the log likelihood of the parameters. 873 00:51:01,859 --> 00:51:04,630 Does that make 874 00:51:04,630 --> 00:51:05,509 sense? 875 00:51:05,509 --> 00:51:09,159 So I'll say that again towards the end of today's lecture. But 876 00:51:09,159 --> 00:51:13,699 for logistic regression, the way you choose theta is exactly maximum likelihood, 877 00:51:13,699 --> 00:51:14,979 as we 878 00:51:14,979 --> 00:51:18,179 worked out in the previous lecture, using Newton's Method or gradient 879 00:51:18,179 --> 00:51:19,849 ascent or 880 00:51:19,849 --> 00:51:21,909 whatever. I'll sort of try to 881 00:51:21,909 --> 00:51:28,909 do that again for one more example towards the end of today's lecture. So 882 00:51:29,249 --> 00:51:33,249 what I want to do is actually use the remaining, I don't know, 883 00:51:33,249 --> 00:51:36,059 19 minutes or so of this class, 884 00:51:36,059 --> 00:51:38,209 to go through the - 885 00:51:38,209 --> 00:51:42,309 one of the more - it's probably the most complex example of a 886 00:51:42,309 --> 00:51:46,089 generalized linear model that I've used. This one I want to go through because it's a little bit 887 00:51:46,089 --> 00:51:47,659 trickier than 888 00:51:47,659 --> 00:51:51,159 many of the other textbook examples of 889 00:51:51,159 --> 00:51:53,509 generalized linear models. 890 00:51:53,509 --> 00:51:54,580 So 891 00:51:54,580 --> 00:51:57,959 again, what I'm going to do is 892 00:51:57,959 --> 00:51:59,960 go through the derivation 893 00:51:59,960 --> 00:52:03,910 reasonably quickly and give you the gist of it, and if there are steps I skip or details 894 00:52:03,910 --> 00:52:07,519 omitted, I'll leave you to read about them more carefully 895 00:52:07,519 --> 00:52:10,049 in the lecture notes. 896 00:52:10,049 --> 00:52:13,649 And what I want to do is talk about 897 00:52:13,649 --> 00:52:16,899 [inaudible]. 898 00:52:16,899 --> 00:52:19,660 And 899 00:52:19,660 --> 00:52:23,190 [inaudible] 900 00:52:23,190 --> 00:52:25,799 is the distribution over 901 00:52:25,799 --> 00:52:28,149 K possible outcomes. 902 00:52:28,149 --> 00:52:29,719 903 00:52:29,719 --> 00:52:32,939 Imagine you're now in a machine-learning problem where the value of Y that you're trying to 904 00:52:32,939 --> 00:52:34,640 predict can take on 905 00:52:34,640 --> 00:52:37,290 K possible outcomes, so rather than 906 00:52:37,290 --> 00:52:39,589 only two outcomes. So obviously, this example's already - 907 00:52:39,589 --> 00:52:44,649 if you want to have a learning algorithm, or to magically send emails for you into your 908 00:52:44,649 --> 00:52:48,130 right email folder, and you may have a dozen email folders you want your algorithm 909 00:52:48,130 --> 00:52:50,129 to classify emails into. 910 00:52:50,129 --> 00:52:51,010 Or 911 00:52:51,010 --> 00:52:54,210 predicting if the patient either has a disease or does not have 912 00:52:54,210 --> 00:52:57,269 a disease, which would be a [inaudible] classification problem. 913 00:52:57,269 --> 00:53:01,759 If you think that the patient may have one of K diseases, and 914 00:53:01,759 --> 00:53:06,039 you want other than have a learning algorithm figure out which one of K diseases your patient has is all. So 915 00:53:06,039 --> 00:53:06,749 lots 916 00:53:06,749 --> 00:53:10,399 of multi-cause classification problems where you have more than two causes. You model that 917 00:53:10,399 --> 00:53:13,989 with [inaudible]. 918 00:53:13,989 --> 00:53:17,169 And eventually - 919 00:53:17,169 --> 00:53:20,609 so for logistic regression, I had [inaudible] like these where you have a 920 00:53:20,609 --> 00:53:25,539 training set and you find a decision boundary that separates them. 921 00:53:25,539 --> 00:53:29,069 [Inaudible], we're going to entertain 922 00:53:29,069 --> 00:53:32,039 the value of predicting, taking on multiple values, so you now have 923 00:53:32,039 --> 00:53:33,380 three causes, 924 00:53:33,380 --> 00:53:35,759 and the learning algorithm 925 00:53:35,759 --> 00:53:39,489 will learn some way to separate out three causes or more, rather than just two 926 00:53:39,489 --> 00:53:43,799 causes. 927 00:53:43,799 --> 00:53:46,940 So let's write [inaudible] in the form of 928 00:53:46,940 --> 00:53:49,359 an exponential family distribution. 929 00:53:49,359 --> 00:53:53,219 930 00:53:53,219 --> 00:53:58,009 So the parameters of a [inaudible] are phi one, phi two 931 00:53:58,009 --> 00:53:59,919 [inaudible] 932 00:53:59,919 --> 00:54:04,049 phi K. I'll actually change this in a second - 933 00:54:04,049 --> 00:54:09,009 where the probability of Y equals I is phi I, 934 00:54:09,009 --> 00:54:10,160 right, because there are 935 00:54:10,160 --> 00:54:12,979 K possible outcomes. 936 00:54:12,979 --> 00:54:17,539 But if I choose this as my parameterization of the [inaudible], then 937 00:54:17,539 --> 00:54:19,940 my parameter's actually redundant because 938 00:54:19,940 --> 00:54:23,399 if these are probabilities, then you have to sum up the one. 939 00:54:23,399 --> 00:54:26,180 And therefore for example, I 940 00:54:26,180 --> 00:54:29,589 can derive the last parameter, phi K, 941 00:54:29,589 --> 00:54:32,029 as one minus phi 942 00:54:32,029 --> 00:54:34,789 one, up 943 00:54:34,789 --> 00:54:36,390 to phi K minus 944 00:54:36,390 --> 00:54:40,599 one. So this would be a 945 00:54:40,599 --> 00:54:44,150 redundant parameterization from 946 00:54:44,150 --> 00:54:48,789 [inaudible]. The result is over-parameterized. And so for purposes of 947 00:54:48,789 --> 00:54:49,870 this [inaudible], I'm 948 00:54:49,870 --> 00:54:53,859 going to treat my parameters of my [inaudible] as phi one, 949 00:54:53,859 --> 00:54:58,859 phi two, up to phi K minus one. 950 00:54:58,859 --> 00:55:02,799 And I won't think of phi K as a parameter. I'll just - so my parameters are 951 00:55:02,799 --> 00:55:03,420 just - 952 00:55:03,420 --> 00:55:07,869 I just have K minus one parameters, parameterizing my 953 00:55:07,869 --> 00:55:11,190 [inaudible]. And sometimes I write phi K in my 954 00:55:11,190 --> 00:55:13,069 derivations as well, and you should think of 955 00:55:13,069 --> 00:55:17,609 phi K as just a shorthand for this, for one minus the rest of the parameters, okay. 956 00:55:17,609 --> 00:55:24,609 So 957 00:55:36,479 --> 00:55:40,799 it turns out the [inaudible] is one of the few examples where T of Y - it's one of the 958 00:55:40,799 --> 00:55:44,569 examples where T of Y is not equal to Y. 959 00:55:44,569 --> 00:55:48,999 So in this case, Y is 960 00:55:48,999 --> 00:55:51,680 on of K possible values. 961 00:55:51,680 --> 00:55:55,199 And so T of Y would be defined as follows; T 962 00:55:55,199 --> 00:55:59,239 of one is going to be a vector with a one 963 00:55:59,239 --> 00:56:02,099 and zeros everywhere else. T 964 00:56:02,099 --> 00:56:03,640 of two 965 00:56:03,640 --> 00:56:07,309 is going to be a zero, one, zero 966 00:56:07,309 --> 00:56:08,689 and so 967 00:56:08,689 --> 00:56:12,359 on. Except that these are going to be 968 00:56:12,359 --> 00:56:15,640 K minus one-dimensional vectors. And 969 00:56:15,640 --> 00:56:17,449 so 970 00:56:17,449 --> 00:56:20,669 T of K minus one is going to be zero, zero, 971 00:56:20,669 --> 00:56:23,719 zero, one. 972 00:56:23,719 --> 00:56:26,829 And 973 00:56:26,829 --> 00:56:29,069 T of K is going to be the vector of all zeros. 974 00:56:29,069 --> 00:56:30,439 So this is just 975 00:56:30,439 --> 00:56:32,419 how I'm choosing to define T 976 00:56:32,419 --> 00:56:34,559 of Y 977 00:56:34,559 --> 00:56:39,019 to write down the [inaudible] in the form of an exponential family 978 00:56:39,019 --> 00:56:40,709 distribution. 979 00:56:40,709 --> 00:56:44,799 Again, these are K minus one-dimensional vectors. 980 00:56:44,799 --> 00:56:47,319 So 981 00:56:47,319 --> 00:56:50,410 this is a good point to introduce one more useful piece of notation, 982 00:56:50,410 --> 00:56:53,909 which is called indicator function notation. 983 00:56:53,909 --> 00:56:58,069 So I'm going to write one, and then curly braces. 984 00:56:58,069 --> 00:57:03,549 And if I write a true statement inside, then the indicator of that statement is going to be 985 00:57:03,549 --> 00:57:04,949 one. Then I write one, 986 00:57:04,949 --> 00:57:07,949 and then I write a false statement inside, then 987 00:57:07,949 --> 00:57:11,849 the value of this indicator function is going to be a 988 00:57:11,849 --> 00:57:15,299 zero. For example, if I write indicator two 989 00:57:15,299 --> 00:57:16,989 equals three 990 00:57:16,989 --> 00:57:20,789 [inaudible] that's false, and so this is equal to zero. 991 00:57:20,789 --> 00:57:22,489 Whereas indicator [inaudible] 992 00:57:22,489 --> 00:57:24,819 plus one equals two, 993 00:57:24,819 --> 00:57:28,449 I wrote down a true statement inside. And so the indicator of the statement was equal to 994 00:57:28,449 --> 00:57:30,029 one. So the 995 00:57:30,029 --> 00:57:31,979 indicator function is just a 996 00:57:31,979 --> 00:57:35,490 very useful notation for indicating sort of truth or falsehood 997 00:57:35,490 --> 00:57:42,490 of the statement inside. And so - actually, let's do 998 00:57:45,529 --> 00:57:47,199 this here. 999 00:57:47,199 --> 00:57:53,939 To combine both of these, right, if I carve out a bit 1000 00:57:53,939 --> 00:57:57,149 of space here - 1001 00:57:57,149 --> 00:58:02,539 so if I use - so TY is a 1002 00:58:02,539 --> 00:58:06,670 vector. Y is one of K values, and so 1003 00:58:06,670 --> 00:58:09,629 TY is one of these K vectors. If 1004 00:58:09,629 --> 00:58:12,969 I use TY as [inaudible] to denote 1005 00:58:12,969 --> 00:58:15,949 the [inaudible] element of the vector 1006 00:58:15,949 --> 00:58:18,429 TY, 1007 00:58:18,429 --> 00:58:21,789 then TY - the [inaudible] element of the vector TY 1008 00:58:21,789 --> 00:58:26,529 is just equal to indicator for 1009 00:58:26,529 --> 00:58:30,459 whether Y is equal to I. Just take a 1010 00:58:30,459 --> 00:58:33,189 - let me clean a couple more boards. Take a look at this for a second 1011 00:58:33,189 --> 00:58:34,989 and make sure you understand why that - make 1012 00:58:34,989 --> 00:58:41,989 sure you understand all that notation and why this is true. All 1013 00:59:09,839 --> 00:59:15,599 right. Actually, raise your hand if this equation makes sense to you. 1014 00:59:15,599 --> 00:59:18,769 Most of you, 1015 00:59:18,769 --> 00:59:23,859 not all, okay. [Inaudible]. 1016 00:59:23,859 --> 00:59:26,599 Just as one kind of [inaudible], 1017 00:59:26,599 --> 00:59:29,679 suppose Y is equal to one 1018 00:59:29,679 --> 00:59:33,579 - let's 1019 00:59:33,579 --> 00:59:36,199 say - let me see. Suppose Y is equal to one, right, 1020 00:59:36,199 --> 00:59:39,739 so TY is equal to this vector, 1021 00:59:39,739 --> 00:59:42,389 and therefore the first element of this vector 1022 00:59:42,389 --> 00:59:46,779 will be one, and the rest of the elements will be equal to zero. 1023 00:59:46,779 --> 00:59:50,299 And so - let 1024 00:59:50,299 --> 00:59:54,339 me try that again, I'm sorry. Let's say I want to ask - I want to look at the [inaudible] element of 1025 00:59:54,339 --> 00:59:58,959 the vector TY, and I want to know is this one or zero. All right. 1026 00:59:58,959 --> 01:00:01,249 Well, this will be one. 1027 01:00:01,249 --> 01:00:03,759 The [inaudible] element of the vector TY 1028 01:00:03,759 --> 01:00:05,059 will be equal to one 1029 01:00:05,059 --> 01:00:06,469 if, and only if Y is 1030 01:00:06,469 --> 01:00:08,789 equal to I. 1031 01:00:08,789 --> 01:00:12,339 Because for example, if Y is equal to one, then only the first element of this 1032 01:00:12,339 --> 01:00:13,809 vector will be zero. 1033 01:00:13,809 --> 01:00:17,829 If Y is equal to two, then only the second element of the vector will be zero 1034 01:00:17,829 --> 01:00:19,849 and so on. So the question of 1035 01:00:19,849 --> 01:00:23,899 whether or not - whether the [inaudible] element of this vector, TY, is equal to 1036 01:00:23,899 --> 01:00:25,670 one is 1037 01:00:25,670 --> 01:00:26,699 answered by 1038 01:00:26,699 --> 01:00:28,749 just asking is Y 1039 01:00:28,749 --> 01:00:35,749 equal to I. Okay. If you're 1040 01:00:36,189 --> 01:00:38,980 still not quite sure why that's true, go home and 1041 01:00:38,980 --> 01:00:41,409 think about it a bit more. And I think I - 1042 01:00:41,409 --> 01:00:45,439 and take a look at the lecture notes as 1043 01:00:45,439 --> 01:00:49,779 well, maybe that'll help. At least for now, only just take my word for it. So 1044 01:00:49,779 --> 01:00:54,459 let's go ahead and write out the distribution 1045 01:00:54,459 --> 01:00:58,369 for the [inaudible] in an exponential family form. 1046 01:00:58,369 --> 01:01:05,369 So PFY is equal to phi one. 1047 01:01:06,479 --> 01:01:07,809 Indicator Y equals one 1048 01:01:07,809 --> 01:01:10,539 times phi two. Indicator Y equals 1049 01:01:10,539 --> 01:01:12,909 to 1050 01:01:12,909 --> 01:01:17,129 up to phi K 1051 01:01:17,129 --> 01:01:21,219 times indicator Y equals K. And again, phi K is not a 1052 01:01:21,219 --> 01:01:23,369 parameter of the distribution. Phi K 1053 01:01:23,369 --> 01:01:29,439 is a shorthand for one minus phi one minus phi two minus the rest. 1054 01:01:29,439 --> 01:01:30,689 And so 1055 01:01:30,689 --> 01:01:34,639 using this equation on the left as well, I can also write this as 1056 01:01:34,639 --> 01:01:38,449 phi one times TY one, phi 1057 01:01:38,449 --> 01:01:40,379 two, TY 1058 01:01:40,379 --> 01:01:42,879 two, dot, dot, dot. 1059 01:01:42,879 --> 01:01:44,609 Phi K minus one, 1060 01:01:44,609 --> 01:01:46,019 1061 01:01:46,019 --> 01:01:48,269 TY, K minus one 1062 01:01:48,269 --> 01:01:50,890 times phi K. And then 1063 01:01:50,890 --> 01:01:57,890 one minus [inaudible]. That should 1064 01:02:02,239 --> 01:02:08,239 be K. 1065 01:02:08,239 --> 01:02:10,709 And it turns out - 1066 01:02:10,709 --> 01:02:15,069 it takes some of the steps of algebra that I don't have time to show. 1067 01:02:15,069 --> 01:02:19,269 It turns out, you can simplify this into - well, the 1068 01:02:19,269 --> 01:02:26,269 exponential family form 1069 01:02:32,389 --> 01:02:39,389 where [inaudible] is a vector, this is a 1070 01:02:50,499 --> 01:02:57,499 K minus one-dimensional vector, and - well, 1071 01:03:06,059 --> 01:03:08,469 okay. So deriving this is a few steps of algebra 1072 01:03:08,469 --> 01:03:12,599 that you can work out yourself, but I won't do here. 1073 01:03:12,599 --> 01:03:17,789 And so using my definition for TY, and 1074 01:03:17,789 --> 01:03:18,809 by choosing 1075 01:03:18,809 --> 01:03:20,550 [inaudible] A and B this 1076 01:03:20,550 --> 01:03:24,490 way, I can take my distribution from [inaudible] and write it out in 1077 01:03:24,490 --> 01:03:31,490 a form of an exponential family distribution. 1078 01:03:32,889 --> 01:03:34,949 It turns out also that 1079 01:03:34,949 --> 01:03:36,739 - let's 1080 01:03:36,739 --> 01:03:40,429 see. [Inaudible], right. One of the things we did was we also had 1081 01:03:40,429 --> 01:03:42,199 [inaudible] as a function of phi, 1082 01:03:42,199 --> 01:03:45,199 and then we inverted that to write 1083 01:03:45,199 --> 01:03:47,859 out phi as a function of [inaudible]. 1084 01:03:47,859 --> 01:03:50,859 So it turns out you can do that as well. 1085 01:03:50,859 --> 01:03:55,489 So this defines [inaudible] as a function of the [inaudible] distributions parameters phi. 1086 01:03:55,489 --> 01:03:56,249 So 1087 01:03:56,249 --> 01:03:59,799 you can take this relationship between [inaudible] and phi and invert it, 1088 01:03:59,799 --> 01:04:02,630 and write out phi as a function of [inaudible]. 1089 01:04:02,630 --> 01:04:06,279 And it turns out, you get that 1090 01:04:06,279 --> 01:04:08,759 phi I is equal to [inaudible] - 1091 01:04:08,759 --> 01:04:15,759 excuse me. 1092 01:04:16,339 --> 01:04:20,959 And you get that phi I is equal to 1093 01:04:20,959 --> 01:04:27,959 [inaudible] I of one plus 1094 01:04:28,479 --> 01:04:35,479 that. 1095 01:04:35,909 --> 01:04:39,170 And the way you do this is you just - this defines 1096 01:04:39,170 --> 01:04:43,349 [inaudible] as a function of the phi, so if you take this and solve for [inaudible], you end up with this. And 1097 01:04:43,349 --> 01:04:47,609 this is - again, there are a couple of steps of algebra that I'm just not showing. 1098 01:04:47,609 --> 01:04:52,059 And then lastly, using our 1099 01:04:52,059 --> 01:04:55,650 assumption that the [inaudible] are a linear function of the [inaudible] 1100 01:04:55,650 --> 01:04:56,789 axis, phi 1101 01:04:56,789 --> 01:05:01,890 I is therefore equal to E to the theta I, transpose X, 1102 01:05:01,890 --> 01:05:06,319 divided by one plus sum over 1103 01:05:06,319 --> 01:05:08,920 J equals one, to K 1104 01:05:08,920 --> 01:05:11,109 minus one, 1105 01:05:11,109 --> 01:05:14,489 E to the 1106 01:05:14,489 --> 01:05:17,659 theta J, transpose 1107 01:05:17,659 --> 01:05:24,069 X. And this is just using the fact 1108 01:05:24,069 --> 01:05:28,549 that [inaudible] I equals theta I, transpose X, which was our earlier 1109 01:05:28,549 --> 01:05:35,549 design choice from generalized linear models. So 1110 01:05:43,659 --> 01:05:49,879 we're just about down. 1111 01:05:49,879 --> 01:05:51,729 So my learning algorithm 1112 01:05:51,729 --> 01:05:56,029 [inaudible]. I'm going to think of it as [inaudible] the 1113 01:05:56,029 --> 01:05:59,150 expected value of TY 1114 01:05:59,150 --> 01:06:02,279 given X and [inaudible] by theta. 1115 01:06:02,279 --> 01:06:04,719 So 1116 01:06:04,719 --> 01:06:08,849 TY was this vector indicator function. So 1117 01:06:08,849 --> 01:06:10,199 T one 1118 01:06:10,199 --> 01:06:13,170 was indicator Y equals one, 1119 01:06:13,170 --> 01:06:16,079 down to indicator Y equals 1120 01:06:16,079 --> 01:06:20,049 K minus one. All 1121 01:06:20,049 --> 01:06:24,349 right. So I want my learning algorithm to output this; the expected value of this vector of 1122 01:06:24,349 --> 01:06:31,349 indicator functions. 1123 01:06:34,130 --> 01:06:39,209 The expected value of indicator Y equals one is just 1124 01:06:39,209 --> 01:06:42,359 the probability that Y equals one, 1125 01:06:42,359 --> 01:06:45,189 which is given by phi one. 1126 01:06:45,189 --> 01:06:48,359 So I have a random variable that's one whenever Y is equal to one and zero 1127 01:06:48,359 --> 01:06:49,859 otherwise, 1128 01:06:49,859 --> 01:06:51,719 so the expected value of that, 1129 01:06:51,719 --> 01:06:54,900 of this indicator Y equals one is just the 1130 01:06:54,900 --> 01:06:59,249 probability that Y equals one, which is given by phi one. 1131 01:06:59,249 --> 01:07:02,239 And therefore, 1132 01:07:02,239 --> 01:07:05,380 by what we were taught earlier, 1133 01:07:05,380 --> 01:07:08,089 this is therefore [inaudible] 1134 01:07:08,089 --> 01:07:15,089 to the theta one, transpose X over - well - okay. 1135 01:07:45,729 --> 01:07:46,849 And so my 1136 01:07:46,849 --> 01:07:51,470 learning algorithm will output the probability that Y equals one, Y equals two, up to Y 1137 01:07:51,470 --> 01:07:54,959 equals K minus one. 1138 01:07:54,959 --> 01:07:58,429 And these probabilities are going to be parameterized by 1139 01:07:58,429 --> 01:08:05,429 these functions like these. 1140 01:08:21,149 --> 01:08:25,339 And so just to give this algorithm a name, 1141 01:08:25,339 --> 01:08:32,339 this algorithm is called softmax regression, 1142 01:08:34,059 --> 01:08:38,599 and is widely thought of as the generalization of 1143 01:08:38,599 --> 01:08:41,900 logistic regression, which is regression of two classes. Is widely thought 1144 01:08:41,900 --> 01:08:44,909 of as a generalization of logistic regression 1145 01:08:44,909 --> 01:08:46,310 to the case of 1146 01:08:46,310 --> 01:08:50,199 K classes rather than two classes. 1147 01:08:50,199 --> 01:08:52,539 1148 01:08:52,539 --> 01:08:56,190 And so just to be very concrete about what you do, right. So you have a machine-learning 1149 01:08:56,190 --> 01:08:59,619 problem, and you want to apply softmax regression to it. So generally, 1150 01:08:59,619 --> 01:09:02,749 work for the entire derivation [inaudible]. I think the 1151 01:09:02,749 --> 01:09:05,719 question you had is about how to fit parameters. 1152 01:09:05,719 --> 01:09:09,649 So let's say you have a machine-learning problem, and 1153 01:09:09,649 --> 01:09:13,209 Y takes on one of K classes. 1154 01:09:13,209 --> 01:09:17,209 What you do is you sit down and say, "Okay, I wanna model Y as being 1155 01:09:17,209 --> 01:09:19,190 [inaudible] 1156 01:09:19,190 --> 01:09:24,190 given any X and then theta." And so you chose [inaudible] as the exponential family. Then you sort 1157 01:09:24,190 --> 01:09:27,349 of turn the crank. And everything else I wrote down follows 1158 01:09:27,349 --> 01:09:29,980 automatically from you have made the choice 1159 01:09:29,980 --> 01:09:34,769 of using [inaudible] distribution as your choice of exponential family. 1160 01:09:34,769 --> 01:09:38,159 And then what you do is you then have this training set, X, I, 1161 01:09:38,159 --> 01:09:39,010 Y, 1162 01:09:39,010 --> 01:09:41,229 I up to X, M, 1163 01:09:41,229 --> 01:09:43,560 Y, M. So 1164 01:09:43,560 --> 01:09:45,429 you're 1165 01:09:45,429 --> 01:09:49,309 doing the training set. We're now [inaudible] the value of Y takes on one 1166 01:09:49,309 --> 01:09:51,479 of K possible values. 1167 01:09:51,479 --> 01:09:55,389 And what you do is you then 1168 01:09:55,389 --> 01:09:59,349 find the parameters of the model by maximum likelihood. So you write down the likelihood 1169 01:09:59,349 --> 01:10:03,039 of the parameters, and you maximize the likelihood. 1170 01:10:03,039 --> 01:10:06,809 So what's the likelihood? Well, the likelihood, as usual, is the 1171 01:10:06,809 --> 01:10:08,929 product of your training set of 1172 01:10:08,929 --> 01:10:11,030 P of YI 1173 01:10:11,030 --> 01:10:14,709 given XI parameterized 1174 01:10:14,709 --> 01:10:17,689 by theta. That's 1175 01:10:17,689 --> 01:10:19,940 the likelihood, same as we had before. 1176 01:10:19,940 --> 01:10:21,349 And that's 1177 01:10:21,349 --> 01:10:23,550 1178 01:10:23,550 --> 01:10:24,640 product of your 1179 01:10:24,640 --> 01:10:26,439 training set of - 1180 01:10:26,439 --> 01:10:29,649 let me write these down now. 1181 01:10:29,649 --> 01:10:32,639 YI equals one 1182 01:10:32,639 --> 01:10:36,049 times phi two of indicator YI 1183 01:10:36,049 --> 01:10:38,300 equals two, dot, 1184 01:10:38,300 --> 01:10:39,489 dot, dot, 1185 01:10:39,489 --> 01:10:42,530 to phi K of indicator YI 1186 01:10:42,530 --> 01:10:45,019 equals 1187 01:10:45,019 --> 01:10:47,269 K. 1188 01:10:47,269 --> 01:10:50,469 Where, for example, 1189 01:10:50,469 --> 01:10:55,340 phi one depends on theta through this formula. It is E to the theta one, 1190 01:10:55,340 --> 01:10:56,650 transpose X 1191 01:10:56,650 --> 01:10:59,289 over 1192 01:10:59,289 --> 01:11:02,469 one 1193 01:11:02,469 --> 01:11:09,420 plus sum over J - well, that formula I had just now. 1194 01:11:09,420 --> 01:11:11,649 And so phi one here is really a 1195 01:11:11,649 --> 01:11:15,559 shorthand for this formula, and similarly for phi two and so on, 1196 01:11:15,559 --> 01:11:20,000 up to phi K, where phi K is one minus all of these things. All right. 1197 01:11:20,000 --> 01:11:21,809 So this is a 1198 01:11:21,809 --> 01:11:22,799 1199 01:11:22,799 --> 01:11:26,209 -this formula looks more complicated than it really is. What you 1200 01:11:26,209 --> 01:11:27,639 really do is you write this down, 1201 01:11:27,639 --> 01:11:31,459 then you take logs, compute a derivative of this formula [inaudible] theta, 1202 01:11:31,459 --> 01:11:34,480 and 1203 01:11:34,480 --> 01:11:36,750 apply say gradient ascent 1204 01:11:36,750 --> 01:11:41,369 to maximize the likelihood. What are the rows of theta? [Inaudible] it's just been a vector, 1205 01:11:41,369 --> 01:11:45,059 right? And now it looks 1206 01:11:45,059 --> 01:11:48,610 like it's two-dimensional. Yeah. In the notation of the [inaudible] I think have theta one 1207 01:11:48,610 --> 01:11:50,459 through 1208 01:11:50,459 --> 01:11:52,599 theta 1209 01:11:52,599 --> 01:11:57,880 K minus one. I've been thinking of each of these as - 1210 01:11:57,880 --> 01:11:59,210 and N 1211 01:11:59,210 --> 01:12:00,910 plus one-dimensional vector. 1212 01:12:00,910 --> 01:12:03,639 If X is N plus one-dimensional, 1213 01:12:03,639 --> 01:12:05,150 then I've been - see, I think if 1214 01:12:05,150 --> 01:12:09,489 you have a set of parameters comprising K minus one vectors, 1215 01:12:09,489 --> 01:12:13,359 and each of these is a - you could group all of these together and make these, but I 1216 01:12:13,359 --> 01:12:15,690 just haven't been doing that. [Inaudible] the derivative 1217 01:12:15,690 --> 01:12:22,690 of K minus one parameter vectors. [Inaudible], what do they correspond to? 1218 01:12:23,340 --> 01:12:25,050 [Inaudible]. 1219 01:12:25,050 --> 01:12:29,050 We're sort of out of time. Let me take that offline. It's hard to answer in the 1220 01:12:29,050 --> 01:12:32,559 same way that the logistic regression - what does theta correspond to 1221 01:12:32,559 --> 01:12:37,729 in logistic regression? You can sort of answer that as sort of - Yeah. It's kind of like 1222 01:12:37,729 --> 01:12:42,530 the [inaudible] feature - Yeah. Sort of similar interpretation, 1223 01:12:42,530 --> 01:12:44,060 yeah. That's good. I think I'm running a little bit 1224 01:12:44,060 --> 01:12:48,670 late. Why don't I - why don't we officially close for the day, but you can come up 1225 01:12:48,670 --> 01:12:50,430 if you more questions and take them offline. Thanks.