1 00:00:01,425 --> 00:00:09,725 2 00:00:11,600 --> 00:00:15,625 This presentation is delivered by the Stanford Center for Professional Development. 3 00:00:23,650 --> 00:00:27,075 OK. So let's get started with today's material. 4 00:00:31,075 --> 00:00:31,775 So 5 00:00:32,025 --> 00:00:37,775 welcome back to the second lecture. What I want to do today is talk about linear regression, 6 00:00:38,275 --> 00:00:41,025 gradient descent, and the normal equations. 7 00:00:42,600 --> 00:00:43,450 And I should also say, 8 00:00:43,725 --> 00:00:51,025 lecture notes have been posted online and so if some of the math I go over today, I go over rather quickly, if you want to 9 00:00:51,575 --> 00:00:59,350 see every equation written out and work through the details more slowly yourself, go to the course homepage and download detailed lecture notes 10 00:00:59,900 --> 00:01:04,575 that pretty much describe all the mathematical, technical contents I'm going to go over today. 11 00:01:06,300 --> 00:01:10,075 Today, I'm also going to delve into a fair amount - some amount of linear algebra, 12 00:01:10,450 --> 00:01:14,550 and so if you would like to see a refresher on linear algebra, 13 00:01:15,550 --> 00:01:21,225 this week's discussion section will be taught by the TAs and will be a refresher on linear algebra. So 14 00:01:21,825 --> 00:01:22,875 if some of the linear algebra 15 00:01:23,225 --> 00:01:28,775 I talk about today sort of seems to be going by pretty quickly, or if you just want to see some of the things 16 00:01:29,500 --> 00:01:33,925 I'm claiming today with our proof, if you want to just see some of those things written out in detail, 17 00:01:34,300 --> 00:01:36,225 you can come to this week's discussion section. 18 00:01:37,525 --> 00:01:38,650 19 00:01:40,825 --> 00:01:44,225 So I just want to start by showing you a fun video. 20 00:01:44,700 --> 00:01:49,300 Remember at the last lecture, the initial lecture, I talked about supervised learning. 21 00:01:50,150 --> 00:01:56,250 And supervised learning was this machine-learning problem where I said we're going to tell the algorithm 22 00:01:56,625 --> 00:01:59,000 what the close right answer is for 23 00:01:59,400 --> 00:02:04,700 a number of examples, and then we want the algorithm to replicate more of the same. So 24 00:02:05,125 --> 00:02:06,125 the example I had 25 00:02:06,725 --> 00:02:11,625 at the first lecture was the problem of predicting housing prices, where you may have a training set, 26 00:02:12,250 --> 00:02:14,550 and we tell the algorithm what the "right" 27 00:02:14,825 --> 00:02:17,275 housing price was for every house in the training set. 28 00:02:18,025 --> 00:02:24,725 And then you want the algorithm to learn the relationship between sizes of houses and the prices, and essentially produce more of 29 00:02:24,950 --> 00:02:25,900 the "right" answer. 30 00:02:26,525 --> 00:02:30,075 So let me show you a video now. Load the big screen, please. 31 00:02:30,875 --> 00:02:32,550 So I'll show you a video now 32 00:02:33,375 --> 00:02:37,425 that was from Dean Pomerleau at some work he did at Carnegie Mellon 33 00:02:37,725 --> 00:02:41,200 on applied supervised learning to get a car to drive itself. 34 00:02:42,425 --> 00:02:48,850 This is work on a vehicle known as Alvin. It was done sort of about 15 years ago, 35 00:02:50,600 --> 00:02:52,075 and I think it was a ... 36 00:02:53,475 --> 00:02:56,600 it was a very elegant example of 37 00:02:57,575 --> 00:03:00,225 the sorts of things you can get supervised or any algorithms to do. 38 00:03:00,875 --> 00:03:01,475 39 00:03:01,900 --> 00:03:08,000 On the video, you hear Dean Pomerleau's voice mention and algorithm called Neural Network. I'll say a little bit about that later, 40 00:03:09,250 --> 00:03:15,750 but the essential learning algorithm for this is something called gradient descent, which I will talk about later in today's lecture. 41 00:03:16,650 --> 00:03:17,550 Let's watch .. 42 00:03:18,725 --> 00:03:20,250 Let's watch the video. 43 00:03:24,000 --> 00:03:30,575 [Video plays, Alvin is a system of artificial neural networks that learn to steer by watching a person's drive.] 44 00:03:31,650 --> 00:03:34,325 [Video plays, Alvin is designed to control the [inaudible] tools,] 45 00:03:35,000 --> 00:03:38,825 [Video plays, modified army [inaudible], equipped with sensors, computers,] 46 00:03:39,300 --> 00:03:40,050 [Video plays, and actuators] 47 00:03:40,575 --> 00:03:43,150 [Video plays, for autonomous navigation experiments.] 48 00:03:44,350 --> 00:03:49,275 [Video plays, The initial step in configuring Alvin is training networks in sphere.] 49 00:03:50,550 --> 00:03:54,875 [Video plays, During training, a person drives the vehicle while Alvin watches.] 50 00:04:00,150 --> 00:04:01,475 [Video plays, Once every two seconds,...] 51 00:04:02,375 --> 00:04:07,925 [Video plays, Alvin digitized the video images of the road ahead, and records the person's steering direction.] 52 00:04:14,450 --> 00:04:15,650 [Video plays, ] 53 00:04:16,100 --> 00:04:20,650 [Video plays, This training image is reduced to the resolution to 30 by 32 pixels.] 54 00:04:21,125 --> 00:04:24,200 [Video plays, and provides it as an input of Alvin's three-layer network.] 55 00:04:25,350 --> 00:04:26,025 *Instructor (Andrew Ng)*:So two comments, 56 00:04:26,250 --> 00:04:30,900 right. One is this is supervised learning because it's learning from a human driver, 57 00:04:31,300 --> 00:04:32,700 in which a human driver shows 58 00:04:33,025 --> 00:04:38,175 that we're on this segment of the road, I will steer at this angle. This segment of the road, I'll steer at this angle. 59 00:04:38,725 --> 00:04:40,225 And so the human provides the number of 60 00:04:40,550 --> 00:04:41,200 "correct" 61 00:04:41,800 --> 00:04:43,175 steering directions to the car, 62 00:04:43,675 --> 00:04:46,050 and then it's the job of the car to try to learn 63 00:04:46,475 --> 00:04:50,925 to produce more of these "correct" steering directions that keeps the car on the road. 64 00:04:52,400 --> 00:04:55,925 On the monitor display up here, I just want to 65 00:04:56,400 --> 00:05:01,400 tell you a little bit about what this display means. So on the upper left where the mouse pointer is moving, 66 00:05:01,900 --> 00:05:05,625 this horizontal line actually shows the human steering direction, 67 00:05:06,200 --> 00:05:08,325 and this white bar, or 68 00:05:08,975 --> 00:05:11,175 this white area right here shows 69 00:05:11,600 --> 00:05:17,525 the steering direction chosen by the human driver, by moving the steering wheel. The human is steering 70 00:05:17,975 --> 00:05:21,100 a little bit to the left here indicated by the position of this white region. 71 00:05:21,625 --> 00:05:22,250 72 00:05:22,850 --> 00:05:25,175 This second line here where Mouse is pointing, 73 00:05:25,875 --> 00:05:28,925 the second line here is the output of the learning algorithm, and 74 00:05:29,300 --> 00:05:32,275 where the learning algorithm currently thinks is the right steering direction. 75 00:05:32,900 --> 00:05:37,925 And right now what you're seeing is the learning algorithm just at the very beginning of training, and so there's just no idea 76 00:05:38,375 --> 00:05:39,650 of where to steer. And so 77 00:05:39,950 --> 00:05:44,000 its output, this little white smear over the entire range of steering directions. 78 00:05:44,600 --> 00:05:48,325 And as the algorithm collects more examples and learns of a time, you see 79 00:05:48,700 --> 00:05:51,400 it start to more confidently choose a steering direction. 80 00:05:52,200 --> 00:05:53,175 So let's keep watching the video. 81 00:05:54,575 --> 00:05:57,000 [Video plays, Using the backpropagation learning algorithm,] 82 00:05:57,900 --> 00:06:00,800 [Video plays, Alvin is trained to output the same steering direction] 83 00:06:01,125 --> 00:06:03,375 [Video plays, as the human driver for that image.] 84 00:06:05,475 --> 00:06:09,375 [Video plays, Initially, the network's steering response is at random.] 85 00:06:16,275 --> 00:06:18,575 [Video plays, After about two minutes of training,] 86 00:06:19,325 --> 00:06:23,950 [Video plays, The network learns to accurately imitate the steering reactions of the human driver.] 87 00:06:36,250 --> 00:06:39,850 [Video plays, The same training procedure is repeated for other road types.] 88 00:06:42,300 --> 00:06:44,075 [Video plays, After the networks have been trained,] 89 00:06:44,700 --> 00:06:48,450 [Video plays, The operator pushes their own switch, and Alvin begins driving.] 90 00:06:53,225 --> 00:06:54,450 [Video plays, Twelve times per second, ] 91 00:06:54,950 --> 00:06:56,450 [Video plays, Alvin digitizes a image] 92 00:06:56,875 --> 00:07:01,175 [Video plays, and feeds it to the neural networks.] 93 00:07:06,225 --> 00:07:08,250 [Video plays, Each network running in parallel] 94 00:07:08,975 --> 00:07:13,575 [Video plays, produces the steering direction and measures its confidence as its response.] 95 00:07:21,925 --> 00:07:24,650 [Video plays, The steering direction from the most confident network] 96 00:07:25,375 --> 00:07:28,000 [Video plays, in this case, the network trained a one-lane road] 97 00:07:28,600 --> 00:07:31,200 [Video plays, is used to control the vehicle.] 98 00:07:40,675 --> 00:07:41,200 [Video plays, Suddenly, ] 99 00:07:41,675 --> 00:07:48,575 [Video plays, an intersection appears ahead of the vehicle.] 100 00:07:53,825 --> 00:07:54,350 [Video plays, ] 101 00:07:56,000 --> 00:08:01,400 [Video plays, As the vehicle approaches the intersection, the confidence of the one-lane network decreases.] 102 00:08:05,950 --> 00:08:07,500 [Video plays, ] 103 00:08:08,800 --> 00:08:15,000 [Video plays, As across the intersection and the two-lane road ahead comes across into view, ] 104 00:08:15,525 --> 00:08:18,175 [Video plays, the confidence of the two-lane network rises.] 105 00:08:23,325 --> 00:08:28,025 [Video plays, When its confidence rises, the two-lane network is selected to steer,] 106 00:08:28,875 --> 00:08:32,275 [Video plays, Safely guiding the vehicle into its lane on the two-lane road.] 107 00:08:35,950 --> 00:08:36,700 [Video plays, ] 108 00:08:39,725 --> 00:08:43,500 *Instructor (Andrew Ng)*:All right, so who thought driving could be that dramatic, right? 109 00:08:44,425 --> 00:08:45,950 Switch back to the chalkboard, please. 110 00:08:46,675 --> 00:08:47,200 111 00:08:47,700 --> 00:08:51,775 I should say, this work was done about 15 years ago and 112 00:08:52,250 --> 00:08:54,950 autonomous driving has come a long way. So many of you 113 00:08:55,300 --> 00:08:57,600 will have heard of the DARPA Grand Challenge, where 114 00:08:57,950 --> 00:08:59,625 one of my colleagues, Sebastian Thrun, 115 00:08:59,900 --> 00:09:03,700 the winning team's drive a car across a desert by itself. 116 00:09:04,200 --> 00:09:07,375 So Alvin was, I think, absolutely amazing work for its time, 117 00:09:08,000 --> 00:09:11,875 but autonomous driving has obviously come a long way since then. 118 00:09:16,000 --> 00:09:19,725 So what you just saw was an example, again, of supervised learning, 119 00:09:20,250 --> 00:09:22,200 and in particular it was an example of 120 00:09:22,700 --> 00:09:24,450 what they call the regression problem, 121 00:09:24,975 --> 00:09:29,750 because the vehicle is trying to predict a continuous value variables of a continuous value steering directions, 122 00:09:31,175 --> 00:09:33,625 we call the regression problem. 123 00:09:34,175 --> 00:09:35,025 124 00:09:35,675 --> 00:09:42,225 And what I want to do today is talk about our first supervised learning algorithm, and it will also be to a regression task. 125 00:09:43,475 --> 00:09:45,025 So 126 00:09:46,975 --> 00:09:49,075 for the running example that I'm going to use 127 00:09:49,550 --> 00:09:54,500 throughout today's lecture, you're going to return to the example of trying to predict housing prices. 128 00:09:55,125 --> 00:09:56,950 So 129 00:09:57,850 --> 00:10:02,775 here's actually a dataset 130 00:10:04,725 --> 00:10:11,400 collected by TA, Dan Ramage, on housing prices in Portland, Oregon. 131 00:10:15,275 --> 00:10:17,650 So 132 00:10:18,350 --> 00:10:20,075 133 00:10:20,700 --> 00:10:23,900 134 00:10:24,525 --> 00:10:29,950 here's a dataset of a number of 135 00:10:30,475 --> 00:10:32,525 houses of different sizes, 136 00:10:33,375 --> 00:10:34,000 and 137 00:10:34,700 --> 00:10:38,100 138 00:10:38,775 --> 00:10:45,775 here are their asking prices in thousands of dollars, $200,000. 139 00:10:47,000 --> 00:10:47,575 140 00:10:48,825 --> 00:10:49,900 141 00:10:50,650 --> 00:10:51,575 142 00:10:52,325 --> 00:10:52,850 143 00:10:53,600 --> 00:10:54,675 And so 144 00:10:55,675 --> 00:10:58,975 we can take this data and plot it, 145 00:10:59,400 --> 00:11:04,125 square feet, best price, 146 00:11:05,000 --> 00:11:05,950 and 147 00:11:06,625 --> 00:11:09,150 so you make your other dataset like that. And the question is, 148 00:11:09,825 --> 00:11:13,125 given a dataset like this, or given what we call a training set like this, 149 00:11:13,650 --> 00:11:16,200 how do you learn to predict the relationship between the size of the house 150 00:11:16,825 --> 00:11:17,600 and the price of the house? 151 00:11:18,800 --> 00:11:25,600 So I'm actually going to come back and modify this task a little bit more later, but let me go ahead and introduce 152 00:11:25,975 --> 00:11:30,575 some notation, which I'll be using, actually, throughout the rest of this course. 153 00:11:31,300 --> 00:11:35,550 The first piece of notation is I'm going to let 154 00:11:36,750 --> 00:11:38,350 the lower case alphabet M 155 00:11:38,975 --> 00:11:40,625 denote the number of training examples, 156 00:11:41,025 --> 00:11:45,300 and that just means the number of rows, or the number of examples, houses, and prices we have. 157 00:11:45,775 --> 00:11:48,825 And in this particular dataset, we have, 158 00:11:49,750 --> 00:11:54,850 what actually happens, we have 47 training examples, although I wrote down only five. Okay, 159 00:11:56,100 --> 00:11:59,875 so 160 00:12:00,200 --> 00:12:01,550 161 00:12:02,150 --> 00:12:08,100 throughout this quarter, I'm going to use the alphabet M 162 00:12:08,700 --> 00:12:13,950 to denote the number of training examples. 163 00:12:15,100 --> 00:12:21,525 I'm going to use the lower case alphabet X 164 00:12:22,250 --> 00:12:22,800 to denote 165 00:12:23,275 --> 00:12:27,925 the input variables, 166 00:12:28,325 --> 00:12:32,475 167 00:12:33,000 --> 00:12:37,025 which I'll often also call the features. And so in this case, 168 00:12:37,600 --> 00:12:40,725 X would denote the size of the house they were looking at. 169 00:12:41,750 --> 00:12:48,075 I'm going to use Y to denote the "output" 170 00:12:48,425 --> 00:12:49,525 171 00:12:49,925 --> 00:12:50,650 172 00:12:51,500 --> 00:12:56,500 variable, which is sometimes also called a target ... 173 00:12:57,000 --> 00:12:58,950 174 00:13:00,150 --> 00:13:00,975 target variable, 175 00:13:01,900 --> 00:13:03,125 and so 176 00:13:04,225 --> 00:13:10,400 one pair, x, y, is what comprises one training example. 177 00:13:10,675 --> 00:13:11,575 178 00:13:12,075 --> 00:13:16,450 In other words, one row on the table I drew just now 179 00:13:16,975 --> 00:13:18,925 what would be what I call one training example, 180 00:13:19,250 --> 00:13:21,150 181 00:13:22,675 --> 00:13:28,075 and the i-th training example, 182 00:13:29,025 --> 00:13:30,200 183 00:13:30,725 --> 00:13:34,500 in other words the i-th row in that table, I'm going to write as 184 00:13:34,975 --> 00:13:37,625 Xi, Yi. 185 00:13:38,125 --> 00:13:39,175 186 00:13:41,550 --> 00:13:43,750 Okay, and so 187 00:13:44,250 --> 00:13:51,100 in this notation they're going to use this superscript I is not exponentiation. So this is not X to the power of IY to the power of I. 188 00:13:51,650 --> 00:13:52,700 In this notation, 189 00:13:54,025 --> 00:13:59,550 the superscript i in parentheses is just sort of an index into the ith row 190 00:14:00,125 --> 00:14:01,800 of my list of training examples. 191 00:14:03,650 --> 00:14:05,750 So 192 00:14:06,950 --> 00:14:11,100 in supervised learning, this is 193 00:14:11,500 --> 00:14:14,500 what we're going to do. We're given a training set, 194 00:14:15,250 --> 00:14:16,675 195 00:14:17,650 --> 00:14:18,700 196 00:14:19,300 --> 00:14:22,950 and we're going to feed our training set, comprising our M 197 00:14:23,250 --> 00:14:27,300 training example, so 47 training examples, into a learning algorithm. 198 00:14:28,150 --> 00:14:33,300 199 00:14:34,625 --> 00:14:36,100 Okay, and 200 00:14:36,975 --> 00:14:43,100 our algorithm then has output function that is by tradition, and for historical reasons, 201 00:14:43,675 --> 00:14:45,325 is usually denoted 202 00:14:46,175 --> 00:14:50,025 lower case alphabet H, and is called a hypothesis. 203 00:14:50,525 --> 00:14:52,725 204 00:14:53,425 --> 00:14:59,750 Don't worry too much about whether the term hypothesis has a deep meaning. It's more a term that's used for historical reasons. 205 00:15:00,675 --> 00:15:02,425 206 00:15:03,225 --> 00:15:07,775 And the hypothesis's job is to take this input. 207 00:15:08,200 --> 00:15:11,200 There's some new [inaudible/ price of a house you want to estimate]. 208 00:15:12,325 --> 00:15:18,625 What the hypothesis does is it takes this input, a new living area in square feet saying 209 00:15:19,050 --> 00:15:20,475 and output 210 00:15:21,200 --> 00:15:26,200 estimates the price of this house. 211 00:15:27,150 --> 00:15:32,925 So the hypothesis H maps from inputs X to outputs Y. 212 00:15:34,425 --> 00:15:35,050 213 00:15:37,650 --> 00:15:41,475 So in order to design a learning algorithm, the first thing we have to decide is 214 00:15:42,000 --> 00:15:43,975 how we want to represent the hypothesis, right. 215 00:15:45,100 --> 00:15:48,950 And just for this purposes of this lecture, for the purposes of our first learning algorithm, 216 00:15:49,475 --> 00:15:53,175 I'm going to use a linear representation for the hypothesis. So 217 00:15:53,600 --> 00:15:57,725 I'm going to represent my hypothesis as H of X equals 218 00:15:58,325 --> 00:15:59,225 theta zero, 219 00:16:00,075 --> 00:16:01,925 220 00:16:02,950 --> 00:16:04,850 plus theta-one X, where 221 00:16:06,050 --> 00:16:08,725 X here is an input feature, and so that's 222 00:16:09,225 --> 00:16:12,200 the size of the house we're considering. 223 00:16:13,275 --> 00:16:16,800 And more generally, come back to this, 224 00:16:18,025 --> 00:16:19,650 more generally for many 225 00:16:20,075 --> 00:16:24,550 regression problems we may have more than one input feature. So for example, 226 00:16:25,300 --> 00:16:32,825 if instead of just knowing the size of the houses, if we know also 227 00:16:33,675 --> 00:16:37,800 the number of bedrooms in these houses, 228 00:16:38,900 --> 00:16:45,000 let's say, then, 229 00:16:48,200 --> 00:16:53,400 230 00:16:54,075 --> 00:16:55,575 so if our ... 231 00:16:55,875 --> 00:16:59,300 training set also has a second feature, the number of bedrooms in the house, 232 00:17:00,000 --> 00:17:00,925 then 233 00:17:01,350 --> 00:17:04,525 you may, let's say x_1 denote 234 00:17:05,125 --> 00:17:07,425 the size and square feet. 235 00:17:07,925 --> 00:17:12,225 Let X have script two denote the number of bedrooms, 236 00:17:12,775 --> 00:17:14,050 237 00:17:14,725 --> 00:17:21,625 and then I would write the hypothesis, H of X, as 238 00:17:22,875 --> 00:17:24,100 239 00:17:25,175 --> 00:17:30,600 theta rho plus theta_1 x_1 plus 240 00:17:31,575 --> 00:17:33,025 theta_2 x_2. Okay, 241 00:17:34,025 --> 00:17:36,550 and sometimes when I went to 242 00:17:36,800 --> 00:17:41,350 take the hypothesis H, and when I went to make this dependent on the theta is explicit, 243 00:17:41,750 --> 00:17:43,625 I'll sometimes write this as 244 00:17:43,900 --> 00:17:45,275 H subscript theta of X. 245 00:17:45,850 --> 00:17:46,600 And so this is 246 00:17:47,275 --> 00:17:48,425 the price 247 00:17:49,125 --> 00:17:50,525 that my hypothesis predicts 248 00:17:51,625 --> 00:17:53,025 a house with features X costs. 249 00:17:55,100 --> 00:17:59,725 So given the new house with features X, a certain size and a certain number of bedrooms, 250 00:18:00,325 --> 00:18:03,550 this is going to be the price that my hypothesis predicts this house is going to cost. 251 00:18:08,200 --> 00:18:13,000 One final piece of notation, 252 00:18:13,700 --> 00:18:17,825 so for conciseness, 253 00:18:18,475 --> 00:18:22,075 254 00:18:23,225 --> 00:18:29,525 just to write this a bit more compactly I'm going to take the convention of defining x_0 to be equal to one, 255 00:18:30,275 --> 00:18:35,075 and so I can now write H of X to be equal to 256 00:18:35,900 --> 00:18:38,575 sum from i equals one to two 257 00:18:39,300 --> 00:18:43,775 of theta_i, oh sorry, zero to two, theta_i, x_i. 258 00:18:44,275 --> 00:18:48,875 And if you think of theta as an x, as vectors, then this is just 259 00:18:49,700 --> 00:18:50,500 "theta transpose x." 260 00:18:51,625 --> 00:18:53,075 And 261 00:18:54,000 --> 00:18:58,125 the very final piece of notation is I'm also going to let 262 00:18:58,725 --> 00:19:03,950 lower case n define ... 263 00:19:05,550 --> 00:19:10,875 lower case n be the number of features in my learning problem. And so this actually becomes a sum ... 264 00:19:12,625 --> 00:19:15,425 265 00:19:16,250 --> 00:19:21,825 a sum from I equals zero to N, where in this example if you have two features, "n" would be equal to two. 266 00:19:25,875 --> 00:19:27,025 All right, 267 00:19:28,075 --> 00:19:30,625 I realize that was a fair amount of notation, 268 00:19:31,475 --> 00:19:33,425 and as I proceed through 269 00:19:33,975 --> 00:19:36,850 the rest of the lecture today, or in future weeks as well, 270 00:19:37,350 --> 00:19:40,475 if some day you're looking at me write a symbol and you're wondering, 271 00:19:40,700 --> 00:19:45,075 gee, what was that [simple|symbol] lower case N again? Or what was that lower case X again, or whatever, 272 00:19:45,450 --> 00:19:47,150 please raise hand and I'll answer. This is 273 00:19:48,100 --> 00:19:51,225 a fair amount of notation. We'll probably all get used to it 274 00:19:51,725 --> 00:19:57,150 in a few days and we'll standardize notation and make a lot of our descriptions of learning algorithms a lot easier. 275 00:19:58,850 --> 00:20:05,525 But again, if you see me write some symbol and you don't quite remember what it means, chances are there are others in this class who've forgotten too. So please raise your hand and 276 00:20:07,050 --> 00:20:07,975 ask if you're ever wondering what some symbol means. 277 00:20:09,550 --> 00:20:11,525 Any questions you have about any of this? 278 00:20:19,025 --> 00:20:21,925 *Student:*The variable can be anything? [Inaudible]? 279 00:20:22,900 --> 00:20:27,025 *Instructor (Andrew Ng)*:Say that again. *Student:*[inaudible] zero theta one? 280 00:20:27,475 --> 00:20:30,350 *Instructor (Andrew Ng)*:Right, so, well let me 281 00:20:30,750 --> 00:20:34,275 282 00:20:34,600 --> 00:20:40,250 this was going to be next, but the theta or the theta i's are called the parameters. 283 00:20:41,000 --> 00:20:41,550 284 00:20:43,300 --> 00:20:45,175 285 00:20:45,775 --> 00:20:51,350 The thetas are called the parameters of our learning algorithm and theta zero, theta one, theta two are just real numbers. 286 00:20:52,225 --> 00:20:54,725 And then it is the job of the learning algorithm 287 00:20:55,525 --> 00:20:57,025 to use the training set 288 00:20:57,750 --> 00:21:00,650 to choose or to learn appropriate parameters theta. Okay, is there other questions? *Student:*What does [inaudible]? 289 00:21:06,550 --> 00:21:07,575 *Instructor (Andrew Ng)*:Oh, transpose. 290 00:21:08,250 --> 00:21:11,500 Oh yeah, sorry. When [inaudible] 291 00:21:11,900 --> 00:21:16,325 theta and theta transpose X, theta [inaudible]. 292 00:21:17,800 --> 00:21:23,875 *Student:*Is this like a [inaudible] 293 00:21:24,700 --> 00:21:30,325 hypothesis [inaudible], or would you have higher orders? Or would theta [inaudible]? 294 00:21:30,850 --> 00:21:32,850 *Instructor (Andrew Ng)*:All 295 00:21:33,350 --> 00:21:39,400 great questions. The answer so the question was, is this a typical hypothesis or can theta be a ... 296 00:21:39,975 --> 00:21:41,425 function of other variables and so on. 297 00:21:41,900 --> 00:21:46,400 And the answer is sort of yes. For now, just for this first 298 00:21:47,425 --> 00:21:50,200 learning algorithm we'll talk about using a linear hypothesis class. 299 00:21:50,800 --> 00:21:55,675 A little bit actually later this quarter, we'll talk about much more complicated hypothesis classes, 300 00:21:56,300 --> 00:21:59,900 and we'll actually talk about higher order functions as well, a little bit later today. 301 00:22:03,425 --> 00:22:06,300 Okay, 302 00:22:07,800 --> 00:22:12,300 so for the learning problem then. How do we choose the parameters theta 303 00:22:13,225 --> 00:22:17,000 so that our hypothesis H will make accurate predictions about all the houses. 304 00:22:18,200 --> 00:22:20,975 All right, so one reasonable thing to do 305 00:22:21,375 --> 00:22:24,175 seems to be, well, we have a training set. So 306 00:22:24,525 --> 00:22:29,175 and just on the training set, our hypothesis will make some prediction, 307 00:22:30,000 --> 00:22:32,675 predictions of the housing prices, of 308 00:22:33,800 --> 00:22:39,525 the prices of the houses in the training set. So one thing we could do is just try to make 309 00:22:40,375 --> 00:22:45,900 the predictions of a learning algorithm accurate on a training set. So given 310 00:22:46,875 --> 00:22:51,475 some features, X, and some correct prices, Y, we might want to make 311 00:22:52,075 --> 00:22:56,750 that theta square difference between the prediction of the algorithm and the actual price [inaudible|small]. 312 00:22:58,575 --> 00:23:05,425 So to choose parameters theta, unless we want to minimize over the parameters theta, so the squared [area|error] 313 00:23:05,925 --> 00:23:07,775 between the predicted price and the actual price. 314 00:23:09,750 --> 00:23:15,075 And so going to fill this in. We have m training examples. 315 00:23:15,575 --> 00:23:19,050 So the sum from i equals one through m of my m training examples, 316 00:23:19,625 --> 00:23:20,250 317 00:23:21,125 --> 00:23:25,850 of price predicted on the i-th house in my training set, minus the actual 318 00:23:26,425 --> 00:23:29,775 target variable, minus actual price on the i-th training example. 319 00:23:30,700 --> 00:23:34,750 And by convention, instead of minimizing this 320 00:23:35,650 --> 00:23:38,075 sum of the squared differences, I'm just going to put a one-half there, 321 00:23:38,750 --> 00:23:41,775 which will simplify 322 00:23:42,375 --> 00:23:44,550 some of the math we do later. 323 00:23:45,375 --> 00:23:51,700 Okay, and so let me go ahead and define J of theta to be equal to just the same, one-half 324 00:23:52,375 --> 00:23:56,150 sum from I equals one through M on the number of training examples, 325 00:23:56,550 --> 00:24:01,200 of the value 326 00:24:01,625 --> 00:24:04,675 predicted by my hypothesis minus the actual value. 327 00:24:05,800 --> 00:24:09,125 And so 328 00:24:09,700 --> 00:24:14,225 what we'll do, let's say, is minimize 329 00:24:15,175 --> 00:24:18,325 as a function of the parameters of theta, this quantity J of theta. 330 00:24:20,125 --> 00:24:26,675 I should say, to those of you who have taken sort of linear algebra classes, or maybe 331 00:24:27,500 --> 00:24:32,625 basic statistics classes, some of you may have seen things like these before and 332 00:24:33,150 --> 00:24:36,525 seen least [inaudible|square] regression or [inaudible|least] squares. 333 00:24:37,475 --> 00:24:41,225 Many of you will not have seen this before. I think some of you may have seen it before, 334 00:24:41,700 --> 00:24:45,050 but either way, regardless of whether you've seen it before, let's keep going. 335 00:24:45,925 --> 00:24:49,925 Just for those of you that have seen it before, I should say eventually, we'll actually show that 336 00:24:50,625 --> 00:24:53,750 this algorithm is a special case of a much broader class of algorithms. 337 00:24:54,775 --> 00:24:57,400 But let's keep going. We'll get there eventually. 338 00:24:58,800 --> 00:25:04,375 339 00:25:05,200 --> 00:25:11,425 So I'm going to talk about a couple of different algorithms for performing that minimization over theta of J of theta. 340 00:25:12,000 --> 00:25:15,400 The first algorithm I'm going to talk about is a search algorithm, 341 00:25:15,975 --> 00:25:19,600 where the basic idea is we'll start with some ... 342 00:25:19,825 --> 00:25:22,050 * 343 00:25:24,375 --> 00:25:27,800 value of my parameter vector theta. 344 00:25:28,300 --> 00:25:34,950 Maybe initialize my parameter vector theta to be the vector of all zeros, 345 00:25:36,175 --> 00:25:37,975 and excuse me, 346 00:25:38,775 --> 00:25:44,875 have to correct that. I sort of write zero with an arrow on top to denote the vector of all zeros. 347 00:25:45,675 --> 00:25:49,975 And then I'm going to keep changing 348 00:25:51,475 --> 00:25:57,350 my parameter vector theta to reduce 349 00:25:57,925 --> 00:26:02,400 J of theta a little bit, until we hopefully end up at 350 00:26:02,850 --> 00:26:05,625 the minimum with respect to theta of J of theta. 351 00:26:06,550 --> 00:26:11,250 So switch the laptops please, and lower the big screen. 352 00:26:12,350 --> 00:26:13,950 So let me go ahead and 353 00:26:17,825 --> 00:26:19,425 show you an animation of this 354 00:26:19,800 --> 00:26:25,275 first algorithm for minimizing J of theta, which is an algorithm called [grading and|gradient] descent. 355 00:26:29,525 --> 00:26:30,625 So here's the idea. You see 356 00:26:31,425 --> 00:26:33,300 on the display a plot 357 00:26:33,775 --> 00:26:38,150 and the axes, the horizontal axes are theta zero and theta one. 358 00:26:38,650 --> 00:26:39,575 That's usually minimize 359 00:26:40,100 --> 00:26:43,725 J of theta, which is represented by the height of this plot. So the 360 00:26:45,000 --> 00:26:47,050 surface represents the function J of theta 361 00:26:47,525 --> 00:26:52,700 and the axes of this function, or the inputs of this function are the parameters theta zero and theta one, written down here below. 362 00:26:54,050 --> 00:26:56,100 So here's the gradient descent algorithm. 363 00:26:57,525 --> 00:27:01,525 I'm going to choose some initial point. It could be vector of all zeros or some randomly chosen point. 364 00:27:02,050 --> 00:27:07,125 Let's say we start from that point denoted by the star, by the cross, 365 00:27:09,050 --> 00:27:11,525 and now I want you to imagine that 366 00:27:13,000 --> 00:27:17,550 this display actually shows a 3D landscape. Imagine you're all in a hilly park or something, 367 00:27:17,950 --> 00:27:19,425 and this is the 3D shape of, 368 00:27:19,700 --> 00:27:20,875 like, a hill in some park. 369 00:27:22,600 --> 00:27:26,575 So imagine you're actually standing physically at the position of 370 00:27:27,500 --> 00:27:29,050 that star, of that cross, 371 00:27:29,900 --> 00:27:31,600 and imagine 372 00:27:32,075 --> 00:27:37,600 you can stand on that hill, right, and look all 360 degrees around you and ask, 373 00:27:38,250 --> 00:27:41,375 if I were to take a small step, what would allow me to go downhill the most? Okay, 374 00:27:42,400 --> 00:27:47,575 just imagine that this is physically a hill and you're standing there, and would look around ask, "If I take a small step, 375 00:27:48,200 --> 00:27:51,500 what is the direction of steepest descent, that would take me downhill as quickly as possible?" 376 00:27:53,225 --> 00:27:55,225 So the gradient descent algorithm does exactly that. 377 00:27:56,025 --> 00:28:00,800 I'm going to take a small step in this direction of steepest descent, or the direction that the gradient turns out to be. 378 00:28:01,775 --> 00:28:05,100 And then you take a small step and you end up at a new point shown there, 379 00:28:06,250 --> 00:28:06,900 and it would keep going. 380 00:28:07,550 --> 00:28:12,025 You're now at a new point on this hill, and again you're going to look around you, 381 00:28:12,350 --> 00:28:14,400 look all 360 degrees around you, 382 00:28:14,925 --> 00:28:18,950 and ask, "What is the direction that would take me downhill as quickly as possible?" 383 00:28:19,350 --> 00:28:21,150 And we want to go downhill as quickly as possible, 384 00:28:21,600 --> 00:28:23,425 because we want to find the minimum of J of theta. 385 00:28:25,350 --> 00:28:26,050 So you do that again. 386 00:28:26,800 --> 00:28:29,950 You can take another step, okay, and you sort of keep going 387 00:28:31,475 --> 00:28:36,200 until you end up at a local minimum of this function, J of theta. 388 00:28:38,200 --> 00:28:41,175 One property of gradient descent is that 389 00:28:41,675 --> 00:28:45,575 where you end up in this case, we ended up at this point on the ... 390 00:28:46,025 --> 00:28:48,750 lower left hand corner of this plot. 391 00:28:49,375 --> 00:28:50,075 But 392 00:28:50,825 --> 00:28:54,800 let's try running gradient descent again from a different position. So that 393 00:28:55,475 --> 00:29:00,950 was where I started gradient descent just now. Let's rerun gradient descent, but using a slightly different initial starting point, 394 00:29:01,750 --> 00:29:03,600 so a point slightly further ... 395 00:29:04,525 --> 00:29:05,425 up and further to the right. 396 00:29:06,750 --> 00:29:09,250 So it turns out if you run gradient descent from that point, 397 00:29:09,800 --> 00:29:12,650 then if you take a steepest descent direction again, 398 00:29:13,825 --> 00:29:17,775 that's your first step. And if you keep going, 399 00:29:19,075 --> 00:29:24,050 it turns out that with a slightly different initial starting point, you can actually end up at a completely different local optimum. Okay, 400 00:29:25,700 --> 00:29:31,275 so this is a property of gradient descent, and we'll come back to it in a second. So be aware that gradient descent 401 00:29:32,325 --> 00:29:35,150 can sometimes depend on where you initialize 402 00:29:35,525 --> 00:29:40,050 your parameters, theta zero and theta one. Switch back to the chalkboard, please. 403 00:29:41,475 --> 00:29:48,500 Let's go ahead and work out the math of the gradient descent algorithm. Then we'll come back and revisit this issue of local optimum. 404 00:29:49,625 --> 00:29:51,450 So 405 00:29:54,975 --> 00:29:56,875 406 00:29:58,375 --> 00:30:04,175 407 00:30:05,775 --> 00:30:08,225 here's the gradient descent algorithm. 408 00:30:09,050 --> 00:30:15,275 We're going to take a repeatedly take a step in the direction of steepest descent, and it turns out that you can write that as [inaudible], 409 00:30:16,025 --> 00:30:19,400 which is we're going to update the parameters theta as 410 00:30:20,725 --> 00:30:23,700 theta_i minus 411 00:30:24,675 --> 00:30:29,225 the partial derivative with respect to theta_i, 412 00:30:29,875 --> 00:30:35,750 J of Theta. Okay, so this is how we're going to update the i-th parameter, theta_i, 413 00:30:36,325 --> 00:30:39,100 how we're going to update Theta I on each iteration of gradient descent. 414 00:30:41,375 --> 00:30:45,750 Just a point of notation, I use this colon equals notation 415 00:30:46,400 --> 00:30:49,975 to denote setting a variable on the left hand side 416 00:30:50,250 --> 00:30:51,375 equal to the variable on the right hand side. All right, 417 00:30:52,475 --> 00:30:55,575 so if I write A colon equals B, 418 00:30:56,150 --> 00:31:00,050 then what I'm saying is, this is part of a computer program, or this is part of an algorithm 419 00:31:00,650 --> 00:31:03,300 where we take the value of B, the value on the right hand side, 420 00:31:03,650 --> 00:31:05,575 and use that to overwrite the value on the left hand side. 421 00:31:06,800 --> 00:31:09,575 In contrast, if I write A equals B, 422 00:31:10,250 --> 00:31:12,950 then this is an assertion of truth. 423 00:31:13,475 --> 00:31:16,400 I'm claiming that the value of A is equal to the value of B, 424 00:31:17,700 --> 00:31:21,425 whereas this is computer operation where we overwrite the value of A. 425 00:31:22,125 --> 00:31:25,100 If I write A equals B then I'm asserting that the values of A and B are equal. 426 00:31:26,700 --> 00:31:30,775 So let's see, 427 00:31:31,000 --> 00:31:32,875 this algorithm sort of makes sense well, 428 00:31:36,600 --> 00:31:40,575 actually let's just move on. Let's just go ahead and take this algorithm and apply it to our problem. 429 00:31:41,050 --> 00:31:44,225 430 00:31:44,725 --> 00:31:47,250 431 00:31:48,850 --> 00:31:52,750 And to work out gradient descent, 432 00:31:53,775 --> 00:31:59,350 let's take gradient descent and just apply it to our problem, and this being the first 433 00:31:59,850 --> 00:32:01,825 somewhat mathematical lecture, I'm going to 434 00:32:02,500 --> 00:32:07,000 step through derivations much more slowly and carefully than I will later in this quarter. 435 00:32:07,400 --> 00:32:09,000 We'll work through the steps of these 436 00:32:09,850 --> 00:32:14,600 in much more detail than I will later in this quarter. Let's actually work out what this gradient descent rule is. 437 00:32:15,525 --> 00:32:18,250 So 438 00:32:19,000 --> 00:32:23,425 and I'll do this just for the case of, if we had only one training example. Okay, 439 00:32:24,300 --> 00:32:30,025 so in this case we need to work out what the partial derivative with respect to the parameter theta I of J of theta. 440 00:32:30,750 --> 00:32:35,775 If we have only one training example 441 00:32:36,500 --> 00:32:40,775 then J of theta is going to be one-half of subscript theta, 442 00:32:41,275 --> 00:32:42,950 of X minus Y, script. 443 00:32:43,500 --> 00:32:47,200 So if you have only one training example comprising one pair, 444 00:32:47,775 --> 00:32:50,375 X, Y, then this is what J of theta is going to be. 445 00:32:51,475 --> 00:32:54,375 And so taking derivatives, 446 00:32:55,050 --> 00:32:56,275 you have one-half 447 00:32:56,725 --> 00:32:58,900 something squared. So the two comes down. 448 00:32:59,325 --> 00:33:04,075 So you have two times one-half times theta of X minus Y, 449 00:33:05,025 --> 00:33:06,425 and then by the 450 00:33:07,275 --> 00:33:10,175 [inaudible] derivatives, we also 451 00:33:10,900 --> 00:33:14,700 must apply this by the derivative of what's inside the square. 452 00:33:15,425 --> 00:33:18,700 453 00:33:19,475 --> 00:33:20,725 Right, 454 00:33:21,525 --> 00:33:26,325 the two and the one-half cancel. 455 00:33:28,500 --> 00:33:34,150 So this leaves [inaudible] times that, theta zero, X zero plus 456 00:33:35,475 --> 00:33:39,350 [inaudible]. 457 00:33:40,625 --> 00:33:44,450 Okay, and if you look inside this sum, 458 00:33:45,575 --> 00:33:49,025 we're taking the partial derivative of this sum 459 00:33:49,500 --> 00:33:51,175 with respect to the parameter theta I. 460 00:33:52,775 --> 00:33:59,025 But all the terms in the sum, except for one, do not depend on theta I. In this sum, 461 00:33:59,575 --> 00:34:05,175 the only term that depends on theta I will be some term here of theta_i, x_i. 462 00:34:06,175 --> 00:34:09,150 And so we take the partial derivative with respect to theta_i, x_i 463 00:34:10,050 --> 00:34:13,600 take the partial derivative with respect to theta_i of 464 00:34:14,425 --> 00:34:19,500 this term theta_i, x_i, and so you get that 465 00:34:20,175 --> 00:34:22,275 times x_i. 466 00:34:25,150 --> 00:34:31,125 Okay, and so this gives us our learning rule, right, 467 00:34:31,650 --> 00:34:34,800 of theta_i gets updated as theta_i 468 00:34:35,600 --> 00:34:39,175 minus alpha times that. 469 00:34:39,850 --> 00:34:43,450 470 00:34:46,125 --> 00:34:51,700 Okay, and this ... 471 00:34:53,500 --> 00:34:58,125 Greek alphabet alpha here is a parameter of the algorithm called the learning rate, 472 00:34:58,975 --> 00:35:02,150 and this parameter alpha controls 473 00:35:02,725 --> 00:35:07,325 how large a step you take. So you're standing on the hill. You decided 474 00:35:07,850 --> 00:35:10,800 what direction to take a step in, and so this parameter alpha 475 00:35:11,500 --> 00:35:16,625 controls how aggressive how large a step you take in this direction of steepest descent. 476 00:35:18,400 --> 00:35:19,700 And so if you 477 00:35:20,275 --> 00:35:23,550 and this is a parameter of the algorithm that's often set by hand. 478 00:35:24,075 --> 00:35:30,225 If you choose alpha to be too small [than|then] your steepest descent algorithm will take very tiny steps and take a long time to converge. 479 00:35:30,825 --> 00:35:32,550 If alpha is too large then 480 00:35:32,925 --> 00:35:36,275 the steepest descent may actually end up overshooting the minimum, 481 00:35:37,125 --> 00:35:37,725 if you're taking too aggressive a step. 482 00:35:40,300 --> 00:35:42,775 Yeah? *Student:*[Inaudible]. 483 00:35:46,700 --> 00:35:48,100 *Instructor (Andrew Ng)*:Say that again? 484 00:35:48,925 --> 00:35:52,100 *Student:*Isn't there a one over two missing somewhere? 485 00:35:53,375 --> 00:35:55,775 *Instructor (Andrew Ng)*:Is there a one-half missing? *Student:*I was [inaudible]. 486 00:35:56,450 --> 00:36:00,100 *Instructor (Andrew Ng)*:Thanks. I do make lots of errors in that. 487 00:36:01,775 --> 00:36:03,100 Any questions about this? 488 00:36:05,750 --> 00:36:06,700 489 00:36:08,475 --> 00:36:13,850 490 00:36:14,825 --> 00:36:15,350 491 00:36:15,725 --> 00:36:20,175 492 00:36:22,650 --> 00:36:24,000 All right, 493 00:36:26,275 --> 00:36:27,425 so let me just 494 00:36:29,350 --> 00:36:35,100 wrap this property into an algorithm. So over there I derived the algorithm where you have just one training example, 495 00:36:36,275 --> 00:36:40,100 more generally for M training examples, gradient descent becomes the following. 496 00:36:40,375 --> 00:36:43,250 We're going to repeat 497 00:36:43,475 --> 00:36:44,725 until convergence 498 00:36:45,575 --> 00:36:48,375 the following step. 499 00:36:48,850 --> 00:36:53,175 Okay, theta_i gets updated as theta_i 500 00:36:54,175 --> 00:36:59,775 and I'm just writing out the appropriate equation for M examples rather than one example. 501 00:37:00,500 --> 00:37:03,150 Theta_i gets updated. Theta_i minus alpha times 502 00:37:03,425 --> 00:37:04,850 the sum from i equals one to M. 503 00:37:05,675 --> 00:37:06,450 504 00:37:06,925 --> 00:37:11,475 505 00:37:12,450 --> 00:37:13,550 506 00:37:14,700 --> 00:37:15,625 507 00:37:17,975 --> 00:37:18,625 508 00:37:20,075 --> 00:37:26,050 509 00:37:29,475 --> 00:37:30,075 Okay, and 510 00:37:30,775 --> 00:37:33,525 I won't bother to show it, but you can 511 00:37:34,100 --> 00:37:37,800 go home and sort of verify for yourself that this summation here, 512 00:37:38,250 --> 00:37:39,975 this is indeed 513 00:37:40,450 --> 00:37:46,325 the partial derivative with respect to theta I of J of theta, where 514 00:37:46,800 --> 00:37:49,150 if you use the original definition of J of theta 515 00:37:49,775 --> 00:37:51,500 for when you have M training examples. 516 00:37:54,350 --> 00:38:00,475 Okay, so I'm just going to show switch back to the laptop display. I'm going to show you what this looks like when you run the algorithm. 517 00:38:00,900 --> 00:38:02,750 518 00:38:03,650 --> 00:38:05,350 So it turns out that 519 00:38:06,075 --> 00:38:10,400 for the specific problem of linear regression, or ordinary [release|least] squares, which is what we're doing today, 520 00:38:11,575 --> 00:38:17,650 the function J of theta actually does not look like this nasty one that I'll show you just now with a multiple local optima. 521 00:38:18,400 --> 00:38:19,650 In particular, it turns out 522 00:38:20,175 --> 00:38:25,475 for ordinary [release|least] squares, the function J of theta is it's just a quadratic function. 523 00:38:26,000 --> 00:38:28,525 And so we'll always have a nice [bow|bowl] shape, 524 00:38:29,150 --> 00:38:34,250 like what you see up here, and only have one global minimum with no other local optima. 525 00:38:34,950 --> 00:38:35,925 So 526 00:38:36,775 --> 00:38:40,850 when you run gradient descent, here are actually the contours of the function J. So 527 00:38:41,350 --> 00:38:45,300 the contours of a [bow|bowl] shaped function like that are going to be ellipses, 528 00:38:45,950 --> 00:38:46,500 and 529 00:38:48,175 --> 00:38:53,425 if you run gradient descent on this algorithm, here's what you might get. Let's see, so I initialize the parameters. 530 00:38:53,725 --> 00:38:56,550 So let's say randomly at the position of that cross over there, 531 00:38:56,975 --> 00:38:58,925 right, that cross on the upper right. 532 00:38:59,625 --> 00:39:02,075 And so after one iteration of gradient descent, 533 00:39:03,150 --> 00:39:06,950 as you change the space of parameters, 534 00:39:08,100 --> 00:39:10,275 so if that's the result of one step of gradient descent, 535 00:39:11,075 --> 00:39:15,575 two steps, three steps, four steps, five steps, and so on, and it, you know, 536 00:39:16,550 --> 00:39:20,225 converges easily, rapidly to the global minimum of this function J of theta. 537 00:39:22,475 --> 00:39:26,275 Okay, and this is a property of [inaudible|ordinary least square] regression 538 00:39:26,750 --> 00:39:31,200 with a linear hypothesis cost. The function, J of theta has no local optima. Yes, question? 539 00:39:34,350 --> 00:39:35,225 *Student:*Is the alpha changing 540 00:39:35,825 --> 00:39:41,425 every time? Because the step is not [inaudible]. *Instructor (Andrew Ng)*:So it turns out that 541 00:39:42,425 --> 00:39:47,350 yes, so it turns out this was done with a this is with a [fake|fixed] value of alpha, 542 00:39:47,825 --> 00:39:52,000 and one of the properties of gradient descent is that as you approach the local minimum, 543 00:39:52,925 --> 00:39:58,025 it actually takes smaller and smaller steps so they'll converge. And the reason is, the update is 544 00:39:59,550 --> 00:40:04,175 you update theta by subtracting from alpha times the gradient. 545 00:40:04,800 --> 00:40:07,100 And so as you approach the local minimum, 546 00:40:07,600 --> 00:40:12,225 the gradient also goes to zero. As you approach the local minimum, 547 00:40:12,850 --> 00:40:18,050 at the local minimum the gradient is zero, and as you approach the local minimum, the gradient also gets smaller and smaller. 548 00:40:18,750 --> 00:40:23,025 And so gradient descent will automatically take smaller and smaller steps 549 00:40:23,750 --> 00:40:24,500 as you approach the local minimum. Make sense? 550 00:40:29,850 --> 00:40:30,550 And 551 00:40:31,200 --> 00:40:36,125 here's the same plot here's actually a plot of the housing prices data. 552 00:40:37,450 --> 00:40:44,175 So here, lets you initialize the parameters to the vector of all zeros, and so this blue line at the bottom 553 00:40:44,775 --> 00:40:49,500 shows the hypothesis with the parameters of initialization. So initially 554 00:40:50,050 --> 00:40:56,325 theta zero and theta one are both zero, and so your hypothesis predicts that all prices are equal to zero. 555 00:40:57,050 --> 00:40:59,525 After one iteration of gradient descent, that's 556 00:41:00,425 --> 00:41:02,475 the blue line you get. After two iterations, 557 00:41:03,750 --> 00:41:08,900 three, four, five, and after a few more iterations, excuse me, it converges, 558 00:41:09,625 --> 00:41:11,825 and you've now found the least square fit for the data. 559 00:41:19,175 --> 00:41:23,350 Okay, let's switch back to the chalkboard. Are there questions about this? Yeah? 560 00:41:28,000 --> 00:41:36,800 *Student:*[Inaudible] iteration, do we mean that we run each sample all the sample cases [inaudible] the new values? *Instructor (Andrew Ng)*:Yes, right. 561 00:41:37,025 --> 00:41:40,300 *Student:*And converged means that the value will be the same 562 00:41:40,625 --> 00:41:47,225 [inaudible] roughly the same? 563 00:41:48,225 --> 00:41:52,375 *Instructor (Andrew Ng)*:Yeah, so this is sort of a question of how do you test the convergence. And there's different ways 564 00:41:52,900 --> 00:41:55,025 of testing for convergence. One is you can look at 565 00:41:55,500 --> 00:41:57,775 two different iterations and see if theta has changed a lot, 566 00:41:58,075 --> 00:42:02,275 and if theta hasn't changed much within two iterations, you may say it's sort of more or less converged. 567 00:42:03,900 --> 00:42:08,925 Something that's done maybe slightly more often is look at the value of J of theta, and if J of theta 568 00:42:09,425 --> 00:42:14,025 if the quantity you're trying to minimize is not changing much anymore, then you might be 569 00:42:14,600 --> 00:42:16,750 inclined to believe it's converged. So these are sort of 570 00:42:17,475 --> 00:42:22,875 standard heuristics, or standard rules of thumb that are often used to decide if gradient descent has converged. Yeah? 571 00:42:27,175 --> 00:42:28,000 572 00:42:28,450 --> 00:42:32,650 *Student:*I may have missed something, but especially in [inaudible] descent. 573 00:42:33,300 --> 00:42:35,050 574 00:42:35,625 --> 00:42:36,475 575 00:42:36,825 --> 00:42:42,975 So one feature [inaudible] curve and can either go this way or that way. 576 00:42:43,550 --> 00:42:48,075 But the math at incline [inaudible] where that comes in. When do you choose whether you go left, 577 00:42:48,525 --> 00:42:53,375 whether you're going this way or that way? *Instructor (Andrew Ng)*:I see. It just turns out that so the question is, 578 00:42:54,200 --> 00:42:59,000 how is gradient descent looking 360 around you and choosing the direction of steepest descent. 579 00:42:59,575 --> 00:43:04,375 So it actually turns out I'm not sure I'll answer the second part, but it turns out that 580 00:43:04,900 --> 00:43:08,500 if you stand on the hill and if you ... 581 00:43:10,575 --> 00:43:14,625 it turns out that when you compute the gradient of the function, when you compute the derivative of the function, 582 00:43:15,025 --> 00:43:16,900 then it just turns out that that is indeed 583 00:43:17,175 --> 00:43:18,425 the direction of steepest descent. 584 00:43:18,850 --> 00:43:25,575 By the way, I just want to point out, you would never want to go in the opposite direction because the opposite direction would actually be the direction of steepest ascent, 585 00:43:26,100 --> 00:43:26,725 right. 586 00:43:27,550 --> 00:43:28,275 So as it turns out 587 00:43:29,450 --> 00:43:34,325 maybe the TAs can talk a bit more about this at the section if there's interest. 588 00:43:34,925 --> 00:43:38,500 It turns out, when you take the derivative of a function, the derivative of a function 589 00:43:39,075 --> 00:43:42,250 sort of turns out to just give you the direction of steepest descent. 590 00:43:43,200 --> 00:43:47,400 And so you don't explicitly look all 360 degrees around you. 591 00:43:47,950 --> 00:43:50,550 You sort of just compute the derivative and that turns out to be the direction of steepest descent. 592 00:43:54,350 --> 00:43:58,050 Yeah, maybe the TAs can talk a bit more about this on Friday. 593 00:44:01,425 --> 00:44:05,550 594 00:44:08,800 --> 00:44:12,075 595 00:44:13,800 --> 00:44:14,775 596 00:44:16,625 --> 00:44:18,250 597 00:44:20,575 --> 00:44:22,450 Okay, let's see, so 598 00:44:24,400 --> 00:44:26,600 let me go ahead and give this algorithm 599 00:44:27,100 --> 00:44:30,200 a specific name. So this algorithm here is actually called 600 00:44:30,575 --> 00:44:34,350 batch gradient descent, 601 00:44:35,075 --> 00:44:38,175 602 00:44:38,700 --> 00:44:43,300 and the term batch isn't a great term, but the term batch refers to the fact that 603 00:44:43,725 --> 00:44:46,025 on every step of gradient descent you're going to 604 00:44:46,375 --> 00:44:48,550 look at your entire training set. You're going to 605 00:44:49,025 --> 00:44:52,225 perform a sum over your M training examples. 606 00:44:54,525 --> 00:44:59,550 So [inaudible|it turns out that batch gradient] descent often works very well. I use it very often, and 607 00:45:00,100 --> 00:45:05,075 it turns out that sometimes if you have a really, really large training set, imagine that 608 00:45:05,525 --> 00:45:10,775 instead of having 47 houses from Portland, Oregon in our training set, if you had, say, the U.S. Census Database or something, 609 00:45:12,125 --> 00:45:16,700 with U.S. census size databases you often have hundreds of thousands or millions of training examples. 610 00:45:18,825 --> 00:45:21,125 So if M is a few million 611 00:45:22,000 --> 00:45:27,250 then if you're running batch [rate and|gradient] descent, this means that to perform every step of gradient descent 612 00:45:27,925 --> 00:45:29,525 you need to perform a sum from 613 00:45:29,925 --> 00:45:31,225 J equals one to a million. 614 00:45:31,600 --> 00:45:35,850 That's sort of a lot of training examples where your computer programs have to look at, 615 00:45:36,400 --> 00:45:39,700 before you can even take one step downhill on the function J of theta. 616 00:45:40,400 --> 00:45:41,175 So 617 00:45:41,875 --> 00:45:46,150 it turns out that when you have very large training sets, 618 00:45:46,750 --> 00:45:49,350 you should write down an alternative algorithm 619 00:45:50,025 --> 00:45:54,800 that is called [inaudible|stochastic] gradient descent. 620 00:45:55,775 --> 00:46:02,200 Sometimes I'll also call it incremental gradient descent, 621 00:46:02,975 --> 00:46:05,325 but the algorithm is as follows. 622 00:46:05,700 --> 00:46:09,750 Again, it will repeat until convergence 623 00:46:10,200 --> 00:46:12,575 624 00:46:13,375 --> 00:46:15,500 and will iterate for J equals one to M, 625 00:46:16,950 --> 00:46:17,825 626 00:46:19,125 --> 00:46:24,850 and will perform one of these sort of gradient descent updates 627 00:46:25,675 --> 00:46:28,100 using just the J training example. 628 00:46:29,175 --> 00:46:30,225 629 00:46:31,400 --> 00:46:32,275 630 00:46:33,875 --> 00:46:39,250 631 00:46:46,975 --> 00:46:51,800 632 00:46:53,925 --> 00:46:55,000 633 00:46:55,450 --> 00:47:01,550 Oh, and as usual, this is really you update all the parameters [data|theta] runs. You perform this update 634 00:47:01,825 --> 00:47:03,325 for all values of i. 635 00:47:05,025 --> 00:47:10,925 For i indexes and the parameter vectors, you just perform this update, all of your parameters simultaneously. 636 00:47:13,575 --> 00:47:16,325 And the advantage of this algorithm is that 637 00:47:17,350 --> 00:47:18,050 in order to 638 00:47:18,475 --> 00:47:22,075 in order to start learning, in order to start modifying the parameters, 639 00:47:22,425 --> 00:47:26,825 you only need to look at your first training examples. You should look at your first training example 640 00:47:27,325 --> 00:47:28,675 and perform an update using 641 00:47:29,275 --> 00:47:32,700 the derivative of the error with respect to just your first training example, 642 00:47:33,150 --> 00:47:36,200 and then you look at your second training example and perform another update. 643 00:47:36,700 --> 00:47:39,225 And you sort of keep adapting your parameters much more quickly 644 00:47:40,125 --> 00:47:41,700 without needing to 645 00:47:42,300 --> 00:47:46,950 scan over your entire U.S. Census database before you can even start adapting parameters. 646 00:47:49,675 --> 00:47:50,500 So 647 00:47:51,250 --> 00:47:56,925 let's see, for [launch|large] data sets, so constantly gradient descent is often much faster, 648 00:47:57,650 --> 00:47:58,450 and 649 00:47:59,275 --> 00:48:05,575 what happens is that constant gradient descent is that it won't actually converge to the global minimum exactly, 650 00:48:06,325 --> 00:48:11,900 but if these are the contours of your function, 651 00:48:12,725 --> 00:48:16,400 then after you run the constant gradient descent, you sort of tend to wander around. 652 00:48:17,250 --> 00:48:21,375 And you may actually end up going uphill occasionally, but your parameters will sort of 653 00:48:22,050 --> 00:48:25,400 [tender|tend] to wander to the region closest to the global minimum, 654 00:48:25,775 --> 00:48:28,200 but sort of keep wandering around a little bit near the region of the global [inaudible]. 655 00:48:29,375 --> 00:48:30,675 And often that's just fine 656 00:48:32,175 --> 00:48:34,025 to have a parameter 657 00:48:35,125 --> 00:48:39,025 that wanders around a little bit the global minimum. 658 00:48:40,175 --> 00:48:45,750 And in practice, this often works much faster than [back|batch] gradient descent, especially if you have a large training set. 659 00:48:48,850 --> 00:48:55,575 Okay, 660 00:48:55,925 --> 00:49:02,950 I'm going to clean a couple of boards. While I do that, why don't you take a look at the equations, and after I'm done cleaning the boards, I'll ask what questions you have. 661 00:49:06,575 --> 00:49:07,800 662 00:49:09,750 --> 00:49:11,000 663 00:49:14,125 --> 00:49:19,800 664 00:49:20,850 --> 00:49:23,150 665 00:49:28,275 --> 00:49:28,875 666 00:49:30,500 --> 00:49:36,300 667 00:49:36,675 --> 00:49:38,125 668 00:49:42,900 --> 00:49:49,000 Okay, so what questions do you have about all of this? 669 00:49:50,050 --> 00:49:51,625 *Student:*[Inaudible] is it true are you just sort of rearranging the order that 670 00:49:52,350 --> 00:49:53,475 you do the ... 671 00:49:53,775 --> 00:49:59,925 computation? So do you just use the first training example and update all of the theta i's and then step, 672 00:50:01,875 --> 00:50:07,350 and then update with the second training example, and update all the theta i's, and then step? And is that why you get sort of this really? 673 00:50:08,175 --> 00:50:14,575 *Instructor (Andrew Ng)*:Let's see, right. So I'm going to look at my first training example and then I'm going to take a step, and then I'm going to 674 00:50:15,975 --> 00:50:22,000 perform the second gradient descent updates using my new parameter vector that has already been modified using my 675 00:50:22,275 --> 00:50:22,850 first training example. 676 00:50:23,900 --> 00:50:24,575 And then I keep going. Make sense? Yeah? 677 00:50:28,475 --> 00:50:29,825 *Student:*So in each update of all the theta_i's, you're only using. 678 00:50:30,400 --> 00:50:34,875 *Instructor (Andrew Ng)*:One training example. *Student:*One training example. 679 00:50:36,875 --> 00:50:39,025 *Student:*[Inaudible]? 680 00:50:39,700 --> 00:50:43,150 *Instructor (Andrew Ng)*:Let's see, it's definitely [inaudible|emperical]. 681 00:50:43,675 --> 00:50:47,100 I believe this theory that sort of supports that as well. 682 00:50:48,500 --> 00:50:53,350 Yeah, the theory that supports that, the [inaudible] of theorem is, I don't remember. Okay, 683 00:50:59,700 --> 00:51:03,150 cool. So 684 00:51:05,050 --> 00:51:06,750 in what I've done so far, 685 00:51:07,975 --> 00:51:10,700 I've talked about an iterative algorithm 686 00:51:11,925 --> 00:51:15,750 for performing this minimization in terms of J of theta. 687 00:51:16,925 --> 00:51:23,075 And it turns out that there's another way for this specific problem of least squares regression, of ordinary least squares. 688 00:51:23,600 --> 00:51:28,425 It turns out there's another way to perform this minimization of J of theta that allows you to 689 00:51:29,025 --> 00:51:35,300 solve for the parameters theta in [close|closed] form, without needing to run an iterative algorithm. 690 00:51:35,850 --> 00:51:40,575 And I know some of you may have seen some of what I'm about to do before, in 691 00:51:41,125 --> 00:51:44,500 like an undergraduate linear algebra course, and the way 692 00:51:45,500 --> 00:51:47,900 it's typically done requires 693 00:51:48,375 --> 00:51:52,800 [inaudible] projections, or taking lots of derivatives and writing lots of algebra. 694 00:51:53,075 --> 00:51:55,100 What I'd like to do is 695 00:51:55,650 --> 00:51:58,425 show you a way to derive the closed form 696 00:51:58,825 --> 00:52:02,075 solution of theta in just a few lines of algebra. 697 00:52:02,550 --> 00:52:06,900 But to do that, I'll need to introduce a new notation for matrix derivatives, 698 00:52:07,575 --> 00:52:10,900 and it turns out that, sort of, the notation I'm about to 699 00:52:12,950 --> 00:52:13,650 define here 700 00:52:14,175 --> 00:52:19,500 just in my own personal work has turned out to be one of the most useful things that I actually use all the time, 701 00:52:19,950 --> 00:52:23,925 to have a notation of how to take derivatives with respect to [matrixes|matrices], so that you can 702 00:52:24,400 --> 00:52:30,025 solve for the minimum of J of theta with, like, a few lines of algebra rather than writing out pages and pages of matrices and derivatives. 703 00:52:31,350 --> 00:52:37,400 So then we're going to define this new notation first and then we'll go ahead and work out the minimization. 704 00:52:38,225 --> 00:52:40,075 Given a function J, 705 00:52:41,000 --> 00:52:44,400 since J is a function of a vector of parameters theta, right, 706 00:52:45,075 --> 00:52:45,950 I'm going to define 707 00:52:46,450 --> 00:52:49,650 the derivative of the gradient of J with respect to theta, 708 00:52:50,425 --> 00:52:52,350 as self of vector. 709 00:52:53,125 --> 00:52:56,975 710 00:52:57,550 --> 00:52:59,375 711 00:53:02,750 --> 00:53:07,950 Okay, and so this is going to be an (n+1) dimensional vector. 712 00:53:08,600 --> 00:53:12,200 Theta is an (n+1) dimensional vector with indices ranging from zero to n. 713 00:53:12,875 --> 00:53:16,000 And so I'm going to define this derivative to be equal to that. 714 00:53:17,525 --> 00:53:22,875 Okay, and so we can actually rewrite 715 00:53:23,400 --> 00:53:27,900 the gradient descent algorithm as follows. This is batch gradient descent, 716 00:53:28,325 --> 00:53:34,350 and we write gradient descent as updating the parameter vector theta notice there's no subscript I know 717 00:53:35,125 --> 00:53:42,675 updating the parameter vector theta as the previous parameter minus alpha times the gradient. 718 00:53:44,775 --> 00:53:51,275 Okay, and so in this equation all of these quantities, theta, 719 00:53:52,050 --> 00:53:58,975 and this gradient vector, all of these are (n+1) dimensional vectors. 720 00:54:00,475 --> 00:54:03,575 I was using the boards out of order, wasn't I? 721 00:54:04,375 --> 00:54:08,975 722 00:54:09,725 --> 00:54:13,875 723 00:54:14,425 --> 00:54:15,100 724 00:54:15,700 --> 00:54:17,900 725 00:54:20,800 --> 00:54:23,350 So more generally, 726 00:54:24,675 --> 00:54:27,300 if you have a function f 727 00:54:27,650 --> 00:54:31,125 that maps from the space of matrices A, 728 00:54:32,600 --> 00:54:33,925 729 00:54:36,650 --> 00:54:37,850 that maps from, 730 00:54:38,350 --> 00:54:42,475 say, the space of N by N matrices to the space of real numbers. 731 00:54:42,900 --> 00:54:46,525 So if you have a function, f of A, 732 00:54:47,000 --> 00:54:49,525 where A ... 733 00:54:50,025 --> 00:54:51,400 is an N by N matrix. 734 00:54:52,100 --> 00:54:56,600 So this function is matched from matrices to real numbers, the function that takes this input to matrix. 735 00:54:57,675 --> 00:55:01,550 Let me define the derivative with respect to f 736 00:55:02,150 --> 00:55:03,425 of the matrix A. 737 00:55:04,375 --> 00:55:08,100 Now, I'm just taking the gradient of f with respect to its input, which is the matrix. 738 00:55:08,950 --> 00:55:12,925 I'm going to define this itself to be a matrix. 739 00:55:13,350 --> 00:55:15,575 740 00:55:16,600 --> 00:55:21,550 741 00:55:22,600 --> 00:55:25,300 742 00:55:26,075 --> 00:55:31,200 743 00:55:32,250 --> 00:55:33,825 744 00:55:36,475 --> 00:55:42,500 Okay, so the derivative of f with respect to A is itself a matrix, and the matrix contains 745 00:55:43,475 --> 00:55:44,950 all the partial derivatives of f 746 00:55:45,950 --> 00:55:47,075 with respect to the elements of A. 747 00:55:58,575 --> 00:56:00,825 One more definition is 748 00:56:01,275 --> 00:56:05,875 if A is a square matrix, so if A 749 00:56:06,525 --> 00:56:10,750 is an n by n matrix, number of rows equals number of columns, 750 00:56:11,325 --> 00:56:13,900 let me define the trace of A 751 00:56:14,450 --> 00:56:17,325 to be equal to the sum of A's diagonal elements. 752 00:56:18,475 --> 00:56:24,850 So this is just sum over i of A_i,i. 753 00:56:26,300 --> 00:56:31,900 For those of you that haven't seen this sort of operator notation before, you can think of trace of A as 754 00:56:32,175 --> 00:56:35,175 the trace operator applied to the square matrix A, 755 00:56:35,600 --> 00:56:39,350 but it's more commonly written without the parentheses. So I usually write trace of A like this, 756 00:56:40,025 --> 00:56:41,650 and this just means the sum of diagonal elements. 757 00:56:44,175 --> 00:56:46,750 So 758 00:56:47,575 --> 00:56:48,400 759 00:56:49,225 --> 00:56:51,750 here are some facts about the trace operator and about derivatives, 760 00:56:52,125 --> 00:56:56,775 and I'm just going to write these without proof. You can also have the TAs prove some of them in the discussion section, 761 00:56:57,300 --> 00:57:00,575 or you can actually go home and 762 00:57:01,125 --> 00:57:02,550 verify the proofs of all of these. 763 00:57:03,550 --> 00:57:07,675 It turns out that given two matrices, 764 00:57:08,350 --> 00:57:14,350 A and B, the trace of the matrix A times B is equal to 765 00:57:14,875 --> 00:57:20,275 the trace of B, A. Okay, I'm not going to prove this, but you should be able to go home and prove this yourself 766 00:57:20,950 --> 00:57:21,925 without too much difficulty. 767 00:57:23,425 --> 00:57:24,725 And similarly, 768 00:57:25,100 --> 00:57:26,125 769 00:57:26,875 --> 00:57:31,375 the trace of a product of three matrices, so if you can take the matrix at the end and 770 00:57:31,725 --> 00:57:35,925 cyclically permeate it to the front. So trace of A times B, times C, 771 00:57:36,275 --> 00:57:38,125 is equal to the trace of C, A, B. 772 00:57:38,775 --> 00:57:40,700 So take the matrix C at the back and move it to the front, 773 00:57:41,200 --> 00:57:44,550 and this is also equal to the trace of 774 00:57:45,050 --> 00:57:47,375 B, C. Take the matrix B and move it to the front. Okay, 775 00:57:54,850 --> 00:57:57,900 also, 776 00:57:58,750 --> 00:58:05,075 suppose you have a function f of A which is defined as a trace of A, B. 777 00:58:06,425 --> 00:58:07,150 Okay, so this is, 778 00:58:07,900 --> 00:58:14,600 right, the trace is a real number. So the trace of A, B is a function that takes this input of matrix A and output a real number. 779 00:58:16,250 --> 00:58:22,175 So then the derivative with respect to the matrix A of this function of trace A, B, 780 00:58:25,650 --> 00:58:26,550 781 00:58:27,050 --> 00:58:31,975 is going to be B transposed. And this is just another fact that you can 782 00:58:32,875 --> 00:58:36,525 prove by yourself by going back and referring to the definitions of traces and matrix derivatives. 783 00:58:37,000 --> 00:58:39,100 I'm not going to prove it. You should work it out. 784 00:58:40,875 --> 00:58:43,175 Okay, and lastly a couple of easy ones. 785 00:58:43,550 --> 00:58:48,925 The trace of A is equal to the trace of A transposed because the trace is just the sum of diagonal elements. 786 00:58:49,800 --> 00:58:54,675 And so if you transpose the matrix, the diagonal, then there's no change. 787 00:58:55,450 --> 00:58:59,125 And if lower case A is a real number, then 788 00:58:59,825 --> 00:59:05,150 the trace of a real number is just itself. So think of a real number as a one by one matrix. 789 00:59:05,750 --> 00:59:09,475 So the trace of a one by one matrix is just whatever that real number is. 790 00:59:12,375 --> 00:59:13,975 791 00:59:15,575 --> 00:59:20,825 And lastly, this is a somewhat tricky one. 792 00:59:21,850 --> 00:59:23,275 793 00:59:24,850 --> 00:59:27,750 794 00:59:28,825 --> 00:59:33,525 The derivative with respect to the matrix A of the trace of A, B, A, transpose C 795 00:59:34,150 --> 00:59:36,125 is equal to C, A, B 796 00:59:36,625 --> 00:59:40,925 plus C transposed A, 797 00:59:41,625 --> 00:59:47,075 B transposed. And I won't prove that either. This is sort of just algebra. Work it out yourself. 798 00:59:48,500 --> 00:59:49,325 799 00:59:50,800 --> 00:59:53,325 800 00:59:55,325 --> 00:59:59,200 Okay, and so, I guess the key equation ... 801 01:00:01,325 --> 01:00:03,050 the key facts I'm going to use again 802 01:00:06,550 --> 01:00:08,550 about traces and matrix derivatives, I'll use five. [Ten minutes.] 803 01:00:14,050 --> 01:00:14,975 Okay, 804 01:00:16,625 --> 01:00:17,150 so 805 01:00:18,225 --> 01:00:22,650 armed with these things I'm going to figure out 806 01:00:23,175 --> 01:00:24,675 let's try to come up with a quick derivation 807 01:00:25,450 --> 01:00:28,100 for how to minimize J of theta as a function of 808 01:00:28,550 --> 01:00:31,850 theta in closed form, and without needing to use an iterative algorithm. 809 01:00:35,850 --> 01:00:37,525 So work this out. Let me define 810 01:00:37,875 --> 01:00:41,450 the matrix X. This is called the design matrix. 811 01:00:41,975 --> 01:00:45,600 To be a matrix containing all the inputs from my training set. 812 01:00:47,125 --> 01:00:53,075 So x_1 was the vector of inputs to the vector of features for my first training example. 813 01:00:53,700 --> 01:00:54,900 So I'm going to 814 01:00:56,675 --> 01:01:00,100 set X 1 to be the first row of this matrix X, 815 01:01:00,900 --> 01:01:02,725 816 01:01:03,300 --> 01:01:04,300 817 01:01:04,850 --> 01:01:09,125 set my second training example is in place to be the second [variable|row], and so on. 818 01:01:09,350 --> 01:01:12,275 And I have M training examples, and so 819 01:01:13,600 --> 01:01:15,875 that's going to be my 820 01:01:16,925 --> 01:01:22,350 design matrix X. Okay, this is defined as matrix capital X as follows, and so 821 01:01:24,500 --> 01:01:28,875 now, let me take this matrix X and multiply it by my parameter vector theta. 822 01:01:29,375 --> 01:01:33,250 This derivation will just take two or three sets. So X times theta 823 01:01:34,475 --> 01:01:40,950 remember how matrix vector multiplication goes. You take this vector and you multiply it by each of the rows of the matrix. 824 01:01:42,100 --> 01:01:44,725 So X times theta is just going to be 825 01:01:45,150 --> 01:01:48,550 x_1 transposed theta, 826 01:01:48,925 --> 01:01:51,225 dot, dot, dot, down to x_m, 827 01:01:53,975 --> 01:01:54,825 transposed theta. 828 01:01:55,725 --> 01:01:57,400 And this is, of course, just 829 01:01:58,125 --> 01:02:00,350 830 01:02:00,975 --> 01:02:06,800 831 01:02:07,650 --> 01:02:12,325 the predictions of your hypothesis on each of your M training examples. 832 01:02:14,900 --> 01:02:15,650 833 01:02:16,100 --> 01:02:18,650 [Then we|Let me] also define(d) the y vector 834 01:02:19,925 --> 01:02:24,500 to be the vector of all the target values y_1 through y_m 835 01:02:25,175 --> 01:02:29,500 in my training set. Okay, so y vector is an m dimensional vector. 836 01:02:31,250 --> 01:02:35,325 837 01:02:36,100 --> 01:02:38,000 838 01:02:39,425 --> 01:02:42,325 839 01:02:43,825 --> 01:02:45,325 840 01:02:48,525 --> 01:02:51,525 So 841 01:02:53,075 --> 01:02:55,100 X theta minus Y 842 01:02:55,675 --> 01:02:58,575 contained the math from the previous board, this is going to be, 843 01:02:59,075 --> 01:03:02,750 844 01:03:03,475 --> 01:03:04,025 845 01:03:04,925 --> 01:03:10,425 846 01:03:14,875 --> 01:03:21,875 right, and now, X theta minus Y, this is a vector. This is an M dimensional vector in M training examples, 847 01:03:22,925 --> 01:03:27,025 and so I'm actually going to take this vector and take this inner product with itself. 848 01:03:27,425 --> 01:03:32,200 Okay, so we call that if Z is a vector 849 01:03:32,875 --> 01:03:36,450 than Z transpose Z is just sum over i, 850 01:03:36,675 --> 01:03:40,325 Zi squared. Right, that's how you take the inner product of a vector with a sum. 851 01:03:41,325 --> 01:03:44,600 So you want to take this vector, X theta minus Y, 852 01:03:44,950 --> 01:03:49,200 and take the inner product 853 01:03:49,650 --> 01:03:50,975 of this vector with itself, 854 01:03:51,925 --> 01:03:52,900 and so that gives me 855 01:03:54,625 --> 01:03:55,600 856 01:03:56,250 --> 01:03:57,425 857 01:03:57,850 --> 01:04:04,975 sum from i equals one to m, (H_f(X_i) - Y) squared. 858 01:04:06,075 --> 01:04:11,000 Okay, since it's just the sum of the squares of the elements of this vector. 859 01:04:12,425 --> 01:04:14,175 And 860 01:04:14,900 --> 01:04:18,700 put a half there for the emphasis. 861 01:04:19,875 --> 01:04:23,625 This is our previous definition of J of theta. Okay, yeah? 862 01:04:30,025 --> 01:04:32,850 *Student:*[Inaudible]? 863 01:04:33,350 --> 01:04:36,750 *Instructor (Andrew Ng)*:Yeah, I threw a lot of notations at you today. 864 01:04:37,175 --> 01:04:39,700 So M is the number of training examples and 865 01:04:40,250 --> 01:04:42,875 the number of training examples runs from one through M, 866 01:04:44,000 --> 01:04:47,225 and then is the feature vector that runs from zero through N. Does that make sense? 867 01:04:49,575 --> 01:04:52,775 So this is the sum from one through M. 868 01:04:53,450 --> 01:04:55,175 It's sort of 869 01:04:57,050 --> 01:05:02,875 theta transpose X that's equal to sum from J equals zero through N of 870 01:05:03,250 --> 01:05:05,100 theta J, X, J. Does that make sense? 871 01:05:08,425 --> 01:05:10,775 It's the feature vectors that 872 01:05:11,200 --> 01:05:14,425 index from zero through N where X_0 is equal to one, 873 01:05:15,100 --> 01:05:16,950 whereas the training example is actually indexed from one through M. 874 01:05:21,875 --> 01:05:26,625 So let me clean a few more boards and take another look at this, make sure it all makes sense. 875 01:05:27,700 --> 01:05:31,575 876 01:05:33,425 --> 01:05:37,525 877 01:05:39,100 --> 01:05:41,075 878 01:05:43,800 --> 01:05:46,250 879 01:05:46,650 --> 01:05:49,625 880 01:05:50,550 --> 01:05:55,875 881 01:05:58,425 --> 01:05:59,450 882 01:06:02,175 --> 01:06:03,550 Okay, yeah? 883 01:06:08,600 --> 01:06:11,950 *Student:*[Inaudible] the Y inside the parentheses, shouldn't that be [inaudible]? *Instructor (Andrew Ng)*:Oh, yes, thank you. Oh is that what you meant? 884 01:06:13,800 --> 01:06:17,875 Yes, thank you. Great, i-th training example. Anything else? Cool. 885 01:06:26,850 --> 01:06:33,425 So we're actually nearly done with this derivation. We would like to minimize J of theta 886 01:06:34,300 --> 01:06:35,750 with respect to theta 887 01:06:36,500 --> 01:06:41,050 and we've written J of theta fairly compactly using this matrix vector notation. 888 01:06:42,750 --> 01:06:45,900 So in order to minimize J of theta with respect to theta, 889 01:06:47,425 --> 01:06:51,225 what we're going to do is take the derivative with respect to theta of J of theta, 890 01:06:52,375 --> 01:06:57,725 and set this to zero, and solve for theta. Okay, 891 01:07:02,150 --> 01:07:05,475 so we have 892 01:07:06,625 --> 01:07:11,500 derivative with respect to theta of 893 01:07:12,425 --> 01:07:18,675 that is equal to 894 01:07:21,075 --> 01:07:21,775 I should mention 895 01:07:22,025 --> 01:07:28,600 there will be some steps here that I'm just going to do fairly quickly without proof. So is it really true that the derivative of half of that is half of the derivative, 896 01:07:29,000 --> 01:07:29,825 and I already exchanged 897 01:07:30,425 --> 01:07:33,475 the derivative and the one-half. In terms of the answers, yes, but 898 01:07:34,000 --> 01:07:40,200 later on you should go home and look through the lecture notes and make sure you understand and believe why every step is correct. 899 01:07:40,625 --> 01:07:42,575 I'm going to do things relatively quickly here 900 01:07:43,025 --> 01:07:46,450 and you can work through every step yourself more slowly by referring to the lecture notes. 901 01:07:48,650 --> 01:07:53,325 Okay, so that's equal to I'm going to expand now this quadratic function. So 902 01:07:53,725 --> 01:07:55,225 this is going to be, 903 01:07:56,125 --> 01:07:57,750 904 01:07:58,700 --> 01:08:01,950 905 01:08:02,875 --> 01:08:06,500 906 01:08:07,050 --> 01:08:10,050 okay, 907 01:08:11,150 --> 01:08:15,975 and this is just sort of taking a quadratic function and expanding it out by multiplying the [inaudible]. And again, 908 01:08:16,600 --> 01:08:18,000 work through the steps later yourself 909 01:08:19,400 --> 01:08:22,650 if you're not quite sure how I did that. So 910 01:08:24,175 --> 01:08:27,200 this thing, this vector, vector product, right, 911 01:08:27,925 --> 01:08:32,475 this quantity here, this is just J of theta and so it's just a real number, 912 01:08:33,500 --> 01:08:35,100 and the trace of a real number is just itself. *Student:*[Inaudible]. 913 01:08:38,225 --> 01:08:39,925 *Instructor (Andrew Ng)*:Oh, thanks, Dan. 914 01:08:41,275 --> 01:08:46,525 Cool, great. So this quantity in parentheses, this is J of theta and it's just a real number. 915 01:08:47,425 --> 01:08:53,175 And so the trace of a real number is just the same real number. And so you can sort of take a trace operator without changing anything. 916 01:08:57,625 --> 01:09:03,600 And this is equal to one-half derivative with respect to theta of the trace of 917 01:09:05,700 --> 01:09:09,650 by the second permutation property of trace. You can take this theta at the end 918 01:09:10,325 --> 01:09:14,050 and move it to the front. So this is going to be trace of theta 919 01:09:15,000 --> 01:09:19,025 times theta transposed, X transpose X 920 01:09:22,250 --> 01:09:24,425 minus 921 01:09:26,125 --> 01:09:31,600 derivative with respect to theta of the trace of, again, I'm going to 922 01:09:32,325 --> 01:09:33,825 take that and bring it to the ... 923 01:09:35,475 --> 01:09:37,800 oh, sorry. 924 01:09:39,575 --> 01:09:45,875 Actually, this thing here is also a real number and the transpose of a real number is just itself. Right, 925 01:09:46,775 --> 01:09:49,600 so take the transpose of a real number without changing anything. 926 01:09:50,450 --> 01:09:52,550 So let me go ahead and just take the transpose of this. 927 01:09:53,025 --> 01:09:56,025 A real number transposed itself is just the same real number. So 928 01:09:57,200 --> 01:10:02,650 this is minus the trace of, taking the transpose of that. Here's Y transpose X theta, then 929 01:10:04,400 --> 01:10:09,175 minus [inaudible] theta. Okay, 930 01:10:11,150 --> 01:10:15,100 and this last quantity, Y transpose Y. It doesn't actually depend on theta. 931 01:10:15,350 --> 01:10:18,550 So when I take the derivative of this last term with respect to theta, it's just zero. So just drop that term. 932 01:10:21,550 --> 01:10:26,125 And lastly, 933 01:10:29,550 --> 01:10:35,575 well, the derivative with respect to theta of the trace of theta, theta transposed, 934 01:10:37,125 --> 01:10:38,900 X transpose X. 935 01:10:39,450 --> 01:10:45,250 I'm going to use one of the facts I wrote down earlier without proof, and I'm going 936 01:10:46,400 --> 01:10:52,325 to let this be A. There's an identity matrix there, so this is A, B, A transpose 937 01:10:53,100 --> 01:10:58,350 C, and using a rule that I've written down previously that you'll find in lecture notes, because 938 01:10:59,025 --> 01:11:01,675 it's still on one of the boards that you had previously, 939 01:11:02,825 --> 01:11:05,050 this is just equal to 940 01:11:06,125 --> 01:11:09,550 X transpose X theta. 941 01:11:10,925 --> 01:11:14,050 942 01:11:15,600 --> 01:11:20,150 So this is C, 943 01:11:21,150 --> 01:11:22,850 A, 944 01:11:24,075 --> 01:11:29,700 B, which is sort of just the identity matrix, which you can ignore, plus X transpose X 945 01:11:31,600 --> 01:11:36,575 theta where this is now C transpose [C|A], 946 01:11:37,425 --> 01:11:41,425 again times the identity which we're going to ignore, times B transposed. 947 01:11:42,400 --> 01:11:45,375 And the matrix X transpose X is [the metric|symmetric], so C transpose is equal to C. 948 01:11:49,350 --> 01:11:55,375 Similarly, the derivative with respect to theta of the trace of 949 01:11:56,475 --> 01:11:57,525 950 01:12:00,075 --> 01:12:02,200 Y transpose theta X, 951 01:12:03,325 --> 01:12:06,750 this is the ... 952 01:12:07,900 --> 01:12:08,600 953 01:12:09,400 --> 01:12:13,925 derivative with respect to matrix A of the trace of B, A 954 01:12:14,600 --> 01:12:19,425 and this is just X transpose Y. This is just 955 01:12:20,175 --> 01:12:23,675 B transposed, again, by one of the rules that I wrote down earlier. 956 01:12:26,150 --> 01:12:26,675 And so 957 01:12:27,550 --> 01:12:33,900 958 01:12:35,425 --> 01:12:39,825 959 01:12:41,500 --> 01:12:42,575 960 01:12:45,575 --> 01:12:48,100 if you plug this back in, we find, therefore, that 961 01:12:48,625 --> 01:12:50,750 the derivative wow, this board's really bad. 962 01:12:52,175 --> 01:12:57,125 963 01:12:58,450 --> 01:13:03,075 964 01:13:03,875 --> 01:13:05,250 965 01:13:06,250 --> 01:13:12,225 So if you plug this back into our formula for the derivative of J, 966 01:13:12,750 --> 01:13:17,475 you find that the derivative with respect to theta of J of theta is equal to 967 01:13:18,825 --> 01:13:23,100 one-half X transpose theta, 968 01:13:23,775 --> 01:13:26,450 plus X transpose X theta, 969 01:13:27,500 --> 01:13:31,900 minus X transpose Y, minus X transpose Y, 970 01:13:32,825 --> 01:13:38,075 which is just X transpose X theta minus X [inaudible]. 971 01:13:41,325 --> 01:13:43,250 972 01:13:44,175 --> 01:13:50,200 Okay, so we set this to zero and we get 973 01:13:51,400 --> 01:13:52,650 that, 974 01:13:53,575 --> 01:13:56,375 975 01:13:57,150 --> 01:14:01,550 976 01:14:02,425 --> 01:14:04,600 which is called a normal equation, 977 01:14:05,700 --> 01:14:07,250 and 978 01:14:08,100 --> 01:14:09,275 we can now solve this 979 01:14:10,050 --> 01:14:12,200 equation for theta 980 01:14:12,975 --> 01:14:18,075 981 01:14:19,575 --> 01:14:24,325 in closed form. That's X transpose X theta, inverse times X transpose Y. 982 01:14:24,925 --> 01:14:26,200 And so this gives us a way 983 01:14:29,175 --> 01:14:32,125 for solving for the least square fit to the parameters in closed form, 984 01:14:32,450 --> 01:14:34,375 without needing to use an [inaudible] gradient descent. 985 01:14:36,550 --> 01:14:41,100 Okay, and using this matrix vector notation, I think, 986 01:14:42,125 --> 01:14:48,075 I don't know, I think we did this whole thing in about ten minutes, which we couldn't have if I was writing out reams of algebra. 987 01:14:52,300 --> 01:14:58,100 Okay, some of you look a little bit dazed, but this is our first learning [hour|algorithm]. Aren't you excited? 988 01:14:59,550 --> 01:15:03,000 Any quick questions about this before we close for today? 989 01:15:06,900 --> 01:15:07,875 *Student:*[Inaudible]. 990 01:15:08,500 --> 01:15:11,350 *Instructor (Andrew Ng)*:Say that again. *Student:*What you derived, wasn't that just [inaudible] of X? 991 01:15:11,575 --> 01:15:15,550 *Instructor (Andrew Ng)*:What inverse? *Student:*Pseudo inverse. *Instructor (Andrew Ng)*:Pseudo inverse? *Student:*Pseudo inverse. 992 01:15:17,250 --> 01:15:17,825 *Instructor (Andrew Ng)*:Yeah, I 993 01:15:18,200 --> 01:15:24,700 it turns out that in cases, if X transpose X is not invertible, [than|then] you use the pseudo inverse minimized to solve this. 994 01:15:25,950 --> 01:15:31,150 But it turns out X transpose X is not invertible. That usually means your features were dependent. 995 01:15:31,750 --> 01:15:36,525 It usually means you did something like repeat the same feature twice in your training set. 996 01:15:37,175 --> 01:15:41,075 So if this is not invertible, it turns out the minimum is obtained by the pseudo inverses of the inverse. 997 01:15:41,950 --> 01:15:45,425 If you don't know what I just said, don't worry about it. It usually won't be a problem. 998 01:15:46,525 --> 01:15:47,050 Anything else? 999 01:15:48,175 --> 01:15:49,125 *Student:*On the second board [inaudible]? 1000 01:15:49,700 --> 01:15:50,275 1001 01:15:50,575 --> 01:15:59,950 Let me take that off. We're running over. Let's close for today and if they're other questions, I'll take them after.