1
00:00:01,425 --> 00:00:09,725


2
00:00:11,600 --> 00:00:15,625
 This presentation is delivered by the Stanford Center for Professional Development.

3
00:00:23,650 --> 00:00:27,075
 OK. So let's get started with today's material. 

4
00:00:31,075 --> 00:00:31,775
 So 

5
00:00:32,025 --> 00:00:37,775
 welcome back to the second lecture. What I want to do today is talk about linear regression, 

6
00:00:38,275 --> 00:00:41,025
 gradient descent, and the normal equations. 

7
00:00:42,600 --> 00:00:43,450
 And I should also say, 

8
00:00:43,725 --> 00:00:51,025
 lecture notes have been posted online and so if some of the math I go over today, I go over rather quickly, if you want to

9
00:00:51,575 --> 00:00:59,350
 see every equation written out and work through the details more slowly yourself, go to the course homepage and download detailed lecture notes 

10
00:00:59,900 --> 00:01:04,575
 that pretty much describe all the mathematical, technical contents I'm going to go over today.

11
00:01:06,300 --> 00:01:10,075
 Today, I'm also going to delve into a fair amount - some amount of linear algebra, 

12
00:01:10,450 --> 00:01:14,550
 and so if you would like to see a refresher on linear algebra, 

13
00:01:15,550 --> 00:01:21,225
 this week's discussion section will be taught by the TAs and will be a refresher on linear algebra. So 

14
00:01:21,825 --> 00:01:22,875
 if some of the linear algebra

15
00:01:23,225 --> 00:01:28,775
 I talk about today sort of seems to be going by pretty quickly, or if you just want to see some of the things 

16
00:01:29,500 --> 00:01:33,925
 I'm claiming today with our proof, if you want to just see some of those things written out in detail, 

17
00:01:34,300 --> 00:01:36,225
 you can come to this week's discussion section.

18
00:01:37,525 --> 00:01:38,650
 

19
00:01:40,825 --> 00:01:44,225
 So I just want to start by showing you a fun video. 

20
00:01:44,700 --> 00:01:49,300
 Remember at the last lecture, the initial lecture, I talked about supervised learning. 

21
00:01:50,150 --> 00:01:56,250
 And supervised learning was this machine-learning problem where I said we're going to tell the algorithm 

22
00:01:56,625 --> 00:01:59,000
 what the close right answer is for 

23
00:01:59,400 --> 00:02:04,700
 a number of examples, and then we want the algorithm to replicate more of the same. So 

24
00:02:05,125 --> 00:02:06,125
 the example I had 

25
00:02:06,725 --> 00:02:11,625
 at the first lecture was the problem of predicting housing prices, where you may have a training set, 

26
00:02:12,250 --> 00:02:14,550
 and we tell the algorithm what the "right" 

27
00:02:14,825 --> 00:02:17,275
 housing price was for every house in the training set. 

28
00:02:18,025 --> 00:02:24,725
 And then you want the algorithm to learn the relationship between sizes of houses and the prices, and essentially produce more of 

29
00:02:24,950 --> 00:02:25,900
 the "right" answer.

30
00:02:26,525 --> 00:02:30,075
 So let me show you a video now. Load the big screen, please. 

31
00:02:30,875 --> 00:02:32,550
 So I'll show you a video now 

32
00:02:33,375 --> 00:02:37,425
 that was from Dean Pomerleau at some work he did at Carnegie Mellon 

33
00:02:37,725 --> 00:02:41,200
 on applied supervised learning to get a car to drive itself.

34
00:02:42,425 --> 00:02:48,850
 This is work on a vehicle known as Alvin. It was done sort of about 15 years ago, 

35
00:02:50,600 --> 00:02:52,075
 and I think it was a ...

36
00:02:53,475 --> 00:02:56,600
 it was a very elegant example of 

37
00:02:57,575 --> 00:03:00,225
 the sorts of things you can get supervised or any algorithms to do.

38
00:03:00,875 --> 00:03:01,475
 

39
00:03:01,900 --> 00:03:08,000
 On the video, you hear Dean Pomerleau's voice mention and algorithm called Neural Network. I'll say a little bit about that later, 

40
00:03:09,250 --> 00:03:15,750
 but the essential learning algorithm for this is something called gradient descent, which I will talk about later in today's lecture. 

41
00:03:16,650 --> 00:03:17,550
 Let's watch ..

42
00:03:18,725 --> 00:03:20,250
 Let's watch the video. 

43
00:03:24,000 --> 00:03:30,575
 [Video plays, Alvin is a system of artificial neural networks that learn to steer by watching a person's drive.]

44
00:03:31,650 --> 00:03:34,325
 [Video plays, Alvin is designed to control the [inaudible] tools,]

45
00:03:35,000 --> 00:03:38,825
 [Video plays, modified army [inaudible], equipped with sensors, computers,]

46
00:03:39,300 --> 00:03:40,050
 [Video plays, and actuators]

47
00:03:40,575 --> 00:03:43,150
 [Video plays, for autonomous navigation experiments.]

48
00:03:44,350 --> 00:03:49,275
 [Video plays, The initial step in configuring Alvin is training networks in sphere.]

49
00:03:50,550 --> 00:03:54,875
 [Video plays, During training, a person drives the vehicle while Alvin watches.]

50
00:04:00,150 --> 00:04:01,475
 [Video plays, Once every two seconds,...]

51
00:04:02,375 --> 00:04:07,925
 [Video plays, Alvin digitized the video images of the road ahead, and records the person's steering direction.]

52
00:04:14,450 --> 00:04:15,650
 [Video plays, ]

53
00:04:16,100 --> 00:04:20,650
 [Video plays, This training image is reduced to the resolution to 30 by 32 pixels.]

54
00:04:21,125 --> 00:04:24,200
 [Video plays, and provides it as an input of Alvin's three-layer network.]

55
00:04:25,350 --> 00:04:26,025
 *Instructor (Andrew Ng)*:So two comments, 

56
00:04:26,250 --> 00:04:30,900
 right. One is this is supervised learning because it's learning from a human driver, 

57
00:04:31,300 --> 00:04:32,700
 in which a human driver shows 

58
00:04:33,025 --> 00:04:38,175
 that we're on this segment of the road, I will steer at this angle. This segment of the road, I'll steer at this angle.

59
00:04:38,725 --> 00:04:40,225
 And so the human provides the number of 

60
00:04:40,550 --> 00:04:41,200
 "correct" 

61
00:04:41,800 --> 00:04:43,175
 steering directions to the car, 

62
00:04:43,675 --> 00:04:46,050
 and then it's the job of the car to try to learn 

63
00:04:46,475 --> 00:04:50,925
 to produce more of these "correct" steering directions that keeps the car on the road.

64
00:04:52,400 --> 00:04:55,925
 On the monitor display up here, I just want to 

65
00:04:56,400 --> 00:05:01,400
 tell you a little bit about what this display means. So on the upper left where the mouse pointer is moving, 

66
00:05:01,900 --> 00:05:05,625
 this horizontal line actually shows the human steering direction, 

67
00:05:06,200 --> 00:05:08,325
 and this white bar, or 

68
00:05:08,975 --> 00:05:11,175
 this white area right here shows 

69
00:05:11,600 --> 00:05:17,525
 the steering direction chosen by the human driver, by moving the steering wheel. The human is steering 

70
00:05:17,975 --> 00:05:21,100
 a little bit to the left here indicated by the position of this white region. 

71
00:05:21,625 --> 00:05:22,250
 

72
00:05:22,850 --> 00:05:25,175
 This second line here where Mouse is pointing, 

73
00:05:25,875 --> 00:05:28,925
 the second line here is the output of the learning algorithm, and 

74
00:05:29,300 --> 00:05:32,275
 where the learning algorithm currently thinks is the right steering direction. 

75
00:05:32,900 --> 00:05:37,925
 And right now what you're seeing is the learning algorithm just at the very beginning of training, and so there's just no idea 

76
00:05:38,375 --> 00:05:39,650
 of where to steer. And so 

77
00:05:39,950 --> 00:05:44,000
 its output, this little white smear over the entire range of steering directions.

78
00:05:44,600 --> 00:05:48,325
 And as the algorithm collects more examples and learns of a time, you see 

79
00:05:48,700 --> 00:05:51,400
 it start to more confidently choose a steering direction. 

80
00:05:52,200 --> 00:05:53,175
 So let's keep watching the video. 

81
00:05:54,575 --> 00:05:57,000
 [Video plays, Using the backpropagation learning algorithm,]

82
00:05:57,900 --> 00:06:00,800
 [Video plays, Alvin is trained to output the same steering direction]

83
00:06:01,125 --> 00:06:03,375
 [Video plays, as the human driver for that image.]

84
00:06:05,475 --> 00:06:09,375
 [Video plays, Initially, the network's steering response is at random.]

85
00:06:16,275 --> 00:06:18,575
 [Video plays, After about two minutes of training,]

86
00:06:19,325 --> 00:06:23,950
 [Video plays, The network learns to accurately imitate the steering reactions of the human driver.]

87
00:06:36,250 --> 00:06:39,850
 [Video plays, The same training procedure is repeated for other road types.]

88
00:06:42,300 --> 00:06:44,075
 [Video plays, After the networks have been trained,]

89
00:06:44,700 --> 00:06:48,450
 [Video plays, The operator pushes their own switch, and Alvin begins driving.]

90
00:06:53,225 --> 00:06:54,450
 [Video plays, Twelve times per second, ]

91
00:06:54,950 --> 00:06:56,450
 [Video plays, Alvin digitizes a image]

92
00:06:56,875 --> 00:07:01,175
 [Video plays, and feeds it to the neural networks.]

93
00:07:06,225 --> 00:07:08,250
 [Video plays, Each network running in parallel]

94
00:07:08,975 --> 00:07:13,575
 [Video plays, produces the steering direction and measures its confidence as its response.]

95
00:07:21,925 --> 00:07:24,650
 [Video plays, The steering direction from the most confident network]

96
00:07:25,375 --> 00:07:28,000
 [Video plays, in this case, the network trained a one-lane road]

97
00:07:28,600 --> 00:07:31,200
 [Video plays, is used to control the vehicle.]

98
00:07:40,675 --> 00:07:41,200
 [Video plays, Suddenly, ]

99
00:07:41,675 --> 00:07:48,575
 [Video plays, an intersection appears ahead of the vehicle.]

100
00:07:53,825 --> 00:07:54,350
 [Video plays, ]

101
00:07:56,000 --> 00:08:01,400
 [Video plays, As the vehicle approaches the intersection, the confidence of the one-lane network decreases.]

102
00:08:05,950 --> 00:08:07,500
 [Video plays, ]

103
00:08:08,800 --> 00:08:15,000
 [Video plays, As across the intersection and the two-lane road ahead comes across into view, ]

104
00:08:15,525 --> 00:08:18,175
 [Video plays, the confidence of the two-lane network rises.]

105
00:08:23,325 --> 00:08:28,025
 [Video plays, When its confidence rises, the two-lane network is selected to steer,]

106
00:08:28,875 --> 00:08:32,275
 [Video plays, Safely guiding the vehicle into its lane on the two-lane road.]

107
00:08:35,950 --> 00:08:36,700
 [Video plays, ]

108
00:08:39,725 --> 00:08:43,500
 *Instructor (Andrew Ng)*:All right, so who thought driving could be that dramatic, right? 

109
00:08:44,425 --> 00:08:45,950
 Switch back to the chalkboard, please. 

110
00:08:46,675 --> 00:08:47,200
 

111
00:08:47,700 --> 00:08:51,775
 I should say, this work was done about 15 years ago and 

112
00:08:52,250 --> 00:08:54,950
 autonomous driving has come a long way. So many of you 

113
00:08:55,300 --> 00:08:57,600
 will have heard of the DARPA Grand Challenge, where 

114
00:08:57,950 --> 00:08:59,625
 one of my colleagues, Sebastian Thrun, 

115
00:08:59,900 --> 00:09:03,700
 the winning team's drive a car across a desert by itself.

116
00:09:04,200 --> 00:09:07,375
 So Alvin was, I think, absolutely amazing work for its time, 

117
00:09:08,000 --> 00:09:11,875
 but autonomous driving has obviously come a long way since then. 

118
00:09:16,000 --> 00:09:19,725
 So what you just saw was an example, again, of supervised learning, 

119
00:09:20,250 --> 00:09:22,200
 and in particular it was an example of 

120
00:09:22,700 --> 00:09:24,450
 what they call the regression problem,

121
00:09:24,975 --> 00:09:29,750
 because the vehicle is trying to predict a continuous value variables of a continuous value steering directions, 

122
00:09:31,175 --> 00:09:33,625
 we call the regression problem.

123
00:09:34,175 --> 00:09:35,025
 

124
00:09:35,675 --> 00:09:42,225
 And what I want to do today is talk about our first supervised learning algorithm, and it will also be to a regression task. 

125
00:09:43,475 --> 00:09:45,025
 So 

126
00:09:46,975 --> 00:09:49,075
 for the running example that I'm going to use 

127
00:09:49,550 --> 00:09:54,500
 throughout today's lecture, you're going to return to the example of trying to predict housing prices. 

128
00:09:55,125 --> 00:09:56,950
 So 

129
00:09:57,850 --> 00:10:02,775
 here's actually a dataset 

130
00:10:04,725 --> 00:10:11,400
 collected by TA, Dan Ramage, on housing prices in Portland, Oregon.

131
00:10:15,275 --> 00:10:17,650
 So 

132
00:10:18,350 --> 00:10:20,075
 

133
00:10:20,700 --> 00:10:23,900
 

134
00:10:24,525 --> 00:10:29,950
 here's a dataset of a number of 

135
00:10:30,475 --> 00:10:32,525
 houses of different sizes, 

136
00:10:33,375 --> 00:10:34,000
 and 

137
00:10:34,700 --> 00:10:38,100
 

138
00:10:38,775 --> 00:10:45,775
 here are their asking prices in thousands of dollars, $200,000. 

139
00:10:47,000 --> 00:10:47,575
 

140
00:10:48,825 --> 00:10:49,900
 

141
00:10:50,650 --> 00:10:51,575
 

142
00:10:52,325 --> 00:10:52,850
 

143
00:10:53,600 --> 00:10:54,675
 And so 

144
00:10:55,675 --> 00:10:58,975
 we can take this data and plot it, 

145
00:10:59,400 --> 00:11:04,125
 square feet, best price, 

146
00:11:05,000 --> 00:11:05,950
 and 

147
00:11:06,625 --> 00:11:09,150
 so you make your other dataset like that. And the question is, 

148
00:11:09,825 --> 00:11:13,125
 given a dataset like this, or given what we call a training set like this, 

149
00:11:13,650 --> 00:11:16,200
 how do you learn to predict the relationship between the size of the house 

150
00:11:16,825 --> 00:11:17,600
 and the price of the house?

151
00:11:18,800 --> 00:11:25,600
 So I'm actually going to come back and modify this task a little bit more later, but let me go ahead and introduce 

152
00:11:25,975 --> 00:11:30,575
 some notation, which I'll be using, actually, throughout the rest of this course. 

153
00:11:31,300 --> 00:11:35,550
 The first piece of notation is I'm going to let 

154
00:11:36,750 --> 00:11:38,350
 the lower case alphabet M 

155
00:11:38,975 --> 00:11:40,625
 denote the number of training examples, 

156
00:11:41,025 --> 00:11:45,300
 and that just means the number of rows, or the number of examples, houses, and prices we have.

157
00:11:45,775 --> 00:11:48,825
 And in this particular dataset, we have, 

158
00:11:49,750 --> 00:11:54,850
 what actually happens, we have 47 training examples, although I wrote down only five. Okay, 

159
00:11:56,100 --> 00:11:59,875
 so 

160
00:12:00,200 --> 00:12:01,550
 

161
00:12:02,150 --> 00:12:08,100
 throughout this quarter, I'm going to use the alphabet M 

162
00:12:08,700 --> 00:12:13,950
 to denote the number of training examples. 

163
00:12:15,100 --> 00:12:21,525
 I'm going to use the lower case alphabet X

164
00:12:22,250 --> 00:12:22,800
 to denote 

165
00:12:23,275 --> 00:12:27,925
 the input variables, 

166
00:12:28,325 --> 00:12:32,475
 

167
00:12:33,000 --> 00:12:37,025
 which I'll often also call the features. And so in this case, 

168
00:12:37,600 --> 00:12:40,725
 X would denote the size of the house they were looking at.

169
00:12:41,750 --> 00:12:48,075
 I'm going to use Y to denote the "output" 

170
00:12:48,425 --> 00:12:49,525
 

171
00:12:49,925 --> 00:12:50,650
 

172
00:12:51,500 --> 00:12:56,500
 variable, which is sometimes also called a target ...

173
00:12:57,000 --> 00:12:58,950
 

174
00:13:00,150 --> 00:13:00,975
 target variable, 

175
00:13:01,900 --> 00:13:03,125
 and so 

176
00:13:04,225 --> 00:13:10,400
 one pair, x, y, is what comprises one training example. 

177
00:13:10,675 --> 00:13:11,575
 

178
00:13:12,075 --> 00:13:16,450
 In other words, one row on the table I drew just now 

179
00:13:16,975 --> 00:13:18,925
 what would be what I call one training example, 

180
00:13:19,250 --> 00:13:21,150
 

181
00:13:22,675 --> 00:13:28,075
 and the i-th training example, 

182
00:13:29,025 --> 00:13:30,200
 

183
00:13:30,725 --> 00:13:34,500
 in other words the i-th row in that table, I'm going to write as 

184
00:13:34,975 --> 00:13:37,625
 Xi, Yi.

185
00:13:38,125 --> 00:13:39,175
 

186
00:13:41,550 --> 00:13:43,750
 Okay, and so 

187
00:13:44,250 --> 00:13:51,100
 in this notation they're going to use this superscript I is not exponentiation. So this is not X to the power of IY to the power of I. 

188
00:13:51,650 --> 00:13:52,700
 In this notation, 

189
00:13:54,025 --> 00:13:59,550
 the superscript i in parentheses is just sort of an index into the ith row 

190
00:14:00,125 --> 00:14:01,800
 of my list of training examples.

191
00:14:03,650 --> 00:14:05,750
 So 

192
00:14:06,950 --> 00:14:11,100
 in supervised learning, this is 

193
00:14:11,500 --> 00:14:14,500
 what we're going to do. We're given a training set, 

194
00:14:15,250 --> 00:14:16,675
 

195
00:14:17,650 --> 00:14:18,700
 

196
00:14:19,300 --> 00:14:22,950
 and we're going to feed our training set, comprising our M 

197
00:14:23,250 --> 00:14:27,300
 training example, so 47 training examples, into a learning algorithm.

198
00:14:28,150 --> 00:14:33,300
 

199
00:14:34,625 --> 00:14:36,100
 Okay, and 

200
00:14:36,975 --> 00:14:43,100
 our algorithm then has output function that is by tradition, and for historical reasons, 

201
00:14:43,675 --> 00:14:45,325
 is usually denoted 

202
00:14:46,175 --> 00:14:50,025
 lower case alphabet H, and is called a hypothesis. 

203
00:14:50,525 --> 00:14:52,725
  

204
00:14:53,425 --> 00:14:59,750
 Don't worry too much about whether the term hypothesis has a deep meaning. It's more a term that's used for historical reasons.

205
00:15:00,675 --> 00:15:02,425
 

206
00:15:03,225 --> 00:15:07,775
 And the hypothesis's job is to take this input. 

207
00:15:08,200 --> 00:15:11,200
 There's some new [inaudible/ price of a house you want to estimate]. 

208
00:15:12,325 --> 00:15:18,625
 What the hypothesis does is it takes this input, a new living area in square feet saying 

209
00:15:19,050 --> 00:15:20,475
 and output 

210
00:15:21,200 --> 00:15:26,200
 estimates the price of this house. 

211
00:15:27,150 --> 00:15:32,925
 So the hypothesis H maps from inputs X to outputs Y. 

212
00:15:34,425 --> 00:15:35,050
 

213
00:15:37,650 --> 00:15:41,475
 So in order to design a learning algorithm, the first thing we have to decide is 

214
00:15:42,000 --> 00:15:43,975
 how we want to represent the hypothesis, right.

215
00:15:45,100 --> 00:15:48,950
 And just for this purposes of this lecture, for the purposes of our first learning algorithm, 

216
00:15:49,475 --> 00:15:53,175
 I'm going to use a linear representation for the hypothesis. So 

217
00:15:53,600 --> 00:15:57,725
 I'm going to represent my hypothesis as H of X equals

218
00:15:58,325 --> 00:15:59,225
 theta zero, 

219
00:16:00,075 --> 00:16:01,925
 

220
00:16:02,950 --> 00:16:04,850
 plus theta-one X, where 

221
00:16:06,050 --> 00:16:08,725
 X here is an input feature, and so that's 

222
00:16:09,225 --> 00:16:12,200
 the size of the house we're considering.

223
00:16:13,275 --> 00:16:16,800
 And more generally, come back to this, 

224
00:16:18,025 --> 00:16:19,650
 more generally for many 

225
00:16:20,075 --> 00:16:24,550
 regression problems we may have more than one input feature. So for example, 

226
00:16:25,300 --> 00:16:32,825
 if instead of just knowing the size of the houses, if we know also 

227
00:16:33,675 --> 00:16:37,800
 the number of bedrooms in these houses, 

228
00:16:38,900 --> 00:16:45,000
 let's say, then, 

229
00:16:48,200 --> 00:16:53,400
 

230
00:16:54,075 --> 00:16:55,575
 so if our ...

231
00:16:55,875 --> 00:16:59,300
 training set also has a second feature, the number of bedrooms in the house, 

232
00:17:00,000 --> 00:17:00,925
 then 

233
00:17:01,350 --> 00:17:04,525
 you may, let's say x_1 denote 

234
00:17:05,125 --> 00:17:07,425
 the size and square feet. 

235
00:17:07,925 --> 00:17:12,225
 Let X have script two denote the number of bedrooms, 

236
00:17:12,775 --> 00:17:14,050
 

237
00:17:14,725 --> 00:17:21,625
 and then I would write the hypothesis, H of X, as 

238
00:17:22,875 --> 00:17:24,100
 

239
00:17:25,175 --> 00:17:30,600
 theta rho plus theta_1 x_1 plus 

240
00:17:31,575 --> 00:17:33,025
 theta_2 x_2. Okay, 

241
00:17:34,025 --> 00:17:36,550
 and sometimes when I went to 

242
00:17:36,800 --> 00:17:41,350
 take the hypothesis H, and when I went to make this dependent on the theta is explicit, 

243
00:17:41,750 --> 00:17:43,625
 I'll sometimes write this as 

244
00:17:43,900 --> 00:17:45,275
 H subscript theta of X. 

245
00:17:45,850 --> 00:17:46,600
 And so this is 

246
00:17:47,275 --> 00:17:48,425
 the price 

247
00:17:49,125 --> 00:17:50,525
 that my hypothesis predicts 

248
00:17:51,625 --> 00:17:53,025
 a house with features X costs. 

249
00:17:55,100 --> 00:17:59,725
 So given the new house with features X, a certain size and a certain number of bedrooms,

250
00:18:00,325 --> 00:18:03,550
 this is going to be the price that my hypothesis predicts this house is going to cost.

251
00:18:08,200 --> 00:18:13,000
 One final piece of notation, 

252
00:18:13,700 --> 00:18:17,825
 so for conciseness, 

253
00:18:18,475 --> 00:18:22,075
 

254
00:18:23,225 --> 00:18:29,525
 just to write this a bit more compactly I'm going to take the convention of defining x_0 to be equal to one, 

255
00:18:30,275 --> 00:18:35,075
 and so I can now write H of X to be equal to 

256
00:18:35,900 --> 00:18:38,575
 sum from <i>i</i> equals one to two 

257
00:18:39,300 --> 00:18:43,775
 of theta_i, oh sorry, zero to two, theta_i, x_i. 

258
00:18:44,275 --> 00:18:48,875
 And if you think of theta as an <i>x</i>, as vectors, then this is just 

259
00:18:49,700 --> 00:18:50,500
 "theta transpose x."

260
00:18:51,625 --> 00:18:53,075
 And 

261
00:18:54,000 --> 00:18:58,125
 the very final piece of notation is I'm also going to let 

262
00:18:58,725 --> 00:19:03,950
 lower case n define ... 

263
00:19:05,550 --> 00:19:10,875
 lower case n be the number of features in my learning problem. And so this actually becomes a sum ...

264
00:19:12,625 --> 00:19:15,425
 

265
00:19:16,250 --> 00:19:21,825
 a sum from I equals zero to N, where in this example if you have two features, "n" would be equal to two.

266
00:19:25,875 --> 00:19:27,025
 All right, 

267
00:19:28,075 --> 00:19:30,625
 I realize that was a fair amount of notation, 

268
00:19:31,475 --> 00:19:33,425
 and as I proceed through 

269
00:19:33,975 --> 00:19:36,850
 the rest of the lecture today, or in future weeks as well, 

270
00:19:37,350 --> 00:19:40,475
 if some day you're looking at me write a symbol and you're wondering, 

271
00:19:40,700 --> 00:19:45,075
 gee, what was that [simple|symbol] lower case N again? Or what was that lower case X again, or whatever, 

272
00:19:45,450 --> 00:19:47,150
 please raise hand and I'll answer. This is 

273
00:19:48,100 --> 00:19:51,225
 a fair amount of notation. We'll probably all get used to it 

274
00:19:51,725 --> 00:19:57,150
 in a few days and we'll standardize notation and make a lot of our descriptions of learning algorithms a lot easier.

275
00:19:58,850 --> 00:20:05,525
 But again, if you see me write some symbol and you don't quite remember what it means, chances are there are others in this class who've forgotten too. So please raise your hand and 

276
00:20:07,050 --> 00:20:07,975
 ask if you're ever wondering what some symbol means. 

277
00:20:09,550 --> 00:20:11,525
 Any questions you have about any of this?

278
00:20:19,025 --> 00:20:21,925
 *Student:*The variable can be anything? [Inaudible]?

279
00:20:22,900 --> 00:20:27,025
 *Instructor (Andrew Ng)*:Say that again. *Student:*[inaudible] zero theta one?

280
00:20:27,475 --> 00:20:30,350
 *Instructor (Andrew Ng)*:Right, so, well let me 

281
00:20:30,750 --> 00:20:34,275
 

282
00:20:34,600 --> 00:20:40,250
 this was going to be next, but the theta or the theta i's are called the parameters. 

283
00:20:41,000 --> 00:20:41,550
 

284
00:20:43,300 --> 00:20:45,175
 

285
00:20:45,775 --> 00:20:51,350
 The thetas are called the parameters of our learning algorithm and theta zero, theta one, theta two are just real numbers. 

286
00:20:52,225 --> 00:20:54,725
 And then it is the job of the learning algorithm 

287
00:20:55,525 --> 00:20:57,025
 to use the training set 

288
00:20:57,750 --> 00:21:00,650
 to choose or to learn appropriate parameters theta. Okay, is there other questions? *Student:*What does [inaudible]?

289
00:21:06,550 --> 00:21:07,575
 *Instructor (Andrew Ng)*:Oh, transpose. 

290
00:21:08,250 --> 00:21:11,500
 Oh yeah, sorry. When [inaudible]

291
00:21:11,900 --> 00:21:16,325
 theta and theta transpose X, theta [inaudible].

292
00:21:17,800 --> 00:21:23,875
 *Student:*Is this like a [inaudible] 

293
00:21:24,700 --> 00:21:30,325
 hypothesis [inaudible], or would you have higher orders? Or would theta [inaudible]?

294
00:21:30,850 --> 00:21:32,850
 *Instructor (Andrew Ng)*:All 

295
00:21:33,350 --> 00:21:39,400
 great questions. The answer so the question was, is this a typical hypothesis or can theta be a ... 

296
00:21:39,975 --> 00:21:41,425
 function of other variables and so on. 

297
00:21:41,900 --> 00:21:46,400
 And the answer is sort of yes. For now, just for this first 

298
00:21:47,425 --> 00:21:50,200
 learning algorithm we'll talk about using a linear hypothesis class. 

299
00:21:50,800 --> 00:21:55,675
 A little bit actually later this quarter, we'll talk about much more complicated hypothesis classes, 

300
00:21:56,300 --> 00:21:59,900
 and we'll actually talk about higher order functions as well, a little bit later today.

301
00:22:03,425 --> 00:22:06,300
 Okay, 

302
00:22:07,800 --> 00:22:12,300
 so for the learning problem then. How do we choose the parameters theta 

303
00:22:13,225 --> 00:22:17,000
 so that our hypothesis H will make accurate predictions about all the houses. 

304
00:22:18,200 --> 00:22:20,975
 All right, so one reasonable thing to do 

305
00:22:21,375 --> 00:22:24,175
 seems to be, well, we have a training set. So

306
00:22:24,525 --> 00:22:29,175
 and just on the training set, our hypothesis will make some prediction, 

307
00:22:30,000 --> 00:22:32,675
 predictions of the housing prices, of 

308
00:22:33,800 --> 00:22:39,525
 the prices of the houses in the training set. So one thing we could do is just try to make 

309
00:22:40,375 --> 00:22:45,900
 the predictions of a learning algorithm accurate on a training set. So given 

310
00:22:46,875 --> 00:22:51,475
 some features, X, and some correct prices, Y, we might want to make 

311
00:22:52,075 --> 00:22:56,750
 that theta square difference between the prediction of the algorithm and the actual price [inaudible|small].

312
00:22:58,575 --> 00:23:05,425
 So to choose parameters theta, unless we want to minimize over the parameters theta, so the squared [area|error]

313
00:23:05,925 --> 00:23:07,775
 between the predicted price and the actual price. 

314
00:23:09,750 --> 00:23:15,075
 And so going to fill this in. We have m training examples. 

315
00:23:15,575 --> 00:23:19,050
 So the sum from i equals one through m of my m training examples, 

316
00:23:19,625 --> 00:23:20,250
 

317
00:23:21,125 --> 00:23:25,850
 of price predicted on the i-th house in my training set, minus the actual 

318
00:23:26,425 --> 00:23:29,775
 target variable, minus actual price on the i-th training example.

319
00:23:30,700 --> 00:23:34,750
 And by convention, instead of minimizing this 

320
00:23:35,650 --> 00:23:38,075
 sum of the squared differences, I'm just going to put a one-half there, 

321
00:23:38,750 --> 00:23:41,775
 which will simplify 

322
00:23:42,375 --> 00:23:44,550
 some of the math we do later. 

323
00:23:45,375 --> 00:23:51,700
 Okay, and so let me go ahead and define J of theta to be equal to just the same, one-half 

324
00:23:52,375 --> 00:23:56,150
 sum from I equals one through M on the number of training examples, 

325
00:23:56,550 --> 00:24:01,200
 of the value 

326
00:24:01,625 --> 00:24:04,675
 predicted by my hypothesis minus the actual value.

327
00:24:05,800 --> 00:24:09,125
 And so 

328
00:24:09,700 --> 00:24:14,225
 what we'll do, let's say, is minimize 

329
00:24:15,175 --> 00:24:18,325
 as a function of the parameters of theta, this quantity J of theta. 

330
00:24:20,125 --> 00:24:26,675
 I should say, to those of you who have taken sort of linear algebra classes, or maybe 

331
00:24:27,500 --> 00:24:32,625
 basic statistics classes, some of you may have seen things like these before and 

332
00:24:33,150 --> 00:24:36,525
 seen least [inaudible|square] regression or [inaudible|least] squares.

333
00:24:37,475 --> 00:24:41,225
 Many of you will not have seen this before. I think some of you may have seen it before, 

334
00:24:41,700 --> 00:24:45,050
 but either way, regardless of whether you've seen it before, let's keep going. 

335
00:24:45,925 --> 00:24:49,925
 Just for those of you that have seen it before, I should say eventually, we'll actually show that 

336
00:24:50,625 --> 00:24:53,750
 this algorithm is a special case of a much broader class of algorithms. 

337
00:24:54,775 --> 00:24:57,400
 But let's keep going. We'll get there eventually.

338
00:24:58,800 --> 00:25:04,375
 

339
00:25:05,200 --> 00:25:11,425
 So I'm going to talk about a couple of different algorithms for performing that minimization over theta of J of theta. 

340
00:25:12,000 --> 00:25:15,400
 The first algorithm I'm going to talk about is a search algorithm, 

341
00:25:15,975 --> 00:25:19,600
 where the basic idea is we'll start with some ...

342
00:25:19,825 --> 00:25:22,050
 * 

343
00:25:24,375 --> 00:25:27,800
 value of my parameter vector theta. 

344
00:25:28,300 --> 00:25:34,950
 Maybe initialize my parameter vector theta to be the vector of all zeros, 

345
00:25:36,175 --> 00:25:37,975
 and excuse me, 

346
00:25:38,775 --> 00:25:44,875
 have to correct that. I sort of write zero with an arrow on top to denote the vector of all zeros.

347
00:25:45,675 --> 00:25:49,975
 And then I'm going to keep changing 

348
00:25:51,475 --> 00:25:57,350
 my parameter vector theta to reduce

349
00:25:57,925 --> 00:26:02,400
 J of theta a little bit, until we hopefully end up at 

350
00:26:02,850 --> 00:26:05,625
 the minimum with respect to theta of J of theta. 

351
00:26:06,550 --> 00:26:11,250
 So switch the laptops please, and lower the big screen. 

352
00:26:12,350 --> 00:26:13,950
 So let me go ahead and 

353
00:26:17,825 --> 00:26:19,425
 show you an animation of this

354
00:26:19,800 --> 00:26:25,275
 first algorithm for minimizing J of theta, which is an algorithm called [grading and|gradient] descent.

355
00:26:29,525 --> 00:26:30,625
 So here's the idea. You see 

356
00:26:31,425 --> 00:26:33,300
 on the display a plot 

357
00:26:33,775 --> 00:26:38,150
 and the axes, the horizontal axes are theta zero and theta one. 

358
00:26:38,650 --> 00:26:39,575
 That's usually minimize

359
00:26:40,100 --> 00:26:43,725
 J of theta, which is represented by the height of this plot. So the

360
00:26:45,000 --> 00:26:47,050
 surface represents the function J of theta 

361
00:26:47,525 --> 00:26:52,700
 and the axes of this function, or the inputs of this function are the parameters theta zero and theta one, written down here below.

362
00:26:54,050 --> 00:26:56,100
 So here's the gradient descent algorithm. 

363
00:26:57,525 --> 00:27:01,525
 I'm going to choose some initial point. It could be vector of all zeros or some randomly chosen point. 

364
00:27:02,050 --> 00:27:07,125
 Let's say we start from that point denoted by the star, by the cross, 

365
00:27:09,050 --> 00:27:11,525
 and now I want you to imagine that 

366
00:27:13,000 --> 00:27:17,550
 this display actually shows a 3D landscape. Imagine you're all in a hilly park or something, 

367
00:27:17,950 --> 00:27:19,425
 and this is the 3D shape of, 

368
00:27:19,700 --> 00:27:20,875
 like, a hill in some park.

369
00:27:22,600 --> 00:27:26,575
 So imagine you're actually standing physically at the position of 

370
00:27:27,500 --> 00:27:29,050
 that star, of that cross, 

371
00:27:29,900 --> 00:27:31,600
 and imagine 

372
00:27:32,075 --> 00:27:37,600
 you can stand on that hill, right, and look all 360 degrees around you and ask, 

373
00:27:38,250 --> 00:27:41,375
 if I were to take a small step, what would allow me to go downhill the most? Okay, 

374
00:27:42,400 --> 00:27:47,575
 just imagine that this is physically a hill and you're standing there, and would look around ask, "If I take a small step, 

375
00:27:48,200 --> 00:27:51,500
 what is the direction of steepest descent, that would take me downhill as quickly as possible?"

376
00:27:53,225 --> 00:27:55,225
 So the gradient descent algorithm does exactly that. 

377
00:27:56,025 --> 00:28:00,800
 I'm going to take a small step in this direction of steepest descent, or the direction that the gradient turns out to be. 

378
00:28:01,775 --> 00:28:05,100
 And then you take a small step and you end up at a new point shown there, 

379
00:28:06,250 --> 00:28:06,900
 and it would keep going. 

380
00:28:07,550 --> 00:28:12,025
 You're now at a new point on this hill, and again you're going to look around you, 

381
00:28:12,350 --> 00:28:14,400
 look all 360 degrees around you, 

382
00:28:14,925 --> 00:28:18,950
 and ask, "What is the direction that would take me downhill as quickly as possible?"

383
00:28:19,350 --> 00:28:21,150
 And we want to go downhill as quickly as possible, 

384
00:28:21,600 --> 00:28:23,425
 because we want to find the minimum of J of theta. 

385
00:28:25,350 --> 00:28:26,050
 So you do that again. 

386
00:28:26,800 --> 00:28:29,950
 You can take another step, okay, and you sort of keep going 

387
00:28:31,475 --> 00:28:36,200
 until you end up at a local minimum of this function, J of theta. 

388
00:28:38,200 --> 00:28:41,175
 One property of gradient descent is that 

389
00:28:41,675 --> 00:28:45,575
 where you end up in this case, we ended up at this point on the ... 

390
00:28:46,025 --> 00:28:48,750
 lower left hand corner of this plot.

391
00:28:49,375 --> 00:28:50,075
 But 

392
00:28:50,825 --> 00:28:54,800
 let's try running gradient descent again from a different position. So that 

393
00:28:55,475 --> 00:29:00,950
 was where I started gradient descent just now. Let's rerun gradient descent, but using a slightly different initial starting point,

394
00:29:01,750 --> 00:29:03,600
 so a point slightly further ... 

395
00:29:04,525 --> 00:29:05,425
 up and further to the right. 

396
00:29:06,750 --> 00:29:09,250
 So it turns out if you run gradient descent from that point,

397
00:29:09,800 --> 00:29:12,650
 then if you take a steepest descent direction again, 

398
00:29:13,825 --> 00:29:17,775
 that's your first step. And if you keep going, 

399
00:29:19,075 --> 00:29:24,050
 it turns out that with a slightly different initial starting point, you can actually end up at a completely different local optimum. Okay, 

400
00:29:25,700 --> 00:29:31,275
 so this is a property of gradient descent, and we'll come back to it in a second. So be aware that gradient descent 

401
00:29:32,325 --> 00:29:35,150
 can sometimes depend on where you initialize 

402
00:29:35,525 --> 00:29:40,050
 your parameters, theta zero and theta one. Switch back to the chalkboard, please. 

403
00:29:41,475 --> 00:29:48,500
 Let's go ahead and work out the math of the gradient descent algorithm. Then we'll come back and revisit this issue of local optimum. 

404
00:29:49,625 --> 00:29:51,450
 So 

405
00:29:54,975 --> 00:29:56,875
 

406
00:29:58,375 --> 00:30:04,175
 

407
00:30:05,775 --> 00:30:08,225
 here's the gradient descent algorithm.

408
00:30:09,050 --> 00:30:15,275
 We're going to take a repeatedly take a step in the direction of steepest descent, and it turns out that you can write that as [inaudible], 

409
00:30:16,025 --> 00:30:19,400
 which is we're going to update the parameters theta as

410
00:30:20,725 --> 00:30:23,700
 theta_i minus 

411
00:30:24,675 --> 00:30:29,225
 the partial derivative with respect to theta_i, 

412
00:30:29,875 --> 00:30:35,750
 J of Theta. Okay, so this is how we're going to update the i-th parameter, theta_i, 

413
00:30:36,325 --> 00:30:39,100
 how we're going to update Theta I on each iteration of gradient descent.

414
00:30:41,375 --> 00:30:45,750
 Just a point of notation, I use this colon equals notation 

415
00:30:46,400 --> 00:30:49,975
 to denote setting a variable on the left hand side 

416
00:30:50,250 --> 00:30:51,375
 equal to the variable on the right hand side. All right, 

417
00:30:52,475 --> 00:30:55,575
 so if I write A colon equals B, 

418
00:30:56,150 --> 00:31:00,050
 then what I'm saying is, this is part of a computer program, or this is part of an algorithm 

419
00:31:00,650 --> 00:31:03,300
 where we take the value of B, the value on the right hand side, 

420
00:31:03,650 --> 00:31:05,575
 and use that to overwrite the value on the left hand side.

421
00:31:06,800 --> 00:31:09,575
 In contrast, if I write A equals B, 

422
00:31:10,250 --> 00:31:12,950
 then this is an assertion of truth.

423
00:31:13,475 --> 00:31:16,400
 I'm claiming that the value of A is equal to the value of B,

424
00:31:17,700 --> 00:31:21,425
 whereas this is computer operation where we overwrite the value of A. 

425
00:31:22,125 --> 00:31:25,100
 If I write A equals B then I'm asserting that the values of A and B are equal.

426
00:31:26,700 --> 00:31:30,775
 So let's see, 

427
00:31:31,000 --> 00:31:32,875
 this algorithm sort of makes sense well, 

428
00:31:36,600 --> 00:31:40,575
 actually let's just move on. Let's just go ahead and take this algorithm and apply it to our problem. 

429
00:31:41,050 --> 00:31:44,225
 

430
00:31:44,725 --> 00:31:47,250
 

431
00:31:48,850 --> 00:31:52,750
 And to work out gradient descent, 

432
00:31:53,775 --> 00:31:59,350
 let's take gradient descent and just apply it to our problem, and this being the first

433
00:31:59,850 --> 00:32:01,825
 somewhat mathematical lecture, I'm going to 

434
00:32:02,500 --> 00:32:07,000
 step through derivations much more slowly and carefully than I will later in this quarter. 

435
00:32:07,400 --> 00:32:09,000
 We'll work through the steps of these 

436
00:32:09,850 --> 00:32:14,600
 in much more detail than I will later in this quarter. Let's actually work out what this gradient descent rule is. 

437
00:32:15,525 --> 00:32:18,250
 So 

438
00:32:19,000 --> 00:32:23,425
 and I'll do this just for the case of, if we had only one training example. Okay, 

439
00:32:24,300 --> 00:32:30,025
 so in this case we need to work out what the partial derivative with respect to the parameter theta I of J of theta. 

440
00:32:30,750 --> 00:32:35,775
 If we have only one training example 

441
00:32:36,500 --> 00:32:40,775
 then J of theta is going to be one-half of subscript theta, 

442
00:32:41,275 --> 00:32:42,950
 of X minus Y, script. 

443
00:32:43,500 --> 00:32:47,200
 So if you have only one training example comprising one pair,

444
00:32:47,775 --> 00:32:50,375
 X, Y, then this is what J of theta is going to be.

445
00:32:51,475 --> 00:32:54,375
 And so taking derivatives, 

446
00:32:55,050 --> 00:32:56,275
 you have one-half 

447
00:32:56,725 --> 00:32:58,900
 something squared. So the two comes down. 

448
00:32:59,325 --> 00:33:04,075
 So you have two times one-half times theta of X minus Y,

449
00:33:05,025 --> 00:33:06,425
 and then by the 

450
00:33:07,275 --> 00:33:10,175
 [inaudible] derivatives, we also 

451
00:33:10,900 --> 00:33:14,700
 must apply this by the derivative of what's inside the square. 

452
00:33:15,425 --> 00:33:18,700
 

453
00:33:19,475 --> 00:33:20,725
 Right, 

454
00:33:21,525 --> 00:33:26,325
 the two and the one-half cancel. 

455
00:33:28,500 --> 00:33:34,150
 So this leaves [inaudible] times that, theta zero, X zero plus

456
00:33:35,475 --> 00:33:39,350
 [inaudible]. 

457
00:33:40,625 --> 00:33:44,450
 Okay, and if you look inside this sum, 

458
00:33:45,575 --> 00:33:49,025
 we're taking the partial derivative of this sum 

459
00:33:49,500 --> 00:33:51,175
 with respect to the parameter theta I. 

460
00:33:52,775 --> 00:33:59,025
 But all the terms in the sum, except for one, do not depend on theta I. In this sum, 

461
00:33:59,575 --> 00:34:05,175
 the only term that depends on theta I will be some term here of theta_i, x_i. 

462
00:34:06,175 --> 00:34:09,150
 And so we take the partial derivative with respect to theta_i, x_i 

463
00:34:10,050 --> 00:34:13,600
 take the partial derivative with respect to theta_i of

464
00:34:14,425 --> 00:34:19,500
 this term theta_i, x_i, and so you get that 

465
00:34:20,175 --> 00:34:22,275
 times x_i.

466
00:34:25,150 --> 00:34:31,125
 Okay, and so this gives us our learning rule, right, 

467
00:34:31,650 --> 00:34:34,800
 of theta_i gets updated as theta_i 

468
00:34:35,600 --> 00:34:39,175
 minus alpha times that. 

469
00:34:39,850 --> 00:34:43,450
 

470
00:34:46,125 --> 00:34:51,700
 Okay, and this ...

471
00:34:53,500 --> 00:34:58,125
 Greek alphabet alpha here is a parameter of the algorithm called the learning rate, 

472
00:34:58,975 --> 00:35:02,150
 and this parameter alpha controls 

473
00:35:02,725 --> 00:35:07,325
 how large a step you take. So you're standing on the hill. You decided 

474
00:35:07,850 --> 00:35:10,800
 what direction to take a step in, and so this parameter alpha

475
00:35:11,500 --> 00:35:16,625
 controls how aggressive how large a step you take in this direction of steepest descent.

476
00:35:18,400 --> 00:35:19,700
 And so if you

477
00:35:20,275 --> 00:35:23,550
 and this is a parameter of the algorithm that's often set by hand. 

478
00:35:24,075 --> 00:35:30,225
 If you choose alpha to be too small [than|then] your steepest descent algorithm will take very tiny steps and take a long time to converge. 

479
00:35:30,825 --> 00:35:32,550
 If alpha is too large then 

480
00:35:32,925 --> 00:35:36,275
 the steepest descent may actually end up overshooting the minimum, 

481
00:35:37,125 --> 00:35:37,725
 if you're taking too aggressive a step.

482
00:35:40,300 --> 00:35:42,775
 Yeah? *Student:*[Inaudible].

483
00:35:46,700 --> 00:35:48,100
 *Instructor (Andrew Ng)*:Say that again?

484
00:35:48,925 --> 00:35:52,100
 *Student:*Isn't there a one over two missing somewhere?

485
00:35:53,375 --> 00:35:55,775
 *Instructor (Andrew Ng)*:Is there a one-half missing? *Student:*I was [inaudible].

486
00:35:56,450 --> 00:36:00,100
 *Instructor (Andrew Ng)*:Thanks. I do make lots of errors in that. 

487
00:36:01,775 --> 00:36:03,100
 Any questions about this?

488
00:36:05,750 --> 00:36:06,700
 

489
00:36:08,475 --> 00:36:13,850
 

490
00:36:14,825 --> 00:36:15,350
 

491
00:36:15,725 --> 00:36:20,175
 

492
00:36:22,650 --> 00:36:24,000
 All right, 

493
00:36:26,275 --> 00:36:27,425
 so let me just 

494
00:36:29,350 --> 00:36:35,100
 wrap this property into an algorithm. So over there I derived the algorithm where you have just one training example,

495
00:36:36,275 --> 00:36:40,100
 more generally for M training examples, gradient descent becomes the following. 

496
00:36:40,375 --> 00:36:43,250
 We're going to repeat 

497
00:36:43,475 --> 00:36:44,725
 until convergence 

498
00:36:45,575 --> 00:36:48,375
 the following step.

499
00:36:48,850 --> 00:36:53,175
 Okay, theta_i gets updated as theta_i

500
00:36:54,175 --> 00:36:59,775
 and I'm just writing out the appropriate equation for M examples rather than one example. 

501
00:37:00,500 --> 00:37:03,150
 Theta_i gets updated. Theta_i minus alpha times 

502
00:37:03,425 --> 00:37:04,850
 the sum from i equals one to M.

503
00:37:05,675 --> 00:37:06,450
 

504
00:37:06,925 --> 00:37:11,475
 

505
00:37:12,450 --> 00:37:13,550
 

506
00:37:14,700 --> 00:37:15,625
 

507
00:37:17,975 --> 00:37:18,625
 

508
00:37:20,075 --> 00:37:26,050
 

509
00:37:29,475 --> 00:37:30,075
 Okay, and 

510
00:37:30,775 --> 00:37:33,525
 I won't bother to show it, but you can 

511
00:37:34,100 --> 00:37:37,800
 go home and sort of verify for yourself that this summation here, 

512
00:37:38,250 --> 00:37:39,975
 this is indeed 

513
00:37:40,450 --> 00:37:46,325
 the partial derivative with respect to theta I of J of theta, where 

514
00:37:46,800 --> 00:37:49,150
 if you use the original definition of J of theta 

515
00:37:49,775 --> 00:37:51,500
 for when you have M training examples.

516
00:37:54,350 --> 00:38:00,475
 Okay, so I'm just going to show switch back to the laptop display. I'm going to show you what this looks like when you run the algorithm. 

517
00:38:00,900 --> 00:38:02,750
 

518
00:38:03,650 --> 00:38:05,350
 So it turns out that 

519
00:38:06,075 --> 00:38:10,400
 for the specific problem of linear regression, or ordinary [release|least] squares, which is what we're doing today, 

520
00:38:11,575 --> 00:38:17,650
 the function J of theta actually does not look like this nasty one that I'll show you just now with a multiple local optima.

521
00:38:18,400 --> 00:38:19,650
 In particular, it turns out 

522
00:38:20,175 --> 00:38:25,475
 for ordinary [release|least] squares, the function J of theta is it's just a quadratic function. 

523
00:38:26,000 --> 00:38:28,525
 And so we'll always have a nice [bow|bowl] shape, 

524
00:38:29,150 --> 00:38:34,250
 like what you see up here, and only have one global minimum with no other local optima.

525
00:38:34,950 --> 00:38:35,925
 So 

526
00:38:36,775 --> 00:38:40,850
 when you run gradient descent, here are actually the contours of the function J. So 

527
00:38:41,350 --> 00:38:45,300
 the contours of a [bow|bowl] shaped function like that are going to be ellipses, 

528
00:38:45,950 --> 00:38:46,500
 and 

529
00:38:48,175 --> 00:38:53,425
 if you run gradient descent on this algorithm, here's what you might get. Let's see, so I initialize the parameters. 

530
00:38:53,725 --> 00:38:56,550
 So let's say randomly at the position of that cross over there, 

531
00:38:56,975 --> 00:38:58,925
 right, that cross on the upper right.

532
00:38:59,625 --> 00:39:02,075
 And so after one iteration of gradient descent, 

533
00:39:03,150 --> 00:39:06,950
 as you change the space of parameters, 

534
00:39:08,100 --> 00:39:10,275
 so if that's the result of one step of gradient descent,

535
00:39:11,075 --> 00:39:15,575
 two steps, three steps, four steps, five steps, and so on, and it, you know, 

536
00:39:16,550 --> 00:39:20,225
 converges easily, rapidly to the global minimum of this function J of theta.

537
00:39:22,475 --> 00:39:26,275
 Okay, and this is a property of [inaudible|ordinary least square] regression 

538
00:39:26,750 --> 00:39:31,200
 with a linear hypothesis cost. The function, J of theta has no local optima. Yes, question?

539
00:39:34,350 --> 00:39:35,225
 *Student:*Is the alpha changing 

540
00:39:35,825 --> 00:39:41,425
 every time? Because the step is not [inaudible]. *Instructor (Andrew Ng)*:So it turns out that 

541
00:39:42,425 --> 00:39:47,350
 yes, so it turns out this was done with a this is with a [fake|fixed] value of alpha, 

542
00:39:47,825 --> 00:39:52,000
 and one of the properties of gradient descent is that as you approach the local minimum, 

543
00:39:52,925 --> 00:39:58,025
 it actually takes smaller and smaller steps so they'll converge. And the reason is, the update is 

544
00:39:59,550 --> 00:40:04,175
 you update theta by subtracting from alpha times the gradient. 

545
00:40:04,800 --> 00:40:07,100
 And so as you approach the local minimum, 

546
00:40:07,600 --> 00:40:12,225
 the gradient also goes to zero. As you approach the local minimum, 

547
00:40:12,850 --> 00:40:18,050
 at the local minimum the gradient is zero, and as you approach the local minimum, the gradient also gets smaller and smaller. 

548
00:40:18,750 --> 00:40:23,025
 And so gradient descent will automatically take smaller and smaller steps 

549
00:40:23,750 --> 00:40:24,500
 as you approach the local minimum. Make sense?

550
00:40:29,850 --> 00:40:30,550
 And 

551
00:40:31,200 --> 00:40:36,125
 here's the same plot here's actually a plot of the housing prices data. 

552
00:40:37,450 --> 00:40:44,175
 So here, lets you initialize the parameters to the vector of all zeros, and so this blue line at the bottom 

553
00:40:44,775 --> 00:40:49,500
 shows the hypothesis with the parameters of initialization. So initially 

554
00:40:50,050 --> 00:40:56,325
 theta zero and theta one are both zero, and so your hypothesis predicts that all prices are equal to zero.

555
00:40:57,050 --> 00:40:59,525
 After one iteration of gradient descent, that's 

556
00:41:00,425 --> 00:41:02,475
 the blue line you get. After two iterations, 

557
00:41:03,750 --> 00:41:08,900
 three, four, five, and after a few more iterations, excuse me, it converges, 

558
00:41:09,625 --> 00:41:11,825
 and you've now found the least square fit for the data. 

559
00:41:19,175 --> 00:41:23,350
 Okay, let's switch back to the chalkboard. Are there questions about this? Yeah?

560
00:41:28,000 --> 00:41:36,800
 *Student:*[Inaudible] iteration, do we mean that we run each sample all the sample cases [inaudible] the new values? *Instructor (Andrew Ng)*:Yes, right.

561
00:41:37,025 --> 00:41:40,300
 *Student:*And converged means that the value will be the same 

562
00:41:40,625 --> 00:41:47,225
 [inaudible] roughly the same?

563
00:41:48,225 --> 00:41:52,375
 *Instructor (Andrew Ng)*:Yeah, so this is sort of a question of how do you test the convergence. And there's different ways 

564
00:41:52,900 --> 00:41:55,025
 of testing for convergence. One is you can look at 

565
00:41:55,500 --> 00:41:57,775
 two different iterations and see if theta has changed a lot,

566
00:41:58,075 --> 00:42:02,275
 and if theta hasn't changed much within two iterations, you may say it's sort of more or less converged.

567
00:42:03,900 --> 00:42:08,925
 Something that's done maybe slightly more often is look at the value of J of theta, and if J of theta 

568
00:42:09,425 --> 00:42:14,025
 if the quantity you're trying to minimize is not changing much anymore, then you might be 

569
00:42:14,600 --> 00:42:16,750
 inclined to believe it's converged. So these are sort of 

570
00:42:17,475 --> 00:42:22,875
 standard heuristics, or standard rules of thumb that are often used to decide if gradient descent has converged. Yeah?

571
00:42:27,175 --> 00:42:28,000
 

572
00:42:28,450 --> 00:42:32,650
 *Student:*I may have missed something, but especially in [inaudible] descent. 

573
00:42:33,300 --> 00:42:35,050
 

574
00:42:35,625 --> 00:42:36,475
 

575
00:42:36,825 --> 00:42:42,975
 So one feature [inaudible] curve and can either go this way or that way. 

576
00:42:43,550 --> 00:42:48,075
 But the math at incline [inaudible] where that comes in. When do you choose whether you go left, 

577
00:42:48,525 --> 00:42:53,375
 whether you're going this way or that way? *Instructor (Andrew Ng)*:I see. It just turns out that so the question is, 

578
00:42:54,200 --> 00:42:59,000
 how is gradient descent looking 360 around you and choosing the direction of steepest descent. 

579
00:42:59,575 --> 00:43:04,375
 So it actually turns out I'm not sure I'll answer the second part, but it turns out that 

580
00:43:04,900 --> 00:43:08,500
 if you stand on the hill and if you ...

581
00:43:10,575 --> 00:43:14,625
 it turns out that when you compute the gradient of the function, when you compute the derivative of the function, 

582
00:43:15,025 --> 00:43:16,900
 then it just turns out that that is indeed 

583
00:43:17,175 --> 00:43:18,425
 the direction of steepest descent.

584
00:43:18,850 --> 00:43:25,575
 By the way, I just want to point out, you would never want to go in the opposite direction because the opposite direction would actually be the direction of steepest ascent, 

585
00:43:26,100 --> 00:43:26,725
 right. 

586
00:43:27,550 --> 00:43:28,275
 So as it turns out 

587
00:43:29,450 --> 00:43:34,325
 maybe the TAs can talk a bit more about this at the section if there's interest. 

588
00:43:34,925 --> 00:43:38,500
 It turns out, when you take the derivative of a function, the derivative of a function 

589
00:43:39,075 --> 00:43:42,250
 sort of turns out to just give you the direction of steepest descent.

590
00:43:43,200 --> 00:43:47,400
 And so you don't explicitly look all 360 degrees around you.

591
00:43:47,950 --> 00:43:50,550
 You sort of just compute the derivative and that turns out to be the direction of steepest descent. 

592
00:43:54,350 --> 00:43:58,050
 Yeah, maybe the TAs can talk a bit more about this on Friday.

593
00:44:01,425 --> 00:44:05,550
 

594
00:44:08,800 --> 00:44:12,075
 

595
00:44:13,800 --> 00:44:14,775
 

596
00:44:16,625 --> 00:44:18,250
 

597
00:44:20,575 --> 00:44:22,450
 Okay, let's see, so 

598
00:44:24,400 --> 00:44:26,600
 let me go ahead and give this algorithm

599
00:44:27,100 --> 00:44:30,200
 a specific name. So this algorithm here is actually called

600
00:44:30,575 --> 00:44:34,350
 batch gradient descent,

601
00:44:35,075 --> 00:44:38,175
 

602
00:44:38,700 --> 00:44:43,300
 and the term batch isn't a great term, but the term batch refers to the fact that 

603
00:44:43,725 --> 00:44:46,025
 on every step of gradient descent you're going to 

604
00:44:46,375 --> 00:44:48,550
 look at your entire training set. You're going to 

605
00:44:49,025 --> 00:44:52,225
 perform a sum over your M training examples.

606
00:44:54,525 --> 00:44:59,550
 So [inaudible|it turns out that batch gradient] descent often works very well. I use it very often, and

607
00:45:00,100 --> 00:45:05,075
 it turns out that sometimes if you have a really, really large training set, imagine that 

608
00:45:05,525 --> 00:45:10,775
 instead of having 47 houses from Portland, Oregon in our training set, if you had, say, the U.S. Census Database or something, 

609
00:45:12,125 --> 00:45:16,700
 with U.S. census size databases you often have hundreds of thousands or millions of training examples.

610
00:45:18,825 --> 00:45:21,125
 So if M is a few million 

611
00:45:22,000 --> 00:45:27,250
 then if you're running batch [rate and|gradient] descent, this means that to perform every step of gradient descent 

612
00:45:27,925 --> 00:45:29,525
 you need to perform a sum from 

613
00:45:29,925 --> 00:45:31,225
 J equals one to a million. 

614
00:45:31,600 --> 00:45:35,850
 That's sort of a lot of training examples where your computer programs have to look at, 

615
00:45:36,400 --> 00:45:39,700
 before you can even take one step downhill on the function J of theta.

616
00:45:40,400 --> 00:45:41,175
 So 

617
00:45:41,875 --> 00:45:46,150
 it turns out that when you have very large training sets, 

618
00:45:46,750 --> 00:45:49,350
 you should write down an alternative algorithm 

619
00:45:50,025 --> 00:45:54,800
 that is called [inaudible|stochastic] gradient descent. 

620
00:45:55,775 --> 00:46:02,200
 Sometimes I'll also call it incremental gradient descent, 

621
00:46:02,975 --> 00:46:05,325
 but the algorithm is as follows. 

622
00:46:05,700 --> 00:46:09,750
 Again, it will repeat until convergence 

623
00:46:10,200 --> 00:46:12,575
 

624
00:46:13,375 --> 00:46:15,500
 and will iterate for J equals one to M, 

625
00:46:16,950 --> 00:46:17,825
 

626
00:46:19,125 --> 00:46:24,850
 and will perform one of these sort of gradient descent updates 

627
00:46:25,675 --> 00:46:28,100
 using just the J training example.

628
00:46:29,175 --> 00:46:30,225
 

629
00:46:31,400 --> 00:46:32,275
 

630
00:46:33,875 --> 00:46:39,250
 

631
00:46:46,975 --> 00:46:51,800
 

632
00:46:53,925 --> 00:46:55,000
 

633
00:46:55,450 --> 00:47:01,550
 Oh, and as usual, this is really you update all the parameters [data|theta] runs. You perform this update 

634
00:47:01,825 --> 00:47:03,325
 for all values of i. 

635
00:47:05,025 --> 00:47:10,925
 For i indexes and the parameter vectors, you just perform this update, all of your parameters simultaneously. 

636
00:47:13,575 --> 00:47:16,325
 And the advantage of this algorithm is that 

637
00:47:17,350 --> 00:47:18,050
 in order to 

638
00:47:18,475 --> 00:47:22,075
 in order to start learning, in order to start modifying the parameters, 

639
00:47:22,425 --> 00:47:26,825
 you only need to look at your first training examples. You should look at your first training example 

640
00:47:27,325 --> 00:47:28,675
 and perform an update using 

641
00:47:29,275 --> 00:47:32,700
 the derivative of the error with respect to just your first training example, 

642
00:47:33,150 --> 00:47:36,200
 and then you look at your second training example and perform another update. 

643
00:47:36,700 --> 00:47:39,225
 And you sort of keep adapting your parameters much more quickly 

644
00:47:40,125 --> 00:47:41,700
 without needing to 

645
00:47:42,300 --> 00:47:46,950
 scan over your entire U.S. Census database before you can even start adapting parameters.

646
00:47:49,675 --> 00:47:50,500
 So 

647
00:47:51,250 --> 00:47:56,925
 let's see, for [launch|large] data sets, so constantly gradient descent is often much faster, 

648
00:47:57,650 --> 00:47:58,450
 and 

649
00:47:59,275 --> 00:48:05,575
 what happens is that constant gradient descent is that it won't actually converge to the global minimum exactly, 

650
00:48:06,325 --> 00:48:11,900
 but if these are the contours of your function, 

651
00:48:12,725 --> 00:48:16,400
 then after you run the constant gradient descent, you sort of tend to wander around.

652
00:48:17,250 --> 00:48:21,375
 And you may actually end up going uphill occasionally, but your parameters will sort of 

653
00:48:22,050 --> 00:48:25,400
 [tender|tend] to wander to the region closest to the global minimum, 

654
00:48:25,775 --> 00:48:28,200
 but sort of keep wandering around a little bit near the region of the global [inaudible]. 

655
00:48:29,375 --> 00:48:30,675
 And often that's just fine 

656
00:48:32,175 --> 00:48:34,025
 to have a parameter 

657
00:48:35,125 --> 00:48:39,025
 that wanders around a little bit the global minimum. 

658
00:48:40,175 --> 00:48:45,750
 And in practice, this often works much faster than [back|batch] gradient descent, especially if you have a large training set.

659
00:48:48,850 --> 00:48:55,575
 Okay, 

660
00:48:55,925 --> 00:49:02,950
 I'm going to clean a couple of boards. While I do that, why don't you take a look at the equations, and after I'm done cleaning the boards, I'll ask what questions you have.

661
00:49:06,575 --> 00:49:07,800
 

662
00:49:09,750 --> 00:49:11,000
 

663
00:49:14,125 --> 00:49:19,800
 

664
00:49:20,850 --> 00:49:23,150
 

665
00:49:28,275 --> 00:49:28,875
 

666
00:49:30,500 --> 00:49:36,300
 

667
00:49:36,675 --> 00:49:38,125
 

668
00:49:42,900 --> 00:49:49,000
 Okay, so what questions do you have about all of this? 

669
00:49:50,050 --> 00:49:51,625
 *Student:*[Inaudible] is it true are you just sort of rearranging the order that 

670
00:49:52,350 --> 00:49:53,475
 you do the ... 

671
00:49:53,775 --> 00:49:59,925
 computation? So do you just use the first training example and update all of the theta i's and then step, 

672
00:50:01,875 --> 00:50:07,350
 and then update with the second training example, and update all the theta i's, and then step? And is that why you get sort of this really?

673
00:50:08,175 --> 00:50:14,575
 *Instructor (Andrew Ng)*:Let's see, right. So I'm going to look at my first training example and then I'm going to take a step, and then I'm going to 

674
00:50:15,975 --> 00:50:22,000
 perform the second gradient descent updates using my new parameter vector that has already been modified using my 

675
00:50:22,275 --> 00:50:22,850
 first training example. 

676
00:50:23,900 --> 00:50:24,575
 And then I keep going. Make sense? Yeah?

677
00:50:28,475 --> 00:50:29,825
 *Student:*So in each update of all the theta_i's, you're only using.

678
00:50:30,400 --> 00:50:34,875
 *Instructor (Andrew Ng)*:One training example. *Student:*One training example.

679
00:50:36,875 --> 00:50:39,025
 *Student:*[Inaudible]?

680
00:50:39,700 --> 00:50:43,150
 *Instructor (Andrew Ng)*:Let's see, it's definitely [inaudible|emperical]. 

681
00:50:43,675 --> 00:50:47,100
 I believe this theory that sort of supports that as well. 

682
00:50:48,500 --> 00:50:53,350
 Yeah, the theory that supports that, the [inaudible] of theorem is, I don't remember. Okay, 

683
00:50:59,700 --> 00:51:03,150
 cool. So 

684
00:51:05,050 --> 00:51:06,750
 in what I've done so far, 

685
00:51:07,975 --> 00:51:10,700
 I've talked about an iterative algorithm 

686
00:51:11,925 --> 00:51:15,750
 for performing this minimization in terms of J of theta. 

687
00:51:16,925 --> 00:51:23,075
 And it turns out that there's another way for this specific problem of least squares regression, of ordinary least squares. 

688
00:51:23,600 --> 00:51:28,425
 It turns out there's another way to perform this minimization of J of theta that allows you to 

689
00:51:29,025 --> 00:51:35,300
 solve for the parameters theta in [close|closed] form, without needing to run an iterative algorithm.

690
00:51:35,850 --> 00:51:40,575
 And I know some of you may have seen some of what I'm about to do before, in 

691
00:51:41,125 --> 00:51:44,500
 like an undergraduate linear algebra course, and the way 

692
00:51:45,500 --> 00:51:47,900
 it's typically done requires 

693
00:51:48,375 --> 00:51:52,800
 [inaudible] projections, or taking lots of derivatives and writing lots of algebra. 

694
00:51:53,075 --> 00:51:55,100
 What I'd like to do is 

695
00:51:55,650 --> 00:51:58,425
 show you a way to derive the closed form 

696
00:51:58,825 --> 00:52:02,075
 solution of theta in just a few lines of algebra.

697
00:52:02,550 --> 00:52:06,900
 But to do that, I'll need to introduce a new notation for matrix derivatives, 

698
00:52:07,575 --> 00:52:10,900
 and it turns out that, sort of, the notation I'm about to

699
00:52:12,950 --> 00:52:13,650
 define here 

700
00:52:14,175 --> 00:52:19,500
 just in my own personal work has turned out to be one of the most useful things that I actually use all the time, 

701
00:52:19,950 --> 00:52:23,925
 to have a notation of how to take derivatives with respect to [matrixes|matrices], so that you can

702
00:52:24,400 --> 00:52:30,025
 solve for the minimum of J of theta with, like, a few lines of algebra rather than writing out pages and pages of matrices and derivatives.

703
00:52:31,350 --> 00:52:37,400
 So then we're going to define this new notation first and then we'll go ahead and work out the minimization. 

704
00:52:38,225 --> 00:52:40,075
 Given a function J, 

705
00:52:41,000 --> 00:52:44,400
 since J is a function of a vector of parameters theta, right, 

706
00:52:45,075 --> 00:52:45,950
 I'm going to define 

707
00:52:46,450 --> 00:52:49,650
 the derivative of the gradient of J with respect to theta, 

708
00:52:50,425 --> 00:52:52,350
 as self of vector. 

709
00:52:53,125 --> 00:52:56,975
 

710
00:52:57,550 --> 00:52:59,375
 

711
00:53:02,750 --> 00:53:07,950
 Okay, and so this is going to be an (n+1) dimensional vector. 

712
00:53:08,600 --> 00:53:12,200
 Theta is an (n+1) dimensional vector with indices ranging from zero to n. 

713
00:53:12,875 --> 00:53:16,000
 And so I'm going to define this derivative to be equal to that.

714
00:53:17,525 --> 00:53:22,875
 Okay, and so we can actually rewrite 

715
00:53:23,400 --> 00:53:27,900
 the gradient descent algorithm as follows. This is batch gradient descent, 

716
00:53:28,325 --> 00:53:34,350
 and we write gradient descent as updating the parameter vector theta notice there's no subscript I know

717
00:53:35,125 --> 00:53:42,675
 updating the parameter vector theta as the previous parameter minus alpha times the gradient.

718
00:53:44,775 --> 00:53:51,275
 Okay, and so in this equation all of these quantities, theta, 

719
00:53:52,050 --> 00:53:58,975
 and this gradient vector, all of these are (n+1) dimensional vectors. 

720
00:54:00,475 --> 00:54:03,575
 I was using the boards out of order, wasn't I? 

721
00:54:04,375 --> 00:54:08,975
 

722
00:54:09,725 --> 00:54:13,875
 

723
00:54:14,425 --> 00:54:15,100
 

724
00:54:15,700 --> 00:54:17,900
 

725
00:54:20,800 --> 00:54:23,350
 So more generally, 

726
00:54:24,675 --> 00:54:27,300
 if you have a function f 

727
00:54:27,650 --> 00:54:31,125
 that maps from the space of matrices A, 

728
00:54:32,600 --> 00:54:33,925
 

729
00:54:36,650 --> 00:54:37,850
 that maps from, 

730
00:54:38,350 --> 00:54:42,475
 say, the space of N by N matrices to the space of real numbers. 

731
00:54:42,900 --> 00:54:46,525
 So if you have a function, f of A, 

732
00:54:47,000 --> 00:54:49,525
 where A ...

733
00:54:50,025 --> 00:54:51,400
 is an N by N matrix.

734
00:54:52,100 --> 00:54:56,600
 So this function is matched from matrices to real numbers, the function that takes this input to matrix. 

735
00:54:57,675 --> 00:55:01,550
 Let me define the derivative with respect to f

736
00:55:02,150 --> 00:55:03,425
 of the matrix A. 

737
00:55:04,375 --> 00:55:08,100
 Now, I'm just taking the gradient of f with respect to its input, which is the matrix. 

738
00:55:08,950 --> 00:55:12,925
 I'm going to define this itself to be a matrix.

739
00:55:13,350 --> 00:55:15,575
 

740
00:55:16,600 --> 00:55:21,550
 

741
00:55:22,600 --> 00:55:25,300
 

742
00:55:26,075 --> 00:55:31,200
 

743
00:55:32,250 --> 00:55:33,825
 

744
00:55:36,475 --> 00:55:42,500
 Okay, so the derivative of f with respect to A is itself a matrix, and the matrix contains 

745
00:55:43,475 --> 00:55:44,950
 all the partial derivatives of f 

746
00:55:45,950 --> 00:55:47,075
 with respect to the elements of A. 

747
00:55:58,575 --> 00:56:00,825
 One more definition is 

748
00:56:01,275 --> 00:56:05,875
 if A is a square matrix, so if A

749
00:56:06,525 --> 00:56:10,750
 is an n by n matrix, number of rows equals number of columns, 

750
00:56:11,325 --> 00:56:13,900
 let me define the trace of A 

751
00:56:14,450 --> 00:56:17,325
 to be equal to the sum of A's diagonal elements.

752
00:56:18,475 --> 00:56:24,850
 So this is just sum over i of A_i,i.

753
00:56:26,300 --> 00:56:31,900
 For those of you that haven't seen this sort of operator notation before, you can think of trace of A as 

754
00:56:32,175 --> 00:56:35,175
 the trace operator applied to the square matrix A, 

755
00:56:35,600 --> 00:56:39,350
 but it's more commonly written without the parentheses. So I usually write trace of A like this, 

756
00:56:40,025 --> 00:56:41,650
 and this just means the sum of diagonal elements.

757
00:56:44,175 --> 00:56:46,750
 So 

758
00:56:47,575 --> 00:56:48,400
 

759
00:56:49,225 --> 00:56:51,750
 here are some facts about the trace operator and about derivatives,

760
00:56:52,125 --> 00:56:56,775
 and I'm just going to write these without proof. You can also have the TAs prove some of them in the discussion section, 

761
00:56:57,300 --> 00:57:00,575
 or you can actually go home and 

762
00:57:01,125 --> 00:57:02,550
 verify the proofs of all of these.

763
00:57:03,550 --> 00:57:07,675
 It turns out that given two matrices, 

764
00:57:08,350 --> 00:57:14,350
 A and B, the trace of the matrix A times B is equal to 

765
00:57:14,875 --> 00:57:20,275
 the trace of B, A. Okay, I'm not going to prove this, but you should be able to go home and prove this yourself 

766
00:57:20,950 --> 00:57:21,925
 without too much difficulty. 

767
00:57:23,425 --> 00:57:24,725
 And similarly, 

768
00:57:25,100 --> 00:57:26,125
 

769
00:57:26,875 --> 00:57:31,375
 the trace of a product of three matrices, so if you can take the matrix at the end and 

770
00:57:31,725 --> 00:57:35,925
 cyclically permeate it to the front. So trace of A times B, times C, 

771
00:57:36,275 --> 00:57:38,125
 is equal to the trace of C, A, B. 

772
00:57:38,775 --> 00:57:40,700
 So take the matrix C at the back and move it to the front, 

773
00:57:41,200 --> 00:57:44,550
 and this is also equal to the trace of 

774
00:57:45,050 --> 00:57:47,375
 B, C. Take the matrix B and move it to the front. Okay, 

775
00:57:54,850 --> 00:57:57,900
 also, 

776
00:57:58,750 --> 00:58:05,075
 suppose you have a function f of A which is defined as a trace of A, B. 

777
00:58:06,425 --> 00:58:07,150
 Okay, so this is, 

778
00:58:07,900 --> 00:58:14,600
 right, the trace is a real number. So the trace of A, B is a function that takes this input of matrix A and output a real number. 

779
00:58:16,250 --> 00:58:22,175
 So then the derivative with respect to the matrix A of this function of trace A, B, 

780
00:58:25,650 --> 00:58:26,550
 

781
00:58:27,050 --> 00:58:31,975
 is going to be B transposed. And this is just another fact that you can 

782
00:58:32,875 --> 00:58:36,525
 prove by yourself by going back and referring to the definitions of traces and matrix derivatives. 

783
00:58:37,000 --> 00:58:39,100
 I'm not going to prove it. You should work it out.

784
00:58:40,875 --> 00:58:43,175
 Okay, and lastly a couple of easy ones. 

785
00:58:43,550 --> 00:58:48,925
 The trace of A is equal to the trace of A transposed because the trace is just the sum of diagonal elements. 

786
00:58:49,800 --> 00:58:54,675
 And so if you transpose the matrix, the diagonal, then there's no change. 

787
00:58:55,450 --> 00:58:59,125
 And if lower case A is a real number, then 

788
00:58:59,825 --> 00:59:05,150
 the trace of a real number is just itself. So think of a real number as a one by one matrix. 

789
00:59:05,750 --> 00:59:09,475
 So the trace of a one by one matrix is just whatever that real number is.

790
00:59:12,375 --> 00:59:13,975
 

791
00:59:15,575 --> 00:59:20,825
 And lastly, this is a somewhat tricky one. 

792
00:59:21,850 --> 00:59:23,275
 

793
00:59:24,850 --> 00:59:27,750
 

794
00:59:28,825 --> 00:59:33,525
 The derivative with respect to the matrix A of the trace of A, B, A, transpose C 

795
00:59:34,150 --> 00:59:36,125
 is equal to C, A, B

796
00:59:36,625 --> 00:59:40,925
 plus C transposed A, 

797
00:59:41,625 --> 00:59:47,075
 B transposed. And I won't prove that either. This is sort of just algebra. Work it out yourself.

798
00:59:48,500 --> 00:59:49,325
 

799
00:59:50,800 --> 00:59:53,325
 

800
00:59:55,325 --> 00:59:59,200
 Okay, and so, I guess the key equation ...

801
01:00:01,325 --> 01:00:03,050
 the key facts I'm going to use again 

802
01:00:06,550 --> 01:00:08,550
 about traces and matrix derivatives, I'll use five. [Ten minutes.]

803
01:00:14,050 --> 01:00:14,975
 Okay, 

804
01:00:16,625 --> 01:00:17,150
 so 

805
01:00:18,225 --> 01:00:22,650
 armed with these things I'm going to figure out 

806
01:00:23,175 --> 01:00:24,675
 let's try to come up with a quick derivation 

807
01:00:25,450 --> 01:00:28,100
 for how to minimize J of theta as a function of 

808
01:00:28,550 --> 01:00:31,850
 theta in closed form, and without needing to use an iterative algorithm.

809
01:00:35,850 --> 01:00:37,525
 So work this out. Let me define 

810
01:00:37,875 --> 01:00:41,450
 the matrix X. This is called the design matrix. 

811
01:00:41,975 --> 01:00:45,600
 To be a matrix containing all the inputs from my training set.

812
01:00:47,125 --> 01:00:53,075
 So x_1 was the vector of inputs to the vector of features for my first training example. 

813
01:00:53,700 --> 01:00:54,900
 So I'm going to 

814
01:00:56,675 --> 01:01:00,100
 set X 1 to be the first row of this matrix X, 

815
01:01:00,900 --> 01:01:02,725
 

816
01:01:03,300 --> 01:01:04,300
 

817
01:01:04,850 --> 01:01:09,125
 set my second training example is in place to be the second [variable|row], and so on.

818
01:01:09,350 --> 01:01:12,275
 And I have M training examples, and so 

819
01:01:13,600 --> 01:01:15,875
 that's going to be my 

820
01:01:16,925 --> 01:01:22,350
 design matrix X. Okay, this is defined as matrix capital X as follows, and so

821
01:01:24,500 --> 01:01:28,875
 now, let me take this matrix X and multiply it by my parameter vector theta. 

822
01:01:29,375 --> 01:01:33,250
 This derivation will just take two or three sets. So X times theta

823
01:01:34,475 --> 01:01:40,950
 remember how matrix vector multiplication goes. You take this vector and you multiply it by each of the rows of the matrix.

824
01:01:42,100 --> 01:01:44,725
 So X times theta is just going to be 

825
01:01:45,150 --> 01:01:48,550
 x_1 transposed theta, 

826
01:01:48,925 --> 01:01:51,225
 dot, dot, dot, down to x_m, 

827
01:01:53,975 --> 01:01:54,825
 transposed theta. 

828
01:01:55,725 --> 01:01:57,400
 And this is, of course, just 

829
01:01:58,125 --> 01:02:00,350
 

830
01:02:00,975 --> 01:02:06,800
 

831
01:02:07,650 --> 01:02:12,325
 the predictions of your hypothesis on each of your M training examples. 

832
01:02:14,900 --> 01:02:15,650
 

833
01:02:16,100 --> 01:02:18,650
 [Then we|Let me] also define(d) the y vector 

834
01:02:19,925 --> 01:02:24,500
 to be the vector of all the target values y_1 through y_m 

835
01:02:25,175 --> 01:02:29,500
 in my training set. Okay, so y vector is an m dimensional vector.

836
01:02:31,250 --> 01:02:35,325
 

837
01:02:36,100 --> 01:02:38,000
 

838
01:02:39,425 --> 01:02:42,325
 

839
01:02:43,825 --> 01:02:45,325
 

840
01:02:48,525 --> 01:02:51,525
 So 

841
01:02:53,075 --> 01:02:55,100
 X theta minus Y 

842
01:02:55,675 --> 01:02:58,575
 contained the math from the previous board, this is going to be, 

843
01:02:59,075 --> 01:03:02,750
 

844
01:03:03,475 --> 01:03:04,025
 

845
01:03:04,925 --> 01:03:10,425
 

846
01:03:14,875 --> 01:03:21,875
 right, and now, X theta minus Y, this is a vector. This is an M dimensional vector in M training examples, 

847
01:03:22,925 --> 01:03:27,025
 and so I'm actually going to take this vector and take this inner product with itself.

848
01:03:27,425 --> 01:03:32,200
 Okay, so we call that if Z is a vector 

849
01:03:32,875 --> 01:03:36,450
 than Z transpose Z is just sum over i, 

850
01:03:36,675 --> 01:03:40,325
 Zi squared. Right, that's how you take the inner product of a vector with a sum. 

851
01:03:41,325 --> 01:03:44,600
 So you want to take this vector, X theta minus Y, 

852
01:03:44,950 --> 01:03:49,200
 and take the inner product 

853
01:03:49,650 --> 01:03:50,975
 of this vector with itself, 

854
01:03:51,925 --> 01:03:52,900
 and so that gives me

855
01:03:54,625 --> 01:03:55,600
 

856
01:03:56,250 --> 01:03:57,425
 

857
01:03:57,850 --> 01:04:04,975
 sum from i equals one to m, (H_f(X_i) - Y) squared. 

858
01:04:06,075 --> 01:04:11,000
 Okay, since it's just the sum of the squares of the elements of this vector.

859
01:04:12,425 --> 01:04:14,175
 And 

860
01:04:14,900 --> 01:04:18,700
 put a half there for the emphasis. 

861
01:04:19,875 --> 01:04:23,625
 This is our previous definition of J of theta. Okay, yeah?

862
01:04:30,025 --> 01:04:32,850
 *Student:*[Inaudible]?

863
01:04:33,350 --> 01:04:36,750
 *Instructor (Andrew Ng)*:Yeah, I threw a lot of notations at you today.

864
01:04:37,175 --> 01:04:39,700
 So M is the number of training examples and 

865
01:04:40,250 --> 01:04:42,875
 the number of training examples runs from one through M, 

866
01:04:44,000 --> 01:04:47,225
 and then is the feature vector that runs from zero through N. Does that make sense?

867
01:04:49,575 --> 01:04:52,775
 So this is the sum from one through M. 

868
01:04:53,450 --> 01:04:55,175
 It's sort of 

869
01:04:57,050 --> 01:05:02,875
 theta transpose X that's equal to sum from J equals zero through N of 

870
01:05:03,250 --> 01:05:05,100
 theta J, X, J. Does that make sense? 

871
01:05:08,425 --> 01:05:10,775
 It's the feature vectors that 

872
01:05:11,200 --> 01:05:14,425
 index from zero through N where X_0 is equal to one, 

873
01:05:15,100 --> 01:05:16,950
 whereas the training example is actually indexed from one through M.

874
01:05:21,875 --> 01:05:26,625
 So let me clean a few more boards and take another look at this, make sure it all makes sense. 

875
01:05:27,700 --> 01:05:31,575
 

876
01:05:33,425 --> 01:05:37,525
 

877
01:05:39,100 --> 01:05:41,075
 

878
01:05:43,800 --> 01:05:46,250
 

879
01:05:46,650 --> 01:05:49,625
 

880
01:05:50,550 --> 01:05:55,875
 

881
01:05:58,425 --> 01:05:59,450
 

882
01:06:02,175 --> 01:06:03,550
 Okay, yeah?

883
01:06:08,600 --> 01:06:11,950
 *Student:*[Inaudible] the Y inside the parentheses, shouldn't that be [inaudible]? *Instructor (Andrew Ng)*:Oh, yes, thank you. Oh is that what you meant?

884
01:06:13,800 --> 01:06:17,875
 Yes, thank you. Great, i-th training example. Anything else? Cool. 

885
01:06:26,850 --> 01:06:33,425
 So we're actually nearly done with this derivation. We would like to minimize J of theta 

886
01:06:34,300 --> 01:06:35,750
 with respect to theta 

887
01:06:36,500 --> 01:06:41,050
 and we've written J of theta fairly compactly using this matrix vector notation.

888
01:06:42,750 --> 01:06:45,900
 So in order to minimize J of theta with respect to theta, 

889
01:06:47,425 --> 01:06:51,225
 what we're going to do is take the derivative with respect to theta of J of theta,

890
01:06:52,375 --> 01:06:57,725
 and set this to zero, and solve for theta. Okay, 

891
01:07:02,150 --> 01:07:05,475
 so we have 

892
01:07:06,625 --> 01:07:11,500
 derivative with respect to theta of 

893
01:07:12,425 --> 01:07:18,675
 that is equal to 

894
01:07:21,075 --> 01:07:21,775
 I should mention 

895
01:07:22,025 --> 01:07:28,600
 there will be some steps here that I'm just going to do fairly quickly without proof. So is it really true that the derivative of half of that is half of the derivative, 

896
01:07:29,000 --> 01:07:29,825
 and I already exchanged 

897
01:07:30,425 --> 01:07:33,475
 the derivative and the one-half. In terms of the answers, yes, but 

898
01:07:34,000 --> 01:07:40,200
 later on you should go home and look through the lecture notes and make sure you understand and believe why every step is correct. 

899
01:07:40,625 --> 01:07:42,575
 I'm going to do things relatively quickly here

900
01:07:43,025 --> 01:07:46,450
 and you can work through every step yourself more slowly by referring to the lecture notes.

901
01:07:48,650 --> 01:07:53,325
 Okay, so that's equal to I'm going to expand now this quadratic function. So 

902
01:07:53,725 --> 01:07:55,225
 this is going to be, 

903
01:07:56,125 --> 01:07:57,750
 

904
01:07:58,700 --> 01:08:01,950
 

905
01:08:02,875 --> 01:08:06,500
 

906
01:08:07,050 --> 01:08:10,050
 okay, 

907
01:08:11,150 --> 01:08:15,975
 and this is just sort of taking a quadratic function and expanding it out by multiplying the [inaudible]. And again, 

908
01:08:16,600 --> 01:08:18,000
 work through the steps later yourself 

909
01:08:19,400 --> 01:08:22,650
 if you're not quite sure how I did that. So 

910
01:08:24,175 --> 01:08:27,200
 this thing, this vector, vector product, right, 

911
01:08:27,925 --> 01:08:32,475
 this quantity here, this is just J of theta and so it's just a real number, 

912
01:08:33,500 --> 01:08:35,100
 and the trace of a real number is just itself. *Student:*[Inaudible].

913
01:08:38,225 --> 01:08:39,925
 *Instructor (Andrew Ng)*:Oh, thanks, Dan. 

914
01:08:41,275 --> 01:08:46,525
 Cool, great. So this quantity in parentheses, this is J of theta and it's just a real number. 

915
01:08:47,425 --> 01:08:53,175
 And so the trace of a real number is just the same real number. And so you can sort of take a trace operator without changing anything. 

916
01:08:57,625 --> 01:09:03,600
 And this is equal to one-half derivative with respect to theta of the trace of 

917
01:09:05,700 --> 01:09:09,650
 by the second permutation property of trace. You can take this theta at the end 

918
01:09:10,325 --> 01:09:14,050
 and move it to the front. So this is going to be trace of theta 

919
01:09:15,000 --> 01:09:19,025
 times theta transposed, X transpose X 

920
01:09:22,250 --> 01:09:24,425
 minus 

921
01:09:26,125 --> 01:09:31,600
 derivative with respect to theta of the trace of, again, I'm going to 

922
01:09:32,325 --> 01:09:33,825
 take that and bring it to the ...

923
01:09:35,475 --> 01:09:37,800
 oh, sorry. 

924
01:09:39,575 --> 01:09:45,875
 Actually, this thing here is also a real number and the transpose of a real number is just itself. Right, 

925
01:09:46,775 --> 01:09:49,600
 so take the transpose of a real number without changing anything.

926
01:09:50,450 --> 01:09:52,550
 So let me go ahead and just take the transpose of this. 

927
01:09:53,025 --> 01:09:56,025
 A real number transposed itself is just the same real number. So 

928
01:09:57,200 --> 01:10:02,650
 this is minus the trace of, taking the transpose of that. Here's Y transpose X theta, then

929
01:10:04,400 --> 01:10:09,175
 minus [inaudible] theta. Okay, 

930
01:10:11,150 --> 01:10:15,100
 and this last quantity, Y transpose Y. It doesn't actually depend on theta. 

931
01:10:15,350 --> 01:10:18,550
 So when I take the derivative of this last term with respect to theta, it's just zero. So just drop that term.

932
01:10:21,550 --> 01:10:26,125
 And lastly, 

933
01:10:29,550 --> 01:10:35,575
 well, the derivative with respect to theta of the trace of theta, theta transposed, 

934
01:10:37,125 --> 01:10:38,900
 X transpose X. 

935
01:10:39,450 --> 01:10:45,250
 I'm going to use one of the facts I wrote down earlier without proof, and I'm going 

936
01:10:46,400 --> 01:10:52,325
 to let this be A. There's an identity matrix there, so this is A, B, A transpose 

937
01:10:53,100 --> 01:10:58,350
 C, and using a rule that I've written down previously that you'll find in lecture notes, because 

938
01:10:59,025 --> 01:11:01,675
 it's still on one of the boards that you had previously, 

939
01:11:02,825 --> 01:11:05,050
 this is just equal to 

940
01:11:06,125 --> 01:11:09,550
 X transpose X theta.

941
01:11:10,925 --> 01:11:14,050
 

942
01:11:15,600 --> 01:11:20,150
 So this is C, 

943
01:11:21,150 --> 01:11:22,850
 A, 

944
01:11:24,075 --> 01:11:29,700
 B, which is sort of just the identity matrix, which you can ignore, plus X transpose X 

945
01:11:31,600 --> 01:11:36,575
 theta where this is now C transpose [C|A],

946
01:11:37,425 --> 01:11:41,425
 again times the identity which we're going to ignore, times B transposed. 

947
01:11:42,400 --> 01:11:45,375
 And the matrix X transpose X is [the metric|symmetric], so C transpose is equal to C.

948
01:11:49,350 --> 01:11:55,375
 Similarly, the derivative with respect to theta of the trace of 

949
01:11:56,475 --> 01:11:57,525
 

950
01:12:00,075 --> 01:12:02,200
 Y transpose theta X, 

951
01:12:03,325 --> 01:12:06,750
 this is the ...

952
01:12:07,900 --> 01:12:08,600
 

953
01:12:09,400 --> 01:12:13,925
 derivative with respect to matrix A of the trace of B, A 

954
01:12:14,600 --> 01:12:19,425
 and this is just X transpose Y. This is just 

955
01:12:20,175 --> 01:12:23,675
 B transposed, again, by one of the rules that I wrote down earlier. 

956
01:12:26,150 --> 01:12:26,675
 And so

957
01:12:27,550 --> 01:12:33,900
 

958
01:12:35,425 --> 01:12:39,825
 

959
01:12:41,500 --> 01:12:42,575
 

960
01:12:45,575 --> 01:12:48,100
 if you plug this back in, we find, therefore, that 

961
01:12:48,625 --> 01:12:50,750
 the derivative wow, this board's really bad.

962
01:12:52,175 --> 01:12:57,125
 

963
01:12:58,450 --> 01:13:03,075
 

964
01:13:03,875 --> 01:13:05,250
 

965
01:13:06,250 --> 01:13:12,225
 So if you plug this back into our formula for the derivative of J, 

966
01:13:12,750 --> 01:13:17,475
 you find that the derivative with respect to theta of J of theta is equal to

967
01:13:18,825 --> 01:13:23,100
 one-half X transpose theta, 

968
01:13:23,775 --> 01:13:26,450
 plus X transpose X theta, 

969
01:13:27,500 --> 01:13:31,900
 minus X transpose Y, minus X transpose Y, 

970
01:13:32,825 --> 01:13:38,075
 which is just X transpose X theta minus X [inaudible].

971
01:13:41,325 --> 01:13:43,250
 

972
01:13:44,175 --> 01:13:50,200
 Okay, so we set this to zero and we get 

973
01:13:51,400 --> 01:13:52,650
 that, 

974
01:13:53,575 --> 01:13:56,375
 

975
01:13:57,150 --> 01:14:01,550
 

976
01:14:02,425 --> 01:14:04,600
 which is called a normal equation, 

977
01:14:05,700 --> 01:14:07,250
 and 

978
01:14:08,100 --> 01:14:09,275
 we can now solve this 

979
01:14:10,050 --> 01:14:12,200
 equation for theta 

980
01:14:12,975 --> 01:14:18,075
 

981
01:14:19,575 --> 01:14:24,325
 in closed form. That's X transpose X theta, inverse times X transpose Y. 

982
01:14:24,925 --> 01:14:26,200
 And so this gives us a way 

983
01:14:29,175 --> 01:14:32,125
 for solving for the least square fit to the parameters in closed form, 

984
01:14:32,450 --> 01:14:34,375
 without needing to use an [inaudible] gradient descent.

985
01:14:36,550 --> 01:14:41,100
 Okay, and using this matrix vector notation, I think, 

986
01:14:42,125 --> 01:14:48,075
 I don't know, I think we did this whole thing in about ten minutes, which we couldn't have if I was writing out reams of algebra. 

987
01:14:52,300 --> 01:14:58,100
 Okay, some of you look a little bit dazed, but this is our first learning [hour|algorithm]. Aren't you excited? 

988
01:14:59,550 --> 01:15:03,000
 Any quick questions about this before we close for today?

989
01:15:06,900 --> 01:15:07,875
 *Student:*[Inaudible].

990
01:15:08,500 --> 01:15:11,350
 *Instructor (Andrew Ng)*:Say that again. *Student:*What you derived, wasn't that just [inaudible] of X?

991
01:15:11,575 --> 01:15:15,550
 *Instructor (Andrew Ng)*:What inverse? *Student:*Pseudo inverse. *Instructor (Andrew Ng)*:Pseudo inverse? *Student:*Pseudo inverse.

992
01:15:17,250 --> 01:15:17,825
 *Instructor (Andrew Ng)*:Yeah, I 

993
01:15:18,200 --> 01:15:24,700
 it turns out that in cases, if X transpose X is not invertible, [than|then] you use the pseudo inverse minimized to solve this. 

994
01:15:25,950 --> 01:15:31,150
 But it turns out X transpose X is not invertible. That usually means your features were dependent.

995
01:15:31,750 --> 01:15:36,525
 It usually means you did something like repeat the same feature twice in your training set. 

996
01:15:37,175 --> 01:15:41,075
 So if this is not invertible, it turns out the minimum is obtained by the pseudo inverses of the inverse.

997
01:15:41,950 --> 01:15:45,425
 If you don't know what I just said, don't worry about it. It usually won't be a problem. 

998
01:15:46,525 --> 01:15:47,050
 Anything else?

999
01:15:48,175 --> 01:15:49,125
 *Student:*On the second board [inaudible]?

1000
01:15:49,700 --> 01:15:50,275


1001
01:15:50,575 --> 01:15:59,950
Let me take that off. We're running over. Let's close for today and if they're other questions, I'll take them after.