1
00:00:01,425 --> 00:00:09,725
2
00:00:11,600 --> 00:00:15,625
This presentation is delivered by the Stanford Center for Professional Development.
3
00:00:23,650 --> 00:00:27,075
OK. So let's get started with today's material.
4
00:00:31,075 --> 00:00:31,775
So
5
00:00:32,025 --> 00:00:37,775
welcome back to the second lecture. What I want to do today is talk about linear regression,
6
00:00:38,275 --> 00:00:41,025
gradient descent, and the normal equations.
7
00:00:42,600 --> 00:00:43,450
And I should also say,
8
00:00:43,725 --> 00:00:51,025
lecture notes have been posted online and so if some of the math I go over today, I go over rather quickly, if you want to
9
00:00:51,575 --> 00:00:59,350
see every equation written out and work through the details more slowly yourself, go to the course homepage and download detailed lecture notes
10
00:00:59,900 --> 00:01:04,575
that pretty much describe all the mathematical, technical contents I'm going to go over today.
11
00:01:06,300 --> 00:01:10,075
Today, I'm also going to delve into a fair amount - some amount of linear algebra,
12
00:01:10,450 --> 00:01:14,550
and so if you would like to see a refresher on linear algebra,
13
00:01:15,550 --> 00:01:21,225
this week's discussion section will be taught by the TAs and will be a refresher on linear algebra. So
14
00:01:21,825 --> 00:01:22,875
if some of the linear algebra
15
00:01:23,225 --> 00:01:28,775
I talk about today sort of seems to be going by pretty quickly, or if you just want to see some of the things
16
00:01:29,500 --> 00:01:33,925
I'm claiming today with our proof, if you want to just see some of those things written out in detail,
17
00:01:34,300 --> 00:01:36,225
you can come to this week's discussion section.
18
00:01:37,525 --> 00:01:38,650
19
00:01:40,825 --> 00:01:44,225
So I just want to start by showing you a fun video.
20
00:01:44,700 --> 00:01:49,300
Remember at the last lecture, the initial lecture, I talked about supervised learning.
21
00:01:50,150 --> 00:01:56,250
And supervised learning was this machine-learning problem where I said we're going to tell the algorithm
22
00:01:56,625 --> 00:01:59,000
what the close right answer is for
23
00:01:59,400 --> 00:02:04,700
a number of examples, and then we want the algorithm to replicate more of the same. So
24
00:02:05,125 --> 00:02:06,125
the example I had
25
00:02:06,725 --> 00:02:11,625
at the first lecture was the problem of predicting housing prices, where you may have a training set,
26
00:02:12,250 --> 00:02:14,550
and we tell the algorithm what the "right"
27
00:02:14,825 --> 00:02:17,275
housing price was for every house in the training set.
28
00:02:18,025 --> 00:02:24,725
And then you want the algorithm to learn the relationship between sizes of houses and the prices, and essentially produce more of
29
00:02:24,950 --> 00:02:25,900
the "right" answer.
30
00:02:26,525 --> 00:02:30,075
So let me show you a video now. Load the big screen, please.
31
00:02:30,875 --> 00:02:32,550
So I'll show you a video now
32
00:02:33,375 --> 00:02:37,425
that was from Dean Pomerleau at some work he did at Carnegie Mellon
33
00:02:37,725 --> 00:02:41,200
on applied supervised learning to get a car to drive itself.
34
00:02:42,425 --> 00:02:48,850
This is work on a vehicle known as Alvin. It was done sort of about 15 years ago,
35
00:02:50,600 --> 00:02:52,075
and I think it was a ...
36
00:02:53,475 --> 00:02:56,600
it was a very elegant example of
37
00:02:57,575 --> 00:03:00,225
the sorts of things you can get supervised or any algorithms to do.
38
00:03:00,875 --> 00:03:01,475
39
00:03:01,900 --> 00:03:08,000
On the video, you hear Dean Pomerleau's voice mention and algorithm called Neural Network. I'll say a little bit about that later,
40
00:03:09,250 --> 00:03:15,750
but the essential learning algorithm for this is something called gradient descent, which I will talk about later in today's lecture.
41
00:03:16,650 --> 00:03:17,550
Let's watch ..
42
00:03:18,725 --> 00:03:20,250
Let's watch the video.
43
00:03:24,000 --> 00:03:30,575
[Video plays, Alvin is a system of artificial neural networks that learn to steer by watching a person's drive.]
44
00:03:31,650 --> 00:03:34,325
[Video plays, Alvin is designed to control the [inaudible] tools,]
45
00:03:35,000 --> 00:03:38,825
[Video plays, modified army [inaudible], equipped with sensors, computers,]
46
00:03:39,300 --> 00:03:40,050
[Video plays, and actuators]
47
00:03:40,575 --> 00:03:43,150
[Video plays, for autonomous navigation experiments.]
48
00:03:44,350 --> 00:03:49,275
[Video plays, The initial step in configuring Alvin is training networks in sphere.]
49
00:03:50,550 --> 00:03:54,875
[Video plays, During training, a person drives the vehicle while Alvin watches.]
50
00:04:00,150 --> 00:04:01,475
[Video plays, Once every two seconds,...]
51
00:04:02,375 --> 00:04:07,925
[Video plays, Alvin digitized the video images of the road ahead, and records the person's steering direction.]
52
00:04:14,450 --> 00:04:15,650
[Video plays, ]
53
00:04:16,100 --> 00:04:20,650
[Video plays, This training image is reduced to the resolution to 30 by 32 pixels.]
54
00:04:21,125 --> 00:04:24,200
[Video plays, and provides it as an input of Alvin's three-layer network.]
55
00:04:25,350 --> 00:04:26,025
*Instructor (Andrew Ng)*:So two comments,
56
00:04:26,250 --> 00:04:30,900
right. One is this is supervised learning because it's learning from a human driver,
57
00:04:31,300 --> 00:04:32,700
in which a human driver shows
58
00:04:33,025 --> 00:04:38,175
that we're on this segment of the road, I will steer at this angle. This segment of the road, I'll steer at this angle.
59
00:04:38,725 --> 00:04:40,225
And so the human provides the number of
60
00:04:40,550 --> 00:04:41,200
"correct"
61
00:04:41,800 --> 00:04:43,175
steering directions to the car,
62
00:04:43,675 --> 00:04:46,050
and then it's the job of the car to try to learn
63
00:04:46,475 --> 00:04:50,925
to produce more of these "correct" steering directions that keeps the car on the road.
64
00:04:52,400 --> 00:04:55,925
On the monitor display up here, I just want to
65
00:04:56,400 --> 00:05:01,400
tell you a little bit about what this display means. So on the upper left where the mouse pointer is moving,
66
00:05:01,900 --> 00:05:05,625
this horizontal line actually shows the human steering direction,
67
00:05:06,200 --> 00:05:08,325
and this white bar, or
68
00:05:08,975 --> 00:05:11,175
this white area right here shows
69
00:05:11,600 --> 00:05:17,525
the steering direction chosen by the human driver, by moving the steering wheel. The human is steering
70
00:05:17,975 --> 00:05:21,100
a little bit to the left here indicated by the position of this white region.
71
00:05:21,625 --> 00:05:22,250
72
00:05:22,850 --> 00:05:25,175
This second line here where Mouse is pointing,
73
00:05:25,875 --> 00:05:28,925
the second line here is the output of the learning algorithm, and
74
00:05:29,300 --> 00:05:32,275
where the learning algorithm currently thinks is the right steering direction.
75
00:05:32,900 --> 00:05:37,925
And right now what you're seeing is the learning algorithm just at the very beginning of training, and so there's just no idea
76
00:05:38,375 --> 00:05:39,650
of where to steer. And so
77
00:05:39,950 --> 00:05:44,000
its output, this little white smear over the entire range of steering directions.
78
00:05:44,600 --> 00:05:48,325
And as the algorithm collects more examples and learns of a time, you see
79
00:05:48,700 --> 00:05:51,400
it start to more confidently choose a steering direction.
80
00:05:52,200 --> 00:05:53,175
So let's keep watching the video.
81
00:05:54,575 --> 00:05:57,000
[Video plays, Using the backpropagation learning algorithm,]
82
00:05:57,900 --> 00:06:00,800
[Video plays, Alvin is trained to output the same steering direction]
83
00:06:01,125 --> 00:06:03,375
[Video plays, as the human driver for that image.]
84
00:06:05,475 --> 00:06:09,375
[Video plays, Initially, the network's steering response is at random.]
85
00:06:16,275 --> 00:06:18,575
[Video plays, After about two minutes of training,]
86
00:06:19,325 --> 00:06:23,950
[Video plays, The network learns to accurately imitate the steering reactions of the human driver.]
87
00:06:36,250 --> 00:06:39,850
[Video plays, The same training procedure is repeated for other road types.]
88
00:06:42,300 --> 00:06:44,075
[Video plays, After the networks have been trained,]
89
00:06:44,700 --> 00:06:48,450
[Video plays, The operator pushes their own switch, and Alvin begins driving.]
90
00:06:53,225 --> 00:06:54,450
[Video plays, Twelve times per second, ]
91
00:06:54,950 --> 00:06:56,450
[Video plays, Alvin digitizes a image]
92
00:06:56,875 --> 00:07:01,175
[Video plays, and feeds it to the neural networks.]
93
00:07:06,225 --> 00:07:08,250
[Video plays, Each network running in parallel]
94
00:07:08,975 --> 00:07:13,575
[Video plays, produces the steering direction and measures its confidence as its response.]
95
00:07:21,925 --> 00:07:24,650
[Video plays, The steering direction from the most confident network]
96
00:07:25,375 --> 00:07:28,000
[Video plays, in this case, the network trained a one-lane road]
97
00:07:28,600 --> 00:07:31,200
[Video plays, is used to control the vehicle.]
98
00:07:40,675 --> 00:07:41,200
[Video plays, Suddenly, ]
99
00:07:41,675 --> 00:07:48,575
[Video plays, an intersection appears ahead of the vehicle.]
100
00:07:53,825 --> 00:07:54,350
[Video plays, ]
101
00:07:56,000 --> 00:08:01,400
[Video plays, As the vehicle approaches the intersection, the confidence of the one-lane network decreases.]
102
00:08:05,950 --> 00:08:07,500
[Video plays, ]
103
00:08:08,800 --> 00:08:15,000
[Video plays, As across the intersection and the two-lane road ahead comes across into view, ]
104
00:08:15,525 --> 00:08:18,175
[Video plays, the confidence of the two-lane network rises.]
105
00:08:23,325 --> 00:08:28,025
[Video plays, When its confidence rises, the two-lane network is selected to steer,]
106
00:08:28,875 --> 00:08:32,275
[Video plays, Safely guiding the vehicle into its lane on the two-lane road.]
107
00:08:35,950 --> 00:08:36,700
[Video plays, ]
108
00:08:39,725 --> 00:08:43,500
*Instructor (Andrew Ng)*:All right, so who thought driving could be that dramatic, right?
109
00:08:44,425 --> 00:08:45,950
Switch back to the chalkboard, please.
110
00:08:46,675 --> 00:08:47,200
111
00:08:47,700 --> 00:08:51,775
I should say, this work was done about 15 years ago and
112
00:08:52,250 --> 00:08:54,950
autonomous driving has come a long way. So many of you
113
00:08:55,300 --> 00:08:57,600
will have heard of the DARPA Grand Challenge, where
114
00:08:57,950 --> 00:08:59,625
one of my colleagues, Sebastian Thrun,
115
00:08:59,900 --> 00:09:03,700
the winning team's drive a car across a desert by itself.
116
00:09:04,200 --> 00:09:07,375
So Alvin was, I think, absolutely amazing work for its time,
117
00:09:08,000 --> 00:09:11,875
but autonomous driving has obviously come a long way since then.
118
00:09:16,000 --> 00:09:19,725
So what you just saw was an example, again, of supervised learning,
119
00:09:20,250 --> 00:09:22,200
and in particular it was an example of
120
00:09:22,700 --> 00:09:24,450
what they call the regression problem,
121
00:09:24,975 --> 00:09:29,750
because the vehicle is trying to predict a continuous value variables of a continuous value steering directions,
122
00:09:31,175 --> 00:09:33,625
we call the regression problem.
123
00:09:34,175 --> 00:09:35,025
124
00:09:35,675 --> 00:09:42,225
And what I want to do today is talk about our first supervised learning algorithm, and it will also be to a regression task.
125
00:09:43,475 --> 00:09:45,025
So
126
00:09:46,975 --> 00:09:49,075
for the running example that I'm going to use
127
00:09:49,550 --> 00:09:54,500
throughout today's lecture, you're going to return to the example of trying to predict housing prices.
128
00:09:55,125 --> 00:09:56,950
So
129
00:09:57,850 --> 00:10:02,775
here's actually a dataset
130
00:10:04,725 --> 00:10:11,400
collected by TA, Dan Ramage, on housing prices in Portland, Oregon.
131
00:10:15,275 --> 00:10:17,650
So
132
00:10:18,350 --> 00:10:20,075
133
00:10:20,700 --> 00:10:23,900
134
00:10:24,525 --> 00:10:29,950
here's a dataset of a number of
135
00:10:30,475 --> 00:10:32,525
houses of different sizes,
136
00:10:33,375 --> 00:10:34,000
and
137
00:10:34,700 --> 00:10:38,100
138
00:10:38,775 --> 00:10:45,775
here are their asking prices in thousands of dollars, $200,000.
139
00:10:47,000 --> 00:10:47,575
140
00:10:48,825 --> 00:10:49,900
141
00:10:50,650 --> 00:10:51,575
142
00:10:52,325 --> 00:10:52,850
143
00:10:53,600 --> 00:10:54,675
And so
144
00:10:55,675 --> 00:10:58,975
we can take this data and plot it,
145
00:10:59,400 --> 00:11:04,125
square feet, best price,
146
00:11:05,000 --> 00:11:05,950
and
147
00:11:06,625 --> 00:11:09,150
so you make your other dataset like that. And the question is,
148
00:11:09,825 --> 00:11:13,125
given a dataset like this, or given what we call a training set like this,
149
00:11:13,650 --> 00:11:16,200
how do you learn to predict the relationship between the size of the house
150
00:11:16,825 --> 00:11:17,600
and the price of the house?
151
00:11:18,800 --> 00:11:25,600
So I'm actually going to come back and modify this task a little bit more later, but let me go ahead and introduce
152
00:11:25,975 --> 00:11:30,575
some notation, which I'll be using, actually, throughout the rest of this course.
153
00:11:31,300 --> 00:11:35,550
The first piece of notation is I'm going to let
154
00:11:36,750 --> 00:11:38,350
the lower case alphabet M
155
00:11:38,975 --> 00:11:40,625
denote the number of training examples,
156
00:11:41,025 --> 00:11:45,300
and that just means the number of rows, or the number of examples, houses, and prices we have.
157
00:11:45,775 --> 00:11:48,825
And in this particular dataset, we have,
158
00:11:49,750 --> 00:11:54,850
what actually happens, we have 47 training examples, although I wrote down only five. Okay,
159
00:11:56,100 --> 00:11:59,875
so
160
00:12:00,200 --> 00:12:01,550
161
00:12:02,150 --> 00:12:08,100
throughout this quarter, I'm going to use the alphabet M
162
00:12:08,700 --> 00:12:13,950
to denote the number of training examples.
163
00:12:15,100 --> 00:12:21,525
I'm going to use the lower case alphabet X
164
00:12:22,250 --> 00:12:22,800
to denote
165
00:12:23,275 --> 00:12:27,925
the input variables,
166
00:12:28,325 --> 00:12:32,475
167
00:12:33,000 --> 00:12:37,025
which I'll often also call the features. And so in this case,
168
00:12:37,600 --> 00:12:40,725
X would denote the size of the house they were looking at.
169
00:12:41,750 --> 00:12:48,075
I'm going to use Y to denote the "output"
170
00:12:48,425 --> 00:12:49,525
171
00:12:49,925 --> 00:12:50,650
172
00:12:51,500 --> 00:12:56,500
variable, which is sometimes also called a target ...
173
00:12:57,000 --> 00:12:58,950
174
00:13:00,150 --> 00:13:00,975
target variable,
175
00:13:01,900 --> 00:13:03,125
and so
176
00:13:04,225 --> 00:13:10,400
one pair, x, y, is what comprises one training example.
177
00:13:10,675 --> 00:13:11,575
178
00:13:12,075 --> 00:13:16,450
In other words, one row on the table I drew just now
179
00:13:16,975 --> 00:13:18,925
what would be what I call one training example,
180
00:13:19,250 --> 00:13:21,150
181
00:13:22,675 --> 00:13:28,075
and the i-th training example,
182
00:13:29,025 --> 00:13:30,200
183
00:13:30,725 --> 00:13:34,500
in other words the i-th row in that table, I'm going to write as
184
00:13:34,975 --> 00:13:37,625
Xi, Yi.
185
00:13:38,125 --> 00:13:39,175
186
00:13:41,550 --> 00:13:43,750
Okay, and so
187
00:13:44,250 --> 00:13:51,100
in this notation they're going to use this superscript I is not exponentiation. So this is not X to the power of IY to the power of I.
188
00:13:51,650 --> 00:13:52,700
In this notation,
189
00:13:54,025 --> 00:13:59,550
the superscript i in parentheses is just sort of an index into the ith row
190
00:14:00,125 --> 00:14:01,800
of my list of training examples.
191
00:14:03,650 --> 00:14:05,750
So
192
00:14:06,950 --> 00:14:11,100
in supervised learning, this is
193
00:14:11,500 --> 00:14:14,500
what we're going to do. We're given a training set,
194
00:14:15,250 --> 00:14:16,675
195
00:14:17,650 --> 00:14:18,700
196
00:14:19,300 --> 00:14:22,950
and we're going to feed our training set, comprising our M
197
00:14:23,250 --> 00:14:27,300
training example, so 47 training examples, into a learning algorithm.
198
00:14:28,150 --> 00:14:33,300
199
00:14:34,625 --> 00:14:36,100
Okay, and
200
00:14:36,975 --> 00:14:43,100
our algorithm then has output function that is by tradition, and for historical reasons,
201
00:14:43,675 --> 00:14:45,325
is usually denoted
202
00:14:46,175 --> 00:14:50,025
lower case alphabet H, and is called a hypothesis.
203
00:14:50,525 --> 00:14:52,725
204
00:14:53,425 --> 00:14:59,750
Don't worry too much about whether the term hypothesis has a deep meaning. It's more a term that's used for historical reasons.
205
00:15:00,675 --> 00:15:02,425
206
00:15:03,225 --> 00:15:07,775
And the hypothesis's job is to take this input.
207
00:15:08,200 --> 00:15:11,200
There's some new [inaudible/ price of a house you want to estimate].
208
00:15:12,325 --> 00:15:18,625
What the hypothesis does is it takes this input, a new living area in square feet saying
209
00:15:19,050 --> 00:15:20,475
and output
210
00:15:21,200 --> 00:15:26,200
estimates the price of this house.
211
00:15:27,150 --> 00:15:32,925
So the hypothesis H maps from inputs X to outputs Y.
212
00:15:34,425 --> 00:15:35,050
213
00:15:37,650 --> 00:15:41,475
So in order to design a learning algorithm, the first thing we have to decide is
214
00:15:42,000 --> 00:15:43,975
how we want to represent the hypothesis, right.
215
00:15:45,100 --> 00:15:48,950
And just for this purposes of this lecture, for the purposes of our first learning algorithm,
216
00:15:49,475 --> 00:15:53,175
I'm going to use a linear representation for the hypothesis. So
217
00:15:53,600 --> 00:15:57,725
I'm going to represent my hypothesis as H of X equals
218
00:15:58,325 --> 00:15:59,225
theta zero,
219
00:16:00,075 --> 00:16:01,925
220
00:16:02,950 --> 00:16:04,850
plus theta-one X, where
221
00:16:06,050 --> 00:16:08,725
X here is an input feature, and so that's
222
00:16:09,225 --> 00:16:12,200
the size of the house we're considering.
223
00:16:13,275 --> 00:16:16,800
And more generally, come back to this,
224
00:16:18,025 --> 00:16:19,650
more generally for many
225
00:16:20,075 --> 00:16:24,550
regression problems we may have more than one input feature. So for example,
226
00:16:25,300 --> 00:16:32,825
if instead of just knowing the size of the houses, if we know also
227
00:16:33,675 --> 00:16:37,800
the number of bedrooms in these houses,
228
00:16:38,900 --> 00:16:45,000
let's say, then,
229
00:16:48,200 --> 00:16:53,400
230
00:16:54,075 --> 00:16:55,575
so if our ...
231
00:16:55,875 --> 00:16:59,300
training set also has a second feature, the number of bedrooms in the house,
232
00:17:00,000 --> 00:17:00,925
then
233
00:17:01,350 --> 00:17:04,525
you may, let's say x_1 denote
234
00:17:05,125 --> 00:17:07,425
the size and square feet.
235
00:17:07,925 --> 00:17:12,225
Let X have script two denote the number of bedrooms,
236
00:17:12,775 --> 00:17:14,050
237
00:17:14,725 --> 00:17:21,625
and then I would write the hypothesis, H of X, as
238
00:17:22,875 --> 00:17:24,100
239
00:17:25,175 --> 00:17:30,600
theta rho plus theta_1 x_1 plus
240
00:17:31,575 --> 00:17:33,025
theta_2 x_2. Okay,
241
00:17:34,025 --> 00:17:36,550
and sometimes when I went to
242
00:17:36,800 --> 00:17:41,350
take the hypothesis H, and when I went to make this dependent on the theta is explicit,
243
00:17:41,750 --> 00:17:43,625
I'll sometimes write this as
244
00:17:43,900 --> 00:17:45,275
H subscript theta of X.
245
00:17:45,850 --> 00:17:46,600
And so this is
246
00:17:47,275 --> 00:17:48,425
the price
247
00:17:49,125 --> 00:17:50,525
that my hypothesis predicts
248
00:17:51,625 --> 00:17:53,025
a house with features X costs.
249
00:17:55,100 --> 00:17:59,725
So given the new house with features X, a certain size and a certain number of bedrooms,
250
00:18:00,325 --> 00:18:03,550
this is going to be the price that my hypothesis predicts this house is going to cost.
251
00:18:08,200 --> 00:18:13,000
One final piece of notation,
252
00:18:13,700 --> 00:18:17,825
so for conciseness,
253
00:18:18,475 --> 00:18:22,075
254
00:18:23,225 --> 00:18:29,525
just to write this a bit more compactly I'm going to take the convention of defining x_0 to be equal to one,
255
00:18:30,275 --> 00:18:35,075
and so I can now write H of X to be equal to
256
00:18:35,900 --> 00:18:38,575
sum from i equals one to two
257
00:18:39,300 --> 00:18:43,775
of theta_i, oh sorry, zero to two, theta_i, x_i.
258
00:18:44,275 --> 00:18:48,875
And if you think of theta as an x, as vectors, then this is just
259
00:18:49,700 --> 00:18:50,500
"theta transpose x."
260
00:18:51,625 --> 00:18:53,075
And
261
00:18:54,000 --> 00:18:58,125
the very final piece of notation is I'm also going to let
262
00:18:58,725 --> 00:19:03,950
lower case n define ...
263
00:19:05,550 --> 00:19:10,875
lower case n be the number of features in my learning problem. And so this actually becomes a sum ...
264
00:19:12,625 --> 00:19:15,425
265
00:19:16,250 --> 00:19:21,825
a sum from I equals zero to N, where in this example if you have two features, "n" would be equal to two.
266
00:19:25,875 --> 00:19:27,025
All right,
267
00:19:28,075 --> 00:19:30,625
I realize that was a fair amount of notation,
268
00:19:31,475 --> 00:19:33,425
and as I proceed through
269
00:19:33,975 --> 00:19:36,850
the rest of the lecture today, or in future weeks as well,
270
00:19:37,350 --> 00:19:40,475
if some day you're looking at me write a symbol and you're wondering,
271
00:19:40,700 --> 00:19:45,075
gee, what was that [simple|symbol] lower case N again? Or what was that lower case X again, or whatever,
272
00:19:45,450 --> 00:19:47,150
please raise hand and I'll answer. This is
273
00:19:48,100 --> 00:19:51,225
a fair amount of notation. We'll probably all get used to it
274
00:19:51,725 --> 00:19:57,150
in a few days and we'll standardize notation and make a lot of our descriptions of learning algorithms a lot easier.
275
00:19:58,850 --> 00:20:05,525
But again, if you see me write some symbol and you don't quite remember what it means, chances are there are others in this class who've forgotten too. So please raise your hand and
276
00:20:07,050 --> 00:20:07,975
ask if you're ever wondering what some symbol means.
277
00:20:09,550 --> 00:20:11,525
Any questions you have about any of this?
278
00:20:19,025 --> 00:20:21,925
*Student:*The variable can be anything? [Inaudible]?
279
00:20:22,900 --> 00:20:27,025
*Instructor (Andrew Ng)*:Say that again. *Student:*[inaudible] zero theta one?
280
00:20:27,475 --> 00:20:30,350
*Instructor (Andrew Ng)*:Right, so, well let me
281
00:20:30,750 --> 00:20:34,275
282
00:20:34,600 --> 00:20:40,250
this was going to be next, but the theta or the theta i's are called the parameters.
283
00:20:41,000 --> 00:20:41,550
284
00:20:43,300 --> 00:20:45,175
285
00:20:45,775 --> 00:20:51,350
The thetas are called the parameters of our learning algorithm and theta zero, theta one, theta two are just real numbers.
286
00:20:52,225 --> 00:20:54,725
And then it is the job of the learning algorithm
287
00:20:55,525 --> 00:20:57,025
to use the training set
288
00:20:57,750 --> 00:21:00,650
to choose or to learn appropriate parameters theta. Okay, is there other questions? *Student:*What does [inaudible]?
289
00:21:06,550 --> 00:21:07,575
*Instructor (Andrew Ng)*:Oh, transpose.
290
00:21:08,250 --> 00:21:11,500
Oh yeah, sorry. When [inaudible]
291
00:21:11,900 --> 00:21:16,325
theta and theta transpose X, theta [inaudible].
292
00:21:17,800 --> 00:21:23,875
*Student:*Is this like a [inaudible]
293
00:21:24,700 --> 00:21:30,325
hypothesis [inaudible], or would you have higher orders? Or would theta [inaudible]?
294
00:21:30,850 --> 00:21:32,850
*Instructor (Andrew Ng)*:All
295
00:21:33,350 --> 00:21:39,400
great questions. The answer so the question was, is this a typical hypothesis or can theta be a ...
296
00:21:39,975 --> 00:21:41,425
function of other variables and so on.
297
00:21:41,900 --> 00:21:46,400
And the answer is sort of yes. For now, just for this first
298
00:21:47,425 --> 00:21:50,200
learning algorithm we'll talk about using a linear hypothesis class.
299
00:21:50,800 --> 00:21:55,675
A little bit actually later this quarter, we'll talk about much more complicated hypothesis classes,
300
00:21:56,300 --> 00:21:59,900
and we'll actually talk about higher order functions as well, a little bit later today.
301
00:22:03,425 --> 00:22:06,300
Okay,
302
00:22:07,800 --> 00:22:12,300
so for the learning problem then. How do we choose the parameters theta
303
00:22:13,225 --> 00:22:17,000
so that our hypothesis H will make accurate predictions about all the houses.
304
00:22:18,200 --> 00:22:20,975
All right, so one reasonable thing to do
305
00:22:21,375 --> 00:22:24,175
seems to be, well, we have a training set. So
306
00:22:24,525 --> 00:22:29,175
and just on the training set, our hypothesis will make some prediction,
307
00:22:30,000 --> 00:22:32,675
predictions of the housing prices, of
308
00:22:33,800 --> 00:22:39,525
the prices of the houses in the training set. So one thing we could do is just try to make
309
00:22:40,375 --> 00:22:45,900
the predictions of a learning algorithm accurate on a training set. So given
310
00:22:46,875 --> 00:22:51,475
some features, X, and some correct prices, Y, we might want to make
311
00:22:52,075 --> 00:22:56,750
that theta square difference between the prediction of the algorithm and the actual price [inaudible|small].
312
00:22:58,575 --> 00:23:05,425
So to choose parameters theta, unless we want to minimize over the parameters theta, so the squared [area|error]
313
00:23:05,925 --> 00:23:07,775
between the predicted price and the actual price.
314
00:23:09,750 --> 00:23:15,075
And so going to fill this in. We have m training examples.
315
00:23:15,575 --> 00:23:19,050
So the sum from i equals one through m of my m training examples,
316
00:23:19,625 --> 00:23:20,250
317
00:23:21,125 --> 00:23:25,850
of price predicted on the i-th house in my training set, minus the actual
318
00:23:26,425 --> 00:23:29,775
target variable, minus actual price on the i-th training example.
319
00:23:30,700 --> 00:23:34,750
And by convention, instead of minimizing this
320
00:23:35,650 --> 00:23:38,075
sum of the squared differences, I'm just going to put a one-half there,
321
00:23:38,750 --> 00:23:41,775
which will simplify
322
00:23:42,375 --> 00:23:44,550
some of the math we do later.
323
00:23:45,375 --> 00:23:51,700
Okay, and so let me go ahead and define J of theta to be equal to just the same, one-half
324
00:23:52,375 --> 00:23:56,150
sum from I equals one through M on the number of training examples,
325
00:23:56,550 --> 00:24:01,200
of the value
326
00:24:01,625 --> 00:24:04,675
predicted by my hypothesis minus the actual value.
327
00:24:05,800 --> 00:24:09,125
And so
328
00:24:09,700 --> 00:24:14,225
what we'll do, let's say, is minimize
329
00:24:15,175 --> 00:24:18,325
as a function of the parameters of theta, this quantity J of theta.
330
00:24:20,125 --> 00:24:26,675
I should say, to those of you who have taken sort of linear algebra classes, or maybe
331
00:24:27,500 --> 00:24:32,625
basic statistics classes, some of you may have seen things like these before and
332
00:24:33,150 --> 00:24:36,525
seen least [inaudible|square] regression or [inaudible|least] squares.
333
00:24:37,475 --> 00:24:41,225
Many of you will not have seen this before. I think some of you may have seen it before,
334
00:24:41,700 --> 00:24:45,050
but either way, regardless of whether you've seen it before, let's keep going.
335
00:24:45,925 --> 00:24:49,925
Just for those of you that have seen it before, I should say eventually, we'll actually show that
336
00:24:50,625 --> 00:24:53,750
this algorithm is a special case of a much broader class of algorithms.
337
00:24:54,775 --> 00:24:57,400
But let's keep going. We'll get there eventually.
338
00:24:58,800 --> 00:25:04,375
339
00:25:05,200 --> 00:25:11,425
So I'm going to talk about a couple of different algorithms for performing that minimization over theta of J of theta.
340
00:25:12,000 --> 00:25:15,400
The first algorithm I'm going to talk about is a search algorithm,
341
00:25:15,975 --> 00:25:19,600
where the basic idea is we'll start with some ...
342
00:25:19,825 --> 00:25:22,050
*
343
00:25:24,375 --> 00:25:27,800
value of my parameter vector theta.
344
00:25:28,300 --> 00:25:34,950
Maybe initialize my parameter vector theta to be the vector of all zeros,
345
00:25:36,175 --> 00:25:37,975
and excuse me,
346
00:25:38,775 --> 00:25:44,875
have to correct that. I sort of write zero with an arrow on top to denote the vector of all zeros.
347
00:25:45,675 --> 00:25:49,975
And then I'm going to keep changing
348
00:25:51,475 --> 00:25:57,350
my parameter vector theta to reduce
349
00:25:57,925 --> 00:26:02,400
J of theta a little bit, until we hopefully end up at
350
00:26:02,850 --> 00:26:05,625
the minimum with respect to theta of J of theta.
351
00:26:06,550 --> 00:26:11,250
So switch the laptops please, and lower the big screen.
352
00:26:12,350 --> 00:26:13,950
So let me go ahead and
353
00:26:17,825 --> 00:26:19,425
show you an animation of this
354
00:26:19,800 --> 00:26:25,275
first algorithm for minimizing J of theta, which is an algorithm called [grading and|gradient] descent.
355
00:26:29,525 --> 00:26:30,625
So here's the idea. You see
356
00:26:31,425 --> 00:26:33,300
on the display a plot
357
00:26:33,775 --> 00:26:38,150
and the axes, the horizontal axes are theta zero and theta one.
358
00:26:38,650 --> 00:26:39,575
That's usually minimize
359
00:26:40,100 --> 00:26:43,725
J of theta, which is represented by the height of this plot. So the
360
00:26:45,000 --> 00:26:47,050
surface represents the function J of theta
361
00:26:47,525 --> 00:26:52,700
and the axes of this function, or the inputs of this function are the parameters theta zero and theta one, written down here below.
362
00:26:54,050 --> 00:26:56,100
So here's the gradient descent algorithm.
363
00:26:57,525 --> 00:27:01,525
I'm going to choose some initial point. It could be vector of all zeros or some randomly chosen point.
364
00:27:02,050 --> 00:27:07,125
Let's say we start from that point denoted by the star, by the cross,
365
00:27:09,050 --> 00:27:11,525
and now I want you to imagine that
366
00:27:13,000 --> 00:27:17,550
this display actually shows a 3D landscape. Imagine you're all in a hilly park or something,
367
00:27:17,950 --> 00:27:19,425
and this is the 3D shape of,
368
00:27:19,700 --> 00:27:20,875
like, a hill in some park.
369
00:27:22,600 --> 00:27:26,575
So imagine you're actually standing physically at the position of
370
00:27:27,500 --> 00:27:29,050
that star, of that cross,
371
00:27:29,900 --> 00:27:31,600
and imagine
372
00:27:32,075 --> 00:27:37,600
you can stand on that hill, right, and look all 360 degrees around you and ask,
373
00:27:38,250 --> 00:27:41,375
if I were to take a small step, what would allow me to go downhill the most? Okay,
374
00:27:42,400 --> 00:27:47,575
just imagine that this is physically a hill and you're standing there, and would look around ask, "If I take a small step,
375
00:27:48,200 --> 00:27:51,500
what is the direction of steepest descent, that would take me downhill as quickly as possible?"
376
00:27:53,225 --> 00:27:55,225
So the gradient descent algorithm does exactly that.
377
00:27:56,025 --> 00:28:00,800
I'm going to take a small step in this direction of steepest descent, or the direction that the gradient turns out to be.
378
00:28:01,775 --> 00:28:05,100
And then you take a small step and you end up at a new point shown there,
379
00:28:06,250 --> 00:28:06,900
and it would keep going.
380
00:28:07,550 --> 00:28:12,025
You're now at a new point on this hill, and again you're going to look around you,
381
00:28:12,350 --> 00:28:14,400
look all 360 degrees around you,
382
00:28:14,925 --> 00:28:18,950
and ask, "What is the direction that would take me downhill as quickly as possible?"
383
00:28:19,350 --> 00:28:21,150
And we want to go downhill as quickly as possible,
384
00:28:21,600 --> 00:28:23,425
because we want to find the minimum of J of theta.
385
00:28:25,350 --> 00:28:26,050
So you do that again.
386
00:28:26,800 --> 00:28:29,950
You can take another step, okay, and you sort of keep going
387
00:28:31,475 --> 00:28:36,200
until you end up at a local minimum of this function, J of theta.
388
00:28:38,200 --> 00:28:41,175
One property of gradient descent is that
389
00:28:41,675 --> 00:28:45,575
where you end up in this case, we ended up at this point on the ...
390
00:28:46,025 --> 00:28:48,750
lower left hand corner of this plot.
391
00:28:49,375 --> 00:28:50,075
But
392
00:28:50,825 --> 00:28:54,800
let's try running gradient descent again from a different position. So that
393
00:28:55,475 --> 00:29:00,950
was where I started gradient descent just now. Let's rerun gradient descent, but using a slightly different initial starting point,
394
00:29:01,750 --> 00:29:03,600
so a point slightly further ...
395
00:29:04,525 --> 00:29:05,425
up and further to the right.
396
00:29:06,750 --> 00:29:09,250
So it turns out if you run gradient descent from that point,
397
00:29:09,800 --> 00:29:12,650
then if you take a steepest descent direction again,
398
00:29:13,825 --> 00:29:17,775
that's your first step. And if you keep going,
399
00:29:19,075 --> 00:29:24,050
it turns out that with a slightly different initial starting point, you can actually end up at a completely different local optimum. Okay,
400
00:29:25,700 --> 00:29:31,275
so this is a property of gradient descent, and we'll come back to it in a second. So be aware that gradient descent
401
00:29:32,325 --> 00:29:35,150
can sometimes depend on where you initialize
402
00:29:35,525 --> 00:29:40,050
your parameters, theta zero and theta one. Switch back to the chalkboard, please.
403
00:29:41,475 --> 00:29:48,500
Let's go ahead and work out the math of the gradient descent algorithm. Then we'll come back and revisit this issue of local optimum.
404
00:29:49,625 --> 00:29:51,450
So
405
00:29:54,975 --> 00:29:56,875
406
00:29:58,375 --> 00:30:04,175
407
00:30:05,775 --> 00:30:08,225
here's the gradient descent algorithm.
408
00:30:09,050 --> 00:30:15,275
We're going to take a repeatedly take a step in the direction of steepest descent, and it turns out that you can write that as [inaudible],
409
00:30:16,025 --> 00:30:19,400
which is we're going to update the parameters theta as
410
00:30:20,725 --> 00:30:23,700
theta_i minus
411
00:30:24,675 --> 00:30:29,225
the partial derivative with respect to theta_i,
412
00:30:29,875 --> 00:30:35,750
J of Theta. Okay, so this is how we're going to update the i-th parameter, theta_i,
413
00:30:36,325 --> 00:30:39,100
how we're going to update Theta I on each iteration of gradient descent.
414
00:30:41,375 --> 00:30:45,750
Just a point of notation, I use this colon equals notation
415
00:30:46,400 --> 00:30:49,975
to denote setting a variable on the left hand side
416
00:30:50,250 --> 00:30:51,375
equal to the variable on the right hand side. All right,
417
00:30:52,475 --> 00:30:55,575
so if I write A colon equals B,
418
00:30:56,150 --> 00:31:00,050
then what I'm saying is, this is part of a computer program, or this is part of an algorithm
419
00:31:00,650 --> 00:31:03,300
where we take the value of B, the value on the right hand side,
420
00:31:03,650 --> 00:31:05,575
and use that to overwrite the value on the left hand side.
421
00:31:06,800 --> 00:31:09,575
In contrast, if I write A equals B,
422
00:31:10,250 --> 00:31:12,950
then this is an assertion of truth.
423
00:31:13,475 --> 00:31:16,400
I'm claiming that the value of A is equal to the value of B,
424
00:31:17,700 --> 00:31:21,425
whereas this is computer operation where we overwrite the value of A.
425
00:31:22,125 --> 00:31:25,100
If I write A equals B then I'm asserting that the values of A and B are equal.
426
00:31:26,700 --> 00:31:30,775
So let's see,
427
00:31:31,000 --> 00:31:32,875
this algorithm sort of makes sense well,
428
00:31:36,600 --> 00:31:40,575
actually let's just move on. Let's just go ahead and take this algorithm and apply it to our problem.
429
00:31:41,050 --> 00:31:44,225
430
00:31:44,725 --> 00:31:47,250
431
00:31:48,850 --> 00:31:52,750
And to work out gradient descent,
432
00:31:53,775 --> 00:31:59,350
let's take gradient descent and just apply it to our problem, and this being the first
433
00:31:59,850 --> 00:32:01,825
somewhat mathematical lecture, I'm going to
434
00:32:02,500 --> 00:32:07,000
step through derivations much more slowly and carefully than I will later in this quarter.
435
00:32:07,400 --> 00:32:09,000
We'll work through the steps of these
436
00:32:09,850 --> 00:32:14,600
in much more detail than I will later in this quarter. Let's actually work out what this gradient descent rule is.
437
00:32:15,525 --> 00:32:18,250
So
438
00:32:19,000 --> 00:32:23,425
and I'll do this just for the case of, if we had only one training example. Okay,
439
00:32:24,300 --> 00:32:30,025
so in this case we need to work out what the partial derivative with respect to the parameter theta I of J of theta.
440
00:32:30,750 --> 00:32:35,775
If we have only one training example
441
00:32:36,500 --> 00:32:40,775
then J of theta is going to be one-half of subscript theta,
442
00:32:41,275 --> 00:32:42,950
of X minus Y, script.
443
00:32:43,500 --> 00:32:47,200
So if you have only one training example comprising one pair,
444
00:32:47,775 --> 00:32:50,375
X, Y, then this is what J of theta is going to be.
445
00:32:51,475 --> 00:32:54,375
And so taking derivatives,
446
00:32:55,050 --> 00:32:56,275
you have one-half
447
00:32:56,725 --> 00:32:58,900
something squared. So the two comes down.
448
00:32:59,325 --> 00:33:04,075
So you have two times one-half times theta of X minus Y,
449
00:33:05,025 --> 00:33:06,425
and then by the
450
00:33:07,275 --> 00:33:10,175
[inaudible] derivatives, we also
451
00:33:10,900 --> 00:33:14,700
must apply this by the derivative of what's inside the square.
452
00:33:15,425 --> 00:33:18,700
453
00:33:19,475 --> 00:33:20,725
Right,
454
00:33:21,525 --> 00:33:26,325
the two and the one-half cancel.
455
00:33:28,500 --> 00:33:34,150
So this leaves [inaudible] times that, theta zero, X zero plus
456
00:33:35,475 --> 00:33:39,350
[inaudible].
457
00:33:40,625 --> 00:33:44,450
Okay, and if you look inside this sum,
458
00:33:45,575 --> 00:33:49,025
we're taking the partial derivative of this sum
459
00:33:49,500 --> 00:33:51,175
with respect to the parameter theta I.
460
00:33:52,775 --> 00:33:59,025
But all the terms in the sum, except for one, do not depend on theta I. In this sum,
461
00:33:59,575 --> 00:34:05,175
the only term that depends on theta I will be some term here of theta_i, x_i.
462
00:34:06,175 --> 00:34:09,150
And so we take the partial derivative with respect to theta_i, x_i
463
00:34:10,050 --> 00:34:13,600
take the partial derivative with respect to theta_i of
464
00:34:14,425 --> 00:34:19,500
this term theta_i, x_i, and so you get that
465
00:34:20,175 --> 00:34:22,275
times x_i.
466
00:34:25,150 --> 00:34:31,125
Okay, and so this gives us our learning rule, right,
467
00:34:31,650 --> 00:34:34,800
of theta_i gets updated as theta_i
468
00:34:35,600 --> 00:34:39,175
minus alpha times that.
469
00:34:39,850 --> 00:34:43,450
470
00:34:46,125 --> 00:34:51,700
Okay, and this ...
471
00:34:53,500 --> 00:34:58,125
Greek alphabet alpha here is a parameter of the algorithm called the learning rate,
472
00:34:58,975 --> 00:35:02,150
and this parameter alpha controls
473
00:35:02,725 --> 00:35:07,325
how large a step you take. So you're standing on the hill. You decided
474
00:35:07,850 --> 00:35:10,800
what direction to take a step in, and so this parameter alpha
475
00:35:11,500 --> 00:35:16,625
controls how aggressive how large a step you take in this direction of steepest descent.
476
00:35:18,400 --> 00:35:19,700
And so if you
477
00:35:20,275 --> 00:35:23,550
and this is a parameter of the algorithm that's often set by hand.
478
00:35:24,075 --> 00:35:30,225
If you choose alpha to be too small [than|then] your steepest descent algorithm will take very tiny steps and take a long time to converge.
479
00:35:30,825 --> 00:35:32,550
If alpha is too large then
480
00:35:32,925 --> 00:35:36,275
the steepest descent may actually end up overshooting the minimum,
481
00:35:37,125 --> 00:35:37,725
if you're taking too aggressive a step.
482
00:35:40,300 --> 00:35:42,775
Yeah? *Student:*[Inaudible].
483
00:35:46,700 --> 00:35:48,100
*Instructor (Andrew Ng)*:Say that again?
484
00:35:48,925 --> 00:35:52,100
*Student:*Isn't there a one over two missing somewhere?
485
00:35:53,375 --> 00:35:55,775
*Instructor (Andrew Ng)*:Is there a one-half missing? *Student:*I was [inaudible].
486
00:35:56,450 --> 00:36:00,100
*Instructor (Andrew Ng)*:Thanks. I do make lots of errors in that.
487
00:36:01,775 --> 00:36:03,100
Any questions about this?
488
00:36:05,750 --> 00:36:06,700
489
00:36:08,475 --> 00:36:13,850
490
00:36:14,825 --> 00:36:15,350
491
00:36:15,725 --> 00:36:20,175
492
00:36:22,650 --> 00:36:24,000
All right,
493
00:36:26,275 --> 00:36:27,425
so let me just
494
00:36:29,350 --> 00:36:35,100
wrap this property into an algorithm. So over there I derived the algorithm where you have just one training example,
495
00:36:36,275 --> 00:36:40,100
more generally for M training examples, gradient descent becomes the following.
496
00:36:40,375 --> 00:36:43,250
We're going to repeat
497
00:36:43,475 --> 00:36:44,725
until convergence
498
00:36:45,575 --> 00:36:48,375
the following step.
499
00:36:48,850 --> 00:36:53,175
Okay, theta_i gets updated as theta_i
500
00:36:54,175 --> 00:36:59,775
and I'm just writing out the appropriate equation for M examples rather than one example.
501
00:37:00,500 --> 00:37:03,150
Theta_i gets updated. Theta_i minus alpha times
502
00:37:03,425 --> 00:37:04,850
the sum from i equals one to M.
503
00:37:05,675 --> 00:37:06,450
504
00:37:06,925 --> 00:37:11,475
505
00:37:12,450 --> 00:37:13,550
506
00:37:14,700 --> 00:37:15,625
507
00:37:17,975 --> 00:37:18,625
508
00:37:20,075 --> 00:37:26,050
509
00:37:29,475 --> 00:37:30,075
Okay, and
510
00:37:30,775 --> 00:37:33,525
I won't bother to show it, but you can
511
00:37:34,100 --> 00:37:37,800
go home and sort of verify for yourself that this summation here,
512
00:37:38,250 --> 00:37:39,975
this is indeed
513
00:37:40,450 --> 00:37:46,325
the partial derivative with respect to theta I of J of theta, where
514
00:37:46,800 --> 00:37:49,150
if you use the original definition of J of theta
515
00:37:49,775 --> 00:37:51,500
for when you have M training examples.
516
00:37:54,350 --> 00:38:00,475
Okay, so I'm just going to show switch back to the laptop display. I'm going to show you what this looks like when you run the algorithm.
517
00:38:00,900 --> 00:38:02,750
518
00:38:03,650 --> 00:38:05,350
So it turns out that
519
00:38:06,075 --> 00:38:10,400
for the specific problem of linear regression, or ordinary [release|least] squares, which is what we're doing today,
520
00:38:11,575 --> 00:38:17,650
the function J of theta actually does not look like this nasty one that I'll show you just now with a multiple local optima.
521
00:38:18,400 --> 00:38:19,650
In particular, it turns out
522
00:38:20,175 --> 00:38:25,475
for ordinary [release|least] squares, the function J of theta is it's just a quadratic function.
523
00:38:26,000 --> 00:38:28,525
And so we'll always have a nice [bow|bowl] shape,
524
00:38:29,150 --> 00:38:34,250
like what you see up here, and only have one global minimum with no other local optima.
525
00:38:34,950 --> 00:38:35,925
So
526
00:38:36,775 --> 00:38:40,850
when you run gradient descent, here are actually the contours of the function J. So
527
00:38:41,350 --> 00:38:45,300
the contours of a [bow|bowl] shaped function like that are going to be ellipses,
528
00:38:45,950 --> 00:38:46,500
and
529
00:38:48,175 --> 00:38:53,425
if you run gradient descent on this algorithm, here's what you might get. Let's see, so I initialize the parameters.
530
00:38:53,725 --> 00:38:56,550
So let's say randomly at the position of that cross over there,
531
00:38:56,975 --> 00:38:58,925
right, that cross on the upper right.
532
00:38:59,625 --> 00:39:02,075
And so after one iteration of gradient descent,
533
00:39:03,150 --> 00:39:06,950
as you change the space of parameters,
534
00:39:08,100 --> 00:39:10,275
so if that's the result of one step of gradient descent,
535
00:39:11,075 --> 00:39:15,575
two steps, three steps, four steps, five steps, and so on, and it, you know,
536
00:39:16,550 --> 00:39:20,225
converges easily, rapidly to the global minimum of this function J of theta.
537
00:39:22,475 --> 00:39:26,275
Okay, and this is a property of [inaudible|ordinary least square] regression
538
00:39:26,750 --> 00:39:31,200
with a linear hypothesis cost. The function, J of theta has no local optima. Yes, question?
539
00:39:34,350 --> 00:39:35,225
*Student:*Is the alpha changing
540
00:39:35,825 --> 00:39:41,425
every time? Because the step is not [inaudible]. *Instructor (Andrew Ng)*:So it turns out that
541
00:39:42,425 --> 00:39:47,350
yes, so it turns out this was done with a this is with a [fake|fixed] value of alpha,
542
00:39:47,825 --> 00:39:52,000
and one of the properties of gradient descent is that as you approach the local minimum,
543
00:39:52,925 --> 00:39:58,025
it actually takes smaller and smaller steps so they'll converge. And the reason is, the update is
544
00:39:59,550 --> 00:40:04,175
you update theta by subtracting from alpha times the gradient.
545
00:40:04,800 --> 00:40:07,100
And so as you approach the local minimum,
546
00:40:07,600 --> 00:40:12,225
the gradient also goes to zero. As you approach the local minimum,
547
00:40:12,850 --> 00:40:18,050
at the local minimum the gradient is zero, and as you approach the local minimum, the gradient also gets smaller and smaller.
548
00:40:18,750 --> 00:40:23,025
And so gradient descent will automatically take smaller and smaller steps
549
00:40:23,750 --> 00:40:24,500
as you approach the local minimum. Make sense?
550
00:40:29,850 --> 00:40:30,550
And
551
00:40:31,200 --> 00:40:36,125
here's the same plot here's actually a plot of the housing prices data.
552
00:40:37,450 --> 00:40:44,175
So here, lets you initialize the parameters to the vector of all zeros, and so this blue line at the bottom
553
00:40:44,775 --> 00:40:49,500
shows the hypothesis with the parameters of initialization. So initially
554
00:40:50,050 --> 00:40:56,325
theta zero and theta one are both zero, and so your hypothesis predicts that all prices are equal to zero.
555
00:40:57,050 --> 00:40:59,525
After one iteration of gradient descent, that's
556
00:41:00,425 --> 00:41:02,475
the blue line you get. After two iterations,
557
00:41:03,750 --> 00:41:08,900
three, four, five, and after a few more iterations, excuse me, it converges,
558
00:41:09,625 --> 00:41:11,825
and you've now found the least square fit for the data.
559
00:41:19,175 --> 00:41:23,350
Okay, let's switch back to the chalkboard. Are there questions about this? Yeah?
560
00:41:28,000 --> 00:41:36,800
*Student:*[Inaudible] iteration, do we mean that we run each sample all the sample cases [inaudible] the new values? *Instructor (Andrew Ng)*:Yes, right.
561
00:41:37,025 --> 00:41:40,300
*Student:*And converged means that the value will be the same
562
00:41:40,625 --> 00:41:47,225
[inaudible] roughly the same?
563
00:41:48,225 --> 00:41:52,375
*Instructor (Andrew Ng)*:Yeah, so this is sort of a question of how do you test the convergence. And there's different ways
564
00:41:52,900 --> 00:41:55,025
of testing for convergence. One is you can look at
565
00:41:55,500 --> 00:41:57,775
two different iterations and see if theta has changed a lot,
566
00:41:58,075 --> 00:42:02,275
and if theta hasn't changed much within two iterations, you may say it's sort of more or less converged.
567
00:42:03,900 --> 00:42:08,925
Something that's done maybe slightly more often is look at the value of J of theta, and if J of theta
568
00:42:09,425 --> 00:42:14,025
if the quantity you're trying to minimize is not changing much anymore, then you might be
569
00:42:14,600 --> 00:42:16,750
inclined to believe it's converged. So these are sort of
570
00:42:17,475 --> 00:42:22,875
standard heuristics, or standard rules of thumb that are often used to decide if gradient descent has converged. Yeah?
571
00:42:27,175 --> 00:42:28,000
572
00:42:28,450 --> 00:42:32,650
*Student:*I may have missed something, but especially in [inaudible] descent.
573
00:42:33,300 --> 00:42:35,050
574
00:42:35,625 --> 00:42:36,475
575
00:42:36,825 --> 00:42:42,975
So one feature [inaudible] curve and can either go this way or that way.
576
00:42:43,550 --> 00:42:48,075
But the math at incline [inaudible] where that comes in. When do you choose whether you go left,
577
00:42:48,525 --> 00:42:53,375
whether you're going this way or that way? *Instructor (Andrew Ng)*:I see. It just turns out that so the question is,
578
00:42:54,200 --> 00:42:59,000
how is gradient descent looking 360 around you and choosing the direction of steepest descent.
579
00:42:59,575 --> 00:43:04,375
So it actually turns out I'm not sure I'll answer the second part, but it turns out that
580
00:43:04,900 --> 00:43:08,500
if you stand on the hill and if you ...
581
00:43:10,575 --> 00:43:14,625
it turns out that when you compute the gradient of the function, when you compute the derivative of the function,
582
00:43:15,025 --> 00:43:16,900
then it just turns out that that is indeed
583
00:43:17,175 --> 00:43:18,425
the direction of steepest descent.
584
00:43:18,850 --> 00:43:25,575
By the way, I just want to point out, you would never want to go in the opposite direction because the opposite direction would actually be the direction of steepest ascent,
585
00:43:26,100 --> 00:43:26,725
right.
586
00:43:27,550 --> 00:43:28,275
So as it turns out
587
00:43:29,450 --> 00:43:34,325
maybe the TAs can talk a bit more about this at the section if there's interest.
588
00:43:34,925 --> 00:43:38,500
It turns out, when you take the derivative of a function, the derivative of a function
589
00:43:39,075 --> 00:43:42,250
sort of turns out to just give you the direction of steepest descent.
590
00:43:43,200 --> 00:43:47,400
And so you don't explicitly look all 360 degrees around you.
591
00:43:47,950 --> 00:43:50,550
You sort of just compute the derivative and that turns out to be the direction of steepest descent.
592
00:43:54,350 --> 00:43:58,050
Yeah, maybe the TAs can talk a bit more about this on Friday.
593
00:44:01,425 --> 00:44:05,550
594
00:44:08,800 --> 00:44:12,075
595
00:44:13,800 --> 00:44:14,775
596
00:44:16,625 --> 00:44:18,250
597
00:44:20,575 --> 00:44:22,450
Okay, let's see, so
598
00:44:24,400 --> 00:44:26,600
let me go ahead and give this algorithm
599
00:44:27,100 --> 00:44:30,200
a specific name. So this algorithm here is actually called
600
00:44:30,575 --> 00:44:34,350
batch gradient descent,
601
00:44:35,075 --> 00:44:38,175
602
00:44:38,700 --> 00:44:43,300
and the term batch isn't a great term, but the term batch refers to the fact that
603
00:44:43,725 --> 00:44:46,025
on every step of gradient descent you're going to
604
00:44:46,375 --> 00:44:48,550
look at your entire training set. You're going to
605
00:44:49,025 --> 00:44:52,225
perform a sum over your M training examples.
606
00:44:54,525 --> 00:44:59,550
So [inaudible|it turns out that batch gradient] descent often works very well. I use it very often, and
607
00:45:00,100 --> 00:45:05,075
it turns out that sometimes if you have a really, really large training set, imagine that
608
00:45:05,525 --> 00:45:10,775
instead of having 47 houses from Portland, Oregon in our training set, if you had, say, the U.S. Census Database or something,
609
00:45:12,125 --> 00:45:16,700
with U.S. census size databases you often have hundreds of thousands or millions of training examples.
610
00:45:18,825 --> 00:45:21,125
So if M is a few million
611
00:45:22,000 --> 00:45:27,250
then if you're running batch [rate and|gradient] descent, this means that to perform every step of gradient descent
612
00:45:27,925 --> 00:45:29,525
you need to perform a sum from
613
00:45:29,925 --> 00:45:31,225
J equals one to a million.
614
00:45:31,600 --> 00:45:35,850
That's sort of a lot of training examples where your computer programs have to look at,
615
00:45:36,400 --> 00:45:39,700
before you can even take one step downhill on the function J of theta.
616
00:45:40,400 --> 00:45:41,175
So
617
00:45:41,875 --> 00:45:46,150
it turns out that when you have very large training sets,
618
00:45:46,750 --> 00:45:49,350
you should write down an alternative algorithm
619
00:45:50,025 --> 00:45:54,800
that is called [inaudible|stochastic] gradient descent.
620
00:45:55,775 --> 00:46:02,200
Sometimes I'll also call it incremental gradient descent,
621
00:46:02,975 --> 00:46:05,325
but the algorithm is as follows.
622
00:46:05,700 --> 00:46:09,750
Again, it will repeat until convergence
623
00:46:10,200 --> 00:46:12,575
624
00:46:13,375 --> 00:46:15,500
and will iterate for J equals one to M,
625
00:46:16,950 --> 00:46:17,825
626
00:46:19,125 --> 00:46:24,850
and will perform one of these sort of gradient descent updates
627
00:46:25,675 --> 00:46:28,100
using just the J training example.
628
00:46:29,175 --> 00:46:30,225
629
00:46:31,400 --> 00:46:32,275
630
00:46:33,875 --> 00:46:39,250
631
00:46:46,975 --> 00:46:51,800
632
00:46:53,925 --> 00:46:55,000
633
00:46:55,450 --> 00:47:01,550
Oh, and as usual, this is really you update all the parameters [data|theta] runs. You perform this update
634
00:47:01,825 --> 00:47:03,325
for all values of i.
635
00:47:05,025 --> 00:47:10,925
For i indexes and the parameter vectors, you just perform this update, all of your parameters simultaneously.
636
00:47:13,575 --> 00:47:16,325
And the advantage of this algorithm is that
637
00:47:17,350 --> 00:47:18,050
in order to
638
00:47:18,475 --> 00:47:22,075
in order to start learning, in order to start modifying the parameters,
639
00:47:22,425 --> 00:47:26,825
you only need to look at your first training examples. You should look at your first training example
640
00:47:27,325 --> 00:47:28,675
and perform an update using
641
00:47:29,275 --> 00:47:32,700
the derivative of the error with respect to just your first training example,
642
00:47:33,150 --> 00:47:36,200
and then you look at your second training example and perform another update.
643
00:47:36,700 --> 00:47:39,225
And you sort of keep adapting your parameters much more quickly
644
00:47:40,125 --> 00:47:41,700
without needing to
645
00:47:42,300 --> 00:47:46,950
scan over your entire U.S. Census database before you can even start adapting parameters.
646
00:47:49,675 --> 00:47:50,500
So
647
00:47:51,250 --> 00:47:56,925
let's see, for [launch|large] data sets, so constantly gradient descent is often much faster,
648
00:47:57,650 --> 00:47:58,450
and
649
00:47:59,275 --> 00:48:05,575
what happens is that constant gradient descent is that it won't actually converge to the global minimum exactly,
650
00:48:06,325 --> 00:48:11,900
but if these are the contours of your function,
651
00:48:12,725 --> 00:48:16,400
then after you run the constant gradient descent, you sort of tend to wander around.
652
00:48:17,250 --> 00:48:21,375
And you may actually end up going uphill occasionally, but your parameters will sort of
653
00:48:22,050 --> 00:48:25,400
[tender|tend] to wander to the region closest to the global minimum,
654
00:48:25,775 --> 00:48:28,200
but sort of keep wandering around a little bit near the region of the global [inaudible].
655
00:48:29,375 --> 00:48:30,675
And often that's just fine
656
00:48:32,175 --> 00:48:34,025
to have a parameter
657
00:48:35,125 --> 00:48:39,025
that wanders around a little bit the global minimum.
658
00:48:40,175 --> 00:48:45,750
And in practice, this often works much faster than [back|batch] gradient descent, especially if you have a large training set.
659
00:48:48,850 --> 00:48:55,575
Okay,
660
00:48:55,925 --> 00:49:02,950
I'm going to clean a couple of boards. While I do that, why don't you take a look at the equations, and after I'm done cleaning the boards, I'll ask what questions you have.
661
00:49:06,575 --> 00:49:07,800
662
00:49:09,750 --> 00:49:11,000
663
00:49:14,125 --> 00:49:19,800
664
00:49:20,850 --> 00:49:23,150
665
00:49:28,275 --> 00:49:28,875
666
00:49:30,500 --> 00:49:36,300
667
00:49:36,675 --> 00:49:38,125
668
00:49:42,900 --> 00:49:49,000
Okay, so what questions do you have about all of this?
669
00:49:50,050 --> 00:49:51,625
*Student:*[Inaudible] is it true are you just sort of rearranging the order that
670
00:49:52,350 --> 00:49:53,475
you do the ...
671
00:49:53,775 --> 00:49:59,925
computation? So do you just use the first training example and update all of the theta i's and then step,
672
00:50:01,875 --> 00:50:07,350
and then update with the second training example, and update all the theta i's, and then step? And is that why you get sort of this really?
673
00:50:08,175 --> 00:50:14,575
*Instructor (Andrew Ng)*:Let's see, right. So I'm going to look at my first training example and then I'm going to take a step, and then I'm going to
674
00:50:15,975 --> 00:50:22,000
perform the second gradient descent updates using my new parameter vector that has already been modified using my
675
00:50:22,275 --> 00:50:22,850
first training example.
676
00:50:23,900 --> 00:50:24,575
And then I keep going. Make sense? Yeah?
677
00:50:28,475 --> 00:50:29,825
*Student:*So in each update of all the theta_i's, you're only using.
678
00:50:30,400 --> 00:50:34,875
*Instructor (Andrew Ng)*:One training example. *Student:*One training example.
679
00:50:36,875 --> 00:50:39,025
*Student:*[Inaudible]?
680
00:50:39,700 --> 00:50:43,150
*Instructor (Andrew Ng)*:Let's see, it's definitely [inaudible|emperical].
681
00:50:43,675 --> 00:50:47,100
I believe this theory that sort of supports that as well.
682
00:50:48,500 --> 00:50:53,350
Yeah, the theory that supports that, the [inaudible] of theorem is, I don't remember. Okay,
683
00:50:59,700 --> 00:51:03,150
cool. So
684
00:51:05,050 --> 00:51:06,750
in what I've done so far,
685
00:51:07,975 --> 00:51:10,700
I've talked about an iterative algorithm
686
00:51:11,925 --> 00:51:15,750
for performing this minimization in terms of J of theta.
687
00:51:16,925 --> 00:51:23,075
And it turns out that there's another way for this specific problem of least squares regression, of ordinary least squares.
688
00:51:23,600 --> 00:51:28,425
It turns out there's another way to perform this minimization of J of theta that allows you to
689
00:51:29,025 --> 00:51:35,300
solve for the parameters theta in [close|closed] form, without needing to run an iterative algorithm.
690
00:51:35,850 --> 00:51:40,575
And I know some of you may have seen some of what I'm about to do before, in
691
00:51:41,125 --> 00:51:44,500
like an undergraduate linear algebra course, and the way
692
00:51:45,500 --> 00:51:47,900
it's typically done requires
693
00:51:48,375 --> 00:51:52,800
[inaudible] projections, or taking lots of derivatives and writing lots of algebra.
694
00:51:53,075 --> 00:51:55,100
What I'd like to do is
695
00:51:55,650 --> 00:51:58,425
show you a way to derive the closed form
696
00:51:58,825 --> 00:52:02,075
solution of theta in just a few lines of algebra.
697
00:52:02,550 --> 00:52:06,900
But to do that, I'll need to introduce a new notation for matrix derivatives,
698
00:52:07,575 --> 00:52:10,900
and it turns out that, sort of, the notation I'm about to
699
00:52:12,950 --> 00:52:13,650
define here
700
00:52:14,175 --> 00:52:19,500
just in my own personal work has turned out to be one of the most useful things that I actually use all the time,
701
00:52:19,950 --> 00:52:23,925
to have a notation of how to take derivatives with respect to [matrixes|matrices], so that you can
702
00:52:24,400 --> 00:52:30,025
solve for the minimum of J of theta with, like, a few lines of algebra rather than writing out pages and pages of matrices and derivatives.
703
00:52:31,350 --> 00:52:37,400
So then we're going to define this new notation first and then we'll go ahead and work out the minimization.
704
00:52:38,225 --> 00:52:40,075
Given a function J,
705
00:52:41,000 --> 00:52:44,400
since J is a function of a vector of parameters theta, right,
706
00:52:45,075 --> 00:52:45,950
I'm going to define
707
00:52:46,450 --> 00:52:49,650
the derivative of the gradient of J with respect to theta,
708
00:52:50,425 --> 00:52:52,350
as self of vector.
709
00:52:53,125 --> 00:52:56,975
710
00:52:57,550 --> 00:52:59,375
711
00:53:02,750 --> 00:53:07,950
Okay, and so this is going to be an (n+1) dimensional vector.
712
00:53:08,600 --> 00:53:12,200
Theta is an (n+1) dimensional vector with indices ranging from zero to n.
713
00:53:12,875 --> 00:53:16,000
And so I'm going to define this derivative to be equal to that.
714
00:53:17,525 --> 00:53:22,875
Okay, and so we can actually rewrite
715
00:53:23,400 --> 00:53:27,900
the gradient descent algorithm as follows. This is batch gradient descent,
716
00:53:28,325 --> 00:53:34,350
and we write gradient descent as updating the parameter vector theta notice there's no subscript I know
717
00:53:35,125 --> 00:53:42,675
updating the parameter vector theta as the previous parameter minus alpha times the gradient.
718
00:53:44,775 --> 00:53:51,275
Okay, and so in this equation all of these quantities, theta,
719
00:53:52,050 --> 00:53:58,975
and this gradient vector, all of these are (n+1) dimensional vectors.
720
00:54:00,475 --> 00:54:03,575
I was using the boards out of order, wasn't I?
721
00:54:04,375 --> 00:54:08,975
722
00:54:09,725 --> 00:54:13,875
723
00:54:14,425 --> 00:54:15,100
724
00:54:15,700 --> 00:54:17,900
725
00:54:20,800 --> 00:54:23,350
So more generally,
726
00:54:24,675 --> 00:54:27,300
if you have a function f
727
00:54:27,650 --> 00:54:31,125
that maps from the space of matrices A,
728
00:54:32,600 --> 00:54:33,925
729
00:54:36,650 --> 00:54:37,850
that maps from,
730
00:54:38,350 --> 00:54:42,475
say, the space of N by N matrices to the space of real numbers.
731
00:54:42,900 --> 00:54:46,525
So if you have a function, f of A,
732
00:54:47,000 --> 00:54:49,525
where A ...
733
00:54:50,025 --> 00:54:51,400
is an N by N matrix.
734
00:54:52,100 --> 00:54:56,600
So this function is matched from matrices to real numbers, the function that takes this input to matrix.
735
00:54:57,675 --> 00:55:01,550
Let me define the derivative with respect to f
736
00:55:02,150 --> 00:55:03,425
of the matrix A.
737
00:55:04,375 --> 00:55:08,100
Now, I'm just taking the gradient of f with respect to its input, which is the matrix.
738
00:55:08,950 --> 00:55:12,925
I'm going to define this itself to be a matrix.
739
00:55:13,350 --> 00:55:15,575
740
00:55:16,600 --> 00:55:21,550
741
00:55:22,600 --> 00:55:25,300
742
00:55:26,075 --> 00:55:31,200
743
00:55:32,250 --> 00:55:33,825
744
00:55:36,475 --> 00:55:42,500
Okay, so the derivative of f with respect to A is itself a matrix, and the matrix contains
745
00:55:43,475 --> 00:55:44,950
all the partial derivatives of f
746
00:55:45,950 --> 00:55:47,075
with respect to the elements of A.
747
00:55:58,575 --> 00:56:00,825
One more definition is
748
00:56:01,275 --> 00:56:05,875
if A is a square matrix, so if A
749
00:56:06,525 --> 00:56:10,750
is an n by n matrix, number of rows equals number of columns,
750
00:56:11,325 --> 00:56:13,900
let me define the trace of A
751
00:56:14,450 --> 00:56:17,325
to be equal to the sum of A's diagonal elements.
752
00:56:18,475 --> 00:56:24,850
So this is just sum over i of A_i,i.
753
00:56:26,300 --> 00:56:31,900
For those of you that haven't seen this sort of operator notation before, you can think of trace of A as
754
00:56:32,175 --> 00:56:35,175
the trace operator applied to the square matrix A,
755
00:56:35,600 --> 00:56:39,350
but it's more commonly written without the parentheses. So I usually write trace of A like this,
756
00:56:40,025 --> 00:56:41,650
and this just means the sum of diagonal elements.
757
00:56:44,175 --> 00:56:46,750
So
758
00:56:47,575 --> 00:56:48,400
759
00:56:49,225 --> 00:56:51,750
here are some facts about the trace operator and about derivatives,
760
00:56:52,125 --> 00:56:56,775
and I'm just going to write these without proof. You can also have the TAs prove some of them in the discussion section,
761
00:56:57,300 --> 00:57:00,575
or you can actually go home and
762
00:57:01,125 --> 00:57:02,550
verify the proofs of all of these.
763
00:57:03,550 --> 00:57:07,675
It turns out that given two matrices,
764
00:57:08,350 --> 00:57:14,350
A and B, the trace of the matrix A times B is equal to
765
00:57:14,875 --> 00:57:20,275
the trace of B, A. Okay, I'm not going to prove this, but you should be able to go home and prove this yourself
766
00:57:20,950 --> 00:57:21,925
without too much difficulty.
767
00:57:23,425 --> 00:57:24,725
And similarly,
768
00:57:25,100 --> 00:57:26,125
769
00:57:26,875 --> 00:57:31,375
the trace of a product of three matrices, so if you can take the matrix at the end and
770
00:57:31,725 --> 00:57:35,925
cyclically permeate it to the front. So trace of A times B, times C,
771
00:57:36,275 --> 00:57:38,125
is equal to the trace of C, A, B.
772
00:57:38,775 --> 00:57:40,700
So take the matrix C at the back and move it to the front,
773
00:57:41,200 --> 00:57:44,550
and this is also equal to the trace of
774
00:57:45,050 --> 00:57:47,375
B, C. Take the matrix B and move it to the front. Okay,
775
00:57:54,850 --> 00:57:57,900
also,
776
00:57:58,750 --> 00:58:05,075
suppose you have a function f of A which is defined as a trace of A, B.
777
00:58:06,425 --> 00:58:07,150
Okay, so this is,
778
00:58:07,900 --> 00:58:14,600
right, the trace is a real number. So the trace of A, B is a function that takes this input of matrix A and output a real number.
779
00:58:16,250 --> 00:58:22,175
So then the derivative with respect to the matrix A of this function of trace A, B,
780
00:58:25,650 --> 00:58:26,550
781
00:58:27,050 --> 00:58:31,975
is going to be B transposed. And this is just another fact that you can
782
00:58:32,875 --> 00:58:36,525
prove by yourself by going back and referring to the definitions of traces and matrix derivatives.
783
00:58:37,000 --> 00:58:39,100
I'm not going to prove it. You should work it out.
784
00:58:40,875 --> 00:58:43,175
Okay, and lastly a couple of easy ones.
785
00:58:43,550 --> 00:58:48,925
The trace of A is equal to the trace of A transposed because the trace is just the sum of diagonal elements.
786
00:58:49,800 --> 00:58:54,675
And so if you transpose the matrix, the diagonal, then there's no change.
787
00:58:55,450 --> 00:58:59,125
And if lower case A is a real number, then
788
00:58:59,825 --> 00:59:05,150
the trace of a real number is just itself. So think of a real number as a one by one matrix.
789
00:59:05,750 --> 00:59:09,475
So the trace of a one by one matrix is just whatever that real number is.
790
00:59:12,375 --> 00:59:13,975
791
00:59:15,575 --> 00:59:20,825
And lastly, this is a somewhat tricky one.
792
00:59:21,850 --> 00:59:23,275
793
00:59:24,850 --> 00:59:27,750
794
00:59:28,825 --> 00:59:33,525
The derivative with respect to the matrix A of the trace of A, B, A, transpose C
795
00:59:34,150 --> 00:59:36,125
is equal to C, A, B
796
00:59:36,625 --> 00:59:40,925
plus C transposed A,
797
00:59:41,625 --> 00:59:47,075
B transposed. And I won't prove that either. This is sort of just algebra. Work it out yourself.
798
00:59:48,500 --> 00:59:49,325
799
00:59:50,800 --> 00:59:53,325
800
00:59:55,325 --> 00:59:59,200
Okay, and so, I guess the key equation ...
801
01:00:01,325 --> 01:00:03,050
the key facts I'm going to use again
802
01:00:06,550 --> 01:00:08,550
about traces and matrix derivatives, I'll use five. [Ten minutes.]
803
01:00:14,050 --> 01:00:14,975
Okay,
804
01:00:16,625 --> 01:00:17,150
so
805
01:00:18,225 --> 01:00:22,650
armed with these things I'm going to figure out
806
01:00:23,175 --> 01:00:24,675
let's try to come up with a quick derivation
807
01:00:25,450 --> 01:00:28,100
for how to minimize J of theta as a function of
808
01:00:28,550 --> 01:00:31,850
theta in closed form, and without needing to use an iterative algorithm.
809
01:00:35,850 --> 01:00:37,525
So work this out. Let me define
810
01:00:37,875 --> 01:00:41,450
the matrix X. This is called the design matrix.
811
01:00:41,975 --> 01:00:45,600
To be a matrix containing all the inputs from my training set.
812
01:00:47,125 --> 01:00:53,075
So x_1 was the vector of inputs to the vector of features for my first training example.
813
01:00:53,700 --> 01:00:54,900
So I'm going to
814
01:00:56,675 --> 01:01:00,100
set X 1 to be the first row of this matrix X,
815
01:01:00,900 --> 01:01:02,725
816
01:01:03,300 --> 01:01:04,300
817
01:01:04,850 --> 01:01:09,125
set my second training example is in place to be the second [variable|row], and so on.
818
01:01:09,350 --> 01:01:12,275
And I have M training examples, and so
819
01:01:13,600 --> 01:01:15,875
that's going to be my
820
01:01:16,925 --> 01:01:22,350
design matrix X. Okay, this is defined as matrix capital X as follows, and so
821
01:01:24,500 --> 01:01:28,875
now, let me take this matrix X and multiply it by my parameter vector theta.
822
01:01:29,375 --> 01:01:33,250
This derivation will just take two or three sets. So X times theta
823
01:01:34,475 --> 01:01:40,950
remember how matrix vector multiplication goes. You take this vector and you multiply it by each of the rows of the matrix.
824
01:01:42,100 --> 01:01:44,725
So X times theta is just going to be
825
01:01:45,150 --> 01:01:48,550
x_1 transposed theta,
826
01:01:48,925 --> 01:01:51,225
dot, dot, dot, down to x_m,
827
01:01:53,975 --> 01:01:54,825
transposed theta.
828
01:01:55,725 --> 01:01:57,400
And this is, of course, just
829
01:01:58,125 --> 01:02:00,350
830
01:02:00,975 --> 01:02:06,800
831
01:02:07,650 --> 01:02:12,325
the predictions of your hypothesis on each of your M training examples.
832
01:02:14,900 --> 01:02:15,650
833
01:02:16,100 --> 01:02:18,650
[Then we|Let me] also define(d) the y vector
834
01:02:19,925 --> 01:02:24,500
to be the vector of all the target values y_1 through y_m
835
01:02:25,175 --> 01:02:29,500
in my training set. Okay, so y vector is an m dimensional vector.
836
01:02:31,250 --> 01:02:35,325
837
01:02:36,100 --> 01:02:38,000
838
01:02:39,425 --> 01:02:42,325
839
01:02:43,825 --> 01:02:45,325
840
01:02:48,525 --> 01:02:51,525
So
841
01:02:53,075 --> 01:02:55,100
X theta minus Y
842
01:02:55,675 --> 01:02:58,575
contained the math from the previous board, this is going to be,
843
01:02:59,075 --> 01:03:02,750
844
01:03:03,475 --> 01:03:04,025
845
01:03:04,925 --> 01:03:10,425
846
01:03:14,875 --> 01:03:21,875
right, and now, X theta minus Y, this is a vector. This is an M dimensional vector in M training examples,
847
01:03:22,925 --> 01:03:27,025
and so I'm actually going to take this vector and take this inner product with itself.
848
01:03:27,425 --> 01:03:32,200
Okay, so we call that if Z is a vector
849
01:03:32,875 --> 01:03:36,450
than Z transpose Z is just sum over i,
850
01:03:36,675 --> 01:03:40,325
Zi squared. Right, that's how you take the inner product of a vector with a sum.
851
01:03:41,325 --> 01:03:44,600
So you want to take this vector, X theta minus Y,
852
01:03:44,950 --> 01:03:49,200
and take the inner product
853
01:03:49,650 --> 01:03:50,975
of this vector with itself,
854
01:03:51,925 --> 01:03:52,900
and so that gives me
855
01:03:54,625 --> 01:03:55,600
856
01:03:56,250 --> 01:03:57,425
857
01:03:57,850 --> 01:04:04,975
sum from i equals one to m, (H_f(X_i) - Y) squared.
858
01:04:06,075 --> 01:04:11,000
Okay, since it's just the sum of the squares of the elements of this vector.
859
01:04:12,425 --> 01:04:14,175
And
860
01:04:14,900 --> 01:04:18,700
put a half there for the emphasis.
861
01:04:19,875 --> 01:04:23,625
This is our previous definition of J of theta. Okay, yeah?
862
01:04:30,025 --> 01:04:32,850
*Student:*[Inaudible]?
863
01:04:33,350 --> 01:04:36,750
*Instructor (Andrew Ng)*:Yeah, I threw a lot of notations at you today.
864
01:04:37,175 --> 01:04:39,700
So M is the number of training examples and
865
01:04:40,250 --> 01:04:42,875
the number of training examples runs from one through M,
866
01:04:44,000 --> 01:04:47,225
and then is the feature vector that runs from zero through N. Does that make sense?
867
01:04:49,575 --> 01:04:52,775
So this is the sum from one through M.
868
01:04:53,450 --> 01:04:55,175
It's sort of
869
01:04:57,050 --> 01:05:02,875
theta transpose X that's equal to sum from J equals zero through N of
870
01:05:03,250 --> 01:05:05,100
theta J, X, J. Does that make sense?
871
01:05:08,425 --> 01:05:10,775
It's the feature vectors that
872
01:05:11,200 --> 01:05:14,425
index from zero through N where X_0 is equal to one,
873
01:05:15,100 --> 01:05:16,950
whereas the training example is actually indexed from one through M.
874
01:05:21,875 --> 01:05:26,625
So let me clean a few more boards and take another look at this, make sure it all makes sense.
875
01:05:27,700 --> 01:05:31,575
876
01:05:33,425 --> 01:05:37,525
877
01:05:39,100 --> 01:05:41,075
878
01:05:43,800 --> 01:05:46,250
879
01:05:46,650 --> 01:05:49,625
880
01:05:50,550 --> 01:05:55,875
881
01:05:58,425 --> 01:05:59,450
882
01:06:02,175 --> 01:06:03,550
Okay, yeah?
883
01:06:08,600 --> 01:06:11,950
*Student:*[Inaudible] the Y inside the parentheses, shouldn't that be [inaudible]? *Instructor (Andrew Ng)*:Oh, yes, thank you. Oh is that what you meant?
884
01:06:13,800 --> 01:06:17,875
Yes, thank you. Great, i-th training example. Anything else? Cool.
885
01:06:26,850 --> 01:06:33,425
So we're actually nearly done with this derivation. We would like to minimize J of theta
886
01:06:34,300 --> 01:06:35,750
with respect to theta
887
01:06:36,500 --> 01:06:41,050
and we've written J of theta fairly compactly using this matrix vector notation.
888
01:06:42,750 --> 01:06:45,900
So in order to minimize J of theta with respect to theta,
889
01:06:47,425 --> 01:06:51,225
what we're going to do is take the derivative with respect to theta of J of theta,
890
01:06:52,375 --> 01:06:57,725
and set this to zero, and solve for theta. Okay,
891
01:07:02,150 --> 01:07:05,475
so we have
892
01:07:06,625 --> 01:07:11,500
derivative with respect to theta of
893
01:07:12,425 --> 01:07:18,675
that is equal to
894
01:07:21,075 --> 01:07:21,775
I should mention
895
01:07:22,025 --> 01:07:28,600
there will be some steps here that I'm just going to do fairly quickly without proof. So is it really true that the derivative of half of that is half of the derivative,
896
01:07:29,000 --> 01:07:29,825
and I already exchanged
897
01:07:30,425 --> 01:07:33,475
the derivative and the one-half. In terms of the answers, yes, but
898
01:07:34,000 --> 01:07:40,200
later on you should go home and look through the lecture notes and make sure you understand and believe why every step is correct.
899
01:07:40,625 --> 01:07:42,575
I'm going to do things relatively quickly here
900
01:07:43,025 --> 01:07:46,450
and you can work through every step yourself more slowly by referring to the lecture notes.
901
01:07:48,650 --> 01:07:53,325
Okay, so that's equal to I'm going to expand now this quadratic function. So
902
01:07:53,725 --> 01:07:55,225
this is going to be,
903
01:07:56,125 --> 01:07:57,750
904
01:07:58,700 --> 01:08:01,950
905
01:08:02,875 --> 01:08:06,500
906
01:08:07,050 --> 01:08:10,050
okay,
907
01:08:11,150 --> 01:08:15,975
and this is just sort of taking a quadratic function and expanding it out by multiplying the [inaudible]. And again,
908
01:08:16,600 --> 01:08:18,000
work through the steps later yourself
909
01:08:19,400 --> 01:08:22,650
if you're not quite sure how I did that. So
910
01:08:24,175 --> 01:08:27,200
this thing, this vector, vector product, right,
911
01:08:27,925 --> 01:08:32,475
this quantity here, this is just J of theta and so it's just a real number,
912
01:08:33,500 --> 01:08:35,100
and the trace of a real number is just itself. *Student:*[Inaudible].
913
01:08:38,225 --> 01:08:39,925
*Instructor (Andrew Ng)*:Oh, thanks, Dan.
914
01:08:41,275 --> 01:08:46,525
Cool, great. So this quantity in parentheses, this is J of theta and it's just a real number.
915
01:08:47,425 --> 01:08:53,175
And so the trace of a real number is just the same real number. And so you can sort of take a trace operator without changing anything.
916
01:08:57,625 --> 01:09:03,600
And this is equal to one-half derivative with respect to theta of the trace of
917
01:09:05,700 --> 01:09:09,650
by the second permutation property of trace. You can take this theta at the end
918
01:09:10,325 --> 01:09:14,050
and move it to the front. So this is going to be trace of theta
919
01:09:15,000 --> 01:09:19,025
times theta transposed, X transpose X
920
01:09:22,250 --> 01:09:24,425
minus
921
01:09:26,125 --> 01:09:31,600
derivative with respect to theta of the trace of, again, I'm going to
922
01:09:32,325 --> 01:09:33,825
take that and bring it to the ...
923
01:09:35,475 --> 01:09:37,800
oh, sorry.
924
01:09:39,575 --> 01:09:45,875
Actually, this thing here is also a real number and the transpose of a real number is just itself. Right,
925
01:09:46,775 --> 01:09:49,600
so take the transpose of a real number without changing anything.
926
01:09:50,450 --> 01:09:52,550
So let me go ahead and just take the transpose of this.
927
01:09:53,025 --> 01:09:56,025
A real number transposed itself is just the same real number. So
928
01:09:57,200 --> 01:10:02,650
this is minus the trace of, taking the transpose of that. Here's Y transpose X theta, then
929
01:10:04,400 --> 01:10:09,175
minus [inaudible] theta. Okay,
930
01:10:11,150 --> 01:10:15,100
and this last quantity, Y transpose Y. It doesn't actually depend on theta.
931
01:10:15,350 --> 01:10:18,550
So when I take the derivative of this last term with respect to theta, it's just zero. So just drop that term.
932
01:10:21,550 --> 01:10:26,125
And lastly,
933
01:10:29,550 --> 01:10:35,575
well, the derivative with respect to theta of the trace of theta, theta transposed,
934
01:10:37,125 --> 01:10:38,900
X transpose X.
935
01:10:39,450 --> 01:10:45,250
I'm going to use one of the facts I wrote down earlier without proof, and I'm going
936
01:10:46,400 --> 01:10:52,325
to let this be A. There's an identity matrix there, so this is A, B, A transpose
937
01:10:53,100 --> 01:10:58,350
C, and using a rule that I've written down previously that you'll find in lecture notes, because
938
01:10:59,025 --> 01:11:01,675
it's still on one of the boards that you had previously,
939
01:11:02,825 --> 01:11:05,050
this is just equal to
940
01:11:06,125 --> 01:11:09,550
X transpose X theta.
941
01:11:10,925 --> 01:11:14,050
942
01:11:15,600 --> 01:11:20,150
So this is C,
943
01:11:21,150 --> 01:11:22,850
A,
944
01:11:24,075 --> 01:11:29,700
B, which is sort of just the identity matrix, which you can ignore, plus X transpose X
945
01:11:31,600 --> 01:11:36,575
theta where this is now C transpose [C|A],
946
01:11:37,425 --> 01:11:41,425
again times the identity which we're going to ignore, times B transposed.
947
01:11:42,400 --> 01:11:45,375
And the matrix X transpose X is [the metric|symmetric], so C transpose is equal to C.
948
01:11:49,350 --> 01:11:55,375
Similarly, the derivative with respect to theta of the trace of
949
01:11:56,475 --> 01:11:57,525
950
01:12:00,075 --> 01:12:02,200
Y transpose theta X,
951
01:12:03,325 --> 01:12:06,750
this is the ...
952
01:12:07,900 --> 01:12:08,600
953
01:12:09,400 --> 01:12:13,925
derivative with respect to matrix A of the trace of B, A
954
01:12:14,600 --> 01:12:19,425
and this is just X transpose Y. This is just
955
01:12:20,175 --> 01:12:23,675
B transposed, again, by one of the rules that I wrote down earlier.
956
01:12:26,150 --> 01:12:26,675
And so
957
01:12:27,550 --> 01:12:33,900
958
01:12:35,425 --> 01:12:39,825
959
01:12:41,500 --> 01:12:42,575
960
01:12:45,575 --> 01:12:48,100
if you plug this back in, we find, therefore, that
961
01:12:48,625 --> 01:12:50,750
the derivative wow, this board's really bad.
962
01:12:52,175 --> 01:12:57,125
963
01:12:58,450 --> 01:13:03,075
964
01:13:03,875 --> 01:13:05,250
965
01:13:06,250 --> 01:13:12,225
So if you plug this back into our formula for the derivative of J,
966
01:13:12,750 --> 01:13:17,475
you find that the derivative with respect to theta of J of theta is equal to
967
01:13:18,825 --> 01:13:23,100
one-half X transpose theta,
968
01:13:23,775 --> 01:13:26,450
plus X transpose X theta,
969
01:13:27,500 --> 01:13:31,900
minus X transpose Y, minus X transpose Y,
970
01:13:32,825 --> 01:13:38,075
which is just X transpose X theta minus X [inaudible].
971
01:13:41,325 --> 01:13:43,250
972
01:13:44,175 --> 01:13:50,200
Okay, so we set this to zero and we get
973
01:13:51,400 --> 01:13:52,650
that,
974
01:13:53,575 --> 01:13:56,375
975
01:13:57,150 --> 01:14:01,550
976
01:14:02,425 --> 01:14:04,600
which is called a normal equation,
977
01:14:05,700 --> 01:14:07,250
and
978
01:14:08,100 --> 01:14:09,275
we can now solve this
979
01:14:10,050 --> 01:14:12,200
equation for theta
980
01:14:12,975 --> 01:14:18,075
981
01:14:19,575 --> 01:14:24,325
in closed form. That's X transpose X theta, inverse times X transpose Y.
982
01:14:24,925 --> 01:14:26,200
And so this gives us a way
983
01:14:29,175 --> 01:14:32,125
for solving for the least square fit to the parameters in closed form,
984
01:14:32,450 --> 01:14:34,375
without needing to use an [inaudible] gradient descent.
985
01:14:36,550 --> 01:14:41,100
Okay, and using this matrix vector notation, I think,
986
01:14:42,125 --> 01:14:48,075
I don't know, I think we did this whole thing in about ten minutes, which we couldn't have if I was writing out reams of algebra.
987
01:14:52,300 --> 01:14:58,100
Okay, some of you look a little bit dazed, but this is our first learning [hour|algorithm]. Aren't you excited?
988
01:14:59,550 --> 01:15:03,000
Any quick questions about this before we close for today?
989
01:15:06,900 --> 01:15:07,875
*Student:*[Inaudible].
990
01:15:08,500 --> 01:15:11,350
*Instructor (Andrew Ng)*:Say that again. *Student:*What you derived, wasn't that just [inaudible] of X?
991
01:15:11,575 --> 01:15:15,550
*Instructor (Andrew Ng)*:What inverse? *Student:*Pseudo inverse. *Instructor (Andrew Ng)*:Pseudo inverse? *Student:*Pseudo inverse.
992
01:15:17,250 --> 01:15:17,825
*Instructor (Andrew Ng)*:Yeah, I
993
01:15:18,200 --> 01:15:24,700
it turns out that in cases, if X transpose X is not invertible, [than|then] you use the pseudo inverse minimized to solve this.
994
01:15:25,950 --> 01:15:31,150
But it turns out X transpose X is not invertible. That usually means your features were dependent.
995
01:15:31,750 --> 01:15:36,525
It usually means you did something like repeat the same feature twice in your training set.
996
01:15:37,175 --> 01:15:41,075
So if this is not invertible, it turns out the minimum is obtained by the pseudo inverses of the inverse.
997
01:15:41,950 --> 01:15:45,425
If you don't know what I just said, don't worry about it. It usually won't be a problem.
998
01:15:46,525 --> 01:15:47,050
Anything else?
999
01:15:48,175 --> 01:15:49,125
*Student:*On the second board [inaudible]?
1000
01:15:49,700 --> 01:15:50,275
1001
01:15:50,575 --> 01:15:59,950
Let me take that off. We're running over. Let's close for today and if they're other questions, I'll take them after.