1
00:00:09,170 --> 00:00:10,419


2
00:00:10,419 --> 00:00:13,689
this presentation is delivered by the stanford center for professional 

3
00:00:13,689 --> 00:00:20,689
development. 

4
00:00:23,579 --> 00:00:26,800
Okay. Good morning and welcome back 

5
00:00:26,800 --> 00:00:30,619
to the third lecture of this class. So here's 

6
00:00:30,619 --> 00:00:33,360
what I want to do today, 

7
00:00:33,360 --> 00:00:34,320
and some of 

8
00:00:34,320 --> 00:00:37,920
the topics I do today may seem a little bit like I'm jumping, sort of, 

9
00:00:37,920 --> 00:00:41,980
from topic to topic, but here's, sort of, the outline for today and the illogical flow of ideas. In the last lecture, we 

10
00:00:41,980 --> 00:00:44,360
talked about 

11
00:00:44,360 --> 00:00:48,390
linear regression and today I want to talk about 

12
00:00:48,390 --> 00:00:52,500
sort of an adaptation of that called locally weighted regression. It's very a popular 

13
00:00:52,500 --> 00:00:57,120
algorithm that's actually one of my former mentors probably favorite machine 

14
00:00:57,120 --> 00:00:59,360
learning algorithm. We'll then 

15
00:00:59,360 --> 00:01:02,800
talk about a probable second interpretation of linear regression 

16
00:01:02,800 --> 00:01:07,650
and use that to move onto our first classification algorithm, which 

17
00:01:07,650 --> 00:01:08,910
is logistic regression; take 

18
00:01:08,910 --> 00:01:12,630
a brief digression to tell you about something called the perceptron algorithm, 

19
00:01:12,630 --> 00:01:15,990
which is something we'll come back to, again, later this quarter; 

20
00:01:15,990 --> 00:01:17,170
and 

21
00:01:17,170 --> 00:01:21,070
time allowing I hope to get to Newton's method, which is an algorithm for 

22
00:01:21,070 --> 00:01:23,119
fitting logistic regression 

23
00:01:23,119 --> 00:01:24,650
models. 

24
00:01:24,650 --> 00:01:30,510
So this is recap where we're talking about in the previous lecture, 

25
00:01:30,510 --> 00:01:33,790
remember the notation I defined was that 

26
00:01:33,790 --> 00:01:35,090
I used this 

27
00:01:35,090 --> 00:01:37,620
X superscript I, 

28
00:01:37,620 --> 00:01:44,620
Y superscript I to denote the I training example. 

29
00:01:47,190 --> 00:01:49,500
And 

30
00:01:49,500 --> 00:01:51,790
when we're talking about linear regression 

31
00:01:51,790 --> 00:01:52,900


32
00:01:52,900 --> 00:01:54,080
or linear least squares, 

33
00:01:54,080 --> 00:01:56,370
we use this to 

34
00:01:56,370 --> 00:01:58,549
denote the predicted value 

35
00:01:58,549 --> 00:02:00,830
of "by my hypothesis H" on 

36
00:02:00,830 --> 00:02:02,420
the input XI. 

37
00:02:02,420 --> 00:02:04,070
And my hypothesis 

38
00:02:04,070 --> 00:02:06,160
was franchised by the vector 

39
00:02:06,160 --> 00:02:08,399
of grams as theta 

40
00:02:08,399 --> 00:02:13,809
and so we said that this was equal to some from theta 

41
00:02:13,809 --> 00:02:15,549
J, 

42
00:02:15,549 --> 00:02:19,870
si 

43
00:02:19,870 --> 00:02:22,409
J, and more theta transpose X. 

44
00:02:22,409 --> 00:02:25,489
And we had the convention that X 

45
00:02:25,489 --> 00:02:29,649
subscript Z is equal to one so this accounts for the intercept term in our 

46
00:02:29,649 --> 00:02:31,279
linear 

47
00:02:31,279 --> 00:02:33,169
regression model. And lowercase n here 

48
00:02:33,169 --> 00:02:36,780
was the notation I was using for the 

49
00:02:36,780 --> 00:02:40,209
number of features in my training set. Okay? So in the 

50
00:02:40,209 --> 00:02:43,689
example when trying to predict housing prices, we had two features, the size of 

51
00:02:43,689 --> 00:02:45,939
the house and the number of bedrooms. 

52
00:02:45,939 --> 00:02:50,559
We had two features and there was - little n was equal to two. 

53
00:02:50,559 --> 00:02:51,989


54
00:02:51,989 --> 00:02:54,709
So just to 

55
00:02:54,709 --> 00:02:57,079
finish recapping the previous lecture, we 

56
00:02:57,079 --> 00:03:01,260
defined this quadratic cos function J of theta equals one-half, 

57
00:03:01,260 --> 00:03:05,510
something 

58
00:03:05,510 --> 00:03:07,209
I 

59
00:03:07,209 --> 00:03:10,379
equals one to m, theta of XI minus YI 

60
00:03:10,379 --> 00:03:12,439
squared 

61
00:03:12,439 --> 00:03:16,819
where this is the sum over our m training examples and my training set. So lowercase 

62
00:03:16,819 --> 00:03:17,769
m 

63
00:03:17,769 --> 00:03:21,129
was the notation I've been using to denote the number of training examples I have and the 

64
00:03:21,129 --> 00:03:23,059
size of my training set. 

65
00:03:23,059 --> 00:03:25,400
And at the end of the last lecture, 

66
00:03:25,400 --> 00:03:26,809
we derive 

67
00:03:26,809 --> 00:03:30,840
the value of theta that minimizes this enclosed form, which was X 

68
00:03:30,840 --> 00:03:32,349
transpose X 

69
00:03:32,349 --> 00:03:35,369
inverse X 

70
00:03:35,369 --> 00:03:38,889
transpose Y. Okay? 

71
00:03:38,889 --> 00:03:42,409
So 

72
00:03:42,409 --> 00:03:46,419
as we move on in today's lecture, I'll continue to use this notation and, again, 

73
00:03:46,419 --> 00:03:50,609
I realize this is a fair amount of notation to all remember, 

74
00:03:50,609 --> 00:03:55,099
so if partway through this lecture you forgot - if you're having trouble remembering 

75
00:03:55,099 --> 00:04:02,099
what lowercase m is or what lowercase n is or something please raise your hand and ask. When 

76
00:04:04,709 --> 00:04:07,690
we talked about linear regression last time 

77
00:04:07,690 --> 00:04:10,440
we used two features. One of the features was 

78
00:04:10,440 --> 00:04:14,279
the size of the houses in square feet, so the living area of the house, 

79
00:04:14,279 --> 00:04:18,449
and the other feature was the number of bedrooms in the house. 

80
00:04:18,449 --> 00:04:22,059
In general, we apply a machine-learning algorithm to some problem that you care 

81
00:04:22,059 --> 00:04:23,060
about. 

82
00:04:23,060 --> 00:04:28,210
The choice of the features will very much be up to you, right? And 

83
00:04:28,210 --> 00:04:32,300
the way you choose your features to give the learning algorithm will often have a 

84
00:04:32,300 --> 00:04:34,330
large impact on how it actually does. 

85
00:04:34,330 --> 00:04:40,120
So just for example, 

86
00:04:40,120 --> 00:04:44,120
the choice we made last time was X1 equal this size, and let's leave this idea 

87
00:04:44,120 --> 00:04:47,080
of the feature of the number of bedrooms for now, let's say we don't have data 

88
00:04:47,080 --> 00:04:50,630
that tells us how many bedrooms are in these houses. 

89
00:04:50,630 --> 00:04:54,030
One thing you could do is actually define - oh, let's 

90
00:04:54,030 --> 00:04:56,289
draw this out. 

91
00:04:56,289 --> 00:05:03,289
And so, right? So say that 

92
00:05:04,689 --> 00:05:07,669
was the size of the house and that's the price of the house. So 

93
00:05:07,669 --> 00:05:10,300
if you use 

94
00:05:10,300 --> 00:05:14,789
this as a feature maybe you get theta zero plus theta 1, 

95
00:05:14,789 --> 00:05:19,889
X1, this, sort of, linear model. 

96
00:05:19,889 --> 00:05:21,590
If you choose - let me just copy the 

97
00:05:21,590 --> 00:05:26,770
same data set over, right? 

98
00:05:26,770 --> 00:05:30,330
You can define the set of features where X1 is equal to the size of the house 

99
00:05:30,330 --> 00:05:34,989
and X2 is the 

100
00:05:34,989 --> 00:05:36,050
square 

101
00:05:36,050 --> 00:05:37,169
of the size 

102
00:05:37,169 --> 00:05:38,319
of the house. Okay? 

103
00:05:38,319 --> 00:05:43,080
So X1 is the size of the house in say square footage and X2 is 

104
00:05:43,080 --> 00:05:45,919
just take whatever the square footage of the house is and just 

105
00:05:45,919 --> 00:05:49,059
square that number, and this would be another way to come up with a feature, 

106
00:05:49,059 --> 00:05:51,109
and if you do that then 

107
00:05:51,109 --> 00:05:55,960
the same algorithm will end up fitting a 

108
00:05:55,960 --> 00:05:59,399
quadratic function for you. 

109
00:05:59,399 --> 00:06:01,379
Theta 2, XM squared. 

110
00:06:01,379 --> 00:06:06,139
Okay? Because this 

111
00:06:06,139 --> 00:06:09,610
is actually X2. And 

112
00:06:09,610 --> 00:06:12,520
depending on what the data looks like, maybe this is a slightly 

113
00:06:12,520 --> 00:06:16,509
better fit to the data. You can actually 

114
00:06:16,509 --> 00:06:23,509
take this even further, right? 

115
00:06:25,280 --> 00:06:26,960
Which is - let's see. 

116
00:06:26,960 --> 00:06:30,400
I have seven training examples here, so you can actually 

117
00:06:30,400 --> 00:06:34,460
maybe fit up to six for the polynomial. You can actually fill a model 

118
00:06:34,460 --> 00:06:35,619
theta zero plus 

119
00:06:35,619 --> 00:06:38,229
theta one, X1 plus theta two, 

120
00:06:38,229 --> 00:06:42,030
X squared plus up to 

121
00:06:42,030 --> 00:06:48,710
theta six. X to the 

122
00:06:48,710 --> 00:06:52,889
power of six and theta six are the polynomial 

123
00:06:52,889 --> 00:06:55,589
to these seven data points. 

124
00:06:55,589 --> 00:06:58,030
And if you do that you find that 

125
00:06:58,030 --> 00:07:01,669
you come up with a model that fits your data exactly. This is where, I guess, 

126
00:07:01,669 --> 00:07:06,190
in this example I drew, we have seven data points, so if you fit a 

127
00:07:06,190 --> 00:07:08,489
six model polynomial you can, sort of, fit a line 

128
00:07:08,489 --> 00:07:11,860
that passes through these seven points perfectly. 

129
00:07:11,860 --> 00:07:14,379
And you probably find that the curve 

130
00:07:14,379 --> 00:07:17,069
you get will look something 

131
00:07:17,069 --> 00:07:20,039
like that. 

132
00:07:20,039 --> 00:07:23,179
And on the one hand, this is a great model in a sense that it 

133
00:07:23,179 --> 00:07:25,329
fits your training data perfectly. 

134
00:07:25,329 --> 00:07:26,680
On the other hand, this is probably not a 

135
00:07:26,680 --> 00:07:28,819
very good model in the sense that 

136
00:07:28,819 --> 00:07:31,889
none of us seriously think that this is a very good predictor of housing 

137
00:07:31,889 --> 00:07:36,220
prices as a function of the size of the house, right? So 

138
00:07:36,220 --> 00:07:39,020
we'll actually come back to this later. It turns 

139
00:07:39,020 --> 00:07:41,940
out of the models we have here; 

140
00:07:41,940 --> 00:07:45,180
I feel like maybe the quadratic model fits the data best. 

141
00:07:45,180 --> 00:07:46,460
Whereas 

142
00:07:46,460 --> 00:07:47,810


143
00:07:47,810 --> 00:07:52,020
the linear model looks like there's actually a bit of a quadratic component in this 

144
00:07:52,020 --> 00:07:52,830
data 

145
00:07:52,830 --> 00:07:56,539
that the linear function is not capturing. 

146
00:07:56,539 --> 00:07:59,959
So we'll actually come back to this a little bit later and talk about the problems 

147
00:07:59,959 --> 00:08:03,779
associated with fitting models that are either too simple, use two small a set of 

148
00:08:03,779 --> 00:08:04,620
features, or 

149
00:08:04,620 --> 00:08:08,449
on the models that are too complex and maybe 

150
00:08:08,449 --> 00:08:11,210
use too large a set of features. 

151
00:08:11,210 --> 00:08:12,289
Just to give these a 

152
00:08:12,289 --> 00:08:13,240
name, 

153
00:08:13,240 --> 00:08:14,490
we call this 

154
00:08:14,490 --> 00:08:19,750
the problem of underfitting 

155
00:08:19,750 --> 00:08:23,139
and, very informally, this refers to a setting where 

156
00:08:23,139 --> 00:08:26,819
there are obvious patterns that - where there are patterns in the data that the 

157
00:08:26,819 --> 00:08:28,729
algorithm is just failing to fit. 

158
00:08:28,729 --> 00:08:32,839
And this problem here we refer to as 

159
00:08:32,839 --> 00:08:34,580
overfitting 

160
00:08:34,580 --> 00:08:36,299
and, again, very informally, 

161
00:08:36,299 --> 00:08:41,310
this is when the algorithm is fitting the idiosyncrasies of this specific data set, 

162
00:08:41,310 --> 00:08:43,480
right? It just so happens that 

163
00:08:43,480 --> 00:08:48,190
of the seven houses we sampled in Portland, or wherever you collect data from, 

164
00:08:48,190 --> 00:08:51,640
that house happens to be a bit more expensive, that house happened on the less 

165
00:08:51,640 --> 00:08:54,070
expensive, and by 

166
00:08:54,070 --> 00:08:57,400
fitting six for the polynomial we're, sort of, fitting the idiosyncratic properties 

167
00:08:57,400 --> 00:08:58,820
of this data set, 

168
00:08:58,820 --> 00:09:01,079
rather than the true underlying trends 

169
00:09:01,079 --> 00:09:04,940
of how housing prices vary as the function of the size of house. Okay? 

170
00:09:04,940 --> 00:09:08,460
So these are two very different problems. We'll define them more formally me later 

171
00:09:08,460 --> 00:09:12,280
and talk about how to address each of these problems, 

172
00:09:12,280 --> 00:09:13,890
but for now I 

173
00:09:13,890 --> 00:09:20,890
hope you appreciate that there is this issue of selecting features. So if 

174
00:09:22,150 --> 00:09:23,610
you want to just 

175
00:09:23,610 --> 00:09:26,270
teach us the learning problems there are a few ways to do 

176
00:09:26,270 --> 00:09:27,370
so. 

177
00:09:27,370 --> 00:09:29,010
We'll talk about 

178
00:09:29,010 --> 00:09:32,760
feature selection algorithms later this quarter as well. So automatic algorithms 

179
00:09:32,760 --> 00:09:33,779
for choosing 

180
00:09:33,779 --> 00:09:35,959
what features you use in a 

181
00:09:35,959 --> 00:09:37,530
regression problem like this. 

182
00:09:37,530 --> 00:09:41,610
What I want to do today is talk about a class of algorithms 

183
00:09:41,610 --> 00:09:44,570
called non-parametric learning algorithms that will help 

184
00:09:44,570 --> 00:09:49,730
to alleviate the need somewhat for you to choose features very carefully. Okay? 

185
00:09:49,730 --> 00:09:56,730
And this leads us into our discussion of locally weighted regression. 

186
00:09:56,970 --> 00:10:03,970
And just to define the term, 

187
00:10:12,340 --> 00:10:16,780
linear regression, as we've defined it so far, is an example of a parametric learning 

188
00:10:16,780 --> 00:10:17,740


189
00:10:17,740 --> 00:10:19,150
algorithm. Parametric 

190
00:10:19,150 --> 00:10:21,920
learning algorithm is one that's defined as 

191
00:10:21,920 --> 00:10:24,670
an algorithm that has a fixed number of parameters 

192
00:10:24,670 --> 00:10:27,200
that fit to the data. Okay? So 

193
00:10:27,200 --> 00:10:28,670
in linear regression 

194
00:10:28,670 --> 00:10:32,930
we have a fix set of parameters theta, right? That must 

195
00:10:32,930 --> 00:10:39,440
fit to the data. 

196
00:10:39,440 --> 00:10:46,440
In contrast, what I'm gonna talk about now is our first non-parametric learning algorithm. The 

197
00:10:58,630 --> 00:11:02,830
formal definition, which is not very intuitive, so I've replaced it with a 

198
00:11:02,830 --> 00:11:04,110
second, say, more 

199
00:11:04,110 --> 00:11:06,200
intuitive. 

200
00:11:06,200 --> 00:11:10,460
The, sort of, formal definition of the non-parametric learning algorithm is that it's an algorithm 

201
00:11:10,460 --> 00:11:17,460
where the number of parameters 

202
00:11:18,300 --> 00:11:22,170
goes 

203
00:11:22,170 --> 00:11:25,419
with M, with the size of the training set. And usually it's 

204
00:11:25,419 --> 00:11:30,620
defined as a number of parameters grows linearly with the size of the training set. 

205
00:11:30,620 --> 00:11:32,640
This is the formal definition. 

206
00:11:32,640 --> 00:11:33,410
A 

207
00:11:33,410 --> 00:11:36,189
slightly less formal definition is that 

208
00:11:36,189 --> 00:11:37,529
the amount of stuff that your learning algorithm needs 

209
00:11:37,529 --> 00:11:40,830
to keep around 

210
00:11:40,830 --> 00:11:44,820
will grow linearly with the training sets or, in another way of saying it, is that this is an 

211
00:11:44,820 --> 00:11:45,830
algorithm that 

212
00:11:45,830 --> 00:11:51,590
we'll need to keep around an entire training set, even after learning. Okay? So 

213
00:11:51,590 --> 00:11:53,980
don't worry too much about this definition. But 

214
00:11:53,980 --> 00:11:55,979
what I want to do now is 

215
00:11:55,979 --> 00:11:58,989
describe a specific non-parametric learning algorithm 

216
00:11:58,989 --> 00:12:05,989
called locally weighted regression. 

217
00:12:09,860 --> 00:12:16,860
Which also goes by a couple of other names - 

218
00:12:17,040 --> 00:12:20,580
which also goes by the name of Loess for self-hysterical reasons. Loess 

219
00:12:20,580 --> 00:12:23,030
is usually spelled L-O-E-S-S, 

220
00:12:23,030 --> 00:12:24,130
sometimes spelled like that, 

221
00:12:24,130 --> 00:12:27,140
too. I just call it locally weighted regression. 

222
00:12:27,140 --> 00:12:34,140
So here's 

223
00:12:34,540 --> 00:12:37,760
the idea. This will be an algorithm that allows us 

224
00:12:37,760 --> 00:12:42,650
to worry a little bit less about having to choose features very carefully. 

225
00:12:42,650 --> 00:12:48,350
So 

226
00:12:48,350 --> 00:12:55,320
for my motivating example, let's say that I 

227
00:12:55,320 --> 00:12:59,000
have a 

228
00:12:59,000 --> 00:13:00,939
training site that looks like this, okay? 

229
00:13:00,939 --> 00:13:04,270
So this is X and that's Y. 

230
00:13:04,270 --> 00:13:07,010
If you run 

231
00:13:07,010 --> 00:13:10,920
linear regression on this and you fit maybe a linear function to this 

232
00:13:10,920 --> 00:13:12,380
and you end up with a 

233
00:13:12,380 --> 00:13:13,399
more or less 

234
00:13:13,399 --> 00:13:16,580
flat, straight line, which is not a very good fit to this data. You 

235
00:13:16,580 --> 00:13:19,820
can sit around and stare at this and try to decide whether the features are used right. 

236
00:13:19,820 --> 00:13:22,570
So maybe you want to toss in a quadratic function, 

237
00:13:22,570 --> 00:13:25,770
but this isn't really quadratic either. So maybe you want to 

238
00:13:25,770 --> 00:13:27,910
model this as a X 

239
00:13:27,910 --> 00:13:31,060
plus X squared plus maybe some function of sin of X or something. 

240
00:13:31,060 --> 00:13:33,850
You actually sit around and fiddle with features. 

241
00:13:33,850 --> 00:13:37,160
And after a while you can probably come up with a set of features that the model is 

242
00:13:37,160 --> 00:13:39,720
okay, but let's talk about an algorithm that 

243
00:13:39,720 --> 00:13:46,720
you can use without needing to do that. 

244
00:13:50,340 --> 00:13:52,930
So 

245
00:13:52,930 --> 00:13:54,480
if - well, 

246
00:13:54,480 --> 00:13:56,370
suppose you want to evaluate 

247
00:13:56,370 --> 00:13:59,170
your hypothesis H 

248
00:13:59,170 --> 00:14:03,950
at a certain point 

249
00:14:03,950 --> 00:14:06,310
with a certain query point low K is X. Okay? And 

250
00:14:06,310 --> 00:14:07,490
let's say 

251
00:14:07,490 --> 00:14:10,670
you want to know what's the predicted value of 

252
00:14:10,670 --> 00:14:11,930
Y 

253
00:14:11,930 --> 00:14:16,540
at this position of X, right? So 

254
00:14:16,540 --> 00:14:18,760
for linear regression, 

255
00:14:18,760 --> 00:14:22,440
what we were doing was we would fit 

256
00:14:22,440 --> 00:14:25,070
theta 

257
00:14:25,070 --> 00:14:28,130
to minimize 

258
00:14:28,130 --> 00:14:30,050
sum 

259
00:14:30,050 --> 00:14:34,610
over I, YI minus theta, transpose XI 

260
00:14:34,610 --> 00:14:38,760
squared, 

261
00:14:38,760 --> 00:14:41,200
and return theta 

262
00:14:41,200 --> 00:14:46,170
transpose X. Okay? So that was linear regression. 

263
00:14:46,170 --> 00:14:49,970
In contrast, in locally weighted linear regression you're going to do things slightly 

264
00:14:49,970 --> 00:14:51,130
different. You're 

265
00:14:51,130 --> 00:14:54,270
going to look at this point X 

266
00:14:54,270 --> 00:14:58,700
and then I'm going to look in my data set and take into account 

267
00:14:58,700 --> 00:15:03,370
only the data points that are, sort of, in the little vicinity of X. Okay? 

268
00:15:03,370 --> 00:15:07,130
So we'll look at where I want to value my hypothesis. I'm going to look 

269
00:15:07,130 --> 00:15:10,140
only in the vicinity of 

270
00:15:10,140 --> 00:15:13,730
this point where I want to value my hypothesis, 

271
00:15:13,730 --> 00:15:16,480
and then I'm going to take, 

272
00:15:16,480 --> 00:15:19,600
let's say, just these few points, 

273
00:15:19,600 --> 00:15:21,320
and I will 

274
00:15:21,320 --> 00:15:22,760
apply linear regression 

275
00:15:22,760 --> 00:15:26,160
to fit a straight line just to this sub-set of the data. Okay? I'm 

276
00:15:26,160 --> 00:15:29,740
using this sub-term sub-set - well let's come back to that later. 

277
00:15:29,740 --> 00:15:32,000
So we take this data set and I fit a 

278
00:15:32,000 --> 00:15:36,690
straight line to it and maybe I get a straight line like that. 

279
00:15:36,690 --> 00:15:40,720
And what I'll do is then 

280
00:15:40,720 --> 00:15:45,150
evaluate this particular value of straight line and 

281
00:15:45,150 --> 00:15:47,280
that will be the value I return for my algorithm. 

282
00:15:47,280 --> 00:15:50,360
I think this would be the predicted value 

283
00:15:50,360 --> 00:15:53,060
for - 

284
00:15:53,060 --> 00:15:57,370
this would be the value of then my hypothesis outputs 

285
00:15:57,370 --> 00:16:04,370
in locally weighted regression. Okay? So 

286
00:16:05,270 --> 00:16:10,110
we're gonna fall one up. Let me go ahead and formalize that. In 

287
00:16:10,110 --> 00:16:15,120
locally weighted regression, we're going to fit theta to 

288
00:16:15,120 --> 00:16:18,350
minimize 

289
00:16:18,350 --> 00:16:25,350
sum over I 

290
00:16:27,480 --> 00:16:32,590
to minimize that 

291
00:16:32,590 --> 00:16:36,600
where these terms W superscript I are called weights. 

292
00:16:36,600 --> 00:16:37,769


293
00:16:37,769 --> 00:16:41,080
There are many possible choice for ways, I'm just gonna write one down. So this E's and 

294
00:16:41,080 --> 00:16:42,990
minus, XI minus 

295
00:16:42,990 --> 00:16:45,590
X squared 

296
00:16:45,590 --> 00:16:49,290
over 

297
00:16:49,290 --> 00:16:54,930
two. So let's look at what these weights really are, right? So notice that - 

298
00:16:54,930 --> 00:16:57,790
suppose you have a training example XI. 

299
00:16:57,790 --> 00:17:04,790
So that XI is very close to X. So this is small, 

300
00:17:06,940 --> 00:17:10,620
right? Then if XI minus X is small, so if XI minus X is close to zero, then 

301
00:17:10,620 --> 00:17:14,319
this is E's to the minus zero and E to the zero 

302
00:17:14,319 --> 00:17:15,860
is one. 

303
00:17:15,860 --> 00:17:19,790
So if XI is close to X, then 

304
00:17:19,790 --> 00:17:21,180
WI 

305
00:17:21,180 --> 00:17:25,130
will be close to one. In other words, the weight associated with the, I training 

306
00:17:25,130 --> 00:17:27,030
example be close to one 

307
00:17:27,030 --> 00:17:30,160
if XI and X are close to each 

308
00:17:30,160 --> 00:17:31,150
other. 

309
00:17:31,150 --> 00:17:34,270
Conversely if XI minus X 

310
00:17:34,270 --> 00:17:41,270
is large 

311
00:17:42,330 --> 00:17:43,110
then - I don't 

312
00:17:43,110 --> 00:17:48,010
know, what would WI be? Zero. 

313
00:17:48,010 --> 00:17:51,020
Zero, right. Close to zero. Right. 

314
00:17:51,020 --> 00:17:51,850
So 

315
00:17:51,850 --> 00:17:56,630
if XI is very far from X then this is E to the minus of some large number 

316
00:17:56,630 --> 00:18:02,210
and E to the minus some large number will be close to zero. 

317
00:18:02,210 --> 00:18:05,170
Okay? 

318
00:18:05,170 --> 00:18:12,170
So the picture is, if I'm 

319
00:18:12,400 --> 00:18:13,390
quarrying 

320
00:18:13,390 --> 00:18:16,970
at a certain point X, shown on the X axis, 

321
00:18:16,970 --> 00:18:21,820
and if my data 

322
00:18:21,820 --> 00:18:24,090
set, say, look like that, 

323
00:18:24,090 --> 00:18:28,880
then I'm going to give the points close to this a large weight and give the points 

324
00:18:28,880 --> 00:18:32,610
far away a small weight. 

325
00:18:32,610 --> 00:18:35,390
So 

326
00:18:35,390 --> 00:18:37,930
for the points that are far away, 

327
00:18:37,930 --> 00:18:39,690
WI will be close to zero. 

328
00:18:39,690 --> 00:18:44,280
And so as if for the points that are far away, 

329
00:18:44,280 --> 00:18:47,850
they will not contribute much at all to this summation, right? So I think this is 

330
00:18:47,850 --> 00:18:48,970
sum over I 

331
00:18:48,970 --> 00:18:54,240
of one times this quadratic term for points by points plus zero times this quadratic term for faraway 

332
00:18:54,240 --> 00:18:55,720
points. 

333
00:18:55,720 --> 00:18:58,610
And so the effect of using this weighting is that 

334
00:18:58,610 --> 00:19:00,050
locally weighted linear regression 

335
00:19:00,050 --> 00:19:02,630
fits a set of parameters theta, 

336
00:19:02,630 --> 00:19:05,030
paying much more attention to fitting the 

337
00:19:05,030 --> 00:19:06,950
points close by 

338
00:19:06,950 --> 00:19:10,870
accurately. Whereas ignoring the contribution from faraway points. Okay? Yeah? Your Y is 

339
00:19:10,870 --> 00:19:17,580
exponentially [inaudible]? 

340
00:19:17,580 --> 00:19:22,570
Yeah. Let's see. So it turns out there are many other weighting functions you can use. It 

341
00:19:22,570 --> 00:19:26,890
turns out that there are definitely different communities of researchers that tend to 

342
00:19:26,890 --> 00:19:29,400
choose different choices by default. 

343
00:19:29,400 --> 00:19:33,700
There is somewhat of a literature on debating what point - 

344
00:19:33,700 --> 00:19:35,660
exactly what function to use. 

345
00:19:35,660 --> 00:19:37,970
This, sort of, exponential decay function is - 

346
00:19:37,970 --> 00:19:41,180
this happens to be a reasonably common one that seems to be a more reasonable choice on many problems, 

347
00:19:41,180 --> 00:19:42,310


348
00:19:42,310 --> 00:19:45,380
but you can actually plug in other functions as well. Did 

349
00:19:45,380 --> 00:19:47,500
I mention what [inaudible] is it at? 

350
00:19:47,500 --> 00:19:49,710
For those of you that are familiar with 

351
00:19:49,710 --> 00:19:51,550
the 

352
00:19:51,550 --> 00:19:53,890
normal distribution, or the Gaussian distribution, 

353
00:19:53,890 --> 00:19:55,350
say this - 

354
00:19:55,350 --> 00:19:59,370
what this formula I've written out here, it cosmetically 

355
00:19:59,370 --> 00:20:02,770
looks a bit like a Gaussian distribution. Okay? But this actually has 

356
00:20:02,770 --> 00:20:06,190
absolutely nothing to do with Gaussian distribution. 

357
00:20:06,190 --> 00:20:07,940
So this 

358
00:20:07,940 --> 00:20:12,240
is not that a problem with XI is Gaussian or whatever. This is no such 

359
00:20:12,240 --> 00:20:13,010
interpretation. 

360
00:20:13,010 --> 00:20:16,840
This is just a convenient function that happens to be a bell-shaped function, but 

361
00:20:16,840 --> 00:20:21,460
don't endow this of any Gaussian semantics. Okay? 

362
00:20:21,460 --> 00:20:23,160
So, in fact - well, 

363
00:20:23,160 --> 00:20:28,510
if you remember the familiar bell-shaped Gaussian, again, it's just 

364
00:20:28,510 --> 00:20:32,160
the ways of associating with these points is that if you 

365
00:20:32,160 --> 00:20:33,850
imagine 

366
00:20:33,850 --> 00:20:36,050
putting this on a bell-shaped bump, 

367
00:20:36,050 --> 00:20:39,650
centered around the position of where you want to value your hypothesis H, 

368
00:20:39,650 --> 00:20:42,910
then there's a saying this point here I'll give 

369
00:20:42,910 --> 00:20:44,460
a weight that's proportional 

370
00:20:44,460 --> 00:20:48,780
to the height of the Gaussian - excuse me, to the height of the bell-shaped function 

371
00:20:48,780 --> 00:20:50,960
evaluated at this point. And the 

372
00:20:50,960 --> 00:20:53,020
way to get to this point will be, 

373
00:20:53,020 --> 00:20:54,730
to this training example, 

374
00:20:54,730 --> 00:20:56,800
will be proportionate to that height 

375
00:20:56,800 --> 00:20:58,549
and so on. Okay? 

376
00:20:58,549 --> 00:21:01,450
And so training examples that are really far away 

377
00:21:01,450 --> 00:21:04,820
get a very small weight. 

378
00:21:04,820 --> 00:21:07,400


379
00:21:07,400 --> 00:21:10,630
One last small generalization to this is that 

380
00:21:10,630 --> 00:21:11,960
normally 

381
00:21:11,960 --> 00:21:15,080
there's one other parameter to this 

382
00:21:15,080 --> 00:21:17,760
algorithm, which I'll denote as tow. 

383
00:21:17,760 --> 00:21:21,110
Again, this looks suspiciously like the variants of a Gaussian, but this is not a Gaussian. 

384
00:21:21,110 --> 00:21:24,640
This is a convenient form or function. 

385
00:21:24,640 --> 00:21:27,140
This parameter tow 

386
00:21:27,140 --> 00:21:32,510
is called the bandwidth 

387
00:21:32,510 --> 00:21:33,760
parameter 

388
00:21:33,760 --> 00:21:40,760
and 

389
00:21:40,980 --> 00:21:47,530
informally it controls how fast the weights fall of with distance. Okay? So just 

390
00:21:47,530 --> 00:21:49,350
copy my diagram from 

391
00:21:49,350 --> 00:21:53,810
the other side, I guess. 

392
00:21:53,810 --> 00:21:55,770
So if 

393
00:21:55,770 --> 00:21:57,750
tow is very small, 

394
00:21:57,750 --> 00:22:00,000


395
00:22:00,000 --> 00:22:04,520
if that's a query X, then you end up choosing a fairly narrow Gaussian - excuse me, a fairly narrow bell shape, 

396
00:22:04,520 --> 00:22:08,080
so that the weights of the points are far away fall off rapidly. 

397
00:22:08,080 --> 00:22:09,570
Whereas if 

398
00:22:09,570 --> 00:22:16,570
tow 

399
00:22:17,490 --> 00:22:21,309
is large then you'd end 

400
00:22:21,309 --> 00:22:25,309
up choosing a weighting function that falls of relatively slowly with distance from your 

401
00:22:25,309 --> 00:22:27,540
query. 

402
00:22:27,540 --> 00:22:30,549
Okay? 

403
00:22:30,549 --> 00:22:32,910
So 

404
00:22:32,910 --> 00:22:39,910
I hope you can, therefore, see that if you 

405
00:22:41,580 --> 00:22:44,410
apply locally weighted linear regression to a data set that looks like 

406
00:22:44,410 --> 00:22:45,640
this, 

407
00:22:45,640 --> 00:22:46,850
then to 

408
00:22:46,850 --> 00:22:50,100
ask what your hypothesis output is at a point like this you end up having a straight line 

409
00:22:50,100 --> 00:22:51,169
making 

410
00:22:51,169 --> 00:22:52,740
that prediction. To 

411
00:22:52,740 --> 00:22:55,159
ask what kind of class this [inaudible] at 

412
00:22:55,159 --> 00:22:56,430
that value 

413
00:22:56,430 --> 00:22:58,540
you put a straight line there 

414
00:22:58,540 --> 00:23:00,420
and you predict that value. 

415
00:23:00,420 --> 00:23:04,060
It turns out that every time you try to vary your hypothesis, every time you 

416
00:23:04,060 --> 00:23:06,509
ask your learning algorithm to make a prediction for 

417
00:23:06,509 --> 00:23:08,910
how much a new house costs or whatever, 

418
00:23:08,910 --> 00:23:09,910


419
00:23:09,910 --> 00:23:13,640
you need to run a new fitting procedure and then 

420
00:23:13,640 --> 00:23:15,730
evaluate this line that you fit 

421
00:23:15,730 --> 00:23:17,810
just at the position of 

422
00:23:17,810 --> 00:23:22,430
the value of X. So the position of the query where you're trying to make a prediction. Okay? But if you 

423
00:23:22,430 --> 00:23:25,510
do this for every point along the X-axis then 

424
00:23:25,510 --> 00:23:27,010
you find that 

425
00:23:27,010 --> 00:23:29,990
locally weighted regression is able to trace on this, sort of, very 

426
00:23:29,990 --> 00:23:31,690
non-linear curve 

427
00:23:31,690 --> 00:23:37,559
for a data set like this. Okay? So 

428
00:23:37,559 --> 00:23:42,110
in the problem set we're actually gonna let you play around more with this algorithm. So I won't say 

429
00:23:42,110 --> 00:23:43,850
too much more about it here. 

430
00:23:43,850 --> 00:23:48,730
But to finally move on to the next topic let me check the questions you have. Yeah? It seems like you 

431
00:23:48,730 --> 00:23:49,950
still have 

432
00:23:49,950 --> 00:23:51,200
the same problem of overfitting 

433
00:23:51,200 --> 00:23:55,320
and underfitting, like when 

434
00:23:55,320 --> 00:23:57,960
you had a Q's tow. Like you make it too small 

435
00:23:57,960 --> 00:23:59,080
in your - 

436
00:23:59,080 --> 00:24:01,760
Yes, absolutely. Yes. So locally 

437
00:24:01,760 --> 00:24:04,260
weighted regression can run into 

438
00:24:04,260 --> 00:24:07,420
- locally weighted regression is not a penancier for the problem of 

439
00:24:07,420 --> 00:24:10,470
overfitting or underfitting. 

440
00:24:10,470 --> 00:24:15,130
You can still run into the same problems with locally weighted regression. What you just 

441
00:24:15,130 --> 00:24:16,920
said about 

442
00:24:16,920 --> 00:24:19,920
- and so some of these things I'll leave you to discover for yourself in the 

443
00:24:19,920 --> 00:24:20,859
homework problem. 

444
00:24:20,859 --> 00:24:24,720
You'll actually see what you just mentioned. 

445
00:24:24,720 --> 00:24:29,070
Yeah? It almost seems like 

446
00:24:29,070 --> 00:24:31,360
you're 

447
00:24:31,360 --> 00:24:37,430
not even thoroughly [inaudible] with this locally weighted, you had all the 

448
00:24:37,430 --> 00:24:42,500
data that you originally had anyway. Yeah. I'm just trying to think of [inaudible] the original data points. Right. So the question is, sort of, this - it's almost as if you're not building a 

449
00:24:42,500 --> 00:24:44,700
model, because you need the entire data set. 

450
00:24:44,700 --> 00:24:45,460
And 

451
00:24:45,460 --> 00:24:49,720
the other way of saying that is that this is a non-parametric learning 

452
00:24:49,720 --> 00:24:51,150
algorithm. So this 

453
00:24:51,150 --> 00:24:52,280


454
00:24:52,280 --> 00:24:54,440
-I don't know. I won't 

455
00:24:54,440 --> 00:24:58,130
debate whether, you know, are we really building a model or not. But 

456
00:24:58,130 --> 00:25:00,370
this is a perfectly fine - so 

457
00:25:00,370 --> 00:25:01,660
if I think 

458
00:25:01,660 --> 00:25:03,750
when you write a code implementing 

459
00:25:03,750 --> 00:25:07,080
locally weighted linear regression on the 

460
00:25:07,080 --> 00:25:08,980
data set I think of that code 

461
00:25:08,980 --> 00:25:13,340
as a whole - as building your model. 

462
00:25:13,340 --> 00:25:14,560
So it actually uses - 

463
00:25:14,560 --> 00:25:18,610
we've actually used this quite successfully to model, sort of, the dynamics of this 

464
00:25:18,610 --> 00:25:23,030
autonomous helicopter this is. Yeah? I ask if this algorithm that learn the weights 

465
00:25:23,030 --> 00:25:24,090
based 

466
00:25:24,090 --> 00:25:27,770
on 

467
00:25:27,770 --> 00:25:32,690
the data? Learn what weights? Oh, the weights WI. Instead of using [inaudible]. I see, 

468
00:25:32,690 --> 00:25:34,259
yes. So 

469
00:25:34,259 --> 00:25:37,450
it turns out there are a few things you can do. One thing that is quite common is 

470
00:25:37,450 --> 00:25:40,190
how to choose this band with parameter tow, 

471
00:25:40,190 --> 00:25:45,790
right? As using the data. We'll actually talk about that a bit later when we talk about model selection. Yes? One last question. I used [inaudible] 

472
00:25:45,790 --> 00:25:52,790
Gaussian 

473
00:25:56,480 --> 00:25:58,030
sometimes if you [inaudible] Gaussian and then - Oh, I guess. Lt's 

474
00:25:58,030 --> 00:25:59,640
see. Boy. 

475
00:25:59,640 --> 00:26:01,679
The weights are not 

476
00:26:01,679 --> 00:26:06,130
random variables and it's not, for the purpose of this algorithm, it is not useful to 

477
00:26:06,130 --> 00:26:09,370
endow it with probable semantics. So you could 

478
00:26:09,370 --> 00:26:15,220
choose to define things as Gaussian, but it, sort of, doesn't lead anywhere. In 

479
00:26:15,220 --> 00:26:15,820
fact, 

480
00:26:15,820 --> 00:26:17,390
it turns out that 

481
00:26:17,390 --> 00:26:21,530
I happened to choose this, sort of, bell-shaped function 

482
00:26:21,530 --> 00:26:23,119
to define my weights. 

483
00:26:23,119 --> 00:26:27,149
It's actually fine to choose a function that doesn't even integrate to one, that integrates to 

484
00:26:27,149 --> 00:26:29,560
infinity, say, as you're weighting function. So 

485
00:26:29,560 --> 00:26:31,700
in that sense, 

486
00:26:31,700 --> 00:26:36,220
I mean, you could force in the definition of a Gaussian, but it's, sort of, not useful. Especially 

487
00:26:36,220 --> 00:26:41,830
since you use other functions that integrate to infinity and don't integrate to one. Okay? 

488
00:26:41,830 --> 00:26:43,990
It's the last question and let's move on Assume 

489
00:26:43,990 --> 00:26:50,990
that we have a very huge [inaudible], for example. A very huge set of houses and want to 

490
00:26:51,370 --> 00:26:53,870
predict the linear for each house 

491
00:26:53,870 --> 00:26:57,530
and so should the end result for each input - I'm 

492
00:26:57,530 --> 00:26:59,549
seeing this very constantly for 

493
00:26:59,549 --> 00:27:02,120
- Yes, you're right. So 

494
00:27:02,120 --> 00:27:04,830
because locally weighted regression is a 

495
00:27:04,830 --> 00:27:07,200
non-parametric algorithm 

496
00:27:07,200 --> 00:27:11,380
every time you make a prediction you need to fit theta to your entire training set again. 

497
00:27:11,380 --> 00:27:13,120
So you're actually right. 

498
00:27:13,120 --> 00:27:17,640
If you have a very large training set then this is a somewhat expensive algorithm to 

499
00:27:17,640 --> 00:27:19,840
use. Because every time you want to make a prediction 

500
00:27:19,840 --> 00:27:20,929
you need to fit 

501
00:27:20,929 --> 00:27:22,660
a straight line 

502
00:27:22,660 --> 00:27:25,640
to a huge data set again. 

503
00:27:25,640 --> 00:27:28,860
Turns out there are algorithms that - turns 

504
00:27:28,860 --> 00:27:32,170
out there are ways to make this much more efficient for large data sets as well. 

505
00:27:32,170 --> 00:27:34,539
So don't want to talk about that. If you're interested, look 

506
00:27:34,539 --> 00:27:36,089
up the work of Andrew Moore 

507
00:27:36,089 --> 00:27:37,780
on KD-trees. He, 

508
00:27:37,780 --> 00:27:39,580
sort 

509
00:27:39,580 --> 00:27:42,190
of, figured out ways to fit these models much more efficiently. That's not something I want to go 

510
00:27:42,190 --> 00:27:44,760
into today. Okay? Let me 

511
00:27:44,760 --> 00:27:51,760
move one. Let's take more questions later. So, okay. 

512
00:28:00,789 --> 00:28:05,670
So that's locally weighted regression. 

513
00:28:05,670 --> 00:28:07,080
Remember the 

514
00:28:07,080 --> 00:28:10,830
outline I had, I guess, at the beginning of this lecture. What I want to do now is 

515
00:28:10,830 --> 00:28:11,440


516
00:28:11,440 --> 00:28:15,650
talk about a probabilistic interpretation of linear regression, all right? 

517
00:28:15,650 --> 00:28:19,179
And in particular of the - it'll be this probabilistic interpretation 

518
00:28:19,179 --> 00:28:22,250
that let's us move on to talk 

519
00:28:22,250 --> 00:28:29,250
about logistic regression, which will be our first classification algorithm. So 

520
00:28:39,680 --> 00:28:43,320
let's put aside locally weighted regression for now. We'll just talk about 

521
00:28:43,320 --> 00:28:46,650
ordinary unweighted linear regression. Let's 

522
00:28:46,650 --> 00:28:50,580
ask the question of why least squares, right? Of all the things 

523
00:28:50,580 --> 00:28:52,400
we could optimize 

524
00:28:52,400 --> 00:28:56,250
how do we come up with this criteria for minimizing the square of the area 

525
00:28:56,250 --> 00:28:59,490
between the predictions of the hypotheses 

526
00:28:59,490 --> 00:29:02,320
and the values Y predicted. So why not minimize 

527
00:29:02,320 --> 00:29:08,140
the absolute value of the areas or the areas to the power of four or something? 

528
00:29:08,140 --> 00:29:10,260
What I'm going to do now is present 

529
00:29:10,260 --> 00:29:12,270
one set of assumptions that 

530
00:29:12,270 --> 00:29:14,520
will serve to "justify" 

531
00:29:14,520 --> 00:29:18,950
why we're minimizing the sum of square zero. Okay? 

532
00:29:18,950 --> 00:29:21,620
It turns out that there are many assumptions that are sufficient 

533
00:29:21,620 --> 00:29:25,820
to justify why we do least squares and this is just one of them. 

534
00:29:25,820 --> 00:29:26,929
So 

535
00:29:26,929 --> 00:29:30,260
just because I present one set of assumptions under which least 

536
00:29:30,260 --> 00:29:32,340
squares regression make sense, 

537
00:29:32,340 --> 00:29:35,270
but this is not the only set of assumptions. So even if the assumptions 

538
00:29:35,270 --> 00:29:39,160
I describe don't hold, least squares actually still makes sense in many 

539
00:29:39,160 --> 00:29:40,030
circumstances. But this 

540
00:29:40,030 --> 00:29:41,490
sort of new help, you know, 

541
00:29:41,490 --> 00:29:48,289
give one rationalization, like, one reason for doing least squares regression. 

542
00:29:48,289 --> 00:29:51,140
And, in particular, what I'm going to do is 

543
00:29:51,140 --> 00:29:57,520
endow the least squares model with probabilistic semantics. So 

544
00:29:57,520 --> 00:30:01,809
let's assume in our example of predicting housing prices, 

545
00:30:01,809 --> 00:30:02,950
that 

546
00:30:02,950 --> 00:30:07,510
the price of the house it's sold four, and 

547
00:30:07,510 --> 00:30:12,400
there's going to be some linear function of the features, 

548
00:30:12,400 --> 00:30:13,720
plus 

549
00:30:13,720 --> 00:30:17,049
some term epsilon I. Okay? 

550
00:30:17,049 --> 00:30:19,620
And epsilon I will be 

551
00:30:19,620 --> 00:30:22,010
our error term. 

552
00:30:22,010 --> 00:30:23,160
You can think of 

553
00:30:23,160 --> 00:30:27,860
the error term as capturing unmodeled effects, like, that maybe 

554
00:30:27,860 --> 00:30:31,490
there's some other features of a house, like, maybe how many fireplaces it has or whether 

555
00:30:31,490 --> 00:30:33,390
there's a garden or whatever, 

556
00:30:33,390 --> 00:30:34,130
that there 

557
00:30:34,130 --> 00:30:37,100
are additional features that we jut fail to capture or 

558
00:30:37,100 --> 00:30:39,520
you can think of epsilon as random noise. 

559
00:30:39,520 --> 00:30:42,360
Epsilon is our error term that captures both these 

560
00:30:42,360 --> 00:30:46,230
unmodeled effects. Just things we forgot to model. Maybe the function isn't quite 

561
00:30:46,230 --> 00:30:47,660
linear or something. 

562
00:30:47,660 --> 00:30:50,360


563
00:30:50,360 --> 00:30:52,709
As well as random noise, like maybe 

564
00:30:52,709 --> 00:30:56,539
that day the seller was in a really bad mood and so he sold it, just 

565
00:30:56,539 --> 00:31:00,440
refused to go for a reasonable price or something. 

566
00:31:00,440 --> 00:31:02,670
And now 

567
00:31:02,670 --> 00:31:07,880
I will assume that the errors have a 

568
00:31:07,880 --> 00:31:09,010
probabilistic 

569
00:31:09,010 --> 00:31:13,100
- have a probability distribution. I'll assume that the errors epsilon I 

570
00:31:13,100 --> 00:31:14,520
are distributed 

571
00:31:14,520 --> 00:31:16,500
just till 

572
00:31:16,500 --> 00:31:18,660
they denote epsilon I 

573
00:31:18,660 --> 00:31:23,440
is distributive according to a probability distribution. That's 

574
00:31:23,440 --> 00:31:26,510
a Gaussian distribution with mean zero 

575
00:31:26,510 --> 00:31:28,170
and variance sigma squared. Okay? So 

576
00:31:28,170 --> 00:31:30,630
let me just scripts in here, 

577
00:31:30,630 --> 00:31:34,630
n stands for normal, right? To denote a normal distribution, also known as the 

578
00:31:34,630 --> 00:31:36,169
Gaussian distribution, 

579
00:31:36,169 --> 00:31:37,320
with mean 

580
00:31:37,320 --> 00:31:41,190
zero and covariance sigma squared. 

581
00:31:41,190 --> 00:31:46,730
Actually, just quickly raise your hand if you've seen a Gaussian distribution before. Okay, cool. Most of you. 

582
00:31:46,730 --> 00:31:48,640
Great. Almost everyone. 

583
00:31:48,640 --> 00:31:51,280
So, 

584
00:31:51,280 --> 00:31:55,309
in other words, the density for Gaussian is what you've seen before. 

585
00:31:55,309 --> 00:31:57,730
The density for epsilon I would be 

586
00:31:57,730 --> 00:31:59,650
one over root 2 pi sigma, E to the 

587
00:31:59,650 --> 00:32:01,790
negative, 

588
00:32:01,790 --> 00:32:03,570
epsilon I 

589
00:32:03,570 --> 00:32:05,760


590
00:32:05,760 --> 00:32:09,420
squared over 2 sigma squared, right? 

591
00:32:09,420 --> 00:32:15,170
And the 

592
00:32:15,170 --> 00:32:22,170
density of our epsilon I will be this bell-shaped curve 

593
00:32:22,929 --> 00:32:25,870
with one standard deviation 

594
00:32:25,870 --> 00:32:31,039
being a, sort of, sigma. Okay? This 

595
00:32:31,039 --> 00:32:31,780
is 

596
00:32:31,780 --> 00:32:34,200
form for that bell-shaped curve. 

597
00:32:34,200 --> 00:32:41,200
So, let's see. I can erase that. Can I 

598
00:32:41,260 --> 00:32:47,040
erase the board? 

599
00:32:47,040 --> 00:32:54,040


600
00:32:59,159 --> 00:33:04,770
So this implies that the 

601
00:33:04,770 --> 00:33:10,120
probability distribution of a price of a house 

602
00:33:10,120 --> 00:33:11,530
given in si 

603
00:33:11,530 --> 00:33:13,620
and the parameters theta, 

604
00:33:13,620 --> 00:33:20,620
that this is going to be Gaussian 

605
00:33:29,530 --> 00:33:34,820
with that density. Okay? 

606
00:33:34,820 --> 00:33:37,930
In other words, saying goes as that the 

607
00:33:37,930 --> 00:33:41,440
price of 

608
00:33:41,440 --> 00:33:46,550
a house given the features of the house and my parameters theta, 

609
00:33:46,550 --> 00:33:50,530
this is going to be a random variable 

610
00:33:50,530 --> 00:33:53,820
that's distributed Gaussian with 

611
00:33:53,820 --> 00:33:56,700
mean theta transpose XI 

612
00:33:56,700 --> 00:33:58,039
and variance sigma squared. 

613
00:33:58,039 --> 00:34:01,120
Right? Because we imagine that 

614
00:34:01,120 --> 00:34:04,990
the way the housing prices are generated is that the price of a house 

615
00:34:04,990 --> 00:34:08,990
is equal to theta transpose XI and then plus some random Gaussian noise with variance sigma 

616
00:34:08,990 --> 00:34:11,649
squared. So 

617
00:34:11,649 --> 00:34:13,589
the price of a house is going to 

618
00:34:13,589 --> 00:34:16,239
have mean theta transpose XI, again, and sigma squared, right? Does 

619
00:34:16,239 --> 00:34:20,069
this make 

620
00:34:20,069 --> 00:34:24,379
sense? Raise your hand if this makes sense. Yeah, 

621
00:34:24,379 --> 00:34:31,379
okay. Lots of you. In 

622
00:34:38,369 --> 00:34:43,939
point of notation - oh, yes? Assuming we don't know anything about the error, why do you assume 

623
00:34:43,939 --> 00:34:47,489
here the error is a 

624
00:34:47,489 --> 00:34:50,299
Gaussian? 

625
00:34:50,299 --> 00:34:54,039
Right. So, boy. 

626
00:34:54,039 --> 00:34:56,429
Why do I see the error as Gaussian? 

627
00:34:56,429 --> 00:35:00,909
Two reasons, right? One is that it turns out to be mathematically convenient to do so 

628
00:35:00,909 --> 00:35:03,109
and the other is, I don't 

629
00:35:03,109 --> 00:35:06,529
know, I can also mumble about justifications, such as things to the 

630
00:35:06,529 --> 00:35:08,609
central limit theorem. It turns out that if you, 

631
00:35:08,609 --> 00:35:11,989
for the vast majority of problems, if you apply a linear regression model like 

632
00:35:11,989 --> 00:35:15,679
this and try to measure the distribution of the errors, 

633
00:35:15,679 --> 00:35:19,509
not all the time, but very often you find that the errors really are Gaussian. That 

634
00:35:19,509 --> 00:35:21,779
this Gaussian model is a good 

635
00:35:21,779 --> 00:35:23,519
assumption for the error 

636
00:35:23,519 --> 00:35:26,079
in regression problems like these. 

637
00:35:26,079 --> 00:35:29,000
Some of you may have heard of the central limit theorem, which says that 

638
00:35:29,000 --> 00:35:33,219
the sum of many independent random variables will tend towards a Gaussian. 

639
00:35:33,219 --> 00:35:37,450
So if the error is caused by many effects, like the mood of the 

640
00:35:37,450 --> 00:35:39,060
seller, the mood of the buyer, 

641
00:35:39,060 --> 00:35:42,680
some other features that we miss, whether the place has a garden or not, and 

642
00:35:42,680 --> 00:35:45,630
if all of these effects are independent, then 

643
00:35:45,630 --> 00:35:49,259
by the central limit theorem you might be inclined to believe that 

644
00:35:49,259 --> 00:35:52,259
the sum of all these effects will be approximately Gaussian. If 

645
00:35:52,259 --> 00:35:53,760
in practice, I guess, the 

646
00:35:53,760 --> 00:35:57,810
two real answers are that, 1.) In practice this is actually a reasonably accurate 

647
00:35:57,810 --> 00:36:04,810
assumption, and 2.) Is it turns out to be mathematically convenient to do so. Okay? Yeah? It seems like we're 

648
00:36:06,509 --> 00:36:07,539
saying 

649
00:36:07,539 --> 00:36:10,539
if we assume that area around model 

650
00:36:10,539 --> 00:36:12,849
has zero mean, then 

651
00:36:12,849 --> 00:36:16,269
the area is centered around our model. Which 

652
00:36:16,269 --> 00:36:18,059
it seems almost like we're trying to assume 

653
00:36:18,059 --> 00:36:19,960
what we're trying to prove. Instructor? 

654
00:36:19,960 --> 00:36:23,640
That's the [inaudible] but, yes. You are assuming that 

655
00:36:23,640 --> 00:36:28,359
the error has zero mean. Which is, yeah, right. 

656
00:36:28,359 --> 00:36:30,919
I think later this quarter we get to some of the other 

657
00:36:30,919 --> 00:36:35,059
things, but for now just think of this as a mathematically - it's actually not 

658
00:36:35,059 --> 00:36:37,879
an unreasonable assumption. 

659
00:36:37,879 --> 00:36:40,390
I guess, 

660
00:36:40,390 --> 00:36:44,979
in machine learning all the assumptions we make are almost never 

661
00:36:44,979 --> 00:36:48,519
true in the absence sense, right? Because, for instance, 

662
00:36:48,519 --> 00:36:54,359
housing prices are priced to dollars and cents, so the error will be - 

663
00:36:54,359 --> 00:36:55,199
errors 

664
00:36:55,199 --> 00:36:58,719
in prices are not continued as value random variables, because 

665
00:36:58,719 --> 00:37:02,239
houses can only be priced at a certain number of dollars and a certain number of 

666
00:37:02,239 --> 00:37:05,970
cents and you never have fractions of cents in housing prices. 

667
00:37:05,970 --> 00:37:10,339
Whereas a Gaussian random variable would. So in that sense, assumptions we make are never 

668
00:37:10,339 --> 00:37:14,689
"absolutely true," but for practical purposes this is a 

669
00:37:14,689 --> 00:37:19,459
accurate enough assumption that it'll be useful to 

670
00:37:19,459 --> 00:37:23,529
make. Okay? I think in a week or two, we'll actually come back to 

671
00:37:23,529 --> 00:37:27,279
selected more about the assumptions we make and when they help our learning 

672
00:37:27,279 --> 00:37:29,059
algorithms and when they hurt our learning 

673
00:37:29,059 --> 00:37:30,839
algorithms. We'll say a bit more about it 

674
00:37:30,839 --> 00:37:37,839
when we talk about generative and discriminative learning algorithms, like, in a week or 

675
00:37:40,319 --> 00:37:44,819
two. Okay? So let's point out one bit of notation, which is that when I 

676
00:37:44,819 --> 00:37:48,399
wrote this down I actually wrote P of YI given XI and then semicolon 

677
00:37:48,399 --> 00:37:49,719
theta 

678
00:37:49,719 --> 00:37:54,999
and I'm going to use this notation when we are not thinking of theta as a 

679
00:37:54,999 --> 00:37:56,799
random variable. So 

680
00:37:56,799 --> 00:38:00,709
in statistics, though, sometimes it's called the frequentist's point of view, 

681
00:38:00,709 --> 00:38:02,380
where you think of there as being some, 

682
00:38:02,380 --> 00:38:06,450
sort of, true value of theta that's out there that's generating the data say, 

683
00:38:06,450 --> 00:38:07,519
but 

684
00:38:07,519 --> 00:38:11,299
we don't know what theta is, but theta is not a random 

685
00:38:11,299 --> 00:38:13,440
vehicle, right? So it's not like there's some 

686
00:38:13,440 --> 00:38:16,270
random value of theta out there. It's that theta is - 

687
00:38:16,270 --> 00:38:19,140
there's some true value of theta out there. It's just that we don't 

688
00:38:19,140 --> 00:38:22,459
know what the true value of theta is. So 

689
00:38:22,459 --> 00:38:25,620
if theta is not a random variable, then I'm 

690
00:38:25,620 --> 00:38:27,299
going to avoid 

691
00:38:27,299 --> 00:38:31,359
writing P of YI given XI comma theta, because this would mean 

692
00:38:31,359 --> 00:38:34,899
that probably of YI conditioned on X and theta 

693
00:38:34,899 --> 00:38:40,200
and you can only condition on random variables. 

694
00:38:40,200 --> 00:38:43,010
So at this part of the class where we're taking 

695
00:38:43,010 --> 00:38:46,490
sort of frequentist's viewpoint rather than the Dasian viewpoint, in this part of class 

696
00:38:46,490 --> 00:38:49,220
we're thinking of theta not as a random variable, but just as something 

697
00:38:49,220 --> 00:38:50,470
we're trying to estimate 

698
00:38:50,470 --> 00:38:52,829
and use the semicolon 

699
00:38:52,829 --> 00:38:57,559
notation. So the way to read this is this is the probability of YI given XI 

700
00:38:57,559 --> 00:39:00,410
and parameterized by theta. Okay? So 

701
00:39:00,410 --> 00:39:03,719
you read the semicolon as parameterized by. 

702
00:39:03,719 --> 00:39:08,269
And in the same way here, I'll say YI given XI parameterized by 

703
00:39:08,269 --> 00:39:09,499
theta is distributed 

704
00:39:09,499 --> 00:39:16,499
Gaussian with that. All right. 

705
00:39:36,319 --> 00:39:38,329
So we're gonna make one more assumption. 

706
00:39:38,329 --> 00:39:41,279
Let's assume that the 

707
00:39:41,279 --> 00:39:43,819
error terms are 

708
00:39:43,819 --> 00:39:48,269


709
00:39:48,269 --> 00:39:50,609
IID, okay? 

710
00:39:50,609 --> 00:39:54,279
Which stands for Independently and Identically Distributed. So it's 

711
00:39:54,279 --> 00:39:57,259
going to assume that the error terms are 

712
00:39:57,259 --> 00:40:04,259
independent of each other, 

713
00:40:04,469 --> 00:40:11,279
right? 

714
00:40:11,279 --> 00:40:14,749
The identically distributed part just means that I'm assuming the outcome for the 

715
00:40:14,749 --> 00:40:18,009
same Gaussian distribution or the same variance, 

716
00:40:18,009 --> 00:40:22,099
but the more important part of is this is that I'm assuming that the epsilon I's are 

717
00:40:22,099 --> 00:40:25,519
independent of each other. 

718
00:40:25,519 --> 00:40:28,899
Now, let's talk about how to fit a model. 

719
00:40:28,899 --> 00:40:32,359
The probability of Y given 

720
00:40:32,359 --> 00:40:36,239
X parameterized by theta - I'm actually going to give 

721
00:40:36,239 --> 00:40:39,149
this another name. I'm going to write this down 

722
00:40:39,149 --> 00:40:42,749
and we'll call this the likelihood of theta 

723
00:40:42,749 --> 00:40:46,439
as the probability of Y given X parameterized by theta. 

724
00:40:46,439 --> 00:40:49,009
And so this is going to be 

725
00:40:49,009 --> 00:40:50,419
the product 

726
00:40:50,419 --> 00:40:57,419
over my training set like that. 

727
00:40:59,939 --> 00:41:04,299
Which is, in turn, going to be a product of 

728
00:41:04,299 --> 00:41:10,539
those Gaussian densities that I wrote down just now, 

729
00:41:10,539 --> 00:41:14,549
right? 

730
00:41:14,549 --> 00:41:20,359
Okay? 

731
00:41:20,359 --> 00:41:24,629
Then in parts of notation, I guess, I define this term here to be the 

732
00:41:24,629 --> 00:41:26,180
likelihood of theta. 

733
00:41:26,180 --> 00:41:30,109
And the likely of theta is just the probability of the data Y, right? Given X 

734
00:41:30,109 --> 00:41:32,789
and prioritized by theta. 

735
00:41:32,789 --> 00:41:36,619
To test the likelihood and probability are often confused. 

736
00:41:36,619 --> 00:41:40,519
So the likelihood of theta is the same thing as the 

737
00:41:40,519 --> 00:41:45,509
probability of the data you saw. So likely and probably are, sort of, the same thing. 

738
00:41:45,509 --> 00:41:47,779
Except that when I use the term likelihood 

739
00:41:47,779 --> 00:41:51,799
I'm trying to emphasize that I'm taking this thing 

740
00:41:51,799 --> 00:41:55,380
and viewing it as a function of theta. 

741
00:41:55,380 --> 00:41:57,130
Okay? 

742
00:41:57,130 --> 00:42:00,609
So likelihood and for probability, they're really the same thing except that 

743
00:42:00,609 --> 00:42:02,339
when I want to view this thing 

744
00:42:02,339 --> 00:42:05,660
as a function of theta holding X and Y fix are 

745
00:42:05,660 --> 00:42:10,089
then called likelihood. Okay? So 

746
00:42:10,089 --> 00:42:13,299
hopefully you hear me say the likelihood of the parameters and the probability 

747
00:42:13,299 --> 00:42:15,119
of the data, 

748
00:42:15,119 --> 00:42:18,469
right? Rather than the likelihood of the data or probability of parameters. So try 

749
00:42:18,469 --> 00:42:25,469
to be consistent in that terminology. 

750
00:42:30,509 --> 00:42:31,519
So given that 

751
00:42:31,519 --> 00:42:33,529
the probability of the data is this and this 

752
00:42:33,529 --> 00:42:36,859
is also the likelihood of the parameters, 

753
00:42:36,859 --> 00:42:37,809
how do you estimate 

754
00:42:37,809 --> 00:42:40,460
the parameters theta? So given a training set, 

755
00:42:40,460 --> 00:42:46,309
what parameters theta do you want to choose for your model? 

756
00:42:46,309 --> 00:42:53,309


757
00:42:58,749 --> 00:43:01,549
Well, the principle of maximum likelihood 

758
00:43:01,549 --> 00:43:02,709
estimation 

759
00:43:02,709 --> 00:43:09,329
says that, 

760
00:43:09,329 --> 00:43:13,299
right? You can choose the value of theta that makes the data 

761
00:43:13,299 --> 00:43:20,039
as probable as possible, right? So choose theta 

762
00:43:20,039 --> 00:43:27,039
to maximize the likelihood. Or 

763
00:43:27,199 --> 00:43:29,769
in other words choose the parameters that make 

764
00:43:29,769 --> 00:43:33,119
the data as probable as possible, right? So this is 

765
00:43:33,119 --> 00:43:36,939
massive likely your estimation from six to six. So it's choose the parameters that makes 

766
00:43:36,939 --> 00:43:39,539
it as likely as probable as possible 

767
00:43:39,539 --> 00:43:43,309
for me to have seen the data I just 

768
00:43:43,309 --> 00:43:46,819
did. So 

769
00:43:46,819 --> 00:43:53,189
for mathematical convenience, let me define lower case l of theta. 

770
00:43:53,189 --> 00:43:57,959
This is called the log likelihood function and it's just log 

771
00:43:57,959 --> 00:44:01,429
of capital L of theta. 

772
00:44:01,429 --> 00:44:06,309
So this is log over product over I 

773
00:44:06,309 --> 00:44:09,539
to find 

774
00:44:09,539 --> 00:44:14,299
sigma E to that. I won't bother to write out what's in the exponent for now. It's just saying this 

775
00:44:14,299 --> 00:44:16,609
from the previous board. 

776
00:44:16,609 --> 00:44:23,609
Log and a product is the same as the sum of over logs, right? So it's a sum 

777
00:44:24,509 --> 00:44:31,509
of 

778
00:44:35,140 --> 00:44:38,400
the logs of - which simplifies to m times 

779
00:44:38,400 --> 00:44:39,349
one over root 

780
00:44:39,349 --> 00:44:43,509
two pi 

781
00:44:43,509 --> 00:44:44,380
sigma 

782
00:44:44,380 --> 00:44:47,469
plus 

783
00:44:47,469 --> 00:44:52,099
and then log of explanation cancel each other, right? So if log of E of 

784
00:44:52,099 --> 00:44:53,089
something is just 

785
00:44:53,089 --> 00:45:00,089
whatever's inside the exponent. So, you know what, 

786
00:45:01,229 --> 00:45:08,229
let me write this on the 

787
00:45:12,079 --> 00:45:16,129
next 

788
00:45:16,129 --> 00:45:21,359
board. 

789
00:45:21,359 --> 00:45:28,359
Okay. 

790
00:45:32,759 --> 00:45:39,759


791
00:45:46,239 --> 00:45:50,509
So 

792
00:45:50,509 --> 00:45:53,349
maximizing the likelihood or maximizing the log 

793
00:45:53,349 --> 00:45:58,440
likelihood is the same 

794
00:45:58,440 --> 00:46:02,940
as minimizing 

795
00:46:02,940 --> 00:46:09,940
that term over there. Well, you get it, right? 

796
00:46:22,019 --> 00:46:26,039
Because there's a minus sign. So maximizing this because of the minus sign is the same as 

797
00:46:26,039 --> 00:46:26,829
minimizing 

798
00:46:26,829 --> 00:46:33,229
this as a function of theta. And 

799
00:46:33,229 --> 00:46:35,729
this is, of course, just 

800
00:46:35,729 --> 00:46:42,729
the same quadratic cos function that we had last time, J of theta, 

801
00:46:43,089 --> 00:46:44,469
right? So what 

802
00:46:44,469 --> 00:46:48,339
we've just shown is that the ordinary least squares algorithm, 

803
00:46:48,339 --> 00:46:50,539
that we worked on the previous lecture, 

804
00:46:50,539 --> 00:46:55,019
is just maximum likelihood 

805
00:46:55,019 --> 00:46:56,079
assuming 

806
00:46:56,079 --> 00:46:58,219
this probabilistic model, 

807
00:46:58,219 --> 00:47:05,219
assuming IID Gaussian errors on our data. 

808
00:47:06,109 --> 00:47:09,659


809
00:47:09,659 --> 00:47:10,720
Okay? One thing that we'll 

810
00:47:10,720 --> 00:47:12,480
actually leave is that, 

811
00:47:12,480 --> 00:47:14,489
in the next lecture notice that 

812
00:47:14,489 --> 00:47:17,060
the value of sigma squared doesn't matter, 

813
00:47:17,060 --> 00:47:17,930
right? That somehow 

814
00:47:17,930 --> 00:47:21,390
no matter what the value of sigma squared is, I mean, sigma squared has to be a positive number. It's a 

815
00:47:21,390 --> 00:47:21,999
variance 

816
00:47:21,999 --> 00:47:26,229
of a Gaussian. So that no matter what sigma 

817
00:47:26,229 --> 00:47:30,160
squared is since it's a positive number the value of theta we end up with 

818
00:47:30,160 --> 00:47:33,619
will be the same, right? So because 

819
00:47:33,619 --> 00:47:34,659
minimizing this 

820
00:47:34,659 --> 00:47:39,289
you get the same value of theta no matter what sigma squared is. So it's as if 

821
00:47:39,289 --> 00:47:42,690
in this model the value of sigma squared doesn't really matter. 

822
00:47:42,690 --> 00:47:45,790
Just remember that for the next lecture. We'll come back 

823
00:47:45,790 --> 00:47:47,509
to this again. 

824
00:47:47,509 --> 00:47:51,429
Any questions about this? 

825
00:47:51,429 --> 00:47:52,670
Actually, let me clean up 

826
00:47:52,670 --> 00:47:59,670
another couple of boards and then I'll see what questions you have. Okay. Any questions? Yeah? You are, I think here you try to measure the likelihood of your nice of 

827
00:48:43,929 --> 00:48:50,929
theta by 

828
00:48:51,209 --> 00:48:52,079
a fraction 

829
00:48:52,079 --> 00:48:53,809
of error, 

830
00:48:53,809 --> 00:48:55,369
but I think it's that you 

831
00:48:55,369 --> 00:48:57,319
measure 

832
00:48:57,319 --> 00:49:01,179
because it depends on the family of theta too, for example. If 

833
00:49:01,179 --> 00:49:05,269


834
00:49:05,269 --> 00:49:09,069
you have a lot of parameters [inaudible] or fitting in? Yeah, yeah. I mean, you're asking about overfitting, whether this is a good model. I think 

835
00:49:09,069 --> 00:49:12,609
let's - the thing's you're mentioning are 

836
00:49:12,609 --> 00:49:14,549
maybe deeper questions about 

837
00:49:14,549 --> 00:49:17,869
learning algorithms that we'll just come back to later, so don't really want to get into 

838
00:49:17,869 --> 00:49:19,459
that right 

839
00:49:19,459 --> 00:49:22,199
now. Any more 

840
00:49:22,199 --> 00:49:29,199
questions? Okay. So 

841
00:49:33,219 --> 00:49:38,759
this endows linear regression with a probabilistic interpretation. 

842
00:49:38,759 --> 00:49:43,209
I'm actually going to use this probabil - use this, sort of, probabilistic 

843
00:49:43,209 --> 00:49:43,940
interpretation 

844
00:49:43,940 --> 00:49:46,190
in order to derive our next learning algorithm, 

845
00:49:46,190 --> 00:49:50,309
which will be our first classification algorithm. Okay? 

846
00:49:50,309 --> 00:49:53,619
So 

847
00:49:53,619 --> 00:49:57,930
you'll recall that I said that regression problems are where the variable Y 

848
00:49:57,930 --> 00:50:01,339
that you're trying to predict is continuous values. 

849
00:50:01,339 --> 00:50:04,259
Now I'm actually gonna talk about our first classification problem, 

850
00:50:04,259 --> 00:50:07,859
where the value Y you're trying to predict 

851
00:50:07,859 --> 00:50:10,939
will be discreet value. You can take on only a small number of discrete values 

852
00:50:10,939 --> 00:50:14,379
and in this case I'll talk about binding classification 

853
00:50:14,379 --> 00:50:15,709
where 

854
00:50:15,709 --> 00:50:18,769
Y takes on only two values, right? So 

855
00:50:18,769 --> 00:50:21,890
you come up with classification problems if you're trying to do, 

856
00:50:21,890 --> 00:50:25,900
say, a medical diagnosis and try to decide based on some features 

857
00:50:25,900 --> 00:50:29,890
that the patient has a disease or does not have a disease. 

858
00:50:29,890 --> 00:50:34,239
Or if in the housing example, maybe you're trying to decide will this house sell in the 

859
00:50:34,239 --> 00:50:37,509
next six months or not and the answer is either yes or no. It'll either be sold in the 

860
00:50:37,509 --> 00:50:40,989
next six months or it won't be. 

861
00:50:40,989 --> 00:50:44,529
Other standing examples, if you want to build a spam filter. Is this e-mail spam 

862
00:50:44,529 --> 00:50:50,930
or not? It's yes or no. Or if you, you know, some of my colleagues sit in whether predicting 

863
00:50:50,930 --> 00:50:55,439
whether a computer system will crash. So you have a learning algorithm to predict will 

864
00:50:55,439 --> 00:50:59,159
this computing cluster crash over the next 24 hours? And, again, it's a yes 

865
00:50:59,159 --> 00:51:06,039
or no answer. So 

866
00:51:06,039 --> 00:51:08,149
there's X, there's Y. 

867
00:51:08,149 --> 00:51:14,119
And in a classification problem 

868
00:51:14,119 --> 00:51:15,269
Y takes on 

869
00:51:15,269 --> 00:51:18,569
two values, zero and one. That's it in binding the classification. 

870
00:51:18,569 --> 00:51:21,630
So what can you do? Well, one thing you could do is 

871
00:51:21,630 --> 00:51:25,299
take linear regression, as we've described it so far, and apply it to this problem, 

872
00:51:25,299 --> 00:51:26,399
right? So you, 

873
00:51:26,399 --> 00:51:30,289
you know, given this data set you can fit a straight line to it. Maybe 

874
00:51:30,289 --> 00:51:31,829
you get that straight line, right? 

875
00:51:31,829 --> 00:51:32,430
But 

876
00:51:32,430 --> 00:51:36,770
this data set I've drawn, right? This is an amazingly easy classification problem. It's 

877
00:51:36,770 --> 00:51:37,890
pretty obvious 

878
00:51:37,890 --> 00:51:41,420
to all of us that, right? The relationship between X and Y is - 

879
00:51:41,420 --> 00:51:47,729
well, you just look at a value around here and it's the right is one, it's 

880
00:51:47,729 --> 00:51:51,529
the left and Y is zero. So you apply linear regressions to this data set and you get a reasonable fit and you can then 

881
00:51:51,529 --> 00:51:53,650
maybe take your linear regression 

882
00:51:53,650 --> 00:51:55,669
hypothesis to this straight line 

883
00:51:55,669 --> 00:51:58,489
and threshold it at 0.5. 

884
00:51:58,489 --> 00:52:02,059
If you do that you'll certainly get the right answer. You predict that 

885
00:52:02,059 --> 00:52:02,660


886
00:52:02,660 --> 00:52:04,269
if X is to the right of, sort 

887
00:52:04,269 --> 00:52:06,229
of, the mid-point here 

888
00:52:06,229 --> 00:52:11,829
then Y is one and then next to the left of that mid-point then Y is zero. 

889
00:52:11,829 --> 00:52:16,099
So some people actually do this. Apply linear regression to classification problems 

890
00:52:16,099 --> 00:52:17,719
and sometimes it'll 

891
00:52:17,719 --> 00:52:19,319
work okay, 

892
00:52:19,319 --> 00:52:21,820
but in general it's actually a pretty bad idea to 

893
00:52:21,820 --> 00:52:26,130
apply linear regression to 

894
00:52:26,130 --> 00:52:31,849
classification problems like these and here's why. Let's say I 

895
00:52:31,849 --> 00:52:33,529
change my training set 

896
00:52:33,529 --> 00:52:39,509
by giving you just one more training example all the way up there, right? 

897
00:52:39,509 --> 00:52:43,049
Imagine if given this training set is actually still entirely obvious what the 

898
00:52:43,049 --> 00:52:45,670
relationship between X and Y is, right? It's just - 

899
00:52:45,670 --> 00:52:50,839
take this value as greater than Y is one and it's less then Y 

900
00:52:50,839 --> 00:52:52,039
is zero. 

901
00:52:52,039 --> 00:52:55,059
By giving you this additional training example it really shouldn't 

902
00:52:55,059 --> 00:52:56,429
change anything. I mean, 

903
00:52:56,429 --> 00:52:59,289
I didn't really convey much new information. There's no surprise that this 

904
00:52:59,289 --> 00:53:01,670
corresponds to Y equals one. 

905
00:53:01,670 --> 00:53:04,589
But if you now fit linear regression to this data 

906
00:53:04,589 --> 00:53:07,159
set you end up with a line that, I 

907
00:53:07,159 --> 00:53:10,410
don't know, maybe looks like that, right? 

908
00:53:10,410 --> 00:53:13,430
And now the predictions of your 

909
00:53:13,430 --> 00:53:15,739
hypothesis have changed completely if 

910
00:53:15,739 --> 00:53:22,739
your threshold - your hypothesis at Y equal both 0.5. Okay? So - In between there might be an interval where it's zero, right? For that far off point? Oh, you mean, like that? 

911
00:53:26,579 --> 00:53:30,079
Right. 

912
00:53:30,079 --> 00:53:31,279
Yeah, yeah, fine. Yeah, sure. A theta 

913
00:53:31,279 --> 00:53:36,849
set like that so. So, I 

914
00:53:36,849 --> 00:53:38,609
guess, 

915
00:53:38,609 --> 00:53:42,179
these just - yes, you're right, but this is an example and this example works. This - [Inaudible] that will 

916
00:53:42,179 --> 00:53:47,609
change it even more if you gave it 

917
00:53:47,609 --> 00:53:49,579
all - 

918
00:53:49,579 --> 00:53:51,999
Yeah. Then I think this actually would make it even worse. You 

919
00:53:51,999 --> 00:53:54,349
would actually get a line that pulls out even further, right? So 

920
00:53:54,349 --> 00:53:57,819
this is my example. I get to make it whatever I want, right? But 

921
00:53:57,819 --> 00:54:00,689
the point of this is that there's not a deep meaning to this. The point of this is 

922
00:54:00,689 --> 00:54:01,879
just that 

923
00:54:01,879 --> 00:54:04,859
it could be a really bad idea to apply linear regression to classification 

924
00:54:04,859 --> 00:54:05,880


925
00:54:05,880 --> 00:54:11,609
algorithm. Sometimes it work fine, but usually I wouldn't do it. 

926
00:54:11,609 --> 00:54:13,769
So a couple of problems with this. One is that, 

927
00:54:13,769 --> 00:54:17,349
well - so what do you want to do 

928
00:54:17,349 --> 00:54:20,630
for classification? If you know the value of Y lies between zero and 

929
00:54:20,630 --> 00:54:24,709
one then to kind of fix this problem 

930
00:54:24,709 --> 00:54:27,079
let's just start by 

931
00:54:27,079 --> 00:54:29,219
changing the form 

932
00:54:29,219 --> 00:54:33,869
of our hypothesis so that my hypothesis 

933
00:54:33,869 --> 00:54:39,279
always lies in the unit interval between zero and one. Okay? 

934
00:54:39,279 --> 00:54:41,829
So if I know Y is either 

935
00:54:41,829 --> 00:54:44,039
zero or one then 

936
00:54:44,039 --> 00:54:47,469
let's at least not have my hypothesis predict values much larger than one and much 

937
00:54:47,469 --> 00:54:50,939
smaller than zero. 

938
00:54:50,939 --> 00:54:52,369
And so 

939
00:54:52,369 --> 00:54:55,749
I'm going to - instead of choosing a linear function for my hypothesis I'm going 

940
00:54:55,749 --> 00:54:59,799
to choose something slightly different. And, 

941
00:54:59,799 --> 00:55:03,219
in particular, I'm going to choose 

942
00:55:03,219 --> 00:55:08,129
this function, H subscript theta of X is going to equal to G of 

943
00:55:08,129 --> 00:55:10,839
theta transpose X 

944
00:55:10,839 --> 00:55:12,930
where 

945
00:55:12,930 --> 00:55:13,890
G 

946
00:55:13,890 --> 00:55:17,279
is going to be this 

947
00:55:17,279 --> 00:55:18,329
function and so 

948
00:55:18,329 --> 00:55:20,819
this becomes more than one plus theta X 

949
00:55:20,819 --> 00:55:22,819
of theta 

950
00:55:22,819 --> 00:55:25,049
transpose X. 

951
00:55:25,049 --> 00:55:27,809
And G of Z is called the sigmoid 

952
00:55:27,809 --> 00:55:32,959
function and it 

953
00:55:32,959 --> 00:55:38,659
is often also called the logistic function. It 

954
00:55:38,659 --> 00:55:41,229
goes by either of these 

955
00:55:41,229 --> 00:55:48,229
names. And what G of Z looks like is the following. So when you have your 

956
00:55:48,449 --> 00:55:50,529
horizontal axis I'm going to plot Z 

957
00:55:50,529 --> 00:55:56,549
and so G of Z 

958
00:55:56,549 --> 00:55:58,879
will look like this. 

959
00:55:58,879 --> 00:56:04,759
Okay? I didn't draw that very well. Okay. 

960
00:56:04,759 --> 00:56:05,849
So G of Z 

961
00:56:05,849 --> 00:56:07,769
tends towards zero 

962
00:56:07,769 --> 00:56:09,909
as Z becomes very small 

963
00:56:09,909 --> 00:56:12,219
and G of Z will ascend 

964
00:56:12,219 --> 00:56:15,209
towards one as Z becomes large and it crosses the 

965
00:56:15,209 --> 00:56:17,089
vertical 

966
00:56:17,089 --> 00:56:19,519
axis at 0.5. 

967
00:56:19,519 --> 00:56:24,460
So this is what sigmoid function, also called the logistic function of. Yeah? Question? What sort of 

968
00:56:24,460 --> 00:56:27,479
sigmoid in other 

969
00:56:27,479 --> 00:56:30,299
step five? Say that again. Why we cannot chose this at five for some reason, like, that's 

970
00:56:30,299 --> 00:56:35,429
better binary. Yeah. Let me come back to that later. So it turns out that Y - where did I get this function from, 

971
00:56:35,429 --> 00:56:37,270
right? I just 

972
00:56:37,270 --> 00:56:38,839
wrote down this function. It actually 

973
00:56:38,839 --> 00:56:42,569
turns out that there are two reasons for using this function that we'll come to. 

974
00:56:42,569 --> 00:56:43,619
One is - 

975
00:56:43,619 --> 00:56:46,809
we talked about generalized linear models. We'll see that this falls out naturally 

976
00:56:46,809 --> 00:56:49,910
as part of the broader class of models. 

977
00:56:49,910 --> 00:56:51,089
And another reason 

978
00:56:51,089 --> 00:56:52,340
that we'll talk about 

979
00:56:52,340 --> 00:56:54,239
next week, it turns out 

980
00:56:54,239 --> 00:56:55,089
there are a couple of, 

981
00:56:55,089 --> 00:56:57,269
I think, very beautiful reasons for why 

982
00:56:57,269 --> 00:56:58,999
we choose logistic 

983
00:56:58,999 --> 00:57:00,469
functions. We'll see 

984
00:57:00,469 --> 00:57:02,089
that in a little bit. But for now let me just 

985
00:57:02,089 --> 00:57:07,249
define it and just take my word for it for now that this is a reasonable choice. 

986
00:57:07,249 --> 00:57:08,529
Okay? But notice now that 

987
00:57:08,529 --> 00:57:15,529
my - the values output by my hypothesis will always be between zero 

988
00:57:15,879 --> 00:57:17,289
and one. Furthermore, 

989
00:57:17,289 --> 00:57:20,719
just like we did for linear regression, I'm going to endow 

990
00:57:20,719 --> 00:57:25,549
the outputs and my hypothesis with a probabilistic interpretation, right? So 

991
00:57:25,549 --> 00:57:30,619
I'm going to assume that the probability that Y is equal to one 

992
00:57:30,619 --> 00:57:33,899
given X and parameterized by theta 

993
00:57:33,899 --> 00:57:35,559
that's equal to 

994
00:57:35,559 --> 00:57:38,439
H subscript theta of X, all right? 

995
00:57:38,439 --> 00:57:39,640
So in other words 

996
00:57:39,640 --> 00:57:42,789
I'm going to imagine that my hypothesis is outputting all these 

997
00:57:42,789 --> 00:57:45,409
numbers that lie between zero and one. 

998
00:57:45,409 --> 00:57:47,890
I'm going to think of my hypothesis 

999
00:57:47,890 --> 00:57:54,890
as trying to estimate the probability that Y is equal to one. Okay? 

1000
00:57:56,209 --> 00:57:59,710
And because 

1001
00:57:59,710 --> 00:58:02,119
Y has to be either zero or one 

1002
00:58:02,119 --> 00:58:04,729
then the probability of Y equals zero is going 

1003
00:58:04,729 --> 00:58:11,729
to be that. All right? 

1004
00:58:11,859 --> 00:58:15,349
So more simply it turns out - actually, take these two equations 

1005
00:58:15,349 --> 00:58:19,209
and write them more compactly. 

1006
00:58:19,209 --> 00:58:22,139
Write P of Y given X 

1007
00:58:22,139 --> 00:58:24,209
parameterized by theta. 

1008
00:58:24,209 --> 00:58:25,839
This is going to be H 

1009
00:58:25,839 --> 00:58:28,349
subscript theta of X to 

1010
00:58:28,349 --> 00:58:31,949
the power of Y times 

1011
00:58:31,949 --> 00:58:33,179
one minus 

1012
00:58:33,179 --> 00:58:34,769
H of X to the power of 

1013
00:58:34,769 --> 00:58:37,219
one minus Y. Okay? So I know this 

1014
00:58:37,219 --> 00:58:42,889
looks somewhat bizarre, but this actually makes the variation much nicer. 

1015
00:58:42,889 --> 00:58:44,439
So Y is equal to one 

1016
00:58:44,439 --> 00:58:48,059
then this equation is H of X to the power of one 

1017
00:58:48,059 --> 00:58:51,229
times something to the power of zero. 

1018
00:58:51,229 --> 00:58:54,429
So anything to the power of zero is just one, 

1019
00:58:54,429 --> 00:58:55,949
right? So Y equals one then 

1020
00:58:55,949 --> 00:58:59,389
this is something to the power of zero and so this is just one. 

1021
00:58:59,389 --> 00:59:03,349
So if Y equals one this is just saying P of Y equals one is equal to H subscript 

1022
00:59:03,349 --> 00:59:05,919
theta of X. Okay? 

1023
00:59:05,919 --> 00:59:08,249
And in the same way, if Y is equal 

1024
00:59:08,249 --> 00:59:09,919
to zero then this is P 

1025
00:59:09,919 --> 00:59:14,329
of Y equals zero equals this thing to the power of zero and so this disappears. This is 

1026
00:59:14,329 --> 00:59:15,669
just one 

1027
00:59:15,669 --> 00:59:17,579
times this thing power of one. Okay? So this is 

1028
00:59:17,579 --> 00:59:18,599
a 

1029
00:59:18,599 --> 00:59:20,269
compact way of writing 

1030
00:59:20,269 --> 00:59:22,799
both of these equations to 

1031
00:59:22,799 --> 00:59:29,799
gather them to one line. 

1032
00:59:31,149 --> 00:59:35,650
So let's hope our parameter fitting, right? And, again, you can ask - 

1033
00:59:35,650 --> 00:59:38,389
well, given this model by data, how do I fit 

1034
00:59:38,389 --> 00:59:41,409
the parameters theta of my 

1035
00:59:41,409 --> 00:59:46,169
model? So the likelihood of the parameters is, as before, it's just the probability 

1036
00:59:46,169 --> 00:59:48,639


1037
00:59:48,639 --> 00:59:50,499


1038
00:59:50,499 --> 00:59:54,469
of theta, right? Which is product over I, PFYI 

1039
00:59:54,469 --> 00:59:56,709
given XI 

1040
00:59:56,709 --> 00:59:59,119
parameterized by theta. 

1041
00:59:59,119 --> 01:00:01,609
Which is - just plugging those 

1042
01:00:01,609 --> 01:00:08,609
in. Okay? I 

1043
01:00:09,399 --> 01:00:16,399
dropped this theta subscript just so you can write a little bit less. Oh, 

1044
01:00:17,479 --> 01:00:19,699
excuse me. These 

1045
01:00:19,699 --> 01:00:26,699
should be 

1046
01:00:29,380 --> 01:00:35,749
XI's 

1047
01:00:35,749 --> 01:00:42,749
and YI's. Okay? 

1048
01:00:51,299 --> 01:00:53,359
So, 

1049
01:00:53,359 --> 01:00:57,029
as before, let's say we want to find a maximum likelihood estimate of the parameters theta. So 

1050
01:00:57,029 --> 01:00:58,759
we want 

1051
01:00:58,759 --> 01:01:03,529
to find the - setting the parameters theta that maximizes the likelihood L 

1052
01:01:03,529 --> 01:01:08,119
of theta. It 

1053
01:01:08,119 --> 01:01:09,619
turns out 

1054
01:01:09,619 --> 01:01:11,439
that very often 

1055
01:01:11,439 --> 01:01:14,479
- just when you work with the derivations, it turns out that it is often 

1056
01:01:14,479 --> 01:01:17,809
much easier to maximize the log of the likelihood rather than maximize the 

1057
01:01:17,809 --> 01:01:19,559
likelihood. 

1058
01:01:19,559 --> 01:01:20,439
So 

1059
01:01:20,439 --> 01:01:22,640
the log 

1060
01:01:22,640 --> 01:01:24,630
likelihood L of theta is just log of capital L. 

1061
01:01:24,630 --> 01:01:27,879
This will, therefore, 

1062
01:01:27,879 --> 01:01:34,879
be sum of this. Okay? 

1063
01:01:58,089 --> 01:01:59,239
And so 

1064
01:01:59,239 --> 01:02:03,689
to fit the parameters theta of our model we'll 

1065
01:02:03,689 --> 01:02:05,929
find the value of 

1066
01:02:05,929 --> 01:02:11,539
theta that maximizes this log likelihood. Yeah? [Inaudible] Say that again. YI is [inaudible]. Oh, yes. 

1067
01:02:11,539 --> 01:02:18,539
Thanks. 

1068
01:02:21,519 --> 01:02:27,239
So having maximized this function - well, it turns out we can actually apply 

1069
01:02:27,239 --> 01:02:28,759
the same gradient 

1070
01:02:28,759 --> 01:02:32,749
descent algorithm that we learned. That was the first algorithm we used 

1071
01:02:32,749 --> 01:02:34,979
to minimize the quadratic function. 

1072
01:02:34,979 --> 01:02:37,209
And you remember, when we talked about least squares, 

1073
01:02:37,209 --> 01:02:40,120
the first algorithm we used to minimize the quadratic 

1074
01:02:40,120 --> 01:02:41,060
error function 

1075
01:02:41,060 --> 01:02:42,649
was great in descent. 

1076
01:02:42,649 --> 01:02:45,409
So can actually use exactly the same algorithm 

1077
01:02:45,409 --> 01:02:48,039
to maximize the log likelihood. 

1078
01:02:48,039 --> 01:02:48,749


1079
01:02:48,749 --> 01:02:51,359
And you remember, that algorithm was just 

1080
01:02:51,359 --> 01:02:54,609
repeatedly take the value of theta 

1081
01:02:54,609 --> 01:02:56,489
and you replace it with 

1082
01:02:56,489 --> 01:02:58,919
the previous value of theta plus 

1083
01:02:58,919 --> 01:03:01,959
a learning rate alpha 

1084
01:03:01,959 --> 01:03:03,939
times 

1085
01:03:03,939 --> 01:03:08,339
the gradient of the cos function. The log likelihood will respect the 

1086
01:03:08,339 --> 01:03:09,640
theta. Okay? 

1087
01:03:09,640 --> 01:03:13,869
One small change is that because previously we were trying to minimize 

1088
01:03:13,869 --> 01:03:14,650


1089
01:03:14,650 --> 01:03:16,999
the quadratic error term. 

1090
01:03:16,999 --> 01:03:20,359
Today we're trying to maximize rather than minimize. So rather than having a minus 

1091
01:03:20,359 --> 01:03:23,189
sign we have a plus sign. So this is 

1092
01:03:23,189 --> 01:03:24,070
just great in ascents, 

1093
01:03:24,070 --> 01:03:25,489
but for the 

1094
01:03:25,489 --> 01:03:27,560
maximization rather than the minimization. 

1095
01:03:27,560 --> 01:03:32,380
So we actually call this gradient ascent and it's really the same 

1096
01:03:32,380 --> 01:03:35,389
algorithm. 

1097
01:03:35,389 --> 01:03:36,840
So to figure out 

1098
01:03:36,840 --> 01:03:41,619
what this gradient - so in order to derive gradient descent, 

1099
01:03:41,619 --> 01:03:43,719
what you need to do is 

1100
01:03:43,719 --> 01:03:48,069
compute the partial derivatives of your objective function with respect to 

1101
01:03:48,069 --> 01:03:52,729
each of your parameters theta I, right? 

1102
01:03:52,729 --> 01:03:58,219
It turns out that 

1103
01:03:58,219 --> 01:04:00,269
if you actually 

1104
01:04:00,269 --> 01:04:02,619
compute this partial derivative - 

1105
01:04:02,619 --> 01:04:09,539
so you take this formula, this L of theta, which is - oh, got that wrong too. If 

1106
01:04:09,539 --> 01:04:13,879
you take this lower case l theta, if you take the log likelihood of theta, 

1107
01:04:13,879 --> 01:04:17,159
and if you take it's partial derivative with 

1108
01:04:17,159 --> 01:04:19,479
respect to theta I 

1109
01:04:19,479 --> 01:04:21,549
you find that 

1110
01:04:21,549 --> 01:04:27,119
this is equal to - 

1111
01:04:27,119 --> 01:04:29,829
let's see. Okay? And, 

1112
01:04:29,829 --> 01:04:36,829
I 

1113
01:04:45,939 --> 01:04:47,309
don't 

1114
01:04:47,309 --> 01:04:50,329
know, the derivation isn't terribly complicated, but in 

1115
01:04:50,329 --> 01:04:53,900
the interest of saving you watching me write down a couple of 

1116
01:04:53,900 --> 01:04:55,789
blackboards full of math I'll just write 

1117
01:04:55,789 --> 01:04:57,199
down the final answer. But 

1118
01:04:57,199 --> 01:04:58,949
the way you get this is you 

1119
01:04:58,949 --> 01:05:00,179
just take those, plug 

1120
01:05:00,179 --> 01:05:04,429
in the definition for F subscript theta as function of XI, and take derivatives, 

1121
01:05:04,429 --> 01:05:06,390
and work through the algebra 

1122
01:05:06,390 --> 01:05:08,439
it turns out it'll simplify 

1123
01:05:08,439 --> 01:05:13,219
down to this formula. Okay? 

1124
01:05:13,219 --> 01:05:15,069
And so 

1125
01:05:15,069 --> 01:05:19,130
what that gives you is that gradient ascent 

1126
01:05:19,130 --> 01:05:21,669
is the following 

1127
01:05:21,669 --> 01:05:24,150
rule. Theta J gets updated as theta 

1128
01:05:24,150 --> 01:05:25,899
J 

1129
01:05:25,899 --> 01:05:27,839
plus alpha 

1130
01:05:27,839 --> 01:05:34,839
gives this. Okay? 

1131
01:05:46,999 --> 01:05:50,229
Does this look familiar to anyone? Did you 

1132
01:05:50,229 --> 01:05:56,369
remember seeing this formula at the last lecture? Right. 

1133
01:05:56,369 --> 01:05:59,349
So when I worked up Bastrian descent 

1134
01:05:59,349 --> 01:06:02,449
for least squares regression I, 

1135
01:06:02,449 --> 01:06:06,269
actually, wrote down 

1136
01:06:06,269 --> 01:06:07,640
exactly the same thing, or maybe 

1137
01:06:07,640 --> 01:06:11,419
there's a minus sign and this is also fit. But I, actually, had 

1138
01:06:11,419 --> 01:06:13,909
exactly the same learning rule last time 

1139
01:06:13,909 --> 01:06:19,919
for least squares regression, 

1140
01:06:19,919 --> 01:06:23,509
right? Is this the same learning algorithm then? So what's different? How come I was making 

1141
01:06:23,509 --> 01:06:25,499
all that noise earlier about 

1142
01:06:25,499 --> 01:06:30,009
least squares regression being a bad idea for classification problems and then I did 

1143
01:06:30,009 --> 01:06:33,769
a bunch of math and I skipped some steps, but I'm, sort of, claiming at the end they're really the same learning algorithm? [Inaudible] constants? 

1144
01:06:33,769 --> 01:06:40,769
Say that again. [Inaudible] Oh, 

1145
01:06:44,180 --> 01:06:46,639
right. Okay, cool. It's the lowest it - No, exactly. Right. So zero to the same, 

1146
01:06:46,639 --> 01:06:48,350
this is not the same, right? And the 

1147
01:06:48,350 --> 01:06:49,269
reason is, 

1148
01:06:49,269 --> 01:06:51,700
in logistic regression 

1149
01:06:51,700 --> 01:06:54,169
this is different from before, right? 

1150
01:06:54,169 --> 01:06:55,699
The definition 

1151
01:06:55,699 --> 01:06:58,380
of this H subscript theta of XI 

1152
01:06:58,380 --> 01:06:59,319
is not 

1153
01:06:59,319 --> 01:07:03,279
the same as the definition I was using in the previous lecture. 

1154
01:07:03,279 --> 01:07:06,519
And in particular this is no longer theta transpose XI. This is not 

1155
01:07:06,519 --> 01:07:09,909
a linear function anymore. 

1156
01:07:09,909 --> 01:07:11,910
This is a logistic function of theta 

1157
01:07:11,910 --> 01:07:14,369
transpose XI. Okay? 

1158
01:07:14,369 --> 01:07:17,919
So even though this looks cosmetically similar, 

1159
01:07:17,919 --> 01:07:20,119
even though this is similar on the surface, 

1160
01:07:20,119 --> 01:07:23,739
to the Bastrian descent rule I derived last time for 

1161
01:07:23,739 --> 01:07:25,209
least squares regression 

1162
01:07:25,209 --> 01:07:29,359
this is actually a totally different learning algorithm. Okay? 

1163
01:07:29,359 --> 01:07:32,499
And it turns out that there's actually no coincidence that you ended up with the 

1164
01:07:32,499 --> 01:07:34,579
same learning rule. We'll actually 

1165
01:07:34,579 --> 01:07:35,450


1166
01:07:35,450 --> 01:07:39,519
talk a bit more about this later when we talk about generalized linear models. 

1167
01:07:39,519 --> 01:07:42,659
But this is one of the most elegant generalized learning models 

1168
01:07:42,659 --> 01:07:44,519
that we'll see later. That 

1169
01:07:44,519 --> 01:07:48,199
even though we're using a different model, you actually ended up with 

1170
01:07:48,199 --> 01:07:51,379
what looks like the same learning algorithm and it's actually no 

1171
01:07:51,379 --> 01:07:56,449
coincidence. Cool. 

1172
01:07:56,449 --> 01:07:59,029
One last comment as 

1173
01:07:59,029 --> 01:08:01,239
part of a sort of learning process, 

1174
01:08:01,239 --> 01:08:02,339
over here 

1175
01:08:02,339 --> 01:08:05,030
I said I take the derivatives and I 

1176
01:08:05,030 --> 01:08:06,729
ended up with this line. 

1177
01:08:06,729 --> 01:08:09,269
I didn't want to 

1178
01:08:09,269 --> 01:08:13,119
make you sit through a long algebraic derivation, but 

1179
01:08:13,119 --> 01:08:14,640
later today or later this week, 

1180
01:08:14,640 --> 01:08:18,569
please, do go home and look at our lecture notes, where I wrote out 

1181
01:08:18,569 --> 01:08:20,829
the entirety of this derivation in full, 

1182
01:08:20,829 --> 01:08:23,419
and make sure you can follow every single step of 

1183
01:08:23,419 --> 01:08:26,709
how we take partial derivatives of this log likelihood 

1184
01:08:26,709 --> 01:08:32,069
to get this formula over here. Okay? By the way, for those who are 

1185
01:08:32,069 --> 01:08:36,379
interested in seriously masking machine learning material, 

1186
01:08:36,379 --> 01:08:40,349
when you go home and look at the lecture notes it will actually be very easy for most 

1187
01:08:40,349 --> 01:08:41,549
of you to look through 

1188
01:08:41,549 --> 01:08:44,599
the lecture notes and read through every line and go yep, that makes sense, that makes sense, that makes sense, 

1189
01:08:44,599 --> 01:08:45,089
and, 

1190
01:08:45,089 --> 01:08:49,539
sort of, say cool. I see how you get this line. 

1191
01:08:49,539 --> 01:08:53,159
You want to make sure you really understand the material. My concrete 

1192
01:08:53,159 --> 01:08:55,049
suggestion to you would be to you to go home, 

1193
01:08:55,049 --> 01:08:57,779
read through the lecture notes, check every line, 

1194
01:08:57,779 --> 01:09:02,159
and then to cover up the derivation and see if you can derive this example, right? So 

1195
01:09:02,159 --> 01:09:06,529
in general, that's usually good advice for studying technical 

1196
01:09:06,529 --> 01:09:09,029
material like machine learning. Which is if you work through a proof 

1197
01:09:09,029 --> 01:09:11,239
and you think you understood every line, 

1198
01:09:11,239 --> 01:09:14,129
the way to make sure you really understood it is to cover it up and see 

1199
01:09:14,129 --> 01:09:17,459
if you can rederive the entire thing itself. This is actually a great way because I 

1200
01:09:17,459 --> 01:09:19,679
did this a lot when I was trying to study 

1201
01:09:19,679 --> 01:09:22,239
various pieces of machine learning 

1202
01:09:22,239 --> 01:09:26,119
theory and various proofs. And this is actually a great way to study because cover up 

1203
01:09:26,119 --> 01:09:28,420
the derivations and see if you can do it yourself 

1204
01:09:28,420 --> 01:09:32,529
without looking at the original derivation. All right. 

1205
01:09:32,529 --> 01:09:36,690


1206
01:09:36,690 --> 01:09:39,950
I probably won't get to Newton's Method today. I just 

1207
01:09:39,950 --> 01:09:46,950
want to say 

1208
01:09:55,250 --> 01:09:58,159
- take one quick digression to talk about 

1209
01:09:58,159 --> 01:09:59,480
one more algorithm, 

1210
01:09:59,480 --> 01:10:01,119
which was the discussion sort 

1211
01:10:01,119 --> 01:10:08,119
of alluding to this earlier, 

1212
01:10:09,169 --> 01:10:11,530
which is the perceptron 

1213
01:10:11,530 --> 01:10:13,219
algorithm, right? So 

1214
01:10:13,219 --> 01:10:13,760
I'm 

1215
01:10:13,760 --> 01:10:16,840
not gonna say a whole lot about the perceptron algorithm, but this is something that we'll come 

1216
01:10:16,840 --> 01:10:20,340
back to later. Later this quarter 

1217
01:10:20,340 --> 01:10:23,820
we'll talk about learning theory. 

1218
01:10:23,820 --> 01:10:27,149
So in logistic regression we said that G of Z are, sort 

1219
01:10:27,149 --> 01:10:27,920
of, 

1220
01:10:27,920 --> 01:10:30,150
my hypothesis output values 

1221
01:10:30,150 --> 01:10:32,800
that were low numbers between zero and one. 

1222
01:10:32,800 --> 01:10:37,110
The question is what if you want to force G of Z to up 

1223
01:10:37,110 --> 01:10:38,889
the value to 

1224
01:10:38,889 --> 01:10:39,190
either 

1225
01:10:39,190 --> 01:10:41,400
zero one? 

1226
01:10:41,400 --> 01:10:43,860
So the 

1227
01:10:43,860 --> 01:10:46,210
perceptron algorithm defines G of Z 

1228
01:10:46,210 --> 01:10:48,090
to be this. 

1229
01:10:48,090 --> 01:10:52,590


1230
01:10:52,590 --> 01:10:54,300
So the picture is - or 

1231
01:10:54,300 --> 01:11:01,300
the cartoon is, rather than this sigmoid function. E of 

1232
01:11:02,349 --> 01:11:07,699
Z now looks like this step function that you were asking about earlier. 

1233
01:11:07,699 --> 01:11:13,579
In saying this before, we can use H subscript theta of X equals G of theta transpose X. Okay? So 

1234
01:11:13,579 --> 01:11:14,310
this 

1235
01:11:14,310 --> 01:11:17,570
is actually - everything is exactly the same as before, 

1236
01:11:17,570 --> 01:11:20,389
except that G of Z is now the step function. 

1237
01:11:20,389 --> 01:11:21,539
It 

1238
01:11:21,539 --> 01:11:24,960
turns out there's this learning called the perceptron learning rule that's actually 

1239
01:11:24,960 --> 01:11:28,429
even the same as the classic gradient ascent 

1240
01:11:28,429 --> 01:11:30,610
for logistic regression. 

1241
01:11:30,610 --> 01:11:32,539
And the learning rule is 

1242
01:11:32,539 --> 01:11:39,539
given by this. Okay? 

1243
01:11:44,349 --> 01:11:49,799
So it looks just like the 

1244
01:11:49,799 --> 01:11:51,019
classic gradient ascent rule 

1245
01:11:51,019 --> 01:11:53,659


1246
01:11:53,659 --> 01:11:56,210
for logistic regression. 

1247
01:11:56,210 --> 01:12:00,780
So this is very different flavor of algorithm than least squares regression and logistic 

1248
01:12:00,780 --> 01:12:01,669
regression, 

1249
01:12:01,669 --> 01:12:05,889
and, in particular, because it outputs only values are either zero or one it 

1250
01:12:05,889 --> 01:12:09,369
turns out it's very difficult to endow this algorithm with 

1251
01:12:09,369 --> 01:12:11,840
probabilistic semantics. And this 

1252
01:12:11,840 --> 01:12:18,840
is, again, even though - oh, excuse me. Right there. Okay. 

1253
01:12:19,639 --> 01:12:23,659
And even though this learning rule looks, again, looks cosmetically very similar to 

1254
01:12:23,659 --> 01:12:25,710
what we have in logistics regression this is actually 

1255
01:12:25,710 --> 01:12:28,320
a very different type of learning rule 

1256
01:12:28,320 --> 01:12:31,420
than the others that were seen 

1257
01:12:31,420 --> 01:12:33,630
in this class. So 

1258
01:12:33,630 --> 01:12:36,969
because this is such a simple learning algorithm, right? It just 

1259
01:12:36,969 --> 01:12:41,030
computes theta transpose X and then you threshold and then your output is zero or one. 

1260
01:12:41,030 --> 01:12:42,909
This is - 

1261
01:12:42,909 --> 01:12:45,989
right. So these are a simpler algorithm than logistic regression, I think. 

1262
01:12:45,989 --> 01:12:49,190
When we talk about learning theory later in this class, 

1263
01:12:49,190 --> 01:12:54,590
the simplicity of this algorithm will let us come back and use it as a building block. Okay? 

1264
01:12:54,590 --> 01:12:56,550
But that's all I want to say about this algorithm for now.