00:00:10,010 --> 00:00:11,269
This subtitle is provided from lecsub.jimdo.com under Creative Commons License.

2
00:00:11,269 --> 00:00:14,539
This presentation is delivered by the Stanford Center for Professional 

3
00:00:14,539 --> 00:00:21,539
Development. 

4
00:00:23,929 --> 00:00:28,039
So welcome to the last lecture of this course. 

5
00:00:28,039 --> 00:00:32,969
What I want to do today is tell you about one final class of reinforcement 

6
00:00:32,969 --> 00:00:34,820
learning algorithms. I 

7
00:00:34,820 --> 00:00:38,560
just want to say a little bit about POMDPs, the partially observable MDPs, and 

8
00:00:38,560 --> 00:00:39,350
then 

9
00:00:39,350 --> 00:00:43,590
the main technical topic for today will be policy search algorithms. 

10
00:00:43,590 --> 00:00:45,979
I'll talk about 

11
00:00:45,979 --> 00:00:48,670
two specific algorithms, essentially called reinforced and called Pegasus, 

12
00:00:48,670 --> 00:00:52,660
and then we'll wrap up the class. So if 

13
00:00:52,660 --> 00:00:54,940
you recall from the last lecture, 

14
00:00:54,940 --> 00:00:59,870
I actually started to talk about one specific example of a POMDP, 

15
00:00:59,870 --> 00:01:06,870
which was this sort of linear dynamical 

16
00:01:07,170 --> 00:01:10,790
system. This is sort of LQR, linear quadratic revelation problem, 

17
00:01:10,790 --> 00:01:17,790
but I changed it and said 

18
00:01:19,550 --> 00:01:22,480
what 

19
00:01:22,480 --> 00:01:24,190
if we only have observations 

20
00:01:24,190 --> 00:01:28,270
YT. And what if we couldn't observe the state of the system directly, 

21
00:01:28,270 --> 00:01:28,729
but 

22
00:01:28,729 --> 00:01:29,740
had to choose 

23
00:01:29,740 --> 00:01:31,789
an action only 

24
00:01:31,789 --> 00:01:35,049
based on some noisy observations that maybe some function of the 

25
00:01:35,049 --> 00:01:37,180
state? 

26
00:01:37,180 --> 00:01:41,960
So our strategy last time was that we said that in the fully observable 

27
00:01:41,960 --> 00:01:42,660
case, 

28
00:01:42,660 --> 00:01:46,060
we could choose actions - 

29
00:01:46,060 --> 00:01:49,410
AT equals 

30
00:01:49,410 --> 00:01:51,439
two, that matrix 

31
00:01:51,439 --> 00:01:54,870
LT times ST. So LT was this 

32
00:01:54,870 --> 00:01:58,730
matrix of parameters that [inaudible] describe the dynamic programming 

33
00:01:58,730 --> 00:02:01,100
algorithm for finite horizon MDPs in 

34
00:02:01,100 --> 00:02:02,200
the LQR problem. 

35
00:02:02,200 --> 00:02:03,710
And so we said if only 

36
00:02:03,710 --> 00:02:07,229
we knew what the state was, we choose actions according to some 

37
00:02:07,229 --> 00:02:08,409
matrix LT 

38
00:02:08,409 --> 00:02:10,529
times the state. 

39
00:02:10,529 --> 00:02:13,929
And then I said in the partially observable case, 

40
00:02:13,929 --> 00:02:14,469


41
00:02:14,469 --> 00:02:17,689
we would compute 

42
00:02:17,689 --> 00:02:19,979
these estimates. 

43
00:02:19,979 --> 00:02:26,979
I wrote them as S of T given T, 

44
00:02:30,379 --> 00:02:34,169
which were our best estimate for what the state is given all the 

45
00:02:34,169 --> 00:02:37,230
observations. And in particular, 

46
00:02:37,230 --> 00:02:41,459
I'm gonna talk about a Kalman filter 

47
00:02:41,459 --> 00:02:43,299
which 

48
00:02:43,299 --> 00:02:45,219
we worked out that our 

49
00:02:45,219 --> 00:02:48,959
posterior distribution of what the state is 

50
00:02:48,959 --> 00:02:52,519
given all the observations up to a certain time 

51
00:02:52,519 --> 00:02:56,579
that was this. So this 

52
00:02:56,579 --> 00:02:59,139
is from last time. So that 

53
00:02:59,139 --> 00:03:03,159
given the observations Y one through YT, our posterior distribution 

54
00:03:03,159 --> 00:03:05,579
of the current state ST was Gaussian 

55
00:03:05,579 --> 00:03:10,169
would mean ST given T sigma T given T. 

56
00:03:10,169 --> 00:03:14,090
So I said we use a Kalman filter to compute this thing, this ST given T, 

57
00:03:14,090 --> 00:03:17,279
which is going to be our best 

58
00:03:17,279 --> 00:03:20,499
guess for what the state is currently. 

59
00:03:20,499 --> 00:03:27,499
And then we choose actions 

60
00:03:29,559 --> 00:03:30,609
using 

61
00:03:30,609 --> 00:03:34,059
our estimate for what the state is, rather than using the true state because we 

62
00:03:34,059 --> 00:03:35,969
don't know the true state anymore in 

63
00:03:35,969 --> 00:03:37,519
this POMDP. 

64
00:03:37,519 --> 00:03:39,619
So 

65
00:03:39,619 --> 00:03:41,929
it turns out that this specific 

66
00:03:41,929 --> 00:03:43,559
strategy 

67
00:03:43,559 --> 00:03:47,219
actually allows you to choose optimal actions, 

68
00:03:47,219 --> 00:03:51,749
allows you to choose actions as well as you possibly can given that this is a 

69
00:03:51,749 --> 00:03:54,799
POMDP, and given there are these noisy observations. 

70
00:03:54,799 --> 00:03:59,469
It turns out that in general finding optimal policies with POMDPs - 

71
00:03:59,469 --> 00:04:03,149
finding optimal policies for these sorts of partially observable MDPs is an 

72
00:04:03,149 --> 00:04:06,039
NP-hard problem. 

73
00:04:06,039 --> 00:04:10,159
Just to be concrete about the formalism of the POMDP - I should just 

74
00:04:10,159 --> 00:04:16,259
write it here - 

75
00:04:16,259 --> 00:04:20,069
a POMDP formally 

76
00:04:20,069 --> 00:04:27,069
is a tuple 

77
00:04:30,729 --> 00:04:32,000
like that 

78
00:04:32,000 --> 00:04:35,550
where the changes are the set Y 

79
00:04:35,550 --> 00:04:42,550
is the set of possible observations, 

80
00:04:44,759 --> 00:04:46,519
and this O 

81
00:04:46,519 --> 00:04:53,519
subscript S are the observation distributions. 

82
00:04:58,039 --> 00:05:04,220
And so at each step, 

83
00:05:04,220 --> 00:05:11,220
we observe 

84
00:05:17,629 --> 00:05:20,179
- at each step in the POMDP, if we're 

85
00:05:20,179 --> 00:05:23,849
in some state ST, we observe some observation YT 

86
00:05:23,849 --> 00:05:27,610
drawn from the observation distribution O subscript ST, that there's an 

87
00:05:27,610 --> 00:05:28,420
index by what 

88
00:05:28,420 --> 00:05:30,369
the current state is. 

89
00:05:30,369 --> 00:05:31,060
And 

90
00:05:31,060 --> 00:05:35,599
it turns out that computing the optimal policy in a POMDP is an NPhard problem. 

91
00:05:35,599 --> 00:05:36,309


92
00:05:36,309 --> 00:05:39,709
For the specific case of linear dynamical systems with the Kalman 

93
00:05:39,709 --> 00:05:40,539
filter 

94
00:05:40,539 --> 00:05:45,849
model, we have this strategy of computing the optimal policy assuming full observability 

95
00:05:45,849 --> 00:05:49,080
and then estimating the states from the observations, 

96
00:05:49,080 --> 00:05:50,889
and then plugging the two together. 

97
00:05:50,889 --> 00:05:52,500
That turns out to be optimal 

98
00:05:52,500 --> 00:05:56,319
essentially for only that special case of a POMDP. 

99
00:05:56,319 --> 00:05:58,479
In the more general case, 

100
00:05:58,479 --> 00:06:00,449
that strategy of designing 

101
00:06:00,449 --> 00:06:02,959
a controller assuming full observability 

102
00:06:02,959 --> 00:06:06,009
and then just estimating the state and plugging the two together, 

103
00:06:06,009 --> 00:06:07,950
for general POMDPs 

104
00:06:07,950 --> 00:06:12,050
that same strategy is often a very reasonable strategy but is not 

105
00:06:12,050 --> 00:06:14,759
always guaranteed to be optimal. 

106
00:06:14,759 --> 00:06:20,869
Solving these problems in general, NP-hard. 

107
00:06:20,869 --> 00:06:22,500
So what I want to do today 

108
00:06:22,500 --> 00:06:25,569
is actually 

109
00:06:25,569 --> 00:06:29,139
talk about a different class of reinforcement learning algorithms. These are called 

110
00:06:29,139 --> 00:06:34,970
policy search 

111
00:06:34,970 --> 00:06:38,259
algorithms. In particular, policy search algorithms 

112
00:06:38,259 --> 00:06:42,719
can be applied equally well to MDPs, to fully observed Markov decision 

113
00:06:42,719 --> 00:06:43,630
processes, 

114
00:06:43,630 --> 00:06:46,419
or to these POMDPs, 

115
00:06:46,419 --> 00:06:49,789
or to these partially observable MPDs. 

116
00:06:49,789 --> 00:06:53,279
What I want to do now, I'll actually just describe policy search 

117
00:06:53,279 --> 00:06:54,240


118
00:06:54,240 --> 00:06:58,269
algorithms applied to MDPs, applied to the fully observable case. 

119
00:06:58,269 --> 00:07:01,919
And in the end, I just briefly describe how you can take policy search algorithms and apply them to POMDPs. In 

120
00:07:01,919 --> 00:07:05,460
the latter case, when 

121
00:07:05,460 --> 00:07:09,830
you apply a policy search algorithm to a POMDP, it's 

122
00:07:09,830 --> 00:07:12,289
going to be hard to guarantee that you get 

123
00:07:12,289 --> 00:07:15,499
the globally optimal policy because 

124
00:07:15,499 --> 00:07:20,060
solving POMDPs in general is NP-hard, but nonetheless policy search algorithms - it 

125
00:07:20,060 --> 00:07:20,809
turns out to be 

126
00:07:20,809 --> 00:07:23,009
I think one of the most 

127
00:07:23,009 --> 00:07:26,659
effective classes of reinforcement learning algorithms, as well both for 

128
00:07:26,659 --> 00:07:29,590
MDPs 

129
00:07:29,590 --> 00:07:31,029
and for POMDPs. 

130
00:07:31,029 --> 00:07:34,110
So 

131
00:07:34,110 --> 00:07:36,849
here's what we're going to do. 

132
00:07:36,849 --> 00:07:40,789
In policy search, 

133
00:07:40,789 --> 00:07:47,789
we're going to define of some set 

134
00:07:48,309 --> 00:07:54,869
which I denote capital pi of policies, 

135
00:07:54,869 --> 00:08:01,869
and our strategy is to search for 

136
00:08:02,710 --> 00:08:05,999
a good policy lower pi 

137
00:08:05,999 --> 00:08:09,240
into set capital pi. 

138
00:08:09,240 --> 00:08:10,909


139
00:08:10,909 --> 00:08:14,930
Just by analogy, I want to say 

140
00:08:14,930 --> 00:08:19,379
- in the same way, back when we were talking about supervised learning, 

141
00:08:19,379 --> 00:08:23,769
the way we defined the set capital pi of policies in the search for policy in this 

142
00:08:23,769 --> 00:08:28,069
set capital pi is analogous to 

143
00:08:28,069 --> 00:08:31,269
supervised learning 

144
00:08:31,269 --> 00:08:38,269
where we defined a set script H of hypotheses 

145
00:08:39,300 --> 00:08:46,300
and search - 

146
00:08:49,520 --> 00:08:52,750
and would search for a good hypothesis in this 

147
00:08:52,750 --> 00:08:54,910
policy 

148
00:08:54,910 --> 00:08:58,120
script H. 

149
00:08:58,120 --> 00:09:03,040
Policy search is sometimes also called direct policy search. 

150
00:09:03,040 --> 00:09:06,740
To contrast this with the source of algorithms we've been talking about so 

151
00:09:06,740 --> 00:09:11,360
far, in all the algorithms we've been talking about so far, we would try to find V star. We would try to 

152
00:09:11,360 --> 00:09:13,600
find the optimal value function. 

153
00:09:13,600 --> 00:09:15,339
And then we'd use V star - we'd 

154
00:09:15,339 --> 00:09:19,179
use the optimal value function to then try to compute or 

155
00:09:19,179 --> 00:09:20,260
try to approximate pi star. 

156
00:09:20,260 --> 00:09:23,980
So all the approaches we talked about previously are 

157
00:09:23,980 --> 00:09:27,710
strategy for finding a good policy. Once we compute the value function, then we go 

158
00:09:27,710 --> 00:09:29,500
from that to policy. 

159
00:09:29,500 --> 00:09:33,580
In contrast, in policy search algorithms and something that's called direct policy search 

160
00:09:33,580 --> 00:09:34,700
algorithms, 

161
00:09:34,700 --> 00:09:39,360
the idea is that we're going to quote œdirectly try to approximate a good 

162
00:09:39,360 --> 00:09:40,160
policy 

163
00:09:40,160 --> 00:09:44,280
without going through the intermediate stage of trying to find the value 

164
00:09:44,280 --> 00:09:48,280
function. 

165
00:09:48,280 --> 00:09:49,530
Let's see. 

166
00:09:49,530 --> 00:09:50,880
And also 

167
00:09:50,880 --> 00:09:54,580
as I develop policy search - just one step that's 

168
00:09:54,580 --> 00:09:57,610
sometimes slightly confusing. 

169
00:09:57,610 --> 00:10:00,660
Making an analogy to supervised learning again, when 

170
00:10:00,660 --> 00:10:02,710
we talked about logistic regression, 

171
00:10:02,710 --> 00:10:04,290
I said 

172
00:10:04,290 --> 00:10:08,700
we have input features X and some labels Y, and I sort of said 

173
00:10:08,700 --> 00:10:11,300
let's approximate Y using the logistic function 

174
00:10:11,300 --> 00:10:13,320
of the inputs X. And 

175
00:10:13,320 --> 00:10:16,310
at least initially, the logistic function was sort of pulled out of the 

176
00:10:16,310 --> 00:10:18,320
air. 

177
00:10:18,320 --> 00:10:22,450
In the same way, as I define policy search algorithms, there'll sort of be a step where I 

178
00:10:22,450 --> 00:10:23,400
say, 

179
00:10:23,400 --> 00:10:27,600
œWell, let's try to compute the actions. Let's try to approximate what a good 

180
00:10:27,600 --> 00:10:28,560
action is 

181
00:10:28,560 --> 00:10:33,020
using a logistic function of the state. So again, I'll sort of pull a 

182
00:10:33,020 --> 00:10:36,770
function out of the air. I'll say, œLet's just choose a function, and 

183
00:10:36,770 --> 00:10:38,120
that'll be our choice of 

184
00:10:38,120 --> 00:10:40,650
the policy cost, and I'll say, 

185
00:10:40,650 --> 00:10:41,539
œLet's 

186
00:10:41,539 --> 00:10:45,309
take this input the state, and then we'll map it through logistic function, and then hopefully, we'll approximate 

187
00:10:45,309 --> 00:10:47,300
what is a good function - 

188
00:10:47,300 --> 00:10:48,350
excuse me, 

189
00:10:48,350 --> 00:10:50,750
we'll approximate what is a good action 

190
00:10:50,750 --> 00:10:53,269
using a logistic function of the state. 

191
00:10:53,269 --> 00:10:54,420
So there's that sort of - 

192
00:10:54,420 --> 00:10:59,460
the function of the choice of policy cost that's again a little bit arbitrary, 

193
00:10:59,460 --> 00:11:04,270
but it's arbitrary as it was when we were talking about supervised 

194
00:11:04,270 --> 00:11:06,080
learning. 

195
00:11:06,080 --> 00:11:08,340
So 

196
00:11:08,340 --> 00:11:10,350
to develop 

197
00:11:10,350 --> 00:11:17,350
our first policy search algorithm, I'm actually gonna need the new definition. 

198
00:11:26,670 --> 00:11:33,670


199
00:11:56,100 --> 00:12:03,100
So 

200
00:12:06,650 --> 00:12:10,910
our first policy search algorithm, we'll actually need to work with stochastic 

201
00:12:10,910 --> 00:12:12,710
policies. 

202
00:12:12,710 --> 00:12:15,430
What I mean by stochastic policy is 

203
00:12:15,430 --> 00:12:17,679
there's going to be a function that maps from 

204
00:12:17,679 --> 00:12:18,940
the space of 

205
00:12:18,940 --> 00:12:22,510
states across actions. They're real numbers 

206
00:12:22,510 --> 00:12:25,039
where pi of S comma A will be 

207
00:12:25,039 --> 00:12:28,900
interpreted as the probability of taking this action A 

208
00:12:28,900 --> 00:12:30,580
in sum state S. 

209
00:12:30,580 --> 00:12:35,630
And so we have to add sum 

210
00:12:35,630 --> 00:12:42,050


211
00:12:42,050 --> 00:12:44,150
over 

212
00:12:44,150 --> 00:12:46,340
A - In other words, for every state 

213
00:12:46,340 --> 00:12:49,150
a stochastic policy 

214
00:12:49,150 --> 00:12:53,830
specifies a probability distribution over the actions. 

215
00:12:53,830 --> 00:12:54,550


216
00:12:54,550 --> 00:12:59,250
So concretely, 

217
00:12:59,250 --> 00:13:03,330
suppose you are executing some policy pi. Say I have some stochastic 

218
00:13:03,330 --> 00:13:04,910
policy pi. I 

219
00:13:04,910 --> 00:13:07,270
wanna execute the policy pi. 

220
00:13:07,270 --> 00:13:11,690
What that means is that - in this example let's say I have three actions. 

221
00:13:11,690 --> 00:13:16,060
What that means is that suppose I'm in some state S. 

222
00:13:16,060 --> 00:13:20,640
I would then compute pi of S comma A1, pi of 

223
00:13:20,640 --> 00:13:21,189
S 

224
00:13:21,189 --> 00:13:25,290
comma A2, 

225
00:13:25,290 --> 00:13:29,370
pi of S comma A3, if I have a three action MDP. 

226
00:13:29,370 --> 00:13:30,740
These 

227
00:13:30,740 --> 00:13:33,630
will be three numbers that sum up to one, 

228
00:13:33,630 --> 00:13:36,730
and then my chance of taking action A1 will be 

229
00:13:36,730 --> 00:13:41,350
equal to this. My chance of taking action A2 will be equal to pi of S comma A2. 

230
00:13:41,350 --> 00:13:44,129
My chance of taking action A3 will be equal 

231
00:13:44,129 --> 00:13:49,670
to this number. So that's what it means to execute a stochastic policy. 

232
00:13:49,670 --> 00:13:50,880
So 

233
00:13:50,880 --> 00:13:53,730
as a concrete example, just let me make this 

234
00:13:53,730 --> 00:13:56,680
- the concept of why you wanna use stochastic policy is maybe a little bit 

235
00:13:56,680 --> 00:14:03,380
hard to understand. So 

236
00:14:03,380 --> 00:14:07,470
let me just go ahead and give one specific example of what a stochastic 

237
00:14:07,470 --> 00:14:09,550
policy may look like. 

238
00:14:09,550 --> 00:14:11,570
For this example, I'm gonna use the 

239
00:14:11,570 --> 00:14:14,140
inverted pendulum 

240
00:14:14,140 --> 00:14:20,370
as my motivating example. It's that problem of balancing a pole. We have an 

241
00:14:20,370 --> 00:14:21,270
inverted 

242
00:14:21,270 --> 00:14:26,230
pendulum that swings freely, and you want to move the cart left and 

243
00:14:26,230 --> 00:14:29,090
right to keep the pole vertical. 

244
00:14:29,090 --> 00:14:31,430
Let's say my actions 

245
00:14:31,430 --> 00:14:33,240


246
00:14:33,240 --> 00:14:34,950
- 

247
00:14:34,950 --> 00:14:36,970
for today's example, I'm gonna use 

248
00:14:36,970 --> 00:14:39,330
that angle to 

249
00:14:39,330 --> 00:14:43,780
denote the angle of the pole 

250
00:14:43,780 --> 00:14:47,440
phi. I have two actions where 

251
00:14:47,440 --> 00:14:49,740
A1 is to accelerate left 

252
00:14:49,740 --> 00:14:55,010
and A2 is to accelerate right. Actually, let me 

253
00:14:55,010 --> 00:14:56,710
just 

254
00:14:56,710 --> 00:14:58,720
write 

255
00:14:58,720 --> 00:15:02,560
that the other way around. A1 is to accelerate right. A2 is to accelerate left. 

256
00:15:02,560 --> 00:15:04,720
So 

257
00:15:04,720 --> 00:15:06,050
let's see. 

258
00:15:06,050 --> 00:15:10,480
Choose a reward function that penalizes the pole falling over whatever. 

259
00:15:10,480 --> 00:15:14,170
And now let's come up with a stochastic policy for this problem. 

260
00:15:14,170 --> 00:15:14,910


261
00:15:14,910 --> 00:15:17,860
To come up with a class of stochastic policies 

262
00:15:17,860 --> 00:15:20,080


263
00:15:20,080 --> 00:15:23,990
really means coming up with some class of functions to approximate what action you 

264
00:15:23,990 --> 00:15:26,570
want to take as a function of the state. 

265
00:15:26,570 --> 00:15:30,000
So here's my somewhat arbitrary choice. I'm gonna say that 

266
00:15:30,000 --> 00:15:31,139
the probability 

267
00:15:31,139 --> 00:15:33,710
of action A1, so pi 

268
00:15:33,710 --> 00:15:35,210
of S comma A1, 

269
00:15:35,210 --> 00:15:39,880
I'm gonna write as - okay? 

270
00:15:39,880 --> 00:15:40,950
And I 

271
00:15:40,950 --> 00:15:44,520
just chose the logistic function because it's a convenient function we've used a lot. So I'm gonna 

272
00:15:44,520 --> 00:15:45,630
say that 

273
00:15:45,630 --> 00:15:49,740
my policy is parameterized by a set of parameters theta, 

274
00:15:49,740 --> 00:15:53,310
and for any given set of parameters theta, 

275
00:15:53,310 --> 00:15:56,270
that gives me a stochastic policy. 

276
00:15:56,270 --> 00:15:59,380
And if I'm executing that policy with parameters theta, that means 

277
00:15:59,380 --> 00:16:00,490
that the 

278
00:16:00,490 --> 00:16:03,770
chance of my choosing to a set of [inaudible] is given by 

279
00:16:03,770 --> 00:16:06,020
this number. 

280
00:16:06,020 --> 00:16:08,370


281
00:16:08,370 --> 00:16:10,490


282
00:16:10,490 --> 00:16:17,490


283
00:16:18,170 --> 00:16:21,900
Because my chances of executing actions A1 or A2 must sum to one, this gives me pi of S A2. So 

284
00:16:21,900 --> 00:16:25,520
just 

285
00:16:25,520 --> 00:16:28,850
[inaudible], this means that when I'm in sum state S, 

286
00:16:28,850 --> 00:16:30,960
I'm going to compute this number, 

287
00:16:30,960 --> 00:16:34,260
compute one over one plus E to the minus state of transpose S. 

288
00:16:34,260 --> 00:16:40,060
And then with this probability, I will execute the accelerate right action, 

289
00:16:40,060 --> 00:16:45,780
and with one minus this probability, I'll execute the accelerate left action. 

290
00:16:45,780 --> 00:16:50,820
And again, just to give you a sense of why this might be a reasonable thing to do, 

291
00:16:50,820 --> 00:16:52,200


292
00:16:52,200 --> 00:16:59,200
let's say my state vector is - this 

293
00:17:01,530 --> 00:17:05,440
is [inaudible] state, and I added an extra one as an interceptor, just to give my 

294
00:17:05,440 --> 00:17:08,949
logistic function an extra feature. 

295
00:17:08,949 --> 00:17:13,380
If I choose my parameters and my policy to be say 

296
00:17:13,380 --> 00:17:19,860
this, 

297
00:17:19,860 --> 00:17:23,079
then that means that at any state, 

298
00:17:23,079 --> 00:17:24,850
the probability 

299
00:17:24,850 --> 00:17:28,700


300
00:17:28,700 --> 00:17:30,360


301
00:17:30,360 --> 00:17:33,870
of my taking action A1 - the probability of my taking the accelerate 

302
00:17:33,870 --> 00:17:40,870
right action is this one over one plus E to the minus state of transpose S, which taking the inner product 

303
00:17:41,360 --> 00:17:48,180
of theta and S, this just gives you phi, 

304
00:17:48,180 --> 00:17:51,250
equals one over one plus E to the minus phi. 

305
00:17:51,250 --> 00:17:54,900
And so if I choose my parameters theta as follows, 

306
00:17:54,900 --> 00:18:01,150
what that means is that 

307
00:18:01,150 --> 00:18:08,150
just depending on the angle phi of my inverted pendulum, 

308
00:18:09,340 --> 00:18:12,200
the chance of my accelerating to the 

309
00:18:12,200 --> 00:18:15,190
right is just this function of 

310
00:18:15,190 --> 00:18:18,320
the angle of my inverted pendulum. And so this means for example 

311
00:18:18,320 --> 00:18:22,090
that if my inverted pendulum is leaning far over to the right, 

312
00:18:22,090 --> 00:18:26,370
then I'm very likely to accelerate to the right to try to catch it. I 

313
00:18:26,370 --> 00:18:30,000
hope the physics of this inverted pendulum thing make 

314
00:18:30,000 --> 00:18:33,440
sense. If my pole's leaning over to the right, then I wanna accelerate to the right to catch it. And 

315
00:18:33,440 --> 00:18:36,740
conversely if phi is negative, it's leaning over to the left, and I'll accelerate to the left to try 

316
00:18:36,740 --> 00:18:38,520
to catch it. 

317
00:18:38,520 --> 00:18:41,210
So this is one example 

318
00:18:41,210 --> 00:18:44,030
for one specific choice of parameters theta. 

319
00:18:44,030 --> 00:18:47,660
Obviously, this isn't a great policy because it ignores the rest of 

320
00:18:47,660 --> 00:18:49,420
the features. Maybe if 

321
00:18:49,420 --> 00:18:52,860
the cart is further to the right, you want it to be less likely to accelerate to the 

322
00:18:52,860 --> 00:18:55,610
right, and you can capture that by changing 

323
00:18:55,610 --> 00:18:57,770
one of these coefficients to take into account the 

324
00:18:57,770 --> 00:19:00,250
actual position of the cart. And then depending on 

325
00:19:00,250 --> 00:19:02,120
the velocity of the cart and the 

326
00:19:02,120 --> 00:19:03,379
angle of velocity, 

327
00:19:03,379 --> 00:19:08,040
you might want to change theta to take into account these other effects as well. Maybe 

328
00:19:08,040 --> 00:19:13,060
if the pole's leaning far to the right, but is actually on its way to swinging back, it's 

329
00:19:13,060 --> 00:19:15,620
specified to the angle of velocity, then you might be 

330
00:19:15,620 --> 00:19:19,300
less worried about having to accelerate hard to the right. And so 

331
00:19:19,300 --> 00:19:23,170
these are the sorts of behavior you can get by varying the parameters theta. 

332
00:19:23,170 --> 00:19:24,160


333
00:19:24,160 --> 00:19:28,640
And so our goal is to tune the parameters theta - our goal in policy search 

334
00:19:28,640 --> 00:19:31,290
is to tune the parameters 

335
00:19:31,290 --> 00:19:34,860
theta so that when we execute the policy 

336
00:19:34,860 --> 00:19:37,290
pi subscript theta, 

337
00:19:37,290 --> 00:19:39,730
the pole stays up as long as possible. 

338
00:19:39,730 --> 00:19:43,809
In other words, our goal is to 

339
00:19:43,809 --> 00:19:48,430
maximize as a function of theta - 

340
00:19:48,430 --> 00:19:55,430
our 

341
00:19:58,070 --> 00:19:59,760
goal is to 

342
00:19:59,760 --> 00:20:05,440
maximize the expected value of the payoff 

343
00:20:05,440 --> 00:20:08,170
for when we execute the policy 

344
00:20:08,170 --> 00:20:09,620
pi theta. We 

345
00:20:09,620 --> 00:20:11,249
want to choose parameters theta 

346
00:20:11,249 --> 00:20:16,060
to maximize that. Are 

347
00:20:16,060 --> 00:20:18,429
there 

348
00:20:18,429 --> 00:20:19,660
questions about 

349
00:20:19,660 --> 00:20:21,120
the problem set up, 

350
00:20:21,120 --> 00:20:26,330
and policy search and policy classes or anything? Yeah. Student:In a case where we have 

351
00:20:26,330 --> 00:20:27,300
more 

352
00:20:27,300 --> 00:20:32,880
than two actions, would we use a different theta for each of the distributions, or still have 

353
00:20:32,880 --> 00:20:36,310
the same parameters? Instructor (Andrew Ng):Oh, yeah. 

354
00:20:36,310 --> 00:20:39,430
Right. So what if we have more than two actions. It turns out you can choose almost anything you want for the policy class, but 

355
00:20:39,430 --> 00:20:42,110
you have say a 

356
00:20:42,110 --> 00:20:43,710
fixed number of discrete actions, 

357
00:20:43,710 --> 00:20:48,210
I would sometimes use like a softmax parameterization. 

358
00:20:48,210 --> 00:20:53,550
Similar to softmax regression that we saw earlier in the class, 

359
00:20:53,550 --> 00:20:55,910
you may 

360
00:20:55,910 --> 00:21:02,910
say 

361
00:21:05,540 --> 00:21:06,900
that - [inaudible] out of space. You may have a set of parameters theta 

362
00:21:06,900 --> 00:21:09,350
1 through theta D if 

363
00:21:09,350 --> 00:21:16,350
you have D actions and - pi equals 

364
00:21:17,230 --> 00:21:18,480
E to the theta 

365
00:21:18,480 --> 00:21:24,910
I transpose S over - so 

366
00:21:24,910 --> 00:21:29,160
that would be an example of a softmax parameterization for multiple actions. 

367
00:21:29,160 --> 00:21:33,390
It turns out that if you have continuous actions, you can actually make this be a 

368
00:21:33,390 --> 00:21:34,090
density 

369
00:21:34,090 --> 00:21:35,950
over the actions A 

370
00:21:35,950 --> 00:21:38,500
and parameterized by other things as well. 

371
00:21:38,500 --> 00:21:42,590
But the choice of policy class is somewhat up to you, in the same way that 

372
00:21:42,590 --> 00:21:44,150


373
00:21:44,150 --> 00:21:47,670
the choice of whether we chose to use a linear function 

374
00:21:47,670 --> 00:21:49,430
or linear function with 

375
00:21:49,430 --> 00:21:56,430
quadratic features or whatever in supervised learning that was sort of up to us. Anything 

376
00:22:01,790 --> 00:22:08,790
else? Yeah. Student:[Inaudible] stochastic? 

377
00:22:12,400 --> 00:22:16,350
Instructor (Andrew Ng):Yes. Student:So is it possible to [inaudible] a stochastic policy using numbers [inaudible]? Instructor (Andrew Ng):I see. Given that MDP has stochastic transition 

378
00:22:16,350 --> 00:22:18,200
probabilities, is it possible to 

379
00:22:18,200 --> 00:22:22,740
use [inaudible] policies and [inaudible] the stochasticity of the state transition probabilities. 

380
00:22:22,740 --> 00:22:24,460
The answer is yes, 

381
00:22:24,460 --> 00:22:25,740
but 

382
00:22:25,740 --> 00:22:29,400
for the purposes of what I want to show later, that won't be useful. 

383
00:22:29,400 --> 00:22:31,870
But formally, it is possible. 

384
00:22:31,870 --> 00:22:34,810
If you already have a fixed - if you have a 

385
00:22:34,810 --> 00:22:40,760
fixed policy, then you'd be able to do that. Anything else? 

386
00:22:40,760 --> 00:22:44,190
Yeah. No, I guess even a [inaudible] class of policy can do that, 

387
00:22:44,190 --> 00:22:50,290
but for the derivation later, I actually need to keep it separate. 

388
00:22:50,290 --> 00:22:55,059
Actually, could you just - I know the concept of policy search is sometimes a little confusing. Could you just raise your hand 

389
00:22:55,059 --> 00:23:01,730
if this makes sense? 

390
00:23:01,730 --> 00:23:05,430
Okay. Thanks. So let's talk about 

391
00:23:05,430 --> 00:23:08,650
an algorithm. What I'm gonna talk about - the first algorithm I'm going to present 

392
00:23:08,650 --> 00:23:09,750


393
00:23:09,750 --> 00:23:13,820
is sometimes called the reinforce algorithm. What I'm going to present it turns out 

394
00:23:13,820 --> 00:23:18,550
isn't exactly the reinforce algorithm as it was originally presented by the 

395
00:23:18,550 --> 00:23:24,930
author Ron Williams, but it sort of captures its essence. Here's the 

396
00:23:24,930 --> 00:23:30,570
idea. 

397
00:23:30,570 --> 00:23:37,030
In the sequel - in what I'm about to do, I'm going to assume that S0 is 

398
00:23:37,030 --> 00:23:37,679
some 

399
00:23:37,679 --> 00:23:40,330
fixed initial state. 

400
00:23:40,330 --> 00:23:41,950


401
00:23:41,950 --> 00:23:44,720
Or 

402
00:23:44,720 --> 00:23:48,740
it turns out if S0 is drawn from some fixed initial state 

403
00:23:48,740 --> 00:23:52,440
distribution then everything else [inaudible], but let's just say S0 is some 

404
00:23:52,440 --> 00:23:54,040
fixed initial state. 

405
00:23:54,040 --> 00:23:57,600
So my goal is to maximize 

406
00:23:57,600 --> 00:24:04,600
this expected sum 

407
00:24:05,820 --> 00:24:10,340
[inaudible]. Given the policy and whatever else, drop that. 

408
00:24:10,340 --> 00:24:14,700
So the random variables in this expectation is a sequence of states and actions: 

409
00:24:14,700 --> 00:24:20,200
S0, A0, S1, A1, and so on, up to ST, AT are the random variables. 

410
00:24:20,200 --> 00:24:25,919
So let me write out this expectation explicitly as a sum 

411
00:24:25,919 --> 00:24:29,800
over all possible state and action sequences of 

412
00:24:29,800 --> 00:24:33,430
that 

413
00:24:33,430 --> 00:24:40,430


414
00:24:46,050 --> 00:24:52,920
- so that's what an expectation is. It's the probability of the random variables times that. Let 

415
00:24:52,920 --> 00:24:59,920
me just expand out this probability. 

416
00:25:00,510 --> 00:25:06,110
So the probability of seeing this exact sequence of states and actions is 

417
00:25:06,110 --> 00:25:08,250
the probability 

418
00:25:08,250 --> 00:25:10,630
of the MDP starting in that state. 

419
00:25:10,630 --> 00:25:14,070
If this is a deterministic initial state, then all the probability 

420
00:25:14,070 --> 00:25:18,180
mass would be on one state. Otherwise, there's some distribution over initial states. 

421
00:25:18,180 --> 00:25:21,040
Then times the probability 

422
00:25:21,040 --> 00:25:23,529
that 

423
00:25:23,529 --> 00:25:25,599
you chose action A0 

424
00:25:25,599 --> 00:25:26,660
from that state 

425
00:25:26,660 --> 00:25:29,350
as zero, 

426
00:25:29,350 --> 00:25:34,390
and then times the probability that 

427
00:25:34,390 --> 00:25:38,130
the MDP's transition probabilities happen to transition you to state 

428
00:25:38,130 --> 00:25:41,250
S1 where you chose action 

429
00:25:41,250 --> 00:25:44,830
A0 to state S0, 

430
00:25:44,830 --> 00:25:50,510
times the probability that you chose that 

431
00:25:50,510 --> 00:25:53,020
and so on. 

432
00:25:53,020 --> 00:25:57,650
The last term here 

433
00:25:57,650 --> 00:25:59,279
is that, 

434
00:25:59,279 --> 00:26:06,279
and then times that. 

435
00:26:12,370 --> 00:26:13,930
So what I did was just 

436
00:26:13,930 --> 00:26:18,620
take this probability of seeing this sequence of states and actions, and then just 

437
00:26:18,620 --> 00:26:22,610
[inaudible] explicitly or expanded explicitly like this. 

438
00:26:22,610 --> 00:26:24,480
It turns out later on 

439
00:26:24,480 --> 00:26:25,720
I'm going to 

440
00:26:25,720 --> 00:26:30,270
need to write this sum of rewards a lot, so I'm just gonna call this the payoff 

441
00:26:30,270 --> 00:26:31,590


442
00:26:31,590 --> 00:26:36,039
from now. So whenever later in this lecture I write the word payoff, I 

443
00:26:36,039 --> 00:26:40,230
just mean this sum. 

444
00:26:40,230 --> 00:26:47,020
So 

445
00:26:47,020 --> 00:26:52,330
our goal is to maximize the expected payoff, so our goal is to maximize this sum. 

446
00:26:52,330 --> 00:26:55,590
Let me actually just skip ahead. I'm going to write down what the final answer is, and 

447
00:26:55,590 --> 00:26:59,370
then I'll come back and justify the 

448
00:26:59,370 --> 00:27:06,370
algorithm. So here's the algorithm. This is 

449
00:27:25,190 --> 00:27:32,190


450
00:28:12,770 --> 00:28:16,020
how we're going to 

451
00:28:16,020 --> 00:28:19,280
update the parameters of the algorithm. We're going to 

452
00:28:19,280 --> 00:28:22,880
sample a state action sequence. The way you do this is you just take your 

453
00:28:22,880 --> 00:28:24,700
current stochastic policy, 

454
00:28:24,700 --> 00:28:29,039
and you execute it in the MDP. So just go ahead and start from some initial state, take 

455
00:28:29,039 --> 00:28:32,600
a stochastic action according to your current stochastic policy, 

456
00:28:32,600 --> 00:28:35,000
see where the state transition probably takes you, and so you just 

457
00:28:35,000 --> 00:28:38,860
do that for T times steps, and that's how you sample the state sequence. Then 

458
00:28:38,860 --> 00:28:41,960
you compute the payoff, and then you 

459
00:28:41,960 --> 00:28:45,020
perform this update. 

460
00:28:45,020 --> 00:28:49,210
So let's go back and figure out what this algorithm is doing. 

461
00:28:49,210 --> 00:28:52,730
Notice that this algorithm performs stochastic updates because on 

462
00:28:52,730 --> 00:28:56,750
every step it updates data according to this thing on the right hand side. 

463
00:28:56,750 --> 00:29:00,150
This thing on the right hand side depends very much on your payoff and on 

464
00:29:00,150 --> 00:29:02,320
the state action sequence you saw. Your 

465
00:29:02,320 --> 00:29:04,650
state action sequence is random, 

466
00:29:04,650 --> 00:29:07,210
so what I want to do is figure out 

467
00:29:07,210 --> 00:29:12,230
- so on every step, I'll sort of take a step that's chosen randomly 

468
00:29:12,230 --> 00:29:13,070


469
00:29:13,070 --> 00:29:16,380
because it depends on this random state action sequence. 

470
00:29:16,380 --> 00:29:19,940
So what I want to do is figure out on average how does it change the 

471
00:29:19,940 --> 00:29:21,530
parameters theta. 

472
00:29:21,530 --> 00:29:23,360
In particular, 

473
00:29:23,360 --> 00:29:30,360
I want to know what is the expected value of the change to the 

474
00:29:37,320 --> 00:29:44,320
parameters. 

475
00:29:58,100 --> 00:30:00,919
So I want to know what is the expected value 

476
00:30:00,919 --> 00:30:03,370
of this change 

477
00:30:03,370 --> 00:30:06,280
to my parameters theta. Our 

478
00:30:06,280 --> 00:30:10,150
goal is to maximize the sum [inaudible] - our goal is to maximize the value of the 

479
00:30:10,150 --> 00:30:10,930
payoff. 

480
00:30:10,930 --> 00:30:13,700
So long as 

481
00:30:13,700 --> 00:30:18,310
the updates on expectation are on average taking us uphill 

482
00:30:18,310 --> 00:30:20,050
on the expected payoff, 

483
00:30:20,050 --> 00:30:21,970
then we're happy. 

484
00:30:21,970 --> 00:30:24,970
It turns out that 

485
00:30:24,970 --> 00:30:31,720
this algorithm is a form of stochastic gradient ascent in which - 

486
00:30:31,720 --> 00:30:32,369
remember 

487
00:30:32,369 --> 00:30:34,910
when I talked about stochastic gradient descent for 

488
00:30:34,910 --> 00:30:36,779
least squares regression, I 

489
00:30:36,779 --> 00:30:40,230
said that you have some parameters - maybe you're trying to 

490
00:30:40,230 --> 00:30:42,820
minimize a quadratic function. 

491
00:30:42,820 --> 00:30:45,590
Then you may have parameters that 

492
00:30:45,590 --> 00:30:46,790
will 

493
00:30:46,790 --> 00:30:49,149
wander around randomly 

494
00:30:49,149 --> 00:30:51,080
until it gets close to 

495
00:30:51,080 --> 00:30:54,290
the optimum of the [inaudible] quadratic surface. 

496
00:30:54,290 --> 00:30:55,810
It turns out that 

497
00:30:55,810 --> 00:30:59,650
the reinforce algorithm will be very much like that. It will be a 

498
00:30:59,650 --> 00:31:03,220
stochastic gradient ascent algorithm in which on every step - 

499
00:31:03,220 --> 00:31:06,010
the step we take is a little bit random. It's determined by the random 

500
00:31:06,010 --> 00:31:07,300
state action sequence, 

501
00:31:07,300 --> 00:31:11,940
but on expectation this turns out to be essentially gradient ascent algorithm. 

502
00:31:11,940 --> 00:31:16,100
And so we'll do something like this. It'll wander around randomly, but on average take you towards the 

503
00:31:16,100 --> 00:31:17,700
optimum. So let 

504
00:31:17,700 --> 00:31:24,700
me go ahead and prove that now. 

505
00:31:28,780 --> 00:31:35,780
Let's see. 

506
00:31:37,540 --> 00:31:43,210
What I'm going to do is I'm going to derive a 

507
00:31:43,210 --> 00:31:46,020
gradient ascent update rule 

508
00:31:46,020 --> 00:31:50,429
for maximizing the expected payoff. Then I'll hopefully show that 

509
00:31:50,429 --> 00:31:53,930
by deriving a gradient ascent update rule, I'll end up with 

510
00:31:53,930 --> 00:31:55,019
this thing 

511
00:31:55,019 --> 00:31:58,570
on expectation. 

512
00:31:58,570 --> 00:32:02,110
So before I do the derivation, let me just remind you of 

513
00:32:02,110 --> 00:32:06,330
the chain rule - the product rule for differentiation 

514
00:32:06,330 --> 00:32:08,510
in which 

515
00:32:08,510 --> 00:32:13,890
if I have a product of functions, 

516
00:32:13,890 --> 00:32:17,230
then 

517
00:32:17,230 --> 00:32:20,720


518
00:32:20,720 --> 00:32:27,720


519
00:32:33,440 --> 00:32:37,779
the derivative of the product is given by 

520
00:32:37,779 --> 00:32:41,029
taking of the derivatives of these things one at a time. So first I differentiate 

521
00:32:41,029 --> 00:32:45,340
with respect to F prime, leaving the other two fixed. Then I differentiate with respect to G, 

522
00:32:45,340 --> 00:32:46,540
leaving the other two fixed. 

523
00:32:46,540 --> 00:32:50,640
Then I differentiate with respect to H, so I get H prime leaving the other two fixed. So that's the 

524
00:32:50,640 --> 00:32:53,530
product rule 

525
00:32:53,530 --> 00:32:56,380
for 

526
00:32:56,380 --> 00:33:00,980
derivatives. 

527
00:33:00,980 --> 00:33:05,020
If you refer back to this equation where 

528
00:33:05,020 --> 00:33:06,870
earlier we wrote out that 

529
00:33:06,870 --> 00:33:11,140
the expected payoff by this equation, this 

530
00:33:11,140 --> 00:33:14,110
sum over all the states of the probability 

531
00:33:14,110 --> 00:33:15,720
times the payoff. 

532
00:33:15,720 --> 00:33:19,159
So what I'm going to do is take the derivative of 

533
00:33:19,159 --> 00:33:20,410
this expression 

534
00:33:20,410 --> 00:33:22,730
with respect to the parameters theta 

535
00:33:22,730 --> 00:33:26,320
because I want to do gradient ascent on this function. So I'm going to take the 

536
00:33:26,320 --> 00:33:30,880
derivative of this function with respect to theta, and then try to go uphill on this function. 

537
00:33:30,880 --> 00:33:34,360
So using the product rule, when I take the derivative of this function with respect 

538
00:33:34,360 --> 00:33:35,690
to theta 

539
00:33:35,690 --> 00:33:39,169
what I get is - we'll end up with the sum of terms right there. There are a lot 

540
00:33:39,169 --> 00:33:42,300
of terms here that depend on theta, 

541
00:33:42,300 --> 00:33:47,160
and so what I'll end up with is I'll end up having a sum - having one term 

542
00:33:47,160 --> 00:33:49,179
that corresponds to the derivative of this 

543
00:33:49,179 --> 00:33:50,700
keeping everything else fixed, 

544
00:33:50,700 --> 00:33:53,960
to have one term from the derivative of this keeping everything else fixed, and I'll 

545
00:33:53,960 --> 00:33:55,140
have one term 

546
00:33:55,140 --> 00:33:56,430
from the derivative of 

547
00:33:56,430 --> 00:34:02,100
that last thing keeping everything else fixed. So just apply the product rule to this. Let's 

548
00:34:02,100 --> 00:34:09,100
write that down. So I have that - the derivative 

549
00:34:10,599 --> 00:34:15,789
with respect to theta of the expected value of the payoff is - 

550
00:34:15,789 --> 00:34:19,959
it turns out I can actually do this entire derivation 

551
00:34:19,959 --> 00:34:22,129
in exactly four steps, 

552
00:34:22,129 --> 00:34:26,819
but each of the steps requires a huge amount of writing, so 

553
00:34:26,819 --> 00:34:27,849
I'll 

554
00:34:27,849 --> 00:34:30,569
just start writing and see how that goes, 

555
00:34:30,569 --> 00:34:37,569
but this is a four step derivation. 

556
00:34:37,919 --> 00:34:44,919
So there's the sum over the state action sequences as we saw before. Close the 

557
00:35:04,869 --> 00:35:11,869
bracket, 

558
00:36:00,229 --> 00:36:03,169
and 

559
00:36:03,169 --> 00:36:08,670
then times the payoff. 

560
00:36:08,670 --> 00:36:12,599
So that huge amount of writing, that was just taking my previous formula and 

561
00:36:12,599 --> 00:36:13,529


562
00:36:13,529 --> 00:36:16,589
differentiating these terms that depend on theta one at a time. This was the term with 

563
00:36:16,589 --> 00:36:17,879
the derivative 

564
00:36:17,879 --> 00:36:19,489
of the first pi of 

565
00:36:19,489 --> 00:36:22,410
theta S0 A0. So there's the first derivative 

566
00:36:22,410 --> 00:36:26,329
term. There's the second one. Then you have plus dot, dot, dot, like 

567
00:36:26,329 --> 00:36:28,369
in terms of [inaudible]. 

568
00:36:28,369 --> 00:36:31,279
That's my last term. 

569
00:36:31,279 --> 00:36:38,279
So that was step one of four. 

570
00:36:51,299 --> 00:36:52,950
And so 

571
00:36:52,950 --> 00:36:59,849
by algebra - let me 

572
00:36:59,849 --> 00:37:06,849
just write this down 

573
00:37:53,829 --> 00:38:00,829


574
00:38:04,009 --> 00:38:05,180
and convince us 

575
00:38:05,180 --> 00:38:06,509
all that it's true. 

576
00:38:06,509 --> 00:38:08,549
This is the second of four steps 

577
00:38:08,549 --> 00:38:09,649
in which it 

578
00:38:09,649 --> 00:38:13,529
just convinced itself that if I expand out - take the sum and multiply it by 

579
00:38:13,529 --> 00:38:15,239
that big product in front, 

580
00:38:15,239 --> 00:38:18,929
then I get back that sum of terms I get. It's essentially - 

581
00:38:18,929 --> 00:38:22,530
for example, when I multiply out, this product on top of this ratio, of this 

582
00:38:22,530 --> 00:38:24,789
first fraction, 

583
00:38:24,789 --> 00:38:29,289
then pi subscript theta S0 A0, that would cancel out this pi subscript 

584
00:38:29,289 --> 00:38:31,219
theta S0 A0 

585
00:38:31,219 --> 00:38:35,130
and replace it with the derivative with respect to theta of pi theta S0 A0. So [inaudible] algebra was 

586
00:38:35,130 --> 00:38:42,130
the second. But 

587
00:38:44,960 --> 00:38:47,640
that term on top is just 

588
00:38:47,640 --> 00:38:53,999
what I worked out previously 

589
00:38:53,999 --> 00:38:58,569
- was the joint probability of the state action sequence, 

590
00:38:58,569 --> 00:39:00,140
and now I have that 

591
00:39:00,140 --> 00:39:04,919
times 

592
00:39:04,919 --> 00:39:11,919
that 

593
00:39:15,660 --> 00:39:22,660


594
00:39:23,489 --> 00:39:24,340
times the 

595
00:39:24,340 --> 00:39:29,459


596
00:39:29,459 --> 00:39:32,170
payoff. 

597
00:39:32,170 --> 00:39:36,209
And so by the definition of expectation, this is just equal to 

598
00:39:36,209 --> 00:39:43,209
that thing 

599
00:40:02,519 --> 00:40:07,129
times the payoff. 

600
00:40:07,129 --> 00:40:08,440
So 

601
00:40:08,440 --> 00:40:12,329
this thing inside the expectation, this is exactly 

602
00:40:12,329 --> 00:40:13,859
the 

603
00:40:13,859 --> 00:40:17,439
step that we were taking in 

604
00:40:17,439 --> 00:40:18,839
the inner group of our reinforce algorithm, 

605
00:40:18,839 --> 00:40:20,880
roughly the reinforce algorithm. This 

606
00:40:20,880 --> 00:40:25,640
proves that the expected value of our change to theta 

607
00:40:25,640 --> 00:40:30,509
is exactly in the direction of the gradient of our expected payoff. That's how 

608
00:40:30,509 --> 00:40:32,639
I started this whole derivation. I said 

609
00:40:32,639 --> 00:40:36,019
let's look at our expected payoff and take the derivative of that with respect to theta. 

610
00:40:36,019 --> 00:40:37,109


611
00:40:37,109 --> 00:40:37,980
What we've 

612
00:40:37,980 --> 00:40:40,529
proved is that on expectation, 

613
00:40:40,529 --> 00:40:44,229
the step direction I'll take reinforce is exactly the gradient of 

614
00:40:44,229 --> 00:40:46,220
the thing I'm trying to 

615
00:40:46,220 --> 00:40:48,679
optimize. This shows that this algorithm is 

616
00:40:48,679 --> 00:40:54,029
a stochastic gradient ascent 

617
00:40:54,029 --> 00:40:58,059
algorithm. I wrote a lot. Why don't you take a minute to look at the equations and [inaudible] 

618
00:40:58,059 --> 00:41:00,910
check if everything makes sense. I'll erase a couple of boards and then check if you have 

619
00:41:00,910 --> 00:41:07,910
questions after that. Questions? 

620
00:41:42,919 --> 00:41:49,919
Could you 

621
00:41:53,749 --> 00:41:56,279
raise your hand if this makes sense? 

622
00:41:56,279 --> 00:42:01,239


623
00:42:01,239 --> 00:42:03,999
Great. 

624
00:42:03,999 --> 00:42:06,789
Some of the comments - we talked about 

625
00:42:06,789 --> 00:42:10,629
those value function approximation approaches where you approximate 

626
00:42:10,629 --> 00:42:13,579
V star, then you go from V star 

627
00:42:13,579 --> 00:42:17,119
to pi star. Then here was also policy search approaches, where you try to approximate the policy 

628
00:42:17,119 --> 00:42:17,839
directly. 

629
00:42:17,839 --> 00:42:22,329
So let's talk briefly about when either one may be preferable. 

630
00:42:22,329 --> 00:42:24,049
It turns out that 

631
00:42:24,049 --> 00:42:26,899
policy search algorithms are especially effective 

632
00:42:26,899 --> 00:42:30,699
when you can choose a simple policy class pi. 

633
00:42:30,699 --> 00:42:32,029


634
00:42:32,029 --> 00:42:35,949
So the question really is for your problem 

635
00:42:35,949 --> 00:42:40,079
does there exist a simple function like a linear function or a logistic function 

636
00:42:40,079 --> 00:42:42,050
that maps from features of the state 

637
00:42:42,050 --> 00:42:43,409
to the action 

638
00:42:43,409 --> 00:42:45,289
that works pretty well. 

639
00:42:45,289 --> 00:42:49,029
So the problem with the inverted pendulum - this is quite likely to be true. 

640
00:42:49,029 --> 00:42:52,450
Going through all the different choices of parameters, you can say 

641
00:42:52,450 --> 00:42:54,930
things like if the pole's leaning towards the right, 

642
00:42:54,930 --> 00:42:57,709
then accelerate towards the right to try to catch it. Thanks to the 

643
00:42:57,709 --> 00:43:01,579
inverted pendulum, this is probably true. For lots of what's called 

644
00:43:01,579 --> 00:43:04,650
low level control tasks, things like driving a car, 

645
00:43:04,650 --> 00:43:09,069
the low level reflexes of do you steer your car left to avoid another car, do 

646
00:43:09,069 --> 00:43:12,109
you steer your car left to follow the car 

647
00:43:12,109 --> 00:43:13,389
road, flying a helicopter, 

648
00:43:13,389 --> 00:43:18,359
again very short time scale types of decisions - I like to think of these as 

649
00:43:18,359 --> 00:43:21,960
decisions like trained operator for like a trained driver or a trained 

650
00:43:21,960 --> 00:43:23,819
pilot. It would almost 

651
00:43:23,819 --> 00:43:27,360
be a reflex, these sorts of very quick instinctive things where 

652
00:43:27,360 --> 00:43:30,790
you map very directly from the inputs, data, and action. These are 

653
00:43:30,790 --> 00:43:32,159
problems for which 

654
00:43:32,159 --> 00:43:35,879
you can probably choose a reasonable policy class like a logistic function or something, 

655
00:43:35,879 --> 00:43:36,329
and it 

656
00:43:36,329 --> 00:43:38,489
will often work well. 

657
00:43:38,489 --> 00:43:41,829
In contrast, if you have problems that require 

658
00:43:41,829 --> 00:43:46,019
long multistep reasoning, so things like a game of chess where you have to 

659
00:43:46,019 --> 00:43:47,230
reason carefully about if 

660
00:43:47,230 --> 00:43:50,130
I do this, then they'll do that, then they'll do this, then they'll do that. 

661
00:43:50,130 --> 00:43:54,979
Those I think of as less instinctual, very high level decision 

662
00:43:54,979 --> 00:43:56,049
making. 

663
00:43:56,049 --> 00:43:59,519
For problems like that, 

664
00:43:59,519 --> 00:44:06,289
I would sometimes use a value function approximation approaches instead. Let 

665
00:44:06,289 --> 00:44:09,089
me say more about this later. 

666
00:44:09,089 --> 00:44:10,009
The last 

667
00:44:10,009 --> 00:44:14,139
thing I want to do is actually tell you about - 

668
00:44:14,139 --> 00:44:19,599
I 

669
00:44:19,599 --> 00:44:21,999
guess just as a side comment, 

670
00:44:21,999 --> 00:44:25,499
it turns out also that if you have POMDP, if you have a partially observable 

671
00:44:25,499 --> 00:44:28,459
MDP - I don't want to say too much about this - 

672
00:44:28,459 --> 00:44:31,219
it turns out that if you only have 

673
00:44:31,219 --> 00:44:36,799
an approximation 

674
00:44:36,799 --> 00:44:42,119
- let's call it S hat of the true 

675
00:44:42,119 --> 00:44:47,389
state, and so this could be S hat equals S of T given T from Kalman filter 

676
00:44:47,389 --> 00:44:54,389
- 

677
00:44:57,359 --> 00:45:01,569
then you can still use these sorts of policy search algorithms where you can 

678
00:45:01,569 --> 00:45:08,569
say pi 

679
00:45:10,159 --> 00:45:12,949
theta of S 

680
00:45:12,949 --> 00:45:15,819
hat comma A - There are various other ways you can use 

681
00:45:15,819 --> 00:45:18,669
policy search algorithms for POMDPs, but this is one of them 

682
00:45:18,669 --> 00:45:21,440
where if you only have estimates of the state, you can then 

683
00:45:21,440 --> 00:45:24,670
choose a policy class that only looks at your estimate of the state to choose the 

684
00:45:24,670 --> 00:45:25,599
action. 

685
00:45:25,599 --> 00:45:29,099
By using the same way of estimating the states in both training 

686
00:45:29,099 --> 00:45:30,629
and testing, 

687
00:45:30,629 --> 00:45:33,089
this will usually do some - 

688
00:45:33,089 --> 00:45:37,069
so these sorts of policy search algorithms can be applied 

689
00:45:37,069 --> 00:45:38,589
often reasonably effectively 

690
00:45:38,589 --> 00:45:42,129
to 

691
00:45:42,129 --> 00:45:43,829


692
00:45:43,829 --> 00:45:47,440


693
00:45:47,440 --> 00:45:50,349
POMDPs as well. There's one more algorithm I wanna talk about, but some final 

694
00:45:50,349 --> 00:45:52,150
words on the reinforce algorithm. It turns out the 

695
00:45:52,150 --> 00:45:54,519
reinforce algorithm 

696
00:45:54,519 --> 00:46:00,089
often works well but is often extremely slow. So it [inaudible] 

697
00:46:00,089 --> 00:46:03,680
works, but one thing to watch out for is that because you're taking these 

698
00:46:03,680 --> 00:46:07,470
gradient ascent steps that are very noisy, you're sampling a state action sequence, and then 

699
00:46:07,470 --> 00:46:07,909
you're sort of 

700
00:46:07,909 --> 00:46:09,579
taking a gradient ascent step in essentially a sort 

701
00:46:09,579 --> 00:46:12,960
of random direction that only on expectation is 

702
00:46:12,960 --> 00:46:14,239
correct. The 

703
00:46:14,239 --> 00:46:17,999
gradient ascent direction for reinforce can sometimes be a bit noisy, 

704
00:46:17,999 --> 00:46:21,599
and so it's not that uncommon to need like a million iterations of gradient ascent, 

705
00:46:21,599 --> 00:46:24,420
or ten million, or 100 million 

706
00:46:24,420 --> 00:46:25,579
iterations of gradient ascent 

707
00:46:25,579 --> 00:46:29,450
for reinforce [inaudible], so that's just something to watch out for. 

708
00:46:29,450 --> 00:46:35,579
One consequence of that is in 

709
00:46:35,579 --> 00:46:39,089
the reinforce 

710
00:46:39,089 --> 00:46:40,619
algorithm - I shouldn't 

711
00:46:40,619 --> 00:46:44,969
really call it reinforce. In what's essentially the reinforce algorithm, there's 

712
00:46:44,969 --> 00:46:48,449
this step where you need to sample a state action sequence. 

713
00:46:48,449 --> 00:46:50,759
So in 

714
00:46:50,759 --> 00:46:54,160
principle you could do this on your own robot. If there were a robot you were trying to 

715
00:46:54,160 --> 00:46:55,280
control, you can actually 

716
00:46:55,280 --> 00:46:58,709
physically initialize in some state, pick an action and so on, and go from there 

717
00:46:58,709 --> 00:47:00,450
to sample a state action sequence. 

718
00:47:00,450 --> 00:47:01,529
But 

719
00:47:01,529 --> 00:47:05,269
if you need to do this ten million times, you probably don't want to [inaudible] 

720
00:47:05,269 --> 00:47:07,459
your robot ten million times. 

721
00:47:07,459 --> 00:47:11,660
I personally have seen many more applications of reinforce 

722
00:47:11,660 --> 00:47:13,160
in simulation. 

723
00:47:13,160 --> 00:47:16,930
You can easily run ten thousand simulations or ten million simulations of your robot in 

724
00:47:16,930 --> 00:47:18,989
simulation maybe, but you 

725
00:47:18,989 --> 00:47:20,890
might not want to do that - 

726
00:47:20,890 --> 00:47:24,509
have your robot physically repeat some action ten million times. 

727
00:47:24,509 --> 00:47:27,529
So I personally have seen many more applications of reinforce 

728
00:47:27,529 --> 00:47:30,319
to learn using a simulator 

729
00:47:30,319 --> 00:47:34,799
than to actually do this on a physical device. 

730
00:47:34,799 --> 00:47:38,230
The last thing I wanted to do is tell you about one other algorithm, 

731
00:47:38,230 --> 00:47:45,230
one final policy search algorithm. [Inaudible] the laptop display please. It's 

732
00:47:52,249 --> 00:47:56,119
a policy search algorithm called Pegasus that's 

733
00:47:56,119 --> 00:48:00,039
actually what we use on our autonomous helicopter flight things 

734
00:48:00,039 --> 00:48:05,459
for many years. There are some other things we do now. So 

735
00:48:05,459 --> 00:48:07,579
here's the idea. There's a 

736
00:48:07,579 --> 00:48:11,059
reminder slide on RL formalism. There's nothing here that you don't know, but 

737
00:48:11,059 --> 00:48:12,709
I just want to 

738
00:48:12,709 --> 00:48:14,180
pictorially 

739
00:48:14,180 --> 00:48:17,119
describe the RL formalism because I'll use that later. 

740
00:48:17,119 --> 00:48:20,859
I'm gonna draw the reinforcement learning picture as follows. The 

741
00:48:20,859 --> 00:48:21,689
initialized [inaudible] 

742
00:48:21,689 --> 00:48:23,509
system, say a 

743
00:48:23,509 --> 00:48:25,559
helicopter or whatever in sum state S0, 

744
00:48:25,559 --> 00:48:27,509
you choose an action A0, 

745
00:48:27,509 --> 00:48:31,779
and then you'll say helicopter dynamics takes you to some new state S1, you choose 

746
00:48:31,779 --> 00:48:33,259
some other action A1, 

747
00:48:33,259 --> 00:48:34,619
and so on. 

748
00:48:34,619 --> 00:48:37,909
And then you have some reward function, which you reply to the sequence of states you 

749
00:48:37,909 --> 00:48:38,789
summed out, 

750
00:48:38,789 --> 00:48:41,249
and that's your total payoff. So 

751
00:48:41,249 --> 00:48:44,549
this is just a picture I wanna use to 

752
00:48:44,549 --> 00:48:46,890
summarize the RL problem. 

753
00:48:46,890 --> 00:48:47,949


754
00:48:47,949 --> 00:48:50,880
Our goal is to maximize the expected payoff, 

755
00:48:50,880 --> 00:48:53,000
which is this, the expected sum of the rewards. 

756
00:48:53,000 --> 00:48:56,909
And our goal is to learn the policy, which I denote by a green box. 

757
00:48:56,909 --> 00:48:58,500
So our policy 

758
00:48:58,500 --> 00:49:01,989
- and I'll switch back to deterministic policies for now. 

759
00:49:01,989 --> 00:49:06,539
So my deterministic policy will be some function mapping from 

760
00:49:06,539 --> 00:49:09,129
the states to the actions. 

761
00:49:09,129 --> 00:49:14,009
As a concrete example, you imagine that in the policy search 

762
00:49:14,009 --> 00:49:14,599
setting, 

763
00:49:14,599 --> 00:49:17,339
you may have a linear class of policies. 

764
00:49:17,339 --> 00:49:21,399
So you may imagine that the action A will be say a linear 

765
00:49:21,399 --> 00:49:22,660
function of the states, 

766
00:49:22,660 --> 00:49:24,899
and your goal is to 

767
00:49:24,899 --> 00:49:27,599
learn the parameters of the linear function. 

768
00:49:27,599 --> 00:49:31,430
So imagine trying to do linear progression on policies, except you're trying to optimize the reinforcement 

769
00:49:31,430 --> 00:49:33,349
learning 

770
00:49:33,349 --> 00:49:36,209
objective. So just [inaudible] imagine that the action A 

771
00:49:36,209 --> 00:49:40,309
is state of transpose S, and you go and policy search this to come up with 

772
00:49:40,309 --> 00:49:42,109
good parameters theta so as 

773
00:49:42,109 --> 00:49:47,079
to maximize the expected payoff. That would be one setting in which this picture applies. There's the 

774
00:49:47,079 --> 00:49:48,979
idea. 

775
00:49:48,979 --> 00:49:49,809
Quite often 

776
00:49:49,809 --> 00:49:54,459
we come up with a model or a simulator for the MDP, and as before 

777
00:49:54,459 --> 00:49:57,499
a model or a simulator is just a box 

778
00:49:57,499 --> 00:49:58,830
that takes this input 

779
00:49:58,830 --> 00:50:00,419
some state ST, 

780
00:50:00,419 --> 00:50:02,690
takes this input some action AT, 

781
00:50:02,690 --> 00:50:06,570
and then outputs some [inaudible] state ST plus one that you might want to take in the MDP. 

782
00:50:06,570 --> 00:50:10,130
This ST plus one will be a random state. It will be drawn from 

783
00:50:10,130 --> 00:50:13,030
the random state transition probabilities of MDP. 

784
00:50:13,030 --> 00:50:16,979
This is important. Very important, ST plus one 

785
00:50:16,979 --> 00:50:19,669
will be a random function ST and AT. In the 

786
00:50:19,669 --> 00:50:21,809
simulator, this is [inaudible]. 

787
00:50:21,809 --> 00:50:25,709
So for example, for autonomous helicopter flight, you [inaudible] build a simulator 

788
00:50:25,709 --> 00:50:29,819
using supervised learning, an algorithm like linear regression 

789
00:50:29,819 --> 00:50:31,529
[inaudible] to linear regression, 

790
00:50:31,529 --> 00:50:35,099
so we can get a nonlinear model of the dynamics of what ST 

791
00:50:35,099 --> 00:50:41,399
plus one is as a random function of ST and AT. Now 

792
00:50:41,399 --> 00:50:44,999
once you have a simulator, 

793
00:50:44,999 --> 00:50:48,100
given any fixed policy you can quite 

794
00:50:48,100 --> 00:50:52,309
straightforwardly evaluate any policy in a simulator. 

795
00:50:52,309 --> 00:50:53,359


796
00:50:53,359 --> 00:50:54,779
Concretely, our 

797
00:50:54,779 --> 00:50:58,619
goal is to find the policy pi mapping from states to actions, so the goal is to find the green 

798
00:50:58,619 --> 00:51:00,209
box 

799
00:51:00,209 --> 00:51:02,879
like that. It works well. 

800
00:51:02,879 --> 00:51:04,809
So if you have 

801
00:51:04,809 --> 00:51:07,569
any one fixed policy pi, 

802
00:51:07,569 --> 00:51:13,109
you can evaluate the policy pi just using the simulator 

803
00:51:13,109 --> 00:51:15,969
via the picture shown at the bottom of the slide. 

804
00:51:15,969 --> 00:51:18,880
So concretely, you can take your initial state S0, 

805
00:51:18,880 --> 00:51:20,979
feed it into the policy pi, 

806
00:51:20,979 --> 00:51:25,449
your policy pi will output some action A0, you plug it in 

807
00:51:25,449 --> 00:51:28,069
the simulator, the simulator outputs a random state S1, you feed 

808
00:51:28,069 --> 00:51:31,630
S1 into the policy and so on, and you get a sequence of states S0 through ST 

809
00:51:31,630 --> 00:51:35,429
that your helicopter flies through in simulation. 

810
00:51:35,429 --> 00:51:37,809
Then sum up the rewards, 

811
00:51:37,809 --> 00:51:39,899
and this gives you an estimate of 

812
00:51:39,899 --> 00:51:42,609
the expected payoff of the policy. 

813
00:51:42,609 --> 00:51:44,929
This picture is just a fancy way of saying fly 

814
00:51:44,929 --> 00:51:48,809
your helicopter in simulation and see how well it does, and measure the sum of 

815
00:51:48,809 --> 00:51:53,969
rewards you get on average in the simulator. The 

816
00:51:53,969 --> 00:51:57,219
picture I've drawn here assumes that you run your policy through the 

817
00:51:57,219 --> 00:51:59,159
simulator just once. In general, 

818
00:51:59,159 --> 00:52:01,299
you would run the policy through the simulator 

819
00:52:01,299 --> 00:52:05,099
some number of times and then average to get an average over M 

820
00:52:05,099 --> 00:52:07,339
simulations to get a better estimate of the 

821
00:52:07,339 --> 00:52:08,740
policy's 

822
00:52:08,740 --> 00:52:10,869
expected payoff. 

823
00:52:10,869 --> 00:52:12,799


824
00:52:12,799 --> 00:52:14,619
Now that I have a way 

825
00:52:14,619 --> 00:52:16,490
- given any one 

826
00:52:16,490 --> 00:52:18,069
fixed policy, 

827
00:52:18,069 --> 00:52:19,259
this gives me a way 

828
00:52:19,259 --> 00:52:20,379
to evaluate 

829
00:52:20,379 --> 00:52:23,429
the expected payoff of that policy. 

830
00:52:23,429 --> 00:52:25,280
So one 

831
00:52:25,280 --> 00:52:28,179
reasonably obvious thing you might try to do is then 

832
00:52:28,179 --> 00:52:30,069
just search for a policy, 

833
00:52:30,069 --> 00:52:33,609
in other words search for parameters theta for your policy, that 

834
00:52:33,609 --> 00:52:35,929
gives you high estimated payoff. 

835
00:52:35,929 --> 00:52:39,509
Does that make sense? So my policy has some parameters theta, so 

836
00:52:39,509 --> 00:52:40,979
my policy is 

837
00:52:40,979 --> 00:52:46,309
my actions A are equal to theta transpose S say if there's a linear policy. 

838
00:52:46,309 --> 00:52:49,559
For any fixed value of the parameters theta, 

839
00:52:49,559 --> 00:52:50,989
I can evaluate - 

840
00:52:50,989 --> 00:52:54,440
I can get an estimate for how good the policy is using the simulator. 

841
00:52:54,440 --> 00:52:58,599
One thing I might try to do is search for parameters theta to try to 

842
00:52:58,599 --> 00:53:03,109
maximize my estimated payoff. 

843
00:53:03,109 --> 00:53:05,929
It turns out you can sort of do that, 

844
00:53:05,929 --> 00:53:10,979
but that idea as I've just stated is hard to get to work. 

845
00:53:10,979 --> 00:53:12,180
Here's the 

846
00:53:12,180 --> 00:53:15,720
reason. The simulator allows us to evaluate 

847
00:53:15,720 --> 00:53:17,529
policy, so let's search for policy of high 

848
00:53:17,529 --> 00:53:20,619
value. The difficulty is that the simulator is random, 

849
00:53:20,619 --> 00:53:22,999
and so every time we evaluate a policy, 

850
00:53:22,999 --> 00:53:26,819
we get back a very slightly different answer. So 

851
00:53:26,819 --> 00:53:28,739
in the 

852
00:53:28,739 --> 00:53:33,219
cartoon below, I want you to imagine that the horizontal axis is the space of policies. 

853
00:53:33,219 --> 00:53:36,779
In other words, as I vary the 

854
00:53:36,779 --> 00:53:40,839
parameters in my policy, I get different points on the horizontal axis here. As I 

855
00:53:40,839 --> 00:53:43,789
vary the parameters theta, I get different policies, 

856
00:53:43,789 --> 00:53:46,479
and so I'm moving along the X axis, 

857
00:53:46,479 --> 00:53:50,079
and my total payoff I'm gonna plot on the vertical axis, 

858
00:53:50,079 --> 00:53:54,459
and the red line in this cartoon is the expected payoff of the different 

859
00:53:54,459 --> 00:53:56,130
policies. 

860
00:53:56,130 --> 00:54:01,929
My goal is to find the policy with the highest expected payoff. 

861
00:54:01,929 --> 00:54:05,319
You could search for a policy with high expected payoff, but every time you evaluate a 

862
00:54:05,319 --> 00:54:06,019
policy 

863
00:54:06,019 --> 00:54:08,150
- say I evaluate some policy, 

864
00:54:08,150 --> 00:54:09,819
then I might get a point that 

865
00:54:09,819 --> 00:54:12,879
just by chance looked a little bit better than it should have. If 

866
00:54:12,879 --> 00:54:16,489
I evaluate a second policy and just by chance it looked a little bit worse. I evaluate a third 

867
00:54:16,489 --> 00:54:21,419
policy, fourth, 

868
00:54:21,419 --> 00:54:25,439
sometimes you look here - sometimes I might actually evaluate exactly the same policy 

869
00:54:25,439 --> 00:54:27,829
twice and get back slightly different 

870
00:54:27,829 --> 00:54:32,380
answers just because my simulator is random, so when I apply the same policy 

871
00:54:32,380 --> 00:54:38,669
twice in simulation, I might get back slightly different answers. 

872
00:54:38,669 --> 00:54:41,839
So as I evaluate more and more policies, these are the pictures I get. 

873
00:54:41,839 --> 00:54:45,799
My goal is to try to optimize the red line. I hope you appreciate this is a 

874
00:54:45,799 --> 00:54:50,569
hard problem, especially when all [inaudible] optimization algorithm gets to see are these 

875
00:54:50,569 --> 00:54:53,949
black dots, and they don't have direct access to the red line. 

876
00:54:53,949 --> 00:54:58,269
So when my input space is some fairly high dimensional space, if I have 

877
00:54:58,269 --> 00:55:02,319
ten parameters, the horizontal axis would actually be a 10-D space, 

878
00:55:02,319 --> 00:55:07,049
all I get are these noisy estimates of what the red line is. This is a very hard 

879
00:55:07,049 --> 00:55:10,209
stochastic optimization problem. 

880
00:55:10,209 --> 00:55:15,969
So it turns out there's one way to make this optimization problem much easier. Here's the idea. 

881
00:55:15,969 --> 00:55:20,109
And the method is called Pegasus, which is an acronym for 

882
00:55:20,109 --> 00:55:20,949
something. 

883
00:55:20,949 --> 00:55:22,890
I'll tell you later. 

884
00:55:22,890 --> 00:55:26,140
So the simulator repeatedly makes calls 

885
00:55:26,140 --> 00:55:27,959
to a random number generator 

886
00:55:27,959 --> 00:55:32,179
to generate random numbers RT, which are used to simulate the stochastic 

887
00:55:32,179 --> 00:55:33,079
dynamics. 

888
00:55:33,079 --> 00:55:36,709
What I mean by that is that the simulator takes this input of state 

889
00:55:36,709 --> 00:55:39,749
and action, and it outputs the mixed state randomly, 

890
00:55:39,749 --> 00:55:43,999
and if you peer into the simulator, if you open the box of the simulator and ask 

891
00:55:43,999 --> 00:55:49,529
how is my simulator generating these random mixed states ST plus one, 

892
00:55:49,529 --> 00:55:51,299
pretty much the only way 

893
00:55:51,299 --> 00:55:52,519
to do so - pretty 

894
00:55:52,519 --> 00:55:53,809
much the only way 

895
00:55:53,809 --> 00:55:56,909
to write a simulator with random outputs is 

896
00:55:56,909 --> 00:56:00,209
we're gonna make calls to a random number generator, 

897
00:56:00,209 --> 00:56:03,440
and get random numbers, these RTs, 

898
00:56:03,440 --> 00:56:08,299
and then you have some function that takes this input S0, A0, and 

899
00:56:08,299 --> 00:56:10,979
the results of your random number generator, 

900
00:56:10,979 --> 00:56:14,430
and it computes some mixed state as a function of the inputs and of the random 

901
00:56:14,430 --> 00:56:16,839
number it got from the random number 

902
00:56:16,839 --> 00:56:17,699
generator. This is 

903
00:56:17,699 --> 00:56:21,889
pretty much the only way anyone implements any random code, any 

904
00:56:21,889 --> 00:56:22,359
code 

905
00:56:22,359 --> 00:56:26,179
that generates random outputs. You make a call to a random number generator, 

906
00:56:26,179 --> 00:56:30,869
and you compute some function of the random number and of your other 

907
00:56:30,869 --> 00:56:35,299
inputs. The reason that when you evaluate different policies you get 

908
00:56:35,299 --> 00:56:36,439
different answers 

909
00:56:36,439 --> 00:56:37,749
is because 

910
00:56:37,749 --> 00:56:41,459
every time you rerun the simulator, you get a different sequence of random 

911
00:56:41,459 --> 00:56:43,569
numbers from the random number generator, 

912
00:56:43,569 --> 00:56:45,569
and so you get a different answer every time, 

913
00:56:45,569 --> 00:56:48,139
even if you evaluate the same policy twice. 

914
00:56:48,139 --> 00:56:50,910
So here's the idea. 

915
00:56:50,910 --> 00:56:56,150
Let's say we fix in advance the sequence of random numbers, 

916
00:56:56,150 --> 00:57:00,229
choose R1, R2, up to RT minus one. 

917
00:57:00,229 --> 00:57:03,359
Fix the sequence of random numbers in advance, 

918
00:57:03,359 --> 00:57:07,280
and we'll always use the same sequence of random numbers 

919
00:57:07,280 --> 00:57:10,189
to evaluate different policies. Go 

920
00:57:10,189 --> 00:57:14,549
into your code and fix R1, R2, through RT minus one. 

921
00:57:14,549 --> 00:57:17,640
Choose them randomly once and then fix them forever. 

922
00:57:17,640 --> 00:57:20,979
If you always use the same sequence of random numbers, then 

923
00:57:20,979 --> 00:57:25,210
the system is no longer random, and if you evaluate the same policy twice, 

924
00:57:25,210 --> 00:57:28,569
you get back exactly the same answer. 

925
00:57:28,569 --> 00:57:33,330
And so these sequences of random numbers, [inaudible] call them scenarios, and 

926
00:57:33,330 --> 00:57:37,410
Pegasus is an acronym for policy evaluation of gradient and search using scenarios. 

927
00:57:37,410 --> 00:57:38,859
So 

928
00:57:38,859 --> 00:57:42,369


929
00:57:42,369 --> 00:57:45,949
when you do that, this is the picture you get. 

930
00:57:45,949 --> 00:57:50,190
As before, the red line is your expected payoff, and by fixing the random numbers, 

931
00:57:50,190 --> 00:57:53,469
you've defined some estimate of the expected payoff. 

932
00:57:53,469 --> 00:57:56,739
And as you evaluate different policies, they're 

933
00:57:56,739 --> 00:58:00,609
still only approximations to their true expected payoff, but at least now you have 

934
00:58:00,609 --> 00:58:01,310
a 

935
00:58:01,310 --> 00:58:03,359
deterministic function to optimize, 

936
00:58:03,359 --> 00:58:07,109
and you can now apply your favorite algorithms, be it gradient ascent or 

937
00:58:07,109 --> 00:58:08,089


938
00:58:08,089 --> 00:58:09,809
some sort 

939
00:58:09,809 --> 00:58:12,259
of local [inaudible] search 

940
00:58:12,259 --> 00:58:13,609
to try to optimize the black 

941
00:58:13,609 --> 00:58:16,329
curve. This gives you a much easier 

942
00:58:16,329 --> 00:58:18,390
optimization problem, and you 

943
00:58:18,390 --> 00:58:22,789
can optimize this to find hopefully a good policy. So this is 

944
00:58:22,789 --> 00:58:26,650
the Pegasus policy search method. 

945
00:58:26,650 --> 00:58:27,450
So when 

946
00:58:27,450 --> 00:58:30,950
I started to talk about reinforcement learning, I 

947
00:58:30,950 --> 00:58:31,390
showed 

948
00:58:31,390 --> 00:58:34,980
that video of a helicopter flying upside down. That was actually done using 

949
00:58:34,980 --> 00:58:36,269
exactly 

950
00:58:36,269 --> 00:58:38,539
method, using exactly this policy search 

951
00:58:38,539 --> 00:58:40,029
algorithm. This 

952
00:58:40,029 --> 00:58:41,849
seems to scale well 

953
00:58:41,849 --> 00:58:47,099
even to fairly large problems, even to fairly high dimensional state spaces. 

954
00:58:47,099 --> 00:58:49,009
Typically Pegasus policy search 

955
00:58:49,009 --> 00:58:50,439
algorithms have been using - 

956
00:58:50,439 --> 00:58:52,719
the optimization problem 

957
00:58:52,719 --> 00:58:56,440
is still - is much easier than the stochastic version, but sometimes it's 

958
00:58:56,440 --> 00:58:58,279
not entirely trivial, and so 

959
00:58:58,279 --> 00:59:00,299
you have to apply this sort of method with 

960
00:59:00,299 --> 00:59:04,549
maybe on the order of ten parameters or tens of parameters, so 30, 40 

961
00:59:04,549 --> 00:59:05,750
parameters, but 

962
00:59:05,750 --> 00:59:08,169
not thousands of parameters, 

963
00:59:08,169 --> 00:59:13,049
at least in these sorts of things with them. 

964
00:59:13,049 --> 00:59:16,270
Student:So is that 

965
00:59:16,270 --> 00:59:19,469
method different than just assuming that 

966
00:59:19,469 --> 00:59:23,479
you know your simulator exactly, just throwing away all the random numbers entirely? Instructor (Andrew Ng):So is this different from assuming 

967
00:59:23,479 --> 00:59:25,089


968
00:59:25,089 --> 00:59:26,329
that 

969
00:59:26,329 --> 00:59:26,949


970
00:59:26,949 --> 00:59:28,809
we have a deterministic simulator? The 

971
00:59:28,809 --> 00:59:31,249
answer is no. In the way you do this, 

972
00:59:31,249 --> 00:59:34,949
for the sake of simplicity I talked about one sequence of random numbers. 

973
00:59:34,949 --> 00:59:36,890
What you do is - 

974
00:59:36,890 --> 00:59:41,109
so imagine that the random numbers are simulating different wind gusts against your 

975
00:59:41,109 --> 00:59:41,969
helicopter. 

976
00:59:41,969 --> 00:59:46,429
So what you want to do isn't really evaluate just against one pattern of wind gusts. 

977
00:59:46,429 --> 00:59:48,000
What you want to do is 

978
00:59:48,000 --> 00:59:49,920
sample some set of 

979
00:59:49,920 --> 00:59:53,650
different patterns of wind gusts, and evaluate against all of them in average. 

980
00:59:53,650 --> 00:59:55,680
So what you do is you actually 

981
00:59:55,680 --> 00:59:58,039
sample say 100 - 

982
00:59:58,039 --> 01:00:01,359
some number I made up like 100 sequences of random numbers, 

983
01:00:01,359 --> 01:00:03,430
and every time you want to evaluate a policy, 

984
01:00:03,430 --> 01:00:08,679
you evaluate it against all 100 sequences of random numbers and then average. 

985
01:00:08,679 --> 01:00:13,249
This is in exactly the same way that on this earlier picture 

986
01:00:13,249 --> 01:00:16,879
you wouldn't necessarily evaluate the policy just once. You evaluate it maybe 100 

987
01:00:16,879 --> 01:00:18,450
times in simulation, and then 

988
01:00:18,450 --> 01:00:21,709
average to get a better estimate of the expected reward. 

989
01:00:21,709 --> 01:00:23,009
In the same way, 

990
01:00:23,009 --> 01:00:24,079
you do that here 

991
01:00:24,079 --> 01:00:29,079
but with 100 fixed sequences of random numbers. Does that make sense? Any 

992
01:00:29,079 --> 01:00:36,079
other questions? Student:If we use 100 scenarios and get an estimate for the policy, [inaudible] 100 times [inaudible] random numbers [inaudible] won't you get similar ideas [inaudible]? Instructor 

993
01:00:47,719 --> 01:00:51,829
(Andrew Ng):Yeah. I guess you're right. So the quality - for a fixed policy, 

994
01:00:51,829 --> 01:00:57,099
the quality of the approximation is equally good for both cases. The 

995
01:00:57,099 --> 01:01:00,989
advantage of fixing the random numbers is that you end up with 

996
01:01:00,989 --> 01:01:03,690
an optimization problem that's much easier. 

997
01:01:03,690 --> 01:01:04,609


998
01:01:04,609 --> 01:01:06,100
I have some search problem, 

999
01:01:06,100 --> 01:01:10,049
and on the horizontal axis there's a space of control policies, 

1000
01:01:10,049 --> 01:01:12,559
and my goal is to find a control policy that 

1001
01:01:12,559 --> 01:01:15,319
maximizes the payoff. 

1002
01:01:15,319 --> 01:01:18,660
The problem with this earlier setting was that 

1003
01:01:18,660 --> 01:01:21,949
when I evaluate policies I get these noisy estimates, 

1004
01:01:21,949 --> 01:01:22,950
and then it's 

1005
01:01:22,950 --> 01:01:26,690
just very hard to optimize the red curve if I have these points that are all over the 

1006
01:01:26,690 --> 01:01:29,420
place. And if I evaluate the same policy twice, I don't even get back the same 

1007
01:01:29,420 --> 01:01:30,719


1008
01:01:30,719 --> 01:01:32,959
answer. By fixing the random numbers, 

1009
01:01:32,959 --> 01:01:36,669
the algorithm still doesn't get to see the red curve, but at least it's 

1010
01:01:36,669 --> 01:01:40,609
now optimizing a deterministic function. That makes the 

1011
01:01:40,609 --> 01:01:42,349
optimization problem much easier. Does 

1012
01:01:42,349 --> 01:01:49,349
that make sense? Student:So every time you fix the random numbers, you get a nice curve to optimize. And then you change the random numbers to get a bunch of different curves 

1013
01:01:50,539 --> 01:01:52,009
that are easy to 

1014
01:01:52,009 --> 01:01:54,759
optimize. And then you smush 

1015
01:01:54,759 --> 01:01:58,009
them together? Instructor (Andrew 

1016
01:01:58,009 --> 01:01:59,309
Ng):Let's 

1017
01:01:59,309 --> 01:02:04,559
see. I have just one nice black curve that I'm trying to optimize. Student:For each scenario. Instructor (Andrew 

1018
01:02:04,559 --> 01:02:07,579
Ng):I see. So I'm gonna average over M scenarios, so I'm gonna average 

1019
01:02:07,579 --> 01:02:08,890
over 100 scenarios. 

1020
01:02:08,890 --> 01:02:12,339
So the black curve here is defined by averaging over 

1021
01:02:12,339 --> 01:02:15,239
a large set of scenarios. Does that make 

1022
01:02:15,239 --> 01:02:16,249
sense? 

1023
01:02:16,249 --> 01:02:20,099
So instead of only one - if the averaging thing doesn't make sense, imagine that there's 

1024
01:02:20,099 --> 01:02:24,049
just one sequence of random numbers. That might be easier to think 

1025
01:02:24,049 --> 01:02:27,419
about. Fix one sequence of random numbers, and every time I evaluate another policy, 

1026
01:02:27,419 --> 01:02:31,039
I evaluate against the same sequence of random numbers, and that gives me 

1027
01:02:31,039 --> 01:02:36,489
a nice deterministic function to optimize. 

1028
01:02:36,489 --> 01:02:39,449
Any other questions? 

1029
01:02:39,449 --> 01:02:41,269
The advantage is really 

1030
01:02:41,269 --> 01:02:45,459
that - one way to think about it is when I evaluate the same policy twice, 

1031
01:02:45,459 --> 01:02:50,449
at least I get back the same answer. This gives me a deterministic function 

1032
01:02:50,449 --> 01:02:53,890
mapping from parameters in my policy 

1033
01:02:53,890 --> 01:02:56,239
to my estimate of the expected payoff. 

1034
01:02:56,239 --> 01:02:57,789


1035
01:02:57,789 --> 01:02:59,339
That's just a function that I can try to 

1036
01:02:59,339 --> 01:03:06,339
optimize using the search 

1037
01:03:11,049 --> 01:03:13,229
algorithm. 

1038
01:03:13,229 --> 01:03:14,920
So we use this algorithm for 

1039
01:03:14,920 --> 01:03:16,799
inverted hovering, 

1040
01:03:16,799 --> 01:03:20,329
and again policy search algorithms tend to work well when you can find 

1041
01:03:20,329 --> 01:03:23,739
a reasonably simple policy mapping from 

1042
01:03:23,739 --> 01:03:28,640
the states to the actions. This is sort of especially the low level control tasks, 

1043
01:03:28,640 --> 01:03:30,359
which I think of as sort 

1044
01:03:30,359 --> 01:03:32,699
of reflexes almost. 

1045
01:03:32,699 --> 01:03:33,799
Completely, 

1046
01:03:33,799 --> 01:03:37,269
if you want to solve a problem like Tetris where you might plan 

1047
01:03:37,269 --> 01:03:40,499
ahead a few steps about what's a nice configuration of the board, or 

1048
01:03:40,499 --> 01:03:42,289
something like a game of chess, 

1049
01:03:42,289 --> 01:03:46,719
or problems of long path plannings of go here, then go there, then go there, 

1050
01:03:46,719 --> 01:03:49,319
then sometimes you might apply a value function 

1051
01:03:49,319 --> 01:03:51,069
method instead. 

1052
01:03:51,069 --> 01:03:55,059
But for tasks like helicopter flight, for 

1053
01:03:55,059 --> 01:03:58,569
low level control tasks, for the reflexes of driving or controlling 

1054
01:03:58,569 --> 01:04:00,749


1055
01:04:00,749 --> 01:04:05,699
various robots, policy search algorithms were easier - 

1056
01:04:05,699 --> 01:04:10,629
we can sometimes more easily approximate the policy directly works very well. 

1057
01:04:10,629 --> 01:04:11,839


1058
01:04:11,839 --> 01:04:15,599
Some [inaudible] the state of RL today. RL algorithms are applied 

1059
01:04:15,599 --> 01:04:18,499
to a wide range of problems, 

1060
01:04:18,499 --> 01:04:22,239
and the key is really sequential decision making. The place where you 

1061
01:04:22,239 --> 01:04:25,569
think about applying reinforcement learning algorithm is when you need to make a decision, 

1062
01:04:25,569 --> 01:04:28,209
then another decision, then another decision, 

1063
01:04:28,209 --> 01:04:31,889
and some of your actions may have long-term consequences. I think that 

1064
01:04:31,889 --> 01:04:33,170
is the 

1065
01:04:33,170 --> 01:04:34,999
heart of RL's 

1066
01:04:34,999 --> 01:04:37,910
sequential decision making, where you make multiple decisions, 

1067
01:04:37,910 --> 01:04:41,599
and some of your actions may have long-term consequences. I've shown you 

1068
01:04:41,599 --> 01:04:44,170
a bunch of robotics examples. RL 

1069
01:04:44,170 --> 01:04:45,569
is also 

1070
01:04:45,569 --> 01:04:48,759
applied to thinks like medical decision making, where you may have a patient and you 

1071
01:04:48,759 --> 01:04:51,879
want to choose a sequence of treatments, and you do this now for the patient, and the 

1072
01:04:51,879 --> 01:04:56,449
patient may be in some other state, and you choose to do that later, and so on. 

1073
01:04:56,449 --> 01:05:00,839
It turns out there's a large community of people applying these sorts of tools to queues. So 

1074
01:05:00,839 --> 01:05:04,539
imagine you have a bank where you have people lining up, and 

1075
01:05:04,539 --> 01:05:06,009
after they go to one cashier, 

1076
01:05:06,009 --> 01:05:09,740
some of them have to go to the manager to deal with something else. You have 

1077
01:05:09,740 --> 01:05:13,410
a system of multiple people standing in line in multiple queues, and so how do you route 

1078
01:05:13,410 --> 01:05:17,839
people optimally to minimize the waiting 

1079
01:05:17,839 --> 01:05:20,239
time. And not just people, but 

1080
01:05:20,239 --> 01:05:23,969
objects in an assembly line and so on. It turns out there's a surprisingly large community 

1081
01:05:23,969 --> 01:05:26,349
working on optimizing queues. 

1082
01:05:26,349 --> 01:05:29,719
I mentioned game playing a little bit already. Things 

1083
01:05:29,719 --> 01:05:33,289
like financial decision making, if you have a large amount of stock, 

1084
01:05:33,289 --> 01:05:37,569
how do you sell off a large amount - how do you time the selling off of your stock 

1085
01:05:37,569 --> 01:05:41,329
so as to not affect market prices adversely too much? There 

1086
01:05:41,329 --> 01:05:44,949
are many operations research problems, things like factory automation. Can you design 

1087
01:05:44,949 --> 01:05:46,369
a factory 

1088
01:05:46,369 --> 01:05:50,140
to optimize throughput, or minimize cost, or whatever. These are all 

1089
01:05:50,140 --> 01:05:54,049
sorts of problems that people are applying reinforcement learning algorithms to. 

1090
01:05:54,049 --> 01:05:57,529
Let me just close with a few robotics examples because they're always fun, and we just 

1091
01:05:57,529 --> 01:05:59,190
have these videos. 

1092
01:05:59,190 --> 01:06:00,189


1093
01:06:00,189 --> 01:06:03,989
This video was a work of Ziko Coulter and Peter Abiel, which 

1094
01:06:03,989 --> 01:06:07,259
is a PhD student 

1095
01:06:07,259 --> 01:06:09,719
here. They were working getting a robot dog 

1096
01:06:09,719 --> 01:06:11,759
to climb over 

1097
01:06:11,759 --> 01:06:13,760
difficult rugged 

1098
01:06:13,760 --> 01:06:16,959
terrain. Using a reinforcement learning algorithm, 

1099
01:06:16,959 --> 01:06:19,979
they applied an approach that's 

1100
01:06:19,979 --> 01:06:24,149
similar to a value function approximation approach, not quite but similar. 

1101
01:06:24,149 --> 01:06:28,849
They allowed the robot dog to sort of plan ahead multiple steps, and 

1102
01:06:28,849 --> 01:06:32,329
carefully choose his footsteps and 

1103
01:06:32,329 --> 01:06:33,999
traverse rugged terrain. 

1104
01:06:33,999 --> 01:06:35,220
This is really 

1105
01:06:35,220 --> 01:06:39,129
state of the art in terms of what can 

1106
01:06:39,129 --> 01:06:43,339
you get a robotic dog to do. 

1107
01:06:43,339 --> 01:06:45,569
Here's another fun one. 

1108
01:06:45,569 --> 01:06:48,009
It turns out that 

1109
01:06:48,009 --> 01:06:51,259
wheeled robots are very fuel-efficient. Cars and trucks are the 

1110
01:06:51,259 --> 01:06:53,749
most fuel-efficient robots in the world 

1111
01:06:53,749 --> 01:06:55,250
almost. 

1112
01:06:55,250 --> 01:06:57,260
Cars and trucks are very fuel-efficient, 

1113
01:06:57,260 --> 01:07:01,309
but the bigger robots have the ability to traverse more rugged terrain. 

1114
01:07:01,309 --> 01:07:04,599
So this is a robot - this is actually a small scale mockup of a larger 

1115
01:07:04,599 --> 01:07:06,519
vehicle built by Lockheed Martin, 

1116
01:07:06,519 --> 01:07:08,119
but can you come 

1117
01:07:08,119 --> 01:07:09,699
up with a vehicle that 

1118
01:07:09,699 --> 01:07:13,689
has wheels and has the fuel efficiency of wheeled robots, but also 

1119
01:07:13,689 --> 01:07:18,329
has legs so it 

1120
01:07:18,329 --> 01:07:19,099
can traverse obstacles. 

1121
01:07:19,099 --> 01:07:23,599
Again, using a reinforcement algorithm to design a controller for this robot to make 

1122
01:07:23,599 --> 01:07:25,279
it 

1123
01:07:25,279 --> 01:07:27,349
traverse obstacles, and 

1124
01:07:27,349 --> 01:07:29,390
somewhat complex gaits that 

1125
01:07:29,390 --> 01:07:31,319
would be very hard to design by hand, 

1126
01:07:31,319 --> 01:07:33,279
but by choosing a reward function, tell the robot 

1127
01:07:33,279 --> 01:07:34,550
this is a 

1128
01:07:34,550 --> 01:07:36,789
plus one reward that's top of the goal, 

1129
01:07:36,789 --> 01:07:38,469
and a few other things, it learns 

1130
01:07:38,469 --> 01:07:45,469
these sorts of policies automatically. 

1131
01:07:46,519 --> 01:07:47,509
Last couple 

1132
01:07:47,509 --> 01:07:51,059
fun ones - I'll show you a 

1133
01:07:51,059 --> 01:07:55,229
couple last helicopter videos. 

1134
01:07:55,229 --> 01:07:57,029


1135
01:07:57,029 --> 01:08:01,709
This is the work of again PhD students here, Peter Abiel and Adam 

1136
01:08:01,709 --> 01:08:08,709
Coates 

1137
01:08:08,769 --> 01:08:11,659
who have been working on 

1138
01:08:11,659 --> 01:08:18,659
autonomous flight. I'll just let 

1139
01:08:25,419 --> 01:08:26,319


1140
01:08:26,319 --> 01:08:33,319
you watch. I'll just 

1141
01:09:49,339 --> 01:09:53,900
show you one more. Student:[Inaudible] do this with a real helicopter [inaudible]? Instructor 

1142
01:09:53,900 --> 01:09:55,199
(Andrew Ng):Not a full-size helicopter. 

1143
01:09:55,199 --> 01:09:58,020
Only small radio control helicopters. Student:[Inaudible]. Instructor (Andrew Ng):Full-size 

1144
01:09:58,020 --> 01:10:00,860
helicopters 

1145
01:10:00,860 --> 01:10:05,780
are the wrong design for this. You shouldn't do this on a full-size helicopter. 

1146
01:10:05,780 --> 01:10:12,619
Many full-size helicopters would fall apart if you tried to do this. Let's see. There's one more. 

1147
01:10:12,619 --> 01:10:18,919
Student:Are there any human [inaudible]? Instructor (Andrew Ng):Yes, 

1148
01:10:18,919 --> 01:10:25,919
there are very good human pilots that can. 

1149
01:10:28,420 --> 01:10:35,420
This is just one more 

1150
01:10:43,739 --> 01:10:45,510
maneuver. That was kind of fun. So this is the work of 

1151
01:10:45,510 --> 01:10:48,090
Peter Abiel and Adam Coates. So that was it. That was 

1152
01:10:48,090 --> 01:10:50,509
all 

1153
01:10:50,509 --> 01:10:52,639


1154
01:10:52,639 --> 01:10:55,530
the technical material 

1155
01:10:55,530 --> 01:11:02,530
I wanted to present in this class. I guess 

1156
01:11:03,109 --> 01:11:04,040
you're all 

1157
01:11:04,040 --> 01:11:10,199
experts on machine learning now. Congratulations. 

1158
01:11:10,199 --> 01:11:14,209
And as I hope you've got the sense through this class that this is one of the technologies 

1159
01:11:14,209 --> 01:11:15,289
that's really 

1160
01:11:15,289 --> 01:11:18,969
having a huge impact on science in engineering and industry. 

1161
01:11:18,969 --> 01:11:23,130
As I said in the first lecture, I think many people use machine 

1162
01:11:23,130 --> 01:11:30,130
learning algorithms dozens of times a day without even knowing about it. 

1163
01:11:30,389 --> 01:11:34,110
Based on the projects you've done, I hope that 

1164
01:11:34,110 --> 01:11:37,899
most of you will be able to imagine yourself 

1165
01:11:37,899 --> 01:11:41,309
going out after this class and applying these things to solve a variety of 

1166
01:11:41,309 --> 01:11:43,280
problems. 

1167
01:11:43,280 --> 01:11:45,980
Hopefully, some of you will also imagine yourselves 

1168
01:11:45,980 --> 01:11:48,989
writing research papers after this class, be it on 

1169
01:11:48,989 --> 01:11:52,440
a novel way to do machine learning, or on some way of applying machine 

1170
01:11:52,440 --> 01:11:55,289
learning to a problem that you care 

1171
01:11:55,289 --> 01:11:58,749
about. In fact, looking at project milestones, I'm actually sure that a large fraction of 

1172
01:11:58,749 --> 01:12:04,300
the projects in this class will be publishable, so I think that's great. 

1173
01:12:04,300 --> 01:12:05,780


1174
01:12:05,780 --> 01:12:06,679


1175
01:12:06,679 --> 01:12:10,750
I guess many of you will also go industry, make new products, and make lots 

1176
01:12:10,750 --> 01:12:12,659
of money using learning 

1177
01:12:12,659 --> 01:12:15,580
algorithms. Remember me 

1178
01:12:15,580 --> 01:12:19,530
if that happens. One of the things I'm excited about is through research or 

1179
01:12:19,530 --> 01:12:22,570
industry, I'm actually completely sure that the people in this class in the 

1180
01:12:22,570 --> 01:12:23,849
next few months 

1181
01:12:23,849 --> 01:12:27,320
will apply machine learning algorithms to lots of problems in 

1182
01:12:27,320 --> 01:12:30,820
industrial management, and computer science, things like optimizing computer 

1183
01:12:30,820 --> 01:12:31,770
architectures, 

1184
01:12:31,770 --> 01:12:33,340
network security, 

1185
01:12:33,340 --> 01:12:38,480
robotics, computer vision, to problems in computational biology, 

1186
01:12:38,480 --> 01:12:39,140


1187
01:12:39,140 --> 01:12:42,429
to problems in aerospace, or understanding natural language, 

1188
01:12:42,429 --> 01:12:44,309
and many more things like that. 

1189
01:12:44,309 --> 01:12:46,030
So 

1190
01:12:46,030 --> 01:12:49,920
right now I have no idea what all of you are going to do with the learning algorithms you learned about, 

1191
01:12:49,920 --> 01:12:51,540
but 

1192
01:12:51,540 --> 01:12:54,809
every time as I wrap up this class, I always 

1193
01:12:54,809 --> 01:12:58,209
feel very excited, and optimistic, and hopeful about the sorts of amazing 

1194
01:12:58,209 --> 01:13:03,129
things you'll be able to do. 

1195
01:13:03,129 --> 01:13:06,780
One final thing, I'll just 

1196
01:13:06,780 --> 01:13:10,489
give out this handout. 

1197
01:13:10,489 --> 01:13:12,709
One final thing is that 

1198
01:13:12,709 --> 01:13:19,709


1199
01:13:24,699 --> 01:13:27,969


1200
01:13:27,969 --> 01:13:30,749


1201
01:13:30,749 --> 01:13:32,900
machine learning has grown out of a 

1202
01:13:32,900 --> 01:13:35,170
larger literature on the AI 

1203
01:13:35,170 --> 01:13:36,419
where 

1204
01:13:36,419 --> 01:13:40,360
this desire to build systems that exhibit intelligent behavior and machine 

1205
01:13:40,360 --> 01:13:41,829
learning 

1206
01:13:41,829 --> 01:13:44,889
is one of the tools of AI, maybe one that's had a 

1207
01:13:44,889 --> 01:13:46,510
disproportionately large impact, 

1208
01:13:46,510 --> 01:13:51,519
but there are many other ideas in AI that I hope you 

1209
01:13:51,519 --> 01:13:55,289
go and continue to learn about. Fortunately, Stanford 

1210
01:13:55,289 --> 01:13:59,279
has one of the best and broadest sets of AI classes, and I hope that 

1211
01:13:59,279 --> 01:14:03,690
you take advantage of some of these classes, and go and learn more about AI, and more 

1212
01:14:03,690 --> 01:14:07,690
about other fields which often apply learning algorithms to problems in vision, 

1213
01:14:07,690 --> 01:14:10,480
problems in natural language processing in robotics, and so on. 

1214
01:14:10,480 --> 01:14:14,960
So the handout I just gave out has a list of AI related courses. Just 

1215
01:14:14,960 --> 01:14:18,489
running down very quickly, I guess, CS221 is an overview that I 

1216
01:14:18,489 --> 01:14:19,340
teach. There 

1217
01:14:19,340 --> 01:14:23,699
are a lot of robotics classes also: 223A, 

1218
01:14:23,699 --> 01:14:25,139
225A, 225B 

1219
01:14:25,139 --> 01:14:28,760
- many robotics class. There are so many applications 

1220
01:14:28,760 --> 01:14:31,999
of learning algorithms to robotics today. 222 

1221
01:14:31,999 --> 01:14:36,369
and 227 are knowledge representation and reasoning classes. 

1222
01:14:36,369 --> 01:14:39,919
228 - of all the classes on this list, 228, which Daphne 

1223
01:14:39,919 --> 01:14:43,479
Koller teaches, is probably closest in spirit to 229. This is one of 

1224
01:14:43,479 --> 01:14:45,300
the classes I highly recommend 

1225
01:14:45,300 --> 01:14:48,600
to all of my PhD students as well. 

1226
01:14:48,600 --> 01:14:53,300
Many other problems also touch on machine learning. On the next page, 

1227
01:14:53,300 --> 01:14:54,040


1228
01:14:54,040 --> 01:14:57,760
courses on computer vision, speech recognition, natural language processing, 

1229
01:14:57,760 --> 01:14:58,879
various 

1230
01:14:58,879 --> 01:15:02,030
tools that aren't just machine learning, but often involve machine learning in 

1231
01:15:02,030 --> 01:15:03,199
many ways. 

1232
01:15:03,199 --> 01:15:07,339
Other aspects of AI, multi-agent systems taught by [inaudible]. 

1233
01:15:07,339 --> 01:15:09,630


1234
01:15:09,630 --> 01:15:13,139
EE364A is convex optimization. It's a class taught by 

1235
01:15:13,139 --> 01:15:16,630
Steve Boyd, and convex optimization came up many times in this class. If you 

1236
01:15:16,630 --> 01:15:20,909
want to become really good at it, EE364 is a great class. If you're 

1237
01:15:20,909 --> 01:15:24,579
interested in project courses, I also teach a project class 

1238
01:15:24,579 --> 01:15:27,179
next quarter where we spend the whole quarter working 

1239
01:15:27,179 --> 01:15:31,510
on research projects. 

1240
01:15:31,510 --> 01:15:37,589
So I hope you go and take some more of those classes. 

1241
01:15:37,589 --> 01:15:40,379
In closing, 

1242
01:15:40,379 --> 01:15:42,849
let me just say 

1243
01:15:42,849 --> 01:15:46,880
this class has been really fun to teach, and it's very satisfying to 

1244
01:15:46,880 --> 01:15:52,010
me personally when we set these insanely difficult hallmarks, and 

1245
01:15:52,010 --> 01:15:54,649
then we'd see a solution, and I'd be like, œOh my god. They actually figured that one out. It's 

1246
01:15:54,649 --> 01:15:55,469
actually 

1247
01:15:55,469 --> 01:15:58,570
very satisfying when I see that. 

1248
01:15:58,570 --> 01:16:03,089
Or looking at the milestones, I often go, œWow, that's really cool. I bet this would be 

1249
01:16:03,089 --> 01:16:04,739
publishable. 

1250
01:16:04,739 --> 01:16:05,469
So 

1251
01:16:05,469 --> 01:16:09,039
I hope you take what you've learned, go forth, and 

1252
01:16:09,039 --> 01:16:11,570
do amazing things with learning algorithms. 

1253
01:16:11,570 --> 01:16:16,220
I know this is a heavy workload class, so thank you all very much for the hard 

1254
01:16:16,220 --> 01:16:20,280
work you've put into this class, and the hard work you've put into learning this material, 

1255
01:16:20,280 --> 01:16:21,699
and 

1256
01:16:21,699 --> 01:16:23,389
thank you very much for having been students in.