00:00:09,040 --> 00:00:10,299
This subtitle is provided from lecsub.jimdo.com under Creative Commons License.

2
00:00:10,299 --> 00:00:13,569
This presentation is delivered by the Stanford Center for Professional 

3
00:00:13,569 --> 00:00:20,569
Development. Okay, 

4
00:00:21,519 --> 00:00:24,099
good morning. Welcome back. 

5
00:00:24,099 --> 00:00:27,019
So I hope all of you had a good Thanksgiving break. 

6
00:00:27,019 --> 00:00:31,309
After the problem sets, I suspect many of us needed one. 

7
00:00:31,309 --> 00:00:35,640
Just one quick announcement so as I announced by email 

8
00:00:35,640 --> 00:00:37,870
a few days ago, this afternoon 

9
00:00:37,870 --> 00:00:38,489
we'll be 

10
00:00:38,489 --> 00:00:42,590
doing another tape ahead of lecture, so I won't physically be here on 

11
00:00:42,590 --> 00:00:43,840
Wednesday, and so we'll be taping 

12
00:00:43,840 --> 00:00:45,990
this Wednesday's lecture ahead of time. 

13
00:00:45,990 --> 00:00:47,210
If you're free 

14
00:00:47,210 --> 00:00:52,770
this afternoon, please come to that; it'll be at 3:45 p.m. 

15
00:00:52,770 --> 00:00:57,610
in the Skilling Auditorium in Skilling 193 

16
00:00:57,610 --> 00:00:59,830
at 3:45. But of course, you can also just show up 

17
00:00:59,830 --> 00:01:06,830
in class as usual at the usual time or just watch it online as usual also. 

18
00:01:07,110 --> 00:01:12,920
Okay, welcome back. What I want to do today is continue our discussion on 

19
00:01:12,920 --> 00:01:15,690
Reinforcement Learning in MDPs. Quite a 

20
00:01:15,690 --> 00:01:19,790
long topic for me to go over today, so most of today's lecture will be on 

21
00:01:19,790 --> 00:01:21,720
continuous state MDPs, 

22
00:01:21,720 --> 00:01:24,090
and in particular, 

23
00:01:24,090 --> 00:01:26,750
algorithms for solving continuous state MDPs, 

24
00:01:26,750 --> 00:01:29,890
so I'll talk just very briefly about discretization. 

25
00:01:29,890 --> 00:01:34,080
I'll spend a lot of time talking about models, assimilators of MDPs, 

26
00:01:34,080 --> 00:01:39,250
and then talk about one algorithm called fitted value iteration 

27
00:01:39,250 --> 00:01:40,420
and 

28
00:01:40,420 --> 00:01:45,119
two functions which builds on that, and then 

29
00:01:45,119 --> 00:01:49,740
hopefully, I'll have time to get to a second algorithm called, approximate policy 

30
00:01:49,740 --> 00:01:50,960
iteration 

31
00:01:50,960 --> 00:01:53,189
Just to recap, 

32
00:01:53,189 --> 00:01:55,119
right, in the previous lecture, 

33
00:01:55,119 --> 00:01:58,270
I defined the Reinforcement Learning problem and 

34
00:01:58,270 --> 00:02:02,950
I defined MDPs, so let me just recap the notation. 

35
00:02:02,950 --> 00:02:04,400


36
00:02:04,400 --> 00:02:11,400
I said that an MDP or a Markov Decision Process, was a ? tuple, 

37
00:02:14,209 --> 00:02:15,959
comprising 

38
00:02:15,959 --> 00:02:19,139
those things and 

39
00:02:19,139 --> 00:02:22,219
the running example of those using last time 

40
00:02:22,219 --> 00:02:25,189
was this one 

41
00:02:25,189 --> 00:02:26,709
right, adapted from 

42
00:02:26,709 --> 00:02:30,209
the Russell and Norvig AI textbook. 

43
00:02:30,209 --> 00:02:31,619
So 

44
00:02:31,619 --> 00:02:35,499
in this example MDP that I was using, 

45
00:02:35,499 --> 00:02:39,269
it had 11 states, so that's where S was. 

46
00:02:39,269 --> 00:02:46,269
The actions were compass directions: north, south, east and west. The 

47
00:02:46,650 --> 00:02:49,689
state transition probability is to capture 

48
00:02:49,689 --> 00:02:51,109
chance of your 

49
00:02:51,109 --> 00:02:52,409
transitioning to every state 

50
00:02:52,409 --> 00:02:55,899
when you take any action in any other given state and so 

51
00:02:55,899 --> 00:02:58,059
in our example that captured 

52
00:02:58,059 --> 00:03:01,119
the stochastic dynamics of our 

53
00:03:01,119 --> 00:03:04,499
robot wondering around [inaudible], and we said if you take the action north 

54
00:03:04,499 --> 00:03:05,639
and the south, 

55
00:03:05,639 --> 00:03:08,879
you have a .8 chance of actually going north and .1 chance of veering 

56
00:03:08,879 --> 00:03:09,879
off, 

57
00:03:09,879 --> 00:03:12,689
so that .1 chance of veering off to the right so 

58
00:03:12,689 --> 00:03:17,069
said model of the robot's noisy 

59
00:03:17,069 --> 00:03:19,200
dynamic with a [inaudible] 

60
00:03:19,200 --> 00:03:21,709
and the 

61
00:03:21,709 --> 00:03:24,789
reward function was that +/-1 at the 

62
00:03:24,789 --> 00:03:31,059
absorbing states 

63
00:03:31,059 --> 00:03:32,749
and 

64
00:03:32,749 --> 00:03:37,659
-0.02 

65
00:03:37,659 --> 00:03:39,589
elsewhere. 

66
00:03:39,589 --> 00:03:41,499
This is an example of an MDP, and 

67
00:03:41,499 --> 00:03:47,899
that's what these five things were. Oh, and I used a discount 

68
00:03:47,899 --> 00:03:48,870
factor G 

69
00:03:48,870 --> 00:03:54,229
of usually a number slightly less than one, so that's the 0.99. 

70
00:03:54,229 --> 00:03:58,939
And so our goal was to find the policy, the 

71
00:03:58,939 --> 00:04:02,089
control policy and that's at ?, 

72
00:04:02,089 --> 00:04:03,729
which is a function 

73
00:04:03,729 --> 00:04:06,400
mapping from the states of the actions 

74
00:04:06,400 --> 00:04:09,609
that tells us what action to take in every state, 

75
00:04:09,609 --> 00:04:14,159
and our goal was to find a policy that maximizes the expected value of 

76
00:04:14,159 --> 00:04:17,319
our total payoff. So we want to find a 

77
00:04:17,319 --> 00:04:19,599
policy. Well, 

78
00:04:19,599 --> 00:04:26,599
let's see. We define value functions Vp (s) to be equal to 

79
00:04:33,250 --> 00:04:35,740
this. 

80
00:04:35,740 --> 00:04:36,979
We said that 

81
00:04:36,979 --> 00:04:41,340
the value of a policy ? from State S was given by the expected 

82
00:04:41,340 --> 00:04:42,029
value 

83
00:04:42,029 --> 00:04:46,199
of the sum of discounted rewards, conditioned on your executing the policy ? 

84
00:04:46,199 --> 00:04:49,329
and 

85
00:04:49,329 --> 00:04:54,469
you're stating off your [inaudible] to say in the State S, 

86
00:04:54,469 --> 00:04:59,959
and so our strategy for finding the policy was sort of comprised of 

87
00:04:59,959 --> 00:05:06,959
two steps. So the goal is to find 

88
00:05:11,569 --> 00:05:15,190
a good policy that maximizes the suspected value of 

89
00:05:15,190 --> 00:05:16,759
the sum of discounted rewards, 

90
00:05:16,759 --> 00:05:21,020
and so I said last time that one strategy for finding the [inaudible] of a 

91
00:05:21,020 --> 00:05:21,709
policy 

92
00:05:21,709 --> 00:05:26,080
is to first compute the optimal value function which I denoted V*(s) and is defined 

93
00:05:26,080 --> 00:05:28,839
like that. It's the maximum 

94
00:05:28,839 --> 00:05:31,969
value that any policy can obtain, 

95
00:05:31,969 --> 00:05:38,969
and for example, 

96
00:05:44,939 --> 00:05:48,449
the optimal value function 

97
00:05:48,449 --> 00:05:55,449
for that MDP looks like this. 

98
00:06:01,439 --> 00:06:04,409
So in other words, starting from any of these states, 

99
00:06:04,409 --> 00:06:07,050
what's the expected value of the 

100
00:06:07,050 --> 00:06:09,210
sum of discounted rewards you get, 

101
00:06:09,210 --> 00:06:12,009
so this is 

102
00:06:12,009 --> 00:06:13,210
V*. 

103
00:06:13,210 --> 00:06:15,289
We also said that 

104
00:06:15,289 --> 00:06:17,099
once you've found V*, 

105
00:06:17,099 --> 00:06:20,400
you can compute the optimal policy 

106
00:06:20,400 --> 00:06:27,400
using this. 

107
00:06:31,579 --> 00:06:34,299


108
00:06:34,299 --> 00:06:36,279
And 

109
00:06:36,279 --> 00:06:41,049
so once you've found 

110
00:06:41,049 --> 00:06:48,049
V and the last piece of this algorithm was Bellman's equations where 

111
00:06:48,460 --> 00:06:51,240
we know that V*, 

112
00:06:51,240 --> 00:06:55,579
the optimal sum of discounted rewards you can get for State S, is equal to the 

113
00:06:55,579 --> 00:06:59,339
immediate reward you get just for starting off in that state 

114
00:06:59,339 --> 00:07:02,360
+G(for the 

115
00:07:02,360 --> 00:07:05,979
max over all the actions you could take)(your 

116
00:07:05,979 --> 00:07:09,350


117
00:07:09,350 --> 00:07:11,639


118
00:07:11,639 --> 00:07:13,659
future 

119
00:07:13,659 --> 00:07:15,619
sum of discounted 

120
00:07:15,619 --> 00:07:16,479
rewards)(your 

121
00:07:16,479 --> 00:07:20,650
future payoff starting from the State S(p) which is where you might transition to 

122
00:07:20,650 --> 00:07:22,430
after 1(s). 

123
00:07:22,430 --> 00:07:26,279
And so this gave us a value iteration algorithm, 

124
00:07:26,279 --> 00:07:29,909
which was essentially 

125
00:07:29,909 --> 00:07:35,289
V.I. I'm abbreviating value iteration as V.I., so in the value iteration 

126
00:07:35,289 --> 00:07:36,159


127
00:07:36,159 --> 00:07:43,159
algorithm, in V.I., you just take Bellman's equations and you repeatedly do this. 

128
00:07:49,960 --> 00:07:54,099
So initialize some guess of the value functions. Initialize a zero as the sum 

129
00:07:54,099 --> 00:07:55,129
rounding guess and 

130
00:07:55,129 --> 00:07:58,959
then repeatedly perform this update for all states, and I said last time that if 

131
00:07:58,959 --> 00:08:00,740
you do this repeatedly, 

132
00:08:00,740 --> 00:08:06,339
then V(s) will converge to the optimal value function, V(s), 

133
00:08:06,339 --> 00:08:09,340
you can 

134
00:08:09,340 --> 00:08:12,219
compute the 

135
00:08:12,219 --> 00:08:13,830


136
00:08:13,830 --> 00:08:19,150
optimal policy ?*. Just one final thing I want to recap was 

137
00:08:19,150 --> 00:08:20,150


138
00:08:20,150 --> 00:08:27,150
the policy iteration algorithm 

139
00:08:32,370 --> 00:08:39,370
in which we repeat the following two steps. 

140
00:08:39,640 --> 00:08:43,790
So let's see, given a random initial policy, we'll solve for Vp. We'll solve for the 

141
00:08:43,790 --> 00:08:48,190
value function for that specific policy. 

142
00:08:48,190 --> 00:08:52,200
So this means for every state, compute the expected sum of discounted rewards 

143
00:08:52,200 --> 00:08:55,460
for if you execute the policy ? 

144
00:08:55,460 --> 00:08:58,000
from that state, 

145
00:08:58,000 --> 00:09:05,000
and then the other step of policy 

146
00:09:28,150 --> 00:09:30,380
iteration 

147
00:09:30,380 --> 00:09:31,730
is 

148
00:09:31,730 --> 00:09:34,960
having found the value function for your policy, 

149
00:09:34,960 --> 00:09:37,440
you then update the policy 

150
00:09:37,440 --> 00:09:41,120
pretending that you've already found the optimal value function, V*, 

151
00:09:41,120 --> 00:09:42,390
and then you 

152
00:09:42,390 --> 00:09:45,790
repeatedly perform these two steps where you solve for the value function for your current 

153
00:09:45,790 --> 00:09:46,820
policy 

154
00:09:46,820 --> 00:09:50,760
and then pretend that that's actually the optimal value function and solve for the 

155
00:09:50,760 --> 00:09:54,790
policy given the value function, and you repeatedly update 

156
00:09:54,790 --> 00:09:59,060
the value function or update the policy using that value function. 

157
00:09:59,060 --> 00:10:00,350
And 

158
00:10:00,350 --> 00:10:03,540
last time I said that this will also cause 

159
00:10:03,540 --> 00:10:07,500
the estimated value function V to converge to V* 

160
00:10:07,500 --> 00:10:12,640
and this will cause p to converge to ?*, the optimal policy. 

161
00:10:12,640 --> 00:10:14,010
So those are based 

162
00:10:14,010 --> 00:10:17,920
on our last lecture [inaudible] MDPs and introduced a lot of new 

163
00:10:17,920 --> 00:10:20,180
notation symbols and just 

164
00:10:20,180 --> 00:10:21,880
summarize all that again. 

165
00:10:21,880 --> 00:10:26,160
What I'm about to do now, what I'm about to do for the rest of today's 

166
00:10:26,160 --> 00:10:27,680
lecture is actually 

167
00:10:27,680 --> 00:10:32,420
build on these two algorithms so 

168
00:10:32,420 --> 00:10:36,040
I guess if you have any questions about this piece, ask now since I've got to go on please. Yeah. Student:[Inaudible] how 

169
00:10:36,040 --> 00:10:42,080
those two algorithms are very 

170
00:10:42,080 --> 00:10:43,710
different? 

171
00:10:43,710 --> 00:10:46,810
Instructor (Andrew Ng):I see, right, so 

172
00:10:46,810 --> 00:10:51,810
yeah, do you see that they're different? Okay, 

173
00:10:51,810 --> 00:10:53,840
how it's different. Let's see. 

174
00:10:53,840 --> 00:10:54,970
So 

175
00:10:54,970 --> 00:11:00,050
well here's one difference. I didn't say this ˜cause no longer use it today. So 

176
00:11:00,050 --> 00:11:03,530
value iteration and policy iteration are different algorithms. 

177
00:11:03,530 --> 00:11:06,260
In policy iteration 

178
00:11:06,260 --> 00:11:07,570
in this 

179
00:11:07,570 --> 00:11:10,630
step, you're given a fixed policy, 

180
00:11:10,630 --> 00:11:14,170
and you're going to solve for the value function for that policy 

181
00:11:14,170 --> 00:11:15,320


182
00:11:15,320 --> 00:11:20,520
and so you're given some fixed policy ?, meaning 

183
00:11:20,520 --> 00:11:26,930
some function mapping from the state's actions. So give you some policy 

184
00:11:26,930 --> 00:11:28,840


185
00:11:28,840 --> 00:11:34,630
and 

186
00:11:34,630 --> 00:11:39,350
whatever. That's just some policy; it's not a great policy. 

187
00:11:39,350 --> 00:11:40,890
And 

188
00:11:40,890 --> 00:11:45,350
in that step that I circled, we have to find the ? of S 

189
00:11:45,350 --> 00:11:52,350


190
00:12:01,640 --> 00:12:04,740
which means that for every state you need to compute 

191
00:12:04,740 --> 00:12:08,360
your expected sum of discounted rewards or if you execute 

192
00:12:08,360 --> 00:12:10,810
this specific policy 

193
00:12:10,810 --> 00:12:15,500
and starting off the MDP in that state S. 

194
00:12:15,500 --> 00:12:19,000
So I showed this last time. I won't go into details today, so I said last 

195
00:12:19,000 --> 00:12:20,330
time that 

196
00:12:20,330 --> 00:12:23,960
you can actually solve for Vp by solving a linear system of 

197
00:12:23,960 --> 00:12:25,360
equations. 

198
00:12:25,360 --> 00:12:27,260
There was a form of Bellman's equations 

199
00:12:27,260 --> 00:12:28,760
for Vp, 

200
00:12:28,760 --> 00:12:32,029
and it turned out to be, if you write this out, you end up with a 

201
00:12:32,029 --> 00:12:35,920
linear system of 11 equations of 11 unknowns and so you 

202
00:12:35,920 --> 00:12:38,830
can actually solve for the value function 

203
00:12:38,830 --> 00:12:39,970
for a fixed policy 

204
00:12:39,970 --> 00:12:42,180
by solving like a system of 

205
00:12:42,180 --> 00:12:46,180
linear equations with 11 variables and 11 constraints, 

206
00:12:46,180 --> 00:12:48,830
and so 

207
00:12:48,830 --> 00:12:51,790
that's policy iteration; 

208
00:12:51,790 --> 00:12:55,290
whereas, in value iteration, 

209
00:12:55,290 --> 00:12:57,430
going back on board, 

210
00:12:57,430 --> 00:13:00,590
in value iteration you sort of repeatedly perform this update where you 

211
00:13:00,590 --> 00:13:04,020
update the value of a state as the [inaudible]. 

212
00:13:04,020 --> 00:13:11,020
So I hope that makes sense that the algorithm of these is different. Student:[Inaudible] on the atomic kits 

213
00:13:12,670 --> 00:13:13,459
so is the assumption 

214
00:13:13,459 --> 00:13:16,130
that we can never get out of 

215
00:13:16,130 --> 00:13:17,960
those states? Instructor 

216
00:13:17,960 --> 00:13:21,730
(Andrew Ng):Yes. There's always things that you where you solve for this [inaudible], for example, and make the numbers 

217
00:13:21,730 --> 00:13:25,950
come up nicely, but I don't wanna spend too much time on them, but yeah, so the 

218
00:13:25,950 --> 00:13:26,820
assumption is that 

219
00:13:26,820 --> 00:13:30,329
once you enter the absorbing state, then the world ends or there're no more 

220
00:13:30,329 --> 00:13:32,549
rewards after that and 

221
00:13:32,549 --> 00:13:33,740
you can think of 

222
00:13:33,740 --> 00:13:36,970
another way to think of the absorbing states which is sort of 

223
00:13:36,970 --> 00:13:38,000
mathematically equivalent. 

224
00:13:38,000 --> 00:13:41,090
You can think of the absorbing states as 

225
00:13:41,090 --> 00:13:43,440
transitioning with probability 1 

226
00:13:43,440 --> 00:13:46,030
to sum 12 state, and then once you're in that 

227
00:13:46,030 --> 00:13:48,870
12th state, you always 

228
00:13:48,870 --> 00:13:51,660
remain in that 12th state, and there're no further rewards from there. If you 

229
00:13:51,660 --> 00:13:53,910
want, you can think of this 

230
00:13:53,910 --> 00:13:57,250
as actually an MDP with 12 states rather than 11 states, 

231
00:13:57,250 --> 00:13:59,270
and the 12th state 

232
00:13:59,270 --> 00:14:05,320
is this zero cost absorbing state that you get stuck in forever. Other 

233
00:14:05,320 --> 00:14:12,320
questions? Yeah, 

234
00:14:19,640 --> 00:14:26,460
please go. Student:Where did the Bellman's equations [inaudible] to optimal value [inaudible]? Instructor (Andrew Ng):Boy, yeah. Okay, this Bellman's equations, this equation that I'm pointing to, I sort 

235
00:14:26,460 --> 00:14:27,190
of 

236
00:14:27,190 --> 00:14:31,120
tried to give it justification for this last time. 

237
00:14:31,120 --> 00:14:34,850
I'll 

238
00:14:34,850 --> 00:14:37,500
say it in 

239
00:14:37,500 --> 00:14:41,990
one sentence so that's that the expected total payoff I get, I 

240
00:14:41,990 --> 00:14:46,820
expect to get something from the state as is equal to my immediate reward which is 

241
00:14:46,820 --> 00:14:49,650
the reward I get for starting a state. Let's see. If I sum 

242
00:14:49,650 --> 00:14:54,220
the state, I'm gonna get some first reward and then I can transition to 

243
00:14:54,220 --> 00:14:55,440
some other state, and 

244
00:14:55,440 --> 00:14:59,020
then from that other state, I'll get some additional rewards from then. 

245
00:14:59,020 --> 00:15:03,390
So Bellman's equations breaks that sum into two pieces. It says the value of a 

246
00:15:03,390 --> 00:15:04,860
state is equal to 

247
00:15:04,860 --> 00:15:06,650
the reward you get right away 

248
00:15:06,650 --> 00:15:10,530
is really, well. 

249
00:15:10,530 --> 00:15:17,120
V*(s) is really equal to 

250
00:15:17,120 --> 00:15:17,790


251
00:15:17,790 --> 00:15:24,790


252
00:15:26,640 --> 00:15:31,130


253
00:15:31,130 --> 00:15:33,390
+G, so this is 

254
00:15:33,390 --> 00:15:35,940


255
00:15:35,940 --> 00:15:40,060
V into two terms and says that there's this first term which is the immediate reward, that, and then +G(the 

256
00:15:40,060 --> 00:15:42,260
rewards 

257
00:15:42,260 --> 00:15:44,480


258
00:15:44,480 --> 00:15:46,440
you get in the future) 

259
00:15:46,440 --> 00:15:50,080
which it turns out to be equal to that 

260
00:15:50,080 --> 00:15:52,340
second row. I spent more time 

261
00:15:52,340 --> 00:15:55,360
justifying this in the previous lecture, 

262
00:15:55,360 --> 00:15:59,930
although yeah, hopefully, for the purposes of this lecture, if you're not sure where this is came, if 

263
00:15:59,930 --> 00:16:02,100
you don't remember the justification of that, why don't you just 

264
00:16:02,100 --> 00:16:03,430
maybe take my word for 

265
00:16:03,430 --> 00:16:05,990
that this equation holds true 

266
00:16:05,990 --> 00:16:07,940
since I use it a little bit later as well, 

267
00:16:07,940 --> 00:16:11,090
and then the lecture notes sort of 

268
00:16:11,090 --> 00:16:15,830
explain a little further the justification for why this equation might hold 

269
00:16:15,830 --> 00:16:17,430
true. But for now, 

270
00:16:17,430 --> 00:16:21,200
yeah, just for now take my word for it that this holds true ˜cause we'll use it a little bit later 

271
00:16:21,200 --> 00:16:28,200
today as well. 

272
00:16:40,960 --> 00:16:44,710
Student:[Inaudible] and would it be in sort of turn back into [inaudible]. Instructor (Andrew Ng):Actually, [inaudible] right question is if in policy iteration if we represent 

273
00:16:44,710 --> 00:16:47,010
? implicitly, 

274
00:16:47,010 --> 00:16:47,600


275
00:16:47,600 --> 00:16:53,060
using V(s), would it become equivalent to valuation, 

276
00:16:53,060 --> 00:16:58,240
and the answer is sort of no. 

277
00:16:58,240 --> 00:17:01,710
Let's see. It's true that policy iteration and value iteration are 

278
00:17:01,710 --> 00:17:07,210
closely related algorithms, and there's actually a continuum between them, but 

279
00:17:07,210 --> 00:17:12,850
yeah, it actually turns out that, 

280
00:17:12,850 --> 00:17:15,400
oh, no, 

281
00:17:15,400 --> 00:17:22,370


282
00:17:22,370 --> 00:17:23,909
the 

283
00:17:23,909 --> 00:17:24,759


284
00:17:24,759 --> 00:17:28,910
algorithms are not equivalent. It's just in policy iteration, 

285
00:17:28,910 --> 00:17:33,289
there is a step where you're solving for the value function for the policy vehicle is V, 

286
00:17:33,289 --> 00:17:35,470
solve for Vp. Usually, 

287
00:17:35,470 --> 00:17:40,590
you can do this, for instance, by solving a linear system of equations. In value 

288
00:17:40,590 --> 00:17:41,260
iteration, 

289
00:17:41,260 --> 00:17:43,850
it is a different 

290
00:17:43,850 --> 00:17:50,850
algorithm, yes. I hope it makes sense that at least cosmetically it's different. 

291
00:17:56,270 --> 00:17:58,140
Student:[Inaudible] 

292
00:17:58,140 --> 00:18:02,250
you have [inaudible] representing ? implicitly, then you won't have 

293
00:18:02,250 --> 00:18:06,620
to solve that to [inaudible] equations. Instructor (Andrew Ng):Yeah, the problem is - let's see. To solve for Vp, this works only if you have a fixed policy, so once you 

294
00:18:06,620 --> 00:18:08,559
change a value function, 

295
00:18:08,559 --> 00:18:10,360
if ? changes as well, 

296
00:18:10,360 --> 00:18:15,750
then it's sort of hard to solve this. 

297
00:18:15,750 --> 00:18:19,340
Yeah, so later on we'll actually talk about some examples of when ? is implicitly 

298
00:18:19,340 --> 00:18:20,440
represented 

299
00:18:20,440 --> 00:18:24,059
but at 

300
00:18:24,059 --> 00:18:26,210
least for now it's I think there's " yeah. 

301
00:18:26,210 --> 00:18:30,350
Maybe there's a way to redefine something, see a mapping onto value iteration but that's not 

302
00:18:30,350 --> 00:18:32,530
usually done. These are 

303
00:18:32,530 --> 00:18:36,670
viewed as different algorithms. 

304
00:18:36,670 --> 00:18:39,640
Okay, cool, so all good questions. 

305
00:18:39,640 --> 00:18:43,410
Let me move on and talk about how 

306
00:18:43,410 --> 00:18:50,410
to generalize these ideas to continuous states. 

307
00:18:58,880 --> 00:19:01,260
Everything we've done so far 

308
00:19:01,260 --> 00:19:02,950
has been for 

309
00:19:02,950 --> 00:19:05,000
discrete states or finite-state 

310
00:19:05,000 --> 00:19:09,890
MDPs. Where, for example, here we had an MDP with a finite set of 11 

311
00:19:09,890 --> 00:19:10,780
states 

312
00:19:10,780 --> 00:19:12,980
and so 

313
00:19:12,980 --> 00:19:16,720
the value function or V(s) or our estimate for the value function, 

314
00:19:16,720 --> 00:19:17,360
V(s), 

315
00:19:17,360 --> 00:19:22,290
could then be represented using an array of 11 numbers ˜cause if you have 11 states, the 

316
00:19:22,290 --> 00:19:23,580
value function 

317
00:19:23,580 --> 00:19:26,120
needs to assign a real number 

318
00:19:26,120 --> 00:19:29,330
to each of the 11 states and so to represent V(s) 

319
00:19:29,330 --> 00:19:31,990
using an array of 11 

320
00:19:31,990 --> 00:19:33,080
numbers. 

321
00:19:33,080 --> 00:19:37,080
What I want to do for [inaudible] today is talk about 

322
00:19:37,080 --> 00:19:38,450
continuous 

323
00:19:38,450 --> 00:19:41,830
states, so for example, 

324
00:19:41,830 --> 00:19:44,980
if you want to control 

325
00:19:44,980 --> 00:19:48,240
any of the number of real [inaudible], so for example, if you want 

326
00:19:48,240 --> 00:19:51,409
to control a car, 

327
00:19:51,409 --> 00:19:52,610


328
00:19:52,610 --> 00:19:55,090
a car 

329
00:19:55,090 --> 00:19:59,840
is positioned given by XYT, as position 

330
00:19:59,840 --> 00:20:04,660
and orientation and if you want to Markov the velocity as well, then Xdot, Ydot, Tdot, 

331
00:20:04,660 --> 00:20:05,159
so these 

332
00:20:05,159 --> 00:20:06,640
are so depending on 

333
00:20:06,640 --> 00:20:08,929
whether you want to model the kinematics and 

334
00:20:08,929 --> 00:20:12,139
so just position, or whether you want to model the dynamics, meaning the velocity as 

335
00:20:12,139 --> 00:20:14,630
well. 

336
00:20:14,630 --> 00:20:16,179
Earlier I showed you 

337
00:20:16,179 --> 00:20:19,740
video of a helicopter 

338
00:20:19,740 --> 00:20:23,480
that was flying, using a rain forest we're learning algorithms, so the 

339
00:20:23,480 --> 00:20:24,730
helicopter which can 

340
00:20:24,730 --> 00:20:28,390
fly in three-dimensional space rather than just drive on the 2-D plane, the 

341
00:20:28,390 --> 00:20:32,780
state will be given by XYZ position, 

342
00:20:32,780 --> 00:20:35,100


343
00:20:35,100 --> 00:20:37,610
FT?, which is 

344
00:20:37,610 --> 00:20:42,740
?[inaudible]. 

345
00:20:42,740 --> 00:20:45,250
The FT? is 

346
00:20:45,250 --> 00:20:48,620
sometimes used to note the P[inaudible] 

347
00:20:48,620 --> 00:20:49,840
of the helicopter, 

348
00:20:49,840 --> 00:20:51,710
just orientation, 

349
00:20:51,710 --> 00:20:55,340
and 

350
00:20:55,340 --> 00:20:59,050


351
00:20:59,050 --> 00:21:02,540
if you want to control a helicopter, you pretty much have to model 

352
00:21:02,540 --> 00:21:06,490
velocity as well which means both linear velocity as well as angular 

353
00:21:06,490 --> 00:21:09,520
velocity, and so this would be a 12-dimensional state. 

354
00:21:09,520 --> 00:21:10,400


355
00:21:10,400 --> 00:21:12,610
If you want an example that 

356
00:21:12,610 --> 00:21:14,150
is 

357
00:21:14,150 --> 00:21:16,770
kind of fun but unusual 

358
00:21:16,770 --> 00:21:19,960
is, 

359
00:21:19,960 --> 00:21:24,550
and I'm just gonna use this as an example 

360
00:21:24,550 --> 00:21:28,240
and actually use this little bit example in today's lecture 

361
00:21:28,240 --> 00:21:31,410
is the inverted pendulum problem which is 

362
00:21:31,410 --> 00:21:35,500
sort of a long-running classic in reinforcement learning 

363
00:21:35,500 --> 00:21:39,040
in which imagine that you 

364
00:21:39,040 --> 00:21:42,880
have a little cart that's on a rail. The rail 

365
00:21:42,880 --> 00:21:45,370
ends at some point 

366
00:21:45,370 --> 00:21:48,400
and if you imagine that you have a pole 

367
00:21:48,400 --> 00:21:51,650
attached to the cart, and this is a free hinge and so 

368
00:21:51,650 --> 00:21:54,110
the pole here can rotate freely, 

369
00:21:54,110 --> 00:21:56,950
and your goal is to control the cart and 

370
00:21:56,950 --> 00:21:59,030
to move it back and forth on this rail 

371
00:21:59,030 --> 00:22:01,130
so as to keep the pole balanced. 

372
00:22:01,130 --> 00:22:02,560
Yeah, 

373
00:22:02,560 --> 00:22:04,620
there's no long pole 

374
00:22:04,620 --> 00:22:05,400
in 

375
00:22:05,400 --> 00:22:07,480
this class but you know what I 

376
00:22:07,480 --> 00:22:08,409
mean, so you 

377
00:22:08,409 --> 00:22:15,210
can imagine. Oh, is there a long pole here? 

378
00:22:15,210 --> 00:22:16,590
Student:Back in the corner. Instructor (Andrew Ng):Oh, thanks. Cool. So I did not practice this 

379
00:22:16,590 --> 00:22:19,920
but you can take a long pole and sort of hold it up, balance, 

380
00:22:19,920 --> 00:22:24,670
so imagine that you can do it better than I can. Imagine these are [inaudible] just moving 

381
00:22:24,670 --> 00:22:27,590
back and forth to try to keep the pole balanced, so 

382
00:22:27,590 --> 00:22:30,580
you can actually us the reinforcement learning algorithm to do that. 

383
00:22:30,580 --> 00:22:33,929
This is actually one of the longstanding classic problems that 

384
00:22:33,929 --> 00:22:35,280
people [inaudible] 

385
00:22:35,280 --> 00:22:38,030
implement and play off using reinforcement learning algorithms, 

386
00:22:38,030 --> 00:22:40,800
and so for this, the 

387
00:22:40,800 --> 00:22:41,940
states would be X and 

388
00:22:41,940 --> 00:22:44,000
T, so X 

389
00:22:44,000 --> 00:22:49,660
would be the position of the cart, 

390
00:22:49,660 --> 00:22:56,070
and T would be the orientation of the pole 

391
00:22:56,070 --> 00:22:59,750
and also the linear velocity and the angular velocity of 

392
00:22:59,750 --> 00:23:00,650


393
00:23:00,650 --> 00:23:07,650
the pole, so I'll 

394
00:23:12,340 --> 00:23:15,390
actually use this example a 

395
00:23:15,390 --> 00:23:18,710
couple times. So to read continuous state space, 

396
00:23:18,710 --> 00:23:22,760
how can you apply an algorithm like value iteration and policy iteration to solve the 

397
00:23:22,760 --> 00:23:24,170
MDP to control 

398
00:23:24,170 --> 00:23:27,800
like the car or a helicopter or something like the inverted 

399
00:23:27,800 --> 00:23:29,030
pendulum? 

400
00:23:29,030 --> 00:23:31,720
So one thing you can do and 

401
00:23:31,720 --> 00:23:35,060
this is maybe the most straightforward thing is, 

402
00:23:35,060 --> 00:23:37,670
if you have say a two-dimensional 

403
00:23:37,670 --> 00:23:42,270
continuous state space, a S-1 and S-2 are my state variables, and in all the 

404
00:23:42,270 --> 00:23:43,990
examples there 

405
00:23:43,990 --> 00:23:48,150
are I guess between 4dimensional to 12-dimensional. I'll just draw 2-D here. 

406
00:23:48,150 --> 00:23:53,670
The most straightforward thing to do would be to take the continuous state space and discretize 

407
00:23:53,670 --> 00:23:55,780
it 

408
00:23:55,780 --> 00:24:02,780
into a number of discrete cells. 

409
00:24:06,600 --> 00:24:07,440


410
00:24:07,440 --> 00:24:11,890
And I use S-bar to denote they're discretized or they're discrete 

411
00:24:11,890 --> 00:24:13,080
states, 

412
00:24:13,080 --> 00:24:16,919
and so you can [inaudible] with this continuous state problem with a finite or 

413
00:24:16,919 --> 00:24:18,409
discrete set of states 

414
00:24:18,409 --> 00:24:19,490
and then 

415
00:24:19,490 --> 00:24:22,880
you can use policy iteration or value iteration 

416
00:24:22,880 --> 00:24:28,900
to solve for 

417
00:24:28,900 --> 00:24:31,529


418
00:24:31,529 --> 00:24:35,950
V(s)-bar. 

419
00:24:35,950 --> 00:24:37,390
And 

420
00:24:37,390 --> 00:24:40,350
if you're robot is then in some state 

421
00:24:40,350 --> 00:24:41,860
given by that dot, 

422
00:24:41,860 --> 00:24:45,910
you would then figure out what discretized state it is in. In this case it's in, 

423
00:24:45,910 --> 00:24:47,950
this discretized dygrid cell 

424
00:24:47,950 --> 00:24:49,180
that's called 

425
00:24:49,180 --> 00:24:51,400
S-bar, 

426
00:24:51,400 --> 00:24:52,840
and then you execute. 

427
00:24:52,840 --> 00:24:56,320
You choose the policy. You choose the action 

428
00:24:56,320 --> 00:24:58,679
given by applied to that discrete 

429
00:24:58,679 --> 00:25:01,190
state, so discretization is maybe the 

430
00:25:01,190 --> 00:25:06,690
most straightforward way to turn a continuous state problem into a discrete state problem. Sometimes 

431
00:25:06,690 --> 00:25:10,200
you can sorta make this work but a couple of reasons why 

432
00:25:10,200 --> 00:25:13,070
this does not work very well. 

433
00:25:13,070 --> 00:25:15,130
One reason is the following, 

434
00:25:15,130 --> 00:25:17,610
and 

435
00:25:17,610 --> 00:25:21,210
for this picture, let's even temporarily put aside reinforcement 

436
00:25:21,210 --> 00:25:22,840
learning. Let's just 

437
00:25:22,840 --> 00:25:26,610
think about doing regression for now and so 

438
00:25:26,610 --> 00:25:29,650
suppose you have some invariable X 

439
00:25:29,650 --> 00:25:36,210
and suppose I have some data, 

440
00:25:36,210 --> 00:25:37,910
and I want to fill a function. 

441
00:25:37,910 --> 00:25:39,720
Y is the function of X, 

442
00:25:39,720 --> 00:25:40,860
so 

443
00:25:40,860 --> 00:25:44,550
discretization is saying that I'm going to take my 

444
00:25:44,550 --> 00:25:48,010
horizontal Xs and chop it up into a number of 

445
00:25:48,010 --> 00:25:48,669
intervals. 

446
00:25:48,669 --> 00:25:52,060
Sometimes I call these intervals buckets as well. 

447
00:25:52,060 --> 00:25:56,310
We chop my horizontal Xs up into a number of buckets 

448
00:25:56,310 --> 00:25:59,480
and then we're approximate this function 

449
00:25:59,480 --> 00:26:03,770
using something that's piecewise constant 

450
00:26:03,770 --> 00:26:05,440
in each of these buckets. 

451
00:26:05,440 --> 00:26:08,800
And just look at this. This is clearly not a very good representation, right, and when we 

452
00:26:08,800 --> 00:26:10,390
talk about regression, 

453
00:26:10,390 --> 00:26:14,340
you just choose some features of X 

454
00:26:14,340 --> 00:26:17,990
and run linear regression or something. You get a much better fit to the function. 

455
00:26:17,990 --> 00:26:22,170
And so the sense that discretization just isn't a very good source 

456
00:26:22,170 --> 00:26:25,770
of piecewise constant functions. This just isn't a very good function 

457
00:26:25,770 --> 00:26:28,210
for representing many things, 

458
00:26:28,210 --> 00:26:32,090
and there's also the sense that 

459
00:26:32,090 --> 00:26:35,580
there's no smoothing or there's no generalization across the different buckets. 

460
00:26:35,580 --> 00:26:39,290
And in fact, back in regression, I would never have chosen 

461
00:26:39,290 --> 00:26:42,870
to do regression using this sort of visualization. It's just really doesn't make sense. 

462
00:26:42,870 --> 00:26:45,630


463
00:26:45,630 --> 00:26:47,990
And so in the same way, 

464
00:26:47,990 --> 00:26:50,950
instead of 

465
00:26:50,950 --> 00:26:53,929
X, V(s), instead of X and 

466
00:26:53,929 --> 00:26:57,730
some hypothesis function of X, if you have the state here and you're trying to approximate the value 

467
00:26:57,730 --> 00:26:59,110
function, then 

468
00:26:59,110 --> 00:27:02,470
you can get discretization to work for many problems but maybe this isn't the best 

469
00:27:02,470 --> 00:27:03,429
representation to represent 

470
00:27:03,429 --> 00:27:06,419
a value function. 

471
00:27:06,419 --> 00:27:08,960


472
00:27:08,960 --> 00:27:12,580
The other problem with discretization and maybe the more serious 

473
00:27:12,580 --> 00:27:17,350
problem is what's 

474
00:27:17,350 --> 00:27:21,450
often somewhat fancifully called the curse of dimensionality. 

475
00:27:21,450 --> 00:27:26,480


476
00:27:26,480 --> 00:27:30,230
And just the observation that 

477
00:27:30,230 --> 00:27:30,990


478
00:27:30,990 --> 00:27:33,800
if the state space is in 

479
00:27:33,800 --> 00:27:39,350
RN, and if you discretize each variable into K buckets, 

480
00:27:39,350 --> 00:27:40,450
so 

481
00:27:40,450 --> 00:27:41,909
if discretize each 

482
00:27:41,909 --> 00:27:44,159


483
00:27:44,159 --> 00:27:45,690
variable into K 

484
00:27:45,690 --> 00:27:47,750
discrete values, 

485
00:27:47,750 --> 00:27:50,150
then you get 

486
00:27:50,150 --> 00:27:53,030
on the order of K to the power of 

487
00:27:53,030 --> 00:27:54,760
N 

488
00:27:54,760 --> 00:27:55,740
discrete states. 

489
00:27:55,740 --> 00:27:56,609
In other words, 

490
00:27:56,609 --> 00:28:00,170
the number of discrete states you end up with 

491
00:28:00,170 --> 00:28:02,100
grows exponentially 

492
00:28:02,100 --> 00:28:04,789
in the dimension of the problem, 

493
00:28:04,789 --> 00:28:05,560
and so 

494
00:28:05,560 --> 00:28:09,350
for a helicopter with 12-dimensional state 

495
00:28:09,350 --> 00:28:10,340
space, this would be 

496
00:28:10,340 --> 00:28:12,769
maybe like 100 to the power of 12, just huge, and it's 

497
00:28:12,769 --> 00:28:13,460
not 

498
00:28:13,460 --> 00:28:14,650
feasible. 

499
00:28:14,650 --> 00:28:19,130
And so discretization doesn't scale well at all with two problems in high-dimensional 

500
00:28:19,130 --> 00:28:23,600
state spaces, 

501
00:28:23,600 --> 00:28:24,790
and 

502
00:28:24,790 --> 00:28:28,520
this observation actually applies more generally than to 

503
00:28:28,520 --> 00:28:32,510
just robotics and continuous state problems. 

504
00:28:32,510 --> 00:28:37,669
For example, another fairly well-known applications 

505
00:28:37,669 --> 00:28:40,170
of reinforcement learning 

506
00:28:40,170 --> 00:28:44,850
has been to factory automations. If you imagine that you have 

507
00:28:44,850 --> 00:28:47,040
20 machines sitting in the factory 

508
00:28:47,040 --> 00:28:50,029
and the machines lie in a assembly line and they all 

509
00:28:50,029 --> 00:28:53,200
do something to a part on the assembly line, then they route the part onto a 

510
00:28:53,200 --> 00:28:55,389
different machine. You want to 

511
00:28:55,389 --> 00:28:58,779
use reinforcement learning algorithms, [inaudible] the order in which the 

512
00:28:58,779 --> 00:29:02,519
different machines operate on your different things that are flowing through your assembly line and maybe 

513
00:29:02,519 --> 00:29:05,309
different machines can do different things. 

514
00:29:05,309 --> 00:29:08,970
So if you have N machines and each machine 

515
00:29:08,970 --> 00:29:10,540
can be in K states, 

516
00:29:10,540 --> 00:29:13,730
then if you do this sort of discretization, the total number of states would be K 

517
00:29:13,730 --> 00:29:15,420
to N as well. 

518
00:29:15,420 --> 00:29:18,289
If you have N machines 

519
00:29:18,289 --> 00:29:21,860
and if each machine can be in K states, then again, you can get 

520
00:29:21,860 --> 00:29:24,550
this huge number of states. 

521
00:29:24,550 --> 00:29:26,799
Other well-known examples would be 

522
00:29:26,799 --> 00:29:30,620
if you have a board game is another example. 

523
00:29:30,620 --> 00:29:33,760
You'd want to use reinforcement learning to play chess. 

524
00:29:33,760 --> 00:29:36,310
Then if you have N 

525
00:29:36,310 --> 00:29:38,660
pieces on your board game, 

526
00:29:38,660 --> 00:29:43,540
you have N pieces on the chessboard and if each piece can be in K positions, then this is a 

527
00:29:43,540 --> 00:29:46,470
game sort of the curse of dimensionality thing where the 

528
00:29:46,470 --> 00:29:50,059
number of discrete states you end up with goes exponentially with the number of 

529
00:29:50,059 --> 00:29:54,500


530
00:29:54,500 --> 00:29:55,059
pieces 

531
00:29:55,059 --> 00:29:57,120
in your board game. 

532
00:29:57,120 --> 00:30:01,169
So the curse of dimensionality 

533
00:30:01,169 --> 00:30:05,830
means that discretization scales poorly to high-dimensional state spaces or 

534
00:30:05,830 --> 00:30:09,370
at least discrete representations 

535
00:30:09,370 --> 00:30:14,169
scale poorly to high-dimensional state spaces. In practice, discretization will usually, if you have a 2-dimensional 

536
00:30:14,169 --> 00:30:16,720
problem, discretization will usually work great. 

537
00:30:16,720 --> 00:30:20,730
If you have a 3-dimensional problem, you can often get discretization to 

538
00:30:20,730 --> 00:30:21,290
work 

539
00:30:21,290 --> 00:30:23,270
not too badly without too much trouble. 

540
00:30:23,270 --> 00:30:28,370
With a 4-dimensional problem, you can still often get to where that they 

541
00:30:28,370 --> 00:30:30,190
could be challenging 

542
00:30:30,190 --> 00:30:32,240


543
00:30:32,240 --> 00:30:32,990


544
00:30:32,990 --> 00:30:38,770
and as you go to higher and higher dimensional state spaces, 

545
00:30:38,770 --> 00:30:42,520
the odds and [inaudible] that you need to figure around to discretization and do things 

546
00:30:42,520 --> 00:30:45,900
like non-uniform grids, so for example, 

547
00:30:45,900 --> 00:30:49,470
what I've 

548
00:30:49,470 --> 00:30:52,710
drawn for you is an example of a non-uniform discretization 

549
00:30:52,710 --> 00:30:57,139
where I'm discretizing S-2 much more finally than S-1. If I think the value 

550
00:30:57,139 --> 00:30:57,750
function 

551
00:30:57,750 --> 00:30:59,250
is much more sensitive 

552
00:30:59,250 --> 00:31:02,620
to the value of state variable S-2 than to S-1, 

553
00:31:02,620 --> 00:31:06,610
and so as you get into higher dimensional state spaces, you may need to manually fiddle 

554
00:31:06,610 --> 00:31:10,000
with choices like these with no uniform discretizations and so on. 

555
00:31:10,000 --> 00:31:13,269
But the folk wisdom seems to be that if you have 2- or 3-dimensional problems, it 

556
00:31:13,269 --> 00:31:14,350
work fine. 

557
00:31:14,350 --> 00:31:17,630
With 4-dimensional problems, you can probably get it to work but it'd be just 

558
00:31:17,630 --> 00:31:19,030
slightly challenging 

559
00:31:19,030 --> 00:31:20,300
and 

560
00:31:20,300 --> 00:31:24,420
you can sometimes by fooling around and being clever, you can often push 

561
00:31:24,420 --> 00:31:25,110
discretization 

562
00:31:25,110 --> 00:31:29,120
up to let's say about 6-dimensional problems but 

563
00:31:29,120 --> 00:31:30,870
with some difficulty 

564
00:31:30,870 --> 00:31:32,980
and problems higher 

565
00:31:32,980 --> 00:31:37,220
than 6-dimensional would be extremely difficult to solve with discretization. 

566
00:31:37,220 --> 00:31:39,700
So that's just rough 

567
00:31:39,700 --> 00:31:44,360
folk wisdom order of managing problems you think about using for discretization. 

568
00:31:44,360 --> 00:31:51,360


569
00:31:51,370 --> 00:31:52,790
But what I want to 

570
00:31:52,790 --> 00:31:56,030
spend most of today talking about is 

571
00:31:56,030 --> 00:31:59,909
[inaudible] methods that often work much better than discretization 

572
00:31:59,909 --> 00:32:02,029
and which we will 

573
00:32:02,029 --> 00:32:04,640
approximate V* directly 

574
00:32:04,640 --> 00:32:08,850
without resulting to these sort of discretizations. 

575
00:32:08,850 --> 00:32:12,300
Before I jump to the specific representation let me just spend a few minutes 

576
00:32:12,300 --> 00:32:14,780
talking about the problem setup 

577
00:32:14,780 --> 00:32:15,610
then. 

578
00:32:15,610 --> 00:32:18,790
For today's lecture, I'm going to focus on 

579
00:32:18,790 --> 00:32:25,360
the problem of continuous states 

580
00:32:25,360 --> 00:32:28,060
and just to keep things sort of very 

581
00:32:28,060 --> 00:32:29,480
simple in this lecture, 

582
00:32:29,480 --> 00:32:33,619
I want view of continuous actions, so I'm gonna see 

583
00:32:33,619 --> 00:32:36,000


584
00:32:36,000 --> 00:32:37,650
discrete actions 

585
00:32:37,650 --> 00:32:40,280
A. So it turns out actually that 

586
00:32:40,280 --> 00:32:44,230
is a critical fact also for many problems, it turns out that the state 

587
00:32:44,230 --> 00:32:45,630
space is 

588
00:32:45,630 --> 00:32:47,299
much larger than 

589
00:32:47,299 --> 00:32:51,179
the states of actions. That just seems to have worked out that way for many problems, so for example, 

590
00:32:51,179 --> 00:32:52,630


591
00:32:52,630 --> 00:32:58,210
for driving a car the state space is 6-dimensional, so if 

592
00:32:58,210 --> 00:33:00,010
XY T, Xdot, Ydot, Tdot. 

593
00:33:00,010 --> 00:33:04,840
Whereas, your action has, you still have two actions. You have forward 

594
00:33:04,840 --> 00:33:07,600
backward motion and steering the car, so you have 

595
00:33:07,600 --> 00:33:11,889
6-D states and 2-D actions, and so you can discretize the action much more easily than discretize the states. 

596
00:33:11,889 --> 00:33:14,740
The only 

597
00:33:14,740 --> 00:33:17,190
examples down for a helicopter you've 12-D states in a 

598
00:33:17,190 --> 00:33:18,620
4-dimensional action 

599
00:33:18,620 --> 00:33:20,290
it turns out, and 

600
00:33:20,290 --> 00:33:24,200
it's also often much easier to just discretize a continuous 

601
00:33:24,200 --> 00:33:27,300
actions into a discrete sum of actions. 

602
00:33:27,300 --> 00:33:27,840


603
00:33:27,840 --> 00:33:31,670
And for the inverted pendulum, you have a 4-D state and a 1-D action. 

604
00:33:31,670 --> 00:33:33,470
Whether you accelerate 

605
00:33:33,470 --> 00:33:38,140
your cart to the left or the right is one D action 

606
00:33:38,140 --> 00:33:41,879
and so for the rest of today, I'm gonna assume a continuous state 

607
00:33:41,879 --> 00:33:45,480
but I'll assume that maybe you've already discretized your actions, 

608
00:33:45,480 --> 00:33:47,700
just because in practice it turns out that 

609
00:33:47,700 --> 00:33:52,780
not for all problems, with many problems large actions 

610
00:33:52,780 --> 00:33:54,280
is just less of a 

611
00:33:54,280 --> 00:33:58,470
difficulty than large state spaces. 

612
00:33:58,470 --> 00:33:59,730
So I'm 

613
00:33:59,730 --> 00:34:01,350
going to assume 

614
00:34:01,350 --> 00:34:06,360
that we have 

615
00:34:06,360 --> 00:34:08,699
a model 

616
00:34:08,699 --> 00:34:15,089
or simulator of the 

617
00:34:15,089 --> 00:34:16,450
MDP, and so 

618
00:34:16,450 --> 00:34:18,179
this is really 

619
00:34:18,179 --> 00:34:22,579
an assumption on how the state transition probabilities are represented. 

620
00:34:22,579 --> 00:34:23,589


621
00:34:23,589 --> 00:34:24,690
I'm gonna assume 

622
00:34:24,690 --> 00:34:28,079
and I'm going to use the terms œmodel and œsimulator 

623
00:34:28,079 --> 00:34:29,800
pretty much synonymously, 

624
00:34:29,800 --> 00:34:31,749
so 

625
00:34:31,749 --> 00:34:38,749
specifically, what I'm going to assume is that we have a black box and a 

626
00:34:40,649 --> 00:34:41,829
piece of code, 

627
00:34:41,829 --> 00:34:45,200
so that I can input any state, input an action 

628
00:34:45,200 --> 00:34:46,399
and it will output 

629
00:34:46,399 --> 00:34:48,559
S 

630
00:34:48,559 --> 00:34:51,259
prime, 

631
00:34:51,259 --> 00:34:54,219
sample from the state transition distribution. Says that 

632
00:34:54,219 --> 00:34:55,069
this is really 

633
00:34:55,069 --> 00:34:57,080
my assumption 

634
00:34:57,080 --> 00:35:01,209
on the representation I have for the state transition probabilities, 

635
00:35:01,209 --> 00:35:05,189
so I'll assume I have a box that read 

636
00:35:05,189 --> 00:35:06,999
take us in for the stated action 

637
00:35:06,999 --> 00:35:09,849
and output in mixed state. 

638
00:35:09,849 --> 00:35:16,849
And so 

639
00:35:17,410 --> 00:35:20,469
since they're fairly common ways to get models of 

640
00:35:20,469 --> 00:35:22,359
different MDPs you may work with, 

641
00:35:22,359 --> 00:35:24,479
one is 

642
00:35:24,479 --> 00:35:30,599
you might get a model from a physics 

643
00:35:30,599 --> 00:35:32,960
simulator. So for example, 

644
00:35:32,960 --> 00:35:36,759
if you're interested in controlling that inverted pendulum, 

645
00:35:36,759 --> 00:35:39,449
so your action is 

646
00:35:39,449 --> 00:35:41,230
A which is 

647
00:35:41,230 --> 00:35:45,130
the magnitude of the force you exert on the cart to 

648
00:35:45,130 --> 00:35:46,140


649
00:35:46,140 --> 00:35:47,709
left 

650
00:35:47,709 --> 00:35:49,380
or right, and your 

651
00:35:49,380 --> 00:35:56,380
state is Xdot, T, Tdot. I'm just gonna write that in that order. 

652
00:35:58,109 --> 00:36:01,469
And so I'm gonna 

653
00:36:01,469 --> 00:36:04,379
write down a bunch of equations just for completeness but 

654
00:36:04,379 --> 00:36:08,779
everything I'm gonna write below here is most of what I wanna write is a bit 

655
00:36:08,779 --> 00:36:12,599
gratuitous, but so since I'll maybe flip open a textbook on physics, a 

656
00:36:12,599 --> 00:36:13,820
textbook on mechanics, 

657
00:36:13,820 --> 00:36:17,069
you can work out the equations of motion of a 

658
00:36:17,069 --> 00:36:19,499
physical device like this, so you find that 

659
00:36:19,499 --> 00:36:21,719


660
00:36:21,719 --> 00:36:22,749


661
00:36:22,749 --> 00:36:26,400
Sdot. The dot denotes derivative, so the derivative of the state with respect to time 

662
00:36:26,400 --> 00:36:28,080
is given by 

663
00:36:28,080 --> 00:36:28,739


664
00:36:28,739 --> 00:36:31,710
Xdot, ?-L(B) 

665
00:36:31,710 --> 00:36:34,099
cost B 

666
00:36:34,099 --> 00:36:35,549
over 

667
00:36:35,549 --> 00:36:36,919


668
00:36:36,919 --> 00:36:39,069
M 

669
00:36:39,069 --> 00:36:46,069
Tdot, 

670
00:36:52,019 --> 00:36:52,859
B. 

671
00:36:52,859 --> 00:36:58,199
And so on where A is the force is the action that you exert on the cart. L is 

672
00:36:58,199 --> 00:36:59,869
the length of the pole. 

673
00:36:59,869 --> 00:37:04,099
M is the total mass of the system and so on. So all these equations 

674
00:37:04,099 --> 00:37:06,209
are good uses, just writing 

675
00:37:06,209 --> 00:37:08,549
them down for completeness, but by 

676
00:37:08,549 --> 00:37:12,179
flipping over, open like a physics textbook, you can work out these equations 

677
00:37:12,179 --> 00:37:13,839
and notions yourself 

678
00:37:13,839 --> 00:37:15,169
and this 

679
00:37:15,169 --> 00:37:17,410
then gives you a model which 

680
00:37:17,410 --> 00:37:18,259
can say that S-2+1. 

681
00:37:18,259 --> 00:37:21,549
You're still one time step later 

682
00:37:21,549 --> 00:37:27,489
will be equal to your previous state 

683
00:37:27,489 --> 00:37:32,219
plus [inaudible], so in your simulator or in my model 

684
00:37:32,219 --> 00:37:35,200
what happens to the cart every 

685
00:37:35,200 --> 00:37:39,229
10th of a second, so ? T 

686
00:37:39,229 --> 00:37:43,130
would be within one second 

687
00:37:43,130 --> 00:37:50,130
and then so plus ? T times 

688
00:37:57,419 --> 00:37:59,789


689
00:37:59,789 --> 00:38:03,469


690
00:38:03,469 --> 00:38:06,129
that. And so that'd be one way to come up with a model of 

691
00:38:06,129 --> 00:38:08,169
your MDP. 

692
00:38:08,169 --> 00:38:12,389
And in this specific example, I've actually written down deterministic model because 

693
00:38:12,389 --> 00:38:12,830


694
00:38:12,830 --> 00:38:14,829
and by deterministic I mean that 

695
00:38:14,829 --> 00:38:18,320
given an action in a state, the next state 

696
00:38:18,320 --> 00:38:19,839
is not random, 

697
00:38:19,839 --> 00:38:23,179
so would be an example of a deterministic model 

698
00:38:23,179 --> 00:38:26,180
where I can compute the next state exactly 

699
00:38:26,180 --> 00:38:29,699
as a function of the previous state and the previous action 

700
00:38:29,699 --> 00:38:33,479
or it's a deterministic model because all the probability 

701
00:38:33,479 --> 00:38:34,999
mass is on a single state 

702
00:38:34,999 --> 00:38:36,939
given the previous stated action. 

703
00:38:36,939 --> 00:38:43,939
You can also make this a stochastic model. 

704
00:38:48,140 --> 00:38:55,140


705
00:38:58,589 --> 00:39:05,589
A second way that is often used to attain a model is to learn one. 

706
00:39:08,359 --> 00:39:13,079
And so again, just concretely what you do is you would 

707
00:39:13,079 --> 00:39:15,599
imagine that you have a physical 

708
00:39:15,599 --> 00:39:17,109
inverted pendulum system 

709
00:39:17,109 --> 00:39:20,630
as you physically own an inverted pendulum robot. 

710
00:39:20,630 --> 00:39:24,109
What you would do is you would then 

711
00:39:24,109 --> 00:39:27,109
initialize your inverted pendulum robot to some state 

712
00:39:27,109 --> 00:39:31,909
and then execute some policy, could be some random policy or some policy that you think is pretty 

713
00:39:31,909 --> 00:39:36,219
good, or you could even try controlling yourself with a joystick or something. 

714
00:39:36,219 --> 00:39:37,399
But so you set 

715
00:39:37,399 --> 00:39:40,889
the system off in some state as zero. 

716
00:39:40,889 --> 00:39:43,619
Then you take some action. 

717
00:39:43,619 --> 00:39:48,039
Here's zero and the game could be chosen by some policy or chosen by you using a joystick tryina 

718
00:39:48,039 --> 00:39:51,179
control your inverted pendulum or whatever. 

719
00:39:51,179 --> 00:39:55,029
System would transition to some new state, S-1, and then you 

720
00:39:55,029 --> 00:39:57,269
take some new action, A-1 

721
00:39:57,269 --> 00:39:59,799
and so on. Let's say you do 

722
00:39:59,799 --> 00:40:01,579
this 

723
00:40:01,579 --> 00:40:04,779
for two time steps 

724
00:40:04,779 --> 00:40:06,480
and 

725
00:40:06,480 --> 00:40:08,859
sometimes I call this one trajectory and 

726
00:40:08,859 --> 00:40:10,009
you repeat this 

727
00:40:10,009 --> 00:40:12,989
M times, so 

728
00:40:12,989 --> 00:40:16,989
this is the first trial of the first trajectory, 

729
00:40:16,989 --> 00:40:23,989
and then you do this again. Initialize it in some 

730
00:40:35,419 --> 00:40:36,880
and so on. 

731
00:40:36,880 --> 00:40:39,979
So you do this a bunch of times 

732
00:40:39,979 --> 00:40:43,640
and then you would run the learning algorithm 

733
00:40:43,640 --> 00:40:49,190
to 

734
00:40:49,190 --> 00:40:53,199
estimate ST+1 

735
00:40:53,199 --> 00:40:56,650
as a function 

736
00:40:56,650 --> 00:41:01,869
of 

737
00:41:01,869 --> 00:41:04,619
ST and 

738
00:41:04,619 --> 00:41:08,009
AT. And for sake of completeness, you should just think of this 

739
00:41:08,009 --> 00:41:10,130
as inverted pendulum problem, 

740
00:41:10,130 --> 00:41:11,190
so ST+1 is 

741
00:41:11,190 --> 00:41:13,290
a 4-dimensional vector. ST, AT 

742
00:41:13,290 --> 00:41:16,579
will be a 4-dimensional vector and that'll be a real number, 

743
00:41:16,579 --> 00:41:17,680
and so you might 

744
00:41:17,680 --> 00:41:21,190
run linear regression 4 times to predict each of these state variables as 

745
00:41:21,190 --> 00:41:22,749
a function 

746
00:41:22,749 --> 00:41:25,250
of each of these 

747
00:41:25,250 --> 00:41:26,979
5 real numbers 

748
00:41:26,979 --> 00:41:29,519
and so on. 

749
00:41:29,519 --> 00:41:32,579
Just for example, if you 

750
00:41:32,579 --> 00:41:36,899
say that if you want to estimate your next state ST+1 as a linear 

751
00:41:36,899 --> 00:41:40,639
function 

752
00:41:40,639 --> 00:41:45,889
of your previous state in your action and so 

753
00:41:45,889 --> 00:41:51,470
A here will be a 4 by 4 matrix, 

754
00:41:51,470 --> 00:41:54,449
and B would be a 4-dimensional vector, 

755
00:41:54,449 --> 00:42:00,199
then 

756
00:42:00,199 --> 00:42:07,199
you would choose the values of A and B that minimize this. 

757
00:42:07,489 --> 00:42:08,089


758
00:42:08,089 --> 00:42:15,089


759
00:42:18,229 --> 00:42:25,229


760
00:42:28,689 --> 00:42:31,169


761
00:42:31,169 --> 00:42:32,269


762
00:42:32,269 --> 00:42:35,220
So if you want your model to be that 

763
00:42:35,220 --> 00:42:38,849
ST+1 is some linear function of the previous stated action, 

764
00:42:38,849 --> 00:42:39,960
then you might 

765
00:42:39,960 --> 00:42:41,950
pose this optimization objective 

766
00:42:41,950 --> 00:42:45,130
and choose A and B to minimize the sum of squares error 

767
00:42:45,130 --> 00:42:52,130
in your predictive value for ST+1 as the linear function of ST 

768
00:42:52,489 --> 00:42:56,889
and AT. I 

769
00:42:56,889 --> 00:42:58,119
should say that 

770
00:42:58,119 --> 00:43:02,839
this is one specific example where you're using a linear function of the previous 

771
00:43:02,839 --> 00:43:06,399
stated action to predict the next state. Of course, you can also use other algorithms 

772
00:43:06,399 --> 00:43:08,469
like low [inaudible] weight to linear regression 

773
00:43:08,469 --> 00:43:12,079
or linear regression with nonlinear features or kernel linear 

774
00:43:12,079 --> 00:43:13,119
regression or whatever 

775
00:43:13,119 --> 00:43:16,380
to predict the next state as a nonlinear function of the current state as well, so this 

776
00:43:16,380 --> 00:43:23,380
is just [inaudible] linear problems. 

777
00:43:25,469 --> 00:43:27,820
And it turns out that low [inaudible] weight to linear 

778
00:43:27,820 --> 00:43:28,770
regression is 

779
00:43:28,770 --> 00:43:31,990
for many robots turns out to be an effective method for this learning 

780
00:43:31,990 --> 00:43:38,990
problem as well. 

781
00:43:47,279 --> 00:43:51,919
And so having learned to model, having learned the parameters A 

782
00:43:51,919 --> 00:43:52,680
and B, 

783
00:43:52,680 --> 00:43:58,979
you then have a model where you say that ST+1 is AST 

784
00:43:58,979 --> 00:44:05,359
plus BAT, and so that would be an example of a deterministic 

785
00:44:05,359 --> 00:44:06,219
model 

786
00:44:06,219 --> 00:44:07,949


787
00:44:07,949 --> 00:44:10,749
or having learned the parameters A and B, 

788
00:44:10,749 --> 00:44:13,259
you might say that ST+1 is 

789
00:44:13,259 --> 00:44:15,229
equal to AST + 

790
00:44:15,229 --> 00:44:17,440
BAT 

791
00:44:17,440 --> 00:44:24,039


792
00:44:24,039 --> 00:44:25,979
+ 

793
00:44:25,979 --> 00:44:28,249
?T. 

794
00:44:28,249 --> 00:44:29,919
And so these would be 

795
00:44:29,919 --> 00:44:32,699
very reasonable ways to come up with either 

796
00:44:32,699 --> 00:44:37,069
a deterministic or a stochastic model for your 

797
00:44:37,069 --> 00:44:39,329
inverted pendulum MDP. 

798
00:44:39,329 --> 00:44:41,519
And so 

799
00:44:41,519 --> 00:44:43,549
just to summarize, 

800
00:44:43,549 --> 00:44:48,239
what we have now is a model, meaning a piece of code, 

801
00:44:48,239 --> 00:44:53,459
where you can input a state and an action 

802
00:44:53,459 --> 00:44:55,079
and get an ST+1. 

803
00:44:55,079 --> 00:44:57,969
And so if you have a stochastic model, 

804
00:44:57,969 --> 00:45:02,419
then to influence this model, you would actually sample ?T from this [inaudible] 

805
00:45:02,419 --> 00:45:03,039
distribution 

806
00:45:03,039 --> 00:45:10,039
in order to generate ST+1. So it 

807
00:45:13,179 --> 00:45:18,269
actually turns out that in a 

808
00:45:18,269 --> 00:45:20,169
preview, I guess, in the next lecture 

809
00:45:20,169 --> 00:45:24,679
it actually turns out that in the specific case of linear dynamical systems, in the specific 

810
00:45:24,679 --> 00:45:28,559
case where the next state is a linear function of the current stated action, it 

811
00:45:28,559 --> 00:45:31,579
actually turns out that there're very powerful algorithms you can use. 

812
00:45:31,579 --> 00:45:35,640
So I'm actually not gonna talk about that today. I'll talk about that in the next lecture rather than 

813
00:45:35,640 --> 00:45:36,630
right now 

814
00:45:36,630 --> 00:45:38,559


815
00:45:38,559 --> 00:45:42,759
but turns out that for many problems of inverted pendulum go if you use low [inaudible] weights 

816
00:45:42,759 --> 00:45:45,599
and linear regressionals and long linear 

817
00:45:45,599 --> 00:45:48,339
algorithm ˜cause many systems aren't really 

818
00:45:48,339 --> 00:45:54,309
linear. You can build a nonlinear model. 

819
00:45:54,309 --> 00:45:56,949
So what I wanna do now is talk about 

820
00:45:56,949 --> 00:45:57,330
given 

821
00:45:57,330 --> 00:46:00,510
a model, given a simulator for your 

822
00:46:00,510 --> 00:46:03,349
MDP, how to come up with an algorithm 

823
00:46:03,349 --> 00:46:07,319
to approximate the alpha value function piece. 

824
00:46:07,319 --> 00:46:14,319
Before I move on, let me check if there're questions on this. Okay, 

825
00:46:38,339 --> 00:46:42,579
cool. So here's 

826
00:46:42,579 --> 00:46:43,500
the idea. 

827
00:46:43,500 --> 00:46:46,890
Back when we talked about linear regression, we said that 

828
00:46:46,890 --> 00:46:52,119
given some inputs X in supervised learning, given the input feature is X, we 

829
00:46:52,119 --> 00:46:54,589
may choose some features of X 

830
00:46:54,589 --> 00:46:56,529
and then 

831
00:46:56,529 --> 00:46:57,229


832
00:46:57,229 --> 00:46:58,819
approximate the type of variable 

833
00:46:58,819 --> 00:47:02,229
as a linear function of various features of X, 

834
00:47:02,229 --> 00:47:04,929
and so just do exactly the same thing to 

835
00:47:04,929 --> 00:47:06,329
approximate 

836
00:47:06,329 --> 00:47:08,539
the optimal value function, 

837
00:47:08,539 --> 00:47:10,869
and in particular, 

838
00:47:10,869 --> 00:47:15,819
we'll choose some features 

839
00:47:15,819 --> 00:47:19,249
5-S of a 

840
00:47:19,249 --> 00:47:20,979
state 

841
00:47:20,979 --> 00:47:25,549
S. And so you could actually choose 

842
00:47:25,549 --> 00:47:29,529
5-S equals S. That would be one reasonable choice, if you want to approximate the value 

843
00:47:29,529 --> 00:47:30,160
function 

844
00:47:30,160 --> 00:47:32,759
as a linear function of the states, but you can 

845
00:47:32,759 --> 00:47:33,659
also 

846
00:47:33,659 --> 00:47:37,069
choose other things, so for example, 

847
00:47:37,069 --> 00:47:41,029
for the inverted pendulum example, you may choose 5-S 

848
00:47:41,029 --> 00:47:45,049
to be equal to a vector of features that may be [inaudible]1 or you may have 

849
00:47:45,049 --> 00:47:46,279


850
00:47:46,279 --> 00:47:47,870


851
00:47:47,870 --> 00:47:51,789
Xdot2, 

852
00:47:51,789 --> 00:47:52,190
Xdot 

853
00:47:52,190 --> 00:47:54,649
maybe some cross terms, 

854
00:47:54,649 --> 00:47:56,799
maybe times 

855
00:47:56,799 --> 00:47:59,369
X, maybe dot2 

856
00:47:59,369 --> 00:48:03,659
and so on. So 

857
00:48:03,659 --> 00:48:07,209
you choose some vector or features 

858
00:48:07,209 --> 00:48:14,209
and then approximate 

859
00:48:16,819 --> 00:48:21,759
the value function as the value of the state as is equal to data 

860
00:48:21,759 --> 00:48:24,289
transfers 

861
00:48:24,289 --> 00:48:28,420
times the features. And I should apologize in advance; I'm overloading notation here. 

862
00:48:28,420 --> 00:48:29,839
It's unfortunate. I 

863
00:48:29,839 --> 00:48:34,349
use data both to denote the angle of the cart of the pole inverted 

864
00:48:34,349 --> 00:48:37,549
pendulum. So this is known as 

865
00:48:37,549 --> 00:48:39,140
the angle T 

866
00:48:39,140 --> 00:48:43,280
but also using T to denote the vector of parameters in my [inaudible] algorithm. 

867
00:48:43,280 --> 00:48:43,910
So sorry 

868
00:48:43,910 --> 00:48:49,939
about the overloading notation. 

869
00:48:49,939 --> 00:48:51,599


870
00:48:51,599 --> 00:48:55,439
Just like we did in linear regression, my goal is to come up 

871
00:48:55,439 --> 00:48:56,599
with 

872
00:48:56,599 --> 00:49:00,339
a linear combination of the features that gives me a good approximation 

873
00:49:00,339 --> 00:49:01,929
to the value function 

874
00:49:01,929 --> 00:49:07,660
and this is completely analogous to when we said that in linear regression 

875
00:49:07,660 --> 00:49:09,500
our estimate, 

876
00:49:09,500 --> 00:49:13,369


877
00:49:13,369 --> 00:49:18,049
my response there but Y as a linear function a feature is at the 

878
00:49:18,049 --> 00:49:23,459
input. That's what we have 

879
00:49:23,459 --> 00:49:30,459
in 

880
00:49:40,109 --> 00:49:47,079
linear regression. Let 

881
00:49:47,079 --> 00:49:50,779
me just write 

882
00:49:50,779 --> 00:49:55,859
down value iteration again and then I'll written down an approximation to value 

883
00:49:55,859 --> 00:49:57,199
iteration, so 

884
00:49:57,199 --> 00:50:00,759
for discrete states, 

885
00:50:00,759 --> 00:50:03,929
this is the idea behind value iteration and we said that 

886
00:50:03,929 --> 00:50:08,849
V(s) will be 

887
00:50:08,849 --> 00:50:11,609
updated 

888
00:50:11,609 --> 00:50:17,400
as R(s) + G 

889
00:50:17,400 --> 00:50:19,530
[inaudible]. That was value iteration 

890
00:50:19,530 --> 00:50:24,129
and in the case of continuous states, this would be the replaced by an [inaudible], an 

891
00:50:24,129 --> 00:50:31,129
[inaudible] over states rather via sum over states. 

892
00:50:31,389 --> 00:50:34,340
Let me just write this 

893
00:50:34,340 --> 00:50:35,669
as 

894
00:50:35,669 --> 00:50:38,809


895
00:50:38,809 --> 00:50:42,589
R(s) + G([inaudible]) and then that sum over T's prime. 

896
00:50:42,589 --> 00:50:45,150
That's really an expectation 

897
00:50:45,150 --> 00:50:47,400
with respect to random state as 

898
00:50:47,400 --> 00:50:54,400
prime drawn from the state transition probabilities piece SA 

899
00:50:55,049 --> 00:50:56,359
of V(s) 

900
00:50:56,359 --> 00:51:00,339
prime. So this is a sum of all states S prime with the 

901
00:51:00,339 --> 00:51:01,599
probability of going to S prime (value), 

902
00:51:01,599 --> 00:51:05,249
so that's really an expectation over the random state S prime flowing from PSA of 

903
00:51:05,249 --> 00:51:09,369
that. 

904
00:51:09,369 --> 00:51:11,989
And so what I'll do now is 

905
00:51:11,989 --> 00:51:15,199
write down an algorithm called 

906
00:51:15,199 --> 00:51:22,199
fitted value iteration 

907
00:51:25,289 --> 00:51:27,949
that's in 

908
00:51:27,949 --> 00:51:30,599
approximation to this 

909
00:51:30,599 --> 00:51:33,809
but specifically for continuous states. 

910
00:51:33,809 --> 00:51:38,519
I just wrote down the first two steps, and then I'll continue on the next board, so the 

911
00:51:38,519 --> 00:51:43,499
first step of the algorithm is we'll sample. 

912
00:51:43,499 --> 00:51:45,269
Choose some set of states 

913
00:51:45,269 --> 00:51:52,269
at 

914
00:51:53,250 --> 00:51:55,119
random. 

915
00:51:55,119 --> 00:51:59,869
So sample S-1, S-2 through S-M randomly so choose a 

916
00:51:59,869 --> 00:52:02,469
set of states randomly 

917
00:52:02,469 --> 00:52:08,209
and 

918
00:52:08,209 --> 00:52:12,020
initialize my parameter vector to be equal to zero. This is 

919
00:52:12,020 --> 00:52:16,190
analogous to in value iteration where I 

920
00:52:16,190 --> 00:52:21,140
might initialize the value function to be the function of all zeros. Then here's the 

921
00:52:21,140 --> 00:52:28,140
end view for the algorithm. 

922
00:52:42,359 --> 00:52:44,119
Got 

923
00:52:44,119 --> 00:52:51,119
quite a lot to 

924
00:52:59,189 --> 00:53:06,189
write 

925
00:55:05,409 --> 00:55:12,409
actually. 

926
00:55:21,899 --> 00:55:28,599


927
00:55:28,599 --> 00:55:29,639
Let's 

928
00:55:29,639 --> 00:55:36,639
see. 

929
00:56:10,420 --> 00:56:12,009
And so 

930
00:56:12,009 --> 00:56:17,839
that's the algorithm. 

931
00:56:17,839 --> 00:56:19,880
Let me just adjust the writing. Give me a second. Give me a minute 

932
00:56:19,880 --> 00:56:26,880
to finish and then I'll step through this. 

933
00:56:26,989 --> 00:56:28,489
Actually, if some of my 

934
00:56:28,489 --> 00:56:35,489
handwriting is eligible, let me know. So 

935
00:56:57,559 --> 00:57:02,079
let me step through this and so briefly explain the rationale. 

936
00:57:02,079 --> 00:57:08,140
So the hear of the algorithm is - let's see. 

937
00:57:08,140 --> 00:57:10,869
In the original value iteration algorithm, 

938
00:57:10,869 --> 00:57:13,240
we would take the value for each state, 

939
00:57:13,240 --> 00:57:14,799
V(s)I, 

940
00:57:14,799 --> 00:57:17,599
and we will overwrite it with this 

941
00:57:17,599 --> 00:57:21,659
expression here. In the original, this discrete value iteration algorithm 

942
00:57:21,659 --> 00:57:23,199
was to V(s)I 

943
00:57:23,199 --> 00:57:23,940
and we will 

944
00:57:23,940 --> 00:57:25,829
set V(s)I 

945
00:57:25,829 --> 00:57:28,589
to be equal to that, I think. 

946
00:57:28,589 --> 00:57:33,130
Now we have in the continuous state case, we have an infinite continuous set of 

947
00:57:33,130 --> 00:57:34,199
states and so 

948
00:57:34,199 --> 00:57:37,859
you can't discretely set the value of each of these to that. 

949
00:57:37,859 --> 00:57:42,509
So what we'll do instead is choose the parameters T 

950
00:57:42,509 --> 00:57:46,849
so that V(s)I is as close as possible to 

951
00:57:46,849 --> 00:57:49,930
this thing on the right hand side 

952
00:57:49,930 --> 00:57:54,509
instead. And this is what YI turns out to be. 

953
00:57:54,509 --> 00:57:58,479
So completely, what I'm going to do is I'm going to construct 

954
00:57:58,479 --> 00:58:00,579
estimates of this term, 

955
00:58:00,579 --> 00:58:04,279
and then I'm going to choose the parameters of my function approximator. I'm 

956
00:58:04,279 --> 00:58:07,729
gonna choose my parameter as T, so that V(s)I is as close as possible to 

957
00:58:07,729 --> 00:58:10,619
these. That's what YI is, 

958
00:58:10,619 --> 00:58:11,979
and specifically, 

959
00:58:11,979 --> 00:58:15,139
what I'm going to do is I'll choose parameters data 

960
00:58:15,139 --> 00:58:18,489
to minimize the sum of square differences between T [inaudible] plus 

961
00:58:18,489 --> 00:58:20,409
5SI. 

962
00:58:20,409 --> 00:58:21,550
This thing here 

963
00:58:21,550 --> 00:58:24,489
is just V(s)I because I'm approximating V(s)I 

964
00:58:24,489 --> 00:58:26,219
is a linear function of 5SI 

965
00:58:26,219 --> 00:58:27,880
and so choose 

966
00:58:27,880 --> 00:58:32,019
the parameters data to minimize the sum of square differences. 

967
00:58:32,019 --> 00:58:34,829
So this is last step is basically 

968
00:58:34,829 --> 00:58:38,759
the approximation version of value 

969
00:58:38,759 --> 00:58:40,610
iteration. What everything else above 

970
00:58:40,610 --> 00:58:42,419
was doing was just 

971
00:58:42,419 --> 00:58:44,749
coming up with an approximation 

972
00:58:44,749 --> 00:58:46,189
to this term, to 

973
00:58:46,189 --> 00:58:47,499
this thing here 

974
00:58:47,499 --> 00:58:50,979
and which I was calling YI. 

975
00:58:50,979 --> 00:58:54,199
And so confluently, for every state SI we want to 

976
00:58:54,199 --> 00:58:57,159
estimate what the thing on the right hand side is 

977
00:58:57,159 --> 00:58:57,979
and but 

978
00:58:57,979 --> 00:59:01,980
there's an expectation here. There's an expectation over a continuous set of states, 

979
00:59:01,980 --> 00:59:06,609
may be a very high dimensional state so I can't compute this expectation exactly. 

980
00:59:06,609 --> 00:59:09,809
What I'll do instead is I'll use my simulator to sample 

981
00:59:09,809 --> 00:59:13,969
a set of states from this distribution from this P substrip, 

982
00:59:13,969 --> 00:59:16,959
SIA, from the state transition distribution 

983
00:59:16,959 --> 00:59:20,669
of where I get to if I take the action A in the state as I, 

984
00:59:20,669 --> 00:59:23,919
and then I'll average over that sample of states 

985
00:59:23,919 --> 00:59:26,819
to compute this expectation. 

986
00:59:26,819 --> 00:59:30,609
And so stepping through the algorithm just says that for 

987
00:59:30,609 --> 00:59:31,500
each 

988
00:59:31,500 --> 00:59:36,329
state and for each action, I'm going to sample a set of states. This S prime 1 through S 

989
00:59:36,329 --> 00:59:40,659
prime K from that state transition distribution, still using the model, and 

990
00:59:40,659 --> 00:59:43,909
then I'll set Q(a) to be equal to that average 

991
00:59:43,909 --> 00:59:45,489
and so this is my 

992
00:59:45,489 --> 00:59:48,429
estimate for R(s)I + G(this 

993
00:59:48,429 --> 00:59:50,399
expected value 

994
00:59:50,399 --> 00:59:53,629
for that specific action A). 

995
00:59:53,629 --> 00:59:56,759
Then I'll take the maximum of actions A 

996
00:59:56,759 --> 01:00:00,890
and this gives me YI, and so YI is for S for that. 

997
01:00:00,890 --> 01:00:02,919
And finally, I'll run 

998
01:00:02,919 --> 01:00:06,779
really run linear regression which is that last of the set [inaudible] to get 

999
01:00:06,779 --> 01:00:07,789
V(s)I 

1000
01:00:07,789 --> 01:00:11,479
to be close to the YIs. 

1001
01:00:11,479 --> 01:00:16,179
And so this algorithm is called fitted value iteration and it actually often 

1002
01:00:16,179 --> 01:00:19,389
works quite well for continuous, 

1003
01:00:19,389 --> 01:00:22,659
for problems with anywhere from 

1004
01:00:22,659 --> 01:00:26,010
6- to 10- to 20-dimensional state spaces 

1005
01:00:26,010 --> 01:00:33,010


1006
01:00:33,949 --> 01:00:39,499
if you can choose appropriate features. Can you raise a hand please if this algorithm makes sense? 

1007
01:00:39,499 --> 01:00:40,790
Some of you didn't have your hands up. 

1008
01:00:40,790 --> 01:00:47,279
Are there questions for those, 

1009
01:00:47,279 --> 01:00:48,410


1010
01:00:48,410 --> 01:00:51,339
yeah? Student: Is there a recess [inaudible] function in this setup? 

1011
01:00:51,339 --> 01:00:55,669
Instructor (Andrew Ng):Oh, yes. An MDP comprises 

1012
01:00:55,669 --> 01:00:57,189
SA, 

1013
01:00:57,189 --> 01:00:59,869
the state transition probabilities G 

1014
01:00:59,869 --> 01:01:01,229
and 

1015
01:01:01,229 --> 01:01:02,380
R 

1016
01:01:02,380 --> 01:01:03,359
and so 

1017
01:01:03,359 --> 01:01:08,359
for continuous state spaces, S would be like R4 for the inverted pendulum 

1018
01:01:08,359 --> 01:01:09,409
or something. 

1019
01:01:09,409 --> 01:01:13,529
Actions with discretized state transitions probabilities with specifying with the model 

1020
01:01:13,529 --> 01:01:14,779
or 

1021
01:01:14,779 --> 01:01:17,470
the simulator, 

1022
01:01:17,470 --> 01:01:19,550
G is just a real number like .99 

1023
01:01:19,550 --> 01:01:23,599
and the real function is usually a function that's given to you. And so 

1024
01:01:23,599 --> 01:01:26,099
the reward function 

1025
01:01:26,099 --> 01:01:30,689
is some function of your 4-dimensional state, 

1026
01:01:30,689 --> 01:01:31,489
and 

1027
01:01:31,489 --> 01:01:32,989
for example, you 

1028
01:01:32,989 --> 01:01:39,989
might choose a reward function to be minus " I don't know. 

1029
01:01:40,029 --> 01:01:47,029
Just for an example of simple reward function, if we want a -1 if the pole has fallen 

1030
01:01:47,819 --> 01:01:51,640
and there it depends on you choose your reward function to be -1 if 

1031
01:01:51,640 --> 01:01:52,290
the 

1032
01:01:52,290 --> 01:01:54,619
inverted pendulum falls over 

1033
01:01:54,619 --> 01:01:58,700
and to find that its angle is greater than 30Â° or something 

1034
01:01:58,700 --> 01:02:02,179
and zero otherwise. 

1035
01:02:02,179 --> 01:02:03,939
So that would be an example 

1036
01:02:03,939 --> 01:02:06,640
of a reward function that you can choose for the inverted pendulum, but yes, assuming a 

1037
01:02:06,640 --> 01:02:09,619
reward function is given to you 

1038
01:02:09,619 --> 01:02:14,589
so that you can compute R(s)I for any state. 

1039
01:02:14,589 --> 01:02:21,589
Are there other questions? 

1040
01:02:39,689 --> 01:02:40,619
Actually, let 

1041
01:02:40,619 --> 01:02:47,619
me try 

1042
01:02:50,459 --> 01:02:57,459


1043
01:03:23,979 --> 01:03:28,479
asking a question, so everything I did here assume that we have a stochastic simulator. 

1044
01:03:28,479 --> 01:03:31,879
So it 

1045
01:03:31,879 --> 01:03:35,389
turns out I can simply this algorithm if I have a deterministic simulator, 

1046
01:03:35,389 --> 01:03:36,510
but 

1047
01:03:36,510 --> 01:03:38,769
deterministic simulator is given a 

1048
01:03:38,769 --> 01:03:42,079
stated action, my next state is always exactly determined. 

1049
01:03:42,079 --> 01:03:45,569
So let me ask you, if I have a deterministic simulator, 

1050
01:03:45,569 --> 01:03:52,569
how would I change this algorithm? How would I simplify this algorithm? Student:Lower your samples that you're drawing [inaudible]. 

1051
01:03:53,929 --> 01:03:57,709


1052
01:03:57,709 --> 01:03:58,720
Instructor (Andrew Ng):Right, so 

1053
01:03:58,720 --> 01:04:01,249
Justin's going right. If I have a deterministic simulator, 

1054
01:04:01,249 --> 01:04:04,869
all my samples from those would be exactly the same, and so if I have a 

1055
01:04:04,869 --> 01:04:06,149
deterministic simulator, 

1056
01:04:06,149 --> 01:04:09,109
I can set K to be equal to 1, 

1057
01:04:09,109 --> 01:04:10,930
so I don't need to draw 

1058
01:04:10,930 --> 01:04:15,869
K different samples. I really only need one sample if I have a 

1059
01:04:15,869 --> 01:04:17,299
deterministic simulator, 

1060
01:04:17,299 --> 01:04:21,669
so you can simplify this by setting K=1 if you have a deterministic simulator. Yeah? Student:I guess I'm really 

1061
01:04:21,669 --> 01:04:28,159
confused about the, 

1062
01:04:28,159 --> 01:04:29,029
yeah, 

1063
01:04:29,029 --> 01:04:34,429
we sorta turned this [inaudible] into something that looks like linear 

1064
01:04:34,429 --> 01:04:37,249
state 

1065
01:04:37,249 --> 01:04:40,089
regression or some' you know the data transpose times something that 

1066
01:04:40,089 --> 01:04:43,890
we're used to but I guess I'm a 

1067
01:04:43,890 --> 01:04:45,369
little. I don't know really know what question to ask but like when we did this before we 

1068
01:04:45,369 --> 01:04:48,330
had like discrete states and everything. We 

1069
01:04:48,330 --> 01:04:50,019
were determined with 

1070
01:04:50,019 --> 01:04:53,189
finding this optimal policy 

1071
01:04:53,189 --> 01:04:56,709
and I 

1072
01:04:56,709 --> 01:05:01,869
guess it doesn't look like we haven't said the word policy in a while so kinda 

1073
01:05:01,869 --> 01:05:05,009
difficult. Instructor (Andrew Ng):Okay, yeah, so [inaudible] matters back to policy but maybe I should 

1074
01:05:05,009 --> 01:05:06,819
just say a couple words so 

1075
01:05:06,819 --> 01:05:10,660
let me actually try to get at some of what maybe what you're saying. 

1076
01:05:10,660 --> 01:05:14,249
Our strategy for finding optimal policy 

1077
01:05:14,249 --> 01:05:15,080
has been to 

1078
01:05:15,080 --> 01:05:19,269
find some way to find V 

1079
01:05:19,269 --> 01:05:21,169
and 

1080
01:05:21,169 --> 01:05:24,089
some of approximations 

1081
01:05:24,089 --> 01:05:28,549
of ?*. So far everything I've been doing has been focused on how to find V*. I just 

1082
01:05:28,549 --> 01:05:30,479
want 

1083
01:05:30,479 --> 01:05:33,119


1084
01:05:33,119 --> 01:05:36,359
to say one more word. It actually turns out that 

1085
01:05:36,359 --> 01:05:40,729
for linear regression it's often very easy. It's often not terribly difficult to 

1086
01:05:40,729 --> 01:05:43,019
choose some resource of the features. Choosing 

1087
01:05:43,019 --> 01:05:46,979
features for approximating the value function is often somewhat 

1088
01:05:46,979 --> 01:05:48,400
harder, 

1089
01:05:48,400 --> 01:05:50,969
so because the value of a state is 

1090
01:05:50,969 --> 01:05:52,020
how good is 

1091
01:05:52,020 --> 01:05:54,380
starting off in this state. 

1092
01:05:54,380 --> 01:05:56,039
What is my 

1093
01:05:56,039 --> 01:05:59,609
expected sum of discounted rewards? What if I start in a certain state? 

1094
01:05:59,609 --> 01:06:00,409
And so 

1095
01:06:00,409 --> 01:06:03,559
what the feature of the state have to measure is really 

1096
01:06:03,559 --> 01:06:07,890
how good is it to start in a certain state? 

1097
01:06:07,890 --> 01:06:12,209
And so for inverted pendulum you actually have that 

1098
01:06:12,209 --> 01:06:15,899
states where the poles are vertical and when a cart that's centered on your 

1099
01:06:15,899 --> 01:06:18,079
track or something, maybe better and so you can 

1100
01:06:18,079 --> 01:06:21,299
come up with features that measure the orientation of the pole and how close 

1101
01:06:21,299 --> 01:06:25,689
you are to the center of the track and so on and those will be reasonable features to use 

1102
01:06:25,689 --> 01:06:28,329
to approximate V*. 

1103
01:06:28,329 --> 01:06:30,249
Although in general it is true that 

1104
01:06:30,249 --> 01:06:33,859
choosing features, the value function approximation, is often slightly trickier 

1105
01:06:33,859 --> 01:06:39,719
than choosing good features for linear regression. Okay and then Justin's 

1106
01:06:39,719 --> 01:06:40,709
questions of 

1107
01:06:40,709 --> 01:06:42,529
so given 

1108
01:06:42,529 --> 01:06:47,109
V*, how do you go back to actually find a policy? 

1109
01:06:47,109 --> 01:06:49,799
In the discrete case, so we 

1110
01:06:49,799 --> 01:06:54,109
have that ?*(s) is equal to all 

1111
01:06:54,109 --> 01:07:01,109
[inaudible] over A of 

1112
01:07:04,319 --> 01:07:07,469
that. So that's again, I used to write this as a sum over states [inaudible]. I'll just write this 

1113
01:07:07,469 --> 01:07:12,869
as an expectation 

1114
01:07:12,869 --> 01:07:15,609
and so then once you find the optimal value 

1115
01:07:15,609 --> 01:07:16,739
function V*, 

1116
01:07:16,739 --> 01:07:18,579
you can then 

1117
01:07:18,579 --> 01:07:20,349
find the optimal policy ?* by computing the 

1118
01:07:20,349 --> 01:07:22,579
[inaudible]. 

1119
01:07:22,579 --> 01:07:25,219
So if you're in a continuous state MDP, 

1120
01:07:25,219 --> 01:07:26,429
then 

1121
01:07:26,429 --> 01:07:30,409
you can't actually do this in advance for every single state because there's an 

1122
01:07:30,409 --> 01:07:31,899
infinite number of states 

1123
01:07:31,899 --> 01:07:34,809
and so you can't actually perform this computation in advance to every single 

1124
01:07:34,809 --> 01:07:36,430
state. 

1125
01:07:36,430 --> 01:07:38,949
What you do instead is 

1126
01:07:38,949 --> 01:07:43,209
whenever your robot is in some specific state S is only when your system 

1127
01:07:43,209 --> 01:07:48,069
is in some specific state S like your car is at some position orientation or 

1128
01:07:48,069 --> 01:07:49,829
your inverted pendulum 

1129
01:07:49,829 --> 01:07:52,949
is in some specific position, posed in some 

1130
01:07:52,949 --> 01:07:56,529
specific angle T. It's 

1131
01:07:56,529 --> 01:07:58,489
only when 

1132
01:07:58,489 --> 01:08:03,069
your system, be it a factor or a board game or a robot, 

1133
01:08:03,069 --> 01:08:05,299
is in some specific state S 

1134
01:08:05,299 --> 01:08:08,809
that you would then go ahead and compute this augmax, so 

1135
01:08:08,809 --> 01:08:10,229
it's only when 

1136
01:08:10,229 --> 01:08:15,779
you're in some state S that you then compute this augmax, and then 

1137
01:08:15,779 --> 01:08:18,099
you 

1138
01:08:18,099 --> 01:08:22,589


1139
01:08:22,589 --> 01:08:24,440
execute that action A 

1140
01:08:24,440 --> 01:08:25,159
and then 

1141
01:08:25,159 --> 01:08:29,489
as a result of your action, your robot would transition to some new state and then 

1142
01:08:29,489 --> 01:08:33,759
so it'll be given that specific new state that you compute as 

1143
01:08:33,759 --> 01:08:40,759
augmax using that specific state S that 

1144
01:08:41,799 --> 01:08:45,739
you're in. There're a few ways to do it. One way to do this is 

1145
01:08:45,739 --> 01:08:50,099
actually the same as in the inner loop of the fitted value iteration algorithm so 

1146
01:08:50,099 --> 01:08:52,150
because of an expectation of a large number 

1147
01:08:52,150 --> 01:08:55,219
of states, you'd need to sample 

1148
01:08:55,219 --> 01:08:57,999
some set of states from the simulator and then 

1149
01:08:57,999 --> 01:09:03,120
approximate this expectation using an average over your samples, so it's actually as 

1150
01:09:03,120 --> 01:09:05,819
inner loop of the value iteration algorithm. 

1151
01:09:05,819 --> 01:09:09,479
So you could do that. That's sometimes done. Sometimes it can also be a pain to have to sample 

1152
01:09:09,479 --> 01:09:10,770
a 

1153
01:09:10,770 --> 01:09:13,029
set of states 

1154
01:09:13,029 --> 01:09:16,230
to approximate those expectations every time you want to take an action in your MDP. 

1155
01:09:16,230 --> 01:09:18,359
Couple 

1156
01:09:18,359 --> 01:09:22,710
of special cases where this can be done, one special case is if you 

1157
01:09:22,710 --> 01:09:29,710
have 

1158
01:09:31,710 --> 01:09:35,560
a deterministic simulator. If it's a deterministic simulator, so in other words, if your similar is 

1159
01:09:35,560 --> 01:09:39,769
just some function, 

1160
01:09:39,769 --> 01:09:43,690
could be a linear or a nonlinear function. If it's a deterministic simulator 

1161
01:09:43,690 --> 01:09:47,569
then the next state, ST+1, is just some function of your 

1162
01:09:47,569 --> 01:09:50,189
previous stated action. 

1163
01:09:50,189 --> 01:09:52,060
If that's the case then 

1164
01:09:52,060 --> 01:09:54,849
this expectation, well, then 

1165
01:09:54,849 --> 01:09:59,259
this simplifies to augmax of A of V* 

1166
01:09:59,259 --> 01:10:00,709
of F 

1167
01:10:00,709 --> 01:10:03,409
of 

1168
01:10:03,409 --> 01:10:07,280
I guess S,A 

1169
01:10:07,280 --> 01:10:08,249
because 

1170
01:10:08,249 --> 01:10:11,070
this is really saying 

1171
01:10:11,070 --> 01:10:14,570
S 

1172
01:10:14,570 --> 01:10:18,409
prime=F(s),A. I switched back and forth between notation; I hope that's okay. S 

1173
01:10:18,409 --> 01:10:21,800
to denote the current state, and S prime to deterministic state versus ST and ST+1 through the 

1174
01:10:21,800 --> 01:10:25,530
current state. Both of these are 

1175
01:10:25,530 --> 01:10:27,699
sorta standing notation and don't 

1176
01:10:27,699 --> 01:10:31,189
mind my switching back and forth between them. But if it's a 

1177
01:10:31,189 --> 01:10:35,310
deterministic simulator you can then just compute what the next state S prime would be 

1178
01:10:35,310 --> 01:10:37,729
for each action you might take from the current state, 

1179
01:10:37,729 --> 01:10:41,689
and then take the augmax of actions A, 

1180
01:10:41,689 --> 01:10:47,229
basically choose the action that gets you to the highest value 

1181
01:10:47,229 --> 01:10:54,229
state. 

1182
01:11:04,300 --> 01:11:04,659
And 

1183
01:11:04,659 --> 01:11:08,920
so that's one case where you can compute the augmax and we can compute that 

1184
01:11:08,920 --> 01:11:13,469
expectation without needing to sample an average over some sample. 

1185
01:11:13,469 --> 01:11:16,670
Another very common case actually it turns out is if you have a stochastic 

1186
01:11:16,670 --> 01:11:20,219


1187
01:11:20,219 --> 01:11:24,010


1188
01:11:24,010 --> 01:11:25,969
simulator, but if your 

1189
01:11:25,969 --> 01:11:30,459
similar happens to take on a very specific form of 

1190
01:11:30,459 --> 01:11:33,820


1191
01:11:33,820 --> 01:11:36,119
ST+1=F(s)T,AT+?T 

1192
01:11:36,119 --> 01:11:37,120


1193
01:11:37,120 --> 01:11:43,689
where this 

1194
01:11:43,689 --> 01:11:47,899
is galsie noise. The [inaudible] is a very common way to build simulators where you model the next 

1195
01:11:47,899 --> 01:11:50,229
state as a function of the current state and action 

1196
01:11:50,229 --> 01:11:52,260
plus some noise and so 

1197
01:11:52,260 --> 01:11:57,340
once specific example would be that sort of mini dynamical system that 

1198
01:11:57,340 --> 01:12:01,400
we talked about with linear function of the current state and action plus 

1199
01:12:01,400 --> 01:12:03,019
galsie 

1200
01:12:03,019 --> 01:12:07,959
noise. In this case, you can approximate augment over 

1201
01:12:07,959 --> 01:12:14,959
A, well. 

1202
01:12:21,809 --> 01:12:27,159
In that case you take that 

1203
01:12:27,159 --> 01:12:29,089
expectation that 

1204
01:12:29,089 --> 01:12:33,179
you're trying to approximate. The 

1205
01:12:33,179 --> 01:12:37,309
expected value of V of the 

1206
01:12:37,309 --> 01:12:41,479


1207
01:12:41,479 --> 01:12:45,260
expected value of 

1208
01:12:45,260 --> 01:12:49,179
S prime, and this is approximation. 

1209
01:12:49,179 --> 01:12:52,510
Expected value of a function is usually not equal to the value of an 

1210
01:12:52,510 --> 01:12:53,059
expectation but 

1211
01:12:53,059 --> 01:12:54,159
it is often 

1212
01:12:54,159 --> 01:12:56,119
a reasonable approximation 

1213
01:12:56,119 --> 01:12:58,340
and so 

1214
01:12:58,340 --> 01:13:01,919


1215
01:13:01,919 --> 01:13:04,669
that would be another way to 

1216
01:13:04,669 --> 01:13:06,439
approximate that expectation 

1217
01:13:06,439 --> 01:13:11,270
and so you choose the actions 

1218
01:13:11,270 --> 01:13:14,340
according to 

1219
01:13:14,340 --> 01:13:21,340
watch we do the same formula as I wrote just now. 

1220
01:13:22,010 --> 01:13:23,619
And so this 

1221
01:13:23,619 --> 01:13:24,920
would be a way of 

1222
01:13:24,920 --> 01:13:27,109
approximating 

1223
01:13:27,109 --> 01:13:28,160


1224
01:13:28,160 --> 01:13:30,530
this augmax, 

1225
01:13:30,530 --> 01:13:34,489
ignoring the noise in the simulator essentially. And 

1226
01:13:34,489 --> 01:13:37,610
this often works pretty well as well just because many simulators turn 

1227
01:13:37,610 --> 01:13:38,539
out 

1228
01:13:38,539 --> 01:13:42,699
to be the form of some linear or some nonlinear function plus zero mean 

1229
01:13:42,699 --> 01:13:43,749
galsie noise, 

1230
01:13:43,749 --> 01:13:46,130
so and just that ignore the zero mean 

1231
01:13:46,130 --> 01:13:47,369
galsie 

1232
01:13:47,369 --> 01:13:50,070
noise, so that you can compute this quickly. 

1233
01:13:50,070 --> 01:13:54,230
And just to complete about this, what 

1234
01:13:54,230 --> 01:13:55,899
that is, right, 

1235
01:13:55,899 --> 01:13:58,460
that V* F of SA, 

1236
01:13:58,460 --> 01:13:58,969
this 

1237
01:13:58,969 --> 01:14:01,670
you 

1238
01:14:01,670 --> 01:14:04,389


1239
01:14:04,389 --> 01:14:08,880
down rate as data transfers Fi of S prime where S prime=F of SA. Great, 

1240
01:14:08,880 --> 01:14:12,219
so this V* you would compute using 

1241
01:14:12,219 --> 01:14:15,050
the parameters data that you just learned using the 

1242
01:14:15,050 --> 01:14:21,719
fitted value iteration 

1243
01:14:21,719 --> 01:14:23,170
algorithm. 

1244
01:14:23,170 --> 01:14:30,170
Questions about this? Student:[Inaudible] case, 

1245
01:14:33,849 --> 01:14:40,849
for real-time application is it possible to use that [inaudible], for example for 

1246
01:14:41,809 --> 01:14:47,739
[inaudible]. Instructor (Andrew Ng):Yes, in real-time applications is it possible to sample case phases use [inaudible] 

1247
01:14:47,739 --> 01:14:49,989


1248
01:14:49,989 --> 01:14:51,159


1249
01:14:51,159 --> 01:14:55,510
expectation. Computers today actually amazingly fast. I'm actually often 

1250
01:14:55,510 --> 01:14:58,659
surprised by how much you can do in real time so 

1251
01:14:58,659 --> 01:15:00,269
the helicopter we actually flying a helicopter 

1252
01:15:00,269 --> 01:15:03,889
using an algorithm different than 

1253
01:15:03,889 --> 01:15:04,739
this? I can't say. 

1254
01:15:04,739 --> 01:15:06,280


1255
01:15:06,280 --> 01:15:10,539
But my intuition is that you could actually do this with a 

1256
01:15:10,539 --> 01:15:14,010
helicopter. A helicopter would control at somewhere between 10hz and 50hz. You need to do 

1257
01:15:14,010 --> 01:15:16,389
this 10 times a second to 50 times a second, 

1258
01:15:16,389 --> 01:15:18,099
and that's actually plenty of time 

1259
01:15:18,099 --> 01:15:19,400
to sample 

1260
01:15:19,400 --> 01:15:23,500
1,000 states and compute this 

1261
01:15:23,500 --> 01:15:26,909
expectation. They're real difficult, helicopters because helicopters are mission critical, and you 

1262
01:15:26,909 --> 01:15:29,500
do something it's like fast. You 

1263
01:15:29,500 --> 01:15:31,849
can do serious damage and so 

1264
01:15:31,849 --> 01:15:35,750
maybe not for good reasons. We've actually tended to avoid 

1265
01:15:35,750 --> 01:15:37,039
tossing coins when we're 

1266
01:15:37,039 --> 01:15:40,179
in the air, so the ideal of letting our actions be some 

1267
01:15:40,179 --> 01:15:45,099
up with some random process is slightly scary and just tend not to do that. 

1268
01:15:45,099 --> 01:15:48,989
I should say that's prob'ly not a great reason because you average a 

1269
01:15:48,989 --> 01:15:51,550
large number of things here very well fine but just 

1270
01:15:51,550 --> 01:15:55,829
as a maybe overly conservative design choice, we actually 

1271
01:15:55,829 --> 01:15:57,060
don't, tend not to find 

1272
01:15:57,060 --> 01:15:59,499
anything randomized on 

1273
01:15:59,499 --> 01:16:01,400
which is prob'ly being over conservative. 

1274
01:16:01,400 --> 01:16:02,340


1275
01:16:02,340 --> 01:16:04,969
It's the choice we made ˜cause other things are slightly safer. 

1276
01:16:04,969 --> 01:16:08,149
I think you can actually often do this. 

1277
01:16:08,149 --> 01:16:13,189
So long as I see a model can be evaluated fast enough where you can sample 

1278
01:16:13,189 --> 01:16:15,469
100 state transitions or 1,000 state 

1279
01:16:15,469 --> 01:16:18,119
transitions, and then do that at 

1280
01:16:18,119 --> 01:16:21,550
10hz. They haven't said that. This is often attained which is why we often 

1281
01:16:21,550 --> 01:16:28,550
use the other approximations that don't require your drawing a large sample. Anything else? No, okay, cool. So 

1282
01:16:31,800 --> 01:16:35,800
now you know one algorithm [inaudible] reinforcement learning on continuous state 

1283
01:16:35,800 --> 01:16:38,710
spaces. Then we'll pick up with some more ideas on some even 

1284
01:16:38,710 --> 01:16:40,449
more powerful algorithms, 

1285
01:16:40,449 --> 01:16:43,370
the solving MDPs of continuous state spaces. 

1286
01:16:43,370 --> 01:16:44,170
Thanks. Let's close for today.