00:00:11,420 --> 00:00:12,819
This subtitle is provided from lecsub.jimdo.com under Creative Commons License.

2
00:00:12,819 --> 00:00:16,099
This presentation is delivered by the Stanford Center for Professional 

3
00:00:16,099 --> 00:00:23,099
Development. 

4
00:00:25,879 --> 00:00:28,419
Okay. Welcome back. What I 

5
00:00:28,419 --> 00:00:31,449
want to do today is talk about 

6
00:00:31,449 --> 00:00:34,570
one of my favorite algorithms for controlling NVPs that I think is 

7
00:00:34,570 --> 00:00:35,870


8
00:00:35,870 --> 00:00:39,320
one of the more elegant and efficient and powerful algorithms that 

9
00:00:39,320 --> 00:00:41,070
I know of. 

10
00:00:41,070 --> 00:00:43,020
So 

11
00:00:43,020 --> 00:00:46,930
what I'll do is I'll first start by talking about a couple 

12
00:00:46,930 --> 00:00:48,640
variations of NVPs 

13
00:00:48,640 --> 00:00:52,320
that are slightly different from the NVP definition you've seen so far. These 

14
00:00:52,320 --> 00:00:54,420
are pretty 

15
00:00:54,420 --> 00:00:58,870
common variations. One is state action rewards, 

16
00:00:58,870 --> 00:01:03,830
and the other is horizon NVPs. Using this semi-modified definition of an 

17
00:01:03,830 --> 00:01:06,540
NVP, I'll talk about linear dynamical systems. I'll spend a little bit 

18
00:01:06,540 --> 00:01:10,410
of time talking about models within dynamical systems, and then talk about LQR, or 

19
00:01:10,410 --> 00:01:13,240
linear quadratic regulation control, 

20
00:01:13,240 --> 00:01:18,600
which will lead us to some kind of [inaudible] equation, which is 

21
00:01:18,600 --> 00:01:23,100
something we will solve in order to do LQR 

22
00:01:23,100 --> 00:01:24,460
controls. 

23
00:01:24,460 --> 00:01:25,660


24
00:01:25,660 --> 00:01:27,400
So 

25
00:01:27,400 --> 00:01:28,650
just 

26
00:01:28,650 --> 00:01:33,020
to recap, and we've seen this definition many times now. 

27
00:01:33,020 --> 00:01:36,840
We've been defining an NVP 

28
00:01:36,840 --> 00:01:41,960
as 

29
00:01:41,960 --> 00:01:46,969
[inaudible] states actions, states improbabilities, [inaudible] reward function 

30
00:01:46,969 --> 00:01:49,869
where - gamma's 

31
00:01:49,869 --> 00:01:52,980
the 

32
00:01:52,980 --> 00:01:56,680
discount factors, a number between zero and one. 

33
00:01:56,680 --> 00:02:00,380
And R, the reward function, 

34
00:02:00,380 --> 00:02:03,909
was the function mapping from the states, the rewards - 

35
00:02:03,909 --> 00:02:07,000
was the function mapping from the states, the real numbers. 

36
00:02:07,000 --> 00:02:08,220
So we had 

37
00:02:08,220 --> 00:02:10,489
value 

38
00:02:10,489 --> 00:02:13,759
iteration, which would do this. 

39
00:02:13,759 --> 00:02:20,759


40
00:02:21,899 --> 00:02:26,429
So after a while, the value of the iteration will cause V to convert to V star. 

41
00:02:26,429 --> 00:02:27,789
Then 

42
00:02:27,789 --> 00:02:31,209
having found the optimal value function, if you 

43
00:02:31,209 --> 00:02:32,589
compute the optimal policy 

44
00:02:32,589 --> 00:02:36,990
by taking essentially [inaudible] of this equation above. Augments of A, 

45
00:02:36,990 --> 00:02:43,990
of that [inaudible]. So 

46
00:02:46,180 --> 00:02:48,180
in 

47
00:02:48,180 --> 00:02:50,719
value iteration, 

48
00:02:50,719 --> 00:02:54,599
as you iterate of this - you know, perform this update, 

49
00:02:54,599 --> 00:02:56,179
the function V will 

50
00:02:56,179 --> 00:03:00,370
[inaudible] convert to V star. So there won't be - 

51
00:03:00,370 --> 00:03:04,490
so without defining the number of iterations, you get closer and closer to V star. 

52
00:03:04,490 --> 00:03:08,400
This actually converge exponentially quickly to V 

53
00:03:08,400 --> 00:03:12,839
star. We will never exactly convert to V star and define the number of iterations. 

54
00:03:12,839 --> 00:03:14,799


55
00:03:14,799 --> 00:03:19,629
So what I want to do now is describe a couple of common variations of 

56
00:03:19,629 --> 00:03:20,139
NVPs 

57
00:03:20,139 --> 00:03:24,139
that we slightly different definitions of. Firs the reward 

58
00:03:24,139 --> 00:03:27,799
function and then second, we'll do something slightly different 

59
00:03:27,799 --> 00:03:29,389
from just 

60
00:03:29,389 --> 00:03:30,209


61
00:03:30,209 --> 00:03:31,699


62
00:03:31,699 --> 00:03:34,659
counting. Then remember in the last lecture, I said that 

63
00:03:34,659 --> 00:03:37,419
for infinite state of continuously in NVPs, 

64
00:03:37,419 --> 00:03:41,169
we couldn't apply the most straightforward version of value iteration 

65
00:03:41,169 --> 00:03:43,669
because if you have a continuous state 

66
00:03:43,669 --> 00:03:45,709
NVP, we need to use 

67
00:03:45,709 --> 00:03:48,289
some approximations of the optimal value function. 

68
00:03:48,289 --> 00:03:51,629
The [inaudible] later in this lecture, 

69
00:03:51,629 --> 00:03:55,510
I'll talk about a special case of NVPs, where you can actually represent the 

70
00:03:55,510 --> 00:03:57,579
value function exactly, 

71
00:03:57,579 --> 00:04:01,029
even if you have an infinite-state space or even if you have a continuous-state 

72
00:04:01,029 --> 00:04:02,000
space. 

73
00:04:02,000 --> 00:04:06,090
I'll actually do that, talk about these special constants of infinite-state 

74
00:04:06,090 --> 00:04:06,779
NVPs, 

75
00:04:06,779 --> 00:04:11,319
using this new variation of the reward function and the alternative to 

76
00:04:11,319 --> 00:04:15,469
just counting, so start to make the formulation a little easier. 

77
00:04:15,469 --> 00:04:16,609
So the first 

78
00:04:16,609 --> 00:04:20,850
variation I want to talk about is 

79
00:04:20,850 --> 00:04:25,380
selection 

80
00:04:25,380 --> 00:04:28,720
rewards. So I'm going to change the definition of the reward function. If this turns out, it 

81
00:04:28,720 --> 00:04:30,869
won't be 

82
00:04:30,869 --> 00:04:33,409
a huge deal. In particular, I change reward function 

83
00:04:33,409 --> 00:04:36,260
to be a function mapping from 

84
00:04:36,260 --> 00:04:38,449
a state action pair 

85
00:04:38,449 --> 00:04:41,130
to the real numbers. 

86
00:04:41,130 --> 00:04:44,509
What I mean by this is just the following. 

87
00:04:44,509 --> 00:04:46,599
You sell off in 

88
00:04:46,599 --> 00:04:48,849
some state in zero. You 

89
00:04:48,849 --> 00:04:52,830
take an action A zero as a result of your state of action choice. You transition 

90
00:04:52,830 --> 00:04:54,709
to some new state, S1. 

91
00:04:54,709 --> 00:04:59,519
You take some action, A1. You transition to some new state, S2. You take some 

92
00:04:59,519 --> 00:05:05,009
action, A2, and so on. So this is a [inaudible] state action sequence that you see. 

93
00:05:05,009 --> 00:05:08,490
So in an MPP where you have a state action reward, 

94
00:05:08,490 --> 00:05:12,879
your total payoff is now defined as 

95
00:05:12,879 --> 00:05:14,259
this, 

96
00:05:14,259 --> 00:05:19,349
where your reward function is now a function both of the current state and of the 

97
00:05:19,349 --> 00:05:20,020
action 

98
00:05:20,020 --> 00:05:22,220
you took in the current state. 

99
00:05:22,220 --> 00:05:27,039
So this is my 

100
00:05:27,039 --> 00:05:28,539
total payoff. 

101
00:05:28,539 --> 00:05:29,199


102
00:05:29,199 --> 00:05:32,240
Then as usual, my goal will be to find a policy - 

103
00:05:32,240 --> 00:05:35,039
to find the function mapping from the state's actions, 

104
00:05:35,039 --> 00:05:37,699
so that when I execute that policy, 

105
00:05:37,699 --> 00:05:40,320
I can maximize the expected value 

106
00:05:40,320 --> 00:05:42,030
of my total payoff. 

107
00:05:42,030 --> 00:05:46,600
So this definition, it actually turns out that given an NVP with 

108
00:05:46,600 --> 00:05:47,889
state action rewards, 

109
00:05:47,889 --> 00:05:49,339
you can actually - 

110
00:05:49,339 --> 00:05:52,949
so by [inaudible] with the definitions of the states, you can actually reduce this back to an NVP 

111
00:05:52,949 --> 00:05:56,639
with only rewards that function in the states. That may or may 

112
00:05:56,639 --> 00:05:58,419
not be [inaudible]. Don't worry if 

113
00:05:58,419 --> 00:06:00,800
it isn't. But 

114
00:06:00,800 --> 00:06:04,729
using state action rewards allows you to more directly model 

115
00:06:04,729 --> 00:06:05,569


116
00:06:05,569 --> 00:06:09,050
problems in which different actions, we have different costs. So 

117
00:06:09,050 --> 00:06:13,090
a running example is the robot. So [inaudible] a robot, 

118
00:06:13,090 --> 00:06:17,080
and it's more costly for the robot to move than for it to stay still. 

119
00:06:17,080 --> 00:06:20,930
If you give an action to stay still, and the action to stay still may 

120
00:06:20,930 --> 00:06:25,349
have a lower cost because you're not using a battery power [inaudible] recharge it for that 

121
00:06:25,349 --> 00:06:26,020
action. 

122
00:06:26,020 --> 00:06:29,820
Another example would be - 

123
00:06:29,820 --> 00:06:33,059
actually, another navigation example 

124
00:06:33,059 --> 00:06:37,429
would be if you have an outdoor vehicle. You need to drive 

125
00:06:37,429 --> 00:06:38,599
over 

126
00:06:38,599 --> 00:06:42,169
some sort of outdoor terrain, like very rough rocks 

127
00:06:42,169 --> 00:06:45,679
or driving over grass. It may be costly, more difficult, 

128
00:06:45,679 --> 00:06:48,960
than driving over, say, a paved road. So you may assign 

129
00:06:48,960 --> 00:06:51,349
an action that requires 

130
00:06:51,349 --> 00:06:56,150
driving over grass or driving over rocks to be 

131
00:06:56,150 --> 00:07:00,289
more costly than driving over paved 

132
00:07:00,289 --> 00:07:04,809
road. 

133
00:07:04,809 --> 00:07:07,830
So this really isn't a huge change to the definition of 

134
00:07:07,830 --> 00:07:14,830
an NVP. I won't really bother to justify this a lot, but [inaudible] 

135
00:07:15,090 --> 00:07:16,779
equations 

136
00:07:16,779 --> 00:07:19,610
is generalizing 

137
00:07:19,610 --> 00:07:24,279
the way that you probably expect it. V star of S is now equal to 

138
00:07:24,279 --> 00:07:26,650


139
00:07:26,650 --> 00:07:33,650
that. 

140
00:07:37,419 --> 00:07:40,649
So previously, when the reward function was just a function of the state, S, we could 

141
00:07:40,649 --> 00:07:43,729
take the max and push it in here. 

142
00:07:43,729 --> 00:07:44,530
But now that 

143
00:07:44,530 --> 00:07:46,069


144
00:07:46,069 --> 00:07:49,779
the rewards is a function of the action you're taking as well, the max comes outside. So 

145
00:07:49,779 --> 00:07:50,869
this says that 

146
00:07:50,869 --> 00:07:57,119
your expected total payoff, starting from the state, as [inaudible] policy, is equal to 

147
00:07:57,119 --> 00:08:01,519
first your immediate reward, RFSA, for executing some action, A, in state S. Then 

148
00:08:01,519 --> 00:08:02,219


149
00:08:02,219 --> 00:08:07,369
plus gamma times your future expected total payoff. 

150
00:08:07,369 --> 00:08:08,839
So 

151
00:08:08,839 --> 00:08:11,330
this is your expected total payoff if you 

152
00:08:11,330 --> 00:08:14,919
take the action, A, from the current state. So 

153
00:08:14,919 --> 00:08:18,879
while these [inaudible] optimal value functions. So your actually optimal expected total payoff 

154
00:08:18,879 --> 00:08:21,539
is the max of all actions of this 

155
00:08:21,539 --> 00:08:26,049
thing on 

156
00:08:26,049 --> 00:08:30,449
the right. Let's see. Value iteration, which I'm abbreviating VI, 

157
00:08:30,449 --> 00:08:33,720
is really the same algorithm. B of 

158
00:08:33,720 --> 00:08:38,060
S is updated as 

159
00:08:38,060 --> 00:08:44,630
max over A, RFSA, same thing. Just [inaudible] on the right-hand side of those equations 

160
00:08:44,630 --> 00:08:45,900


161
00:08:45,900 --> 00:08:49,880
be updating V of S using [inaudible] equations. Again, you get value iteration, 

162
00:08:49,880 --> 00:08:53,370
exactly the same way. 

163
00:08:53,370 --> 00:08:55,520
Then finally, 

164
00:08:55,520 --> 00:08:56,950


165
00:08:56,950 --> 00:09:00,160
having found the optimal value function, V star, 

166
00:09:00,160 --> 00:09:03,180
using the value iteration algorithm, 

167
00:09:03,180 --> 00:09:07,650
you can then compute the optimal policy, pi star of S as same as 

168
00:09:07,650 --> 00:09:12,250
before. The best action to take in the state, S, is the 

169
00:09:12,250 --> 00:09:19,250
action, A, that maximizes the thing on the righthand 

170
00:09:28,360 --> 00:09:32,480
side. So having used value iteration 

171
00:09:32,480 --> 00:09:35,330
to compute the optimal value function, you can then find 

172
00:09:35,330 --> 00:09:40,210
pi star using that. 

173
00:09:40,210 --> 00:09:41,890


174
00:09:41,890 --> 00:09:45,670
So this was the easier of the two 

175
00:09:45,670 --> 00:09:47,920
variations of NVPs [inaudible] 

176
00:09:47,920 --> 00:09:48,910
so far. 

177
00:09:48,910 --> 00:09:55,910
Any questions? Actually, are there questions about this? 

178
00:10:01,090 --> 00:10:02,810
So 

179
00:10:02,810 --> 00:10:06,660
the other variation, the other alternative definition, 

180
00:10:06,660 --> 00:10:07,600


181
00:10:07,600 --> 00:10:10,520
will be 

182
00:10:10,520 --> 00:10:16,220
finite horizon 

183
00:10:16,220 --> 00:10:19,010
NVPs. So 

184
00:10:19,010 --> 00:10:21,540
finite horizon NVP 

185
00:10:21,540 --> 00:10:24,780
comprises of the [inaudible] SA [inaudible] 

186
00:10:24,780 --> 00:10:28,110
transition, probably 

187
00:10:28,110 --> 00:10:30,650
with these, and the parameter T 

188
00:10:30,650 --> 00:10:32,180
and the reward function. 

189
00:10:32,180 --> 00:10:36,250
Here, 

190
00:10:36,250 --> 00:10:37,360
T 

191
00:10:37,360 --> 00:10:40,520
is a parameter called the horizon time. 

192
00:10:40,520 --> 00:10:43,050
Concretely, 

193
00:10:43,050 --> 00:10:45,920
what this really means is that we'll 

194
00:10:45,920 --> 00:10:48,330
be taking actions in the NVP 

195
00:10:48,330 --> 00:10:55,330
only for a total of capital T times this. So we won't use this counting anymore. [Audio cuts out] Instructor (Andrew 

196
00:10:56,180 --> 00:11:00,430


197
00:11:00,430 --> 00:11:04,350
Ng):[Inaudible] zero, take action A0. Get to some other state S1, take 

198
00:11:04,350 --> 00:11:06,670
action A1 and so on. 

199
00:11:06,670 --> 00:11:10,100
Eventually, you get to some state, STAT 

200
00:11:10,100 --> 00:11:12,180
after T times [inaudible]. So 

201
00:11:12,180 --> 00:11:16,760
my total payoff, now, 

202
00:11:16,760 --> 00:11:18,180
will be 

203
00:11:18,180 --> 00:11:19,350
given 

204
00:11:19,350 --> 00:11:21,900
by this 

205
00:11:21,900 --> 00:11:26,180
sum from time zero up to time T of my sum over rewards. 

206
00:11:26,180 --> 00:11:27,740


207
00:11:27,740 --> 00:11:31,870
Okay? My goal, as usual 

208
00:11:31,870 --> 00:11:34,030
- so this is my total payoff. 

209
00:11:34,030 --> 00:11:37,580
My goal, as usual, is to maximize the expected value of my total payoff. We want to come up 

210
00:11:37,580 --> 00:11:40,140
with a policy to maximize the 

211
00:11:40,140 --> 00:11:44,010
expected value of this total payoff. The key difference is that 

212
00:11:44,010 --> 00:11:47,890
the world only will exist [inaudible], and after that, 

213
00:11:47,890 --> 00:11:51,380
there's no more rewards to be corrected. 

214
00:11:51,380 --> 00:11:54,820
So this turns out to be [inaudible] of a difference because it turns out that the optimal 

215
00:11:54,820 --> 00:11:59,710
policy 

216
00:11:59,710 --> 00:12:00,690


217
00:12:00,690 --> 00:12:03,200
may be non-stationary. 

218
00:12:03,200 --> 00:12:06,300


219
00:12:06,300 --> 00:12:07,540
The 

220
00:12:07,540 --> 00:12:10,319
term, stationary, means that it doesn't depend 

221
00:12:10,319 --> 00:12:12,649
on time. Non-stationary 

222
00:12:12,649 --> 00:12:14,419
means that it may depend on time. 

223
00:12:14,419 --> 00:12:17,810
So non-stationary 

224
00:12:17,810 --> 00:12:21,820
roughly means that my optimal action to take will be different for different time 

225
00:12:21,820 --> 00:12:22,490
steps. 

226
00:12:22,490 --> 00:12:24,430
That's what nonstationary 

227
00:12:24,430 --> 00:12:26,790
means. Just as an example of that, 

228
00:12:26,790 --> 00:12:31,240
imagine that we have some robot. Let's say the 

229
00:12:31,240 --> 00:12:33,850
robot 

230
00:12:33,850 --> 00:12:35,290
is here. 

231
00:12:35,290 --> 00:12:37,620
Let's say that there's 

232
00:12:37,620 --> 00:12:40,590
a good [inaudible] over there with a plus one reward. 

233
00:12:40,590 --> 00:12:42,930


234
00:12:42,930 --> 00:12:46,360
Much further away, there's a plus ten reward. 

235
00:12:46,360 --> 00:12:50,530
So depending on how much time you have left on the clock, 

236
00:12:50,530 --> 00:12:52,980
it may be better to go after the plus one 

237
00:12:52,980 --> 00:12:55,330
or the plus ten reward. If it's still early 

238
00:12:55,330 --> 00:12:58,840
in the game, you still have a lot of time, it may be better to head toward 

239
00:12:58,840 --> 00:13:03,030
the plus-ten rewards junction and get a much larger 

240
00:13:03,030 --> 00:13:06,040
reward. If you only have a couple of time sets left, if the clock has nearly reached 

241
00:13:06,040 --> 00:13:08,069
the time, capital T, 

242
00:13:08,069 --> 00:13:11,790
then you may not have enough time to get to a plus ten reward. You've be better 

243
00:13:11,790 --> 00:13:16,050
off heading for the plus one reward that's much more close by. 

244
00:13:16,050 --> 00:13:19,660
So what this example illustrates is that when you're in that state, 

245
00:13:19,660 --> 00:13:23,740
the best action to take could be to go left or to go right, depending on what 

246
00:13:23,740 --> 00:13:26,499
time it is. So just as an example, illustrating 

247
00:13:26,499 --> 00:13:31,960
how the actually policy can be non-stationary. 

248
00:13:31,960 --> 00:13:33,060


249
00:13:33,060 --> 00:13:34,710


250
00:13:34,710 --> 00:13:36,410
In fact, 

251
00:13:36,410 --> 00:13:38,120
since we have non-stationary 

252
00:13:38,120 --> 00:13:43,120
policies anyway in the sequence, what I'm going to do next, I'm going to 

253
00:13:43,120 --> 00:13:44,159
allow 

254
00:13:44,159 --> 00:13:48,390
non-stationary transition probabilities as well. So I'll just write that 

255
00:13:48,390 --> 00:13:50,390
up there. 

256
00:13:50,390 --> 00:13:52,170
What I mean is that 

257
00:13:52,170 --> 00:13:56,640
so far, assuming that the state ST plus one, is joined from the state 

258
00:13:56,640 --> 00:13:59,480
transition probabilities [inaudible] 

259
00:13:59,480 --> 00:14:02,110
by the previous states and the previous action. 

260
00:14:02,110 --> 00:14:05,650
I've been assuming that these state transition probabilities are the same 

261
00:14:05,650 --> 00:14:10,170
for all times. So I want to say [inaudible] and take some action, 

262
00:14:10,170 --> 00:14:12,540
the distribution of an innate state doesn't matter. 

263
00:14:12,540 --> 00:14:15,910
It doesn't depend on time. 

264
00:14:15,910 --> 00:14:16,700
So 

265
00:14:16,700 --> 00:14:21,890
I'm going to allow a study more general definition as well, in which we have 

266
00:14:21,890 --> 00:14:24,600
non-stationary state transition probabilities 

267
00:14:24,600 --> 00:14:25,329
so that 

268
00:14:25,329 --> 00:14:27,960
the chance of where you end up [inaudible] 

269
00:14:27,960 --> 00:14:29,590
may also 

270
00:14:29,590 --> 00:14:33,060
depend on what time it is. 

271
00:14:33,060 --> 00:14:37,320
So as examples of this non-stationary state 

272
00:14:37,320 --> 00:14:39,000
transition probabilities, 

273
00:14:39,000 --> 00:14:44,400
one example would be if you model flying an aircraft over a long distance. 

274
00:14:44,400 --> 00:14:48,650
Then as the aircraft flies, you burn fuel and become lighter. So the 

275
00:14:48,650 --> 00:14:52,510
dynamics of the aircraft actually change over time. The mass of the aircraft can 

276
00:14:52,510 --> 00:14:55,160
change significantly over time as you burn fuel. 

277
00:14:55,160 --> 00:14:56,480
So depending on 

278
00:14:56,480 --> 00:15:01,130
what time it is, your mixed state could actually 

279
00:15:01,130 --> 00:15:03,320
depend on not only your current state and your action, 

280
00:15:03,320 --> 00:15:06,240
but also on how much fuel you burn, therefore, what time it is. 

281
00:15:06,240 --> 00:15:07,250


282
00:15:07,250 --> 00:15:10,510
Other examples, another aerospace one, is 

283
00:15:10,510 --> 00:15:14,620
if you have the weather forecast for the next 24 hours, say, 

284
00:15:14,620 --> 00:15:18,360
you know what the winds and precipitation are going to be like over the next 24 hours. Then again, 

285
00:15:18,360 --> 00:15:22,200
if you fly the aircraft from, say, here to New York, 

286
00:15:22,200 --> 00:15:25,019
it may cost different amounts to fly 

287
00:15:25,019 --> 00:15:28,829
different [inaudible] at different times. Maybe flying over 

288
00:15:28,829 --> 00:15:33,439
the Rockies may cost different amounts, depending on whether you do it now, when there's really great weather, or 

289
00:15:33,439 --> 00:15:34,319
if you do it 

290
00:15:34,319 --> 00:15:37,629
a few hours from now, when the weather may be forecast really bad. 

291
00:15:37,629 --> 00:15:38,720


292
00:15:38,720 --> 00:15:39,380


293
00:15:39,380 --> 00:15:40,970
For an 

294
00:15:40,970 --> 00:15:46,440
example you see everyday, same thing for traffic, right? 

295
00:15:46,440 --> 00:15:48,380
There's at least - 

296
00:15:48,380 --> 00:15:52,510
depending on where you live, certainly here in California, there are times of day where traffic is 

297
00:15:52,510 --> 00:15:57,050
really bad in lots of places. So the costs of driving certain roads may vary, depending on 

298
00:15:57,050 --> 00:15:59,070
what time of day it is. 

299
00:15:59,070 --> 00:16:03,589
Lots of other examples. Industrial automation, different machines 

300
00:16:03,589 --> 00:16:07,160
in the factory may be available to different degrees at different times of day. They 

301
00:16:07,160 --> 00:16:10,149
cost different amounts to hire different workers, depending on whether you pay 

302
00:16:10,149 --> 00:16:12,950
over time [inaudible] or whatever. 

303
00:16:12,950 --> 00:16:15,960
So the cost of doing different things in the factory can also 

304
00:16:15,960 --> 00:16:17,250
be a function of time. 

305
00:16:17,250 --> 00:16:18,429


306
00:16:18,429 --> 00:16:24,280
The state transition probabilities can also be a function of time. 

307
00:16:24,280 --> 00:16:25,750


308
00:16:25,750 --> 00:16:27,850
Lastly, 

309
00:16:27,850 --> 00:16:31,910
while we're doing this as well, to make this fully general, we might as 

310
00:16:31,910 --> 00:16:37,730
well have non-stationary [inaudible] as well, 

311
00:16:37,730 --> 00:16:40,639
where you might also index the reward 

312
00:16:40,639 --> 00:16:42,679
function of these times and prescripts, 

313
00:16:42,679 --> 00:16:45,090
where the cost of doing different things may depend on 

314
00:16:45,090 --> 00:16:47,830
the time as well. 

315
00:16:47,830 --> 00:16:50,530


316
00:16:50,530 --> 00:16:55,590
Actually, there's more examples of non-stationary NVPs, so let's - so now we 

317
00:16:55,590 --> 00:16:59,249
have a nonstationary policy. Let's talk about an algorithm to actually 

318
00:16:59,249 --> 00:17:02,180
try to find the optimal policy. 

319
00:17:02,180 --> 00:17:04,110


320
00:17:04,110 --> 00:17:10,089
So let me 

321
00:17:10,089 --> 00:17:13,580
define the following. 

322
00:17:13,580 --> 00:17:20,390
This is now a slightly modified definition of the optimal value function. I'll 

323
00:17:20,390 --> 00:17:27,390
just 

324
00:17:30,970 --> 00:17:37,970
write this down, I guess. 

325
00:17:39,080 --> 00:17:42,620
So I'm going to define the optimal value function, and this going to be 

326
00:17:42,620 --> 00:17:44,690
indexed by T, with a subscript T. The 

327
00:17:44,690 --> 00:17:46,480
optimal value of a state 

328
00:17:46,480 --> 00:17:47,840


329
00:17:47,840 --> 00:17:50,360
for time, T, we're going to define as your 

330
00:17:50,360 --> 00:17:52,300
optimal 

331
00:17:52,300 --> 00:17:57,320
sum of rewards for if you start the NVP at that state, S, 

332
00:17:57,320 --> 00:17:59,309
and if the clock starts off 

333
00:17:59,309 --> 00:18:02,200
at time, 

334
00:18:02,200 --> 00:18:06,400
lowercase T. So the optimal value of a state will depend on what time it is and how 

335
00:18:06,400 --> 00:18:09,919
much time you have lest to run this 

336
00:18:09,919 --> 00:18:11,440
NVP. Therefore, 

337
00:18:11,440 --> 00:18:16,230
the sum on the right sums only for time T, time T plus one, time T plus two up 

338
00:18:16,230 --> 00:18:17,840
to time, 

339
00:18:17,840 --> 00:18:19,520
capital T. 

340
00:18:19,520 --> 00:18:22,570


341
00:18:22,570 --> 00:18:26,620
I'll just state in English again, this is your expected optimal 

342
00:18:26,620 --> 00:18:28,059
total payoff 

343
00:18:28,059 --> 00:18:31,240
if you start your system in a state, S, 

344
00:18:31,240 --> 00:18:34,760
and if the clock is already at time, 

345
00:18:34,760 --> 00:18:37,010
lowercase T. 

346
00:18:37,010 --> 00:18:38,370


347
00:18:38,370 --> 00:18:42,620
So it turns out then there's a [inaudible], you can value that [inaudible]. 

348
00:18:42,620 --> 00:18:46,080
Let me just write out the value [inaudible] algorithm for this. 

349
00:18:46,080 --> 00:18:48,059
It turns out 

350
00:18:48,059 --> 00:18:49,270
you can - 

351
00:18:49,270 --> 00:18:54,150
well, let me just write this 

352
00:18:54,150 --> 00:18:56,920
out. I'll write this below. 

353
00:18:56,920 --> 00:19:00,260
It turns out you can compute the optimal value function for the NVP using the 

354
00:19:00,260 --> 00:19:03,020
following recursion, which is very similar to 

355
00:19:03,020 --> 00:19:06,200
what we have for value iteration. We're going 

356
00:19:06,200 --> 00:19:06,879
to set V 

357
00:19:06,879 --> 00:19:12,080
of S to be equal to [inaudible] over 

358
00:19:12,080 --> 00:19:12,889
A, 

359
00:19:12,889 --> 00:19:19,889
same as before, right? Okay? 

360
00:19:41,110 --> 00:19:42,419
So 

361
00:19:42,419 --> 00:19:46,650
if I start the clock at time T and from state S, 

362
00:19:46,650 --> 00:19:49,230
my expected total payoff is equal to 

363
00:19:49,230 --> 00:19:51,590
the maximum [inaudible] actions I may take 

364
00:19:51,590 --> 00:19:53,500
of my immediate reward. Taking that 

365
00:19:53,500 --> 00:19:56,100
action, A, in that state, 

366
00:19:56,100 --> 00:20:00,050
S. Them plus my expected future payoff. So if I take action, A, 

367
00:20:00,050 --> 00:20:01,690
I would transition with [inaudible] P, subscript SA, S prime 

368
00:20:01,690 --> 00:20:06,470
to some new state, S 

369
00:20:06,470 --> 00:20:09,470
prime. If I get to the state, S prime, 

370
00:20:09,470 --> 00:20:11,600
my total expected payoff from the 

371
00:20:11,600 --> 00:20:14,270
state S prime would be these [inaudible] 

372
00:20:14,270 --> 00:20:18,610
now subscript T plus one, that's prime. Subscript T plus one 

373
00:20:18,610 --> 00:20:23,400
reflects that after I've taken one action, my clock will have advanced from 

374
00:20:23,400 --> 00:20:26,400
time T to time T plus one. So this is now V 

375
00:20:26,400 --> 00:20:30,559
star subscript T plus one. 

376
00:20:30,559 --> 00:20:33,909
So this expresses V star of T 

377
00:20:33,909 --> 00:20:36,230
in terms of V star T plus one. 

378
00:20:36,230 --> 00:20:39,740
Then lastly, to start off this recursion, you 

379
00:20:39,740 --> 00:20:41,700
would have V star, 

380
00:20:41,700 --> 00:20:43,270
capital T 

381
00:20:43,270 --> 00:20:50,270
is equal to - it's just equal 

382
00:20:51,300 --> 00:20:53,220
to that. 

383
00:20:53,220 --> 00:20:56,900
If you're already at time, capital T, then you just get to take one action, and then 

384
00:20:56,900 --> 00:20:59,070
the clock runs out. So this is 

385
00:20:59,070 --> 00:21:03,610
V star capital T. Your value of starting in some state, S, with no time - 

386
00:21:03,610 --> 00:21:08,809
with just one time step, with no time left on the clock. 

387
00:21:08,809 --> 00:21:12,880
So in the case of finite horizon NVP, this actually gives up a very nice 

388
00:21:12,880 --> 00:21:13,830


389
00:21:13,830 --> 00:21:15,640
dynamic programming algorithm 

390
00:21:15,640 --> 00:21:16,559
in which you can 

391
00:21:16,559 --> 00:21:20,740
start off by computing V star of T. 

392
00:21:20,740 --> 00:21:24,030
Then you use this backward [inaudible] to compute 

393
00:21:24,030 --> 00:21:25,089
V star of 

394
00:21:25,089 --> 00:21:29,349
capital T minus one, capital T minus two and so on. We compute V star 

395
00:21:29,349 --> 00:21:32,230
of T and T minus one and so on. It recurs backwards 

396
00:21:32,230 --> 00:21:34,670
onto your computer, V star 

397
00:21:34,670 --> 00:21:38,450
for all of your time steps. Can you see this 

398
00:21:38,450 --> 00:21:39,500
board? 

399
00:21:39,500 --> 00:21:42,250
Cool. Then 

400
00:21:42,250 --> 00:21:44,770


401
00:21:44,770 --> 00:21:47,570
the final step is 

402
00:21:47,570 --> 00:21:52,810
- previously, we said that pi star of S - I'm going to come back and change this a bit - was the [inaudible] 

403
00:21:52,810 --> 00:21:55,580
A of R 

404
00:21:55,580 --> 00:21:57,150
plus A 

405
00:21:57,150 --> 00:22:04,150
plus [inaudible] PSA - this 

406
00:22:04,280 --> 00:22:07,230
is sort of what we had. In 

407
00:22:07,230 --> 00:22:09,789
the finite horizon case, the 

408
00:22:09,789 --> 00:22:13,250
ultimate action may depend on what time it is. So the ultimate action to take it, 

409
00:22:13,250 --> 00:22:16,820
time T, is [inaudible] actions, A. 

410
00:22:16,820 --> 00:22:19,249
This 

411
00:22:19,249 --> 00:22:22,350
is basically the 

412
00:22:22,350 --> 00:22:26,840
augmat of exactly that same thing on the right-hand side as we had 

413
00:22:26,840 --> 00:22:29,970
in our dynamic programming 

414
00:22:29,970 --> 00:22:33,060
algorithm. So you do this for every time step, and now you compute it, 

415
00:22:33,060 --> 00:22:36,950
your optimal policy for different time 

416
00:22:36,950 --> 00:22:40,410
steps. Again, this is a non-stationary policy 

417
00:22:40,410 --> 00:22:42,290
because pi star 

418
00:22:42,290 --> 00:22:46,920
of S my depend on what time it is. 

419
00:22:46,920 --> 00:22:50,420
So [inaudible] the difference between this 

420
00:22:50,420 --> 00:22:57,150
and the early version of the earlier version of value iteration is that - so what you do is you complete V star of T. 

421
00:22:57,150 --> 00:23:00,620
Then using the backwards recursion of that [inaudible] algorithm, you computer V star T 

422
00:23:00,620 --> 00:23:04,130
minus one. Then V star T minus two 

423
00:23:04,130 --> 00:23:06,710
and so on down to V star of zero. 

424
00:23:06,710 --> 00:23:09,660
Then from these, you compute pi star. 

425
00:23:09,660 --> 00:23:13,210


426
00:23:13,210 --> 00:23:17,010
So one - there's not a huge difference, but one minus 

427
00:23:17,010 --> 00:23:18,100
difference [inaudible] 

428
00:23:18,100 --> 00:23:21,919
the infinite horizon discounted case 

429
00:23:21,919 --> 00:23:26,350
is that by running this recursion once, you now have exactly 

430
00:23:26,350 --> 00:23:29,550
the right value function. So this just computes the value function, rather 

431
00:23:29,550 --> 00:23:30,560
than 

432
00:23:30,560 --> 00:23:37,560
merely converging [inaudible]. This just gives you the right value function with one 

433
00:23:37,710 --> 00:23:39,710


434
00:23:39,710 --> 00:23:42,710
pass. Cool. Any questions there? Yeah. 

435
00:23:42,710 --> 00:23:47,600
Interviewee: [Inaudible]. Instructor (Andrew 

436
00:23:47,600 --> 00:23:50,450


437
00:23:50,450 --> 00:23:54,080
Ng):This computation's much shorter than valuations. So 

438
00:23:54,080 --> 00:23:58,460
sort of yes and no. It depends on how large capital T is. Interviewee: [Inaudible] the normal NVP, could 

439
00:23:58,460 --> 00:24:05,460
[inaudible] and then use this case for that case? Instructor (Andrew Ng):I see. 

440
00:24:06,100 --> 00:24:09,430
So for the normal NVP, can you assume capital T and then assume this? So 

441
00:24:09,430 --> 00:24:13,049
it actually turns out that - that's a 

442
00:24:13,049 --> 00:24:14,740
great question. Let 

443
00:24:14,740 --> 00:24:18,640
me just answer this in a hand-wavy way. 

444
00:24:18,640 --> 00:24:23,140
So it actually turns out for a discounted infinite horizon NVP 

445
00:24:23,140 --> 00:24:25,670
where some discount factor gamma. 

446
00:24:25,670 --> 00:24:29,430
So what you can do is you can say, 

447
00:24:29,430 --> 00:24:31,160
after T times X, 

448
00:24:31,160 --> 00:24:34,990
gamma to the T would be really small. It would be like [inaudible] something. I 

449
00:24:34,990 --> 00:24:36,130
don't 

450
00:24:36,130 --> 00:24:39,750
really care what happens after that many times because the rewards are 

451
00:24:39,750 --> 00:24:42,630
multiplied by gamma to the T. After that, I don't really care. So you can 

452
00:24:42,630 --> 00:24:43,919
ask, 

453
00:24:43,919 --> 00:24:45,120
can I take 

454
00:24:45,120 --> 00:24:48,490
my infinite horizon discounted NVP 

455
00:24:48,490 --> 00:24:53,190
and approximate that with a finite horizon NVP where the number of times, steps T, 

456
00:24:53,190 --> 00:24:55,180
is chosen so that [inaudible] true. So it 

457
00:24:55,180 --> 00:24:57,700
turns out you can do that. 

458
00:24:57,700 --> 00:25:01,890
Then you end up with some value for T. You can solve for T so 

459
00:25:01,890 --> 00:25:03,910
that this holds true. 

460
00:25:03,910 --> 00:25:09,500
It turns out you can prove that if you took the original value iteration algorithm 

461
00:25:09,500 --> 00:25:14,780
and if you run the original value of the [inaudible] algorithm - the version for discounted NVPs. 

462
00:25:14,780 --> 00:25:16,830
If you run that for 

463
00:25:16,830 --> 00:25:19,590
this same number of time steps, 

464
00:25:19,590 --> 00:25:23,790
you will end up with an approximation to the value function that is 

465
00:25:23,790 --> 00:25:26,730
about this close, up to some small constant factors. 

466
00:25:26,730 --> 00:25:30,960
So to do that, you end up with roughly the same amounts of computation anyway. 

467
00:25:30,960 --> 00:25:32,550
Then 

468
00:25:32,550 --> 00:25:35,780
you actually end up with a non-stationary policy, which is more expensive to keep 

469
00:25:35,780 --> 00:25:36,650


470
00:25:36,650 --> 00:25:40,720
around. You need to keep around the different policy every time step, which is not 

471
00:25:40,720 --> 00:25:46,880
as nice as if you had the stationary policy, same policy for all times. So 

472
00:25:46,880 --> 00:25:50,600
there are other reasons, but sometimes you might take an 

473
00:25:50,600 --> 00:25:54,560
infinite horizon discounted problem and approximate it to a finite horizon problem. 

474
00:25:54,560 --> 00:25:56,190
But 

475
00:25:56,190 --> 00:25:58,760
this particular reason is 

476
00:25:58,760 --> 00:26:01,270
not the one. That makes 

477
00:26:01,270 --> 00:26:04,210
sense. More 

478
00:26:04,210 --> 00:26:11,210
questions? Interviewee: [Inaudible]? Instructor (Andrew 

479
00:26:12,580 --> 00:26:14,680
Ng):Is 

480
00:26:14,680 --> 00:26:16,870
there a gamma in this? 

481
00:26:16,870 --> 00:26:20,159
So if you want, you can actually 

482
00:26:20,159 --> 00:26:22,570
change the definition of an NVP 

483
00:26:22,570 --> 00:26:27,920
and use a finite horizon discounted NVP. If you want, you can do that. You 

484
00:26:27,920 --> 00:26:29,040
can actually come 

485
00:26:29,040 --> 00:26:32,090
in and put a gamma there, 

486
00:26:32,090 --> 00:26:35,140
and use this counting the finite horizon. 

487
00:26:35,140 --> 00:26:37,690
It turns out that usually, for most 

488
00:26:37,690 --> 00:26:41,470
problems that people deal with, you either use discounting 

489
00:26:41,470 --> 00:26:43,220
or you 

490
00:26:43,220 --> 00:26:47,059
use the finite horizon. It's been less common to do both, but you can certainly do as well. 

491
00:26:47,059 --> 00:26:51,990
One of the nice things about discounting, it makes such your value function is finite. 

492
00:26:51,990 --> 00:26:52,860


493
00:26:52,860 --> 00:26:57,549
Algorithmically and mathematically, one of the reasons to use discounting is 

494
00:26:57,549 --> 00:27:01,309
because you're multiplying your rewards exponentially. It's a geometrically 

495
00:27:01,309 --> 00:27:02,640
[inaudible] series. 

496
00:27:02,640 --> 00:27:05,049
It shows that the value function is always finite. 

497
00:27:05,049 --> 00:27:07,780
This is a really nice mathematical properties when you do discounting. 

498
00:27:07,780 --> 00:27:11,360
So when you have a finite horizon anyway, 

499
00:27:11,360 --> 00:27:15,179
then the value function's also guaranteed to be finite. So with that, you 

500
00:27:15,179 --> 00:27:22,179
don't have to use discounting. But if you want, you can actually discount 

501
00:27:24,130 --> 00:27:27,910
as well. Interviewee: [Inaudible]. Instructor (Andrew Ng):Yeah, yes, you're right. If you want, you can redefine the reward 

502
00:27:27,910 --> 00:27:34,910
function to go downward into the to the reward function, since we have nonstationary rewards as well. So 

503
00:27:39,690 --> 00:27:41,049
that was 

504
00:27:41,049 --> 00:27:44,230
finite horizon 

505
00:27:44,230 --> 00:27:45,409
NVPs. 

506
00:27:45,409 --> 00:27:48,960
What I want to do now is actually use 

507
00:27:48,960 --> 00:27:53,030
both of these ideas, your state action rewards and your finite horizon NVPs 

508
00:27:53,030 --> 00:27:57,580
to describe a special case of NVPs that 

509
00:27:57,580 --> 00:28:00,380
makes very strong assumptions about the problem. 

510
00:28:00,380 --> 00:28:00,880


511
00:28:00,880 --> 00:28:04,390
But these assumptions are reasonable for many systems. With these 

512
00:28:04,390 --> 00:28:06,090
assumptions, what we come up with, I 

513
00:28:06,090 --> 00:28:11,960
think, are very nice and very elegant algorithms for solving even very large 

514
00:28:11,960 --> 00:28:16,240
NVPs. 

515
00:28:16,240 --> 00:28:23,240


516
00:28:30,100 --> 00:28:32,820
So let's talk about linear 

517
00:28:32,820 --> 00:28:35,760
quadratic regulation. 

518
00:28:35,760 --> 00:28:42,760


519
00:28:46,740 --> 00:28:47,710


520
00:28:47,710 --> 00:28:49,570


521
00:28:49,570 --> 00:28:51,669


522
00:28:51,669 --> 00:28:55,860
We just talked about dynamic programming for finite horizon NVPs, so 

523
00:28:55,860 --> 00:28:58,149
just remember that algorithm. 

524
00:28:58,149 --> 00:29:01,850
When I come back to talk about an algorithm for solving LQR problems, 

525
00:29:01,850 --> 00:29:05,320
I'm actually going to use exactly that dynamic programming algorithm 

526
00:29:05,320 --> 00:29:10,320
that you just saw for finite horizon NVPs. I'll be using exactly 

527
00:29:10,320 --> 00:29:13,929
that algorithm again. So just remember that for now. 

528
00:29:13,929 --> 00:29:16,320
So let's talk about LQR. 

529
00:29:16,320 --> 00:29:21,750
So I want to take these ideas and apply them to NVPs with continuous 

530
00:29:21,750 --> 00:29:25,150
state spaces and maybe even continuous action 

531
00:29:25,150 --> 00:29:26,170
spaces. So 

532
00:29:26,170 --> 00:29:28,740
to specify and 

533
00:29:28,740 --> 00:29:33,679
NVPs, I need to give you this fivetuple 

534
00:29:33,679 --> 00:29:36,380
of state actions, [inaudible] in the reward. I'm going to use 

535
00:29:36,380 --> 00:29:40,730
the finite horizon, capital T, rather than 

536
00:29:40,730 --> 00:29:42,570
discounting. 

537
00:29:42,570 --> 00:29:47,809
So in LQR problems, I'm going to assume the following. I'm 

538
00:29:47,809 --> 00:29:51,309
going to assume that the 

539
00:29:51,309 --> 00:29:54,400
[inaudible] space is [inaudible] RN. 

540
00:29:54,400 --> 00:29:58,289
And I'm going to assume, also, a continuous set of 

541
00:29:58,289 --> 00:30:03,830
actions lie in RT. 

542
00:30:03,830 --> 00:30:08,320
To specify the state transition probabilities, PSA, 

543
00:30:08,320 --> 00:30:11,930
I need to tell you what the distribution of the mixed state is, given the current state and 

544
00:30:11,930 --> 00:30:13,510
the current action. 

545
00:30:13,510 --> 00:30:17,450
So we actually saw a little bit of this in the last lecture. I want to assume 

546
00:30:17,450 --> 00:30:19,519
the next state, ST plus one, 

547
00:30:19,519 --> 00:30:23,100
is going to be a linear function 

548
00:30:23,100 --> 00:30:25,010
of the previous 

549
00:30:25,010 --> 00:30:27,200
state, AST plus BAT 

550
00:30:27,200 --> 00:30:29,309
plus WT - 

551
00:30:29,309 --> 00:30:34,650
oh, excuse me. I meant to subscript that. 

552
00:30:34,650 --> 00:30:40,590


553
00:30:40,590 --> 00:30:47,590


554
00:30:54,590 --> 00:30:59,190
Where WT is Gaussian [inaudible] would mean zero and some covariance 

555
00:30:59,190 --> 00:31:02,260
given by sigma 

556
00:31:02,260 --> 00:31:07,370
W. Subscripts at A and B here with subscripts T to show 

557
00:31:07,370 --> 00:31:09,000
that 

558
00:31:09,000 --> 00:31:11,809
these matrixes could change over time. So this would be 

559
00:31:11,809 --> 00:31:14,940
non-stationary dynamics. 

560
00:31:14,940 --> 00:31:20,200
As a point of notation, unfortunately compiling 

561
00:31:20,200 --> 00:31:22,240
ideas from multiple literatures, 

562
00:31:22,240 --> 00:31:26,220
so it's sort of unfortunately that capital A denotes both a set of actions 

563
00:31:26,220 --> 00:31:27,860
as well as 

564
00:31:27,860 --> 00:31:29,609
a matrix. 

565
00:31:29,609 --> 00:31:33,889
When you see A later on, A will usually be used to denote a matrix, 

566
00:31:33,889 --> 00:31:38,870
rather than a set of actions. So [inaudible] overload notation again, but 

567
00:31:38,870 --> 00:31:41,820
unfortunately the notational conventions when you have 

568
00:31:41,820 --> 00:31:46,660
research ideas in multiple research communities, often they share the same symbol. 

569
00:31:46,660 --> 00:31:47,789


570
00:31:47,789 --> 00:31:49,830
So just to be concrete, 

571
00:31:49,830 --> 00:31:53,020
AT is a matrix 

572
00:31:53,020 --> 00:31:55,640
that's N 

573
00:31:55,640 --> 00:32:01,270
by N. [Inaudible] matrixes that are N by D. 

574
00:32:01,270 --> 00:32:02,130


575
00:32:02,130 --> 00:32:05,559
Just to be completely clear, right, the matrixes A and B, 

576
00:32:05,559 --> 00:32:09,230
I'm going to assume, are fixed and known in advance. So I'm going to 

577
00:32:09,230 --> 00:32:11,700
give you the matrixes, 

578
00:32:11,700 --> 00:32:15,750
A and B, and I'm going to give you sigma W. Your job is to find a good policy 

579
00:32:15,750 --> 00:32:17,060
for 

580
00:32:17,060 --> 00:32:18,840
this NVP. So in other words, 

581
00:32:18,840 --> 00:32:20,870
this is my specification 

582
00:32:20,870 --> 00:32:25,309
of the state transition probabilities. 

583
00:32:25,309 --> 00:32:28,020
Looking ahead, 

584
00:32:28,020 --> 00:32:31,190
we see this later, it turns out this 

585
00:32:31,190 --> 00:32:38,190
noise term is not very important. So 

586
00:32:39,630 --> 00:32:40,850
it turns out that 

587
00:32:40,850 --> 00:32:44,250
the treatment of the noise term is not very important. 

588
00:32:44,250 --> 00:32:45,830
We'll see this later. We can 

589
00:32:45,830 --> 00:32:47,179
pretty much 

590
00:32:47,179 --> 00:32:49,300
ignore the noise 

591
00:32:49,300 --> 00:32:52,680
term, and we'll still do fine. This is just a warning 

592
00:32:52,680 --> 00:32:56,910
in the sequel, what I do later, I might be slightly sloppy 

593
00:32:56,910 --> 00:32:57,970
in my treatment 

594
00:32:57,970 --> 00:33:02,970
of the noise term. In this very special case, it would be unimportant. 

595
00:33:02,970 --> 00:33:04,170
The 

596
00:33:04,170 --> 00:33:05,549
last 

597
00:33:05,549 --> 00:33:09,150
thing I have to specify is some horizon time, 

598
00:33:09,150 --> 00:33:12,580
and then I also have some 

599
00:33:12,580 --> 00:33:13,700
reward 

600
00:33:13,700 --> 00:33:16,550
function. 

601
00:33:16,550 --> 00:33:20,020
For LQR control, I'm going to assume that a reward function 

602
00:33:20,020 --> 00:33:27,020
can be written as 

603
00:33:30,090 --> 00:33:31,850
this, where 

604
00:33:31,850 --> 00:33:35,260
UT is a matrix that's N by N. VT is 

605
00:33:35,260 --> 00:33:36,439
a 

606
00:33:36,439 --> 00:33:37,950


607
00:33:37,950 --> 00:33:41,020
matrix that's D by D. I'll 

608
00:33:41,020 --> 00:33:44,440
assume that UT 

609
00:33:44,440 --> 00:33:47,040
and VT are both positive semi-definite. 

610
00:33:47,040 --> 00:33:51,289
Are both PSD. So 

611
00:33:51,289 --> 00:33:57,360
the fact that UT and VT are both positive semi-definite matrixes, 

612
00:33:57,360 --> 00:33:59,770
that implies that 

613
00:33:59,770 --> 00:34:02,259
ST transpose, UT, ST [inaudible] 

614
00:34:02,259 --> 00:34:07,000
zero. Similarly, ST transpose 

615
00:34:07,000 --> 00:34:09,569
are VT, AT, [inaudible] zero. 

616
00:34:09,569 --> 00:34:16,519
So this implies that 

617
00:34:16,519 --> 00:34:20,190
your rewards are always negative. This is a 

618
00:34:20,190 --> 00:34:23,529
somewhat depressing NVP in which there are only costs and no positive rewards, right, 

619
00:34:23,529 --> 00:34:24,369
because of 

620
00:34:24,369 --> 00:34:27,169
the minus sign 

621
00:34:27,169 --> 00:34:28,569
there. 

622
00:34:28,569 --> 00:34:35,569
So 

623
00:34:41,399 --> 00:34:43,319


624
00:34:43,319 --> 00:34:46,659
as 

625
00:34:46,659 --> 00:34:49,630
a complete example for how you might want to apply this, 

626
00:34:49,630 --> 00:34:53,510
you've seen my helicopter videos, right? So one thing is, 

627
00:34:53,510 --> 00:34:55,729
for example, 

628
00:34:55,729 --> 00:34:57,989
suppose you have a helicopter, 

629
00:34:57,989 --> 00:35:00,890
and you want the state ST to be 

630
00:35:00,890 --> 00:35:04,229
as close to zero as possible. 

631
00:35:04,229 --> 00:35:05,440


632
00:35:05,440 --> 00:35:08,059
Then 

633
00:35:08,059 --> 00:35:10,210
you might choose UT 

634
00:35:10,210 --> 00:35:12,619
to be equal to the identity matrix. 

635
00:35:12,619 --> 00:35:16,709
This will make R of STAT be 

636
00:35:16,709 --> 00:35:20,060
equal to ST transpose 

637
00:35:20,060 --> 00:35:23,819
ST. But that's just 

638
00:35:23,819 --> 00:35:27,239


639
00:35:27,239 --> 00:35:30,719
- I'll just 

640
00:35:30,719 --> 00:35:33,809
write that down. [Inaudible] oh, excuse me - minus. 

641
00:35:33,809 --> 00:35:38,429
The squared negative of the squared [inaudible] vector. So this would be penalizing the 

642
00:35:38,429 --> 00:35:39,680
system 

643
00:35:39,680 --> 00:35:40,769
quadratically 

644
00:35:40,769 --> 00:35:45,219
for having a state that's half of zero, assuming that zero's the origin state. So if it 

645
00:35:45,219 --> 00:35:46,889
goes to make a helicopter 

646
00:35:46,889 --> 00:35:52,039
hover around the state zero, then you might choose this sort of reward function. It 

647
00:35:52,039 --> 00:35:54,439
turns out 

648
00:35:54,439 --> 00:35:55,279


649
00:35:55,279 --> 00:35:58,679
it's also very common for action to choose a cost 

650
00:35:58,679 --> 00:36:03,209
for the action. So suppose I choose VT to be equal to an identity matrix. I get minus 

651
00:36:03,209 --> 00:36:05,359
AT transpose 

652
00:36:05,359 --> 00:36:09,509
AT 

653
00:36:09,509 --> 00:36:12,259
here. Then minus [inaudible] actions as well. 

654
00:36:12,259 --> 00:36:16,079
Including a quadratic cost for actions, it's also a fairly common 

655
00:36:16,079 --> 00:36:17,099
thing to do. 

656
00:36:17,099 --> 00:36:19,560
In practice, this tends to be effective 

657
00:36:19,560 --> 00:36:22,630
of discouraging your system from 

658
00:36:22,630 --> 00:36:26,209
jerking the controls around. This discourages making very huge control commands. Having 

659
00:36:26,209 --> 00:36:27,779
a term like this 

660
00:36:27,779 --> 00:36:31,859
reward function often makes many systems behave better. 

661
00:36:31,859 --> 00:36:35,789
Of course, [inaudible] choose different values, we have different values on the diagonal to 

662
00:36:35,789 --> 00:36:37,069
give different 

663
00:36:37,069 --> 00:36:39,960
state variables, different weight and 

664
00:36:39,960 --> 00:36:44,809
so on. So lots of possible choices for U and B. This is one example. 

665
00:36:44,809 --> 00:36:49,849


666
00:36:49,849 --> 00:36:55,989
So 

667
00:36:55,989 --> 00:36:59,420
for the next few steps, 

668
00:36:59,420 --> 00:37:03,619
I'm going to write out things, I'm going to derive things, 

669
00:37:03,619 --> 00:37:08,509
for the general case of non-stationary dynamics. I'm going - as I write 

670
00:37:08,509 --> 00:37:11,529
out more math and more equations for LQR, 

671
00:37:11,529 --> 00:37:13,809
I'm going to try write it out 

672
00:37:13,809 --> 00:37:18,549
for the fairly general case of time varied dynamics and time varied reward 

673
00:37:18,549 --> 00:37:21,639
functions. So I'm [inaudible] function. 

674
00:37:21,639 --> 00:37:26,309
But for purposes of understanding this material, 

675
00:37:26,309 --> 00:37:29,499
you might want to think of just ignoring many of the subscripts, in terms 

676
00:37:29,499 --> 00:37:31,579
of T. So 

677
00:37:31,579 --> 00:37:33,849


678
00:37:33,849 --> 00:37:36,049
for the sake of [inaudible] material, 

679
00:37:36,049 --> 00:37:40,059
you might want to mentally assume that there are some fixed matrix, A, so 

680
00:37:40,059 --> 00:37:46,079
that A is equal to A1, A2, equals A3 and so on. 

681
00:37:46,079 --> 00:37:50,599
Similarly, there's some matrix B. 

682
00:37:50,599 --> 00:37:51,509
Okay? 

683
00:37:51,509 --> 00:37:55,859
So write it out for the fully general, non-stationary case, but 

684
00:37:55,859 --> 00:37:59,420
you might just want to ignore many of the time subscripts and imagine the 

685
00:37:59,420 --> 00:38:02,099
stationary case for now. 

686
00:38:02,099 --> 00:38:04,079


687
00:38:04,079 --> 00:38:05,999


688
00:38:05,999 --> 00:38:07,419


689
00:38:07,419 --> 00:38:09,939
Quite a bit later, 

690
00:38:09,939 --> 00:38:13,569
we're going to talk about an extension of this called differential dynamic programming that will 

691
00:38:13,569 --> 00:38:16,670
actually use a non-stationary 

692
00:38:16,670 --> 00:38:20,789
[inaudible] to a very powerful effect for a specific algorithm. But 

693
00:38:20,789 --> 00:38:25,579
for most of what we're about to do, just pretend that NVP is stationary. 

694
00:38:25,579 --> 00:38:32,579
Okay. 

695
00:38:49,160 --> 00:38:50,919
So before I 

696
00:38:50,919 --> 00:38:54,759
talk about models, let me jus say a couple of words about how you would go about 

697
00:38:54,759 --> 00:38:56,910
coming up with the linear models. The 

698
00:38:56,910 --> 00:39:00,630
key assumption in this model is that the dynamics are linear. There's also the assumption the 

699
00:39:00,630 --> 00:39:03,429
reward function is quadratic, but 

700
00:39:03,429 --> 00:39:06,849
let's talk about the assumption that the dynamics are linear. 

701
00:39:06,849 --> 00:39:13,789
ST plus one equals AST plus VAT. Maybe time varying, maybe 

702
00:39:13,789 --> 00:39:16,879
stationary. I'm just writing stationary for now. 

703
00:39:16,879 --> 00:39:19,529
So how do you get models like this? 

704
00:39:19,529 --> 00:39:23,169
We actually saw one example of this already in the previous lecture. If 

705
00:39:23,169 --> 00:39:26,729
you have an inverted pendulum system, and you want to model the 

706
00:39:26,729 --> 00:39:28,099
inverted pendulum 

707
00:39:28,099 --> 00:39:28,900


708
00:39:28,900 --> 00:39:34,039
using a linear model like this, maybe [inaudible]. I'm not going to write that down. 

709
00:39:34,039 --> 00:39:37,849
One thing you could do is run your inverted pendulum, start it off in some state as 

710
00:39:37,849 --> 00:39:38,719
zero, 

711
00:39:38,719 --> 00:39:40,799
take some action, A0, have it 

712
00:39:40,799 --> 00:39:45,760
get to some state, S1. Take action A1 and so on, get to some state ST. 

713
00:39:45,760 --> 00:39:48,409
Our index is one 

714
00:39:48,409 --> 00:39:52,139
to denote that this is my first trial. Then you can repeat this a bunch of times. 

715
00:39:52,139 --> 00:39:55,170
You can repeat this N times. I'm 

716
00:39:55,170 --> 00:39:57,670
just executing actions on 

717
00:39:57,670 --> 00:39:59,519
your physical robot. 

718
00:39:59,519 --> 00:40:02,049
It could be a robot, it could be a 

719
00:40:02,049 --> 00:40:05,739
chemical plant. It could be whatever. Trying out different actions in 

720
00:40:05,739 --> 00:40:10,009
your system and watch 

721
00:40:10,009 --> 00:40:12,770


722
00:40:12,770 --> 00:40:14,989
what states it gets to. So 

723
00:40:14,989 --> 00:40:20,619
for the linear model to your data, and choose the parameters A 

724
00:40:20,619 --> 00:40:25,659
and B, 

725
00:40:25,659 --> 00:40:27,939
that minimize 

726
00:40:27,939 --> 00:40:34,939


727
00:40:38,049 --> 00:40:39,709
the quadratic 

728
00:40:39,709 --> 00:40:41,369
error term. 

729
00:40:41,369 --> 00:40:44,030


730
00:40:44,030 --> 00:40:46,949


731
00:40:46,949 --> 00:40:51,640
So this says how well does AST plus BAT 

732
00:40:51,640 --> 00:40:55,359
predict ST plus one. So you minimize the quadratic penalty term. This would be 

733
00:40:55,359 --> 00:40:57,719
one reasonable way to estimate 

734
00:40:57,719 --> 00:41:01,390
the parameters of a linear dynamical system for a 

735
00:41:01,390 --> 00:41:08,390
physical robot or a physical chemical part of whatever they may have. Another 

736
00:41:08,839 --> 00:41:10,569
way to 

737
00:41:10,569 --> 00:41:13,129
come up with a linear 

738
00:41:13,129 --> 00:41:14,209


739
00:41:14,209 --> 00:41:19,199
model 

740
00:41:19,199 --> 00:41:24,180
consistently, if I want to control, is to take a nonlinear model and to 

741
00:41:24,180 --> 00:41:29,920
linearize it. Let me show you what I mean by that. So you can linearize a 

742
00:41:29,920 --> 00:41:35,819
nonlinear model. 

743
00:41:35,819 --> 00:41:37,439
So 

744
00:41:37,439 --> 00:41:38,570


745
00:41:38,570 --> 00:41:41,750
let's say you have some nonlinear model 

746
00:41:41,750 --> 00:41:45,289
that expresses ST plus one as some function of 

747
00:41:45,289 --> 00:41:46,219
ST 

748
00:41:46,219 --> 00:41:47,499
and 

749
00:41:47,499 --> 00:41:49,529
AT. 

750
00:41:49,529 --> 00:41:53,840
In the example in the previous lecture, I said for the inverted pendulum 

751
00:41:53,840 --> 00:41:56,579


752
00:41:56,579 --> 00:41:57,489
[inaudible]. 

753
00:41:57,489 --> 00:42:01,440
By referring to the laws of physics. It was actually by downloading off the shelf 

754
00:42:01,440 --> 00:42:04,380
software for doing physics simulations. So if you haven't 

755
00:42:04,380 --> 00:42:08,139
seen [inaudible] before, you can go online. You can easily find 

756
00:42:08,139 --> 00:42:11,270
many open-source packages for simulating the 

757
00:42:11,270 --> 00:42:14,350
physics of simple devices like these. 

758
00:42:14,350 --> 00:42:18,430
Download the software, type in the specifications of your robot, and it will 

759
00:42:18,430 --> 00:42:21,980
simulate the physics that you use. There's lots of open-source software patches like that. You can just 

760
00:42:21,980 --> 00:42:23,829
download them. 

761
00:42:23,829 --> 00:42:27,060
But something like that, you can now build a physics simulator 

762
00:42:27,060 --> 00:42:31,730
that predicts the state as a function of the previous state and the previous action. So 

763
00:42:31,730 --> 00:42:34,029
you actually come up with some function that 

764
00:42:34,029 --> 00:42:39,550
says that 

765
00:42:39,550 --> 00:42:41,529


766
00:42:41,529 --> 00:42:45,169
- the state [inaudible] next time. The [inaudible] vector 

767
00:42:45,169 --> 00:42:52,169
will be some function of the current state 

768
00:42:53,350 --> 00:42:54,630
and the current 

769
00:42:54,630 --> 00:42:58,019
action, where the action in this case is just a real number that says how 

770
00:42:58,019 --> 00:43:02,129
hard you accelerated to the left or right. 

771
00:43:02,129 --> 00:43:06,359
Then you can take this nonlinear model. I actually wrote down a sample of a 

772
00:43:06,359 --> 00:43:08,119
model in the last lecture, but 

773
00:43:08,119 --> 00:43:11,390
in general, F would be some nonlinear function. [Inaudible] of 

774
00:43:11,390 --> 00:43:12,769
a linear function. 

775
00:43:12,769 --> 00:43:16,919
So what I mean by linearize is the following. So here's just a cartoon. I'll 

776
00:43:16,919 --> 00:43:19,819
write down the math in a second. Let's say the horizontal acces is 

777
00:43:19,819 --> 00:43:24,029
the input state, ST, 

778
00:43:24,029 --> 00:43:27,679
and the output state, ST plus one, as I said. 

779
00:43:27,679 --> 00:43:31,779
Here's 

780
00:43:31,779 --> 00:43:34,709
the function at F. 

781
00:43:34,709 --> 00:43:38,689
So the next state, ST plus one, will be some function of the previous state, 

782
00:43:38,689 --> 00:43:42,029
ST and the action 

783
00:43:42,029 --> 00:43:46,009
AT. So to linearize this model, what you would do is you would choose a point. 

784
00:43:46,009 --> 00:43:49,019
We'll call this bar 

785
00:43:49,019 --> 00:43:54,140
T. Then you would take the derivative of this 

786
00:43:54,140 --> 00:43:56,469
function. For the 

787
00:43:56,469 --> 00:44:00,279
[inaudible] straight line to that function. 

788
00:44:00,279 --> 00:44:04,449
So this allows you to express the next state, ST plus one. You 

789
00:44:04,449 --> 00:44:07,229
can approximate the next state, ST plus one, as 

790
00:44:07,229 --> 00:44:10,089
this linear function of the 

791
00:44:10,089 --> 00:44:13,999
previous state, ST. So 

792
00:44:13,999 --> 00:44:17,809
to make this cartoon really right, the horizontal access here is really a state 

793
00:44:17,809 --> 00:44:20,109
action pair. 

794
00:44:20,109 --> 00:44:24,650
You're linearizing around. So this is just a cartoon. The horizontal 

795
00:44:24,650 --> 00:44:29,699
access represents the input state and the input action. 

796
00:44:29,699 --> 00:44:36,699
So 

797
00:44:42,049 --> 00:44:45,549
just to write this out in 

798
00:44:45,549 --> 00:44:49,529
math, I'll write out the simple case first and the fully general one 

799
00:44:49,529 --> 00:44:51,089
in a second. Suppose 

800
00:44:51,089 --> 00:44:54,709
the horizontal access was only this state. So let's pretend interactions they [inaudible] now. 

801
00:44:54,709 --> 00:44:56,099
ST plus 

802
00:44:56,099 --> 00:45:00,309
one is just some function of ST, than that linear function I drew would be 

803
00:45:00,309 --> 00:45:02,219
ST plus one. 

804
00:45:02,219 --> 00:45:04,450
We're approximating 

805
00:45:04,450 --> 00:45:08,349
as F prime evaluated at some point as bar T 

806
00:45:08,349 --> 00:45:12,189
times ST times S bar T. 

807
00:45:12,189 --> 00:45:15,579
Plus 

808
00:45:15,579 --> 00:45:17,829
S bar 

809
00:45:17,829 --> 00:45:21,879
T. So with this, you'd express ST plus one as a linear 

810
00:45:21,879 --> 00:45:22,719
function 

811
00:45:22,719 --> 00:45:29,359
of ST. Just note that S bar T is a constant. 

812
00:45:29,359 --> 00:45:31,179


813
00:45:31,179 --> 00:45:32,520
It's not 

814
00:45:32,520 --> 00:45:36,520
a variable. Does that make sense? S bar T is a constant. F prime of S bar T is gradient of the function F at the point 

815
00:45:36,520 --> 00:45:38,019
S bar T. This is 

816
00:45:38,019 --> 00:45:41,779
really just the equation of that linear function. So you 

817
00:45:41,779 --> 00:45:46,309
can then convert this to 

818
00:45:46,309 --> 00:45:51,809


819
00:45:51,809 --> 00:45:55,399
A and B matrixes. Jumping back one board, I'm going 

820
00:45:55,399 --> 00:45:57,040
to point out one other thing. 

821
00:45:57,040 --> 00:45:57,829
Let's 

822
00:45:57,829 --> 00:45:59,929
say I look at this straight line, 

823
00:45:59,929 --> 00:46:01,159
and I ask 

824
00:46:01,159 --> 00:46:03,489
how well does this straight line 

825
00:46:03,489 --> 00:46:07,769
approximate my function F, my original simulator, my original function F. 

826
00:46:07,769 --> 00:46:10,630
Then you sort of notice that 

827
00:46:10,630 --> 00:46:12,440
in this neighborhood, 

828
00:46:12,440 --> 00:46:14,830
in the neighborhood of S bar, there's a 

829
00:46:14,830 --> 00:46:17,400
pretty good approximation. It's fairly close. 

830
00:46:17,400 --> 00:46:19,389
But then as you move further away, 

831
00:46:19,389 --> 00:46:24,419
moving far off to the left here, it becomes a pretty terrible approximation. 

832
00:46:24,419 --> 00:46:25,539
So 

833
00:46:25,539 --> 00:46:29,139
when you linearize 

834
00:46:29,139 --> 00:46:33,099
a nonlinear model to apply LQR, 

835
00:46:33,099 --> 00:46:36,789
one of the parameters you have to choose would be the point around which to 

836
00:46:36,789 --> 00:46:38,960
linearize your nonlinear model. 

837
00:46:38,960 --> 00:46:44,169
So if you expect your inverted pendulum system 

838
00:46:44,169 --> 00:46:48,509
to spend most of its time in the vicinity of this state, 

839
00:46:48,509 --> 00:46:49,409
then it'd 

840
00:46:49,409 --> 00:46:53,459
be reasonable to linearize around this state 

841
00:46:53,459 --> 00:46:54,830
because that means that 

842
00:46:54,830 --> 00:46:57,319
the linear approximation would be a good approximation, 

843
00:46:57,319 --> 00:47:02,489
usually, for the states that you expect [inaudible] to spend most of this time. 

844
00:47:02,489 --> 00:47:06,179
If conversely, you expect the system to spend most of its time 

845
00:47:06,179 --> 00:47:07,739
at states far to the left, then this would be 

846
00:47:07,739 --> 00:47:10,039
a terrible location to linearize. 

847
00:47:10,039 --> 00:47:14,269
So one rule of thumb is to choose the position to linearize according to where 

848
00:47:14,269 --> 00:47:17,419
you expect the system to spend most of its time 

849
00:47:17,419 --> 00:47:20,369
so that the linear approximation will tend to be 

850
00:47:20,369 --> 00:47:22,929
an accurate approximation in the 

851
00:47:22,929 --> 00:47:27,269
vicinity of the states [inaudible]. 

852
00:47:27,269 --> 00:47:28,770
Just to be fair, it is about 

853
00:47:28,770 --> 00:47:30,109
choosing the point, S bar, 

854
00:47:30,109 --> 00:47:31,639
A bar, 

855
00:47:31,639 --> 00:47:34,169
that we'll use 

856
00:47:34,169 --> 00:47:36,949
to come up with a linear function 

857
00:47:36,949 --> 00:47:40,599
that we'll pretend it's a good approximation to my original 

858
00:47:40,599 --> 00:47:47,079
nonlinear function, 

859
00:47:47,079 --> 00:47:50,269


860
00:47:50,269 --> 00:47:55,269


861
00:47:55,269 --> 00:48:00,589
F. So for an example like the inverted pendulum problem, this problem, if 

862
00:48:00,589 --> 00:48:03,740
you expect to do pretty well in this problem, 

863
00:48:03,740 --> 00:48:08,089
then you would expect the state 

864
00:48:08,089 --> 00:48:12,680
to often be near the zero state. 

865
00:48:12,680 --> 00:48:17,890
If S equals zero corresponds to X being the center of the railway track that the inverted 

866
00:48:17,890 --> 00:48:20,249
pendulum lives on. You 

867
00:48:20,249 --> 00:48:23,000
expect to do fairly well. You expect the pole to 

868
00:48:23,000 --> 00:48:24,689
mostly be upright 

869
00:48:24,689 --> 00:48:28,559
[inaudible] upright at zero degrees or 90 degrees, I guess. So you choose 

870
00:48:28,559 --> 00:48:32,910
whatever state corresponds to having the pole 

871
00:48:32,910 --> 00:48:33,959
upright. The zero 

872
00:48:33,959 --> 00:48:37,299
velocity [inaudible], near zero velocity in the middle of the track. 

873
00:48:37,299 --> 00:48:40,249
So you usually choose that as a state 

874
00:48:40,249 --> 00:48:42,579
to linearize 

875
00:48:42,579 --> 00:48:45,680
your inverted pendulum dynamics around. That's 

876
00:48:45,680 --> 00:48:52,680
a region where you might want your approximation to be good. 

877
00:48:55,859 --> 00:48:59,759
So I wrote this down. To come back to this formula, I wrote this down for the special 

878
00:48:59,759 --> 00:49:02,229
case of a one D state variable. If there 

879
00:49:02,229 --> 00:49:08,039
are no actions. The general formula for the linearization approximation is ST 

880
00:49:08,039 --> 00:49:08,949


881
00:49:08,949 --> 00:49:11,869
plus one were approximate as F 

882
00:49:11,869 --> 00:49:14,559
of S bar T. 

883
00:49:14,559 --> 00:49:17,119
A bar T 

884
00:49:17,119 --> 00:49:24,119
plus 

885
00:49:33,469 --> 00:49:40,469


886
00:49:40,700 --> 00:49:47,700


887
00:49:50,459 --> 00:49:54,229
- Okay? 

888
00:49:54,229 --> 00:49:55,240
Where 

889
00:49:55,240 --> 00:49:59,669
these upside down triangles are an unusual 

890
00:49:59,669 --> 00:50:02,739
symbol for taking the 

891
00:50:02,739 --> 00:50:07,289
derivative of F with respect to [inaudible] vector value, second argument. So 

892
00:50:07,289 --> 00:50:12,169
by choosing an appropriate state as bar T, A bar T to linearize 

893
00:50:12,169 --> 00:50:12,950
around, 

894
00:50:12,950 --> 00:50:14,009
you'd 

895
00:50:14,009 --> 00:50:17,920
now express ST plus one 

896
00:50:17,920 --> 00:50:22,039
as a linear function of the current state and 

897
00:50:22,039 --> 00:50:23,279
the current 

898
00:50:23,279 --> 00:50:24,279
action, AT. Again, 

899
00:50:24,279 --> 00:50:29,420
these things, S bar T, is a constant that you choose ahead of time. 

900
00:50:29,420 --> 00:50:32,409
Same for A bar 

901
00:50:32,409 --> 00:50:33,989


902
00:50:33,989 --> 00:50:34,959


903
00:50:34,959 --> 00:50:40,109
T. Lastly, having linearized this thing, you can then convert it 

904
00:50:40,109 --> 00:50:43,260
to matrixes 

905
00:50:43,260 --> 00:50:46,509
like 

906
00:50:46,509 --> 00:50:48,849
that. 

907
00:50:48,849 --> 00:50:55,460
So the ST plus one is now a linear function of ST and 

908
00:50:55,460 --> 00:50:56,789
AT. 

909
00:50:56,789 --> 00:51:03,789
Questions about this? 

910
00:51:07,119 --> 00:51:10,759


911
00:51:10,759 --> 00:51:16,019
So just one tiny detail, and it's really not a huge deal, is that this thing 

912
00:51:16,019 --> 00:51:20,179
below is technically an [inaudible] function. 

913
00:51:20,179 --> 00:51:23,140
There might actually be an extra constant there, but 

914
00:51:23,140 --> 00:51:27,199
this is not a difficult generalization of a linear dynamical system definition. 

915
00:51:27,199 --> 00:51:28,130


916
00:51:28,130 --> 00:51:32,320
One way to deal with that constant is actually to do something like take your 

917
00:51:32,320 --> 00:51:36,249
definition for this state, let's say XX dot theta theta dot. 

918
00:51:36,249 --> 00:51:38,859
You can then augment your state vector to 

919
00:51:38,859 --> 00:51:40,959
have an extra interceptor, one. 

920
00:51:40,959 --> 00:51:46,069
With the interceptor one and working out the A matrix, you can then 

921
00:51:46,069 --> 00:51:49,170
take care of the extra constant, C, as well. So you can 

922
00:51:49,170 --> 00:51:50,679
deal with this thing being - 

923
00:51:50,679 --> 00:51:54,809
technically it's an affine function because of this extra offset, rather than a linear 

924
00:51:54,809 --> 00:51:55,569
function. 

925
00:51:55,569 --> 00:52:01,809
But this is just a little bit of bookkeeping [inaudible] for yourself and shouldn't 

926
00:52:01,809 --> 00:52:02,829
be 

927
00:52:02,829 --> 00:52:04,689
a huge 

928
00:52:04,689 --> 00:52:06,459


929
00:52:06,459 --> 00:52:13,459
deal. 

930
00:52:17,289 --> 00:52:18,059


931
00:52:18,059 --> 00:52:19,449
So to summarize, you see I 

932
00:52:19,449 --> 00:52:20,729
have this up, 

933
00:52:20,729 --> 00:52:24,699
you can learn a model, you can take a nonlinear model. Your nonlinear 

934
00:52:24,699 --> 00:52:28,789
model can be a physics model or a nonlinear model you learned and linearize it. 

935
00:52:28,789 --> 00:52:29,559


936
00:52:29,559 --> 00:52:32,299
Now I'll post an LQR problem 

937
00:52:32,299 --> 00:52:35,339
in which we have 

938
00:52:35,339 --> 00:52:41,689
specification of the NVP in which the states are in RN, the actions are in RD, 

939
00:52:41,689 --> 00:52:46,669
and the state has zero probabilities given by 

940
00:52:46,669 --> 00:52:50,239
the [inaudible] linear equation. SD plus one equals 

941
00:52:50,239 --> 00:52:53,849
ATST plus BTAT. Our rewards are going to be 

942
00:52:53,849 --> 00:52:56,989
these quadratic functions. 

943
00:52:56,989 --> 00:53:01,900
So the specification of the NVP means that we know the A matrixes, 

944
00:53:01,900 --> 00:53:06,089
the B matrixes, the U matrixes 

945
00:53:06,089 --> 00:53:08,930
and the V matrixes. Our goal is to come up with a policy 

946
00:53:08,930 --> 00:53:13,789
to maximize our finite 

947
00:53:13,789 --> 00:53:19,049
horizon sum of rewards. 

948
00:53:19,049 --> 00:53:20,029


949
00:53:20,029 --> 00:53:23,149
So our goal is to come up with a policy, 

950
00:53:23,149 --> 00:53:25,799
first, to maximize 

951
00:53:25,799 --> 00:53:29,059
the expected value of this 

952
00:53:29,059 --> 00:53:34,279
finite horizon sum of 

953
00:53:34,279 --> 00:53:39,009


954
00:53:39,009 --> 00:53:46,009
rewards. Okay. 

955
00:53:48,219 --> 00:53:49,390
So our 

956
00:53:49,390 --> 00:53:53,549
approach to solving this problem 

957
00:53:53,549 --> 00:53:56,899
will be exactly that 

958
00:53:56,899 --> 00:54:00,019
finite horizon dynamic programming algorithm that we worked out a 

959
00:54:00,019 --> 00:54:01,769
little earlier in this 

960
00:54:01,769 --> 00:54:03,579
lecture. In particular, 

961
00:54:03,579 --> 00:54:06,569
my strategy for finding the optimal policy 

962
00:54:06,569 --> 00:54:08,879
will be to 

963
00:54:08,879 --> 00:54:12,359
first find V star of 

964
00:54:12,359 --> 00:54:14,219
T, the capital T, 

965
00:54:14,219 --> 00:54:18,559
and then I'll apply by a recursion to find V star of T minus one, V star of T minus 

966
00:54:18,559 --> 00:54:22,989
two and so on. 

967
00:54:22,989 --> 00:54:27,619
In the dynamic programming algorithm we worked out, V star subscript T of the 

968
00:54:27,619 --> 00:54:29,209
state ST, this is the maximum 

969
00:54:29,209 --> 00:54:31,279
[inaudible] 

970
00:54:31,279 --> 00:54:35,440
actions you might take at that time of R 

971
00:54:35,440 --> 00:54:39,299
of STAT. 

972
00:54:39,299 --> 00:54:42,800
Again, just for the sake of understanding this material, you can probably 

973
00:54:42,800 --> 00:54:47,170
pretend the rewards and the dynamics are actually stationary. I'll write out all 

974
00:54:47,170 --> 00:54:53,249
these superscripts all the time [inaudible] if you're reading this for the first time. 

975
00:54:53,249 --> 00:54:57,340
The reward is equal to max of 

976
00:54:57,340 --> 00:55:04,340
AT of minus - 

977
00:55:11,829 --> 00:55:16,099
right? I hope this isn't confusing. The superscript Ts denote transposes. 

978
00:55:16,099 --> 00:55:19,849
The lowercase Ts denote the time index capital T. 

979
00:55:19,849 --> 00:55:23,769
So that's just a definition of my next quadratic awards. So this 

980
00:55:23,769 --> 00:55:27,859
is clearly maximized as minus 

981
00:55:27,859 --> 00:55:29,100


982
00:55:29,100 --> 00:55:31,269
ST transpose 

983
00:55:31,269 --> 00:55:37,889
UTST 

984
00:55:37,889 --> 00:55:39,949
because 

985
00:55:39,949 --> 00:55:43,909
that last term is - this is greater than or equal to zero. That gives 

986
00:55:43,909 --> 00:55:47,759
me my assumption that VT is [inaudible] semi-definite. So the best action to take in 

987
00:55:47,759 --> 00:55:49,399
the last time step 

988
00:55:49,399 --> 00:55:52,639
is just the action 

989
00:55:52,639 --> 00:55:53,769
zero. So 

990
00:55:53,769 --> 00:56:00,769
pi star subscript T of ST is equal to the 

991
00:56:02,320 --> 00:56:04,369
[inaudible] of actions of 

992
00:56:04,369 --> 00:56:10,969
that same thing. It's just 

993
00:56:10,969 --> 00:56:12,469
zero. It's by 

994
00:56:12,469 --> 00:56:15,609
choosing the zero action, 

995
00:56:15,609 --> 00:56:18,909
AT transpose VTAT becomes zero, and that's 

996
00:56:18,909 --> 00:56:22,829
how this reward is 

997
00:56:22,829 --> 00:56:25,999


998
00:56:25,999 --> 00:56:31,150


999
00:56:31,150 --> 00:56:38,150
maximized. 

1000
00:56:50,259 --> 00:56:57,169
Any questions, or is something illegible? Okay. 

1001
00:56:57,169 --> 00:57:00,390


1002
00:57:00,390 --> 00:57:04,330
So now let's do the dynamic programming step where 

1003
00:57:04,330 --> 00:57:09,679
my goal is given 

1004
00:57:09,679 --> 00:57:11,739
VT plus one, 

1005
00:57:11,739 --> 00:57:15,799
I want to compute VT. Given V star T plus one, I want to compute 

1006
00:57:15,799 --> 00:57:20,410
V star of T. So this is the dynamic programming step. 

1007
00:57:20,410 --> 00:57:21,629


1008
00:57:21,629 --> 00:57:25,239
So the 

1009
00:57:25,239 --> 00:57:29,499
DP steps I wrote down previously was this. So for the finite state case, I 

1010
00:57:29,499 --> 00:57:36,499
wrote down the following. 

1011
00:57:49,700 --> 00:57:52,279


1012
00:57:52,279 --> 00:57:59,279


1013
00:58:32,289 --> 00:58:35,769
So this is exactly the equation I wrote down 

1014
00:58:35,769 --> 00:58:37,499
previously, 

1015
00:58:37,499 --> 00:58:41,489
and this is what I wrote down for finite states, where you have these discreet state transition 

1016
00:58:41,489 --> 00:58:43,679
probabilities, and we can sum over 

1017
00:58:43,679 --> 00:58:48,039
this discreet set of states. Now we're going to continue as an infinite state 

1018
00:58:48,039 --> 00:58:51,850
again, so this sum over state should actually become an integral. I'm going to 

1019
00:58:51,850 --> 00:58:55,940
actually skip the integral step. We'll just go ahead and write this last term here as an 

1020
00:58:55,940 --> 00:58:56,829
expectation. 

1021
00:58:56,829 --> 00:59:03,829
So this is going to be max over actions AT 

1022
00:59:04,309 --> 00:59:06,879
plus - and then this becomes and expectation 

1023
00:59:06,879 --> 00:59:10,089
over the random mixed state, ST plus one, 

1024
00:59:10,089 --> 00:59:13,500
[inaudible] from state transition probabilities given by P of STAT of V star T 

1025
00:59:13,500 --> 00:59:16,429
plus one, ST 

1026
00:59:16,429 --> 00:59:19,080
plus one. So this 

1027
00:59:19,080 --> 00:59:21,309
is 

1028
00:59:21,309 --> 00:59:23,719
the same equation written down 

1029
00:59:23,719 --> 00:59:25,699
as an expectation. 

1030
00:59:25,699 --> 00:59:26,879


1031
00:59:26,879 --> 00:59:31,519
So what I need to do is given a representation of V star T plus one, I 

1032
00:59:31,519 --> 00:59:32,789
need to find V 

1033
00:59:32,789 --> 00:59:36,430
star of T. 

1034
00:59:36,430 --> 00:59:40,819
So it turns out that LQR has the following useful property. It turns out that each of 

1035
00:59:40,819 --> 00:59:42,839
these value functions 

1036
00:59:42,839 --> 00:59:46,129
can be represented as a quadratic function. 

1037
00:59:46,129 --> 00:59:48,659
So concretely, 

1038
00:59:48,659 --> 00:59:53,259
let's suppose that V 

1039
00:59:53,259 --> 01:00:00,259
star T plus one - suppose that this can be expressed as a 

1040
01:00:00,429 --> 01:00:07,429
quadratic 

1041
01:00:09,369 --> 01:00:10,840
function, written like so, 

1042
01:00:10,840 --> 01:00:11,859
where 

1043
01:00:11,859 --> 01:00:14,880
the matrix phi T plus one 

1044
01:00:14,880 --> 01:00:17,049
is an N by N matrix, and 

1045
01:00:17,049 --> 01:00:21,339
psi T plus one 

1046
01:00:21,339 --> 01:00:23,599
is just a real number. 

1047
01:00:23,599 --> 01:00:29,160
So in other words, suppose V star T plus one is just a quadratic function 

1048
01:00:29,160 --> 01:00:29,969


1049
01:00:29,969 --> 01:00:33,809
of the state ST 

1050
01:00:33,809 --> 01:00:36,679
plus one. 

1051
01:00:36,679 --> 01:00:40,170
We can then show that 

1052
01:00:40,170 --> 01:00:44,169
when you do one dynamic programming step - when you 

1053
01:00:44,169 --> 01:00:46,089


1054
01:00:46,089 --> 01:00:49,569
plug this definition of V star T plus one into your dynamic 

1055
01:00:49,569 --> 01:00:51,879
programming step in the equation I had just now, 

1056
01:00:51,879 --> 01:00:54,559
you can show that you would get 

1057
01:00:54,559 --> 01:00:57,899
that V star T as well, will 

1058
01:00:57,899 --> 01:01:01,749
also be a quadratic function of the 

1059
01:01:01,749 --> 01:01:06,279


1060
01:01:06,279 --> 01:01:12,639
same form. [Inaudible] here, right? The sum-appropriate matrix, phi T 

1061
01:01:12,639 --> 01:01:19,609
and sum appropriate real number, psi 

1062
01:01:19,609 --> 01:01:21,839
of 

1063
01:01:21,839 --> 01:01:27,819
T. So what you can do is stall off the recursion with - well, does that make 

1064
01:01:27,819 --> 01:01:32,099
sense? 

1065
01:01:32,099 --> 01:01:35,009
So what you can do is 

1066
01:01:35,009 --> 01:01:36,769
stall off the recursion 

1067
01:01:36,769 --> 01:01:41,139
as follows. So previously, we worked out that V star capital T, 

1068
01:01:41,139 --> 01:01:44,219
we said that this is minus 

1069
01:01:44,219 --> 01:01:47,539
ST 

1070
01:01:47,539 --> 01:01:50,009
transpose UTST. So we 

1071
01:01:50,009 --> 01:01:50,760
have that phi 

1072
01:01:50,760 --> 01:01:53,160
of capital T is 

1073
01:01:53,160 --> 01:01:56,329
equal to minus UT, 

1074
01:01:56,329 --> 01:01:59,309
and psi of capital T is equal to zero. 

1075
01:01:59,309 --> 01:02:00,700
Now V star T of ST 

1076
01:02:00,700 --> 01:02:07,700
is equal to ST transpose phi of T, ST plus psi of T. 

1077
01:02:08,419 --> 01:02:09,050
So 

1078
01:02:09,050 --> 01:02:13,450
you can start out the recursion this way with phi of T equals minus UT and psi of T 

1079
01:02:13,450 --> 01:02:14,939
equals zero. 

1080
01:02:14,939 --> 01:02:16,849
Then work out 

1081
01:02:16,849 --> 01:02:20,219
what the recursion is. I won't 

1082
01:02:20,219 --> 01:02:22,819


1083
01:02:22,819 --> 01:02:27,119
actually do the full [inaudible]. This may be algebra, and 

1084
01:02:27,119 --> 01:02:28,809


1085
01:02:28,809 --> 01:02:32,939
you've actually done this sort of Gaussian expectation 

1086
01:02:32,939 --> 01:02:39,559
math a lot in your homework by now. 

1087
01:02:39,559 --> 01:02:44,009
So I won't do the full derivation. I'll just outline the 

1088
01:02:44,009 --> 01:02:45,039


1089
01:02:45,039 --> 01:02:46,869
one-ish G step. So 

1090
01:02:46,869 --> 01:02:51,029
in dynamic programming step, V star ST is 

1091
01:02:51,029 --> 01:02:53,239
equal to 

1092
01:02:53,239 --> 01:02:55,679
max over actions AT of 

1093
01:02:55,679 --> 01:03:02,679
the median reward. 

1094
01:03:03,619 --> 01:03:05,809
So this was R of 

1095
01:03:05,809 --> 01:03:09,479
SA from my equation in the dynamic programming step. 

1096
01:03:09,479 --> 01:03:12,099
Then plus an expected value 

1097
01:03:12,099 --> 01:03:18,279
over the random mixed state, ST plus one, drawn from the 

1098
01:03:18,279 --> 01:03:22,079
Gaussian distribution would mean 

1099
01:03:22,079 --> 01:03:23,890
ATST plus 

1100
01:03:23,890 --> 01:03:28,169
BTAT and covariant sigma 

1101
01:03:28,169 --> 01:03:31,889
W. So what this is, this is really my 

1102
01:03:31,889 --> 01:03:35,029
specification for P of STAT. This is my 

1103
01:03:35,029 --> 01:03:38,749
state transition distribution in the LQR setting. This is my 

1104
01:03:38,749 --> 01:03:40,809
state transition distribution 

1105
01:03:40,809 --> 01:03:43,369
[inaudible] take action AT 

1106
01:03:43,369 --> 01:03:44,670
in 

1107
01:03:44,670 --> 01:03:46,179
the state 

1108
01:03:46,179 --> 01:03:50,319
ST. Then my next state is - distributed Gaussian would mean ATST plus BTAT and 

1109
01:03:50,319 --> 01:03:50,979
covariant 

1110
01:03:50,979 --> 01:03:54,430
sigma W. Then 

1111
01:03:54,430 --> 01:04:01,430
of the this 

1112
01:04:10,669 --> 01:04:13,719
state. This, of course, is just A star 

1113
01:04:13,719 --> 01:04:17,109
T plus one 

1114
01:04:17,109 --> 01:04:23,549
of ST plus one. I hope 

1115
01:04:23,549 --> 01:04:27,459
this makes sense. This is just taking that equation I had previously in the 

1116
01:04:27,459 --> 01:04:30,309
dynamic programming step. So the V star of T, ST 

1117
01:04:30,309 --> 01:04:33,279
equals max over actions 

1118
01:04:33,279 --> 01:04:35,679
of the immediate rewards 

1119
01:04:35,679 --> 01:04:38,739
plus an expected value over the mixed state of 

1120
01:04:38,739 --> 01:04:40,639
V star of the mixed state with 

1121
01:04:40,639 --> 01:04:43,249
the clock advanced by one. 

1122
01:04:43,249 --> 01:04:46,179
So I've just plugged in all the definitions as a reward of the 

1123
01:04:46,179 --> 01:04:47,889
state [inaudible] distribution 

1124
01:04:47,889 --> 01:04:54,889
and of the value function. 

1125
01:04:55,909 --> 01:05:01,729
Actually, could you raise your hand if this makes sense? Cool. So if you write 

1126
01:05:01,729 --> 01:05:03,429
this out 

1127
01:05:03,429 --> 01:05:07,630
and you expand the expectation - I know you've done this many times, so I 

1128
01:05:07,630 --> 01:05:08,979
won't do it - 

1129
01:05:08,979 --> 01:05:11,470
this whole thing on the right-hand side 

1130
01:05:11,470 --> 01:05:15,139
simplifies to a big quadratic function of the action, AT. 

1131
01:05:15,139 --> 01:05:17,839
So this whole 

1132
01:05:17,839 --> 01:05:22,839
thing simplifies to a big 

1133
01:05:22,839 --> 01:05:26,399
quadratic function 

1134
01:05:26,399 --> 01:05:32,449


1135
01:05:32,449 --> 01:05:35,909
of the action 

1136
01:05:35,909 --> 01:05:37,859
AT. 

1137
01:05:37,859 --> 01:05:42,489
We want to maximize this with respect to the actions AT. So to 

1138
01:05:42,489 --> 01:05:44,339
maximize a big quadratic function, you 

1139
01:05:44,339 --> 01:05:48,150
just take the derivatives of the functions with respect to the action AT, set the derivative equal 

1140
01:05:48,150 --> 01:05:52,039
to zero, and then you've maximized the right-hand side, with 

1141
01:05:52,039 --> 01:05:55,559
respect to the action, 

1142
01:05:55,559 --> 01:05:59,709
AT. It turns out - I'm just going to write this expression down for completeness. You 

1143
01:05:59,709 --> 01:06:02,209
can derive it yourself at any time. It turns 

1144
01:06:02,209 --> 01:06:05,239
out if you actually maximize that thing on the right- hand side as a 

1145
01:06:05,239 --> 01:06:07,299
function of the actions, AT, 

1146
01:06:07,299 --> 01:06:12,469
you find that [inaudible] AT is going to be 

1147
01:06:12,469 --> 01:06:14,429
that 

1148
01:06:14,429 --> 01:06:21,429
times 

1149
01:06:23,319 --> 01:06:25,930
ST. 

1150
01:06:25,930 --> 01:06:29,389
Don't 

1151
01:06:29,389 --> 01:06:34,200
worry about this expression. You can get it from [inaudible] and derive it yourself. 

1152
01:06:34,200 --> 01:06:37,839
But the key thing to note is that the optimal action, AT, is going to be some big 

1153
01:06:37,839 --> 01:06:38,949
matrix. 

1154
01:06:38,949 --> 01:06:40,409
We're going to call this thing 

1155
01:06:40,409 --> 01:06:43,299
LT 

1156
01:06:43,299 --> 01:06:47,259
times 

1157
01:06:47,259 --> 01:06:49,009


1158
01:06:49,009 --> 01:06:51,459
ST. In other words, the optimal action to take in this 

1159
01:06:51,459 --> 01:06:53,329
given state is going to be 

1160
01:06:53,329 --> 01:06:55,249
some linear function 

1161
01:06:55,249 --> 01:06:59,289
of the state, ST. So 

1162
01:06:59,289 --> 01:07:02,819
having done dynamic programming, you remember also when we worked out the 

1163
01:07:02,819 --> 01:07:04,979
dynamic programming algorithm for finite 

1164
01:07:04,979 --> 01:07:06,279
horizon NVPs, 

1165
01:07:06,279 --> 01:07:07,999
we said that 

1166
01:07:07,999 --> 01:07:12,109
the way you compute the optimal policy, pi star of T 

1167
01:07:12,109 --> 01:07:13,699
of ST. 

1168
01:07:13,699 --> 01:07:17,819
This is always the [inaudible] of the same thing. [Inaudible] 

1169
01:07:17,819 --> 01:07:22,799
of actions AT of the same thing. 

1170
01:07:22,799 --> 01:07:25,139
STAT plus 

1171
01:07:25,139 --> 01:07:32,139
your expected value of [inaudible] PSTAT, P-star, T 

1172
01:07:35,859 --> 01:07:39,179
plus one, ST plus one. This thing on the right-hand side is always the same thing as 

1173
01:07:39,179 --> 01:07:41,839
the thing we maximized 

1174
01:07:41,839 --> 01:07:42,969


1175
01:07:42,969 --> 01:07:46,479
[inaudible]. So what this means is that when I said this a value of A to the maximize 

1176
01:07:46,479 --> 01:07:50,379
of this. So what this means is that the optimal action to take from the state of ST 

1177
01:07:50,379 --> 01:07:53,079
is actually equal to 

1178
01:07:53,079 --> 01:07:55,869
LT 

1179
01:07:55,869 --> 01:08:01,689
times ST. 

1180
01:08:01,689 --> 01:08:03,859
What 

1181
01:08:03,859 --> 01:08:05,919
was shown is that 

1182
01:08:05,919 --> 01:08:08,129
when you're in some state, ST, the 

1183
01:08:08,129 --> 01:08:13,599
optimal action for that state is going to be some matrix, LT, which can compute, 

1184
01:08:13,599 --> 01:08:14,229
times 

1185
01:08:14,229 --> 01:08:16,859
the state, 

1186
01:08:16,859 --> 01:08:18,079
ST. 

1187
01:08:18,079 --> 01:08:23,549
In other words, the optimal action is actually a linear function of the state. 

1188
01:08:23,549 --> 01:08:26,839
I'm just going to point out, this is not a function of approximation here, right. 

1189
01:08:26,839 --> 01:08:31,739
What we did not do, we did not say, 

1190
01:08:31,739 --> 01:08:35,389
let's find the optimal linear policy. We didn't say, let's look at the optimal 

1191
01:08:35,389 --> 01:08:38,359
policy, and then we'll fit this straight line to the optimal policy. This is not 

1192
01:08:38,359 --> 01:08:42,089
about approximating the optimal policy with a straight line. 

1193
01:08:42,089 --> 01:08:46,549
This derivation is saying that the optimal policy is a straight line. The 

1194
01:08:46,549 --> 01:08:53,549
optimal action is a linear function of the current 

1195
01:08:54,239 --> 01:08:57,359
state. 

1196
01:08:57,359 --> 01:09:01,319
Moreover, when you've worked out this is a value for AT 

1197
01:09:01,319 --> 01:09:02,310


1198
01:09:02,310 --> 01:09:05,370
that maximizes this thing on 

1199
01:09:05,370 --> 01:09:10,009
the right-hand side. So you take this and plug it back in to do the dynamic programming recursion. 

1200
01:09:10,009 --> 01:09:17,009
What you find is that - 

1201
01:09:19,719 --> 01:09:24,189
so you take AT and plug it back in to do the maximization. It will 

1202
01:09:24,189 --> 01:09:26,319
actually get 

1203
01:09:26,319 --> 01:09:28,279
you this formula, 

1204
01:09:28,279 --> 01:09:30,560
so V star 

1205
01:09:30,560 --> 01:09:35,049
TST. So you find that 

1206
01:09:35,049 --> 01:09:38,770
it will indeed be a quadratic function like this 

1207
01:09:38,770 --> 01:09:41,799
of the following form where 

1208
01:09:41,799 --> 01:09:45,540
- and I just write out the equations for the sake of 

1209
01:09:45,540 --> 01:09:52,540
completeness. Don't worry too much about their forms. You can 

1210
01:09:57,369 --> 01:10:03,630
derive this yourself. 

1211
01:10:03,630 --> 01:10:10,630


1212
01:10:15,929 --> 01:10:22,929


1213
01:10:25,039 --> 01:10:28,789


1214
01:10:28,789 --> 01:10:35,789


1215
01:10:48,130 --> 01:10:52,190
So just to summarize, don't worry too much about the forms of these 

1216
01:10:52,190 --> 01:10:54,429
equations. 

1217
01:10:54,429 --> 01:10:57,899
What we've done is written down the recursion to the expressor 

1218
01:10:57,899 --> 01:10:59,820
phi T and psi T 

1219
01:10:59,820 --> 01:11:02,329
as a function of phi T plus one 

1220
01:11:02,329 --> 01:11:04,129
and psi T plus one. 

1221
01:11:04,129 --> 01:11:05,090
So 

1222
01:11:05,090 --> 01:11:08,340
this allows you to compute the optimal value function 

1223
01:11:08,340 --> 01:11:13,579
for when the clock is at time lowercase T, as a function of the optimal value 

1224
01:11:13,579 --> 01:11:17,699
function for when the clock is at time T plus one. 

1225
01:11:17,699 --> 01:11:20,019


1226
01:11:20,019 --> 01:11:25,480
So 

1227
01:11:25,480 --> 01:11:28,260
to summarize, 

1228
01:11:28,260 --> 01:11:32,659
GSELQG here's a finite horizon of 

1229
01:11:32,659 --> 01:11:35,920
- actually, just to give this equation a name as well. This 

1230
01:11:35,920 --> 01:11:40,370
recursion, in terms of the phi Ts, this is called the 

1231
01:11:40,370 --> 01:11:44,909
discrete 

1232
01:11:44,909 --> 01:11:51,909
time Bacardi equation. [Inaudible] 

1233
01:11:54,099 --> 01:11:57,609
recursion that gives you phi 

1234
01:11:57,609 --> 01:12:02,420
T in terms of phi T plus one. 

1235
01:12:02,420 --> 01:12:05,130
So 

1236
01:12:05,130 --> 01:12:10,969
to summarize, 

1237
01:12:10,969 --> 01:12:12,269


1238
01:12:12,269 --> 01:12:17,499
our algorithm for finding the exact solution to finite horizon LQR 

1239
01:12:17,499 --> 01:12:22,949
problems is as follows. We initialize phi T 

1240
01:12:22,949 --> 01:12:24,340
to 

1241
01:12:24,340 --> 01:12:26,840
be equal to minus 

1242
01:12:26,840 --> 01:12:29,020


1243
01:12:29,020 --> 01:12:30,449


1244
01:12:30,449 --> 01:12:32,699
UT 

1245
01:12:32,699 --> 01:12:36,559
and psi T to be equal to zero. 

1246
01:12:36,559 --> 01:12:38,709
Then 

1247
01:12:38,709 --> 01:12:42,679
recursively, 

1248
01:12:42,679 --> 01:12:45,080
calculate 

1249
01:12:45,080 --> 01:12:47,179
phi T 

1250
01:12:47,179 --> 01:12:49,260
and psi 

1251
01:12:49,260 --> 01:12:50,309
T 

1252
01:12:50,309 --> 01:12:53,820
as a function of phi T plus one 

1253
01:12:53,820 --> 01:12:56,360
and psi T plus 

1254
01:12:56,360 --> 01:13:00,429
one with the discrete time 

1255
01:13:00,429 --> 01:13:01,769
- 

1256
01:13:01,769 --> 01:13:06,030
actually, excuse me. So 

1257
01:13:06,030 --> 01:13:08,590
recursively calculate phi T and psi 

1258
01:13:08,590 --> 01:13:11,349
T as a function of phi T plus one and psi T plus one, as I showed, 

1259
01:13:11,349 --> 01:13:14,730
using the discrete time Bacardi equation. 

1260
01:13:14,730 --> 01:13:17,940
So you do this for 

1261
01:13:17,940 --> 01:13:19,199
T equals 

1262
01:13:19,199 --> 01:13:22,429
T minus one, T minus two 

1263
01:13:22,429 --> 01:13:28,579
and so on, down to time zero. Then you 

1264
01:13:28,579 --> 01:13:32,519
compute 

1265
01:13:32,519 --> 01:13:37,429
LT as a function of - actually, is it phi T or phi T 

1266
01:13:37,429 --> 01:13:39,599
plus one? Phi T 

1267
01:13:39,599 --> 01:13:41,929
plus one, I think. 

1268
01:13:41,929 --> 01:13:46,260
As a function of phi T plus one and psi T plus one. 

1269
01:13:46,260 --> 01:13:50,199
This is actually a function of only phi T plus one. You don't really need psi T 

1270
01:13:50,199 --> 01:13:53,110
plus one. 

1271
01:13:53,110 --> 01:13:59,819
Now you have your optimal policy. 

1272
01:13:59,819 --> 01:14:03,099
So having computed the LTs, 

1273
01:14:03,099 --> 01:14:05,939
you now have the 

1274
01:14:05,939 --> 01:14:08,149
optimal action to take in the state 

1275
01:14:08,149 --> 01:14:15,149
ST, just given by 

1276
01:14:24,530 --> 01:14:29,419
this 

1277
01:14:29,419 --> 01:14:34,979
linear equation. 

1278
01:14:34,979 --> 01:14:38,499
How much time do I have left? Okay. Let me just say one last thing about this before I 

1279
01:14:38,499 --> 01:14:44,479
close. 

1280
01:14:44,479 --> 01:14:48,449
Maybe I'll do it next week. I think I'll do it next session 

1281
01:14:48,449 --> 01:14:52,179
instead. So it actually turns out there's one cool property about this that's kind of that is 

1282
01:14:52,179 --> 01:14:54,750
kind of subtle, but you'll find it out in the next lecture. 

1283
01:14:54,750 --> 01:15:01,750
Are there question about this before we close for today, then? So 

1284
01:15:04,829 --> 01:15:07,309
the very cool thing about 

1285
01:15:07,309 --> 01:15:12,579
the solution of discrete time LQR problems - finite horizon LQR 

1286
01:15:12,579 --> 01:15:17,280
problems is that this is a problem in an infinite state, with a continuous state. 

1287
01:15:17,280 --> 01:15:21,059
But nonetheless, under the assumptions we made, you can prove that the 

1288
01:15:21,059 --> 01:15:24,749
value function is a quadratic function of the state. 

1289
01:15:24,749 --> 01:15:29,089
Therefore, just by computing these matrixes phi T and the real numbers 

1290
01:15:29,089 --> 01:15:33,010
psi T, you can actually exactly represent the value function, even for 

1291
01:15:33,010 --> 01:15:37,489
these infinitely large state spaces, even for continuous state spaces. 

1292
01:15:37,489 --> 01:15:42,599
So the computation of these algorithms scales only like the cube, 

1293
01:15:42,599 --> 01:15:46,499
scales only as a polynomial in terms of the number of state variables 

1294
01:15:46,499 --> 01:15:49,959
whereas in [inaudible] dimensionality problems, with [inaudible], we 

1295
01:15:49,959 --> 01:15:53,099
had algorithms of a scale exponentially dimensional problem. 

1296
01:15:53,099 --> 01:15:55,420
Whereas LQR scales only are like the cube 

1297
01:15:55,420 --> 01:16:00,699
of the dimension of the problem. So this easily 

1298
01:16:00,699 --> 01:16:03,960
applies to problems with even very large state spaces. So we actually often apply 

1299
01:16:03,960 --> 01:16:06,049
variations of this algorithm 

1300
01:16:06,049 --> 01:16:09,159
to some subset, to some particular subset for the things we do on our 

1301
01:16:09,159 --> 01:16:13,010
helicopter, which has high dimensional state spaces, with twelve or higher 

1302
01:16:13,010 --> 01:16:13,969
dimensions. 

1303
01:16:13,969 --> 01:16:15,989
This has worked very well for that. 

1304
01:16:15,989 --> 01:16:17,229
So 

1305
01:16:17,229 --> 01:16:21,280
it turns out there are even more things you can do with this, and I'll continue with 

1306
01:16:21,280 --> 01:16:23,519
that in the next lecture. So let's close for today, then.