00:00:01,709 --> 00:00:08,709
This subtitle is provided from lecsub.jimdo.com under Creative Commons License.

2
00:00:12,009 --> 00:00:15,279
This presentation is delivered by the Stanford Center for Professional 

3
00:00:15,279 --> 00:00:22,279
Development. 

4
00:00:24,019 --> 00:00:26,849
Okay. So welcome back, and 

5
00:00:26,849 --> 00:00:27,869
let's 

6
00:00:27,869 --> 00:00:32,690
continue our discussion of reinforcement learning 

7
00:00:32,690 --> 00:00:35,340
algorithms. On for today, the first thing I want to do is actually 

8
00:00:35,340 --> 00:00:39,290
talk a little bit about debugging reinforcement learning algorithms and 

9
00:00:39,290 --> 00:00:41,060
then I'll continue the 

10
00:00:41,060 --> 00:00:44,090
technical discussion from last week on LQR, on linear-quadratic 

11
00:00:44,090 --> 00:00:45,390
regulation. 

12
00:00:45,390 --> 00:00:48,710
In particular, I want to tell you about an algorithm called the French dynamic programming, which 

13
00:00:48,710 --> 00:00:51,320
I think is actually a very effective 

14
00:00:51,320 --> 00:00:56,190
absolute controls lack reinforcement learning algorithm for many problems. 

15
00:00:56,190 --> 00:01:00,090
Then we'll talk about Kalman filters and linear-quadratic Gaussian control, LGG. 

16
00:01:00,090 --> 00:01:05,180
Let's start with debugging RL algorithms. And can I switch to the laptop display, 

17
00:01:05,180 --> 00:01:06,620
please? 

18
00:01:06,620 --> 00:01:08,040
And so 

19
00:01:08,040 --> 00:01:15,040
this was actually - 

20
00:01:17,570 --> 00:01:22,920
what I'm about to do here is this is actually a specific example that I had done 

21
00:01:22,920 --> 00:01:26,800
earlier in this quarter, but that I promised to do again. So remember, 

22
00:01:26,800 --> 00:01:29,989
you know, what was it? Roughly halfway through the quarter 

23
00:01:29,989 --> 00:01:31,560
I'd given a 

24
00:01:31,560 --> 00:01:34,870
lecture on debugging learning algorithms, right? This idea that 

25
00:01:34,870 --> 00:01:38,830
very often you run a learning algorithm and it, you know, maybe does roughly what 

26
00:01:38,830 --> 00:01:42,600
you want to and maybe it doesn't do as well as you're hoping. 

27
00:01:42,600 --> 00:01:45,570
Then what do you do next? And the talk of this idea that, you 

28
00:01:45,570 --> 00:01:49,200
know, some of the really, really good people in machine learning, the people that really 

29
00:01:49,200 --> 00:01:53,040
understand learning algorithms, they're really good at getting these things to work. 

30
00:01:53,040 --> 00:01:56,530
Very often what they're really good at is at figuring out why a learning 

31
00:01:56,530 --> 00:01:59,310
algorithm is working or is not working 

32
00:01:59,310 --> 00:02:01,890
and that prevents them from doing things 

33
00:02:01,890 --> 00:02:03,770
for six months 

34
00:02:03,770 --> 00:02:08,549
that someone else may be able to just look at and say gee, there was no point collecting 

35
00:02:08,549 --> 00:02:12,400
more training data, because your learning algorithm had high bias rather than high 

36
00:02:12,400 --> 00:02:16,079
variance. So that six months you spent collecting more training data - I could have told you six 

37
00:02:16,079 --> 00:02:19,429
months ago it was a waste of time. Right? So these are sorts of things that 

38
00:02:19,429 --> 00:02:22,679
some of the people are really good at machine learning, that they really get machine 

39
00:02:22,679 --> 00:02:27,729
learning, are very good at. 

40
00:02:27,729 --> 00:02:29,699
Well, just a few of my slides. These slides I won't actually 

41
00:02:29,699 --> 00:02:31,370
talk about these, 

42
00:02:31,370 --> 00:02:36,019
but these are exactly the same slides you saw last time. Actually, 

43
00:02:36,019 --> 00:02:38,509
I'll just skip ahead, I guess. 

44
00:02:38,509 --> 00:02:40,739
So last time you 

45
00:02:40,739 --> 00:02:46,139
saw this discussion on - right. 

46
00:02:46,139 --> 00:02:50,979
Diagnostics for whether you happen to have a bias problem, or a variance problem, 

47
00:02:50,979 --> 00:02:52,770
or in other cases where 

48
00:02:52,770 --> 00:02:56,780
the - your optimization algorithm is converging or whether it's the [inaudible] optimization objective and 

49
00:02:56,780 --> 00:02:57,729
so on. 

50
00:02:57,729 --> 00:02:59,300
And we'll talk about that again, but the 

51
00:02:59,300 --> 00:03:02,790
one example that I, sort of, promised to show again was actually a reinforcement 

52
00:03:02,790 --> 00:03:06,760
learning example, but at that time we hadn't talked about reinforcement learning yet, so 

53
00:03:06,760 --> 00:03:08,990
I promised to do exactly the same example again, all right? 

54
00:03:08,990 --> 00:03:10,599
So 

55
00:03:10,599 --> 00:03:13,619
let's go for the example. 

56
00:03:13,619 --> 00:03:17,870
The motivating example was robotic control. Let's 

57
00:03:17,870 --> 00:03:22,489
see. Write a - let's say you want to design a controller for this 

58
00:03:22,489 --> 00:03:23,470
helicopter. 

59
00:03:23,470 --> 00:03:25,079
So this is a 

60
00:03:25,079 --> 00:03:28,249
very typical way by which you might apply machine-learning 

61
00:03:28,249 --> 00:03:30,719
algorithm or several machine-learning algorithms 

62
00:03:30,719 --> 00:03:33,249
to a control problem, right? Which is you 

63
00:03:33,249 --> 00:03:34,010
might first 

64
00:03:34,010 --> 00:03:35,609
build a simulator - 

65
00:03:35,609 --> 00:03:38,589
so control problem is 

66
00:03:38,589 --> 00:03:42,349
you want to build a controller to make the helicopter hover in place, right? So 

67
00:03:42,349 --> 00:03:47,059
the first thing you want to do is build a simulator of the helicopter. And this just means 

68
00:03:47,059 --> 00:03:51,479
model of the state transition probabilities, a piece of SA of the 

69
00:03:51,479 --> 00:03:55,109
helicopter, and you can do this by many different ways. Maybe you can try reading a 

70
00:03:55,109 --> 00:03:56,709
helicopter textbook and building a 

71
00:03:56,709 --> 00:03:58,809
simulator based on 

72
00:03:58,809 --> 00:04:01,789
what's known about aerodynamics of helicopters. It actually turns out this is very 

73
00:04:01,789 --> 00:04:04,569
hard to do. Another thing you could do is collect data 

74
00:04:04,569 --> 00:04:08,379
and maybe fit a linear model, or maybe fit a non-linear model, 

75
00:04:08,379 --> 00:04:10,709
to what the next stage is as a function of 

76
00:04:10,709 --> 00:04:12,829
current state and current action. Okay? So 

77
00:04:12,829 --> 00:04:14,969
there's different ways of estimating 

78
00:04:14,969 --> 00:04:16,449
the state transition probabilities. 

79
00:04:16,449 --> 00:04:17,409
So, now, 

80
00:04:17,409 --> 00:04:21,269
you now have a simulator and I'm showing a screen shot of our 

81
00:04:21,269 --> 00:04:25,120
simulator we have a stand fit on the upper right there. 

82
00:04:25,120 --> 00:04:28,720
Second thing you might do is then choose a reward function. So you might choose this sort 

83
00:04:28,720 --> 00:04:30,460
of quadratic cos function, 

84
00:04:30,460 --> 00:04:31,529
right? So 

85
00:04:31,529 --> 00:04:35,550
the reward for being at a state X is going to be minus the 

86
00:04:35,550 --> 00:04:39,310
norm difference between a current state and some desired state of 

87
00:04:39,310 --> 00:04:42,249
your one simple example of a reward function. 

88
00:04:42,249 --> 00:04:46,060
And this, sort of, quadratic reward function is what we've been using in the last 

89
00:04:46,060 --> 00:04:50,419
lecture in LQR control, linear-quadratic regulation control. 

90
00:04:50,419 --> 00:04:54,479
And finally, right? Random reinforcement learning algorithm 

91
00:04:54,479 --> 00:04:58,979
in simulation, meaning that you use your model of the dynamics 

92
00:04:58,979 --> 00:05:03,370
to try to maximize that final horizon sum of rewards 

93
00:05:03,370 --> 00:05:05,800
and when you do that you get a policy out, which 

94
00:05:05,800 --> 00:05:10,490
I'm gonna call the policy pi subscript RL to denote the policy output by 

95
00:05:10,490 --> 00:05:12,799
the reinforcement learning 

96
00:05:12,799 --> 00:05:14,039
algorithm. Okay? 

97
00:05:14,039 --> 00:05:17,110
Let's say you do this and the resulting controller 

98
00:05:17,110 --> 00:05:21,610
gets much worse performance than a human pilot that you hire to fly 

99
00:05:21,610 --> 00:05:23,380
the helicopter for you. 

100
00:05:23,380 --> 00:05:25,470


101
00:05:25,470 --> 00:05:28,259
So how do you go about figuring out what to do next? Well, actually, you have several 

102
00:05:28,259 --> 00:05:29,979
things you might do, right? You might 

103
00:05:29,979 --> 00:05:34,909
try to improve the simulator, so there are exactly three steps. You want say maybe after the 

104
00:05:34,909 --> 00:05:37,529
new model from the helicopter dynamics, but I think it's non - it is 

105
00:05:37,529 --> 00:05:40,060
actually non-linear. Or maybe you want to 

106
00:05:40,060 --> 00:05:43,170
collect more training data, 

107
00:05:43,170 --> 00:05:47,180
so you can get a better estimates of the transition probabilities of the 

108
00:05:47,180 --> 00:05:48,330
helicopter. 

109
00:05:48,330 --> 00:05:51,840
Or maybe you want to fiddle with the features you used to model the dynamics 

110
00:05:51,840 --> 00:05:53,740
of your helicopter, right? 

111
00:05:53,740 --> 00:05:57,559
Other things you might do is, you might modify the reward function R if you think, 

112
00:05:57,559 --> 00:06:00,470
you know, it's not just a quadratic function, maybe it's something else. I don't 

113
00:06:00,470 --> 00:06:04,570
know. Or maybe you're unsatisfied with the reinforcement learning algorithm. Maybe you think, 

114
00:06:04,570 --> 00:06:07,349
you know, the algorithm isn't quite doing the right thing. 

115
00:06:07,349 --> 00:06:11,259
Or maybe you think you need to discritize the states more finely in order to 

116
00:06:11,259 --> 00:06:13,030
apply your reinforcement learning algorithm. Or 

117
00:06:13,030 --> 00:06:18,789
maybe you need to fiddle with the features you used in value function approximations. Okay? So 

118
00:06:18,789 --> 00:06:19,289
these 

119
00:06:19,289 --> 00:06:22,509
are three examples of things you might do 

120
00:06:22,509 --> 00:06:23,189
and, 

121
00:06:23,189 --> 00:06:27,119
again, quite often if you chose the wrong one to work on you can easily spend, 

122
00:06:27,119 --> 00:06:31,089
you know - actually this one I don't want to say six months. You can easily spend a year or two 

123
00:06:31,089 --> 00:06:33,030
working on the wrong thing. 

124
00:06:33,030 --> 00:06:36,680
Hey, Dan, this is a favor. I'm, sort of, out of chalk could you wander around 

125
00:06:36,680 --> 00:06:42,709
and help me 

126
00:06:42,709 --> 00:06:46,069
get? Thanks. So the team does three things; they'll copy the yellow box to the upper right of 

127
00:06:46,069 --> 00:06:48,219
this slide. What 

128
00:06:48,219 --> 00:06:52,189
can you do? So this is the, sort of, reasoning we actually go through on the 

129
00:06:52,189 --> 00:06:54,419
helicopter project often 

130
00:06:54,419 --> 00:06:57,320
and to decide what to work on. So let me just step for this 

131
00:06:57,320 --> 00:06:59,209
example 

132
00:06:59,209 --> 00:07:01,699
fairly slowly. 

133
00:07:01,699 --> 00:07:05,110
So this, sort of, reasoning you might go through. 

134
00:07:05,110 --> 00:07:07,539
Suppose these three assumptions hold true, 

135
00:07:07,539 --> 00:07:11,770
right? Suppose that the helicopter simulator is accurate, so let's suppose 

136
00:07:11,770 --> 00:07:12,300
that you 

137
00:07:12,300 --> 00:07:15,330
built an accurate model of the dynamics. 

138
00:07:15,330 --> 00:07:16,439


139
00:07:16,439 --> 00:07:17,530
And suppose that, 

140
00:07:17,530 --> 00:07:21,229
sort of, I turned two under slides, so suppose that the reinforcement learning 

141
00:07:21,229 --> 00:07:26,949
algorithm correctly controls the helicopter in simulation. So it's a maximized that expected 

142
00:07:26,949 --> 00:07:29,479
payoff, right? 

143
00:07:29,479 --> 00:07:32,139
And suppose that maximizing the expected payoff corresponds to 

144
00:07:32,139 --> 00:07:34,210
autonomous flight, 

145
00:07:34,210 --> 00:07:38,300
right? If all of these three assumptions holds true, 

146
00:07:38,300 --> 00:07:41,699
then that means that you would expect the learned controller 

147
00:07:41,699 --> 00:07:43,279
pi subscript RL to 

148
00:07:43,279 --> 00:07:48,489
fly well on the actual helicopter. Okay? So 

149
00:07:48,489 --> 00:07:50,699
this is the - I'm, sort of, 

150
00:07:50,699 --> 00:07:55,099
showing you the source of the reasoning that I go through when I'm trying to 

151
00:07:55,099 --> 00:07:58,729
come up with a set of diagnostics for this problem. 

152
00:07:58,729 --> 00:08:03,219
So these are some of the diagnostics we actually use routinely on various revised 

153
00:08:03,219 --> 00:08:07,020
control problems. 

154
00:08:07,020 --> 00:08:09,240
So pi 

155
00:08:09,240 --> 00:08:11,389
subscript RL, 

156
00:08:11,389 --> 00:08:15,919
right? We said it doesn't fly well on the actual helicopter. 

157
00:08:15,919 --> 00:08:20,669
So the first diagnostic you want to run is just check if it flies well in simulation, all right? 

158
00:08:20,669 --> 00:08:24,469
So if it flies well in simulation, but not in real life, then 

159
00:08:24,469 --> 00:08:25,529
the problem 

160
00:08:25,529 --> 00:08:27,309
is in the simulator, 

161
00:08:27,309 --> 00:08:32,200
right? Because the simulator predicts that your controller, the pi subscript RL, flies well, 

162
00:08:32,200 --> 00:08:36,390
but it doesn't actually fly well in real life. So if this holds true, then that suggests 

163
00:08:36,390 --> 00:08:41,610
a problem is in the simulator. Question? Student:[Inaudible] the 

164
00:08:41,610 --> 00:08:48,610
helicopter 

165
00:08:50,650 --> 00:08:55,190


166
00:08:55,190 --> 00:08:55,729
pilots 

167
00:08:55,729 --> 00:09:00,660


168
00:09:00,660 --> 00:09:03,190
try to fly on 

169
00:09:03,190 --> 00:09:06,970
the simulator? Do you have the real helicopter pilots try to fly on the simulator? Instructor (Andrew Ng):Do I try to have the real helicopter flying simulator? Student:Real helicopter pilots. Instructor (Andrew Ng):Oh, I see. Do we ask the helicopter pilots to fly in simulation? So, yeah. It turns out one of the later diagnostics 

170
00:09:06,970 --> 00:09:08,960
you could do that. 

171
00:09:08,960 --> 00:09:14,510
On our project we don't do that very often. We informally ask the pilot, 

172
00:09:14,510 --> 00:09:18,350
who's one of the best pilots in the world, 

173
00:09:18,350 --> 00:09:19,620
Gary Zoco, to look at 

174
00:09:19,620 --> 00:09:26,080
the simulator sometimes. We don't very often ask him to fly in the 

175
00:09:26,080 --> 00:09:26,700
simulator. 

176
00:09:26,700 --> 00:09:28,040
That answers your question. But let me actually 

177
00:09:28,040 --> 00:09:30,090
go on and show you some of the other diagnostics 

178
00:09:30,090 --> 00:09:34,260
that you might use then. Okay? 

179
00:09:34,260 --> 00:09:38,520
Second is let me use pi subscript human to denote the human 

180
00:09:38,520 --> 00:09:39,970
control policy, 

181
00:09:39,970 --> 00:09:41,240
right? 

182
00:09:41,240 --> 00:09:43,690
This is pi subscript human is policy that, you know, however the human flies it. 

183
00:09:43,690 --> 00:09:46,190
So one thing to do is 

184
00:09:46,190 --> 00:09:47,920
look at the value 

185
00:09:47,920 --> 00:09:53,600
of pi RL compared to the value of pi subscript human. Okay? 

186
00:09:53,600 --> 00:09:55,830
So what this means really is, 

187
00:09:55,830 --> 00:09:56,890
look at 

188
00:09:56,890 --> 00:10:00,860
how the helicopter looks like when it's flying under control of the pi subscript RL and 

189
00:10:00,860 --> 00:10:01,680


190
00:10:01,680 --> 00:10:06,350
look at what the helicopter does when it's flying under the human pilot control 

191
00:10:06,350 --> 00:10:08,110
and evaluate - 

192
00:10:08,110 --> 00:10:11,720
and then, sort of, compute the sum of rewards, right? For your human pilot performance and 

193
00:10:11,720 --> 00:10:13,200
compute the sum of rewards 

194
00:10:13,200 --> 00:10:14,250


195
00:10:14,250 --> 00:10:16,840
for the learning controller performance 

196
00:10:16,840 --> 00:10:18,149
and see on, 

197
00:10:18,149 --> 00:10:21,779
say, the real helicopter or that question or you can do this on the real 

198
00:10:21,779 --> 00:10:23,780
helicopter or on simulation actually, 

199
00:10:23,780 --> 00:10:25,460
but you can see 

200
00:10:25,460 --> 00:10:28,750
does the human obtain a higher or a lower 

201
00:10:28,750 --> 00:10:31,090
sum of rewards on average than 

202
00:10:31,090 --> 00:10:33,830
does the controller you just learned. Okay? 

203
00:10:33,830 --> 00:10:37,050
And then the way you do this you actually can go and fly the helicopter 

204
00:10:37,050 --> 00:10:40,520
and just measure the sum of rewards, right? On the actual sequence of 

205
00:10:40,520 --> 00:10:43,940
states the helicopter flew through, 

206
00:10:43,940 --> 00:10:45,820
right? So 

207
00:10:45,820 --> 00:10:49,720
if this condition holds true, right? Where my mouse pointer is 

208
00:10:49,720 --> 00:10:52,330
if - oh, excuse me. Okay. 

209
00:10:52,330 --> 00:10:55,700
If this condition holds true where my mouse pointer is, if 

210
00:10:55,700 --> 00:11:00,310
it holds true that pi subscript RL is less than pi subscript human, those of 

211
00:11:00,310 --> 00:11:03,190
you watching online I don't know if you can see this, but this 

212
00:11:03,190 --> 00:11:08,340
V pi subscript RL of as zero less than V pi subscript human of as zero. But 

213
00:11:08,340 --> 00:11:10,040
if 

214
00:11:10,040 --> 00:11:14,040
this holds true, then that suggests that a problem is in the reinforcement learning 

215
00:11:14,040 --> 00:11:15,200
algorithm 

216
00:11:15,200 --> 00:11:16,340
because 

217
00:11:16,340 --> 00:11:19,760
the human has found a policy that 

218
00:11:19,760 --> 00:11:24,120
obtains a higher reward than does the reinforcement learning algorithm. So this proves, or this 

219
00:11:24,120 --> 00:11:26,160
shows, that your reinforcement learning algorithm 

220
00:11:26,160 --> 00:11:33,160
is not maximizing the sum of rewards, right? 

221
00:11:35,600 --> 00:11:37,360
And lastly, 

222
00:11:37,360 --> 00:11:42,130
the last condition is this - the last test is this, if the inequality holds in the 

223
00:11:42,130 --> 00:11:43,980
opposite direction - so if 

224
00:11:43,980 --> 00:11:48,180
the reinforcement learning algorithm obtains a higher sum of rewards on 

225
00:11:48,180 --> 00:11:49,260
average than 

226
00:11:49,260 --> 00:11:51,720
does the human, 

227
00:11:51,720 --> 00:11:56,490
but the reinforcement learning algorithm still flies worse than the human does, right? Then 

228
00:11:56,490 --> 00:11:58,279
this suggests that the problem is in 

229
00:11:58,279 --> 00:12:01,340
the cos functions than the reward function because the 

230
00:12:01,340 --> 00:12:05,100
reinforcement learning algorithm is obtaining a higher sum of rewards than 

231
00:12:05,100 --> 00:12:07,769
the human, but it still flies worse than the human. 

232
00:12:07,769 --> 00:12:10,440
So that suggests that 

233
00:12:10,440 --> 00:12:14,080
maximizing the sum of rewards does not correspond 

234
00:12:14,080 --> 00:12:15,850
to very good autonomous flight. 

235
00:12:15,850 --> 00:12:19,260
So if this holds true, then the problem is in your reward function. Or 

236
00:12:19,260 --> 00:12:20,240
in other words, 

237
00:12:20,240 --> 00:12:23,910
the problem is in your optimization objective 

238
00:12:23,910 --> 00:12:27,320
and then rather than the algorithm that's trying to maximize the optimization objective and, so 

239
00:12:27,320 --> 00:12:29,710
you might change your optimization objective. 

240
00:12:29,710 --> 00:12:34,270
In other words, you might change 

241
00:12:34,270 --> 00:12:37,670
reward function. Okay? So, of course, this is just one example of how you might debug a reinforcement 

242
00:12:37,670 --> 00:12:39,800
learning algorithm. 

243
00:12:39,800 --> 00:12:42,180
And these particular set of diagnostics 

244
00:12:42,180 --> 00:12:46,000
happen to apply only because we're fortunate enough to be 

245
00:12:46,000 --> 00:12:48,130
able to an amazingly good 

246
00:12:48,130 --> 00:12:49,339
human pilot 

247
00:12:49,339 --> 00:12:51,520
who can fly a helicopter for us, right? 

248
00:12:51,520 --> 00:12:54,510
If you didn't have a very good pilot then maybe some of these diagnostics won't 

249
00:12:54,510 --> 00:12:57,480
apply and you'll have to come up with other ones. But 

250
00:12:57,480 --> 00:12:59,940
want to go through this again as an example of 

251
00:12:59,940 --> 00:13:03,630
the source of diagnostics you use to debug a reinforcement learning algorithm. 

252
00:13:03,630 --> 00:13:05,820
And the point of this 

253
00:13:05,820 --> 00:13:07,380
example isn't so much 

254
00:13:07,380 --> 00:13:10,130
that I want you to remember the specifics of the diagnostics, right? The 

255
00:13:10,130 --> 00:13:11,580
point of this is really 

256
00:13:11,580 --> 00:13:16,090
for your own problem, be it supervised learning, unsupervised learning, reinforcement 

257
00:13:16,090 --> 00:13:17,650
learning, whatever. 

258
00:13:17,650 --> 00:13:21,400
You very often have to come up with your own diagnostics, 

259
00:13:21,400 --> 00:13:24,800
your own debugging tools, to figure out why an algorithm is working and why an algorithm isn't working. 

260
00:13:24,800 --> 00:13:26,520
And this is one example of 

261
00:13:26,520 --> 00:13:30,410
what we actually do on the helicopter. Okay? 

262
00:13:30,410 --> 00:13:33,450
Questions about this? Yeah, Justin? Student:I'm just curious how do you collect, like, training data? Like in our homework 

263
00:13:33,450 --> 00:13:35,080
the pendulum fell 

264
00:13:35,080 --> 00:13:36,889
over lots of times, 

265
00:13:36,889 --> 00:13:38,190
but how do you work that 

266
00:13:38,190 --> 00:13:43,840
with an expensive helicopter? Instructor (Andrew Ng):Yeah. So, I see, 

267
00:13:43,840 --> 00:13:49,290
right. So on the helicopter the way we collect data to learn [inaudible] and improbabilities 

268
00:13:49,290 --> 00:13:49,890
is 

269
00:13:49,890 --> 00:13:52,480
we usually - not - 

270
00:13:52,480 --> 00:13:55,800
done lots of things, but to first approximation, the most basic 

271
00:13:55,800 --> 00:13:56,559
merger 

272
00:13:56,559 --> 00:14:00,360
is if you ask a human pilot to just fly the helicopter around for you 

273
00:14:00,360 --> 00:14:03,620
and he's not gonna crash it. So you can collect lots of data 

274
00:14:03,620 --> 00:14:08,480
as is being controlled by a human pilot. And, as I say, it turns out in data 

275
00:14:08,480 --> 00:14:10,370
collection the, sort of, 

276
00:14:10,370 --> 00:14:15,810
a few standard ways to collect data are to let you - 

277
00:14:15,810 --> 00:14:19,500
so a helicopter can do lots of things and you don't want to collect data 

278
00:14:19,500 --> 00:14:24,710
representing only one small part of the flight regime. So 

279
00:14:24,710 --> 00:14:29,510
concretely what we often do is ask the pilot to carry out frequency sweeps 

280
00:14:29,510 --> 00:14:33,940
and what that means, very informally, is you imagine them holding a control stick, right? In 

281
00:14:33,940 --> 00:14:34,840
their hand. 

282
00:14:34,840 --> 00:14:36,670
Frequency sweeps are a process where 

283
00:14:36,670 --> 00:14:37,880
you start off 

284
00:14:37,880 --> 00:14:40,160
making very slow oscillations 

285
00:14:40,160 --> 00:14:41,470
and then you start 

286
00:14:41,470 --> 00:14:44,800
taking your control stick and moving it back and forth faster and faster, so you 

287
00:14:44,800 --> 00:14:48,740
sort of sweep out the range of frequencies ranging from very slow slightly 

288
00:14:48,740 --> 00:14:52,950
[inaudible] by oscillations until you go faster and faster and you're sort of directing the control seat 

289
00:14:52,950 --> 00:14:56,780
back and forth. So this is - oh, good. Thank you. 

290
00:14:56,780 --> 00:15:00,340
So that's, sort of, one standard way that we use to 

291
00:15:00,340 --> 00:15:01,700
collect data 

292
00:15:01,700 --> 00:15:05,500
on various robotics. It may or may not apply to different robots or to 

293
00:15:05,500 --> 00:15:08,540
different systems you work on. 

294
00:15:08,540 --> 00:15:12,089
And, as I say, in reality we do a lot of things. Sometimes we have a 

295
00:15:12,089 --> 00:15:13,400
controllers collect data 

296
00:15:13,400 --> 00:15:15,960
autonomously too, but that's other 

297
00:15:15,960 --> 00:15:21,900
more complex algorithms. Anything else? Student:In the first point the 

298
00:15:21,900 --> 00:15:23,860
goal communicates that the other is perhaps 

299
00:15:23,860 --> 00:15:27,920
not mapping the actions 

300
00:15:27,920 --> 00:15:31,050
similar to the simulator, so in [inaudible] this could be very common that you could have [inaudible]? Just any specific matter that pilots use to take care 

301
00:15:31,050 --> 00:15:32,300
of that variable, so all [inaudible]? 

302
00:15:32,300 --> 00:15:35,210
Instructor (Andrew Ng):Yeah, 

303
00:15:35,210 --> 00:15:38,670
right. So you're 

304
00:15:38,670 --> 00:15:42,450
saying like point one may not be simulated accurate; 

305
00:15:42,450 --> 00:15:44,020


306
00:15:44,020 --> 00:15:47,600
it may be that the hardware is doing something strange with the 

307
00:15:47,600 --> 00:15:48,189
control. 

308
00:15:48,189 --> 00:15:51,800
There is concluding the controls through some action 

309
00:15:51,800 --> 00:15:55,030
and through some strange transformation because it's 

310
00:15:55,030 --> 00:15:56,780
actually simulating it as a helicopter. I don't 

311
00:15:56,780 --> 00:15:58,650
know. Yeah. I've 

312
00:15:58,650 --> 00:16:01,400
definitely seen that happen on some other robots before. 

313
00:16:01,400 --> 00:16:03,220
So maybe diagnostic one here 

314
00:16:03,220 --> 00:16:07,240
is a better form as deciding whether the simulator matches the actual 

315
00:16:07,240 --> 00:16:09,440
hardware, 

316
00:16:09,440 --> 00:16:13,120
I don't know. Yeah. That's another class of those to watch out for. If 

317
00:16:13,120 --> 00:16:17,410
you suspect that's the case, I can't think of a good, sort of, diagnostic 

318
00:16:17,410 --> 00:16:19,420
right now to confirm that, 

319
00:16:19,420 --> 00:16:21,620
but that, again, before - 

320
00:16:21,620 --> 00:16:24,290
that would be a good thing to try to come up with a diagnostic for to 

321
00:16:24,290 --> 00:16:27,640
see if there might be something wrong with the hardware. 

322
00:16:27,640 --> 00:16:30,170
And I 

323
00:16:30,170 --> 00:16:33,760
think these are by no means the definitive diagnostics or anything like that. 

324
00:16:33,760 --> 00:16:35,870
It's just sort of an example, but it would 

325
00:16:35,870 --> 00:16:39,840
be great if you come up with other diagnostics to check that the hardware is working 

326
00:16:39,840 --> 00:16:44,170
properly that would be a great thing to do, too. Okay. Is 

327
00:16:44,170 --> 00:16:48,340
this - okay, last couple of questions before we move on? Student:You said the 

328
00:16:48,340 --> 00:16:51,340
reward function was? Instructor (Andrew Ng):Oh, in this example, 

329
00:16:51,340 --> 00:16:52,750


330
00:16:52,750 --> 00:16:55,100
what I was just using 

331
00:16:55,100 --> 00:16:56,490
a quadratic constant. 

332
00:16:56,490 --> 00:17:03,490
On the helicopter we often use things that are much more complicated. Student:[Inaudible] You have no way of 

333
00:17:03,590 --> 00:17:03,880
knowing 

334
00:17:03,880 --> 00:17:07,260
what is the, like, desired focus? Instructor (Andrew Ng):Yeah. So 

335
00:17:07,260 --> 00:17:08,480
you 

336
00:17:08,480 --> 00:17:12,709
can, sort of, figure out where - ask the human pilot to hover in place and 

337
00:17:12,709 --> 00:17:19,560
guess what his desire was. Again, these aren't constants telling you to, yeah. Student:[Inaudible] 

338
00:17:19,560 --> 00:17:20,190
physics 

339
00:17:20,190 --> 00:17:22,720


340
00:17:22,720 --> 00:17:26,500
that are based learning problems. Do you - does it actually work best to use a physical finian model? You know, just, 

341
00:17:26,500 --> 00:17:28,750
sort of, physics tell or you just sort of do 

342
00:17:28,750 --> 00:17:31,600
learning on [inaudible]? Instructor (Andrew Ng):Yeah. 

343
00:17:31,600 --> 00:17:35,870
The physics models work well, right? So the answer is it varies a lot from 

344
00:17:35,870 --> 00:17:37,710
problem to problem. It turns out that 

345
00:17:37,710 --> 00:17:40,740
the aerodynamics of helicopters, I think, aren't 

346
00:17:40,740 --> 00:17:43,190
understood well enough that you can look at 

347
00:17:43,190 --> 00:17:46,770
the �specs� of a helicopter and build a very good physics 

348
00:17:46,770 --> 00:17:50,950
simulator. So on the helicopter, we actually learn the dynamics. And I don't know how to build a physics model. For other 

349
00:17:50,950 --> 00:17:51,620


350
00:17:51,620 --> 00:17:55,010
problems, like if you actually have an inverted pendulum problem or 

351
00:17:55,010 --> 00:17:56,140
something, 

352
00:17:56,140 --> 00:17:59,310
there are many other problems for which the aerodynamics are much better 

353
00:17:59,310 --> 00:18:02,640
understood and for which physic simulators work perfectly fine, but 

354
00:18:02,640 --> 00:18:04,990
it depends a lot - on some problems 

355
00:18:04,990 --> 00:18:07,240
physic simulators work great on some they 

356
00:18:07,240 --> 00:18:11,380
probably aren't great on all. Okay? Cool. 

357
00:18:11,380 --> 00:18:18,380
So, I guess, retract the chalkboard, please. All right. 

358
00:18:31,790 --> 00:18:34,280
So how much time do I have? Twenty minutes. Okay. So let's 

359
00:18:34,280 --> 00:18:35,190
go back 

360
00:18:35,190 --> 00:18:37,150
to our 

361
00:18:37,150 --> 00:18:40,480
discussion on LQR Control, linearquadratic regulation 

362
00:18:40,480 --> 00:18:43,320
control, and then I want to take that a little bit further and tell you about 

363
00:18:43,320 --> 00:18:48,080
the, sort of, one variation on the LQR called differential dynamic programming. 

364
00:18:48,080 --> 00:18:51,500
Just to recap, just to remind you of what LQR, or linear-quadratic 

365
00:18:51,500 --> 00:18:54,420
regulation control, 

366
00:18:54,420 --> 00:18:57,070
is, in the last lecture I defined a 

367
00:18:57,070 --> 00:18:59,550
horizon problem 

368
00:18:59,550 --> 00:19:06,550
where your goal is to maximize, 

369
00:19:10,560 --> 00:19:16,010
right? Just sort of find the horizon sum of rewards, and there's no discounting anymore, 

370
00:19:16,010 --> 00:19:19,850
and then we came up with this dynamic programming algorithm, right? Okay. 

371
00:19:19,850 --> 00:19:26,850


372
00:20:06,490 --> 00:20:10,260
Then we came up with this dynamic programming algorithm in the last lecture, 

373
00:20:10,260 --> 00:20:14,420
where you compute V star of capital T that's one value function for 

374
00:20:14,420 --> 00:20:15,929
the last time step. 

375
00:20:15,929 --> 00:20:18,110
So, in other words, what's the value if 

376
00:20:18,110 --> 00:20:22,720
your star in the state S and you just get to take one action and then 

377
00:20:22,720 --> 00:20:24,470
the clock runs 

378
00:20:24,470 --> 00:20:27,140
out. The aerodynamic programming algorithm that we 

379
00:20:27,140 --> 00:20:29,630
repeatedly compute V star 

380
00:20:29,630 --> 00:20:32,890
lowercase t in terms of V star t plus one, 

381
00:20:32,890 --> 00:20:35,360
so we compute V star 

382
00:20:35,360 --> 00:20:42,360
capital T and then recurs backwards, right? 

383
00:20:42,730 --> 00:20:45,980
And so on, until we 

384
00:20:45,980 --> 00:20:47,130


385
00:20:47,130 --> 00:20:49,010
get down to V star zero 

386
00:20:49,010 --> 00:20:51,180
and then pi star was given by, 

387
00:20:51,180 --> 00:20:54,179
as usual, the augmax of the thing we had 

388
00:20:54,179 --> 00:20:57,500
in the definition of the value function. 

389
00:20:57,500 --> 00:21:00,980
So 

390
00:21:00,980 --> 00:21:07,180
last time, the specific example we saw of - or one specific example, that sort of 

391
00:21:07,180 --> 00:21:12,160
define a horizon problem that we solved by DP was 

392
00:21:12,160 --> 00:21:14,130
the LQR problem 

393
00:21:14,130 --> 00:21:20,370
where we worked directly with continuous state and actions. 

394
00:21:20,370 --> 00:21:27,370
And 

395
00:21:33,380 --> 00:21:38,039
in the LQR problem we had these linear dynamics where the state ST plus one 

396
00:21:38,039 --> 00:21:42,040
is a linear function of the previous state and action 

397
00:21:42,040 --> 00:21:45,110
and then plus 

398
00:21:45,110 --> 00:21:46,740


399
00:21:46,740 --> 00:21:50,930
this Gaussian noise WT, which is covariance sigma W. 

400
00:21:50,930 --> 00:21:51,810


401
00:21:51,810 --> 00:21:55,250
And I said briefly last time, that 

402
00:21:55,250 --> 00:21:58,960
one specific way to come up with these linear dynamics, right? Oh, excuse me. 

403
00:21:58,960 --> 00:22:00,929


404
00:22:00,929 --> 00:22:02,350
One specific way 

405
00:22:02,350 --> 00:22:03,460
to 

406
00:22:03,460 --> 00:22:05,000
take a system and 

407
00:22:05,000 --> 00:22:08,080
come up with a linear model for it is if you have 

408
00:22:08,080 --> 00:22:15,080
some simulator, say, right? 

409
00:22:19,660 --> 00:22:22,110
So in this cartoon, the 

410
00:22:22,110 --> 00:22:25,980
vertical axis represents ST plus one and the horizontal axis represents STAT. So 

411
00:22:25,980 --> 00:22:27,269
say 

412
00:22:27,269 --> 00:22:30,360
you have a simulator, 

413
00:22:30,360 --> 00:22:31,410


414
00:22:31,410 --> 00:22:35,179
right? And let's say it determines [inaudible] simulator, so we have a [inaudible] simulator 

415
00:22:35,179 --> 00:22:38,559
that tells you what the next state is ST plus one is a function of 

416
00:22:38,559 --> 00:22:40,100
previous state 

417
00:22:40,100 --> 00:22:47,100
and action. And then you can choose 

418
00:22:50,029 --> 00:22:51,860
a point 

419
00:22:51,860 --> 00:22:55,980
around which to linearize this simulator, by which I mean 

420
00:22:55,980 --> 00:22:59,060
that you choose a point, an approximate 

421
00:22:59,060 --> 00:23:03,350
your - approximate the function F using a linear function, this tangent, to the 

422
00:23:03,350 --> 00:23:06,450
function F at that point. So if you do 

423
00:23:06,450 --> 00:23:10,610
that you have ST plus one equals - 

424
00:23:10,610 --> 00:23:17,610
shoot. 

425
00:23:43,210 --> 00:23:45,440
Sorry about writing down so many lines, 

426
00:23:45,440 --> 00:23:49,049
but this will be the linearization approximation 

427
00:23:49,049 --> 00:23:50,110


428
00:23:50,110 --> 00:23:52,110
to the function F 

429
00:23:52,110 --> 00:23:56,290
where I've taken a linearization around the specific point as bar T, 

430
00:23:56,290 --> 00:23:58,809
A bar 

431
00:23:58,809 --> 00:24:02,520
T. Okay? And so you can take those and just narrow it down to a linear 

432
00:24:02,520 --> 00:24:07,290


433
00:24:07,290 --> 00:24:07,980


434
00:24:07,980 --> 00:24:09,860


435
00:24:09,860 --> 00:24:13,450
equation like that where the next state ST plus one is 

436
00:24:13,450 --> 00:24:16,490
now some linear function of the previous state ST and AT 

437
00:24:16,490 --> 00:24:18,790
and these matrixes, AT and BT, 

438
00:24:18,790 --> 00:24:20,580
will depend on 

439
00:24:20,580 --> 00:24:22,980
your choice of location around 

440
00:24:22,980 --> 00:24:26,559
which to linearize this function. Okay? 

441
00:24:26,559 --> 00:24:29,840
I said last time, that this linearization approximation 

442
00:24:29,840 --> 00:24:34,870
you, sort of, expect to be particularly good in the vicinity of, as bar T, A bar T 

443
00:24:34,870 --> 00:24:36,180
because this linear function 

444
00:24:36,180 --> 00:24:38,470
is a pretty good approximation to F, 

445
00:24:38,470 --> 00:24:39,220
right? So if 

446
00:24:39,220 --> 00:24:46,220
in this little neighborhood there. And - yes? Student:[Inaudible] is there an assumption that you are looking 

447
00:24:50,940 --> 00:24:53,260
at something the second recently 

448
00:24:53,260 --> 00:24:56,040
indicates like 

449
00:24:56,040 --> 00:24:58,960
the helicopter, are you assuming pilot behavior is the same 

450
00:24:58,960 --> 00:25:00,330
as [inaudible] behavior 

451
00:25:00,330 --> 00:25:02,120
or - Instructor (Andrew 

452
00:25:02,120 --> 00:25:06,090
Ng):Yeah, right. So let me not call this as an assumption. Let me just say that when 

453
00:25:06,090 --> 00:25:08,890
I use this algorithm, when I choose to linearize this way, 

454
00:25:08,890 --> 00:25:12,180
then my approximation would be physically good in the vicinity here and it 

455
00:25:12,180 --> 00:25:15,679
may be less good elsewhere and, so 

456
00:25:15,679 --> 00:25:19,250
let me - when I actually talk about DDP I actually make use of this property. Which is why I'm 

457
00:25:19,250 --> 00:25:21,150
going over it now. 

458
00:25:21,150 --> 00:25:23,460
Okay? But, yeah. There is an intuition that 

459
00:25:23,460 --> 00:25:27,470
you want to linearize near the vicinity"near of states, so you expect your system to spend the 

460
00:25:27,470 --> 00:25:34,080
most time. Right. So, 

461
00:25:34,080 --> 00:25:37,450
okay. So this is how you might come up with a linear model and 

462
00:25:37,450 --> 00:25:41,080
if you do that then, 

463
00:25:41,080 --> 00:25:48,080
oh, you can - let's see. So for LQR we also have this sort of quadratic reward function, 

464
00:25:52,950 --> 00:25:56,320
right? 

465
00:25:56,320 --> 00:25:58,840
Where the matrixes UT and VT 

466
00:25:58,840 --> 00:26:01,700
are positive semi-definite, so the rewards are always negative, that's 

467
00:26:01,700 --> 00:26:03,190
this minus sign. 

468
00:26:03,190 --> 00:26:04,630
And then 

469
00:26:04,630 --> 00:26:08,480
if you take exactly the dynamic programming algorithm, that I've written down just 

470
00:26:08,480 --> 00:26:09,740
now, 

471
00:26:09,740 --> 00:26:13,220
then - lets see. 

472
00:26:13,220 --> 00:26:17,260
It turns out that the value function at every state, excuse me. 

473
00:26:17,260 --> 00:26:20,419
It turns out the value function for every time step 

474
00:26:20,419 --> 00:26:22,760
will be a quadratic function 

475
00:26:22,760 --> 00:26:25,490
of the state and can be written like that. 

476
00:26:25,490 --> 00:26:27,250


477
00:26:27,250 --> 00:26:34,250
And so you initialize the dynamic programming algorithm as follows. 

478
00:26:42,690 --> 00:26:46,280
And I just write this down, but there's actually just one property I want to point out 

479
00:26:46,280 --> 00:26:47,720
later, 

480
00:26:47,720 --> 00:26:53,880
but this equation is, well, 

481
00:26:53,880 --> 00:26:56,320
somewhat big and hairy, but don't worry about 

482
00:26:56,320 --> 00:27:03,320
most of its details. 

483
00:27:03,750 --> 00:27:04,139
Let 

484
00:27:04,139 --> 00:27:11,139


485
00:27:16,750 --> 00:27:23,750
me 

486
00:27:31,580 --> 00:27:33,309
just 

487
00:27:33,309 --> 00:27:34,419
put 

488
00:27:34,419 --> 00:27:39,159
this. Shoot, 

489
00:27:39,159 --> 00:27:46,159
there's 

490
00:27:48,260 --> 00:27:50,830
one more equation I want to fit in. 

491
00:27:50,830 --> 00:27:52,380
Well, okay. All right. 

492
00:27:52,380 --> 00:27:55,290
So it turns out the value function is a quadratic function where 

493
00:27:55,290 --> 00:27:57,900
V star T is this 

494
00:27:57,900 --> 00:27:59,180
and, 

495
00:27:59,180 --> 00:28:02,260
so you initialize the dynamic programming step 

496
00:28:02,260 --> 00:28:04,310
with this. This 

497
00:28:04,310 --> 00:28:07,030
fi and this si gives you V capital T 

498
00:28:07,030 --> 00:28:10,530
and then it records backwards. So these two equations 

499
00:28:10,530 --> 00:28:14,970
express - 

500
00:28:14,970 --> 00:28:18,980
will give you V subscript T as a function of VT 

501
00:28:18,980 --> 00:28:22,580
plus one. Okay? So it incurs backwards for this learning algorithm. 

502
00:28:22,580 --> 00:28:24,320
And the last thing, 

503
00:28:24,320 --> 00:28:31,320


504
00:28:37,010 --> 00:28:43,280
get this on the same board, so - sorry about the disorganized use of the boards. I just wanted this on the same place. 

505
00:28:43,280 --> 00:28:50,280
And, finally, the 

506
00:28:50,760 --> 00:28:55,200
actual policy pi star of ST is given by some linear function of 

507
00:28:55,200 --> 00:28:57,960
ST, so LT here is a matrix 

508
00:28:57,960 --> 00:29:04,960
where LT is equal to - numerous 

509
00:29:09,700 --> 00:29:11,410
times. 

510
00:29:11,410 --> 00:29:18,410
Okay? 

511
00:29:18,780 --> 00:29:20,830
And so when you do this 

512
00:29:20,830 --> 00:29:24,570
you now have the actual policy, right? So just concretely, you 

513
00:29:24,570 --> 00:29:29,940
run the dynamic programming algorithm to compute fi T and si T for all values of T 

514
00:29:29,940 --> 00:29:34,130
and then you plug it in to compute the matrixes LT and now you know the optimal 

515
00:29:34,130 --> 00:29:40,230
action stake and [inaudible]. Okay? So 

516
00:29:40,230 --> 00:29:42,000
there's one very interesting property 

517
00:29:42,000 --> 00:29:45,679
- these equations are a huge mess, but you can re-derive them yourself, but don't worry about the details. 

518
00:29:45,679 --> 00:29:49,450
But this is one specific property 

519
00:29:49,450 --> 00:29:53,370
of this dynamic programming algorithm that's very interesting that I want to point out. 

520
00:29:53,370 --> 00:29:56,919
Which is the following. Notice that to 

521
00:29:56,919 --> 00:30:02,250
compute the optimal policy I need to compute these matrixes LT 

522
00:30:02,250 --> 00:30:03,560
and notice that 

523
00:30:03,560 --> 00:30:05,580
LT depends on 

524
00:30:05,580 --> 00:30:08,760
A, it depends on B, it depends on D, and 

525
00:30:08,760 --> 00:30:09,530
it depends on 

526
00:30:09,530 --> 00:30:10,870
fi, 

527
00:30:10,870 --> 00:30:15,049
but it doesn't depend on si, 

528
00:30:15,049 --> 00:30:17,440
right? 

529
00:30:17,440 --> 00:30:19,120
Notice this further that 

530
00:30:19,120 --> 00:30:24,040
when I carry out my 

531
00:30:24,040 --> 00:30:26,529
dynamic programming algorithm my recursive definition for si - 

532
00:30:26,529 --> 00:30:27,330
well 

533
00:30:27,330 --> 00:30:32,420
it depends on 

534
00:30:32,420 --> 00:30:39,170


535
00:30:39,170 --> 00:30:41,460
- oh, excuse me. 

536
00:30:41,460 --> 00:30:43,800
It should be si T, right. Si T plus one. Okay. In order to 

537
00:30:43,800 --> 00:30:47,950
carry out my dynamic programming algorithm, right? For si T I need to know what si T plus 

538
00:30:47,950 --> 00:30:48,920
one is. 

539
00:30:48,920 --> 00:30:50,780
So si T depends 

540
00:30:50,780 --> 00:30:51,960
on these things, 

541
00:30:51,960 --> 00:30:55,280
but in order to carry out the dynamic programming for fi 

542
00:30:55,280 --> 00:30:58,840
T, fi T doesn't actually depend on 

543
00:30:58,840 --> 00:31:01,360
si T plus one, right? And so 

544
00:31:01,360 --> 00:31:03,630
in other words in order to 

545
00:31:03,630 --> 00:31:06,160
compute the fi T's I don't need these si's. So 

546
00:31:06,160 --> 00:31:07,600
if 

547
00:31:07,600 --> 00:31:13,000
all I want is the fi's I can actually omit 

548
00:31:13,000 --> 00:31:15,840
this step of the dynamic programming algorithm 

549
00:31:15,840 --> 00:31:17,279
and not bother 

550
00:31:17,279 --> 00:31:21,460
to carry out the dynamic programming algorithm in terms of the fi's. 

551
00:31:21,460 --> 00:31:25,600
And then having done my dynamic programming algorithm just for - excuse 

552
00:31:25,600 --> 00:31:27,390
me, I misspoke. 

553
00:31:27,390 --> 00:31:28,419
You - 

554
00:31:28,419 --> 00:31:32,280
I can forget about the si's and just do the dynamic programming updates for 

555
00:31:32,280 --> 00:31:34,500
the fi matrixes 

556
00:31:34,500 --> 00:31:39,669
and then having done my DP updates for the fi T I can then plug this into this 

557
00:31:39,669 --> 00:31:42,720
formula to compute LT. Okay? 

558
00:31:42,720 --> 00:31:44,900
And so 

559
00:31:44,900 --> 00:31:49,029
one other thing about it is you can, to be slightly more efficient 

560
00:31:49,029 --> 00:31:51,700
- efficiency isn't really the issue, but 

561
00:31:51,700 --> 00:31:55,650
if you want you can actually forget about the fi T's. You actually don't need to 

562
00:31:55,650 --> 00:31:59,890
compute that at all. 

563
00:31:59,890 --> 00:32:05,760
Now, the other interesting property of this is that 

564
00:32:05,760 --> 00:32:08,410
the matrix sigma W 

565
00:32:08,410 --> 00:32:11,320
appears only in my DV update for the si 

566
00:32:11,320 --> 00:32:13,590
T's. It doesn't actually 

567
00:32:13,590 --> 00:32:15,350
appear in my updates for the fi 

568
00:32:15,350 --> 00:32:18,410
T's. So you remember - well, 

569
00:32:18,410 --> 00:32:25,240
my model was that ST plus one equals ATST plus VT, 

570
00:32:25,240 --> 00:32:27,360
AT plus 

571
00:32:27,360 --> 00:32:29,030
WT where these noise 

572
00:32:29,030 --> 00:32:32,590
terms, WT, had a covariance sigma W 

573
00:32:32,590 --> 00:32:33,620
and so the 

574
00:32:33,620 --> 00:32:37,910
only place that appears - the covariance of the noise terms of appears is in those IT's, but I 

575
00:32:37,910 --> 00:32:39,820
just said that they can do this entire 

576
00:32:39,820 --> 00:32:41,130
[inaudible] ordering 

577
00:32:41,130 --> 00:32:44,350
algorithm without the si 

578
00:32:44,350 --> 00:32:47,990
T's. So what this means is that you can actually find the optimal policy 

579
00:32:47,990 --> 00:32:52,740
without knowing what the covariance of the noise terms 

580
00:32:52,740 --> 00:32:53,860
are. Okay? 

581
00:32:53,860 --> 00:32:56,169
So this is a very special property of 

582
00:32:56,169 --> 00:32:58,060
LQR 

583
00:32:58,060 --> 00:33:02,020
systems and once you change anything, once you go away from a linear dynamical 

584
00:33:02,020 --> 00:33:03,279
system, or once you 

585
00:33:03,279 --> 00:33:06,750
change almost any aspect of this because at discreet states or discreet 

586
00:33:06,750 --> 00:33:08,760
actions or whatever and once you change 

587
00:33:08,760 --> 00:33:11,029
almost any aspect of this problem 

588
00:33:11,029 --> 00:33:15,230
this property will no longer hold true because this is a very special property of 

589
00:33:15,230 --> 00:33:16,770
LQR systems 

590
00:33:16,770 --> 00:33:19,029
that the optimal policy 

591
00:33:19,029 --> 00:33:25,330
does not actually depend on the noise magnitude of these noise terms. Okay? 

592
00:33:25,330 --> 00:33:25,580
And 

593
00:33:25,529 --> 00:33:31,130
the only important property is that the noise function of zero mean. 

594
00:33:31,130 --> 00:33:32,440
So 

595
00:33:32,440 --> 00:33:34,850
there's this intuition that 

596
00:33:34,850 --> 00:33:38,720
to compute the optimal policy you can just ignore the noise terms. Or as if, 

597
00:33:38,720 --> 00:33:43,059
so as long as you know the expected value of your state ST plus one - 

598
00:33:43,059 --> 00:33:47,330
write down. On average ST plus one is ATST plus BTAT, then there's 

599
00:33:47,330 --> 00:33:49,310
as if you can ignore 

600
00:33:49,310 --> 00:33:51,750
the noise 

601
00:33:51,750 --> 00:33:54,080
in your next state ST plus one. 

602
00:33:54,080 --> 00:33:56,929
And the optimal policy doesn't change. Okay? 

603
00:33:56,929 --> 00:33:57,960
So 

604
00:33:57,960 --> 00:34:01,740
we'll actually come back to this in a minute. Later on we'll talk about Kalman filters. We'll actually 

605
00:34:01,740 --> 00:34:05,550
use this property of LQR systems. 

606
00:34:05,550 --> 00:34:09,490
Just to point out, note that the value function does depend on the noise 

607
00:34:09,490 --> 00:34:12,069
covariance, right? The value function here does 

608
00:34:12,069 --> 00:34:16,039
depend on si T. So the larger the noise in your system the 

609
00:34:16,039 --> 00:34:19,899
worse your value function. So this does depend on the noise, but it's the optimal policy that doesn't 

610
00:34:19,899 --> 00:34:22,409
depend on the noise. We'll use this 

611
00:34:22,409 --> 00:34:25,329
property later. 

612
00:34:25,329 --> 00:34:29,099
Okay. 

613
00:34:29,099 --> 00:34:31,719
So let's see how we're doing on 

614
00:34:31,719 --> 00:34:37,579
time. 

615
00:34:37,579 --> 00:34:44,579
Let's 

616
00:34:47,309 --> 00:34:54,309
see. Right. Okay. So let's put this aside for now. What I want 

617
00:35:03,380 --> 00:35:06,419
to do now is tell you about 

618
00:35:06,419 --> 00:35:13,419
one specific way of applying LQR that's called differential dynamic programming. 

619
00:35:13,629 --> 00:35:15,639
And as in most of the example, 

620
00:35:15,639 --> 00:35:19,660
think of try to control a system, like a helicopter or a car 

621
00:35:19,660 --> 00:35:24,810
or even a chemical plant, with some continuous state. So for the 

622
00:35:24,810 --> 00:35:27,720
sake of thinking through this example, just imagine 

623
00:35:27,720 --> 00:35:28,630


624
00:35:28,630 --> 00:35:30,809
trying to control a helicopter. 

625
00:35:30,809 --> 00:35:37,189
And let's say you have some simulator 

626
00:35:37,189 --> 00:35:40,049
that espece with the next state is a function of the previous data in action, 

627
00:35:40,049 --> 00:35:43,519
right? And 

628
00:35:43,519 --> 00:35:45,530
for this let's say the 

629
00:35:45,530 --> 00:35:50,009
model of your simulator is non-linear, 

630
00:35:50,009 --> 00:35:51,079


631
00:35:51,079 --> 00:35:52,600
but and determinalistic. 

632
00:35:52,600 --> 00:35:56,520
Okay? So I say just now, that the noise terms don't matter very much. 

633
00:35:56,520 --> 00:35:57,979
So let's 

634
00:35:57,979 --> 00:36:03,899
just work with the term simulator for now, but let's say F is non-linear. 

635
00:36:03,899 --> 00:36:05,059
And 

636
00:36:05,059 --> 00:36:08,409
let's say there's some specific trajectory that you want the 

637
00:36:08,409 --> 00:36:10,229
helicopter to follow, all right? 

638
00:36:10,229 --> 00:36:10,860
So I 

639
00:36:10,860 --> 00:36:13,199
want to talk about how to apply LQR to get 

640
00:36:13,199 --> 00:36:14,800
helicopter or a car 

641
00:36:14,800 --> 00:36:18,089
or a chemical plant where your state variables may depend 

642
00:36:18,089 --> 00:36:21,629
on the amounts of different chemicals and the mixes of chemicals you have in different batch, really. It's really easy to think about 

643
00:36:21,629 --> 00:36:22,170
a 

644
00:36:22,170 --> 00:36:24,959
helicopter. Let's say 

645
00:36:24,959 --> 00:36:28,819
there's some trajectory you want the helicopter to follow. 

646
00:36:28,819 --> 00:36:31,029
So here's 

647
00:36:31,029 --> 00:36:34,129
what the differential dynamic programming it does. 

648
00:36:34,129 --> 00:36:37,609
First step is come up with 

649
00:36:37,609 --> 00:36:41,619
what I'm gonna call some nominal trajectory, 

650
00:36:41,619 --> 00:36:43,950


651
00:36:43,950 --> 00:36:45,849
right? And so we're 

652
00:36:45,849 --> 00:36:51,159
gonna call this S zero A zero. 

653
00:36:51,159 --> 00:36:55,529
Okay? 

654
00:36:55,529 --> 00:36:56,849


655
00:36:56,849 --> 00:37:00,599


656
00:37:00,599 --> 00:37:04,269
So one way to come up with this would be if you had some very bad 

657
00:37:04,269 --> 00:37:07,799
controller - someone hacked a controller for flying a helicopter 

658
00:37:07,799 --> 00:37:09,880
is not a good controller 

659
00:37:09,880 --> 00:37:11,490
at all. But you might then 

660
00:37:11,490 --> 00:37:15,709
go ahead and fly the helicopter using a very bad, a very sloppy, controller and 

661
00:37:15,709 --> 00:37:18,809
you get out some sequence of states and actions, 

662
00:37:18,809 --> 00:37:23,849
right? So I'm gonna - and I just call this sequence of states and actions the trajectory 

663
00:37:23,849 --> 00:37:29,259
- the nominal trajectory. 

664
00:37:29,259 --> 00:37:30,479
Then 

665
00:37:30,479 --> 00:37:32,440
I will 

666
00:37:32,440 --> 00:37:36,549
linearize F 

667
00:37:36,549 --> 00:37:38,030
around this 

668
00:37:38,030 --> 00:37:39,799
normal trajectory. 

669
00:37:39,799 --> 00:37:40,869


670
00:37:40,869 --> 00:37:44,559
Okay? 

671
00:37:44,559 --> 00:37:47,999
So i.e., 

672
00:37:47,999 --> 00:37:50,239


673
00:37:50,239 --> 00:37:53,329
right? I'll use that same 

674
00:37:53,329 --> 00:37:56,029
thing. So for times si T our 

675
00:37:56,029 --> 00:37:56,619
approximate 

676
00:37:56,619 --> 00:37:59,359
ST plus one, as this 

677
00:37:59,359 --> 00:38:05,049
linearization thing that we just 

678
00:38:05,049 --> 00:38:07,229
saw, 

679
00:38:07,229 --> 00:38:11,229


680
00:38:11,229 --> 00:38:14,399
times - plus the other term. 

681
00:38:14,399 --> 00:38:21,399
Okay? 

682
00:38:24,729 --> 00:38:26,670
And then you 

683
00:38:26,670 --> 00:38:31,719
distill this down to sum 

684
00:38:31,719 --> 00:38:33,599
ATST plus BTST. 

685
00:38:33,599 --> 00:38:37,629
Okay? So this will actually be the first time that I'll make explicit use of 

686
00:38:37,629 --> 00:38:41,619
the ability of LQR or these finer horizon problems 

687
00:38:41,619 --> 00:38:46,699
to handle non-stationery dynamics. In particular, for this example, 

688
00:38:46,699 --> 00:38:49,279
it will be important that 

689
00:38:49,279 --> 00:38:50,769
AT and BT 

690
00:38:50,769 --> 00:38:57,099
depend on time - oh, excuse me. Okay? 

691
00:38:57,099 --> 00:38:59,559
So the intuition is that 

692
00:38:59,559 --> 00:39:02,609
even if this is a pretty sloppy controller, or even if you had a pretty bad 

693
00:39:02,609 --> 00:39:04,349
controller come up with your original 

694
00:39:04,349 --> 00:39:06,549
normal trajectory, 

695
00:39:06,549 --> 00:39:08,930
you still expect maybe, 

696
00:39:08,930 --> 00:39:10,379
right? 

697
00:39:10,379 --> 00:39:14,959
You'd expect your state and action at time T 

698
00:39:14,959 --> 00:39:15,649
to be 

699
00:39:15,649 --> 00:39:17,640


700
00:39:17,640 --> 00:39:18,769


701
00:39:18,769 --> 00:39:22,540
maybe reasonably similar to what even the sloppy controller had done, right? 

702
00:39:22,540 --> 00:39:26,759
So you want a fly trajectory maybe you want to make a 90-degree turn. 

703
00:39:26,759 --> 00:39:30,640
Maybe if a bad controller that does a pretty sloppy job, but at any given in time you're 

704
00:39:30,640 --> 00:39:33,190
still moving around this trajectory. So 

705
00:39:33,190 --> 00:39:35,869
this is really telling you 

706
00:39:35,869 --> 00:39:38,980
where along, say, the 90-degree turn trajectory, 

707
00:39:38,980 --> 00:39:44,029
just very roughly, where along the trajectory you expect to be at any given time 

708
00:39:44,029 --> 00:39:50,729
and so let's linearize around that point. 

709
00:39:50,729 --> 00:39:54,719
Okay? 

710
00:39:54,719 --> 00:40:01,719
Then 

711
00:40:02,139 --> 00:40:04,199


712
00:40:04,199 --> 00:40:05,789
you would - 

713
00:40:05,789 --> 00:40:08,930
having found the linear model you run LQR to 

714
00:40:08,930 --> 00:40:13,489
get the optimal policy for this specific linear model 

715
00:40:13,489 --> 00:40:16,789
and now you have a better policy. 

716
00:40:16,789 --> 00:40:18,570
And the final thing you do 

717
00:40:18,570 --> 00:40:22,920
is - boy, I'll 

718
00:40:22,920 --> 00:40:29,920
write this on a different board, I guess. Okay. 

719
00:40:49,199 --> 00:40:53,799
Shoot. The last step is you use a simulator, 

720
00:40:53,799 --> 00:40:55,869
a model, 

721
00:40:55,869 --> 00:41:02,869
to come up with a new normal 

722
00:41:04,730 --> 00:41:08,130
trajectory. 

723
00:41:08,130 --> 00:41:09,939
So i.e., 

724
00:41:09,939 --> 00:41:13,970
okay? 

725
00:41:13,970 --> 00:41:20,970


726
00:41:32,599 --> 00:41:37,119
So now you take the controller you just learned and basically try flying your 

727
00:41:37,119 --> 00:41:40,209
helicopter in your simulator. So you initialize the simulator 

728
00:41:40,209 --> 00:41:42,969
to the initial state, and I'll call the S bar zero, 

729
00:41:42,969 --> 00:41:44,489
and you'll run every time step. 

730
00:41:44,489 --> 00:41:47,160
You choose an action which I'll call A bar T, 

731
00:41:47,160 --> 00:41:50,949
using the controller pi T that you just learned using LQR. And then you simulate forward in time, right? You 

732
00:41:50,949 --> 00:41:54,359
use the simulator, the function 

733
00:41:54,359 --> 00:41:58,009
F, to tell you what the next state S bar T plus one will be 

734
00:41:58,009 --> 00:42:03,779
when your previous state and action is bar T A bar T. 

735
00:42:03,779 --> 00:42:04,569
And then 

736
00:42:04,569 --> 00:42:08,910
you linearize 

737
00:42:08,910 --> 00:42:15,910
around this new trajectory 

738
00:42:17,809 --> 00:42:20,079
and repeat. Okay? 

739
00:42:20,079 --> 00:42:23,819
So now you have a new normal trajectory and you linearize your 

740
00:42:23,819 --> 00:42:27,979
simulator around this new trajectory and then you repeat the whole procedure. I guess going back to 

741
00:42:27,979 --> 00:42:30,890
step two of the algorithm. 

742
00:42:30,890 --> 00:42:31,500


743
00:42:31,500 --> 00:42:35,700
And this turns out to be a surprisingly effective procedure. So the 

744
00:42:35,700 --> 00:42:38,759
cartoon 

745
00:42:38,759 --> 00:42:39,899
of 

746
00:42:39,899 --> 00:42:41,650
what this algorithm 

747
00:42:41,650 --> 00:42:44,289
may do is as follows. 

748
00:42:44,289 --> 00:42:46,919
Let's say you want to make a 90-degree turn on the helicopter let's see 

749
00:42:46,919 --> 00:42:48,979
one, you know, a helicopter 

750
00:42:48,979 --> 00:42:51,719
to follow a trajectory like that. 

751
00:42:51,719 --> 00:42:53,880


752
00:42:53,880 --> 00:42:58,189
Follow up of a very bad controller, I just, you know, hack up some controller, whatever. 

753
00:42:58,189 --> 00:43:04,899
Have some way to come up with an initial normal trajectory. 

754
00:43:04,899 --> 00:43:10,979
Maybe your initial controller overshoots the turn, takes the turn wide, right? 

755
00:43:10,979 --> 00:43:14,799
But now you can use these points to linearize 

756
00:43:14,799 --> 00:43:17,969
the simulator. So linearize in a very non-linear simulator and the idea is that 

757
00:43:17,969 --> 00:43:22,249
maybe this state isn't such a bad approximation. That maybe a 

758
00:43:22,249 --> 00:43:25,540
linearization approximation at this sequence of 

759
00:43:25,540 --> 00:43:28,509
states will actually be reasonable because your helicopter 

760
00:43:28,509 --> 00:43:31,589
won't exactly be on the states, but will be close to the sequence of states of 

761
00:43:31,589 --> 00:43:33,739
every time 

762
00:43:33,739 --> 00:43:35,470
step. So after one duration of DDP, that's the 

763
00:43:35,470 --> 00:43:38,159


764
00:43:38,159 --> 00:43:40,459
target trajectory, maybe you 

765
00:43:40,459 --> 00:43:43,640
get a little bit closer 

766
00:43:43,640 --> 00:43:47,329
and now you have an even better place around to linearize. 

767
00:43:47,329 --> 00:43:54,329
Then after another linearization of DDP you 

768
00:43:55,649 --> 00:44:00,339
get closer and closer to finding exactly the trajectory you want. Okay? 

769
00:44:00,339 --> 00:44:02,629
So 

770
00:44:02,629 --> 00:44:07,309
turns out DDP is a sort of - it turns out to be a form of a local search algorithm 

771
00:44:07,309 --> 00:44:08,729
in which you - 

772
00:44:08,729 --> 00:44:12,339
on each iteration you find a slightly better place to linearize. So you end 

773
00:44:12,339 --> 00:44:16,319
up with a slightly better control and you repeat. And we actually do this - 

774
00:44:16,319 --> 00:44:19,990
this is actually one of the things we do on the helicopter. And this works very well 

775
00:44:19,990 --> 00:44:22,009
on many - this works 

776
00:44:22,009 --> 00:44:27,899
surprisingly well - this works very well on many problems. Cool. 

777
00:44:27,899 --> 00:44:31,339
I think - I was actually going to show some helicopter videos, but in the interest of time, let me 

778
00:44:31,339 --> 00:44:31,960
just 

779
00:44:31,960 --> 00:44:35,779
defer that to the next lecture. I'll show you a bunch of cool helicopter things in the 

780
00:44:35,779 --> 00:44:42,779
next lecture, but let me just check if there are questions about this before I move on. 

781
00:44:47,150 --> 00:44:48,629
Yeah? 

782
00:44:48,629 --> 00:44:51,049
Student:[Inaudible] Instructor (Andrew Ng):In this sample? 

783
00:44:51,049 --> 00:44:53,359
Yes, yeah, right, yeah. So I'm going to 

784
00:44:53,359 --> 00:44:57,089
- let's pick some kind of horizon T. So I'm going to run through my entire 

785
00:44:57,089 --> 00:44:59,119
trajectory in my simulator, 

786
00:44:59,119 --> 00:45:02,599
so I end up with a new 

787
00:45:02,599 --> 00:45:06,640
nominal trajectory to 

788
00:45:06,640 --> 00:45:13,640
linearize around, right? Okay? Yeah? Student:So does this method give you, like, a Gaussian for performing, like, a certain action? 

789
00:45:20,669 --> 00:45:22,849
Like 

790
00:45:22,849 --> 00:45:26,069
you talked about, like, the 90-degree turn thing or something. Instructor (Andrew Ng):Right. 

791
00:45:26,069 --> 00:45:29,199
Student:So is this from one, like, is this from one, like, one 

792
00:45:29,199 --> 00:45:33,699
90-degree turn or can 

793
00:45:33,699 --> 00:45:35,150
you [inaudible]? Instructor (Andrew Ng):Yeah. So it turns out 

794
00:45:35,150 --> 00:45:39,229
what - so this is used clear - let's see. Go and 

795
00:45:39,229 --> 00:45:42,670
think about this as if there's a specific trajectory 

796
00:45:42,670 --> 00:45:47,299
that you want to follow. I'm just gonna, car or helicopter or it could be in a chemical plant, 

797
00:45:47,299 --> 00:45:48,329
right? 

798
00:45:48,329 --> 00:45:51,859
If there's some specific sequence of states you expect the system to go 

799
00:45:51,859 --> 00:45:53,259
through over time, 

800
00:45:53,259 --> 00:45:55,769
so that you actually want to 

801
00:45:55,769 --> 00:45:58,429
linearize at different times - excuse me. 

802
00:45:58,429 --> 00:46:01,429
So, therefore, the different times you want different linear approximations, your dynamics, 

803
00:46:01,429 --> 00:46:03,709
right? So 

804
00:46:03,709 --> 00:46:08,499
I actually start to laugh over stationary simulator, 

805
00:46:08,499 --> 00:46:11,129
right? I mean, this 

806
00:46:11,129 --> 00:46:14,869
function F, it may be the same function F for all time steps, but 

807
00:46:14,869 --> 00:46:18,960
the point of DDP is that I may want to use different linearizations for different 

808
00:46:18,960 --> 00:46:20,319
time 

809
00:46:20,319 --> 00:46:23,499
steps. So a lot of the inner loop of the algorithm is just coming up with 

810
00:46:23,499 --> 00:46:25,260
better and better places 

811
00:46:25,260 --> 00:46:27,100
around to linearize. Where 

812
00:46:27,100 --> 00:46:34,100
at different times I'll linearize around different points. Does that make sense? Cool. Okay, cool. 

813
00:46:35,749 --> 00:46:37,789
So that was DDP. And 

814
00:46:37,789 --> 00:46:44,789
I'll show examples of DDP results in the next lecture. 

815
00:46:48,509 --> 00:46:55,509
So the last thing I wanted to do was talk about Kalman filters 

816
00:47:00,579 --> 00:47:06,069
and LQG control, linear-quadratic Gaussian control. And 

817
00:47:06,069 --> 00:47:10,259
what I want to do is actually talk about a different type of MDP problem 

818
00:47:10,259 --> 00:47:12,910
where we don't get to observe the state explicitly, 

819
00:47:12,910 --> 00:47:17,099
right? So far in every one I've been talking about, I've been assuming 

820
00:47:17,099 --> 00:47:20,969
that every time step you know what the state of the system is, 

821
00:47:20,969 --> 00:47:25,079
so you can compute a policy to some function of the state is 

822
00:47:25,079 --> 00:47:27,159
in. If you've all ready had that, you know, 

823
00:47:27,159 --> 00:47:31,939
the action we take is LT times ST, right? So to compute the action you need 

824
00:47:31,939 --> 00:47:34,649
to know what the state is. What I 

825
00:47:34,649 --> 00:47:38,169
want to do now is talk about the different type of problem where you don't get to 

826
00:47:38,169 --> 00:47:39,499
observe the state 

827
00:47:39,499 --> 00:47:41,229
explicitly. The fact 

828
00:47:41,229 --> 00:47:44,709
- before we even talk about the control let me just talk about the different 

829
00:47:44,709 --> 00:47:45,329
problem 

830
00:47:45,329 --> 00:47:48,269
where - just forget about control for now and just look at some 

831
00:47:48,269 --> 00:47:52,189
dynamical systems where you may not get to observe the state explicitly and then 

832
00:47:52,189 --> 00:47:56,429
only later we'll tie this back to controlling 

833
00:47:56,429 --> 00:47:58,890
some systems. Okay? As a concrete example, 

834
00:47:58,890 --> 00:48:00,739
let's say 

835
00:48:00,739 --> 00:48:04,969
as, sort of, just an example to think about, imagine using a radar to track the 

836
00:48:04,969 --> 00:48:07,910
helicopter, right? So we may 

837
00:48:07,910 --> 00:48:11,039
model a helicopter, 

838
00:48:11,039 --> 00:48:14,649
and this will be an amazingly simplified model of a helicopter, 

839
00:48:14,649 --> 00:48:18,719
as, you know, some linear dynamical systems. So [inaudible] ST plus one equals AST plus 

840
00:48:18,719 --> 00:48:22,209
WT, and we'll forget about controls 

841
00:48:22,209 --> 00:48:25,069
for now, okay? We'll fill the controls back in. 

842
00:48:25,069 --> 00:48:26,789
And just with this example, 

843
00:48:26,789 --> 00:48:31,489
I'm gonna use 

844
00:48:31,489 --> 00:48:35,759
an extremely simplified state, right? 

845
00:48:35,759 --> 00:48:40,089
Where my state is just a position in velocity in the X and Y directions, 

846
00:48:40,089 --> 00:48:41,859
so you 

847
00:48:41,859 --> 00:48:46,650
may choose an A matrix like this 

848
00:48:46,650 --> 00:48:53,650
as a - okay? 

849
00:48:58,859 --> 00:49:03,279
As an extremely simple - as a, sort of, an extremely simplified model 

850
00:49:03,279 --> 00:49:04,589
of what the dynamics of, 

851
00:49:04,589 --> 00:49:08,210
like, a plane or object moving in 2-D may look like. So just 

852
00:49:08,210 --> 00:49:09,529
imagine that you 

853
00:49:09,529 --> 00:49:14,359
have simulation and you have a radar and you're tracking blips on your radar and you 

854
00:49:14,359 --> 00:49:17,960
want to estimate the position, or the state, of the helicopter as just as its 

855
00:49:17,960 --> 00:49:20,619
XY position and its XY velocity and 

856
00:49:20,619 --> 00:49:24,180
you have a very simple dynamical model of what the helicopter may do. 

857
00:49:24,180 --> 00:49:26,099
So 

858
00:49:26,099 --> 00:49:33,099
this matrix, this just says that XT plus one equals XT plus 

859
00:49:33,549 --> 00:49:35,999
X star T plus noise, 

860
00:49:35,999 --> 00:49:37,439


861
00:49:37,439 --> 00:49:39,929
so that's this first equation. 

862
00:49:39,929 --> 00:49:43,319
The second equation says that X star T plus 

863
00:49:43,319 --> 00:49:44,359
one 

864
00:49:44,359 --> 00:49:49,719
equals 0.9 times X star T plus nine. Yes, 

865
00:49:49,719 --> 00:49:52,819
this is an amazingly simplified model of 

866
00:49:52,819 --> 00:49:55,519
what a flying vehicle may look like. 

867
00:49:55,519 --> 00:49:59,419


868
00:49:59,419 --> 00:50:02,729
Here's the more interesting part, which is that with - 

869
00:50:02,729 --> 00:50:06,259
if you're tracking a helicopter with some sensor you 

870
00:50:06,259 --> 00:50:09,749
won't get to observe the full state explicitly. But just 

871
00:50:09,749 --> 00:50:10,979
for this cartoon example, 

872
00:50:10,979 --> 00:50:13,650
let's say that we get to observe YT, 

873
00:50:13,650 --> 00:50:16,849
which is CST 

874
00:50:16,849 --> 00:50:23,849
plus VT where 

875
00:50:24,009 --> 00:50:28,179
the VT is a random variables - Gaussian random variables with, say, zero mean 

876
00:50:28,179 --> 00:50:32,359
and a Gaussian noise with covariance given by sigma V. Okay? 

877
00:50:32,359 --> 00:50:37,219
So in this example 

878
00:50:37,219 --> 00:50:44,219
let's say that C is 

879
00:50:44,799 --> 00:50:51,449
that and - so CST is 

880
00:50:51,449 --> 00:50:53,060
equal to 

881
00:50:53,060 --> 00:50:53,700
XY, 

882
00:50:53,700 --> 00:50:57,579
right? Take this state vector and multiply it by Z, you just get XY. So 

883
00:50:57,579 --> 00:51:01,689
let's see what the sensor, maybe a radar, maybe a vision system, I don't know, something 

884
00:51:01,689 --> 00:51:03,199
that only gets to observe 

885
00:51:03,199 --> 00:51:04,369
the position of the 

886
00:51:04,369 --> 00:51:05,249


887
00:51:05,249 --> 00:51:12,249
helicopter that you're trying to track. 

888
00:51:14,249 --> 00:51:21,249
So here's the cartoon. 

889
00:51:27,309 --> 00:51:31,279
So a helicopter may fly through some sequence of states, 

890
00:51:31,279 --> 00:51:38,279
let's say it flies through some smooth trajectory, 

891
00:51:42,129 --> 00:51:44,839
whatever. It makes a slow turn. 

892
00:51:44,839 --> 00:51:49,819
So the true state is four-dimensional, but I'm just drawing two dimensions, 

893
00:51:49,819 --> 00:51:51,039
right? So 

894
00:51:51,039 --> 00:51:53,500
maybe you have a camera sensor down here, 

895
00:51:53,500 --> 00:51:55,339
or a radar or whatever, 

896
00:51:55,339 --> 00:51:56,589
and 

897
00:51:56,589 --> 00:52:00,819
for this cartoon example, let's say the noise in your observations is larger in the 

898
00:52:00,819 --> 00:52:03,690
vertical axis than the horizontal axis. So 

899
00:52:03,690 --> 00:52:06,789
what you get is actually 

900
00:52:06,789 --> 00:52:08,329
one sample 

901
00:52:08,329 --> 00:52:11,559
from the sequence of five Gaussians. So you may 

902
00:52:11,559 --> 00:52:15,469
observe the helicopter there at times step one, observe it there at time step two, observe it 

903
00:52:15,469 --> 00:52:16,969
there at time three, 

904
00:52:16,969 --> 00:52:18,409
time four, 

905
00:52:18,409 --> 00:52:20,599
time five. Okay? All right. 

906
00:52:20,599 --> 00:52:22,559
So that's what your - 

907
00:52:22,559 --> 00:52:26,049
there's a sequence of positions that your camera estimate gives you. 

908
00:52:26,049 --> 00:52:30,169
And 

909
00:52:30,169 --> 00:52:34,839
given these sorts of observations, can you estimate the actual state of the system? Okay? 

910
00:52:34,839 --> 00:52:35,539
So 

911
00:52:35,539 --> 00:52:42,539
these orange things, I guess, right? 

912
00:52:47,569 --> 00:52:50,909
Okay? These orange things are your observations YT. 

913
00:52:50,909 --> 00:52:52,219
And 

914
00:52:52,219 --> 00:52:55,599
test for the state of helicopter every time. Just for it, so the position of the helicopter 

915
00:52:55,599 --> 00:52:58,999
at every time. Clearly you don't want to just rely on the orange crosses because that's too 

916
00:52:58,999 --> 00:53:00,359
noisy 

917
00:53:00,359 --> 00:53:04,799
and they also don't give you velocities, right? So you only observe the subset of the state of 

918
00:53:04,799 --> 00:53:06,729
variables. 

919
00:53:06,729 --> 00:53:10,689
So what can you do? So concretely - well, you don't actually ever get to 

920
00:53:10,689 --> 00:53:12,279
observe the true 

921
00:53:12,279 --> 00:53:14,900
positions, right? All you get to do is 

922
00:53:14,900 --> 00:53:16,990
observe those orange crosses. I 

923
00:53:16,990 --> 00:53:19,939
guess I should erase the ellipses if I can. Right. 

924
00:53:19,939 --> 00:53:24,389
You get the idea. 

925
00:53:24,389 --> 00:53:28,689
The question is given - yeah. You know what I'm 

926
00:53:28,689 --> 00:53:29,729
trying to do. 

927
00:53:29,729 --> 00:53:33,950
Given just the orange crosses can you get a good estimate of the state of 

928
00:53:33,950 --> 00:53:37,169
the system at every time step? 

929
00:53:37,169 --> 00:53:40,219
So it turns out that - 

930
00:53:40,219 --> 00:53:43,709
well, so what you want to do is 

931
00:53:43,709 --> 00:53:49,679
to estimate the distribution on the state 

932
00:53:49,679 --> 00:53:50,799
given 

933
00:53:50,799 --> 00:53:54,429
all the previous observations, 

934
00:53:54,429 --> 00:53:58,109
right? So given observations, you know, one, two, three, four, and five, where is the 

935
00:53:58,109 --> 00:54:01,039
helicopter currently? 

936
00:54:01,039 --> 00:54:03,119
So it turns out that 

937
00:54:03,119 --> 00:54:08,689
the random variables, S zero, S one, up to ST and Y1 are 

938
00:54:08,689 --> 00:54:13,689
to ST, have a joint Gaussian distribution, 

939
00:54:13,689 --> 00:54:14,540


940
00:54:14,540 --> 00:54:18,319


941
00:54:18,319 --> 00:54:23,319
right? 

942
00:54:23,319 --> 00:54:27,689
So one thing you could do is construct a joint Gaussian distribution - can define 

943
00:54:27,689 --> 00:54:30,199
vector value random variable Z, S zero, 

944
00:54:30,199 --> 00:54:34,369
S one, up to ST, Y1 up to YT, 

945
00:54:34,369 --> 00:54:35,629
right? 

946
00:54:35,629 --> 00:54:38,899
So it turns out that 

947
00:54:38,899 --> 00:54:43,449


948
00:54:43,449 --> 00:54:48,049
Z will have some Gaussian distribution with some mean and some covariance matrix. 

949
00:54:48,049 --> 00:54:49,639


950
00:54:49,639 --> 00:54:53,989
Using the Gaussian marginalization and conditioning formulas. But I think way back when we 

951
00:54:53,989 --> 00:54:56,279
talked about factor analysis in this class, 

952
00:54:56,279 --> 00:54:57,939
we talked about how to 

953
00:54:57,939 --> 00:55:01,889
compute marginal distributions and conditional distributions of Gaussians. 

954
00:55:01,889 --> 00:55:03,529
But using those formulas 

955
00:55:03,529 --> 00:55:04,779
you can actually compute this 

956
00:55:04,779 --> 00:55:10,249
thing. You can compute, right? You 

957
00:55:10,249 --> 00:55:13,409
can compute that conditional distribution. This will give a good estimate 

958
00:55:13,409 --> 00:55:16,009
of the current state ST. Okay? 

959
00:55:16,009 --> 00:55:19,059
But clearly this is a 

960
00:55:19,059 --> 00:55:24,279
extremely computationally inefficient way to do so because 

961
00:55:24,279 --> 00:55:25,989
these 

962
00:55:25,989 --> 00:55:28,779
means and covariance matrixes will grow linearly 

963
00:55:28,779 --> 00:55:32,299
with the number of time steps as you're tracking a helicopter over tens of 

964
00:55:32,299 --> 00:55:35,679
thousands of time steps. They were huge covariance matrixes, so 

965
00:55:35,679 --> 00:55:39,269
this is a conceptually correct way, but just a computational not reasonable way 

966
00:55:39,269 --> 00:55:46,269
to perform this computation. So, 

967
00:55:46,380 --> 00:55:52,499
instead, 

968
00:55:52,499 --> 00:55:56,679
there's an algorithm called the Kalman filter that allows you to organize your computations 

969
00:55:56,679 --> 00:55:59,659
efficiently and do this. 

970
00:55:59,659 --> 00:56:01,979
Just on the side, if you 

971
00:56:01,979 --> 00:56:05,109
remember Dan's discussion section on HMM's the 

972
00:56:05,109 --> 00:56:06,970
Kalman filter model 

973
00:56:06,970 --> 00:56:10,690
turns out to actually be a hidden Markov model. These Kalman's are only for those of 

974
00:56:10,690 --> 00:56:13,719
you that attended Dan's discussion section. 

975
00:56:13,719 --> 00:56:16,539
If not then what I'm about to say may not make sense. But 

976
00:56:16,539 --> 00:56:20,220
if you remember Dan's kind of section of the hidden Markov model, 

977
00:56:20,220 --> 00:56:22,050
it actually turns out that the Kalman 

978
00:56:22,050 --> 00:56:25,609
filter model, this linear dynamical system with observations is actually an 

979
00:56:25,609 --> 00:56:29,239
HMM problem where - let's see. 

980
00:56:29,239 --> 00:56:36,239
Unfortunately, 

981
00:56:36,969 --> 00:56:40,509
the notation's a bit different because Dan was drawing from, sort of, a 

982
00:56:40,509 --> 00:56:44,819
clash of multiple research communities using these 

983
00:56:44,819 --> 00:56:45,990
same ideas. So 

984
00:56:45,990 --> 00:56:49,589
the notation that Dan used, I think, was developed 

985
00:56:49,589 --> 00:56:52,120
in a different community that clashes a bit with 

986
00:56:52,120 --> 00:56:56,400
the reinforcement learning community notations. So 

987
00:56:56,400 --> 00:56:59,719
in Dan's notation in the 

988
00:56:59,719 --> 00:57:05,089
HMM section, Z and X were used to denote the state and the observations. Today, I'm using S and X to denote 

989
00:57:05,089 --> 00:57:07,529
the state and the observations. 

990
00:57:07,529 --> 00:57:09,209
Okay? 

991
00:57:09,209 --> 00:57:13,849
But it turns out what I'm about to do turns out to be a hidden Markov 

992
00:57:13,849 --> 00:57:17,139
model with continuous states rather than discrete states, which is 

993
00:57:17,139 --> 00:57:19,539
under the discretion section. Okay. If you didn't attend that discussion section 

994
00:57:19,539 --> 00:57:22,689
then forget everything I just 

995
00:57:22,689 --> 00:57:27,039
said in the last minute. 

996
00:57:27,039 --> 00:57:27,820


997
00:57:27,820 --> 00:57:33,069
So here's the outline of the Kalman filter. 

998
00:57:33,069 --> 00:57:36,880
It turns out that, so 

999
00:57:36,880 --> 00:57:41,089
it's a cursive algorithm. So it turns out that if I have computed P 

1000
00:57:41,089 --> 00:57:42,259
of 

1001
00:57:42,259 --> 00:57:44,559
ST 

1002
00:57:44,559 --> 00:57:46,739
given Y1 

1003
00:57:46,739 --> 00:57:50,439
up to YT, the Kalman filter organizes these computations into steps. 

1004
00:57:50,439 --> 00:57:54,789
The first step is called the predict step. 

1005
00:57:54,789 --> 00:57:57,199
Where given P 

1006
00:57:57,199 --> 00:58:00,939
of ST - where you all ready have P of ST given Y1 up to YT 

1007
00:58:00,939 --> 00:58:05,439
and you compute what P of ST plus one given Y1 up to YT is. 

1008
00:58:05,439 --> 00:58:12,439
And then the other step 

1009
00:58:12,619 --> 00:58:13,599


1010
00:58:13,599 --> 00:58:17,069
is 

1011
00:58:17,069 --> 00:58:20,869
called the update step. Where given this second line you compute this third line. Okay? 

1012
00:58:20,869 --> 00:58:24,229
Where having taken account only observations of the time T 

1013
00:58:24,229 --> 00:58:25,239
you know incorporate the lots 

1014
00:58:25,239 --> 00:58:27,009
of the observations up 

1015
00:58:27,009 --> 00:58:30,089
to time T plus one. 

1016
00:58:30,089 --> 00:58:31,269


1017
00:58:31,269 --> 00:58:33,199
So 

1018
00:58:33,199 --> 00:58:34,919
concretely - 

1019
00:58:34,919 --> 00:58:41,919
oh, and let's 

1020
00:58:43,209 --> 00:58:45,630
see. 

1021
00:58:45,630 --> 00:58:46,900
In the predict step 

1022
00:58:46,900 --> 00:58:52,729
it turns out that 

1023
00:58:52,729 --> 00:58:56,519
- so what I'm going to do is actually just outline the main steps of the 

1024
00:58:56,519 --> 00:58:57,410
Kalman filter. 

1025
00:58:57,410 --> 00:59:01,420
I won't actually derive the algorithm and prove it's correct. It 

1026
00:59:01,420 --> 00:59:02,509
turns out that, I don't 

1027
00:59:02,509 --> 00:59:05,619
know, working out the actual proof of what 

1028
00:59:05,619 --> 00:59:08,099
I'm about to derive is probably 

1029
00:59:08,099 --> 00:59:12,149
significantly - it's probably, I don't know, about as hard, or maybe slightly easier, than many of 

1030
00:59:12,149 --> 00:59:14,559
the homework's you've done all ready. So and you've done 

1031
00:59:14,559 --> 00:59:18,569
some pretty amazingly hard homework, so you can work out the proof for yourself. It's just write out the 

1032
00:59:18,569 --> 00:59:20,589
main outlines 

1033
00:59:20,589 --> 00:59:24,709
and the conclusion of the algorithm. 

1034
00:59:24,709 --> 00:59:30,099
So for the acceptance of the vest T given Y1 after YT. If that 

1035
00:59:30,099 --> 00:59:33,779
is 

1036
00:59:33,779 --> 00:59:35,919
given by 

1037
00:59:35,919 --> 00:59:41,880
that 

1038
00:59:41,880 --> 00:59:48,880
then 

1039
00:59:50,629 --> 00:59:57,629
where - okay? 

1040
01:00:19,839 --> 01:00:23,390


1041
01:00:23,390 --> 01:00:25,139
So given 

1042
01:00:25,139 --> 01:00:29,039
ST, having computed the distribution ST given Y1 through YT - 

1043
01:00:29,039 --> 01:00:32,380
and computed the distribution of ST plus one given Y1 through YT as 

1044
01:00:32,380 --> 01:00:34,839
Gaussian, with this mean and this covariance, 

1045
01:00:34,839 --> 01:00:38,109
where you compute the mean and covariance using these two formulas. 

1046
01:00:38,109 --> 01:00:38,940
And 

1047
01:00:38,940 --> 01:00:41,639
just as a point of notation, 

1048
01:00:41,639 --> 01:00:44,089
right? I'm using ST and YT 

1049
01:00:44,089 --> 01:00:48,690
to denote the true states in observations. So the ST is the unknown true 

1050
01:00:48,690 --> 01:00:49,390
state. Okay? 

1051
01:00:49,390 --> 01:00:52,520
ST is whatever state this one is in and you actually don't know what ST is 

1052
01:00:52,520 --> 01:00:55,179
because you don't get to observe this. 

1053
01:00:55,179 --> 01:00:59,459
And, in contrast, I'm using these things like ST given T, 

1054
01:00:59,459 --> 01:01:02,819
ST plus one given T, 

1055
01:01:02,819 --> 01:01:05,079
sigma T given T, and so on. 

1056
01:01:05,079 --> 01:01:09,199
These things are the results of your computations, 

1057
01:01:09,199 --> 01:01:14,009
right? So these things are actually things you compute. So I 

1058
01:01:14,009 --> 01:01:18,359
hope the notations are okay, but these - ST is the unknown true state, right? 

1059
01:01:18,359 --> 01:01:21,729
Whereas these things, ST equals one given T and so on, these are things that you 

1060
01:01:21,729 --> 01:01:26,909
compute inside your algorithm. Okay? 

1061
01:01:26,909 --> 01:01:33,909
So that was the predict 

1062
01:01:51,059 --> 01:01:55,199
step. And in the update step, 

1063
01:01:55,199 --> 01:02:02,199
you find that - well, 

1064
01:02:03,219 --> 01:02:10,219
okay? 

1065
01:02:13,339 --> 01:02:20,339


1066
01:02:48,749 --> 01:02:55,749


1067
01:02:59,719 --> 01:03:06,719


1068
01:03:22,839 --> 01:03:25,149
And so that's the updates of the Kalman filter 

1069
01:03:25,149 --> 01:03:27,769
where you compute this in terms of 

1070
01:03:27,769 --> 01:03:32,149
your ST given Y1 through 

1071
01:03:32,149 --> 01:03:33,499
YT. So 

1072
01:03:33,499 --> 01:03:34,789
after having performed 

1073
01:03:34,789 --> 01:03:38,059
the most recent Kalman filter update you find that, 

1074
01:03:38,059 --> 01:03:41,469
right? Your perceived distribution on the estimate of ST plus one, given 

1075
01:03:41,469 --> 01:03:43,739
all your observations so far, 

1076
01:03:43,739 --> 01:03:46,379
is that it's Gaussian with mean 

1077
01:03:46,379 --> 01:03:49,829
given by this and variance given by that. 

1078
01:03:49,829 --> 01:03:51,550
So, informally, 

1079
01:03:51,550 --> 01:03:53,749


1080
01:03:53,749 --> 01:03:57,559
this thing ST plus one given T plus one 

1081
01:03:57,559 --> 01:03:59,919
is our 

1082
01:03:59,919 --> 01:04:06,919
best estimate 

1083
01:04:08,539 --> 01:04:09,339


1084
01:04:09,339 --> 01:04:10,239
for ST plus 

1085
01:04:10,239 --> 01:04:15,420
one, right? Given all the observations we've had up to that time. Okay? 

1086
01:04:15,420 --> 01:04:16,819
And, again, 

1087
01:04:16,819 --> 01:04:20,239
the correctness of these equations - the fact that I'm actually 

1088
01:04:20,239 --> 01:04:21,040
computing this 

1089
01:04:21,040 --> 01:04:24,489
mean and covariance of this conditional Gaussian distribution, 

1090
01:04:24,489 --> 01:04:31,049
you can - I'll leave you to sort of prove that at home if you want. Okay? I'm 

1091
01:04:31,049 --> 01:04:35,069
actually gonna put this together with LQR control in a second, but, so before I do 

1092
01:04:35,069 --> 01:04:36,140
that let me check if you've 

1093
01:04:36,140 --> 01:04:40,519
got questions about this? Actually let 

1094
01:04:40,519 --> 01:04:42,579
me erase the board while 

1095
01:04:42,579 --> 01:04:49,579
you take a 

1096
01:04:52,159 --> 01:04:56,129
look at that. Right. 

1097
01:04:56,129 --> 01:05:03,129
Okay. Any 

1098
01:05:24,659 --> 01:05:31,659
questions for Kalman filters? Yeah? Student:How is it computationally less intensive than compute some drawing Gaussian distribution and then find 

1099
01:05:43,929 --> 01:05:45,579
the conditional - Instructor (Andrew Ng):Very quickly. Yeah, of course. So 

1100
01:05:45,579 --> 01:05:49,079
how is this less computationally intensive than the original method I talked about, right? So in the original method I 

1101
01:05:49,079 --> 01:05:50,169
talked 

1102
01:05:50,169 --> 01:05:53,159
about - wow, this is really back 

1103
01:05:53,159 --> 01:05:55,029
and forth. 

1104
01:05:55,029 --> 01:06:02,029
I said, let's construct a Z, which was this huge Gaussian thing, right? 

1105
01:06:03,949 --> 01:06:08,359
And figure out what the mean and 

1106
01:06:08,359 --> 01:06:10,599
covariance matrix of Z is. So sigma will be like 

1107
01:06:10,599 --> 01:06:13,059
R - it'll 

1108
01:06:13,059 --> 01:06:16,089
be - well, it'll be 

1109
01:06:16,089 --> 01:06:17,619
roughly, 

1110
01:06:17,619 --> 01:06:21,669
right? A T by T matrix, right? 

1111
01:06:21,669 --> 01:06:25,029
This is actually - or the T by T is actually T times number of 

1112
01:06:25,029 --> 01:06:27,309
state variables plus number 

1113
01:06:27,309 --> 01:06:29,389
of observation variables 

1114
01:06:29,389 --> 01:06:31,679
by that. This is a huge matrix 

1115
01:06:31,679 --> 01:06:35,039
and as the number of times it increases 

1116
01:06:35,039 --> 01:06:36,909
sigma will become bigger and bigger. 

1117
01:06:36,909 --> 01:06:39,629
So 

1118
01:06:39,629 --> 01:06:43,379
the conditional and marginalization operations require things like 

1119
01:06:43,379 --> 01:06:46,799
computing the inverse of T or subsets of T. So 

1120
01:06:46,799 --> 01:06:48,919
the naïve way of doing this 

1121
01:06:48,919 --> 01:06:50,130
will 

1122
01:06:50,130 --> 01:06:53,599
cos on the order of TQ computation, if you do things naively, right? If 

1123
01:06:53,599 --> 01:06:54,619
- 

1124
01:06:54,619 --> 01:07:00,059
because inverting like a T by T matrix cos on the order of TQ, roughly. 

1125
01:07:00,059 --> 01:07:02,459
In contrast, the Kalman filter algorithm, like I said, 

1126
01:07:02,459 --> 01:07:06,439
over here. I just have the update step. On the other board I had the 

1127
01:07:06,439 --> 01:07:07,399
predict step. 

1128
01:07:07,399 --> 01:07:09,079
But you can 

1129
01:07:09,079 --> 01:07:11,880
carry out the computation on both of these lines and 

1130
01:07:11,880 --> 01:07:13,889
it's actually constant time. So 

1131
01:07:13,889 --> 01:07:17,719
on every time step you perform these Kalman filter updates. 

1132
01:07:17,719 --> 01:07:20,959
So if every time you get one more observation you perform one more 

1133
01:07:20,959 --> 01:07:25,359
Kalman filter update and the computation of that doesn't depend on 

1134
01:07:25,359 --> 01:07:28,589
it's - or the one time for every time 

1135
01:07:28,589 --> 01:07:31,799
step. So the amount of stuff you need to keep around in memory 

1136
01:07:31,799 --> 01:07:35,630
doesn't grow linearly with the number of time steps you see. 

1137
01:07:35,630 --> 01:07:37,889
Okay? Because - 

1138
01:07:37,889 --> 01:07:40,669
actually what - I think I just realized why - so, yes. 

1139
01:07:40,669 --> 01:07:44,140
Actually this is the way we actually run Kalman filters, which is 

1140
01:07:44,140 --> 01:07:45,650
initially I have 

1141
01:07:45,650 --> 01:07:47,719
just my first observation. 

1142
01:07:47,719 --> 01:07:51,369
So I then compute P of X1 given Y1, right? 

1143
01:07:51,369 --> 01:07:54,749
And now I know why I think my helicopter is at time step one. 

1144
01:07:54,749 --> 01:07:56,019
Having computed this 

1145
01:07:56,019 --> 01:07:59,249
there may be some time passes, like a second passes, 

1146
01:07:59,249 --> 01:08:01,729
and then I get another observation 

1147
01:08:01,729 --> 01:08:04,910
and what I'll do is I'll combine these two together to get 

1148
01:08:04,910 --> 01:08:08,829
P of X2 given Y1 and Y2, right? 

1149
01:08:08,829 --> 01:08:12,529
And then may be another second passes in time and I get another observation. So my 

1150
01:08:12,529 --> 01:08:14,869
helicopters move a little bit more, because another second's passed and I get 

1151
01:08:14,869 --> 01:08:16,670
another observation. 

1152
01:08:16,670 --> 01:08:20,759
What I do is I combine these two to compute P of SV given Y1, 

1153
01:08:20,759 --> 01:08:21,949
Y2, 

1154
01:08:21,949 --> 01:08:24,020
Y3. 

1155
01:08:24,020 --> 01:08:27,279
And it turns out that in order to compute this 

1156
01:08:27,279 --> 01:08:29,539
I don't need to remember 

1157
01:08:29,539 --> 01:08:33,199
any of these earlier observations. Okay? 

1158
01:08:33,199 --> 01:08:38,419
So this is how you actually run it in real time say. Okay? Cool. 

1159
01:08:38,419 --> 01:08:40,859
So 

1160
01:08:40,859 --> 01:08:42,579
- oh, drat, running out of time. 

1161
01:08:42,579 --> 01:08:47,239
The last thing I want to do is actually put these things together. So 

1162
01:08:47,239 --> 01:08:49,130
putting it together - 

1163
01:08:49,130 --> 01:08:56,130
putting 

1164
01:08:57,230 --> 01:09:00,319
Kalman filters together with LQR control you get an 

1165
01:09:00,319 --> 01:09:05,310
algorithm called LQG control, which stands for linear-quadratic Gaussian. 

1166
01:09:05,310 --> 01:09:08,289
But in this type of control problem, we have 

1167
01:09:08,289 --> 01:09:11,449
a linear dynamical system. 

1168
01:09:11,449 --> 01:09:18,449
So I'm now adding actions back in, right? So now B times AT. Okay? 

1169
01:09:22,569 --> 01:09:29,569
And then, 

1170
01:09:32,289 --> 01:09:35,489
so LQG problem, or linear-quadratic Gaussian problem, 

1171
01:09:35,489 --> 01:09:39,629
I have a linear dynamical system that I want to control 

1172
01:09:39,629 --> 01:09:43,670
and I don't get to observe the states directly. I only get to observe 

1173
01:09:43,670 --> 01:09:45,919
these variables YT. Okay? 

1174
01:09:45,919 --> 01:09:46,819
So I only get 

1175
01:09:46,819 --> 01:09:50,639
noisy observations of the actual state. 

1176
01:09:50,639 --> 01:09:52,249
So it turns out that 

1177
01:09:52,249 --> 01:09:53,749
you can solve 

1178
01:09:53,749 --> 01:09:56,959
an LGG control problem as follows. 

1179
01:09:56,959 --> 01:10:00,510
At every time step, we'll 

1180
01:10:00,510 --> 01:10:06,809
use a Kalman filter to estimate 

1181
01:10:06,809 --> 01:10:11,909
the state, 

1182
01:10:11,909 --> 01:10:14,679
right? So concretely - 

1183
01:10:14,679 --> 01:10:19,659
let's say you know the initial state. Then you initialize this to be like 

1184
01:10:19,659 --> 01:10:24,030
that. If you know that the initial state is some state as zero, you initialize that 

1185
01:10:24,030 --> 01:10:25,800
as zero and that or, 

1186
01:10:25,800 --> 01:10:31,519
whatever, right? And 

1187
01:10:31,519 --> 01:10:34,440
this is just - well, okay? If you 

1188
01:10:34,440 --> 01:10:39,909


1189
01:10:39,909 --> 01:10:43,729
don't know the initial state exactly, then this is just a mean of your initial state 

1190
01:10:43,729 --> 01:10:47,689
estimate and that would be your covariance or your initial state estimate. So just initialize your Kalman filter 

1191
01:10:47,689 --> 01:10:49,759
this way. 

1192
01:10:49,759 --> 01:10:54,269
And then you use the Kalman filter on every step to estimate what the state is. So 

1193
01:10:54,269 --> 01:10:56,969
here's the predict step, 

1194
01:10:56,969 --> 01:10:58,610
right? Previously we had 

1195
01:10:58,610 --> 01:11:00,219


1196
01:11:00,219 --> 01:11:03,260
ST plus one give T 

1197
01:11:03,260 --> 01:11:10,260
equals - 

1198
01:11:17,919 --> 01:11:22,840
and so on. So this is your predict step and then you have an update step, same as before. 

1199
01:11:22,840 --> 01:11:27,469
The one change I'm gonna make to the predict step is now I'm going to 

1200
01:11:27,469 --> 01:11:31,409
take this into account as well. This is just saying suppose my previous state was ST 

1201
01:11:31,409 --> 01:11:32,530
given T, 

1202
01:11:32,530 --> 01:11:36,130
what do I think my next state ST plus one given T will be 

1203
01:11:36,130 --> 01:11:38,989
given no other observations and the answer is, you know, it's really just this 

1204
01:11:38,989 --> 01:11:40,360
equation, 

1205
01:11:40,360 --> 01:11:41,840
AST given T 

1206
01:11:41,840 --> 01:11:44,500
plus BAT. 

1207
01:11:44,500 --> 01:11:45,989


1208
01:11:45,989 --> 01:11:47,109
And then, 

1209
01:11:47,109 --> 01:11:50,309
so this takes care of, sort of, the observations. 

1210
01:11:50,309 --> 01:11:53,140
And then the other thing you do is 

1211
01:11:53,140 --> 01:11:55,980
compute 

1212
01:11:55,980 --> 01:11:58,820
LT's 

1213
01:11:58,820 --> 01:12:00,219
using 

1214
01:12:00,219 --> 01:12:02,449
LQR, right? 

1215
01:12:02,449 --> 01:12:04,039
Assuming 

1216
01:12:04,039 --> 01:12:09,679
- 

1217
01:12:09,679 --> 01:12:12,980
then the other thing you do is you just look at the linear dynamical systems, and 

1218
01:12:12,980 --> 01:12:15,030
forget about the observations for now, 

1219
01:12:15,030 --> 01:12:17,599
and compute the optimal policy - 

1220
01:12:17,599 --> 01:12:18,520
oh, 

1221
01:12:18,520 --> 01:12:21,210
right. Previously we had that 

1222
01:12:21,210 --> 01:12:25,380
you would choose actions AT equals to LT times ST, right? 

1223
01:12:25,380 --> 01:12:30,530
So the optimal policy we said was these matrixes, LT times ST. 

1224
01:12:30,530 --> 01:12:33,949
So the other part of this problem you would use LQR to compute 

1225
01:12:33,949 --> 01:12:35,219
these matrixes 

1226
01:12:35,219 --> 01:12:39,309
LT, ignoring the fact that you don't actually observe the state. 

1227
01:12:39,309 --> 01:12:41,690
And the very final step of 

1228
01:12:41,690 --> 01:12:45,440
LQR control is that - well, when you're actually flying a helicopter, when you're actually doing 

1229
01:12:45,440 --> 01:12:47,860
whatever you're doing, you can't actually 

1230
01:12:47,860 --> 01:12:51,250
plug in the actual state because in LGG problem you don't get to observe the 

1231
01:12:51,250 --> 01:12:52,810
state exactly. 

1232
01:12:52,810 --> 01:12:54,949
So what you do 

1233
01:12:54,949 --> 01:12:58,949
when you actually execute the policy is you 

1234
01:12:58,949 --> 01:13:03,380
choose the action 

1235
01:13:03,380 --> 01:13:05,800
according to your best estimate of the state. Okay? 

1236
01:13:05,800 --> 01:13:09,590
So in other words, you don't know what ST is, but your best estimate of the state 

1237
01:13:09,590 --> 01:13:12,989
at any time is this S of T given 

1238
01:13:12,989 --> 01:13:16,999
T. So you just plug those in and take LT times your best estimate of the state 

1239
01:13:16,999 --> 01:13:21,439
and then you go ahead and execute the action AT on your system, on your helicopter, or whatever. 

1240
01:13:21,439 --> 01:13:23,630
Okay? 

1241
01:13:23,630 --> 01:13:25,530
And it turns out that 

1242
01:13:25,530 --> 01:13:29,989
for this specific class of problems, this is actually optimal procedure. This will 

1243
01:13:29,989 --> 01:13:32,300
actually cause you to act optimally 

1244
01:13:32,300 --> 01:13:36,119
in your LQG problem. 

1245
01:13:36,119 --> 01:13:38,480
And there's this intuition that, 

1246
01:13:38,480 --> 01:13:42,060
earlier I said, in LQR problems it's almost as if the noise doesn't matter 

1247
01:13:42,060 --> 01:13:44,440
and in a pure LQR problem 

1248
01:13:44,440 --> 01:13:48,820
the WT terms don't matter. It's as if you can ignore the noise. So it turns out that 

1249
01:13:48,820 --> 01:13:52,360
by elaborating that proof, which I'm not gonna do you can - you're welcome to proof for yourself at home. 

1250
01:13:52,360 --> 01:13:54,780
It's 

1251
01:13:54,780 --> 01:13:57,570
that intuition means that you can actually ignore the noise in your 

1252
01:13:57,570 --> 01:14:00,169
observations as well. The ST given T 

1253
01:14:00,169 --> 01:14:01,969
is some of your best estimate. 

1254
01:14:01,969 --> 01:14:03,989
So it's 

1255
01:14:03,989 --> 01:14:06,449
as if 

1256
01:14:06,449 --> 01:14:11,570
your true state ST is equal to ST given T 

1257
01:14:11,570 --> 01:14:12,130


1258
01:14:12,130 --> 01:14:13,769
plus 

1259
01:14:13,769 --> 01:14:17,320
noise. So in LQG control, what we're going to do is ignore the noise and just 

1260
01:14:17,320 --> 01:14:18,389
plug in 

1261
01:14:18,389 --> 01:14:22,749
this ST given T and this turns out the optimal thing to do. I 

1262
01:14:22,749 --> 01:14:26,300
should say, this turns out to be a very special case 

1263
01:14:26,300 --> 01:14:28,560
of a problem where you can 

1264
01:14:28,560 --> 01:14:32,320
ignore the noise and still act 

1265
01:14:32,320 --> 01:14:34,770
optimally and this property - this actually 

1266
01:14:34,770 --> 01:14:39,449
is something called the separation principle where you can design 

1267
01:14:39,449 --> 01:14:43,380
an algorithm for estimate the states and design an algorithm for controlling your system. 

1268
01:14:43,380 --> 01:14:46,499
So just glom the two together and that turns out to be optimal. This 

1269
01:14:46,499 --> 01:14:51,029
is a very unusual property and it pretty much was true only for LQG. 

1270
01:14:51,029 --> 01:14:55,009
It doesn't hold true for many systems. Once you change anything, one's that's non-linear, 

1271
01:14:55,009 --> 01:14:59,039
you know, some other noise model of one that's non-linear once this - I don't know. 

1272
01:14:59,039 --> 01:15:02,110
Once you change almost anything in this problem 

1273
01:15:02,110 --> 01:15:05,429
this will no longer hold true. The - and just estimate the states and plug that into a 

1274
01:15:05,429 --> 01:15:07,340
controller that was designed, 

1275
01:15:07,340 --> 01:15:09,110
assuming you could observe the states 

1276
01:15:09,110 --> 01:15:11,460
fully. But that 

1277
01:15:11,460 --> 01:15:14,610
once you change almost anything this will no longer turn out to be optimal. But 

1278
01:15:14,610 --> 01:15:18,769
for the LQG problem specifically, it's kind of convenient that you can do this. Just 

1279
01:15:18,769 --> 01:15:22,139
one quick question to actually close Student:[Inaudible] Instructor 

1280
01:15:22,139 --> 01:15:23,170
(Andrew 

1281
01:15:23,170 --> 01:15:28,249
Ng):Oh, yes. Yeah. In every embassy wing - in everything I've described I'm assuming that you're already learned A and B or something, so to - Student:[Inaudible] 

1282
01:15:28,249 --> 01:15:33,239
Instructor (Andrew Ng):Yeah, right. Okay. 

1283
01:15:33,239 --> 01:15:36,750
Sorry we're running a little bit late; let's close for today and next time I'll talk a bit more about these 

1284
01:15:36,750 --> 01:15:38,109
partially