00:00:01,709 --> 00:00:08,709 This subtitle is provided from lecsub.jimdo.com under Creative Commons License. 2 00:00:12,009 --> 00:00:15,279 This presentation is delivered by the Stanford Center for Professional 3 00:00:15,279 --> 00:00:22,279 Development. 4 00:00:24,019 --> 00:00:26,849 Okay. So welcome back, and 5 00:00:26,849 --> 00:00:27,869 let's 6 00:00:27,869 --> 00:00:32,690 continue our discussion of reinforcement learning 7 00:00:32,690 --> 00:00:35,340 algorithms. On for today, the first thing I want to do is actually 8 00:00:35,340 --> 00:00:39,290 talk a little bit about debugging reinforcement learning algorithms and 9 00:00:39,290 --> 00:00:41,060 then I'll continue the 10 00:00:41,060 --> 00:00:44,090 technical discussion from last week on LQR, on linear-quadratic 11 00:00:44,090 --> 00:00:45,390 regulation. 12 00:00:45,390 --> 00:00:48,710 In particular, I want to tell you about an algorithm called the French dynamic programming, which 13 00:00:48,710 --> 00:00:51,320 I think is actually a very effective 14 00:00:51,320 --> 00:00:56,190 absolute controls lack reinforcement learning algorithm for many problems. 15 00:00:56,190 --> 00:01:00,090 Then we'll talk about Kalman filters and linear-quadratic Gaussian control, LGG. 16 00:01:00,090 --> 00:01:05,180 Let's start with debugging RL algorithms. And can I switch to the laptop display, 17 00:01:05,180 --> 00:01:06,620 please? 18 00:01:06,620 --> 00:01:08,040 And so 19 00:01:08,040 --> 00:01:15,040 this was actually - 20 00:01:17,570 --> 00:01:22,920 what I'm about to do here is this is actually a specific example that I had done 21 00:01:22,920 --> 00:01:26,800 earlier in this quarter, but that I promised to do again. So remember, 22 00:01:26,800 --> 00:01:29,989 you know, what was it? Roughly halfway through the quarter 23 00:01:29,989 --> 00:01:31,560 I'd given a 24 00:01:31,560 --> 00:01:34,870 lecture on debugging learning algorithms, right? This idea that 25 00:01:34,870 --> 00:01:38,830 very often you run a learning algorithm and it, you know, maybe does roughly what 26 00:01:38,830 --> 00:01:42,600 you want to and maybe it doesn't do as well as you're hoping. 27 00:01:42,600 --> 00:01:45,570 Then what do you do next? And the talk of this idea that, you 28 00:01:45,570 --> 00:01:49,200 know, some of the really, really good people in machine learning, the people that really 29 00:01:49,200 --> 00:01:53,040 understand learning algorithms, they're really good at getting these things to work. 30 00:01:53,040 --> 00:01:56,530 Very often what they're really good at is at figuring out why a learning 31 00:01:56,530 --> 00:01:59,310 algorithm is working or is not working 32 00:01:59,310 --> 00:02:01,890 and that prevents them from doing things 33 00:02:01,890 --> 00:02:03,770 for six months 34 00:02:03,770 --> 00:02:08,549 that someone else may be able to just look at and say gee, there was no point collecting 35 00:02:08,549 --> 00:02:12,400 more training data, because your learning algorithm had high bias rather than high 36 00:02:12,400 --> 00:02:16,079 variance. So that six months you spent collecting more training data - I could have told you six 37 00:02:16,079 --> 00:02:19,429 months ago it was a waste of time. Right? So these are sorts of things that 38 00:02:19,429 --> 00:02:22,679 some of the people are really good at machine learning, that they really get machine 39 00:02:22,679 --> 00:02:27,729 learning, are very good at. 40 00:02:27,729 --> 00:02:29,699 Well, just a few of my slides. These slides I won't actually 41 00:02:29,699 --> 00:02:31,370 talk about these, 42 00:02:31,370 --> 00:02:36,019 but these are exactly the same slides you saw last time. Actually, 43 00:02:36,019 --> 00:02:38,509 I'll just skip ahead, I guess. 44 00:02:38,509 --> 00:02:40,739 So last time you 45 00:02:40,739 --> 00:02:46,139 saw this discussion on - right. 46 00:02:46,139 --> 00:02:50,979 Diagnostics for whether you happen to have a bias problem, or a variance problem, 47 00:02:50,979 --> 00:02:52,770 or in other cases where 48 00:02:52,770 --> 00:02:56,780 the - your optimization algorithm is converging or whether it's the [inaudible] optimization objective and 49 00:02:56,780 --> 00:02:57,729 so on. 50 00:02:57,729 --> 00:02:59,300 And we'll talk about that again, but the 51 00:02:59,300 --> 00:03:02,790 one example that I, sort of, promised to show again was actually a reinforcement 52 00:03:02,790 --> 00:03:06,760 learning example, but at that time we hadn't talked about reinforcement learning yet, so 53 00:03:06,760 --> 00:03:08,990 I promised to do exactly the same example again, all right? 54 00:03:08,990 --> 00:03:10,599 So 55 00:03:10,599 --> 00:03:13,619 let's go for the example. 56 00:03:13,619 --> 00:03:17,870 The motivating example was robotic control. Let's 57 00:03:17,870 --> 00:03:22,489 see. Write a - let's say you want to design a controller for this 58 00:03:22,489 --> 00:03:23,470 helicopter. 59 00:03:23,470 --> 00:03:25,079 So this is a 60 00:03:25,079 --> 00:03:28,249 very typical way by which you might apply machine-learning 61 00:03:28,249 --> 00:03:30,719 algorithm or several machine-learning algorithms 62 00:03:30,719 --> 00:03:33,249 to a control problem, right? Which is you 63 00:03:33,249 --> 00:03:34,010 might first 64 00:03:34,010 --> 00:03:35,609 build a simulator - 65 00:03:35,609 --> 00:03:38,589 so control problem is 66 00:03:38,589 --> 00:03:42,349 you want to build a controller to make the helicopter hover in place, right? So 67 00:03:42,349 --> 00:03:47,059 the first thing you want to do is build a simulator of the helicopter. And this just means 68 00:03:47,059 --> 00:03:51,479 model of the state transition probabilities, a piece of SA of the 69 00:03:51,479 --> 00:03:55,109 helicopter, and you can do this by many different ways. Maybe you can try reading a 70 00:03:55,109 --> 00:03:56,709 helicopter textbook and building a 71 00:03:56,709 --> 00:03:58,809 simulator based on 72 00:03:58,809 --> 00:04:01,789 what's known about aerodynamics of helicopters. It actually turns out this is very 73 00:04:01,789 --> 00:04:04,569 hard to do. Another thing you could do is collect data 74 00:04:04,569 --> 00:04:08,379 and maybe fit a linear model, or maybe fit a non-linear model, 75 00:04:08,379 --> 00:04:10,709 to what the next stage is as a function of 76 00:04:10,709 --> 00:04:12,829 current state and current action. Okay? So 77 00:04:12,829 --> 00:04:14,969 there's different ways of estimating 78 00:04:14,969 --> 00:04:16,449 the state transition probabilities. 79 00:04:16,449 --> 00:04:17,409 So, now, 80 00:04:17,409 --> 00:04:21,269 you now have a simulator and I'm showing a screen shot of our 81 00:04:21,269 --> 00:04:25,120 simulator we have a stand fit on the upper right there. 82 00:04:25,120 --> 00:04:28,720 Second thing you might do is then choose a reward function. So you might choose this sort 83 00:04:28,720 --> 00:04:30,460 of quadratic cos function, 84 00:04:30,460 --> 00:04:31,529 right? So 85 00:04:31,529 --> 00:04:35,550 the reward for being at a state X is going to be minus the 86 00:04:35,550 --> 00:04:39,310 norm difference between a current state and some desired state of 87 00:04:39,310 --> 00:04:42,249 your one simple example of a reward function. 88 00:04:42,249 --> 00:04:46,060 And this, sort of, quadratic reward function is what we've been using in the last 89 00:04:46,060 --> 00:04:50,419 lecture in LQR control, linear-quadratic regulation control. 90 00:04:50,419 --> 00:04:54,479 And finally, right? Random reinforcement learning algorithm 91 00:04:54,479 --> 00:04:58,979 in simulation, meaning that you use your model of the dynamics 92 00:04:58,979 --> 00:05:03,370 to try to maximize that final horizon sum of rewards 93 00:05:03,370 --> 00:05:05,800 and when you do that you get a policy out, which 94 00:05:05,800 --> 00:05:10,490 I'm gonna call the policy pi subscript RL to denote the policy output by 95 00:05:10,490 --> 00:05:12,799 the reinforcement learning 96 00:05:12,799 --> 00:05:14,039 algorithm. Okay? 97 00:05:14,039 --> 00:05:17,110 Let's say you do this and the resulting controller 98 00:05:17,110 --> 00:05:21,610 gets much worse performance than a human pilot that you hire to fly 99 00:05:21,610 --> 00:05:23,380 the helicopter for you. 100 00:05:23,380 --> 00:05:25,470 101 00:05:25,470 --> 00:05:28,259 So how do you go about figuring out what to do next? Well, actually, you have several 102 00:05:28,259 --> 00:05:29,979 things you might do, right? You might 103 00:05:29,979 --> 00:05:34,909 try to improve the simulator, so there are exactly three steps. You want say maybe after the 104 00:05:34,909 --> 00:05:37,529 new model from the helicopter dynamics, but I think it's non - it is 105 00:05:37,529 --> 00:05:40,060 actually non-linear. Or maybe you want to 106 00:05:40,060 --> 00:05:43,170 collect more training data, 107 00:05:43,170 --> 00:05:47,180 so you can get a better estimates of the transition probabilities of the 108 00:05:47,180 --> 00:05:48,330 helicopter. 109 00:05:48,330 --> 00:05:51,840 Or maybe you want to fiddle with the features you used to model the dynamics 110 00:05:51,840 --> 00:05:53,740 of your helicopter, right? 111 00:05:53,740 --> 00:05:57,559 Other things you might do is, you might modify the reward function R if you think, 112 00:05:57,559 --> 00:06:00,470 you know, it's not just a quadratic function, maybe it's something else. I don't 113 00:06:00,470 --> 00:06:04,570 know. Or maybe you're unsatisfied with the reinforcement learning algorithm. Maybe you think, 114 00:06:04,570 --> 00:06:07,349 you know, the algorithm isn't quite doing the right thing. 115 00:06:07,349 --> 00:06:11,259 Or maybe you think you need to discritize the states more finely in order to 116 00:06:11,259 --> 00:06:13,030 apply your reinforcement learning algorithm. Or 117 00:06:13,030 --> 00:06:18,789 maybe you need to fiddle with the features you used in value function approximations. Okay? So 118 00:06:18,789 --> 00:06:19,289 these 119 00:06:19,289 --> 00:06:22,509 are three examples of things you might do 120 00:06:22,509 --> 00:06:23,189 and, 121 00:06:23,189 --> 00:06:27,119 again, quite often if you chose the wrong one to work on you can easily spend, 122 00:06:27,119 --> 00:06:31,089 you know - actually this one I don't want to say six months. You can easily spend a year or two 123 00:06:31,089 --> 00:06:33,030 working on the wrong thing. 124 00:06:33,030 --> 00:06:36,680 Hey, Dan, this is a favor. I'm, sort of, out of chalk could you wander around 125 00:06:36,680 --> 00:06:42,709 and help me 126 00:06:42,709 --> 00:06:46,069 get? Thanks. So the team does three things; they'll copy the yellow box to the upper right of 127 00:06:46,069 --> 00:06:48,219 this slide. What 128 00:06:48,219 --> 00:06:52,189 can you do? So this is the, sort of, reasoning we actually go through on the 129 00:06:52,189 --> 00:06:54,419 helicopter project often 130 00:06:54,419 --> 00:06:57,320 and to decide what to work on. So let me just step for this 131 00:06:57,320 --> 00:06:59,209 example 132 00:06:59,209 --> 00:07:01,699 fairly slowly. 133 00:07:01,699 --> 00:07:05,110 So this, sort of, reasoning you might go through. 134 00:07:05,110 --> 00:07:07,539 Suppose these three assumptions hold true, 135 00:07:07,539 --> 00:07:11,770 right? Suppose that the helicopter simulator is accurate, so let's suppose 136 00:07:11,770 --> 00:07:12,300 that you 137 00:07:12,300 --> 00:07:15,330 built an accurate model of the dynamics. 138 00:07:15,330 --> 00:07:16,439 139 00:07:16,439 --> 00:07:17,530 And suppose that, 140 00:07:17,530 --> 00:07:21,229 sort of, I turned two under slides, so suppose that the reinforcement learning 141 00:07:21,229 --> 00:07:26,949 algorithm correctly controls the helicopter in simulation. So it's a maximized that expected 142 00:07:26,949 --> 00:07:29,479 payoff, right? 143 00:07:29,479 --> 00:07:32,139 And suppose that maximizing the expected payoff corresponds to 144 00:07:32,139 --> 00:07:34,210 autonomous flight, 145 00:07:34,210 --> 00:07:38,300 right? If all of these three assumptions holds true, 146 00:07:38,300 --> 00:07:41,699 then that means that you would expect the learned controller 147 00:07:41,699 --> 00:07:43,279 pi subscript RL to 148 00:07:43,279 --> 00:07:48,489 fly well on the actual helicopter. Okay? So 149 00:07:48,489 --> 00:07:50,699 this is the - I'm, sort of, 150 00:07:50,699 --> 00:07:55,099 showing you the source of the reasoning that I go through when I'm trying to 151 00:07:55,099 --> 00:07:58,729 come up with a set of diagnostics for this problem. 152 00:07:58,729 --> 00:08:03,219 So these are some of the diagnostics we actually use routinely on various revised 153 00:08:03,219 --> 00:08:07,020 control problems. 154 00:08:07,020 --> 00:08:09,240 So pi 155 00:08:09,240 --> 00:08:11,389 subscript RL, 156 00:08:11,389 --> 00:08:15,919 right? We said it doesn't fly well on the actual helicopter. 157 00:08:15,919 --> 00:08:20,669 So the first diagnostic you want to run is just check if it flies well in simulation, all right? 158 00:08:20,669 --> 00:08:24,469 So if it flies well in simulation, but not in real life, then 159 00:08:24,469 --> 00:08:25,529 the problem 160 00:08:25,529 --> 00:08:27,309 is in the simulator, 161 00:08:27,309 --> 00:08:32,200 right? Because the simulator predicts that your controller, the pi subscript RL, flies well, 162 00:08:32,200 --> 00:08:36,390 but it doesn't actually fly well in real life. So if this holds true, then that suggests 163 00:08:36,390 --> 00:08:41,610 a problem is in the simulator. Question? Student:[Inaudible] the 164 00:08:41,610 --> 00:08:48,610 helicopter 165 00:08:50,650 --> 00:08:55,190 166 00:08:55,190 --> 00:08:55,729 pilots 167 00:08:55,729 --> 00:09:00,660 168 00:09:00,660 --> 00:09:03,190 try to fly on 169 00:09:03,190 --> 00:09:06,970 the simulator? Do you have the real helicopter pilots try to fly on the simulator? Instructor (Andrew Ng):Do I try to have the real helicopter flying simulator? Student:Real helicopter pilots. Instructor (Andrew Ng):Oh, I see. Do we ask the helicopter pilots to fly in simulation? So, yeah. It turns out one of the later diagnostics 170 00:09:06,970 --> 00:09:08,960 you could do that. 171 00:09:08,960 --> 00:09:14,510 On our project we don't do that very often. We informally ask the pilot, 172 00:09:14,510 --> 00:09:18,350 who's one of the best pilots in the world, 173 00:09:18,350 --> 00:09:19,620 Gary Zoco, to look at 174 00:09:19,620 --> 00:09:26,080 the simulator sometimes. We don't very often ask him to fly in the 175 00:09:26,080 --> 00:09:26,700 simulator. 176 00:09:26,700 --> 00:09:28,040 That answers your question. But let me actually 177 00:09:28,040 --> 00:09:30,090 go on and show you some of the other diagnostics 178 00:09:30,090 --> 00:09:34,260 that you might use then. Okay? 179 00:09:34,260 --> 00:09:38,520 Second is let me use pi subscript human to denote the human 180 00:09:38,520 --> 00:09:39,970 control policy, 181 00:09:39,970 --> 00:09:41,240 right? 182 00:09:41,240 --> 00:09:43,690 This is pi subscript human is policy that, you know, however the human flies it. 183 00:09:43,690 --> 00:09:46,190 So one thing to do is 184 00:09:46,190 --> 00:09:47,920 look at the value 185 00:09:47,920 --> 00:09:53,600 of pi RL compared to the value of pi subscript human. Okay? 186 00:09:53,600 --> 00:09:55,830 So what this means really is, 187 00:09:55,830 --> 00:09:56,890 look at 188 00:09:56,890 --> 00:10:00,860 how the helicopter looks like when it's flying under control of the pi subscript RL and 189 00:10:00,860 --> 00:10:01,680 190 00:10:01,680 --> 00:10:06,350 look at what the helicopter does when it's flying under the human pilot control 191 00:10:06,350 --> 00:10:08,110 and evaluate - 192 00:10:08,110 --> 00:10:11,720 and then, sort of, compute the sum of rewards, right? For your human pilot performance and 193 00:10:11,720 --> 00:10:13,200 compute the sum of rewards 194 00:10:13,200 --> 00:10:14,250 195 00:10:14,250 --> 00:10:16,840 for the learning controller performance 196 00:10:16,840 --> 00:10:18,149 and see on, 197 00:10:18,149 --> 00:10:21,779 say, the real helicopter or that question or you can do this on the real 198 00:10:21,779 --> 00:10:23,780 helicopter or on simulation actually, 199 00:10:23,780 --> 00:10:25,460 but you can see 200 00:10:25,460 --> 00:10:28,750 does the human obtain a higher or a lower 201 00:10:28,750 --> 00:10:31,090 sum of rewards on average than 202 00:10:31,090 --> 00:10:33,830 does the controller you just learned. Okay? 203 00:10:33,830 --> 00:10:37,050 And then the way you do this you actually can go and fly the helicopter 204 00:10:37,050 --> 00:10:40,520 and just measure the sum of rewards, right? On the actual sequence of 205 00:10:40,520 --> 00:10:43,940 states the helicopter flew through, 206 00:10:43,940 --> 00:10:45,820 right? So 207 00:10:45,820 --> 00:10:49,720 if this condition holds true, right? Where my mouse pointer is 208 00:10:49,720 --> 00:10:52,330 if - oh, excuse me. Okay. 209 00:10:52,330 --> 00:10:55,700 If this condition holds true where my mouse pointer is, if 210 00:10:55,700 --> 00:11:00,310 it holds true that pi subscript RL is less than pi subscript human, those of 211 00:11:00,310 --> 00:11:03,190 you watching online I don't know if you can see this, but this 212 00:11:03,190 --> 00:11:08,340 V pi subscript RL of as zero less than V pi subscript human of as zero. But 213 00:11:08,340 --> 00:11:10,040 if 214 00:11:10,040 --> 00:11:14,040 this holds true, then that suggests that a problem is in the reinforcement learning 215 00:11:14,040 --> 00:11:15,200 algorithm 216 00:11:15,200 --> 00:11:16,340 because 217 00:11:16,340 --> 00:11:19,760 the human has found a policy that 218 00:11:19,760 --> 00:11:24,120 obtains a higher reward than does the reinforcement learning algorithm. So this proves, or this 219 00:11:24,120 --> 00:11:26,160 shows, that your reinforcement learning algorithm 220 00:11:26,160 --> 00:11:33,160 is not maximizing the sum of rewards, right? 221 00:11:35,600 --> 00:11:37,360 And lastly, 222 00:11:37,360 --> 00:11:42,130 the last condition is this - the last test is this, if the inequality holds in the 223 00:11:42,130 --> 00:11:43,980 opposite direction - so if 224 00:11:43,980 --> 00:11:48,180 the reinforcement learning algorithm obtains a higher sum of rewards on 225 00:11:48,180 --> 00:11:49,260 average than 226 00:11:49,260 --> 00:11:51,720 does the human, 227 00:11:51,720 --> 00:11:56,490 but the reinforcement learning algorithm still flies worse than the human does, right? Then 228 00:11:56,490 --> 00:11:58,279 this suggests that the problem is in 229 00:11:58,279 --> 00:12:01,340 the cos functions than the reward function because the 230 00:12:01,340 --> 00:12:05,100 reinforcement learning algorithm is obtaining a higher sum of rewards than 231 00:12:05,100 --> 00:12:07,769 the human, but it still flies worse than the human. 232 00:12:07,769 --> 00:12:10,440 So that suggests that 233 00:12:10,440 --> 00:12:14,080 maximizing the sum of rewards does not correspond 234 00:12:14,080 --> 00:12:15,850 to very good autonomous flight. 235 00:12:15,850 --> 00:12:19,260 So if this holds true, then the problem is in your reward function. Or 236 00:12:19,260 --> 00:12:20,240 in other words, 237 00:12:20,240 --> 00:12:23,910 the problem is in your optimization objective 238 00:12:23,910 --> 00:12:27,320 and then rather than the algorithm that's trying to maximize the optimization objective and, so 239 00:12:27,320 --> 00:12:29,710 you might change your optimization objective. 240 00:12:29,710 --> 00:12:34,270 In other words, you might change 241 00:12:34,270 --> 00:12:37,670 reward function. Okay? So, of course, this is just one example of how you might debug a reinforcement 242 00:12:37,670 --> 00:12:39,800 learning algorithm. 243 00:12:39,800 --> 00:12:42,180 And these particular set of diagnostics 244 00:12:42,180 --> 00:12:46,000 happen to apply only because we're fortunate enough to be 245 00:12:46,000 --> 00:12:48,130 able to an amazingly good 246 00:12:48,130 --> 00:12:49,339 human pilot 247 00:12:49,339 --> 00:12:51,520 who can fly a helicopter for us, right? 248 00:12:51,520 --> 00:12:54,510 If you didn't have a very good pilot then maybe some of these diagnostics won't 249 00:12:54,510 --> 00:12:57,480 apply and you'll have to come up with other ones. But 250 00:12:57,480 --> 00:12:59,940 want to go through this again as an example of 251 00:12:59,940 --> 00:13:03,630 the source of diagnostics you use to debug a reinforcement learning algorithm. 252 00:13:03,630 --> 00:13:05,820 And the point of this 253 00:13:05,820 --> 00:13:07,380 example isn't so much 254 00:13:07,380 --> 00:13:10,130 that I want you to remember the specifics of the diagnostics, right? The 255 00:13:10,130 --> 00:13:11,580 point of this is really 256 00:13:11,580 --> 00:13:16,090 for your own problem, be it supervised learning, unsupervised learning, reinforcement 257 00:13:16,090 --> 00:13:17,650 learning, whatever. 258 00:13:17,650 --> 00:13:21,400 You very often have to come up with your own diagnostics, 259 00:13:21,400 --> 00:13:24,800 your own debugging tools, to figure out why an algorithm is working and why an algorithm isn't working. 260 00:13:24,800 --> 00:13:26,520 And this is one example of 261 00:13:26,520 --> 00:13:30,410 what we actually do on the helicopter. Okay? 262 00:13:30,410 --> 00:13:33,450 Questions about this? Yeah, Justin? Student:I'm just curious how do you collect, like, training data? Like in our homework 263 00:13:33,450 --> 00:13:35,080 the pendulum fell 264 00:13:35,080 --> 00:13:36,889 over lots of times, 265 00:13:36,889 --> 00:13:38,190 but how do you work that 266 00:13:38,190 --> 00:13:43,840 with an expensive helicopter? Instructor (Andrew Ng):Yeah. So, I see, 267 00:13:43,840 --> 00:13:49,290 right. So on the helicopter the way we collect data to learn [inaudible] and improbabilities 268 00:13:49,290 --> 00:13:49,890 is 269 00:13:49,890 --> 00:13:52,480 we usually - not - 270 00:13:52,480 --> 00:13:55,800 done lots of things, but to first approximation, the most basic 271 00:13:55,800 --> 00:13:56,559 merger 272 00:13:56,559 --> 00:14:00,360 is if you ask a human pilot to just fly the helicopter around for you 273 00:14:00,360 --> 00:14:03,620 and he's not gonna crash it. So you can collect lots of data 274 00:14:03,620 --> 00:14:08,480 as is being controlled by a human pilot. And, as I say, it turns out in data 275 00:14:08,480 --> 00:14:10,370 collection the, sort of, 276 00:14:10,370 --> 00:14:15,810 a few standard ways to collect data are to let you - 277 00:14:15,810 --> 00:14:19,500 so a helicopter can do lots of things and you don't want to collect data 278 00:14:19,500 --> 00:14:24,710 representing only one small part of the flight regime. So 279 00:14:24,710 --> 00:14:29,510 concretely what we often do is ask the pilot to carry out frequency sweeps 280 00:14:29,510 --> 00:14:33,940 and what that means, very informally, is you imagine them holding a control stick, right? In 281 00:14:33,940 --> 00:14:34,840 their hand. 282 00:14:34,840 --> 00:14:36,670 Frequency sweeps are a process where 283 00:14:36,670 --> 00:14:37,880 you start off 284 00:14:37,880 --> 00:14:40,160 making very slow oscillations 285 00:14:40,160 --> 00:14:41,470 and then you start 286 00:14:41,470 --> 00:14:44,800 taking your control stick and moving it back and forth faster and faster, so you 287 00:14:44,800 --> 00:14:48,740 sort of sweep out the range of frequencies ranging from very slow slightly 288 00:14:48,740 --> 00:14:52,950 [inaudible] by oscillations until you go faster and faster and you're sort of directing the control seat 289 00:14:52,950 --> 00:14:56,780 back and forth. So this is - oh, good. Thank you. 290 00:14:56,780 --> 00:15:00,340 So that's, sort of, one standard way that we use to 291 00:15:00,340 --> 00:15:01,700 collect data 292 00:15:01,700 --> 00:15:05,500 on various robotics. It may or may not apply to different robots or to 293 00:15:05,500 --> 00:15:08,540 different systems you work on. 294 00:15:08,540 --> 00:15:12,089 And, as I say, in reality we do a lot of things. Sometimes we have a 295 00:15:12,089 --> 00:15:13,400 controllers collect data 296 00:15:13,400 --> 00:15:15,960 autonomously too, but that's other 297 00:15:15,960 --> 00:15:21,900 more complex algorithms. Anything else? Student:In the first point the 298 00:15:21,900 --> 00:15:23,860 goal communicates that the other is perhaps 299 00:15:23,860 --> 00:15:27,920 not mapping the actions 300 00:15:27,920 --> 00:15:31,050 similar to the simulator, so in [inaudible] this could be very common that you could have [inaudible]? Just any specific matter that pilots use to take care 301 00:15:31,050 --> 00:15:32,300 of that variable, so all [inaudible]? 302 00:15:32,300 --> 00:15:35,210 Instructor (Andrew Ng):Yeah, 303 00:15:35,210 --> 00:15:38,670 right. So you're 304 00:15:38,670 --> 00:15:42,450 saying like point one may not be simulated accurate; 305 00:15:42,450 --> 00:15:44,020 306 00:15:44,020 --> 00:15:47,600 it may be that the hardware is doing something strange with the 307 00:15:47,600 --> 00:15:48,189 control. 308 00:15:48,189 --> 00:15:51,800 There is concluding the controls through some action 309 00:15:51,800 --> 00:15:55,030 and through some strange transformation because it's 310 00:15:55,030 --> 00:15:56,780 actually simulating it as a helicopter. I don't 311 00:15:56,780 --> 00:15:58,650 know. Yeah. I've 312 00:15:58,650 --> 00:16:01,400 definitely seen that happen on some other robots before. 313 00:16:01,400 --> 00:16:03,220 So maybe diagnostic one here 314 00:16:03,220 --> 00:16:07,240 is a better form as deciding whether the simulator matches the actual 315 00:16:07,240 --> 00:16:09,440 hardware, 316 00:16:09,440 --> 00:16:13,120 I don't know. Yeah. That's another class of those to watch out for. If 317 00:16:13,120 --> 00:16:17,410 you suspect that's the case, I can't think of a good, sort of, diagnostic 318 00:16:17,410 --> 00:16:19,420 right now to confirm that, 319 00:16:19,420 --> 00:16:21,620 but that, again, before - 320 00:16:21,620 --> 00:16:24,290 that would be a good thing to try to come up with a diagnostic for to 321 00:16:24,290 --> 00:16:27,640 see if there might be something wrong with the hardware. 322 00:16:27,640 --> 00:16:30,170 And I 323 00:16:30,170 --> 00:16:33,760 think these are by no means the definitive diagnostics or anything like that. 324 00:16:33,760 --> 00:16:35,870 It's just sort of an example, but it would 325 00:16:35,870 --> 00:16:39,840 be great if you come up with other diagnostics to check that the hardware is working 326 00:16:39,840 --> 00:16:44,170 properly that would be a great thing to do, too. Okay. Is 327 00:16:44,170 --> 00:16:48,340 this - okay, last couple of questions before we move on? Student:You said the 328 00:16:48,340 --> 00:16:51,340 reward function was? Instructor (Andrew Ng):Oh, in this example, 329 00:16:51,340 --> 00:16:52,750 330 00:16:52,750 --> 00:16:55,100 what I was just using 331 00:16:55,100 --> 00:16:56,490 a quadratic constant. 332 00:16:56,490 --> 00:17:03,490 On the helicopter we often use things that are much more complicated. Student:[Inaudible] You have no way of 333 00:17:03,590 --> 00:17:03,880 knowing 334 00:17:03,880 --> 00:17:07,260 what is the, like, desired focus? Instructor (Andrew Ng):Yeah. So 335 00:17:07,260 --> 00:17:08,480 you 336 00:17:08,480 --> 00:17:12,709 can, sort of, figure out where - ask the human pilot to hover in place and 337 00:17:12,709 --> 00:17:19,560 guess what his desire was. Again, these aren't constants telling you to, yeah. Student:[Inaudible] 338 00:17:19,560 --> 00:17:20,190 physics 339 00:17:20,190 --> 00:17:22,720 340 00:17:22,720 --> 00:17:26,500 that are based learning problems. Do you - does it actually work best to use a physical finian model? You know, just, 341 00:17:26,500 --> 00:17:28,750 sort of, physics tell or you just sort of do 342 00:17:28,750 --> 00:17:31,600 learning on [inaudible]? Instructor (Andrew Ng):Yeah. 343 00:17:31,600 --> 00:17:35,870 The physics models work well, right? So the answer is it varies a lot from 344 00:17:35,870 --> 00:17:37,710 problem to problem. It turns out that 345 00:17:37,710 --> 00:17:40,740 the aerodynamics of helicopters, I think, aren't 346 00:17:40,740 --> 00:17:43,190 understood well enough that you can look at 347 00:17:43,190 --> 00:17:46,770 the specs of a helicopter and build a very good physics 348 00:17:46,770 --> 00:17:50,950 simulator. So on the helicopter, we actually learn the dynamics. And I don't know how to build a physics model. For other 349 00:17:50,950 --> 00:17:51,620 350 00:17:51,620 --> 00:17:55,010 problems, like if you actually have an inverted pendulum problem or 351 00:17:55,010 --> 00:17:56,140 something, 352 00:17:56,140 --> 00:17:59,310 there are many other problems for which the aerodynamics are much better 353 00:17:59,310 --> 00:18:02,640 understood and for which physic simulators work perfectly fine, but 354 00:18:02,640 --> 00:18:04,990 it depends a lot - on some problems 355 00:18:04,990 --> 00:18:07,240 physic simulators work great on some they 356 00:18:07,240 --> 00:18:11,380 probably aren't great on all. Okay? Cool. 357 00:18:11,380 --> 00:18:18,380 So, I guess, retract the chalkboard, please. All right. 358 00:18:31,790 --> 00:18:34,280 So how much time do I have? Twenty minutes. Okay. So let's 359 00:18:34,280 --> 00:18:35,190 go back 360 00:18:35,190 --> 00:18:37,150 to our 361 00:18:37,150 --> 00:18:40,480 discussion on LQR Control, linearquadratic regulation 362 00:18:40,480 --> 00:18:43,320 control, and then I want to take that a little bit further and tell you about 363 00:18:43,320 --> 00:18:48,080 the, sort of, one variation on the LQR called differential dynamic programming. 364 00:18:48,080 --> 00:18:51,500 Just to recap, just to remind you of what LQR, or linear-quadratic 365 00:18:51,500 --> 00:18:54,420 regulation control, 366 00:18:54,420 --> 00:18:57,070 is, in the last lecture I defined a 367 00:18:57,070 --> 00:18:59,550 horizon problem 368 00:18:59,550 --> 00:19:06,550 where your goal is to maximize, 369 00:19:10,560 --> 00:19:16,010 right? Just sort of find the horizon sum of rewards, and there's no discounting anymore, 370 00:19:16,010 --> 00:19:19,850 and then we came up with this dynamic programming algorithm, right? Okay. 371 00:19:19,850 --> 00:19:26,850 372 00:20:06,490 --> 00:20:10,260 Then we came up with this dynamic programming algorithm in the last lecture, 373 00:20:10,260 --> 00:20:14,420 where you compute V star of capital T that's one value function for 374 00:20:14,420 --> 00:20:15,929 the last time step. 375 00:20:15,929 --> 00:20:18,110 So, in other words, what's the value if 376 00:20:18,110 --> 00:20:22,720 your star in the state S and you just get to take one action and then 377 00:20:22,720 --> 00:20:24,470 the clock runs 378 00:20:24,470 --> 00:20:27,140 out. The aerodynamic programming algorithm that we 379 00:20:27,140 --> 00:20:29,630 repeatedly compute V star 380 00:20:29,630 --> 00:20:32,890 lowercase t in terms of V star t plus one, 381 00:20:32,890 --> 00:20:35,360 so we compute V star 382 00:20:35,360 --> 00:20:42,360 capital T and then recurs backwards, right? 383 00:20:42,730 --> 00:20:45,980 And so on, until we 384 00:20:45,980 --> 00:20:47,130 385 00:20:47,130 --> 00:20:49,010 get down to V star zero 386 00:20:49,010 --> 00:20:51,180 and then pi star was given by, 387 00:20:51,180 --> 00:20:54,179 as usual, the augmax of the thing we had 388 00:20:54,179 --> 00:20:57,500 in the definition of the value function. 389 00:20:57,500 --> 00:21:00,980 So 390 00:21:00,980 --> 00:21:07,180 last time, the specific example we saw of - or one specific example, that sort of 391 00:21:07,180 --> 00:21:12,160 define a horizon problem that we solved by DP was 392 00:21:12,160 --> 00:21:14,130 the LQR problem 393 00:21:14,130 --> 00:21:20,370 where we worked directly with continuous state and actions. 394 00:21:20,370 --> 00:21:27,370 And 395 00:21:33,380 --> 00:21:38,039 in the LQR problem we had these linear dynamics where the state ST plus one 396 00:21:38,039 --> 00:21:42,040 is a linear function of the previous state and action 397 00:21:42,040 --> 00:21:45,110 and then plus 398 00:21:45,110 --> 00:21:46,740 399 00:21:46,740 --> 00:21:50,930 this Gaussian noise WT, which is covariance sigma W. 400 00:21:50,930 --> 00:21:51,810 401 00:21:51,810 --> 00:21:55,250 And I said briefly last time, that 402 00:21:55,250 --> 00:21:58,960 one specific way to come up with these linear dynamics, right? Oh, excuse me. 403 00:21:58,960 --> 00:22:00,929 404 00:22:00,929 --> 00:22:02,350 One specific way 405 00:22:02,350 --> 00:22:03,460 to 406 00:22:03,460 --> 00:22:05,000 take a system and 407 00:22:05,000 --> 00:22:08,080 come up with a linear model for it is if you have 408 00:22:08,080 --> 00:22:15,080 some simulator, say, right? 409 00:22:19,660 --> 00:22:22,110 So in this cartoon, the 410 00:22:22,110 --> 00:22:25,980 vertical axis represents ST plus one and the horizontal axis represents STAT. So 411 00:22:25,980 --> 00:22:27,269 say 412 00:22:27,269 --> 00:22:30,360 you have a simulator, 413 00:22:30,360 --> 00:22:31,410 414 00:22:31,410 --> 00:22:35,179 right? And let's say it determines [inaudible] simulator, so we have a [inaudible] simulator 415 00:22:35,179 --> 00:22:38,559 that tells you what the next state is ST plus one is a function of 416 00:22:38,559 --> 00:22:40,100 previous state 417 00:22:40,100 --> 00:22:47,100 and action. And then you can choose 418 00:22:50,029 --> 00:22:51,860 a point 419 00:22:51,860 --> 00:22:55,980 around which to linearize this simulator, by which I mean 420 00:22:55,980 --> 00:22:59,060 that you choose a point, an approximate 421 00:22:59,060 --> 00:23:03,350 your - approximate the function F using a linear function, this tangent, to the 422 00:23:03,350 --> 00:23:06,450 function F at that point. So if you do 423 00:23:06,450 --> 00:23:10,610 that you have ST plus one equals - 424 00:23:10,610 --> 00:23:17,610 shoot. 425 00:23:43,210 --> 00:23:45,440 Sorry about writing down so many lines, 426 00:23:45,440 --> 00:23:49,049 but this will be the linearization approximation 427 00:23:49,049 --> 00:23:50,110 428 00:23:50,110 --> 00:23:52,110 to the function F 429 00:23:52,110 --> 00:23:56,290 where I've taken a linearization around the specific point as bar T, 430 00:23:56,290 --> 00:23:58,809 A bar 431 00:23:58,809 --> 00:24:02,520 T. Okay? And so you can take those and just narrow it down to a linear 432 00:24:02,520 --> 00:24:07,290 433 00:24:07,290 --> 00:24:07,980 434 00:24:07,980 --> 00:24:09,860 435 00:24:09,860 --> 00:24:13,450 equation like that where the next state ST plus one is 436 00:24:13,450 --> 00:24:16,490 now some linear function of the previous state ST and AT 437 00:24:16,490 --> 00:24:18,790 and these matrixes, AT and BT, 438 00:24:18,790 --> 00:24:20,580 will depend on 439 00:24:20,580 --> 00:24:22,980 your choice of location around 440 00:24:22,980 --> 00:24:26,559 which to linearize this function. Okay? 441 00:24:26,559 --> 00:24:29,840 I said last time, that this linearization approximation 442 00:24:29,840 --> 00:24:34,870 you, sort of, expect to be particularly good in the vicinity of, as bar T, A bar T 443 00:24:34,870 --> 00:24:36,180 because this linear function 444 00:24:36,180 --> 00:24:38,470 is a pretty good approximation to F, 445 00:24:38,470 --> 00:24:39,220 right? So if 446 00:24:39,220 --> 00:24:46,220 in this little neighborhood there. And - yes? Student:[Inaudible] is there an assumption that you are looking 447 00:24:50,940 --> 00:24:53,260 at something the second recently 448 00:24:53,260 --> 00:24:56,040 indicates like 449 00:24:56,040 --> 00:24:58,960 the helicopter, are you assuming pilot behavior is the same 450 00:24:58,960 --> 00:25:00,330 as [inaudible] behavior 451 00:25:00,330 --> 00:25:02,120 or - Instructor (Andrew 452 00:25:02,120 --> 00:25:06,090 Ng):Yeah, right. So let me not call this as an assumption. Let me just say that when 453 00:25:06,090 --> 00:25:08,890 I use this algorithm, when I choose to linearize this way, 454 00:25:08,890 --> 00:25:12,180 then my approximation would be physically good in the vicinity here and it 455 00:25:12,180 --> 00:25:15,679 may be less good elsewhere and, so 456 00:25:15,679 --> 00:25:19,250 let me - when I actually talk about DDP I actually make use of this property. Which is why I'm 457 00:25:19,250 --> 00:25:21,150 going over it now. 458 00:25:21,150 --> 00:25:23,460 Okay? But, yeah. There is an intuition that 459 00:25:23,460 --> 00:25:27,470 you want to linearize near the vicinity"near of states, so you expect your system to spend the 460 00:25:27,470 --> 00:25:34,080 most time. Right. So, 461 00:25:34,080 --> 00:25:37,450 okay. So this is how you might come up with a linear model and 462 00:25:37,450 --> 00:25:41,080 if you do that then, 463 00:25:41,080 --> 00:25:48,080 oh, you can - let's see. So for LQR we also have this sort of quadratic reward function, 464 00:25:52,950 --> 00:25:56,320 right? 465 00:25:56,320 --> 00:25:58,840 Where the matrixes UT and VT 466 00:25:58,840 --> 00:26:01,700 are positive semi-definite, so the rewards are always negative, that's 467 00:26:01,700 --> 00:26:03,190 this minus sign. 468 00:26:03,190 --> 00:26:04,630 And then 469 00:26:04,630 --> 00:26:08,480 if you take exactly the dynamic programming algorithm, that I've written down just 470 00:26:08,480 --> 00:26:09,740 now, 471 00:26:09,740 --> 00:26:13,220 then - lets see. 472 00:26:13,220 --> 00:26:17,260 It turns out that the value function at every state, excuse me. 473 00:26:17,260 --> 00:26:20,419 It turns out the value function for every time step 474 00:26:20,419 --> 00:26:22,760 will be a quadratic function 475 00:26:22,760 --> 00:26:25,490 of the state and can be written like that. 476 00:26:25,490 --> 00:26:27,250 477 00:26:27,250 --> 00:26:34,250 And so you initialize the dynamic programming algorithm as follows. 478 00:26:42,690 --> 00:26:46,280 And I just write this down, but there's actually just one property I want to point out 479 00:26:46,280 --> 00:26:47,720 later, 480 00:26:47,720 --> 00:26:53,880 but this equation is, well, 481 00:26:53,880 --> 00:26:56,320 somewhat big and hairy, but don't worry about 482 00:26:56,320 --> 00:27:03,320 most of its details. 483 00:27:03,750 --> 00:27:04,139 Let 484 00:27:04,139 --> 00:27:11,139 485 00:27:16,750 --> 00:27:23,750 me 486 00:27:31,580 --> 00:27:33,309 just 487 00:27:33,309 --> 00:27:34,419 put 488 00:27:34,419 --> 00:27:39,159 this. Shoot, 489 00:27:39,159 --> 00:27:46,159 there's 490 00:27:48,260 --> 00:27:50,830 one more equation I want to fit in. 491 00:27:50,830 --> 00:27:52,380 Well, okay. All right. 492 00:27:52,380 --> 00:27:55,290 So it turns out the value function is a quadratic function where 493 00:27:55,290 --> 00:27:57,900 V star T is this 494 00:27:57,900 --> 00:27:59,180 and, 495 00:27:59,180 --> 00:28:02,260 so you initialize the dynamic programming step 496 00:28:02,260 --> 00:28:04,310 with this. This 497 00:28:04,310 --> 00:28:07,030 fi and this si gives you V capital T 498 00:28:07,030 --> 00:28:10,530 and then it records backwards. So these two equations 499 00:28:10,530 --> 00:28:14,970 express - 500 00:28:14,970 --> 00:28:18,980 will give you V subscript T as a function of VT 501 00:28:18,980 --> 00:28:22,580 plus one. Okay? So it incurs backwards for this learning algorithm. 502 00:28:22,580 --> 00:28:24,320 And the last thing, 503 00:28:24,320 --> 00:28:31,320 504 00:28:37,010 --> 00:28:43,280 get this on the same board, so - sorry about the disorganized use of the boards. I just wanted this on the same place. 505 00:28:43,280 --> 00:28:50,280 And, finally, the 506 00:28:50,760 --> 00:28:55,200 actual policy pi star of ST is given by some linear function of 507 00:28:55,200 --> 00:28:57,960 ST, so LT here is a matrix 508 00:28:57,960 --> 00:29:04,960 where LT is equal to - numerous 509 00:29:09,700 --> 00:29:11,410 times. 510 00:29:11,410 --> 00:29:18,410 Okay? 511 00:29:18,780 --> 00:29:20,830 And so when you do this 512 00:29:20,830 --> 00:29:24,570 you now have the actual policy, right? So just concretely, you 513 00:29:24,570 --> 00:29:29,940 run the dynamic programming algorithm to compute fi T and si T for all values of T 514 00:29:29,940 --> 00:29:34,130 and then you plug it in to compute the matrixes LT and now you know the optimal 515 00:29:34,130 --> 00:29:40,230 action stake and [inaudible]. Okay? So 516 00:29:40,230 --> 00:29:42,000 there's one very interesting property 517 00:29:42,000 --> 00:29:45,679 - these equations are a huge mess, but you can re-derive them yourself, but don't worry about the details. 518 00:29:45,679 --> 00:29:49,450 But this is one specific property 519 00:29:49,450 --> 00:29:53,370 of this dynamic programming algorithm that's very interesting that I want to point out. 520 00:29:53,370 --> 00:29:56,919 Which is the following. Notice that to 521 00:29:56,919 --> 00:30:02,250 compute the optimal policy I need to compute these matrixes LT 522 00:30:02,250 --> 00:30:03,560 and notice that 523 00:30:03,560 --> 00:30:05,580 LT depends on 524 00:30:05,580 --> 00:30:08,760 A, it depends on B, it depends on D, and 525 00:30:08,760 --> 00:30:09,530 it depends on 526 00:30:09,530 --> 00:30:10,870 fi, 527 00:30:10,870 --> 00:30:15,049 but it doesn't depend on si, 528 00:30:15,049 --> 00:30:17,440 right? 529 00:30:17,440 --> 00:30:19,120 Notice this further that 530 00:30:19,120 --> 00:30:24,040 when I carry out my 531 00:30:24,040 --> 00:30:26,529 dynamic programming algorithm my recursive definition for si - 532 00:30:26,529 --> 00:30:27,330 well 533 00:30:27,330 --> 00:30:32,420 it depends on 534 00:30:32,420 --> 00:30:39,170 535 00:30:39,170 --> 00:30:41,460 - oh, excuse me. 536 00:30:41,460 --> 00:30:43,800 It should be si T, right. Si T plus one. Okay. In order to 537 00:30:43,800 --> 00:30:47,950 carry out my dynamic programming algorithm, right? For si T I need to know what si T plus 538 00:30:47,950 --> 00:30:48,920 one is. 539 00:30:48,920 --> 00:30:50,780 So si T depends 540 00:30:50,780 --> 00:30:51,960 on these things, 541 00:30:51,960 --> 00:30:55,280 but in order to carry out the dynamic programming for fi 542 00:30:55,280 --> 00:30:58,840 T, fi T doesn't actually depend on 543 00:30:58,840 --> 00:31:01,360 si T plus one, right? And so 544 00:31:01,360 --> 00:31:03,630 in other words in order to 545 00:31:03,630 --> 00:31:06,160 compute the fi T's I don't need these si's. So 546 00:31:06,160 --> 00:31:07,600 if 547 00:31:07,600 --> 00:31:13,000 all I want is the fi's I can actually omit 548 00:31:13,000 --> 00:31:15,840 this step of the dynamic programming algorithm 549 00:31:15,840 --> 00:31:17,279 and not bother 550 00:31:17,279 --> 00:31:21,460 to carry out the dynamic programming algorithm in terms of the fi's. 551 00:31:21,460 --> 00:31:25,600 And then having done my dynamic programming algorithm just for - excuse 552 00:31:25,600 --> 00:31:27,390 me, I misspoke. 553 00:31:27,390 --> 00:31:28,419 You - 554 00:31:28,419 --> 00:31:32,280 I can forget about the si's and just do the dynamic programming updates for 555 00:31:32,280 --> 00:31:34,500 the fi matrixes 556 00:31:34,500 --> 00:31:39,669 and then having done my DP updates for the fi T I can then plug this into this 557 00:31:39,669 --> 00:31:42,720 formula to compute LT. Okay? 558 00:31:42,720 --> 00:31:44,900 And so 559 00:31:44,900 --> 00:31:49,029 one other thing about it is you can, to be slightly more efficient 560 00:31:49,029 --> 00:31:51,700 - efficiency isn't really the issue, but 561 00:31:51,700 --> 00:31:55,650 if you want you can actually forget about the fi T's. You actually don't need to 562 00:31:55,650 --> 00:31:59,890 compute that at all. 563 00:31:59,890 --> 00:32:05,760 Now, the other interesting property of this is that 564 00:32:05,760 --> 00:32:08,410 the matrix sigma W 565 00:32:08,410 --> 00:32:11,320 appears only in my DV update for the si 566 00:32:11,320 --> 00:32:13,590 T's. It doesn't actually 567 00:32:13,590 --> 00:32:15,350 appear in my updates for the fi 568 00:32:15,350 --> 00:32:18,410 T's. So you remember - well, 569 00:32:18,410 --> 00:32:25,240 my model was that ST plus one equals ATST plus VT, 570 00:32:25,240 --> 00:32:27,360 AT plus 571 00:32:27,360 --> 00:32:29,030 WT where these noise 572 00:32:29,030 --> 00:32:32,590 terms, WT, had a covariance sigma W 573 00:32:32,590 --> 00:32:33,620 and so the 574 00:32:33,620 --> 00:32:37,910 only place that appears - the covariance of the noise terms of appears is in those IT's, but I 575 00:32:37,910 --> 00:32:39,820 just said that they can do this entire 576 00:32:39,820 --> 00:32:41,130 [inaudible] ordering 577 00:32:41,130 --> 00:32:44,350 algorithm without the si 578 00:32:44,350 --> 00:32:47,990 T's. So what this means is that you can actually find the optimal policy 579 00:32:47,990 --> 00:32:52,740 without knowing what the covariance of the noise terms 580 00:32:52,740 --> 00:32:53,860 are. Okay? 581 00:32:53,860 --> 00:32:56,169 So this is a very special property of 582 00:32:56,169 --> 00:32:58,060 LQR 583 00:32:58,060 --> 00:33:02,020 systems and once you change anything, once you go away from a linear dynamical 584 00:33:02,020 --> 00:33:03,279 system, or once you 585 00:33:03,279 --> 00:33:06,750 change almost any aspect of this because at discreet states or discreet 586 00:33:06,750 --> 00:33:08,760 actions or whatever and once you change 587 00:33:08,760 --> 00:33:11,029 almost any aspect of this problem 588 00:33:11,029 --> 00:33:15,230 this property will no longer hold true because this is a very special property of 589 00:33:15,230 --> 00:33:16,770 LQR systems 590 00:33:16,770 --> 00:33:19,029 that the optimal policy 591 00:33:19,029 --> 00:33:25,330 does not actually depend on the noise magnitude of these noise terms. Okay? 592 00:33:25,330 --> 00:33:25,580 And 593 00:33:25,529 --> 00:33:31,130 the only important property is that the noise function of zero mean. 594 00:33:31,130 --> 00:33:32,440 So 595 00:33:32,440 --> 00:33:34,850 there's this intuition that 596 00:33:34,850 --> 00:33:38,720 to compute the optimal policy you can just ignore the noise terms. Or as if, 597 00:33:38,720 --> 00:33:43,059 so as long as you know the expected value of your state ST plus one - 598 00:33:43,059 --> 00:33:47,330 write down. On average ST plus one is ATST plus BTAT, then there's 599 00:33:47,330 --> 00:33:49,310 as if you can ignore 600 00:33:49,310 --> 00:33:51,750 the noise 601 00:33:51,750 --> 00:33:54,080 in your next state ST plus one. 602 00:33:54,080 --> 00:33:56,929 And the optimal policy doesn't change. Okay? 603 00:33:56,929 --> 00:33:57,960 So 604 00:33:57,960 --> 00:34:01,740 we'll actually come back to this in a minute. Later on we'll talk about Kalman filters. We'll actually 605 00:34:01,740 --> 00:34:05,550 use this property of LQR systems. 606 00:34:05,550 --> 00:34:09,490 Just to point out, note that the value function does depend on the noise 607 00:34:09,490 --> 00:34:12,069 covariance, right? The value function here does 608 00:34:12,069 --> 00:34:16,039 depend on si T. So the larger the noise in your system the 609 00:34:16,039 --> 00:34:19,899 worse your value function. So this does depend on the noise, but it's the optimal policy that doesn't 610 00:34:19,899 --> 00:34:22,409 depend on the noise. We'll use this 611 00:34:22,409 --> 00:34:25,329 property later. 612 00:34:25,329 --> 00:34:29,099 Okay. 613 00:34:29,099 --> 00:34:31,719 So let's see how we're doing on 614 00:34:31,719 --> 00:34:37,579 time. 615 00:34:37,579 --> 00:34:44,579 Let's 616 00:34:47,309 --> 00:34:54,309 see. Right. Okay. So let's put this aside for now. What I want 617 00:35:03,380 --> 00:35:06,419 to do now is tell you about 618 00:35:06,419 --> 00:35:13,419 one specific way of applying LQR that's called differential dynamic programming. 619 00:35:13,629 --> 00:35:15,639 And as in most of the example, 620 00:35:15,639 --> 00:35:19,660 think of try to control a system, like a helicopter or a car 621 00:35:19,660 --> 00:35:24,810 or even a chemical plant, with some continuous state. So for the 622 00:35:24,810 --> 00:35:27,720 sake of thinking through this example, just imagine 623 00:35:27,720 --> 00:35:28,630 624 00:35:28,630 --> 00:35:30,809 trying to control a helicopter. 625 00:35:30,809 --> 00:35:37,189 And let's say you have some simulator 626 00:35:37,189 --> 00:35:40,049 that espece with the next state is a function of the previous data in action, 627 00:35:40,049 --> 00:35:43,519 right? And 628 00:35:43,519 --> 00:35:45,530 for this let's say the 629 00:35:45,530 --> 00:35:50,009 model of your simulator is non-linear, 630 00:35:50,009 --> 00:35:51,079 631 00:35:51,079 --> 00:35:52,600 but and determinalistic. 632 00:35:52,600 --> 00:35:56,520 Okay? So I say just now, that the noise terms don't matter very much. 633 00:35:56,520 --> 00:35:57,979 So let's 634 00:35:57,979 --> 00:36:03,899 just work with the term simulator for now, but let's say F is non-linear. 635 00:36:03,899 --> 00:36:05,059 And 636 00:36:05,059 --> 00:36:08,409 let's say there's some specific trajectory that you want the 637 00:36:08,409 --> 00:36:10,229 helicopter to follow, all right? 638 00:36:10,229 --> 00:36:10,860 So I 639 00:36:10,860 --> 00:36:13,199 want to talk about how to apply LQR to get 640 00:36:13,199 --> 00:36:14,800 helicopter or a car 641 00:36:14,800 --> 00:36:18,089 or a chemical plant where your state variables may depend 642 00:36:18,089 --> 00:36:21,629 on the amounts of different chemicals and the mixes of chemicals you have in different batch, really. It's really easy to think about 643 00:36:21,629 --> 00:36:22,170 a 644 00:36:22,170 --> 00:36:24,959 helicopter. Let's say 645 00:36:24,959 --> 00:36:28,819 there's some trajectory you want the helicopter to follow. 646 00:36:28,819 --> 00:36:31,029 So here's 647 00:36:31,029 --> 00:36:34,129 what the differential dynamic programming it does. 648 00:36:34,129 --> 00:36:37,609 First step is come up with 649 00:36:37,609 --> 00:36:41,619 what I'm gonna call some nominal trajectory, 650 00:36:41,619 --> 00:36:43,950 651 00:36:43,950 --> 00:36:45,849 right? And so we're 652 00:36:45,849 --> 00:36:51,159 gonna call this S zero A zero. 653 00:36:51,159 --> 00:36:55,529 Okay? 654 00:36:55,529 --> 00:36:56,849 655 00:36:56,849 --> 00:37:00,599 656 00:37:00,599 --> 00:37:04,269 So one way to come up with this would be if you had some very bad 657 00:37:04,269 --> 00:37:07,799 controller - someone hacked a controller for flying a helicopter 658 00:37:07,799 --> 00:37:09,880 is not a good controller 659 00:37:09,880 --> 00:37:11,490 at all. But you might then 660 00:37:11,490 --> 00:37:15,709 go ahead and fly the helicopter using a very bad, a very sloppy, controller and 661 00:37:15,709 --> 00:37:18,809 you get out some sequence of states and actions, 662 00:37:18,809 --> 00:37:23,849 right? So I'm gonna - and I just call this sequence of states and actions the trajectory 663 00:37:23,849 --> 00:37:29,259 - the nominal trajectory. 664 00:37:29,259 --> 00:37:30,479 Then 665 00:37:30,479 --> 00:37:32,440 I will 666 00:37:32,440 --> 00:37:36,549 linearize F 667 00:37:36,549 --> 00:37:38,030 around this 668 00:37:38,030 --> 00:37:39,799 normal trajectory. 669 00:37:39,799 --> 00:37:40,869 670 00:37:40,869 --> 00:37:44,559 Okay? 671 00:37:44,559 --> 00:37:47,999 So i.e., 672 00:37:47,999 --> 00:37:50,239 673 00:37:50,239 --> 00:37:53,329 right? I'll use that same 674 00:37:53,329 --> 00:37:56,029 thing. So for times si T our 675 00:37:56,029 --> 00:37:56,619 approximate 676 00:37:56,619 --> 00:37:59,359 ST plus one, as this 677 00:37:59,359 --> 00:38:05,049 linearization thing that we just 678 00:38:05,049 --> 00:38:07,229 saw, 679 00:38:07,229 --> 00:38:11,229 680 00:38:11,229 --> 00:38:14,399 times - plus the other term. 681 00:38:14,399 --> 00:38:21,399 Okay? 682 00:38:24,729 --> 00:38:26,670 And then you 683 00:38:26,670 --> 00:38:31,719 distill this down to sum 684 00:38:31,719 --> 00:38:33,599 ATST plus BTST. 685 00:38:33,599 --> 00:38:37,629 Okay? So this will actually be the first time that I'll make explicit use of 686 00:38:37,629 --> 00:38:41,619 the ability of LQR or these finer horizon problems 687 00:38:41,619 --> 00:38:46,699 to handle non-stationery dynamics. In particular, for this example, 688 00:38:46,699 --> 00:38:49,279 it will be important that 689 00:38:49,279 --> 00:38:50,769 AT and BT 690 00:38:50,769 --> 00:38:57,099 depend on time - oh, excuse me. Okay? 691 00:38:57,099 --> 00:38:59,559 So the intuition is that 692 00:38:59,559 --> 00:39:02,609 even if this is a pretty sloppy controller, or even if you had a pretty bad 693 00:39:02,609 --> 00:39:04,349 controller come up with your original 694 00:39:04,349 --> 00:39:06,549 normal trajectory, 695 00:39:06,549 --> 00:39:08,930 you still expect maybe, 696 00:39:08,930 --> 00:39:10,379 right? 697 00:39:10,379 --> 00:39:14,959 You'd expect your state and action at time T 698 00:39:14,959 --> 00:39:15,649 to be 699 00:39:15,649 --> 00:39:17,640 700 00:39:17,640 --> 00:39:18,769 701 00:39:18,769 --> 00:39:22,540 maybe reasonably similar to what even the sloppy controller had done, right? 702 00:39:22,540 --> 00:39:26,759 So you want a fly trajectory maybe you want to make a 90-degree turn. 703 00:39:26,759 --> 00:39:30,640 Maybe if a bad controller that does a pretty sloppy job, but at any given in time you're 704 00:39:30,640 --> 00:39:33,190 still moving around this trajectory. So 705 00:39:33,190 --> 00:39:35,869 this is really telling you 706 00:39:35,869 --> 00:39:38,980 where along, say, the 90-degree turn trajectory, 707 00:39:38,980 --> 00:39:44,029 just very roughly, where along the trajectory you expect to be at any given time 708 00:39:44,029 --> 00:39:50,729 and so let's linearize around that point. 709 00:39:50,729 --> 00:39:54,719 Okay? 710 00:39:54,719 --> 00:40:01,719 Then 711 00:40:02,139 --> 00:40:04,199 712 00:40:04,199 --> 00:40:05,789 you would - 713 00:40:05,789 --> 00:40:08,930 having found the linear model you run LQR to 714 00:40:08,930 --> 00:40:13,489 get the optimal policy for this specific linear model 715 00:40:13,489 --> 00:40:16,789 and now you have a better policy. 716 00:40:16,789 --> 00:40:18,570 And the final thing you do 717 00:40:18,570 --> 00:40:22,920 is - boy, I'll 718 00:40:22,920 --> 00:40:29,920 write this on a different board, I guess. Okay. 719 00:40:49,199 --> 00:40:53,799 Shoot. The last step is you use a simulator, 720 00:40:53,799 --> 00:40:55,869 a model, 721 00:40:55,869 --> 00:41:02,869 to come up with a new normal 722 00:41:04,730 --> 00:41:08,130 trajectory. 723 00:41:08,130 --> 00:41:09,939 So i.e., 724 00:41:09,939 --> 00:41:13,970 okay? 725 00:41:13,970 --> 00:41:20,970 726 00:41:32,599 --> 00:41:37,119 So now you take the controller you just learned and basically try flying your 727 00:41:37,119 --> 00:41:40,209 helicopter in your simulator. So you initialize the simulator 728 00:41:40,209 --> 00:41:42,969 to the initial state, and I'll call the S bar zero, 729 00:41:42,969 --> 00:41:44,489 and you'll run every time step. 730 00:41:44,489 --> 00:41:47,160 You choose an action which I'll call A bar T, 731 00:41:47,160 --> 00:41:50,949 using the controller pi T that you just learned using LQR. And then you simulate forward in time, right? You 732 00:41:50,949 --> 00:41:54,359 use the simulator, the function 733 00:41:54,359 --> 00:41:58,009 F, to tell you what the next state S bar T plus one will be 734 00:41:58,009 --> 00:42:03,779 when your previous state and action is bar T A bar T. 735 00:42:03,779 --> 00:42:04,569 And then 736 00:42:04,569 --> 00:42:08,910 you linearize 737 00:42:08,910 --> 00:42:15,910 around this new trajectory 738 00:42:17,809 --> 00:42:20,079 and repeat. Okay? 739 00:42:20,079 --> 00:42:23,819 So now you have a new normal trajectory and you linearize your 740 00:42:23,819 --> 00:42:27,979 simulator around this new trajectory and then you repeat the whole procedure. I guess going back to 741 00:42:27,979 --> 00:42:30,890 step two of the algorithm. 742 00:42:30,890 --> 00:42:31,500 743 00:42:31,500 --> 00:42:35,700 And this turns out to be a surprisingly effective procedure. So the 744 00:42:35,700 --> 00:42:38,759 cartoon 745 00:42:38,759 --> 00:42:39,899 of 746 00:42:39,899 --> 00:42:41,650 what this algorithm 747 00:42:41,650 --> 00:42:44,289 may do is as follows. 748 00:42:44,289 --> 00:42:46,919 Let's say you want to make a 90-degree turn on the helicopter let's see 749 00:42:46,919 --> 00:42:48,979 one, you know, a helicopter 750 00:42:48,979 --> 00:42:51,719 to follow a trajectory like that. 751 00:42:51,719 --> 00:42:53,880 752 00:42:53,880 --> 00:42:58,189 Follow up of a very bad controller, I just, you know, hack up some controller, whatever. 753 00:42:58,189 --> 00:43:04,899 Have some way to come up with an initial normal trajectory. 754 00:43:04,899 --> 00:43:10,979 Maybe your initial controller overshoots the turn, takes the turn wide, right? 755 00:43:10,979 --> 00:43:14,799 But now you can use these points to linearize 756 00:43:14,799 --> 00:43:17,969 the simulator. So linearize in a very non-linear simulator and the idea is that 757 00:43:17,969 --> 00:43:22,249 maybe this state isn't such a bad approximation. That maybe a 758 00:43:22,249 --> 00:43:25,540 linearization approximation at this sequence of 759 00:43:25,540 --> 00:43:28,509 states will actually be reasonable because your helicopter 760 00:43:28,509 --> 00:43:31,589 won't exactly be on the states, but will be close to the sequence of states of 761 00:43:31,589 --> 00:43:33,739 every time 762 00:43:33,739 --> 00:43:35,470 step. So after one duration of DDP, that's the 763 00:43:35,470 --> 00:43:38,159 764 00:43:38,159 --> 00:43:40,459 target trajectory, maybe you 765 00:43:40,459 --> 00:43:43,640 get a little bit closer 766 00:43:43,640 --> 00:43:47,329 and now you have an even better place around to linearize. 767 00:43:47,329 --> 00:43:54,329 Then after another linearization of DDP you 768 00:43:55,649 --> 00:44:00,339 get closer and closer to finding exactly the trajectory you want. Okay? 769 00:44:00,339 --> 00:44:02,629 So 770 00:44:02,629 --> 00:44:07,309 turns out DDP is a sort of - it turns out to be a form of a local search algorithm 771 00:44:07,309 --> 00:44:08,729 in which you - 772 00:44:08,729 --> 00:44:12,339 on each iteration you find a slightly better place to linearize. So you end 773 00:44:12,339 --> 00:44:16,319 up with a slightly better control and you repeat. And we actually do this - 774 00:44:16,319 --> 00:44:19,990 this is actually one of the things we do on the helicopter. And this works very well 775 00:44:19,990 --> 00:44:22,009 on many - this works 776 00:44:22,009 --> 00:44:27,899 surprisingly well - this works very well on many problems. Cool. 777 00:44:27,899 --> 00:44:31,339 I think - I was actually going to show some helicopter videos, but in the interest of time, let me 778 00:44:31,339 --> 00:44:31,960 just 779 00:44:31,960 --> 00:44:35,779 defer that to the next lecture. I'll show you a bunch of cool helicopter things in the 780 00:44:35,779 --> 00:44:42,779 next lecture, but let me just check if there are questions about this before I move on. 781 00:44:47,150 --> 00:44:48,629 Yeah? 782 00:44:48,629 --> 00:44:51,049 Student:[Inaudible] Instructor (Andrew Ng):In this sample? 783 00:44:51,049 --> 00:44:53,359 Yes, yeah, right, yeah. So I'm going to 784 00:44:53,359 --> 00:44:57,089 - let's pick some kind of horizon T. So I'm going to run through my entire 785 00:44:57,089 --> 00:44:59,119 trajectory in my simulator, 786 00:44:59,119 --> 00:45:02,599 so I end up with a new 787 00:45:02,599 --> 00:45:06,640 nominal trajectory to 788 00:45:06,640 --> 00:45:13,640 linearize around, right? Okay? Yeah? Student:So does this method give you, like, a Gaussian for performing, like, a certain action? 789 00:45:20,669 --> 00:45:22,849 Like 790 00:45:22,849 --> 00:45:26,069 you talked about, like, the 90-degree turn thing or something. Instructor (Andrew Ng):Right. 791 00:45:26,069 --> 00:45:29,199 Student:So is this from one, like, is this from one, like, one 792 00:45:29,199 --> 00:45:33,699 90-degree turn or can 793 00:45:33,699 --> 00:45:35,150 you [inaudible]? Instructor (Andrew Ng):Yeah. So it turns out 794 00:45:35,150 --> 00:45:39,229 what - so this is used clear - let's see. Go and 795 00:45:39,229 --> 00:45:42,670 think about this as if there's a specific trajectory 796 00:45:42,670 --> 00:45:47,299 that you want to follow. I'm just gonna, car or helicopter or it could be in a chemical plant, 797 00:45:47,299 --> 00:45:48,329 right? 798 00:45:48,329 --> 00:45:51,859 If there's some specific sequence of states you expect the system to go 799 00:45:51,859 --> 00:45:53,259 through over time, 800 00:45:53,259 --> 00:45:55,769 so that you actually want to 801 00:45:55,769 --> 00:45:58,429 linearize at different times - excuse me. 802 00:45:58,429 --> 00:46:01,429 So, therefore, the different times you want different linear approximations, your dynamics, 803 00:46:01,429 --> 00:46:03,709 right? So 804 00:46:03,709 --> 00:46:08,499 I actually start to laugh over stationary simulator, 805 00:46:08,499 --> 00:46:11,129 right? I mean, this 806 00:46:11,129 --> 00:46:14,869 function F, it may be the same function F for all time steps, but 807 00:46:14,869 --> 00:46:18,960 the point of DDP is that I may want to use different linearizations for different 808 00:46:18,960 --> 00:46:20,319 time 809 00:46:20,319 --> 00:46:23,499 steps. So a lot of the inner loop of the algorithm is just coming up with 810 00:46:23,499 --> 00:46:25,260 better and better places 811 00:46:25,260 --> 00:46:27,100 around to linearize. Where 812 00:46:27,100 --> 00:46:34,100 at different times I'll linearize around different points. Does that make sense? Cool. Okay, cool. 813 00:46:35,749 --> 00:46:37,789 So that was DDP. And 814 00:46:37,789 --> 00:46:44,789 I'll show examples of DDP results in the next lecture. 815 00:46:48,509 --> 00:46:55,509 So the last thing I wanted to do was talk about Kalman filters 816 00:47:00,579 --> 00:47:06,069 and LQG control, linear-quadratic Gaussian control. And 817 00:47:06,069 --> 00:47:10,259 what I want to do is actually talk about a different type of MDP problem 818 00:47:10,259 --> 00:47:12,910 where we don't get to observe the state explicitly, 819 00:47:12,910 --> 00:47:17,099 right? So far in every one I've been talking about, I've been assuming 820 00:47:17,099 --> 00:47:20,969 that every time step you know what the state of the system is, 821 00:47:20,969 --> 00:47:25,079 so you can compute a policy to some function of the state is 822 00:47:25,079 --> 00:47:27,159 in. If you've all ready had that, you know, 823 00:47:27,159 --> 00:47:31,939 the action we take is LT times ST, right? So to compute the action you need 824 00:47:31,939 --> 00:47:34,649 to know what the state is. What I 825 00:47:34,649 --> 00:47:38,169 want to do now is talk about the different type of problem where you don't get to 826 00:47:38,169 --> 00:47:39,499 observe the state 827 00:47:39,499 --> 00:47:41,229 explicitly. The fact 828 00:47:41,229 --> 00:47:44,709 - before we even talk about the control let me just talk about the different 829 00:47:44,709 --> 00:47:45,329 problem 830 00:47:45,329 --> 00:47:48,269 where - just forget about control for now and just look at some 831 00:47:48,269 --> 00:47:52,189 dynamical systems where you may not get to observe the state explicitly and then 832 00:47:52,189 --> 00:47:56,429 only later we'll tie this back to controlling 833 00:47:56,429 --> 00:47:58,890 some systems. Okay? As a concrete example, 834 00:47:58,890 --> 00:48:00,739 let's say 835 00:48:00,739 --> 00:48:04,969 as, sort of, just an example to think about, imagine using a radar to track the 836 00:48:04,969 --> 00:48:07,910 helicopter, right? So we may 837 00:48:07,910 --> 00:48:11,039 model a helicopter, 838 00:48:11,039 --> 00:48:14,649 and this will be an amazingly simplified model of a helicopter, 839 00:48:14,649 --> 00:48:18,719 as, you know, some linear dynamical systems. So [inaudible] ST plus one equals AST plus 840 00:48:18,719 --> 00:48:22,209 WT, and we'll forget about controls 841 00:48:22,209 --> 00:48:25,069 for now, okay? We'll fill the controls back in. 842 00:48:25,069 --> 00:48:26,789 And just with this example, 843 00:48:26,789 --> 00:48:31,489 I'm gonna use 844 00:48:31,489 --> 00:48:35,759 an extremely simplified state, right? 845 00:48:35,759 --> 00:48:40,089 Where my state is just a position in velocity in the X and Y directions, 846 00:48:40,089 --> 00:48:41,859 so you 847 00:48:41,859 --> 00:48:46,650 may choose an A matrix like this 848 00:48:46,650 --> 00:48:53,650 as a - okay? 849 00:48:58,859 --> 00:49:03,279 As an extremely simple - as a, sort of, an extremely simplified model 850 00:49:03,279 --> 00:49:04,589 of what the dynamics of, 851 00:49:04,589 --> 00:49:08,210 like, a plane or object moving in 2-D may look like. So just 852 00:49:08,210 --> 00:49:09,529 imagine that you 853 00:49:09,529 --> 00:49:14,359 have simulation and you have a radar and you're tracking blips on your radar and you 854 00:49:14,359 --> 00:49:17,960 want to estimate the position, or the state, of the helicopter as just as its 855 00:49:17,960 --> 00:49:20,619 XY position and its XY velocity and 856 00:49:20,619 --> 00:49:24,180 you have a very simple dynamical model of what the helicopter may do. 857 00:49:24,180 --> 00:49:26,099 So 858 00:49:26,099 --> 00:49:33,099 this matrix, this just says that XT plus one equals XT plus 859 00:49:33,549 --> 00:49:35,999 X star T plus noise, 860 00:49:35,999 --> 00:49:37,439 861 00:49:37,439 --> 00:49:39,929 so that's this first equation. 862 00:49:39,929 --> 00:49:43,319 The second equation says that X star T plus 863 00:49:43,319 --> 00:49:44,359 one 864 00:49:44,359 --> 00:49:49,719 equals 0.9 times X star T plus nine. Yes, 865 00:49:49,719 --> 00:49:52,819 this is an amazingly simplified model of 866 00:49:52,819 --> 00:49:55,519 what a flying vehicle may look like. 867 00:49:55,519 --> 00:49:59,419 868 00:49:59,419 --> 00:50:02,729 Here's the more interesting part, which is that with - 869 00:50:02,729 --> 00:50:06,259 if you're tracking a helicopter with some sensor you 870 00:50:06,259 --> 00:50:09,749 won't get to observe the full state explicitly. But just 871 00:50:09,749 --> 00:50:10,979 for this cartoon example, 872 00:50:10,979 --> 00:50:13,650 let's say that we get to observe YT, 873 00:50:13,650 --> 00:50:16,849 which is CST 874 00:50:16,849 --> 00:50:23,849 plus VT where 875 00:50:24,009 --> 00:50:28,179 the VT is a random variables - Gaussian random variables with, say, zero mean 876 00:50:28,179 --> 00:50:32,359 and a Gaussian noise with covariance given by sigma V. Okay? 877 00:50:32,359 --> 00:50:37,219 So in this example 878 00:50:37,219 --> 00:50:44,219 let's say that C is 879 00:50:44,799 --> 00:50:51,449 that and - so CST is 880 00:50:51,449 --> 00:50:53,060 equal to 881 00:50:53,060 --> 00:50:53,700 XY, 882 00:50:53,700 --> 00:50:57,579 right? Take this state vector and multiply it by Z, you just get XY. So 883 00:50:57,579 --> 00:51:01,689 let's see what the sensor, maybe a radar, maybe a vision system, I don't know, something 884 00:51:01,689 --> 00:51:03,199 that only gets to observe 885 00:51:03,199 --> 00:51:04,369 the position of the 886 00:51:04,369 --> 00:51:05,249 887 00:51:05,249 --> 00:51:12,249 helicopter that you're trying to track. 888 00:51:14,249 --> 00:51:21,249 So here's the cartoon. 889 00:51:27,309 --> 00:51:31,279 So a helicopter may fly through some sequence of states, 890 00:51:31,279 --> 00:51:38,279 let's say it flies through some smooth trajectory, 891 00:51:42,129 --> 00:51:44,839 whatever. It makes a slow turn. 892 00:51:44,839 --> 00:51:49,819 So the true state is four-dimensional, but I'm just drawing two dimensions, 893 00:51:49,819 --> 00:51:51,039 right? So 894 00:51:51,039 --> 00:51:53,500 maybe you have a camera sensor down here, 895 00:51:53,500 --> 00:51:55,339 or a radar or whatever, 896 00:51:55,339 --> 00:51:56,589 and 897 00:51:56,589 --> 00:52:00,819 for this cartoon example, let's say the noise in your observations is larger in the 898 00:52:00,819 --> 00:52:03,690 vertical axis than the horizontal axis. So 899 00:52:03,690 --> 00:52:06,789 what you get is actually 900 00:52:06,789 --> 00:52:08,329 one sample 901 00:52:08,329 --> 00:52:11,559 from the sequence of five Gaussians. So you may 902 00:52:11,559 --> 00:52:15,469 observe the helicopter there at times step one, observe it there at time step two, observe it 903 00:52:15,469 --> 00:52:16,969 there at time three, 904 00:52:16,969 --> 00:52:18,409 time four, 905 00:52:18,409 --> 00:52:20,599 time five. Okay? All right. 906 00:52:20,599 --> 00:52:22,559 So that's what your - 907 00:52:22,559 --> 00:52:26,049 there's a sequence of positions that your camera estimate gives you. 908 00:52:26,049 --> 00:52:30,169 And 909 00:52:30,169 --> 00:52:34,839 given these sorts of observations, can you estimate the actual state of the system? Okay? 910 00:52:34,839 --> 00:52:35,539 So 911 00:52:35,539 --> 00:52:42,539 these orange things, I guess, right? 912 00:52:47,569 --> 00:52:50,909 Okay? These orange things are your observations YT. 913 00:52:50,909 --> 00:52:52,219 And 914 00:52:52,219 --> 00:52:55,599 test for the state of helicopter every time. Just for it, so the position of the helicopter 915 00:52:55,599 --> 00:52:58,999 at every time. Clearly you don't want to just rely on the orange crosses because that's too 916 00:52:58,999 --> 00:53:00,359 noisy 917 00:53:00,359 --> 00:53:04,799 and they also don't give you velocities, right? So you only observe the subset of the state of 918 00:53:04,799 --> 00:53:06,729 variables. 919 00:53:06,729 --> 00:53:10,689 So what can you do? So concretely - well, you don't actually ever get to 920 00:53:10,689 --> 00:53:12,279 observe the true 921 00:53:12,279 --> 00:53:14,900 positions, right? All you get to do is 922 00:53:14,900 --> 00:53:16,990 observe those orange crosses. I 923 00:53:16,990 --> 00:53:19,939 guess I should erase the ellipses if I can. Right. 924 00:53:19,939 --> 00:53:24,389 You get the idea. 925 00:53:24,389 --> 00:53:28,689 The question is given - yeah. You know what I'm 926 00:53:28,689 --> 00:53:29,729 trying to do. 927 00:53:29,729 --> 00:53:33,950 Given just the orange crosses can you get a good estimate of the state of 928 00:53:33,950 --> 00:53:37,169 the system at every time step? 929 00:53:37,169 --> 00:53:40,219 So it turns out that - 930 00:53:40,219 --> 00:53:43,709 well, so what you want to do is 931 00:53:43,709 --> 00:53:49,679 to estimate the distribution on the state 932 00:53:49,679 --> 00:53:50,799 given 933 00:53:50,799 --> 00:53:54,429 all the previous observations, 934 00:53:54,429 --> 00:53:58,109 right? So given observations, you know, one, two, three, four, and five, where is the 935 00:53:58,109 --> 00:54:01,039 helicopter currently? 936 00:54:01,039 --> 00:54:03,119 So it turns out that 937 00:54:03,119 --> 00:54:08,689 the random variables, S zero, S one, up to ST and Y1 are 938 00:54:08,689 --> 00:54:13,689 to ST, have a joint Gaussian distribution, 939 00:54:13,689 --> 00:54:14,540 940 00:54:14,540 --> 00:54:18,319 941 00:54:18,319 --> 00:54:23,319 right? 942 00:54:23,319 --> 00:54:27,689 So one thing you could do is construct a joint Gaussian distribution - can define 943 00:54:27,689 --> 00:54:30,199 vector value random variable Z, S zero, 944 00:54:30,199 --> 00:54:34,369 S one, up to ST, Y1 up to YT, 945 00:54:34,369 --> 00:54:35,629 right? 946 00:54:35,629 --> 00:54:38,899 So it turns out that 947 00:54:38,899 --> 00:54:43,449 948 00:54:43,449 --> 00:54:48,049 Z will have some Gaussian distribution with some mean and some covariance matrix. 949 00:54:48,049 --> 00:54:49,639 950 00:54:49,639 --> 00:54:53,989 Using the Gaussian marginalization and conditioning formulas. But I think way back when we 951 00:54:53,989 --> 00:54:56,279 talked about factor analysis in this class, 952 00:54:56,279 --> 00:54:57,939 we talked about how to 953 00:54:57,939 --> 00:55:01,889 compute marginal distributions and conditional distributions of Gaussians. 954 00:55:01,889 --> 00:55:03,529 But using those formulas 955 00:55:03,529 --> 00:55:04,779 you can actually compute this 956 00:55:04,779 --> 00:55:10,249 thing. You can compute, right? You 957 00:55:10,249 --> 00:55:13,409 can compute that conditional distribution. This will give a good estimate 958 00:55:13,409 --> 00:55:16,009 of the current state ST. Okay? 959 00:55:16,009 --> 00:55:19,059 But clearly this is a 960 00:55:19,059 --> 00:55:24,279 extremely computationally inefficient way to do so because 961 00:55:24,279 --> 00:55:25,989 these 962 00:55:25,989 --> 00:55:28,779 means and covariance matrixes will grow linearly 963 00:55:28,779 --> 00:55:32,299 with the number of time steps as you're tracking a helicopter over tens of 964 00:55:32,299 --> 00:55:35,679 thousands of time steps. They were huge covariance matrixes, so 965 00:55:35,679 --> 00:55:39,269 this is a conceptually correct way, but just a computational not reasonable way 966 00:55:39,269 --> 00:55:46,269 to perform this computation. So, 967 00:55:46,380 --> 00:55:52,499 instead, 968 00:55:52,499 --> 00:55:56,679 there's an algorithm called the Kalman filter that allows you to organize your computations 969 00:55:56,679 --> 00:55:59,659 efficiently and do this. 970 00:55:59,659 --> 00:56:01,979 Just on the side, if you 971 00:56:01,979 --> 00:56:05,109 remember Dan's discussion section on HMM's the 972 00:56:05,109 --> 00:56:06,970 Kalman filter model 973 00:56:06,970 --> 00:56:10,690 turns out to actually be a hidden Markov model. These Kalman's are only for those of 974 00:56:10,690 --> 00:56:13,719 you that attended Dan's discussion section. 975 00:56:13,719 --> 00:56:16,539 If not then what I'm about to say may not make sense. But 976 00:56:16,539 --> 00:56:20,220 if you remember Dan's kind of section of the hidden Markov model, 977 00:56:20,220 --> 00:56:22,050 it actually turns out that the Kalman 978 00:56:22,050 --> 00:56:25,609 filter model, this linear dynamical system with observations is actually an 979 00:56:25,609 --> 00:56:29,239 HMM problem where - let's see. 980 00:56:29,239 --> 00:56:36,239 Unfortunately, 981 00:56:36,969 --> 00:56:40,509 the notation's a bit different because Dan was drawing from, sort of, a 982 00:56:40,509 --> 00:56:44,819 clash of multiple research communities using these 983 00:56:44,819 --> 00:56:45,990 same ideas. So 984 00:56:45,990 --> 00:56:49,589 the notation that Dan used, I think, was developed 985 00:56:49,589 --> 00:56:52,120 in a different community that clashes a bit with 986 00:56:52,120 --> 00:56:56,400 the reinforcement learning community notations. So 987 00:56:56,400 --> 00:56:59,719 in Dan's notation in the 988 00:56:59,719 --> 00:57:05,089 HMM section, Z and X were used to denote the state and the observations. Today, I'm using S and X to denote 989 00:57:05,089 --> 00:57:07,529 the state and the observations. 990 00:57:07,529 --> 00:57:09,209 Okay? 991 00:57:09,209 --> 00:57:13,849 But it turns out what I'm about to do turns out to be a hidden Markov 992 00:57:13,849 --> 00:57:17,139 model with continuous states rather than discrete states, which is 993 00:57:17,139 --> 00:57:19,539 under the discretion section. Okay. If you didn't attend that discussion section 994 00:57:19,539 --> 00:57:22,689 then forget everything I just 995 00:57:22,689 --> 00:57:27,039 said in the last minute. 996 00:57:27,039 --> 00:57:27,820 997 00:57:27,820 --> 00:57:33,069 So here's the outline of the Kalman filter. 998 00:57:33,069 --> 00:57:36,880 It turns out that, so 999 00:57:36,880 --> 00:57:41,089 it's a cursive algorithm. So it turns out that if I have computed P 1000 00:57:41,089 --> 00:57:42,259 of 1001 00:57:42,259 --> 00:57:44,559 ST 1002 00:57:44,559 --> 00:57:46,739 given Y1 1003 00:57:46,739 --> 00:57:50,439 up to YT, the Kalman filter organizes these computations into steps. 1004 00:57:50,439 --> 00:57:54,789 The first step is called the predict step. 1005 00:57:54,789 --> 00:57:57,199 Where given P 1006 00:57:57,199 --> 00:58:00,939 of ST - where you all ready have P of ST given Y1 up to YT 1007 00:58:00,939 --> 00:58:05,439 and you compute what P of ST plus one given Y1 up to YT is. 1008 00:58:05,439 --> 00:58:12,439 And then the other step 1009 00:58:12,619 --> 00:58:13,599 1010 00:58:13,599 --> 00:58:17,069 is 1011 00:58:17,069 --> 00:58:20,869 called the update step. Where given this second line you compute this third line. Okay? 1012 00:58:20,869 --> 00:58:24,229 Where having taken account only observations of the time T 1013 00:58:24,229 --> 00:58:25,239 you know incorporate the lots 1014 00:58:25,239 --> 00:58:27,009 of the observations up 1015 00:58:27,009 --> 00:58:30,089 to time T plus one. 1016 00:58:30,089 --> 00:58:31,269 1017 00:58:31,269 --> 00:58:33,199 So 1018 00:58:33,199 --> 00:58:34,919 concretely - 1019 00:58:34,919 --> 00:58:41,919 oh, and let's 1020 00:58:43,209 --> 00:58:45,630 see. 1021 00:58:45,630 --> 00:58:46,900 In the predict step 1022 00:58:46,900 --> 00:58:52,729 it turns out that 1023 00:58:52,729 --> 00:58:56,519 - so what I'm going to do is actually just outline the main steps of the 1024 00:58:56,519 --> 00:58:57,410 Kalman filter. 1025 00:58:57,410 --> 00:59:01,420 I won't actually derive the algorithm and prove it's correct. It 1026 00:59:01,420 --> 00:59:02,509 turns out that, I don't 1027 00:59:02,509 --> 00:59:05,619 know, working out the actual proof of what 1028 00:59:05,619 --> 00:59:08,099 I'm about to derive is probably 1029 00:59:08,099 --> 00:59:12,149 significantly - it's probably, I don't know, about as hard, or maybe slightly easier, than many of 1030 00:59:12,149 --> 00:59:14,559 the homework's you've done all ready. So and you've done 1031 00:59:14,559 --> 00:59:18,569 some pretty amazingly hard homework, so you can work out the proof for yourself. It's just write out the 1032 00:59:18,569 --> 00:59:20,589 main outlines 1033 00:59:20,589 --> 00:59:24,709 and the conclusion of the algorithm. 1034 00:59:24,709 --> 00:59:30,099 So for the acceptance of the vest T given Y1 after YT. If that 1035 00:59:30,099 --> 00:59:33,779 is 1036 00:59:33,779 --> 00:59:35,919 given by 1037 00:59:35,919 --> 00:59:41,880 that 1038 00:59:41,880 --> 00:59:48,880 then 1039 00:59:50,629 --> 00:59:57,629 where - okay? 1040 01:00:19,839 --> 01:00:23,390 1041 01:00:23,390 --> 01:00:25,139 So given 1042 01:00:25,139 --> 01:00:29,039 ST, having computed the distribution ST given Y1 through YT - 1043 01:00:29,039 --> 01:00:32,380 and computed the distribution of ST plus one given Y1 through YT as 1044 01:00:32,380 --> 01:00:34,839 Gaussian, with this mean and this covariance, 1045 01:00:34,839 --> 01:00:38,109 where you compute the mean and covariance using these two formulas. 1046 01:00:38,109 --> 01:00:38,940 And 1047 01:00:38,940 --> 01:00:41,639 just as a point of notation, 1048 01:00:41,639 --> 01:00:44,089 right? I'm using ST and YT 1049 01:00:44,089 --> 01:00:48,690 to denote the true states in observations. So the ST is the unknown true 1050 01:00:48,690 --> 01:00:49,390 state. Okay? 1051 01:00:49,390 --> 01:00:52,520 ST is whatever state this one is in and you actually don't know what ST is 1052 01:00:52,520 --> 01:00:55,179 because you don't get to observe this. 1053 01:00:55,179 --> 01:00:59,459 And, in contrast, I'm using these things like ST given T, 1054 01:00:59,459 --> 01:01:02,819 ST plus one given T, 1055 01:01:02,819 --> 01:01:05,079 sigma T given T, and so on. 1056 01:01:05,079 --> 01:01:09,199 These things are the results of your computations, 1057 01:01:09,199 --> 01:01:14,009 right? So these things are actually things you compute. So I 1058 01:01:14,009 --> 01:01:18,359 hope the notations are okay, but these - ST is the unknown true state, right? 1059 01:01:18,359 --> 01:01:21,729 Whereas these things, ST equals one given T and so on, these are things that you 1060 01:01:21,729 --> 01:01:26,909 compute inside your algorithm. Okay? 1061 01:01:26,909 --> 01:01:33,909 So that was the predict 1062 01:01:51,059 --> 01:01:55,199 step. And in the update step, 1063 01:01:55,199 --> 01:02:02,199 you find that - well, 1064 01:02:03,219 --> 01:02:10,219 okay? 1065 01:02:13,339 --> 01:02:20,339 1066 01:02:48,749 --> 01:02:55,749 1067 01:02:59,719 --> 01:03:06,719 1068 01:03:22,839 --> 01:03:25,149 And so that's the updates of the Kalman filter 1069 01:03:25,149 --> 01:03:27,769 where you compute this in terms of 1070 01:03:27,769 --> 01:03:32,149 your ST given Y1 through 1071 01:03:32,149 --> 01:03:33,499 YT. So 1072 01:03:33,499 --> 01:03:34,789 after having performed 1073 01:03:34,789 --> 01:03:38,059 the most recent Kalman filter update you find that, 1074 01:03:38,059 --> 01:03:41,469 right? Your perceived distribution on the estimate of ST plus one, given 1075 01:03:41,469 --> 01:03:43,739 all your observations so far, 1076 01:03:43,739 --> 01:03:46,379 is that it's Gaussian with mean 1077 01:03:46,379 --> 01:03:49,829 given by this and variance given by that. 1078 01:03:49,829 --> 01:03:51,550 So, informally, 1079 01:03:51,550 --> 01:03:53,749 1080 01:03:53,749 --> 01:03:57,559 this thing ST plus one given T plus one 1081 01:03:57,559 --> 01:03:59,919 is our 1082 01:03:59,919 --> 01:04:06,919 best estimate 1083 01:04:08,539 --> 01:04:09,339 1084 01:04:09,339 --> 01:04:10,239 for ST plus 1085 01:04:10,239 --> 01:04:15,420 one, right? Given all the observations we've had up to that time. Okay? 1086 01:04:15,420 --> 01:04:16,819 And, again, 1087 01:04:16,819 --> 01:04:20,239 the correctness of these equations - the fact that I'm actually 1088 01:04:20,239 --> 01:04:21,040 computing this 1089 01:04:21,040 --> 01:04:24,489 mean and covariance of this conditional Gaussian distribution, 1090 01:04:24,489 --> 01:04:31,049 you can - I'll leave you to sort of prove that at home if you want. Okay? I'm 1091 01:04:31,049 --> 01:04:35,069 actually gonna put this together with LQR control in a second, but, so before I do 1092 01:04:35,069 --> 01:04:36,140 that let me check if you've 1093 01:04:36,140 --> 01:04:40,519 got questions about this? Actually let 1094 01:04:40,519 --> 01:04:42,579 me erase the board while 1095 01:04:42,579 --> 01:04:49,579 you take a 1096 01:04:52,159 --> 01:04:56,129 look at that. Right. 1097 01:04:56,129 --> 01:05:03,129 Okay. Any 1098 01:05:24,659 --> 01:05:31,659 questions for Kalman filters? Yeah? Student:How is it computationally less intensive than compute some drawing Gaussian distribution and then find 1099 01:05:43,929 --> 01:05:45,579 the conditional - Instructor (Andrew Ng):Very quickly. Yeah, of course. So 1100 01:05:45,579 --> 01:05:49,079 how is this less computationally intensive than the original method I talked about, right? So in the original method I 1101 01:05:49,079 --> 01:05:50,169 talked 1102 01:05:50,169 --> 01:05:53,159 about - wow, this is really back 1103 01:05:53,159 --> 01:05:55,029 and forth. 1104 01:05:55,029 --> 01:06:02,029 I said, let's construct a Z, which was this huge Gaussian thing, right? 1105 01:06:03,949 --> 01:06:08,359 And figure out what the mean and 1106 01:06:08,359 --> 01:06:10,599 covariance matrix of Z is. So sigma will be like 1107 01:06:10,599 --> 01:06:13,059 R - it'll 1108 01:06:13,059 --> 01:06:16,089 be - well, it'll be 1109 01:06:16,089 --> 01:06:17,619 roughly, 1110 01:06:17,619 --> 01:06:21,669 right? A T by T matrix, right? 1111 01:06:21,669 --> 01:06:25,029 This is actually - or the T by T is actually T times number of 1112 01:06:25,029 --> 01:06:27,309 state variables plus number 1113 01:06:27,309 --> 01:06:29,389 of observation variables 1114 01:06:29,389 --> 01:06:31,679 by that. This is a huge matrix 1115 01:06:31,679 --> 01:06:35,039 and as the number of times it increases 1116 01:06:35,039 --> 01:06:36,909 sigma will become bigger and bigger. 1117 01:06:36,909 --> 01:06:39,629 So 1118 01:06:39,629 --> 01:06:43,379 the conditional and marginalization operations require things like 1119 01:06:43,379 --> 01:06:46,799 computing the inverse of T or subsets of T. So 1120 01:06:46,799 --> 01:06:48,919 the naïve way of doing this 1121 01:06:48,919 --> 01:06:50,130 will 1122 01:06:50,130 --> 01:06:53,599 cos on the order of TQ computation, if you do things naively, right? If 1123 01:06:53,599 --> 01:06:54,619 - 1124 01:06:54,619 --> 01:07:00,059 because inverting like a T by T matrix cos on the order of TQ, roughly. 1125 01:07:00,059 --> 01:07:02,459 In contrast, the Kalman filter algorithm, like I said, 1126 01:07:02,459 --> 01:07:06,439 over here. I just have the update step. On the other board I had the 1127 01:07:06,439 --> 01:07:07,399 predict step. 1128 01:07:07,399 --> 01:07:09,079 But you can 1129 01:07:09,079 --> 01:07:11,880 carry out the computation on both of these lines and 1130 01:07:11,880 --> 01:07:13,889 it's actually constant time. So 1131 01:07:13,889 --> 01:07:17,719 on every time step you perform these Kalman filter updates. 1132 01:07:17,719 --> 01:07:20,959 So if every time you get one more observation you perform one more 1133 01:07:20,959 --> 01:07:25,359 Kalman filter update and the computation of that doesn't depend on 1134 01:07:25,359 --> 01:07:28,589 it's - or the one time for every time 1135 01:07:28,589 --> 01:07:31,799 step. So the amount of stuff you need to keep around in memory 1136 01:07:31,799 --> 01:07:35,630 doesn't grow linearly with the number of time steps you see. 1137 01:07:35,630 --> 01:07:37,889 Okay? Because - 1138 01:07:37,889 --> 01:07:40,669 actually what - I think I just realized why - so, yes. 1139 01:07:40,669 --> 01:07:44,140 Actually this is the way we actually run Kalman filters, which is 1140 01:07:44,140 --> 01:07:45,650 initially I have 1141 01:07:45,650 --> 01:07:47,719 just my first observation. 1142 01:07:47,719 --> 01:07:51,369 So I then compute P of X1 given Y1, right? 1143 01:07:51,369 --> 01:07:54,749 And now I know why I think my helicopter is at time step one. 1144 01:07:54,749 --> 01:07:56,019 Having computed this 1145 01:07:56,019 --> 01:07:59,249 there may be some time passes, like a second passes, 1146 01:07:59,249 --> 01:08:01,729 and then I get another observation 1147 01:08:01,729 --> 01:08:04,910 and what I'll do is I'll combine these two together to get 1148 01:08:04,910 --> 01:08:08,829 P of X2 given Y1 and Y2, right? 1149 01:08:08,829 --> 01:08:12,529 And then may be another second passes in time and I get another observation. So my 1150 01:08:12,529 --> 01:08:14,869 helicopters move a little bit more, because another second's passed and I get 1151 01:08:14,869 --> 01:08:16,670 another observation. 1152 01:08:16,670 --> 01:08:20,759 What I do is I combine these two to compute P of SV given Y1, 1153 01:08:20,759 --> 01:08:21,949 Y2, 1154 01:08:21,949 --> 01:08:24,020 Y3. 1155 01:08:24,020 --> 01:08:27,279 And it turns out that in order to compute this 1156 01:08:27,279 --> 01:08:29,539 I don't need to remember 1157 01:08:29,539 --> 01:08:33,199 any of these earlier observations. Okay? 1158 01:08:33,199 --> 01:08:38,419 So this is how you actually run it in real time say. Okay? Cool. 1159 01:08:38,419 --> 01:08:40,859 So 1160 01:08:40,859 --> 01:08:42,579 - oh, drat, running out of time. 1161 01:08:42,579 --> 01:08:47,239 The last thing I want to do is actually put these things together. So 1162 01:08:47,239 --> 01:08:49,130 putting it together - 1163 01:08:49,130 --> 01:08:56,130 putting 1164 01:08:57,230 --> 01:09:00,319 Kalman filters together with LQR control you get an 1165 01:09:00,319 --> 01:09:05,310 algorithm called LQG control, which stands for linear-quadratic Gaussian. 1166 01:09:05,310 --> 01:09:08,289 But in this type of control problem, we have 1167 01:09:08,289 --> 01:09:11,449 a linear dynamical system. 1168 01:09:11,449 --> 01:09:18,449 So I'm now adding actions back in, right? So now B times AT. Okay? 1169 01:09:22,569 --> 01:09:29,569 And then, 1170 01:09:32,289 --> 01:09:35,489 so LQG problem, or linear-quadratic Gaussian problem, 1171 01:09:35,489 --> 01:09:39,629 I have a linear dynamical system that I want to control 1172 01:09:39,629 --> 01:09:43,670 and I don't get to observe the states directly. I only get to observe 1173 01:09:43,670 --> 01:09:45,919 these variables YT. Okay? 1174 01:09:45,919 --> 01:09:46,819 So I only get 1175 01:09:46,819 --> 01:09:50,639 noisy observations of the actual state. 1176 01:09:50,639 --> 01:09:52,249 So it turns out that 1177 01:09:52,249 --> 01:09:53,749 you can solve 1178 01:09:53,749 --> 01:09:56,959 an LGG control problem as follows. 1179 01:09:56,959 --> 01:10:00,510 At every time step, we'll 1180 01:10:00,510 --> 01:10:06,809 use a Kalman filter to estimate 1181 01:10:06,809 --> 01:10:11,909 the state, 1182 01:10:11,909 --> 01:10:14,679 right? So concretely - 1183 01:10:14,679 --> 01:10:19,659 let's say you know the initial state. Then you initialize this to be like 1184 01:10:19,659 --> 01:10:24,030 that. If you know that the initial state is some state as zero, you initialize that 1185 01:10:24,030 --> 01:10:25,800 as zero and that or, 1186 01:10:25,800 --> 01:10:31,519 whatever, right? And 1187 01:10:31,519 --> 01:10:34,440 this is just - well, okay? If you 1188 01:10:34,440 --> 01:10:39,909 1189 01:10:39,909 --> 01:10:43,729 don't know the initial state exactly, then this is just a mean of your initial state 1190 01:10:43,729 --> 01:10:47,689 estimate and that would be your covariance or your initial state estimate. So just initialize your Kalman filter 1191 01:10:47,689 --> 01:10:49,759 this way. 1192 01:10:49,759 --> 01:10:54,269 And then you use the Kalman filter on every step to estimate what the state is. So 1193 01:10:54,269 --> 01:10:56,969 here's the predict step, 1194 01:10:56,969 --> 01:10:58,610 right? Previously we had 1195 01:10:58,610 --> 01:11:00,219 1196 01:11:00,219 --> 01:11:03,260 ST plus one give T 1197 01:11:03,260 --> 01:11:10,260 equals - 1198 01:11:17,919 --> 01:11:22,840 and so on. So this is your predict step and then you have an update step, same as before. 1199 01:11:22,840 --> 01:11:27,469 The one change I'm gonna make to the predict step is now I'm going to 1200 01:11:27,469 --> 01:11:31,409 take this into account as well. This is just saying suppose my previous state was ST 1201 01:11:31,409 --> 01:11:32,530 given T, 1202 01:11:32,530 --> 01:11:36,130 what do I think my next state ST plus one given T will be 1203 01:11:36,130 --> 01:11:38,989 given no other observations and the answer is, you know, it's really just this 1204 01:11:38,989 --> 01:11:40,360 equation, 1205 01:11:40,360 --> 01:11:41,840 AST given T 1206 01:11:41,840 --> 01:11:44,500 plus BAT. 1207 01:11:44,500 --> 01:11:45,989 1208 01:11:45,989 --> 01:11:47,109 And then, 1209 01:11:47,109 --> 01:11:50,309 so this takes care of, sort of, the observations. 1210 01:11:50,309 --> 01:11:53,140 And then the other thing you do is 1211 01:11:53,140 --> 01:11:55,980 compute 1212 01:11:55,980 --> 01:11:58,820 LT's 1213 01:11:58,820 --> 01:12:00,219 using 1214 01:12:00,219 --> 01:12:02,449 LQR, right? 1215 01:12:02,449 --> 01:12:04,039 Assuming 1216 01:12:04,039 --> 01:12:09,679 - 1217 01:12:09,679 --> 01:12:12,980 then the other thing you do is you just look at the linear dynamical systems, and 1218 01:12:12,980 --> 01:12:15,030 forget about the observations for now, 1219 01:12:15,030 --> 01:12:17,599 and compute the optimal policy - 1220 01:12:17,599 --> 01:12:18,520 oh, 1221 01:12:18,520 --> 01:12:21,210 right. Previously we had that 1222 01:12:21,210 --> 01:12:25,380 you would choose actions AT equals to LT times ST, right? 1223 01:12:25,380 --> 01:12:30,530 So the optimal policy we said was these matrixes, LT times ST. 1224 01:12:30,530 --> 01:12:33,949 So the other part of this problem you would use LQR to compute 1225 01:12:33,949 --> 01:12:35,219 these matrixes 1226 01:12:35,219 --> 01:12:39,309 LT, ignoring the fact that you don't actually observe the state. 1227 01:12:39,309 --> 01:12:41,690 And the very final step of 1228 01:12:41,690 --> 01:12:45,440 LQR control is that - well, when you're actually flying a helicopter, when you're actually doing 1229 01:12:45,440 --> 01:12:47,860 whatever you're doing, you can't actually 1230 01:12:47,860 --> 01:12:51,250 plug in the actual state because in LGG problem you don't get to observe the 1231 01:12:51,250 --> 01:12:52,810 state exactly. 1232 01:12:52,810 --> 01:12:54,949 So what you do 1233 01:12:54,949 --> 01:12:58,949 when you actually execute the policy is you 1234 01:12:58,949 --> 01:13:03,380 choose the action 1235 01:13:03,380 --> 01:13:05,800 according to your best estimate of the state. Okay? 1236 01:13:05,800 --> 01:13:09,590 So in other words, you don't know what ST is, but your best estimate of the state 1237 01:13:09,590 --> 01:13:12,989 at any time is this S of T given 1238 01:13:12,989 --> 01:13:16,999 T. So you just plug those in and take LT times your best estimate of the state 1239 01:13:16,999 --> 01:13:21,439 and then you go ahead and execute the action AT on your system, on your helicopter, or whatever. 1240 01:13:21,439 --> 01:13:23,630 Okay? 1241 01:13:23,630 --> 01:13:25,530 And it turns out that 1242 01:13:25,530 --> 01:13:29,989 for this specific class of problems, this is actually optimal procedure. This will 1243 01:13:29,989 --> 01:13:32,300 actually cause you to act optimally 1244 01:13:32,300 --> 01:13:36,119 in your LQG problem. 1245 01:13:36,119 --> 01:13:38,480 And there's this intuition that, 1246 01:13:38,480 --> 01:13:42,060 earlier I said, in LQR problems it's almost as if the noise doesn't matter 1247 01:13:42,060 --> 01:13:44,440 and in a pure LQR problem 1248 01:13:44,440 --> 01:13:48,820 the WT terms don't matter. It's as if you can ignore the noise. So it turns out that 1249 01:13:48,820 --> 01:13:52,360 by elaborating that proof, which I'm not gonna do you can - you're welcome to proof for yourself at home. 1250 01:13:52,360 --> 01:13:54,780 It's 1251 01:13:54,780 --> 01:13:57,570 that intuition means that you can actually ignore the noise in your 1252 01:13:57,570 --> 01:14:00,169 observations as well. The ST given T 1253 01:14:00,169 --> 01:14:01,969 is some of your best estimate. 1254 01:14:01,969 --> 01:14:03,989 So it's 1255 01:14:03,989 --> 01:14:06,449 as if 1256 01:14:06,449 --> 01:14:11,570 your true state ST is equal to ST given T 1257 01:14:11,570 --> 01:14:12,130 1258 01:14:12,130 --> 01:14:13,769 plus 1259 01:14:13,769 --> 01:14:17,320 noise. So in LQG control, what we're going to do is ignore the noise and just 1260 01:14:17,320 --> 01:14:18,389 plug in 1261 01:14:18,389 --> 01:14:22,749 this ST given T and this turns out the optimal thing to do. I 1262 01:14:22,749 --> 01:14:26,300 should say, this turns out to be a very special case 1263 01:14:26,300 --> 01:14:28,560 of a problem where you can 1264 01:14:28,560 --> 01:14:32,320 ignore the noise and still act 1265 01:14:32,320 --> 01:14:34,770 optimally and this property - this actually 1266 01:14:34,770 --> 01:14:39,449 is something called the separation principle where you can design 1267 01:14:39,449 --> 01:14:43,380 an algorithm for estimate the states and design an algorithm for controlling your system. 1268 01:14:43,380 --> 01:14:46,499 So just glom the two together and that turns out to be optimal. This 1269 01:14:46,499 --> 01:14:51,029 is a very unusual property and it pretty much was true only for LQG. 1270 01:14:51,029 --> 01:14:55,009 It doesn't hold true for many systems. Once you change anything, one's that's non-linear, 1271 01:14:55,009 --> 01:14:59,039 you know, some other noise model of one that's non-linear once this - I don't know. 1272 01:14:59,039 --> 01:15:02,110 Once you change almost anything in this problem 1273 01:15:02,110 --> 01:15:05,429 this will no longer hold true. The - and just estimate the states and plug that into a 1274 01:15:05,429 --> 01:15:07,340 controller that was designed, 1275 01:15:07,340 --> 01:15:09,110 assuming you could observe the states 1276 01:15:09,110 --> 01:15:11,460 fully. But that 1277 01:15:11,460 --> 01:15:14,610 once you change almost anything this will no longer turn out to be optimal. But 1278 01:15:14,610 --> 01:15:18,769 for the LQG problem specifically, it's kind of convenient that you can do this. Just 1279 01:15:18,769 --> 01:15:22,139 one quick question to actually close Student:[Inaudible] Instructor 1280 01:15:22,139 --> 01:15:23,170 (Andrew 1281 01:15:23,170 --> 01:15:28,249 Ng):Oh, yes. Yeah. In every embassy wing - in everything I've described I'm assuming that you're already learned A and B or something, so to - Student:[Inaudible] 1282 01:15:28,249 --> 01:15:33,239 Instructor (Andrew Ng):Yeah, right. Okay. 1283 01:15:33,239 --> 01:15:36,750 Sorry we're running a little bit late; let's close for today and next time I'll talk a bit more about these 1284 01:15:36,750 --> 01:15:38,109 partially