00:00:11,420 --> 00:00:12,819 This subtitle is provided from lecsub.jimdo.com under Creative Commons License. 2 00:00:12,819 --> 00:00:16,099 This presentation is delivered by the Stanford Center for Professional 3 00:00:16,099 --> 00:00:23,099 Development. 4 00:00:25,879 --> 00:00:28,419 Okay. Welcome back. What I 5 00:00:28,419 --> 00:00:31,449 want to do today is talk about 6 00:00:31,449 --> 00:00:34,570 one of my favorite algorithms for controlling NVPs that I think is 7 00:00:34,570 --> 00:00:35,870 8 00:00:35,870 --> 00:00:39,320 one of the more elegant and efficient and powerful algorithms that 9 00:00:39,320 --> 00:00:41,070 I know of. 10 00:00:41,070 --> 00:00:43,020 So 11 00:00:43,020 --> 00:00:46,930 what I'll do is I'll first start by talking about a couple 12 00:00:46,930 --> 00:00:48,640 variations of NVPs 13 00:00:48,640 --> 00:00:52,320 that are slightly different from the NVP definition you've seen so far. These 14 00:00:52,320 --> 00:00:54,420 are pretty 15 00:00:54,420 --> 00:00:58,870 common variations. One is state action rewards, 16 00:00:58,870 --> 00:01:03,830 and the other is horizon NVPs. Using this semi-modified definition of an 17 00:01:03,830 --> 00:01:06,540 NVP, I'll talk about linear dynamical systems. I'll spend a little bit 18 00:01:06,540 --> 00:01:10,410 of time talking about models within dynamical systems, and then talk about LQR, or 19 00:01:10,410 --> 00:01:13,240 linear quadratic regulation control, 20 00:01:13,240 --> 00:01:18,600 which will lead us to some kind of [inaudible] equation, which is 21 00:01:18,600 --> 00:01:23,100 something we will solve in order to do LQR 22 00:01:23,100 --> 00:01:24,460 controls. 23 00:01:24,460 --> 00:01:25,660 24 00:01:25,660 --> 00:01:27,400 So 25 00:01:27,400 --> 00:01:28,650 just 26 00:01:28,650 --> 00:01:33,020 to recap, and we've seen this definition many times now. 27 00:01:33,020 --> 00:01:36,840 We've been defining an NVP 28 00:01:36,840 --> 00:01:41,960 as 29 00:01:41,960 --> 00:01:46,969 [inaudible] states actions, states improbabilities, [inaudible] reward function 30 00:01:46,969 --> 00:01:49,869 where - gamma's 31 00:01:49,869 --> 00:01:52,980 the 32 00:01:52,980 --> 00:01:56,680 discount factors, a number between zero and one. 33 00:01:56,680 --> 00:02:00,380 And R, the reward function, 34 00:02:00,380 --> 00:02:03,909 was the function mapping from the states, the rewards - 35 00:02:03,909 --> 00:02:07,000 was the function mapping from the states, the real numbers. 36 00:02:07,000 --> 00:02:08,220 So we had 37 00:02:08,220 --> 00:02:10,489 value 38 00:02:10,489 --> 00:02:13,759 iteration, which would do this. 39 00:02:13,759 --> 00:02:20,759 40 00:02:21,899 --> 00:02:26,429 So after a while, the value of the iteration will cause V to convert to V star. 41 00:02:26,429 --> 00:02:27,789 Then 42 00:02:27,789 --> 00:02:31,209 having found the optimal value function, if you 43 00:02:31,209 --> 00:02:32,589 compute the optimal policy 44 00:02:32,589 --> 00:02:36,990 by taking essentially [inaudible] of this equation above. Augments of A, 45 00:02:36,990 --> 00:02:43,990 of that [inaudible]. So 46 00:02:46,180 --> 00:02:48,180 in 47 00:02:48,180 --> 00:02:50,719 value iteration, 48 00:02:50,719 --> 00:02:54,599 as you iterate of this - you know, perform this update, 49 00:02:54,599 --> 00:02:56,179 the function V will 50 00:02:56,179 --> 00:03:00,370 [inaudible] convert to V star. So there won't be - 51 00:03:00,370 --> 00:03:04,490 so without defining the number of iterations, you get closer and closer to V star. 52 00:03:04,490 --> 00:03:08,400 This actually converge exponentially quickly to V 53 00:03:08,400 --> 00:03:12,839 star. We will never exactly convert to V star and define the number of iterations. 54 00:03:12,839 --> 00:03:14,799 55 00:03:14,799 --> 00:03:19,629 So what I want to do now is describe a couple of common variations of 56 00:03:19,629 --> 00:03:20,139 NVPs 57 00:03:20,139 --> 00:03:24,139 that we slightly different definitions of. Firs the reward 58 00:03:24,139 --> 00:03:27,799 function and then second, we'll do something slightly different 59 00:03:27,799 --> 00:03:29,389 from just 60 00:03:29,389 --> 00:03:30,209 61 00:03:30,209 --> 00:03:31,699 62 00:03:31,699 --> 00:03:34,659 counting. Then remember in the last lecture, I said that 63 00:03:34,659 --> 00:03:37,419 for infinite state of continuously in NVPs, 64 00:03:37,419 --> 00:03:41,169 we couldn't apply the most straightforward version of value iteration 65 00:03:41,169 --> 00:03:43,669 because if you have a continuous state 66 00:03:43,669 --> 00:03:45,709 NVP, we need to use 67 00:03:45,709 --> 00:03:48,289 some approximations of the optimal value function. 68 00:03:48,289 --> 00:03:51,629 The [inaudible] later in this lecture, 69 00:03:51,629 --> 00:03:55,510 I'll talk about a special case of NVPs, where you can actually represent the 70 00:03:55,510 --> 00:03:57,579 value function exactly, 71 00:03:57,579 --> 00:04:01,029 even if you have an infinite-state space or even if you have a continuous-state 72 00:04:01,029 --> 00:04:02,000 space. 73 00:04:02,000 --> 00:04:06,090 I'll actually do that, talk about these special constants of infinite-state 74 00:04:06,090 --> 00:04:06,779 NVPs, 75 00:04:06,779 --> 00:04:11,319 using this new variation of the reward function and the alternative to 76 00:04:11,319 --> 00:04:15,469 just counting, so start to make the formulation a little easier. 77 00:04:15,469 --> 00:04:16,609 So the first 78 00:04:16,609 --> 00:04:20,850 variation I want to talk about is 79 00:04:20,850 --> 00:04:25,380 selection 80 00:04:25,380 --> 00:04:28,720 rewards. So I'm going to change the definition of the reward function. If this turns out, it 81 00:04:28,720 --> 00:04:30,869 won't be 82 00:04:30,869 --> 00:04:33,409 a huge deal. In particular, I change reward function 83 00:04:33,409 --> 00:04:36,260 to be a function mapping from 84 00:04:36,260 --> 00:04:38,449 a state action pair 85 00:04:38,449 --> 00:04:41,130 to the real numbers. 86 00:04:41,130 --> 00:04:44,509 What I mean by this is just the following. 87 00:04:44,509 --> 00:04:46,599 You sell off in 88 00:04:46,599 --> 00:04:48,849 some state in zero. You 89 00:04:48,849 --> 00:04:52,830 take an action A zero as a result of your state of action choice. You transition 90 00:04:52,830 --> 00:04:54,709 to some new state, S1. 91 00:04:54,709 --> 00:04:59,519 You take some action, A1. You transition to some new state, S2. You take some 92 00:04:59,519 --> 00:05:05,009 action, A2, and so on. So this is a [inaudible] state action sequence that you see. 93 00:05:05,009 --> 00:05:08,490 So in an MPP where you have a state action reward, 94 00:05:08,490 --> 00:05:12,879 your total payoff is now defined as 95 00:05:12,879 --> 00:05:14,259 this, 96 00:05:14,259 --> 00:05:19,349 where your reward function is now a function both of the current state and of the 97 00:05:19,349 --> 00:05:20,020 action 98 00:05:20,020 --> 00:05:22,220 you took in the current state. 99 00:05:22,220 --> 00:05:27,039 So this is my 100 00:05:27,039 --> 00:05:28,539 total payoff. 101 00:05:28,539 --> 00:05:29,199 102 00:05:29,199 --> 00:05:32,240 Then as usual, my goal will be to find a policy - 103 00:05:32,240 --> 00:05:35,039 to find the function mapping from the state's actions, 104 00:05:35,039 --> 00:05:37,699 so that when I execute that policy, 105 00:05:37,699 --> 00:05:40,320 I can maximize the expected value 106 00:05:40,320 --> 00:05:42,030 of my total payoff. 107 00:05:42,030 --> 00:05:46,600 So this definition, it actually turns out that given an NVP with 108 00:05:46,600 --> 00:05:47,889 state action rewards, 109 00:05:47,889 --> 00:05:49,339 you can actually - 110 00:05:49,339 --> 00:05:52,949 so by [inaudible] with the definitions of the states, you can actually reduce this back to an NVP 111 00:05:52,949 --> 00:05:56,639 with only rewards that function in the states. That may or may 112 00:05:56,639 --> 00:05:58,419 not be [inaudible]. Don't worry if 113 00:05:58,419 --> 00:06:00,800 it isn't. But 114 00:06:00,800 --> 00:06:04,729 using state action rewards allows you to more directly model 115 00:06:04,729 --> 00:06:05,569 116 00:06:05,569 --> 00:06:09,050 problems in which different actions, we have different costs. So 117 00:06:09,050 --> 00:06:13,090 a running example is the robot. So [inaudible] a robot, 118 00:06:13,090 --> 00:06:17,080 and it's more costly for the robot to move than for it to stay still. 119 00:06:17,080 --> 00:06:20,930 If you give an action to stay still, and the action to stay still may 120 00:06:20,930 --> 00:06:25,349 have a lower cost because you're not using a battery power [inaudible] recharge it for that 121 00:06:25,349 --> 00:06:26,020 action. 122 00:06:26,020 --> 00:06:29,820 Another example would be - 123 00:06:29,820 --> 00:06:33,059 actually, another navigation example 124 00:06:33,059 --> 00:06:37,429 would be if you have an outdoor vehicle. You need to drive 125 00:06:37,429 --> 00:06:38,599 over 126 00:06:38,599 --> 00:06:42,169 some sort of outdoor terrain, like very rough rocks 127 00:06:42,169 --> 00:06:45,679 or driving over grass. It may be costly, more difficult, 128 00:06:45,679 --> 00:06:48,960 than driving over, say, a paved road. So you may assign 129 00:06:48,960 --> 00:06:51,349 an action that requires 130 00:06:51,349 --> 00:06:56,150 driving over grass or driving over rocks to be 131 00:06:56,150 --> 00:07:00,289 more costly than driving over paved 132 00:07:00,289 --> 00:07:04,809 road. 133 00:07:04,809 --> 00:07:07,830 So this really isn't a huge change to the definition of 134 00:07:07,830 --> 00:07:14,830 an NVP. I won't really bother to justify this a lot, but [inaudible] 135 00:07:15,090 --> 00:07:16,779 equations 136 00:07:16,779 --> 00:07:19,610 is generalizing 137 00:07:19,610 --> 00:07:24,279 the way that you probably expect it. V star of S is now equal to 138 00:07:24,279 --> 00:07:26,650 139 00:07:26,650 --> 00:07:33,650 that. 140 00:07:37,419 --> 00:07:40,649 So previously, when the reward function was just a function of the state, S, we could 141 00:07:40,649 --> 00:07:43,729 take the max and push it in here. 142 00:07:43,729 --> 00:07:44,530 But now that 143 00:07:44,530 --> 00:07:46,069 144 00:07:46,069 --> 00:07:49,779 the rewards is a function of the action you're taking as well, the max comes outside. So 145 00:07:49,779 --> 00:07:50,869 this says that 146 00:07:50,869 --> 00:07:57,119 your expected total payoff, starting from the state, as [inaudible] policy, is equal to 147 00:07:57,119 --> 00:08:01,519 first your immediate reward, RFSA, for executing some action, A, in state S. Then 148 00:08:01,519 --> 00:08:02,219 149 00:08:02,219 --> 00:08:07,369 plus gamma times your future expected total payoff. 150 00:08:07,369 --> 00:08:08,839 So 151 00:08:08,839 --> 00:08:11,330 this is your expected total payoff if you 152 00:08:11,330 --> 00:08:14,919 take the action, A, from the current state. So 153 00:08:14,919 --> 00:08:18,879 while these [inaudible] optimal value functions. So your actually optimal expected total payoff 154 00:08:18,879 --> 00:08:21,539 is the max of all actions of this 155 00:08:21,539 --> 00:08:26,049 thing on 156 00:08:26,049 --> 00:08:30,449 the right. Let's see. Value iteration, which I'm abbreviating VI, 157 00:08:30,449 --> 00:08:33,720 is really the same algorithm. B of 158 00:08:33,720 --> 00:08:38,060 S is updated as 159 00:08:38,060 --> 00:08:44,630 max over A, RFSA, same thing. Just [inaudible] on the right-hand side of those equations 160 00:08:44,630 --> 00:08:45,900 161 00:08:45,900 --> 00:08:49,880 be updating V of S using [inaudible] equations. Again, you get value iteration, 162 00:08:49,880 --> 00:08:53,370 exactly the same way. 163 00:08:53,370 --> 00:08:55,520 Then finally, 164 00:08:55,520 --> 00:08:56,950 165 00:08:56,950 --> 00:09:00,160 having found the optimal value function, V star, 166 00:09:00,160 --> 00:09:03,180 using the value iteration algorithm, 167 00:09:03,180 --> 00:09:07,650 you can then compute the optimal policy, pi star of S as same as 168 00:09:07,650 --> 00:09:12,250 before. The best action to take in the state, S, is the 169 00:09:12,250 --> 00:09:19,250 action, A, that maximizes the thing on the righthand 170 00:09:28,360 --> 00:09:32,480 side. So having used value iteration 171 00:09:32,480 --> 00:09:35,330 to compute the optimal value function, you can then find 172 00:09:35,330 --> 00:09:40,210 pi star using that. 173 00:09:40,210 --> 00:09:41,890 174 00:09:41,890 --> 00:09:45,670 So this was the easier of the two 175 00:09:45,670 --> 00:09:47,920 variations of NVPs [inaudible] 176 00:09:47,920 --> 00:09:48,910 so far. 177 00:09:48,910 --> 00:09:55,910 Any questions? Actually, are there questions about this? 178 00:10:01,090 --> 00:10:02,810 So 179 00:10:02,810 --> 00:10:06,660 the other variation, the other alternative definition, 180 00:10:06,660 --> 00:10:07,600 181 00:10:07,600 --> 00:10:10,520 will be 182 00:10:10,520 --> 00:10:16,220 finite horizon 183 00:10:16,220 --> 00:10:19,010 NVPs. So 184 00:10:19,010 --> 00:10:21,540 finite horizon NVP 185 00:10:21,540 --> 00:10:24,780 comprises of the [inaudible] SA [inaudible] 186 00:10:24,780 --> 00:10:28,110 transition, probably 187 00:10:28,110 --> 00:10:30,650 with these, and the parameter T 188 00:10:30,650 --> 00:10:32,180 and the reward function. 189 00:10:32,180 --> 00:10:36,250 Here, 190 00:10:36,250 --> 00:10:37,360 T 191 00:10:37,360 --> 00:10:40,520 is a parameter called the horizon time. 192 00:10:40,520 --> 00:10:43,050 Concretely, 193 00:10:43,050 --> 00:10:45,920 what this really means is that we'll 194 00:10:45,920 --> 00:10:48,330 be taking actions in the NVP 195 00:10:48,330 --> 00:10:55,330 only for a total of capital T times this. So we won't use this counting anymore. [Audio cuts out] Instructor (Andrew 196 00:10:56,180 --> 00:11:00,430 197 00:11:00,430 --> 00:11:04,350 Ng):[Inaudible] zero, take action A0. Get to some other state S1, take 198 00:11:04,350 --> 00:11:06,670 action A1 and so on. 199 00:11:06,670 --> 00:11:10,100 Eventually, you get to some state, STAT 200 00:11:10,100 --> 00:11:12,180 after T times [inaudible]. So 201 00:11:12,180 --> 00:11:16,760 my total payoff, now, 202 00:11:16,760 --> 00:11:18,180 will be 203 00:11:18,180 --> 00:11:19,350 given 204 00:11:19,350 --> 00:11:21,900 by this 205 00:11:21,900 --> 00:11:26,180 sum from time zero up to time T of my sum over rewards. 206 00:11:26,180 --> 00:11:27,740 207 00:11:27,740 --> 00:11:31,870 Okay? My goal, as usual 208 00:11:31,870 --> 00:11:34,030 - so this is my total payoff. 209 00:11:34,030 --> 00:11:37,580 My goal, as usual, is to maximize the expected value of my total payoff. We want to come up 210 00:11:37,580 --> 00:11:40,140 with a policy to maximize the 211 00:11:40,140 --> 00:11:44,010 expected value of this total payoff. The key difference is that 212 00:11:44,010 --> 00:11:47,890 the world only will exist [inaudible], and after that, 213 00:11:47,890 --> 00:11:51,380 there's no more rewards to be corrected. 214 00:11:51,380 --> 00:11:54,820 So this turns out to be [inaudible] of a difference because it turns out that the optimal 215 00:11:54,820 --> 00:11:59,710 policy 216 00:11:59,710 --> 00:12:00,690 217 00:12:00,690 --> 00:12:03,200 may be non-stationary. 218 00:12:03,200 --> 00:12:06,300 219 00:12:06,300 --> 00:12:07,540 The 220 00:12:07,540 --> 00:12:10,319 term, stationary, means that it doesn't depend 221 00:12:10,319 --> 00:12:12,649 on time. Non-stationary 222 00:12:12,649 --> 00:12:14,419 means that it may depend on time. 223 00:12:14,419 --> 00:12:17,810 So non-stationary 224 00:12:17,810 --> 00:12:21,820 roughly means that my optimal action to take will be different for different time 225 00:12:21,820 --> 00:12:22,490 steps. 226 00:12:22,490 --> 00:12:24,430 That's what nonstationary 227 00:12:24,430 --> 00:12:26,790 means. Just as an example of that, 228 00:12:26,790 --> 00:12:31,240 imagine that we have some robot. Let's say the 229 00:12:31,240 --> 00:12:33,850 robot 230 00:12:33,850 --> 00:12:35,290 is here. 231 00:12:35,290 --> 00:12:37,620 Let's say that there's 232 00:12:37,620 --> 00:12:40,590 a good [inaudible] over there with a plus one reward. 233 00:12:40,590 --> 00:12:42,930 234 00:12:42,930 --> 00:12:46,360 Much further away, there's a plus ten reward. 235 00:12:46,360 --> 00:12:50,530 So depending on how much time you have left on the clock, 236 00:12:50,530 --> 00:12:52,980 it may be better to go after the plus one 237 00:12:52,980 --> 00:12:55,330 or the plus ten reward. If it's still early 238 00:12:55,330 --> 00:12:58,840 in the game, you still have a lot of time, it may be better to head toward 239 00:12:58,840 --> 00:13:03,030 the plus-ten rewards junction and get a much larger 240 00:13:03,030 --> 00:13:06,040 reward. If you only have a couple of time sets left, if the clock has nearly reached 241 00:13:06,040 --> 00:13:08,069 the time, capital T, 242 00:13:08,069 --> 00:13:11,790 then you may not have enough time to get to a plus ten reward. You've be better 243 00:13:11,790 --> 00:13:16,050 off heading for the plus one reward that's much more close by. 244 00:13:16,050 --> 00:13:19,660 So what this example illustrates is that when you're in that state, 245 00:13:19,660 --> 00:13:23,740 the best action to take could be to go left or to go right, depending on what 246 00:13:23,740 --> 00:13:26,499 time it is. So just as an example, illustrating 247 00:13:26,499 --> 00:13:31,960 how the actually policy can be non-stationary. 248 00:13:31,960 --> 00:13:33,060 249 00:13:33,060 --> 00:13:34,710 250 00:13:34,710 --> 00:13:36,410 In fact, 251 00:13:36,410 --> 00:13:38,120 since we have non-stationary 252 00:13:38,120 --> 00:13:43,120 policies anyway in the sequence, what I'm going to do next, I'm going to 253 00:13:43,120 --> 00:13:44,159 allow 254 00:13:44,159 --> 00:13:48,390 non-stationary transition probabilities as well. So I'll just write that 255 00:13:48,390 --> 00:13:50,390 up there. 256 00:13:50,390 --> 00:13:52,170 What I mean is that 257 00:13:52,170 --> 00:13:56,640 so far, assuming that the state ST plus one, is joined from the state 258 00:13:56,640 --> 00:13:59,480 transition probabilities [inaudible] 259 00:13:59,480 --> 00:14:02,110 by the previous states and the previous action. 260 00:14:02,110 --> 00:14:05,650 I've been assuming that these state transition probabilities are the same 261 00:14:05,650 --> 00:14:10,170 for all times. So I want to say [inaudible] and take some action, 262 00:14:10,170 --> 00:14:12,540 the distribution of an innate state doesn't matter. 263 00:14:12,540 --> 00:14:15,910 It doesn't depend on time. 264 00:14:15,910 --> 00:14:16,700 So 265 00:14:16,700 --> 00:14:21,890 I'm going to allow a study more general definition as well, in which we have 266 00:14:21,890 --> 00:14:24,600 non-stationary state transition probabilities 267 00:14:24,600 --> 00:14:25,329 so that 268 00:14:25,329 --> 00:14:27,960 the chance of where you end up [inaudible] 269 00:14:27,960 --> 00:14:29,590 may also 270 00:14:29,590 --> 00:14:33,060 depend on what time it is. 271 00:14:33,060 --> 00:14:37,320 So as examples of this non-stationary state 272 00:14:37,320 --> 00:14:39,000 transition probabilities, 273 00:14:39,000 --> 00:14:44,400 one example would be if you model flying an aircraft over a long distance. 274 00:14:44,400 --> 00:14:48,650 Then as the aircraft flies, you burn fuel and become lighter. So the 275 00:14:48,650 --> 00:14:52,510 dynamics of the aircraft actually change over time. The mass of the aircraft can 276 00:14:52,510 --> 00:14:55,160 change significantly over time as you burn fuel. 277 00:14:55,160 --> 00:14:56,480 So depending on 278 00:14:56,480 --> 00:15:01,130 what time it is, your mixed state could actually 279 00:15:01,130 --> 00:15:03,320 depend on not only your current state and your action, 280 00:15:03,320 --> 00:15:06,240 but also on how much fuel you burn, therefore, what time it is. 281 00:15:06,240 --> 00:15:07,250 282 00:15:07,250 --> 00:15:10,510 Other examples, another aerospace one, is 283 00:15:10,510 --> 00:15:14,620 if you have the weather forecast for the next 24 hours, say, 284 00:15:14,620 --> 00:15:18,360 you know what the winds and precipitation are going to be like over the next 24 hours. Then again, 285 00:15:18,360 --> 00:15:22,200 if you fly the aircraft from, say, here to New York, 286 00:15:22,200 --> 00:15:25,019 it may cost different amounts to fly 287 00:15:25,019 --> 00:15:28,829 different [inaudible] at different times. Maybe flying over 288 00:15:28,829 --> 00:15:33,439 the Rockies may cost different amounts, depending on whether you do it now, when there's really great weather, or 289 00:15:33,439 --> 00:15:34,319 if you do it 290 00:15:34,319 --> 00:15:37,629 a few hours from now, when the weather may be forecast really bad. 291 00:15:37,629 --> 00:15:38,720 292 00:15:38,720 --> 00:15:39,380 293 00:15:39,380 --> 00:15:40,970 For an 294 00:15:40,970 --> 00:15:46,440 example you see everyday, same thing for traffic, right? 295 00:15:46,440 --> 00:15:48,380 There's at least - 296 00:15:48,380 --> 00:15:52,510 depending on where you live, certainly here in California, there are times of day where traffic is 297 00:15:52,510 --> 00:15:57,050 really bad in lots of places. So the costs of driving certain roads may vary, depending on 298 00:15:57,050 --> 00:15:59,070 what time of day it is. 299 00:15:59,070 --> 00:16:03,589 Lots of other examples. Industrial automation, different machines 300 00:16:03,589 --> 00:16:07,160 in the factory may be available to different degrees at different times of day. They 301 00:16:07,160 --> 00:16:10,149 cost different amounts to hire different workers, depending on whether you pay 302 00:16:10,149 --> 00:16:12,950 over time [inaudible] or whatever. 303 00:16:12,950 --> 00:16:15,960 So the cost of doing different things in the factory can also 304 00:16:15,960 --> 00:16:17,250 be a function of time. 305 00:16:17,250 --> 00:16:18,429 306 00:16:18,429 --> 00:16:24,280 The state transition probabilities can also be a function of time. 307 00:16:24,280 --> 00:16:25,750 308 00:16:25,750 --> 00:16:27,850 Lastly, 309 00:16:27,850 --> 00:16:31,910 while we're doing this as well, to make this fully general, we might as 310 00:16:31,910 --> 00:16:37,730 well have non-stationary [inaudible] as well, 311 00:16:37,730 --> 00:16:40,639 where you might also index the reward 312 00:16:40,639 --> 00:16:42,679 function of these times and prescripts, 313 00:16:42,679 --> 00:16:45,090 where the cost of doing different things may depend on 314 00:16:45,090 --> 00:16:47,830 the time as well. 315 00:16:47,830 --> 00:16:50,530 316 00:16:50,530 --> 00:16:55,590 Actually, there's more examples of non-stationary NVPs, so let's - so now we 317 00:16:55,590 --> 00:16:59,249 have a nonstationary policy. Let's talk about an algorithm to actually 318 00:16:59,249 --> 00:17:02,180 try to find the optimal policy. 319 00:17:02,180 --> 00:17:04,110 320 00:17:04,110 --> 00:17:10,089 So let me 321 00:17:10,089 --> 00:17:13,580 define the following. 322 00:17:13,580 --> 00:17:20,390 This is now a slightly modified definition of the optimal value function. I'll 323 00:17:20,390 --> 00:17:27,390 just 324 00:17:30,970 --> 00:17:37,970 write this down, I guess. 325 00:17:39,080 --> 00:17:42,620 So I'm going to define the optimal value function, and this going to be 326 00:17:42,620 --> 00:17:44,690 indexed by T, with a subscript T. The 327 00:17:44,690 --> 00:17:46,480 optimal value of a state 328 00:17:46,480 --> 00:17:47,840 329 00:17:47,840 --> 00:17:50,360 for time, T, we're going to define as your 330 00:17:50,360 --> 00:17:52,300 optimal 331 00:17:52,300 --> 00:17:57,320 sum of rewards for if you start the NVP at that state, S, 332 00:17:57,320 --> 00:17:59,309 and if the clock starts off 333 00:17:59,309 --> 00:18:02,200 at time, 334 00:18:02,200 --> 00:18:06,400 lowercase T. So the optimal value of a state will depend on what time it is and how 335 00:18:06,400 --> 00:18:09,919 much time you have lest to run this 336 00:18:09,919 --> 00:18:11,440 NVP. Therefore, 337 00:18:11,440 --> 00:18:16,230 the sum on the right sums only for time T, time T plus one, time T plus two up 338 00:18:16,230 --> 00:18:17,840 to time, 339 00:18:17,840 --> 00:18:19,520 capital T. 340 00:18:19,520 --> 00:18:22,570 341 00:18:22,570 --> 00:18:26,620 I'll just state in English again, this is your expected optimal 342 00:18:26,620 --> 00:18:28,059 total payoff 343 00:18:28,059 --> 00:18:31,240 if you start your system in a state, S, 344 00:18:31,240 --> 00:18:34,760 and if the clock is already at time, 345 00:18:34,760 --> 00:18:37,010 lowercase T. 346 00:18:37,010 --> 00:18:38,370 347 00:18:38,370 --> 00:18:42,620 So it turns out then there's a [inaudible], you can value that [inaudible]. 348 00:18:42,620 --> 00:18:46,080 Let me just write out the value [inaudible] algorithm for this. 349 00:18:46,080 --> 00:18:48,059 It turns out 350 00:18:48,059 --> 00:18:49,270 you can - 351 00:18:49,270 --> 00:18:54,150 well, let me just write this 352 00:18:54,150 --> 00:18:56,920 out. I'll write this below. 353 00:18:56,920 --> 00:19:00,260 It turns out you can compute the optimal value function for the NVP using the 354 00:19:00,260 --> 00:19:03,020 following recursion, which is very similar to 355 00:19:03,020 --> 00:19:06,200 what we have for value iteration. We're going 356 00:19:06,200 --> 00:19:06,879 to set V 357 00:19:06,879 --> 00:19:12,080 of S to be equal to [inaudible] over 358 00:19:12,080 --> 00:19:12,889 A, 359 00:19:12,889 --> 00:19:19,889 same as before, right? Okay? 360 00:19:41,110 --> 00:19:42,419 So 361 00:19:42,419 --> 00:19:46,650 if I start the clock at time T and from state S, 362 00:19:46,650 --> 00:19:49,230 my expected total payoff is equal to 363 00:19:49,230 --> 00:19:51,590 the maximum [inaudible] actions I may take 364 00:19:51,590 --> 00:19:53,500 of my immediate reward. Taking that 365 00:19:53,500 --> 00:19:56,100 action, A, in that state, 366 00:19:56,100 --> 00:20:00,050 S. Them plus my expected future payoff. So if I take action, A, 367 00:20:00,050 --> 00:20:01,690 I would transition with [inaudible] P, subscript SA, S prime 368 00:20:01,690 --> 00:20:06,470 to some new state, S 369 00:20:06,470 --> 00:20:09,470 prime. If I get to the state, S prime, 370 00:20:09,470 --> 00:20:11,600 my total expected payoff from the 371 00:20:11,600 --> 00:20:14,270 state S prime would be these [inaudible] 372 00:20:14,270 --> 00:20:18,610 now subscript T plus one, that's prime. Subscript T plus one 373 00:20:18,610 --> 00:20:23,400 reflects that after I've taken one action, my clock will have advanced from 374 00:20:23,400 --> 00:20:26,400 time T to time T plus one. So this is now V 375 00:20:26,400 --> 00:20:30,559 star subscript T plus one. 376 00:20:30,559 --> 00:20:33,909 So this expresses V star of T 377 00:20:33,909 --> 00:20:36,230 in terms of V star T plus one. 378 00:20:36,230 --> 00:20:39,740 Then lastly, to start off this recursion, you 379 00:20:39,740 --> 00:20:41,700 would have V star, 380 00:20:41,700 --> 00:20:43,270 capital T 381 00:20:43,270 --> 00:20:50,270 is equal to - it's just equal 382 00:20:51,300 --> 00:20:53,220 to that. 383 00:20:53,220 --> 00:20:56,900 If you're already at time, capital T, then you just get to take one action, and then 384 00:20:56,900 --> 00:20:59,070 the clock runs out. So this is 385 00:20:59,070 --> 00:21:03,610 V star capital T. Your value of starting in some state, S, with no time - 386 00:21:03,610 --> 00:21:08,809 with just one time step, with no time left on the clock. 387 00:21:08,809 --> 00:21:12,880 So in the case of finite horizon NVP, this actually gives up a very nice 388 00:21:12,880 --> 00:21:13,830 389 00:21:13,830 --> 00:21:15,640 dynamic programming algorithm 390 00:21:15,640 --> 00:21:16,559 in which you can 391 00:21:16,559 --> 00:21:20,740 start off by computing V star of T. 392 00:21:20,740 --> 00:21:24,030 Then you use this backward [inaudible] to compute 393 00:21:24,030 --> 00:21:25,089 V star of 394 00:21:25,089 --> 00:21:29,349 capital T minus one, capital T minus two and so on. We compute V star 395 00:21:29,349 --> 00:21:32,230 of T and T minus one and so on. It recurs backwards 396 00:21:32,230 --> 00:21:34,670 onto your computer, V star 397 00:21:34,670 --> 00:21:38,450 for all of your time steps. Can you see this 398 00:21:38,450 --> 00:21:39,500 board? 399 00:21:39,500 --> 00:21:42,250 Cool. Then 400 00:21:42,250 --> 00:21:44,770 401 00:21:44,770 --> 00:21:47,570 the final step is 402 00:21:47,570 --> 00:21:52,810 - previously, we said that pi star of S - I'm going to come back and change this a bit - was the [inaudible] 403 00:21:52,810 --> 00:21:55,580 A of R 404 00:21:55,580 --> 00:21:57,150 plus A 405 00:21:57,150 --> 00:22:04,150 plus [inaudible] PSA - this 406 00:22:04,280 --> 00:22:07,230 is sort of what we had. In 407 00:22:07,230 --> 00:22:09,789 the finite horizon case, the 408 00:22:09,789 --> 00:22:13,250 ultimate action may depend on what time it is. So the ultimate action to take it, 409 00:22:13,250 --> 00:22:16,820 time T, is [inaudible] actions, A. 410 00:22:16,820 --> 00:22:19,249 This 411 00:22:19,249 --> 00:22:22,350 is basically the 412 00:22:22,350 --> 00:22:26,840 augmat of exactly that same thing on the right-hand side as we had 413 00:22:26,840 --> 00:22:29,970 in our dynamic programming 414 00:22:29,970 --> 00:22:33,060 algorithm. So you do this for every time step, and now you compute it, 415 00:22:33,060 --> 00:22:36,950 your optimal policy for different time 416 00:22:36,950 --> 00:22:40,410 steps. Again, this is a non-stationary policy 417 00:22:40,410 --> 00:22:42,290 because pi star 418 00:22:42,290 --> 00:22:46,920 of S my depend on what time it is. 419 00:22:46,920 --> 00:22:50,420 So [inaudible] the difference between this 420 00:22:50,420 --> 00:22:57,150 and the early version of the earlier version of value iteration is that - so what you do is you complete V star of T. 421 00:22:57,150 --> 00:23:00,620 Then using the backwards recursion of that [inaudible] algorithm, you computer V star T 422 00:23:00,620 --> 00:23:04,130 minus one. Then V star T minus two 423 00:23:04,130 --> 00:23:06,710 and so on down to V star of zero. 424 00:23:06,710 --> 00:23:09,660 Then from these, you compute pi star. 425 00:23:09,660 --> 00:23:13,210 426 00:23:13,210 --> 00:23:17,010 So one - there's not a huge difference, but one minus 427 00:23:17,010 --> 00:23:18,100 difference [inaudible] 428 00:23:18,100 --> 00:23:21,919 the infinite horizon discounted case 429 00:23:21,919 --> 00:23:26,350 is that by running this recursion once, you now have exactly 430 00:23:26,350 --> 00:23:29,550 the right value function. So this just computes the value function, rather 431 00:23:29,550 --> 00:23:30,560 than 432 00:23:30,560 --> 00:23:37,560 merely converging [inaudible]. This just gives you the right value function with one 433 00:23:37,710 --> 00:23:39,710 434 00:23:39,710 --> 00:23:42,710 pass. Cool. Any questions there? Yeah. 435 00:23:42,710 --> 00:23:47,600 Interviewee: [Inaudible]. Instructor (Andrew 436 00:23:47,600 --> 00:23:50,450 437 00:23:50,450 --> 00:23:54,080 Ng):This computation's much shorter than valuations. So 438 00:23:54,080 --> 00:23:58,460 sort of yes and no. It depends on how large capital T is. Interviewee: [Inaudible] the normal NVP, could 439 00:23:58,460 --> 00:24:05,460 [inaudible] and then use this case for that case? Instructor (Andrew Ng):I see. 440 00:24:06,100 --> 00:24:09,430 So for the normal NVP, can you assume capital T and then assume this? So 441 00:24:09,430 --> 00:24:13,049 it actually turns out that - that's a 442 00:24:13,049 --> 00:24:14,740 great question. Let 443 00:24:14,740 --> 00:24:18,640 me just answer this in a hand-wavy way. 444 00:24:18,640 --> 00:24:23,140 So it actually turns out for a discounted infinite horizon NVP 445 00:24:23,140 --> 00:24:25,670 where some discount factor gamma. 446 00:24:25,670 --> 00:24:29,430 So what you can do is you can say, 447 00:24:29,430 --> 00:24:31,160 after T times X, 448 00:24:31,160 --> 00:24:34,990 gamma to the T would be really small. It would be like [inaudible] something. I 449 00:24:34,990 --> 00:24:36,130 don't 450 00:24:36,130 --> 00:24:39,750 really care what happens after that many times because the rewards are 451 00:24:39,750 --> 00:24:42,630 multiplied by gamma to the T. After that, I don't really care. So you can 452 00:24:42,630 --> 00:24:43,919 ask, 453 00:24:43,919 --> 00:24:45,120 can I take 454 00:24:45,120 --> 00:24:48,490 my infinite horizon discounted NVP 455 00:24:48,490 --> 00:24:53,190 and approximate that with a finite horizon NVP where the number of times, steps T, 456 00:24:53,190 --> 00:24:55,180 is chosen so that [inaudible] true. So it 457 00:24:55,180 --> 00:24:57,700 turns out you can do that. 458 00:24:57,700 --> 00:25:01,890 Then you end up with some value for T. You can solve for T so 459 00:25:01,890 --> 00:25:03,910 that this holds true. 460 00:25:03,910 --> 00:25:09,500 It turns out you can prove that if you took the original value iteration algorithm 461 00:25:09,500 --> 00:25:14,780 and if you run the original value of the [inaudible] algorithm - the version for discounted NVPs. 462 00:25:14,780 --> 00:25:16,830 If you run that for 463 00:25:16,830 --> 00:25:19,590 this same number of time steps, 464 00:25:19,590 --> 00:25:23,790 you will end up with an approximation to the value function that is 465 00:25:23,790 --> 00:25:26,730 about this close, up to some small constant factors. 466 00:25:26,730 --> 00:25:30,960 So to do that, you end up with roughly the same amounts of computation anyway. 467 00:25:30,960 --> 00:25:32,550 Then 468 00:25:32,550 --> 00:25:35,780 you actually end up with a non-stationary policy, which is more expensive to keep 469 00:25:35,780 --> 00:25:36,650 470 00:25:36,650 --> 00:25:40,720 around. You need to keep around the different policy every time step, which is not 471 00:25:40,720 --> 00:25:46,880 as nice as if you had the stationary policy, same policy for all times. So 472 00:25:46,880 --> 00:25:50,600 there are other reasons, but sometimes you might take an 473 00:25:50,600 --> 00:25:54,560 infinite horizon discounted problem and approximate it to a finite horizon problem. 474 00:25:54,560 --> 00:25:56,190 But 475 00:25:56,190 --> 00:25:58,760 this particular reason is 476 00:25:58,760 --> 00:26:01,270 not the one. That makes 477 00:26:01,270 --> 00:26:04,210 sense. More 478 00:26:04,210 --> 00:26:11,210 questions? Interviewee: [Inaudible]? Instructor (Andrew 479 00:26:12,580 --> 00:26:14,680 Ng):Is 480 00:26:14,680 --> 00:26:16,870 there a gamma in this? 481 00:26:16,870 --> 00:26:20,159 So if you want, you can actually 482 00:26:20,159 --> 00:26:22,570 change the definition of an NVP 483 00:26:22,570 --> 00:26:27,920 and use a finite horizon discounted NVP. If you want, you can do that. You 484 00:26:27,920 --> 00:26:29,040 can actually come 485 00:26:29,040 --> 00:26:32,090 in and put a gamma there, 486 00:26:32,090 --> 00:26:35,140 and use this counting the finite horizon. 487 00:26:35,140 --> 00:26:37,690 It turns out that usually, for most 488 00:26:37,690 --> 00:26:41,470 problems that people deal with, you either use discounting 489 00:26:41,470 --> 00:26:43,220 or you 490 00:26:43,220 --> 00:26:47,059 use the finite horizon. It's been less common to do both, but you can certainly do as well. 491 00:26:47,059 --> 00:26:51,990 One of the nice things about discounting, it makes such your value function is finite. 492 00:26:51,990 --> 00:26:52,860 493 00:26:52,860 --> 00:26:57,549 Algorithmically and mathematically, one of the reasons to use discounting is 494 00:26:57,549 --> 00:27:01,309 because you're multiplying your rewards exponentially. It's a geometrically 495 00:27:01,309 --> 00:27:02,640 [inaudible] series. 496 00:27:02,640 --> 00:27:05,049 It shows that the value function is always finite. 497 00:27:05,049 --> 00:27:07,780 This is a really nice mathematical properties when you do discounting. 498 00:27:07,780 --> 00:27:11,360 So when you have a finite horizon anyway, 499 00:27:11,360 --> 00:27:15,179 then the value function's also guaranteed to be finite. So with that, you 500 00:27:15,179 --> 00:27:22,179 don't have to use discounting. But if you want, you can actually discount 501 00:27:24,130 --> 00:27:27,910 as well. Interviewee: [Inaudible]. Instructor (Andrew Ng):Yeah, yes, you're right. If you want, you can redefine the reward 502 00:27:27,910 --> 00:27:34,910 function to go downward into the to the reward function, since we have nonstationary rewards as well. So 503 00:27:39,690 --> 00:27:41,049 that was 504 00:27:41,049 --> 00:27:44,230 finite horizon 505 00:27:44,230 --> 00:27:45,409 NVPs. 506 00:27:45,409 --> 00:27:48,960 What I want to do now is actually use 507 00:27:48,960 --> 00:27:53,030 both of these ideas, your state action rewards and your finite horizon NVPs 508 00:27:53,030 --> 00:27:57,580 to describe a special case of NVPs that 509 00:27:57,580 --> 00:28:00,380 makes very strong assumptions about the problem. 510 00:28:00,380 --> 00:28:00,880 511 00:28:00,880 --> 00:28:04,390 But these assumptions are reasonable for many systems. With these 512 00:28:04,390 --> 00:28:06,090 assumptions, what we come up with, I 513 00:28:06,090 --> 00:28:11,960 think, are very nice and very elegant algorithms for solving even very large 514 00:28:11,960 --> 00:28:16,240 NVPs. 515 00:28:16,240 --> 00:28:23,240 516 00:28:30,100 --> 00:28:32,820 So let's talk about linear 517 00:28:32,820 --> 00:28:35,760 quadratic regulation. 518 00:28:35,760 --> 00:28:42,760 519 00:28:46,740 --> 00:28:47,710 520 00:28:47,710 --> 00:28:49,570 521 00:28:49,570 --> 00:28:51,669 522 00:28:51,669 --> 00:28:55,860 We just talked about dynamic programming for finite horizon NVPs, so 523 00:28:55,860 --> 00:28:58,149 just remember that algorithm. 524 00:28:58,149 --> 00:29:01,850 When I come back to talk about an algorithm for solving LQR problems, 525 00:29:01,850 --> 00:29:05,320 I'm actually going to use exactly that dynamic programming algorithm 526 00:29:05,320 --> 00:29:10,320 that you just saw for finite horizon NVPs. I'll be using exactly 527 00:29:10,320 --> 00:29:13,929 that algorithm again. So just remember that for now. 528 00:29:13,929 --> 00:29:16,320 So let's talk about LQR. 529 00:29:16,320 --> 00:29:21,750 So I want to take these ideas and apply them to NVPs with continuous 530 00:29:21,750 --> 00:29:25,150 state spaces and maybe even continuous action 531 00:29:25,150 --> 00:29:26,170 spaces. So 532 00:29:26,170 --> 00:29:28,740 to specify and 533 00:29:28,740 --> 00:29:33,679 NVPs, I need to give you this fivetuple 534 00:29:33,679 --> 00:29:36,380 of state actions, [inaudible] in the reward. I'm going to use 535 00:29:36,380 --> 00:29:40,730 the finite horizon, capital T, rather than 536 00:29:40,730 --> 00:29:42,570 discounting. 537 00:29:42,570 --> 00:29:47,809 So in LQR problems, I'm going to assume the following. I'm 538 00:29:47,809 --> 00:29:51,309 going to assume that the 539 00:29:51,309 --> 00:29:54,400 [inaudible] space is [inaudible] RN. 540 00:29:54,400 --> 00:29:58,289 And I'm going to assume, also, a continuous set of 541 00:29:58,289 --> 00:30:03,830 actions lie in RT. 542 00:30:03,830 --> 00:30:08,320 To specify the state transition probabilities, PSA, 543 00:30:08,320 --> 00:30:11,930 I need to tell you what the distribution of the mixed state is, given the current state and 544 00:30:11,930 --> 00:30:13,510 the current action. 545 00:30:13,510 --> 00:30:17,450 So we actually saw a little bit of this in the last lecture. I want to assume 546 00:30:17,450 --> 00:30:19,519 the next state, ST plus one, 547 00:30:19,519 --> 00:30:23,100 is going to be a linear function 548 00:30:23,100 --> 00:30:25,010 of the previous 549 00:30:25,010 --> 00:30:27,200 state, AST plus BAT 550 00:30:27,200 --> 00:30:29,309 plus WT - 551 00:30:29,309 --> 00:30:34,650 oh, excuse me. I meant to subscript that. 552 00:30:34,650 --> 00:30:40,590 553 00:30:40,590 --> 00:30:47,590 554 00:30:54,590 --> 00:30:59,190 Where WT is Gaussian [inaudible] would mean zero and some covariance 555 00:30:59,190 --> 00:31:02,260 given by sigma 556 00:31:02,260 --> 00:31:07,370 W. Subscripts at A and B here with subscripts T to show 557 00:31:07,370 --> 00:31:09,000 that 558 00:31:09,000 --> 00:31:11,809 these matrixes could change over time. So this would be 559 00:31:11,809 --> 00:31:14,940 non-stationary dynamics. 560 00:31:14,940 --> 00:31:20,200 As a point of notation, unfortunately compiling 561 00:31:20,200 --> 00:31:22,240 ideas from multiple literatures, 562 00:31:22,240 --> 00:31:26,220 so it's sort of unfortunately that capital A denotes both a set of actions 563 00:31:26,220 --> 00:31:27,860 as well as 564 00:31:27,860 --> 00:31:29,609 a matrix. 565 00:31:29,609 --> 00:31:33,889 When you see A later on, A will usually be used to denote a matrix, 566 00:31:33,889 --> 00:31:38,870 rather than a set of actions. So [inaudible] overload notation again, but 567 00:31:38,870 --> 00:31:41,820 unfortunately the notational conventions when you have 568 00:31:41,820 --> 00:31:46,660 research ideas in multiple research communities, often they share the same symbol. 569 00:31:46,660 --> 00:31:47,789 570 00:31:47,789 --> 00:31:49,830 So just to be concrete, 571 00:31:49,830 --> 00:31:53,020 AT is a matrix 572 00:31:53,020 --> 00:31:55,640 that's N 573 00:31:55,640 --> 00:32:01,270 by N. [Inaudible] matrixes that are N by D. 574 00:32:01,270 --> 00:32:02,130 575 00:32:02,130 --> 00:32:05,559 Just to be completely clear, right, the matrixes A and B, 576 00:32:05,559 --> 00:32:09,230 I'm going to assume, are fixed and known in advance. So I'm going to 577 00:32:09,230 --> 00:32:11,700 give you the matrixes, 578 00:32:11,700 --> 00:32:15,750 A and B, and I'm going to give you sigma W. Your job is to find a good policy 579 00:32:15,750 --> 00:32:17,060 for 580 00:32:17,060 --> 00:32:18,840 this NVP. So in other words, 581 00:32:18,840 --> 00:32:20,870 this is my specification 582 00:32:20,870 --> 00:32:25,309 of the state transition probabilities. 583 00:32:25,309 --> 00:32:28,020 Looking ahead, 584 00:32:28,020 --> 00:32:31,190 we see this later, it turns out this 585 00:32:31,190 --> 00:32:38,190 noise term is not very important. So 586 00:32:39,630 --> 00:32:40,850 it turns out that 587 00:32:40,850 --> 00:32:44,250 the treatment of the noise term is not very important. 588 00:32:44,250 --> 00:32:45,830 We'll see this later. We can 589 00:32:45,830 --> 00:32:47,179 pretty much 590 00:32:47,179 --> 00:32:49,300 ignore the noise 591 00:32:49,300 --> 00:32:52,680 term, and we'll still do fine. This is just a warning 592 00:32:52,680 --> 00:32:56,910 in the sequel, what I do later, I might be slightly sloppy 593 00:32:56,910 --> 00:32:57,970 in my treatment 594 00:32:57,970 --> 00:33:02,970 of the noise term. In this very special case, it would be unimportant. 595 00:33:02,970 --> 00:33:04,170 The 596 00:33:04,170 --> 00:33:05,549 last 597 00:33:05,549 --> 00:33:09,150 thing I have to specify is some horizon time, 598 00:33:09,150 --> 00:33:12,580 and then I also have some 599 00:33:12,580 --> 00:33:13,700 reward 600 00:33:13,700 --> 00:33:16,550 function. 601 00:33:16,550 --> 00:33:20,020 For LQR control, I'm going to assume that a reward function 602 00:33:20,020 --> 00:33:27,020 can be written as 603 00:33:30,090 --> 00:33:31,850 this, where 604 00:33:31,850 --> 00:33:35,260 UT is a matrix that's N by N. VT is 605 00:33:35,260 --> 00:33:36,439 a 606 00:33:36,439 --> 00:33:37,950 607 00:33:37,950 --> 00:33:41,020 matrix that's D by D. I'll 608 00:33:41,020 --> 00:33:44,440 assume that UT 609 00:33:44,440 --> 00:33:47,040 and VT are both positive semi-definite. 610 00:33:47,040 --> 00:33:51,289 Are both PSD. So 611 00:33:51,289 --> 00:33:57,360 the fact that UT and VT are both positive semi-definite matrixes, 612 00:33:57,360 --> 00:33:59,770 that implies that 613 00:33:59,770 --> 00:34:02,259 ST transpose, UT, ST [inaudible] 614 00:34:02,259 --> 00:34:07,000 zero. Similarly, ST transpose 615 00:34:07,000 --> 00:34:09,569 are VT, AT, [inaudible] zero. 616 00:34:09,569 --> 00:34:16,519 So this implies that 617 00:34:16,519 --> 00:34:20,190 your rewards are always negative. This is a 618 00:34:20,190 --> 00:34:23,529 somewhat depressing NVP in which there are only costs and no positive rewards, right, 619 00:34:23,529 --> 00:34:24,369 because of 620 00:34:24,369 --> 00:34:27,169 the minus sign 621 00:34:27,169 --> 00:34:28,569 there. 622 00:34:28,569 --> 00:34:35,569 So 623 00:34:41,399 --> 00:34:43,319 624 00:34:43,319 --> 00:34:46,659 as 625 00:34:46,659 --> 00:34:49,630 a complete example for how you might want to apply this, 626 00:34:49,630 --> 00:34:53,510 you've seen my helicopter videos, right? So one thing is, 627 00:34:53,510 --> 00:34:55,729 for example, 628 00:34:55,729 --> 00:34:57,989 suppose you have a helicopter, 629 00:34:57,989 --> 00:35:00,890 and you want the state ST to be 630 00:35:00,890 --> 00:35:04,229 as close to zero as possible. 631 00:35:04,229 --> 00:35:05,440 632 00:35:05,440 --> 00:35:08,059 Then 633 00:35:08,059 --> 00:35:10,210 you might choose UT 634 00:35:10,210 --> 00:35:12,619 to be equal to the identity matrix. 635 00:35:12,619 --> 00:35:16,709 This will make R of STAT be 636 00:35:16,709 --> 00:35:20,060 equal to ST transpose 637 00:35:20,060 --> 00:35:23,819 ST. But that's just 638 00:35:23,819 --> 00:35:27,239 639 00:35:27,239 --> 00:35:30,719 - I'll just 640 00:35:30,719 --> 00:35:33,809 write that down. [Inaudible] oh, excuse me - minus. 641 00:35:33,809 --> 00:35:38,429 The squared negative of the squared [inaudible] vector. So this would be penalizing the 642 00:35:38,429 --> 00:35:39,680 system 643 00:35:39,680 --> 00:35:40,769 quadratically 644 00:35:40,769 --> 00:35:45,219 for having a state that's half of zero, assuming that zero's the origin state. So if it 645 00:35:45,219 --> 00:35:46,889 goes to make a helicopter 646 00:35:46,889 --> 00:35:52,039 hover around the state zero, then you might choose this sort of reward function. It 647 00:35:52,039 --> 00:35:54,439 turns out 648 00:35:54,439 --> 00:35:55,279 649 00:35:55,279 --> 00:35:58,679 it's also very common for action to choose a cost 650 00:35:58,679 --> 00:36:03,209 for the action. So suppose I choose VT to be equal to an identity matrix. I get minus 651 00:36:03,209 --> 00:36:05,359 AT transpose 652 00:36:05,359 --> 00:36:09,509 AT 653 00:36:09,509 --> 00:36:12,259 here. Then minus [inaudible] actions as well. 654 00:36:12,259 --> 00:36:16,079 Including a quadratic cost for actions, it's also a fairly common 655 00:36:16,079 --> 00:36:17,099 thing to do. 656 00:36:17,099 --> 00:36:19,560 In practice, this tends to be effective 657 00:36:19,560 --> 00:36:22,630 of discouraging your system from 658 00:36:22,630 --> 00:36:26,209 jerking the controls around. This discourages making very huge control commands. Having 659 00:36:26,209 --> 00:36:27,779 a term like this 660 00:36:27,779 --> 00:36:31,859 reward function often makes many systems behave better. 661 00:36:31,859 --> 00:36:35,789 Of course, [inaudible] choose different values, we have different values on the diagonal to 662 00:36:35,789 --> 00:36:37,069 give different 663 00:36:37,069 --> 00:36:39,960 state variables, different weight and 664 00:36:39,960 --> 00:36:44,809 so on. So lots of possible choices for U and B. This is one example. 665 00:36:44,809 --> 00:36:49,849 666 00:36:49,849 --> 00:36:55,989 So 667 00:36:55,989 --> 00:36:59,420 for the next few steps, 668 00:36:59,420 --> 00:37:03,619 I'm going to write out things, I'm going to derive things, 669 00:37:03,619 --> 00:37:08,509 for the general case of non-stationary dynamics. I'm going - as I write 670 00:37:08,509 --> 00:37:11,529 out more math and more equations for LQR, 671 00:37:11,529 --> 00:37:13,809 I'm going to try write it out 672 00:37:13,809 --> 00:37:18,549 for the fairly general case of time varied dynamics and time varied reward 673 00:37:18,549 --> 00:37:21,639 functions. So I'm [inaudible] function. 674 00:37:21,639 --> 00:37:26,309 But for purposes of understanding this material, 675 00:37:26,309 --> 00:37:29,499 you might want to think of just ignoring many of the subscripts, in terms 676 00:37:29,499 --> 00:37:31,579 of T. So 677 00:37:31,579 --> 00:37:33,849 678 00:37:33,849 --> 00:37:36,049 for the sake of [inaudible] material, 679 00:37:36,049 --> 00:37:40,059 you might want to mentally assume that there are some fixed matrix, A, so 680 00:37:40,059 --> 00:37:46,079 that A is equal to A1, A2, equals A3 and so on. 681 00:37:46,079 --> 00:37:50,599 Similarly, there's some matrix B. 682 00:37:50,599 --> 00:37:51,509 Okay? 683 00:37:51,509 --> 00:37:55,859 So write it out for the fully general, non-stationary case, but 684 00:37:55,859 --> 00:37:59,420 you might just want to ignore many of the time subscripts and imagine the 685 00:37:59,420 --> 00:38:02,099 stationary case for now. 686 00:38:02,099 --> 00:38:04,079 687 00:38:04,079 --> 00:38:05,999 688 00:38:05,999 --> 00:38:07,419 689 00:38:07,419 --> 00:38:09,939 Quite a bit later, 690 00:38:09,939 --> 00:38:13,569 we're going to talk about an extension of this called differential dynamic programming that will 691 00:38:13,569 --> 00:38:16,670 actually use a non-stationary 692 00:38:16,670 --> 00:38:20,789 [inaudible] to a very powerful effect for a specific algorithm. But 693 00:38:20,789 --> 00:38:25,579 for most of what we're about to do, just pretend that NVP is stationary. 694 00:38:25,579 --> 00:38:32,579 Okay. 695 00:38:49,160 --> 00:38:50,919 So before I 696 00:38:50,919 --> 00:38:54,759 talk about models, let me jus say a couple of words about how you would go about 697 00:38:54,759 --> 00:38:56,910 coming up with the linear models. The 698 00:38:56,910 --> 00:39:00,630 key assumption in this model is that the dynamics are linear. There's also the assumption the 699 00:39:00,630 --> 00:39:03,429 reward function is quadratic, but 700 00:39:03,429 --> 00:39:06,849 let's talk about the assumption that the dynamics are linear. 701 00:39:06,849 --> 00:39:13,789 ST plus one equals AST plus VAT. Maybe time varying, maybe 702 00:39:13,789 --> 00:39:16,879 stationary. I'm just writing stationary for now. 703 00:39:16,879 --> 00:39:19,529 So how do you get models like this? 704 00:39:19,529 --> 00:39:23,169 We actually saw one example of this already in the previous lecture. If 705 00:39:23,169 --> 00:39:26,729 you have an inverted pendulum system, and you want to model the 706 00:39:26,729 --> 00:39:28,099 inverted pendulum 707 00:39:28,099 --> 00:39:28,900 708 00:39:28,900 --> 00:39:34,039 using a linear model like this, maybe [inaudible]. I'm not going to write that down. 709 00:39:34,039 --> 00:39:37,849 One thing you could do is run your inverted pendulum, start it off in some state as 710 00:39:37,849 --> 00:39:38,719 zero, 711 00:39:38,719 --> 00:39:40,799 take some action, A0, have it 712 00:39:40,799 --> 00:39:45,760 get to some state, S1. Take action A1 and so on, get to some state ST. 713 00:39:45,760 --> 00:39:48,409 Our index is one 714 00:39:48,409 --> 00:39:52,139 to denote that this is my first trial. Then you can repeat this a bunch of times. 715 00:39:52,139 --> 00:39:55,170 You can repeat this N times. I'm 716 00:39:55,170 --> 00:39:57,670 just executing actions on 717 00:39:57,670 --> 00:39:59,519 your physical robot. 718 00:39:59,519 --> 00:40:02,049 It could be a robot, it could be a 719 00:40:02,049 --> 00:40:05,739 chemical plant. It could be whatever. Trying out different actions in 720 00:40:05,739 --> 00:40:10,009 your system and watch 721 00:40:10,009 --> 00:40:12,770 722 00:40:12,770 --> 00:40:14,989 what states it gets to. So 723 00:40:14,989 --> 00:40:20,619 for the linear model to your data, and choose the parameters A 724 00:40:20,619 --> 00:40:25,659 and B, 725 00:40:25,659 --> 00:40:27,939 that minimize 726 00:40:27,939 --> 00:40:34,939 727 00:40:38,049 --> 00:40:39,709 the quadratic 728 00:40:39,709 --> 00:40:41,369 error term. 729 00:40:41,369 --> 00:40:44,030 730 00:40:44,030 --> 00:40:46,949 731 00:40:46,949 --> 00:40:51,640 So this says how well does AST plus BAT 732 00:40:51,640 --> 00:40:55,359 predict ST plus one. So you minimize the quadratic penalty term. This would be 733 00:40:55,359 --> 00:40:57,719 one reasonable way to estimate 734 00:40:57,719 --> 00:41:01,390 the parameters of a linear dynamical system for a 735 00:41:01,390 --> 00:41:08,390 physical robot or a physical chemical part of whatever they may have. Another 736 00:41:08,839 --> 00:41:10,569 way to 737 00:41:10,569 --> 00:41:13,129 come up with a linear 738 00:41:13,129 --> 00:41:14,209 739 00:41:14,209 --> 00:41:19,199 model 740 00:41:19,199 --> 00:41:24,180 consistently, if I want to control, is to take a nonlinear model and to 741 00:41:24,180 --> 00:41:29,920 linearize it. Let me show you what I mean by that. So you can linearize a 742 00:41:29,920 --> 00:41:35,819 nonlinear model. 743 00:41:35,819 --> 00:41:37,439 So 744 00:41:37,439 --> 00:41:38,570 745 00:41:38,570 --> 00:41:41,750 let's say you have some nonlinear model 746 00:41:41,750 --> 00:41:45,289 that expresses ST plus one as some function of 747 00:41:45,289 --> 00:41:46,219 ST 748 00:41:46,219 --> 00:41:47,499 and 749 00:41:47,499 --> 00:41:49,529 AT. 750 00:41:49,529 --> 00:41:53,840 In the example in the previous lecture, I said for the inverted pendulum 751 00:41:53,840 --> 00:41:56,579 752 00:41:56,579 --> 00:41:57,489 [inaudible]. 753 00:41:57,489 --> 00:42:01,440 By referring to the laws of physics. It was actually by downloading off the shelf 754 00:42:01,440 --> 00:42:04,380 software for doing physics simulations. So if you haven't 755 00:42:04,380 --> 00:42:08,139 seen [inaudible] before, you can go online. You can easily find 756 00:42:08,139 --> 00:42:11,270 many open-source packages for simulating the 757 00:42:11,270 --> 00:42:14,350 physics of simple devices like these. 758 00:42:14,350 --> 00:42:18,430 Download the software, type in the specifications of your robot, and it will 759 00:42:18,430 --> 00:42:21,980 simulate the physics that you use. There's lots of open-source software patches like that. You can just 760 00:42:21,980 --> 00:42:23,829 download them. 761 00:42:23,829 --> 00:42:27,060 But something like that, you can now build a physics simulator 762 00:42:27,060 --> 00:42:31,730 that predicts the state as a function of the previous state and the previous action. So 763 00:42:31,730 --> 00:42:34,029 you actually come up with some function that 764 00:42:34,029 --> 00:42:39,550 says that 765 00:42:39,550 --> 00:42:41,529 766 00:42:41,529 --> 00:42:45,169 - the state [inaudible] next time. The [inaudible] vector 767 00:42:45,169 --> 00:42:52,169 will be some function of the current state 768 00:42:53,350 --> 00:42:54,630 and the current 769 00:42:54,630 --> 00:42:58,019 action, where the action in this case is just a real number that says how 770 00:42:58,019 --> 00:43:02,129 hard you accelerated to the left or right. 771 00:43:02,129 --> 00:43:06,359 Then you can take this nonlinear model. I actually wrote down a sample of a 772 00:43:06,359 --> 00:43:08,119 model in the last lecture, but 773 00:43:08,119 --> 00:43:11,390 in general, F would be some nonlinear function. [Inaudible] of 774 00:43:11,390 --> 00:43:12,769 a linear function. 775 00:43:12,769 --> 00:43:16,919 So what I mean by linearize is the following. So here's just a cartoon. I'll 776 00:43:16,919 --> 00:43:19,819 write down the math in a second. Let's say the horizontal acces is 777 00:43:19,819 --> 00:43:24,029 the input state, ST, 778 00:43:24,029 --> 00:43:27,679 and the output state, ST plus one, as I said. 779 00:43:27,679 --> 00:43:31,779 Here's 780 00:43:31,779 --> 00:43:34,709 the function at F. 781 00:43:34,709 --> 00:43:38,689 So the next state, ST plus one, will be some function of the previous state, 782 00:43:38,689 --> 00:43:42,029 ST and the action 783 00:43:42,029 --> 00:43:46,009 AT. So to linearize this model, what you would do is you would choose a point. 784 00:43:46,009 --> 00:43:49,019 We'll call this bar 785 00:43:49,019 --> 00:43:54,140 T. Then you would take the derivative of this 786 00:43:54,140 --> 00:43:56,469 function. For the 787 00:43:56,469 --> 00:44:00,279 [inaudible] straight line to that function. 788 00:44:00,279 --> 00:44:04,449 So this allows you to express the next state, ST plus one. You 789 00:44:04,449 --> 00:44:07,229 can approximate the next state, ST plus one, as 790 00:44:07,229 --> 00:44:10,089 this linear function of the 791 00:44:10,089 --> 00:44:13,999 previous state, ST. So 792 00:44:13,999 --> 00:44:17,809 to make this cartoon really right, the horizontal access here is really a state 793 00:44:17,809 --> 00:44:20,109 action pair. 794 00:44:20,109 --> 00:44:24,650 You're linearizing around. So this is just a cartoon. The horizontal 795 00:44:24,650 --> 00:44:29,699 access represents the input state and the input action. 796 00:44:29,699 --> 00:44:36,699 So 797 00:44:42,049 --> 00:44:45,549 just to write this out in 798 00:44:45,549 --> 00:44:49,529 math, I'll write out the simple case first and the fully general one 799 00:44:49,529 --> 00:44:51,089 in a second. Suppose 800 00:44:51,089 --> 00:44:54,709 the horizontal access was only this state. So let's pretend interactions they [inaudible] now. 801 00:44:54,709 --> 00:44:56,099 ST plus 802 00:44:56,099 --> 00:45:00,309 one is just some function of ST, than that linear function I drew would be 803 00:45:00,309 --> 00:45:02,219 ST plus one. 804 00:45:02,219 --> 00:45:04,450 We're approximating 805 00:45:04,450 --> 00:45:08,349 as F prime evaluated at some point as bar T 806 00:45:08,349 --> 00:45:12,189 times ST times S bar T. 807 00:45:12,189 --> 00:45:15,579 Plus 808 00:45:15,579 --> 00:45:17,829 S bar 809 00:45:17,829 --> 00:45:21,879 T. So with this, you'd express ST plus one as a linear 810 00:45:21,879 --> 00:45:22,719 function 811 00:45:22,719 --> 00:45:29,359 of ST. Just note that S bar T is a constant. 812 00:45:29,359 --> 00:45:31,179 813 00:45:31,179 --> 00:45:32,520 It's not 814 00:45:32,520 --> 00:45:36,520 a variable. Does that make sense? S bar T is a constant. F prime of S bar T is gradient of the function F at the point 815 00:45:36,520 --> 00:45:38,019 S bar T. This is 816 00:45:38,019 --> 00:45:41,779 really just the equation of that linear function. So you 817 00:45:41,779 --> 00:45:46,309 can then convert this to 818 00:45:46,309 --> 00:45:51,809 819 00:45:51,809 --> 00:45:55,399 A and B matrixes. Jumping back one board, I'm going 820 00:45:55,399 --> 00:45:57,040 to point out one other thing. 821 00:45:57,040 --> 00:45:57,829 Let's 822 00:45:57,829 --> 00:45:59,929 say I look at this straight line, 823 00:45:59,929 --> 00:46:01,159 and I ask 824 00:46:01,159 --> 00:46:03,489 how well does this straight line 825 00:46:03,489 --> 00:46:07,769 approximate my function F, my original simulator, my original function F. 826 00:46:07,769 --> 00:46:10,630 Then you sort of notice that 827 00:46:10,630 --> 00:46:12,440 in this neighborhood, 828 00:46:12,440 --> 00:46:14,830 in the neighborhood of S bar, there's a 829 00:46:14,830 --> 00:46:17,400 pretty good approximation. It's fairly close. 830 00:46:17,400 --> 00:46:19,389 But then as you move further away, 831 00:46:19,389 --> 00:46:24,419 moving far off to the left here, it becomes a pretty terrible approximation. 832 00:46:24,419 --> 00:46:25,539 So 833 00:46:25,539 --> 00:46:29,139 when you linearize 834 00:46:29,139 --> 00:46:33,099 a nonlinear model to apply LQR, 835 00:46:33,099 --> 00:46:36,789 one of the parameters you have to choose would be the point around which to 836 00:46:36,789 --> 00:46:38,960 linearize your nonlinear model. 837 00:46:38,960 --> 00:46:44,169 So if you expect your inverted pendulum system 838 00:46:44,169 --> 00:46:48,509 to spend most of its time in the vicinity of this state, 839 00:46:48,509 --> 00:46:49,409 then it'd 840 00:46:49,409 --> 00:46:53,459 be reasonable to linearize around this state 841 00:46:53,459 --> 00:46:54,830 because that means that 842 00:46:54,830 --> 00:46:57,319 the linear approximation would be a good approximation, 843 00:46:57,319 --> 00:47:02,489 usually, for the states that you expect [inaudible] to spend most of this time. 844 00:47:02,489 --> 00:47:06,179 If conversely, you expect the system to spend most of its time 845 00:47:06,179 --> 00:47:07,739 at states far to the left, then this would be 846 00:47:07,739 --> 00:47:10,039 a terrible location to linearize. 847 00:47:10,039 --> 00:47:14,269 So one rule of thumb is to choose the position to linearize according to where 848 00:47:14,269 --> 00:47:17,419 you expect the system to spend most of its time 849 00:47:17,419 --> 00:47:20,369 so that the linear approximation will tend to be 850 00:47:20,369 --> 00:47:22,929 an accurate approximation in the 851 00:47:22,929 --> 00:47:27,269 vicinity of the states [inaudible]. 852 00:47:27,269 --> 00:47:28,770 Just to be fair, it is about 853 00:47:28,770 --> 00:47:30,109 choosing the point, S bar, 854 00:47:30,109 --> 00:47:31,639 A bar, 855 00:47:31,639 --> 00:47:34,169 that we'll use 856 00:47:34,169 --> 00:47:36,949 to come up with a linear function 857 00:47:36,949 --> 00:47:40,599 that we'll pretend it's a good approximation to my original 858 00:47:40,599 --> 00:47:47,079 nonlinear function, 859 00:47:47,079 --> 00:47:50,269 860 00:47:50,269 --> 00:47:55,269 861 00:47:55,269 --> 00:48:00,589 F. So for an example like the inverted pendulum problem, this problem, if 862 00:48:00,589 --> 00:48:03,740 you expect to do pretty well in this problem, 863 00:48:03,740 --> 00:48:08,089 then you would expect the state 864 00:48:08,089 --> 00:48:12,680 to often be near the zero state. 865 00:48:12,680 --> 00:48:17,890 If S equals zero corresponds to X being the center of the railway track that the inverted 866 00:48:17,890 --> 00:48:20,249 pendulum lives on. You 867 00:48:20,249 --> 00:48:23,000 expect to do fairly well. You expect the pole to 868 00:48:23,000 --> 00:48:24,689 mostly be upright 869 00:48:24,689 --> 00:48:28,559 [inaudible] upright at zero degrees or 90 degrees, I guess. So you choose 870 00:48:28,559 --> 00:48:32,910 whatever state corresponds to having the pole 871 00:48:32,910 --> 00:48:33,959 upright. The zero 872 00:48:33,959 --> 00:48:37,299 velocity [inaudible], near zero velocity in the middle of the track. 873 00:48:37,299 --> 00:48:40,249 So you usually choose that as a state 874 00:48:40,249 --> 00:48:42,579 to linearize 875 00:48:42,579 --> 00:48:45,680 your inverted pendulum dynamics around. That's 876 00:48:45,680 --> 00:48:52,680 a region where you might want your approximation to be good. 877 00:48:55,859 --> 00:48:59,759 So I wrote this down. To come back to this formula, I wrote this down for the special 878 00:48:59,759 --> 00:49:02,229 case of a one D state variable. If there 879 00:49:02,229 --> 00:49:08,039 are no actions. The general formula for the linearization approximation is ST 880 00:49:08,039 --> 00:49:08,949 881 00:49:08,949 --> 00:49:11,869 plus one were approximate as F 882 00:49:11,869 --> 00:49:14,559 of S bar T. 883 00:49:14,559 --> 00:49:17,119 A bar T 884 00:49:17,119 --> 00:49:24,119 plus 885 00:49:33,469 --> 00:49:40,469 886 00:49:40,700 --> 00:49:47,700 887 00:49:50,459 --> 00:49:54,229 - Okay? 888 00:49:54,229 --> 00:49:55,240 Where 889 00:49:55,240 --> 00:49:59,669 these upside down triangles are an unusual 890 00:49:59,669 --> 00:50:02,739 symbol for taking the 891 00:50:02,739 --> 00:50:07,289 derivative of F with respect to [inaudible] vector value, second argument. So 892 00:50:07,289 --> 00:50:12,169 by choosing an appropriate state as bar T, A bar T to linearize 893 00:50:12,169 --> 00:50:12,950 around, 894 00:50:12,950 --> 00:50:14,009 you'd 895 00:50:14,009 --> 00:50:17,920 now express ST plus one 896 00:50:17,920 --> 00:50:22,039 as a linear function of the current state and 897 00:50:22,039 --> 00:50:23,279 the current 898 00:50:23,279 --> 00:50:24,279 action, AT. Again, 899 00:50:24,279 --> 00:50:29,420 these things, S bar T, is a constant that you choose ahead of time. 900 00:50:29,420 --> 00:50:32,409 Same for A bar 901 00:50:32,409 --> 00:50:33,989 902 00:50:33,989 --> 00:50:34,959 903 00:50:34,959 --> 00:50:40,109 T. Lastly, having linearized this thing, you can then convert it 904 00:50:40,109 --> 00:50:43,260 to matrixes 905 00:50:43,260 --> 00:50:46,509 like 906 00:50:46,509 --> 00:50:48,849 that. 907 00:50:48,849 --> 00:50:55,460 So the ST plus one is now a linear function of ST and 908 00:50:55,460 --> 00:50:56,789 AT. 909 00:50:56,789 --> 00:51:03,789 Questions about this? 910 00:51:07,119 --> 00:51:10,759 911 00:51:10,759 --> 00:51:16,019 So just one tiny detail, and it's really not a huge deal, is that this thing 912 00:51:16,019 --> 00:51:20,179 below is technically an [inaudible] function. 913 00:51:20,179 --> 00:51:23,140 There might actually be an extra constant there, but 914 00:51:23,140 --> 00:51:27,199 this is not a difficult generalization of a linear dynamical system definition. 915 00:51:27,199 --> 00:51:28,130 916 00:51:28,130 --> 00:51:32,320 One way to deal with that constant is actually to do something like take your 917 00:51:32,320 --> 00:51:36,249 definition for this state, let's say XX dot theta theta dot. 918 00:51:36,249 --> 00:51:38,859 You can then augment your state vector to 919 00:51:38,859 --> 00:51:40,959 have an extra interceptor, one. 920 00:51:40,959 --> 00:51:46,069 With the interceptor one and working out the A matrix, you can then 921 00:51:46,069 --> 00:51:49,170 take care of the extra constant, C, as well. So you can 922 00:51:49,170 --> 00:51:50,679 deal with this thing being - 923 00:51:50,679 --> 00:51:54,809 technically it's an affine function because of this extra offset, rather than a linear 924 00:51:54,809 --> 00:51:55,569 function. 925 00:51:55,569 --> 00:52:01,809 But this is just a little bit of bookkeeping [inaudible] for yourself and shouldn't 926 00:52:01,809 --> 00:52:02,829 be 927 00:52:02,829 --> 00:52:04,689 a huge 928 00:52:04,689 --> 00:52:06,459 929 00:52:06,459 --> 00:52:13,459 deal. 930 00:52:17,289 --> 00:52:18,059 931 00:52:18,059 --> 00:52:19,449 So to summarize, you see I 932 00:52:19,449 --> 00:52:20,729 have this up, 933 00:52:20,729 --> 00:52:24,699 you can learn a model, you can take a nonlinear model. Your nonlinear 934 00:52:24,699 --> 00:52:28,789 model can be a physics model or a nonlinear model you learned and linearize it. 935 00:52:28,789 --> 00:52:29,559 936 00:52:29,559 --> 00:52:32,299 Now I'll post an LQR problem 937 00:52:32,299 --> 00:52:35,339 in which we have 938 00:52:35,339 --> 00:52:41,689 specification of the NVP in which the states are in RN, the actions are in RD, 939 00:52:41,689 --> 00:52:46,669 and the state has zero probabilities given by 940 00:52:46,669 --> 00:52:50,239 the [inaudible] linear equation. SD plus one equals 941 00:52:50,239 --> 00:52:53,849 ATST plus BTAT. Our rewards are going to be 942 00:52:53,849 --> 00:52:56,989 these quadratic functions. 943 00:52:56,989 --> 00:53:01,900 So the specification of the NVP means that we know the A matrixes, 944 00:53:01,900 --> 00:53:06,089 the B matrixes, the U matrixes 945 00:53:06,089 --> 00:53:08,930 and the V matrixes. Our goal is to come up with a policy 946 00:53:08,930 --> 00:53:13,789 to maximize our finite 947 00:53:13,789 --> 00:53:19,049 horizon sum of rewards. 948 00:53:19,049 --> 00:53:20,029 949 00:53:20,029 --> 00:53:23,149 So our goal is to come up with a policy, 950 00:53:23,149 --> 00:53:25,799 first, to maximize 951 00:53:25,799 --> 00:53:29,059 the expected value of this 952 00:53:29,059 --> 00:53:34,279 finite horizon sum of 953 00:53:34,279 --> 00:53:39,009 954 00:53:39,009 --> 00:53:46,009 rewards. Okay. 955 00:53:48,219 --> 00:53:49,390 So our 956 00:53:49,390 --> 00:53:53,549 approach to solving this problem 957 00:53:53,549 --> 00:53:56,899 will be exactly that 958 00:53:56,899 --> 00:54:00,019 finite horizon dynamic programming algorithm that we worked out a 959 00:54:00,019 --> 00:54:01,769 little earlier in this 960 00:54:01,769 --> 00:54:03,579 lecture. In particular, 961 00:54:03,579 --> 00:54:06,569 my strategy for finding the optimal policy 962 00:54:06,569 --> 00:54:08,879 will be to 963 00:54:08,879 --> 00:54:12,359 first find V star of 964 00:54:12,359 --> 00:54:14,219 T, the capital T, 965 00:54:14,219 --> 00:54:18,559 and then I'll apply by a recursion to find V star of T minus one, V star of T minus 966 00:54:18,559 --> 00:54:22,989 two and so on. 967 00:54:22,989 --> 00:54:27,619 In the dynamic programming algorithm we worked out, V star subscript T of the 968 00:54:27,619 --> 00:54:29,209 state ST, this is the maximum 969 00:54:29,209 --> 00:54:31,279 [inaudible] 970 00:54:31,279 --> 00:54:35,440 actions you might take at that time of R 971 00:54:35,440 --> 00:54:39,299 of STAT. 972 00:54:39,299 --> 00:54:42,800 Again, just for the sake of understanding this material, you can probably 973 00:54:42,800 --> 00:54:47,170 pretend the rewards and the dynamics are actually stationary. I'll write out all 974 00:54:47,170 --> 00:54:53,249 these superscripts all the time [inaudible] if you're reading this for the first time. 975 00:54:53,249 --> 00:54:57,340 The reward is equal to max of 976 00:54:57,340 --> 00:55:04,340 AT of minus - 977 00:55:11,829 --> 00:55:16,099 right? I hope this isn't confusing. The superscript Ts denote transposes. 978 00:55:16,099 --> 00:55:19,849 The lowercase Ts denote the time index capital T. 979 00:55:19,849 --> 00:55:23,769 So that's just a definition of my next quadratic awards. So this 980 00:55:23,769 --> 00:55:27,859 is clearly maximized as minus 981 00:55:27,859 --> 00:55:29,100 982 00:55:29,100 --> 00:55:31,269 ST transpose 983 00:55:31,269 --> 00:55:37,889 UTST 984 00:55:37,889 --> 00:55:39,949 because 985 00:55:39,949 --> 00:55:43,909 that last term is - this is greater than or equal to zero. That gives 986 00:55:43,909 --> 00:55:47,759 me my assumption that VT is [inaudible] semi-definite. So the best action to take in 987 00:55:47,759 --> 00:55:49,399 the last time step 988 00:55:49,399 --> 00:55:52,639 is just the action 989 00:55:52,639 --> 00:55:53,769 zero. So 990 00:55:53,769 --> 00:56:00,769 pi star subscript T of ST is equal to the 991 00:56:02,320 --> 00:56:04,369 [inaudible] of actions of 992 00:56:04,369 --> 00:56:10,969 that same thing. It's just 993 00:56:10,969 --> 00:56:12,469 zero. It's by 994 00:56:12,469 --> 00:56:15,609 choosing the zero action, 995 00:56:15,609 --> 00:56:18,909 AT transpose VTAT becomes zero, and that's 996 00:56:18,909 --> 00:56:22,829 how this reward is 997 00:56:22,829 --> 00:56:25,999 998 00:56:25,999 --> 00:56:31,150 999 00:56:31,150 --> 00:56:38,150 maximized. 1000 00:56:50,259 --> 00:56:57,169 Any questions, or is something illegible? Okay. 1001 00:56:57,169 --> 00:57:00,390 1002 00:57:00,390 --> 00:57:04,330 So now let's do the dynamic programming step where 1003 00:57:04,330 --> 00:57:09,679 my goal is given 1004 00:57:09,679 --> 00:57:11,739 VT plus one, 1005 00:57:11,739 --> 00:57:15,799 I want to compute VT. Given V star T plus one, I want to compute 1006 00:57:15,799 --> 00:57:20,410 V star of T. So this is the dynamic programming step. 1007 00:57:20,410 --> 00:57:21,629 1008 00:57:21,629 --> 00:57:25,239 So the 1009 00:57:25,239 --> 00:57:29,499 DP steps I wrote down previously was this. So for the finite state case, I 1010 00:57:29,499 --> 00:57:36,499 wrote down the following. 1011 00:57:49,700 --> 00:57:52,279 1012 00:57:52,279 --> 00:57:59,279 1013 00:58:32,289 --> 00:58:35,769 So this is exactly the equation I wrote down 1014 00:58:35,769 --> 00:58:37,499 previously, 1015 00:58:37,499 --> 00:58:41,489 and this is what I wrote down for finite states, where you have these discreet state transition 1016 00:58:41,489 --> 00:58:43,679 probabilities, and we can sum over 1017 00:58:43,679 --> 00:58:48,039 this discreet set of states. Now we're going to continue as an infinite state 1018 00:58:48,039 --> 00:58:51,850 again, so this sum over state should actually become an integral. I'm going to 1019 00:58:51,850 --> 00:58:55,940 actually skip the integral step. We'll just go ahead and write this last term here as an 1020 00:58:55,940 --> 00:58:56,829 expectation. 1021 00:58:56,829 --> 00:59:03,829 So this is going to be max over actions AT 1022 00:59:04,309 --> 00:59:06,879 plus - and then this becomes and expectation 1023 00:59:06,879 --> 00:59:10,089 over the random mixed state, ST plus one, 1024 00:59:10,089 --> 00:59:13,500 [inaudible] from state transition probabilities given by P of STAT of V star T 1025 00:59:13,500 --> 00:59:16,429 plus one, ST 1026 00:59:16,429 --> 00:59:19,080 plus one. So this 1027 00:59:19,080 --> 00:59:21,309 is 1028 00:59:21,309 --> 00:59:23,719 the same equation written down 1029 00:59:23,719 --> 00:59:25,699 as an expectation. 1030 00:59:25,699 --> 00:59:26,879 1031 00:59:26,879 --> 00:59:31,519 So what I need to do is given a representation of V star T plus one, I 1032 00:59:31,519 --> 00:59:32,789 need to find V 1033 00:59:32,789 --> 00:59:36,430 star of T. 1034 00:59:36,430 --> 00:59:40,819 So it turns out that LQR has the following useful property. It turns out that each of 1035 00:59:40,819 --> 00:59:42,839 these value functions 1036 00:59:42,839 --> 00:59:46,129 can be represented as a quadratic function. 1037 00:59:46,129 --> 00:59:48,659 So concretely, 1038 00:59:48,659 --> 00:59:53,259 let's suppose that V 1039 00:59:53,259 --> 01:00:00,259 star T plus one - suppose that this can be expressed as a 1040 01:00:00,429 --> 01:00:07,429 quadratic 1041 01:00:09,369 --> 01:00:10,840 function, written like so, 1042 01:00:10,840 --> 01:00:11,859 where 1043 01:00:11,859 --> 01:00:14,880 the matrix phi T plus one 1044 01:00:14,880 --> 01:00:17,049 is an N by N matrix, and 1045 01:00:17,049 --> 01:00:21,339 psi T plus one 1046 01:00:21,339 --> 01:00:23,599 is just a real number. 1047 01:00:23,599 --> 01:00:29,160 So in other words, suppose V star T plus one is just a quadratic function 1048 01:00:29,160 --> 01:00:29,969 1049 01:00:29,969 --> 01:00:33,809 of the state ST 1050 01:00:33,809 --> 01:00:36,679 plus one. 1051 01:00:36,679 --> 01:00:40,170 We can then show that 1052 01:00:40,170 --> 01:00:44,169 when you do one dynamic programming step - when you 1053 01:00:44,169 --> 01:00:46,089 1054 01:00:46,089 --> 01:00:49,569 plug this definition of V star T plus one into your dynamic 1055 01:00:49,569 --> 01:00:51,879 programming step in the equation I had just now, 1056 01:00:51,879 --> 01:00:54,559 you can show that you would get 1057 01:00:54,559 --> 01:00:57,899 that V star T as well, will 1058 01:00:57,899 --> 01:01:01,749 also be a quadratic function of the 1059 01:01:01,749 --> 01:01:06,279 1060 01:01:06,279 --> 01:01:12,639 same form. [Inaudible] here, right? The sum-appropriate matrix, phi T 1061 01:01:12,639 --> 01:01:19,609 and sum appropriate real number, psi 1062 01:01:19,609 --> 01:01:21,839 of 1063 01:01:21,839 --> 01:01:27,819 T. So what you can do is stall off the recursion with - well, does that make 1064 01:01:27,819 --> 01:01:32,099 sense? 1065 01:01:32,099 --> 01:01:35,009 So what you can do is 1066 01:01:35,009 --> 01:01:36,769 stall off the recursion 1067 01:01:36,769 --> 01:01:41,139 as follows. So previously, we worked out that V star capital T, 1068 01:01:41,139 --> 01:01:44,219 we said that this is minus 1069 01:01:44,219 --> 01:01:47,539 ST 1070 01:01:47,539 --> 01:01:50,009 transpose UTST. So we 1071 01:01:50,009 --> 01:01:50,760 have that phi 1072 01:01:50,760 --> 01:01:53,160 of capital T is 1073 01:01:53,160 --> 01:01:56,329 equal to minus UT, 1074 01:01:56,329 --> 01:01:59,309 and psi of capital T is equal to zero. 1075 01:01:59,309 --> 01:02:00,700 Now V star T of ST 1076 01:02:00,700 --> 01:02:07,700 is equal to ST transpose phi of T, ST plus psi of T. 1077 01:02:08,419 --> 01:02:09,050 So 1078 01:02:09,050 --> 01:02:13,450 you can start out the recursion this way with phi of T equals minus UT and psi of T 1079 01:02:13,450 --> 01:02:14,939 equals zero. 1080 01:02:14,939 --> 01:02:16,849 Then work out 1081 01:02:16,849 --> 01:02:20,219 what the recursion is. I won't 1082 01:02:20,219 --> 01:02:22,819 1083 01:02:22,819 --> 01:02:27,119 actually do the full [inaudible]. This may be algebra, and 1084 01:02:27,119 --> 01:02:28,809 1085 01:02:28,809 --> 01:02:32,939 you've actually done this sort of Gaussian expectation 1086 01:02:32,939 --> 01:02:39,559 math a lot in your homework by now. 1087 01:02:39,559 --> 01:02:44,009 So I won't do the full derivation. I'll just outline the 1088 01:02:44,009 --> 01:02:45,039 1089 01:02:45,039 --> 01:02:46,869 one-ish G step. So 1090 01:02:46,869 --> 01:02:51,029 in dynamic programming step, V star ST is 1091 01:02:51,029 --> 01:02:53,239 equal to 1092 01:02:53,239 --> 01:02:55,679 max over actions AT of 1093 01:02:55,679 --> 01:03:02,679 the median reward. 1094 01:03:03,619 --> 01:03:05,809 So this was R of 1095 01:03:05,809 --> 01:03:09,479 SA from my equation in the dynamic programming step. 1096 01:03:09,479 --> 01:03:12,099 Then plus an expected value 1097 01:03:12,099 --> 01:03:18,279 over the random mixed state, ST plus one, drawn from the 1098 01:03:18,279 --> 01:03:22,079 Gaussian distribution would mean 1099 01:03:22,079 --> 01:03:23,890 ATST plus 1100 01:03:23,890 --> 01:03:28,169 BTAT and covariant sigma 1101 01:03:28,169 --> 01:03:31,889 W. So what this is, this is really my 1102 01:03:31,889 --> 01:03:35,029 specification for P of STAT. This is my 1103 01:03:35,029 --> 01:03:38,749 state transition distribution in the LQR setting. This is my 1104 01:03:38,749 --> 01:03:40,809 state transition distribution 1105 01:03:40,809 --> 01:03:43,369 [inaudible] take action AT 1106 01:03:43,369 --> 01:03:44,670 in 1107 01:03:44,670 --> 01:03:46,179 the state 1108 01:03:46,179 --> 01:03:50,319 ST. Then my next state is - distributed Gaussian would mean ATST plus BTAT and 1109 01:03:50,319 --> 01:03:50,979 covariant 1110 01:03:50,979 --> 01:03:54,430 sigma W. Then 1111 01:03:54,430 --> 01:04:01,430 of the this 1112 01:04:10,669 --> 01:04:13,719 state. This, of course, is just A star 1113 01:04:13,719 --> 01:04:17,109 T plus one 1114 01:04:17,109 --> 01:04:23,549 of ST plus one. I hope 1115 01:04:23,549 --> 01:04:27,459 this makes sense. This is just taking that equation I had previously in the 1116 01:04:27,459 --> 01:04:30,309 dynamic programming step. So the V star of T, ST 1117 01:04:30,309 --> 01:04:33,279 equals max over actions 1118 01:04:33,279 --> 01:04:35,679 of the immediate rewards 1119 01:04:35,679 --> 01:04:38,739 plus an expected value over the mixed state of 1120 01:04:38,739 --> 01:04:40,639 V star of the mixed state with 1121 01:04:40,639 --> 01:04:43,249 the clock advanced by one. 1122 01:04:43,249 --> 01:04:46,179 So I've just plugged in all the definitions as a reward of the 1123 01:04:46,179 --> 01:04:47,889 state [inaudible] distribution 1124 01:04:47,889 --> 01:04:54,889 and of the value function. 1125 01:04:55,909 --> 01:05:01,729 Actually, could you raise your hand if this makes sense? Cool. So if you write 1126 01:05:01,729 --> 01:05:03,429 this out 1127 01:05:03,429 --> 01:05:07,630 and you expand the expectation - I know you've done this many times, so I 1128 01:05:07,630 --> 01:05:08,979 won't do it - 1129 01:05:08,979 --> 01:05:11,470 this whole thing on the right-hand side 1130 01:05:11,470 --> 01:05:15,139 simplifies to a big quadratic function of the action, AT. 1131 01:05:15,139 --> 01:05:17,839 So this whole 1132 01:05:17,839 --> 01:05:22,839 thing simplifies to a big 1133 01:05:22,839 --> 01:05:26,399 quadratic function 1134 01:05:26,399 --> 01:05:32,449 1135 01:05:32,449 --> 01:05:35,909 of the action 1136 01:05:35,909 --> 01:05:37,859 AT. 1137 01:05:37,859 --> 01:05:42,489 We want to maximize this with respect to the actions AT. So to 1138 01:05:42,489 --> 01:05:44,339 maximize a big quadratic function, you 1139 01:05:44,339 --> 01:05:48,150 just take the derivatives of the functions with respect to the action AT, set the derivative equal 1140 01:05:48,150 --> 01:05:52,039 to zero, and then you've maximized the right-hand side, with 1141 01:05:52,039 --> 01:05:55,559 respect to the action, 1142 01:05:55,559 --> 01:05:59,709 AT. It turns out - I'm just going to write this expression down for completeness. You 1143 01:05:59,709 --> 01:06:02,209 can derive it yourself at any time. It turns 1144 01:06:02,209 --> 01:06:05,239 out if you actually maximize that thing on the right- hand side as a 1145 01:06:05,239 --> 01:06:07,299 function of the actions, AT, 1146 01:06:07,299 --> 01:06:12,469 you find that [inaudible] AT is going to be 1147 01:06:12,469 --> 01:06:14,429 that 1148 01:06:14,429 --> 01:06:21,429 times 1149 01:06:23,319 --> 01:06:25,930 ST. 1150 01:06:25,930 --> 01:06:29,389 Don't 1151 01:06:29,389 --> 01:06:34,200 worry about this expression. You can get it from [inaudible] and derive it yourself. 1152 01:06:34,200 --> 01:06:37,839 But the key thing to note is that the optimal action, AT, is going to be some big 1153 01:06:37,839 --> 01:06:38,949 matrix. 1154 01:06:38,949 --> 01:06:40,409 We're going to call this thing 1155 01:06:40,409 --> 01:06:43,299 LT 1156 01:06:43,299 --> 01:06:47,259 times 1157 01:06:47,259 --> 01:06:49,009 1158 01:06:49,009 --> 01:06:51,459 ST. In other words, the optimal action to take in this 1159 01:06:51,459 --> 01:06:53,329 given state is going to be 1160 01:06:53,329 --> 01:06:55,249 some linear function 1161 01:06:55,249 --> 01:06:59,289 of the state, ST. So 1162 01:06:59,289 --> 01:07:02,819 having done dynamic programming, you remember also when we worked out the 1163 01:07:02,819 --> 01:07:04,979 dynamic programming algorithm for finite 1164 01:07:04,979 --> 01:07:06,279 horizon NVPs, 1165 01:07:06,279 --> 01:07:07,999 we said that 1166 01:07:07,999 --> 01:07:12,109 the way you compute the optimal policy, pi star of T 1167 01:07:12,109 --> 01:07:13,699 of ST. 1168 01:07:13,699 --> 01:07:17,819 This is always the [inaudible] of the same thing. [Inaudible] 1169 01:07:17,819 --> 01:07:22,799 of actions AT of the same thing. 1170 01:07:22,799 --> 01:07:25,139 STAT plus 1171 01:07:25,139 --> 01:07:32,139 your expected value of [inaudible] PSTAT, P-star, T 1172 01:07:35,859 --> 01:07:39,179 plus one, ST plus one. This thing on the right-hand side is always the same thing as 1173 01:07:39,179 --> 01:07:41,839 the thing we maximized 1174 01:07:41,839 --> 01:07:42,969 1175 01:07:42,969 --> 01:07:46,479 [inaudible]. So what this means is that when I said this a value of A to the maximize 1176 01:07:46,479 --> 01:07:50,379 of this. So what this means is that the optimal action to take from the state of ST 1177 01:07:50,379 --> 01:07:53,079 is actually equal to 1178 01:07:53,079 --> 01:07:55,869 LT 1179 01:07:55,869 --> 01:08:01,689 times ST. 1180 01:08:01,689 --> 01:08:03,859 What 1181 01:08:03,859 --> 01:08:05,919 was shown is that 1182 01:08:05,919 --> 01:08:08,129 when you're in some state, ST, the 1183 01:08:08,129 --> 01:08:13,599 optimal action for that state is going to be some matrix, LT, which can compute, 1184 01:08:13,599 --> 01:08:14,229 times 1185 01:08:14,229 --> 01:08:16,859 the state, 1186 01:08:16,859 --> 01:08:18,079 ST. 1187 01:08:18,079 --> 01:08:23,549 In other words, the optimal action is actually a linear function of the state. 1188 01:08:23,549 --> 01:08:26,839 I'm just going to point out, this is not a function of approximation here, right. 1189 01:08:26,839 --> 01:08:31,739 What we did not do, we did not say, 1190 01:08:31,739 --> 01:08:35,389 let's find the optimal linear policy. We didn't say, let's look at the optimal 1191 01:08:35,389 --> 01:08:38,359 policy, and then we'll fit this straight line to the optimal policy. This is not 1192 01:08:38,359 --> 01:08:42,089 about approximating the optimal policy with a straight line. 1193 01:08:42,089 --> 01:08:46,549 This derivation is saying that the optimal policy is a straight line. The 1194 01:08:46,549 --> 01:08:53,549 optimal action is a linear function of the current 1195 01:08:54,239 --> 01:08:57,359 state. 1196 01:08:57,359 --> 01:09:01,319 Moreover, when you've worked out this is a value for AT 1197 01:09:01,319 --> 01:09:02,310 1198 01:09:02,310 --> 01:09:05,370 that maximizes this thing on 1199 01:09:05,370 --> 01:09:10,009 the right-hand side. So you take this and plug it back in to do the dynamic programming recursion. 1200 01:09:10,009 --> 01:09:17,009 What you find is that - 1201 01:09:19,719 --> 01:09:24,189 so you take AT and plug it back in to do the maximization. It will 1202 01:09:24,189 --> 01:09:26,319 actually get 1203 01:09:26,319 --> 01:09:28,279 you this formula, 1204 01:09:28,279 --> 01:09:30,560 so V star 1205 01:09:30,560 --> 01:09:35,049 TST. So you find that 1206 01:09:35,049 --> 01:09:38,770 it will indeed be a quadratic function like this 1207 01:09:38,770 --> 01:09:41,799 of the following form where 1208 01:09:41,799 --> 01:09:45,540 - and I just write out the equations for the sake of 1209 01:09:45,540 --> 01:09:52,540 completeness. Don't worry too much about their forms. You can 1210 01:09:57,369 --> 01:10:03,630 derive this yourself. 1211 01:10:03,630 --> 01:10:10,630 1212 01:10:15,929 --> 01:10:22,929 1213 01:10:25,039 --> 01:10:28,789 1214 01:10:28,789 --> 01:10:35,789 1215 01:10:48,130 --> 01:10:52,190 So just to summarize, don't worry too much about the forms of these 1216 01:10:52,190 --> 01:10:54,429 equations. 1217 01:10:54,429 --> 01:10:57,899 What we've done is written down the recursion to the expressor 1218 01:10:57,899 --> 01:10:59,820 phi T and psi T 1219 01:10:59,820 --> 01:11:02,329 as a function of phi T plus one 1220 01:11:02,329 --> 01:11:04,129 and psi T plus one. 1221 01:11:04,129 --> 01:11:05,090 So 1222 01:11:05,090 --> 01:11:08,340 this allows you to compute the optimal value function 1223 01:11:08,340 --> 01:11:13,579 for when the clock is at time lowercase T, as a function of the optimal value 1224 01:11:13,579 --> 01:11:17,699 function for when the clock is at time T plus one. 1225 01:11:17,699 --> 01:11:20,019 1226 01:11:20,019 --> 01:11:25,480 So 1227 01:11:25,480 --> 01:11:28,260 to summarize, 1228 01:11:28,260 --> 01:11:32,659 GSELQG here's a finite horizon of 1229 01:11:32,659 --> 01:11:35,920 - actually, just to give this equation a name as well. This 1230 01:11:35,920 --> 01:11:40,370 recursion, in terms of the phi Ts, this is called the 1231 01:11:40,370 --> 01:11:44,909 discrete 1232 01:11:44,909 --> 01:11:51,909 time Bacardi equation. [Inaudible] 1233 01:11:54,099 --> 01:11:57,609 recursion that gives you phi 1234 01:11:57,609 --> 01:12:02,420 T in terms of phi T plus one. 1235 01:12:02,420 --> 01:12:05,130 So 1236 01:12:05,130 --> 01:12:10,969 to summarize, 1237 01:12:10,969 --> 01:12:12,269 1238 01:12:12,269 --> 01:12:17,499 our algorithm for finding the exact solution to finite horizon LQR 1239 01:12:17,499 --> 01:12:22,949 problems is as follows. We initialize phi T 1240 01:12:22,949 --> 01:12:24,340 to 1241 01:12:24,340 --> 01:12:26,840 be equal to minus 1242 01:12:26,840 --> 01:12:29,020 1243 01:12:29,020 --> 01:12:30,449 1244 01:12:30,449 --> 01:12:32,699 UT 1245 01:12:32,699 --> 01:12:36,559 and psi T to be equal to zero. 1246 01:12:36,559 --> 01:12:38,709 Then 1247 01:12:38,709 --> 01:12:42,679 recursively, 1248 01:12:42,679 --> 01:12:45,080 calculate 1249 01:12:45,080 --> 01:12:47,179 phi T 1250 01:12:47,179 --> 01:12:49,260 and psi 1251 01:12:49,260 --> 01:12:50,309 T 1252 01:12:50,309 --> 01:12:53,820 as a function of phi T plus one 1253 01:12:53,820 --> 01:12:56,360 and psi T plus 1254 01:12:56,360 --> 01:13:00,429 one with the discrete time 1255 01:13:00,429 --> 01:13:01,769 - 1256 01:13:01,769 --> 01:13:06,030 actually, excuse me. So 1257 01:13:06,030 --> 01:13:08,590 recursively calculate phi T and psi 1258 01:13:08,590 --> 01:13:11,349 T as a function of phi T plus one and psi T plus one, as I showed, 1259 01:13:11,349 --> 01:13:14,730 using the discrete time Bacardi equation. 1260 01:13:14,730 --> 01:13:17,940 So you do this for 1261 01:13:17,940 --> 01:13:19,199 T equals 1262 01:13:19,199 --> 01:13:22,429 T minus one, T minus two 1263 01:13:22,429 --> 01:13:28,579 and so on, down to time zero. Then you 1264 01:13:28,579 --> 01:13:32,519 compute 1265 01:13:32,519 --> 01:13:37,429 LT as a function of - actually, is it phi T or phi T 1266 01:13:37,429 --> 01:13:39,599 plus one? Phi T 1267 01:13:39,599 --> 01:13:41,929 plus one, I think. 1268 01:13:41,929 --> 01:13:46,260 As a function of phi T plus one and psi T plus one. 1269 01:13:46,260 --> 01:13:50,199 This is actually a function of only phi T plus one. You don't really need psi T 1270 01:13:50,199 --> 01:13:53,110 plus one. 1271 01:13:53,110 --> 01:13:59,819 Now you have your optimal policy. 1272 01:13:59,819 --> 01:14:03,099 So having computed the LTs, 1273 01:14:03,099 --> 01:14:05,939 you now have the 1274 01:14:05,939 --> 01:14:08,149 optimal action to take in the state 1275 01:14:08,149 --> 01:14:15,149 ST, just given by 1276 01:14:24,530 --> 01:14:29,419 this 1277 01:14:29,419 --> 01:14:34,979 linear equation. 1278 01:14:34,979 --> 01:14:38,499 How much time do I have left? Okay. Let me just say one last thing about this before I 1279 01:14:38,499 --> 01:14:44,479 close. 1280 01:14:44,479 --> 01:14:48,449 Maybe I'll do it next week. I think I'll do it next session 1281 01:14:48,449 --> 01:14:52,179 instead. So it actually turns out there's one cool property about this that's kind of that is 1282 01:14:52,179 --> 01:14:54,750 kind of subtle, but you'll find it out in the next lecture. 1283 01:14:54,750 --> 01:15:01,750 Are there question about this before we close for today, then? So 1284 01:15:04,829 --> 01:15:07,309 the very cool thing about 1285 01:15:07,309 --> 01:15:12,579 the solution of discrete time LQR problems - finite horizon LQR 1286 01:15:12,579 --> 01:15:17,280 problems is that this is a problem in an infinite state, with a continuous state. 1287 01:15:17,280 --> 01:15:21,059 But nonetheless, under the assumptions we made, you can prove that the 1288 01:15:21,059 --> 01:15:24,749 value function is a quadratic function of the state. 1289 01:15:24,749 --> 01:15:29,089 Therefore, just by computing these matrixes phi T and the real numbers 1290 01:15:29,089 --> 01:15:33,010 psi T, you can actually exactly represent the value function, even for 1291 01:15:33,010 --> 01:15:37,489 these infinitely large state spaces, even for continuous state spaces. 1292 01:15:37,489 --> 01:15:42,599 So the computation of these algorithms scales only like the cube, 1293 01:15:42,599 --> 01:15:46,499 scales only as a polynomial in terms of the number of state variables 1294 01:15:46,499 --> 01:15:49,959 whereas in [inaudible] dimensionality problems, with [inaudible], we 1295 01:15:49,959 --> 01:15:53,099 had algorithms of a scale exponentially dimensional problem. 1296 01:15:53,099 --> 01:15:55,420 Whereas LQR scales only are like the cube 1297 01:15:55,420 --> 01:16:00,699 of the dimension of the problem. So this easily 1298 01:16:00,699 --> 01:16:03,960 applies to problems with even very large state spaces. So we actually often apply 1299 01:16:03,960 --> 01:16:06,049 variations of this algorithm 1300 01:16:06,049 --> 01:16:09,159 to some subset, to some particular subset for the things we do on our 1301 01:16:09,159 --> 01:16:13,010 helicopter, which has high dimensional state spaces, with twelve or higher 1302 01:16:13,010 --> 01:16:13,969 dimensions. 1303 01:16:13,969 --> 01:16:15,989 This has worked very well for that. 1304 01:16:15,989 --> 01:16:17,229 So 1305 01:16:17,229 --> 01:16:21,280 it turns out there are even more things you can do with this, and I'll continue with 1306 01:16:21,280 --> 01:16:23,519 that in the next lecture. So let's close for today, then.