00:00:09,040 --> 00:00:10,299 This subtitle is provided from lecsub.jimdo.com under Creative Commons License. 2 00:00:10,299 --> 00:00:13,569 This presentation is delivered by the Stanford Center for Professional 3 00:00:13,569 --> 00:00:20,569 Development. Okay, 4 00:00:21,519 --> 00:00:24,099 good morning. Welcome back. 5 00:00:24,099 --> 00:00:27,019 So I hope all of you had a good Thanksgiving break. 6 00:00:27,019 --> 00:00:31,309 After the problem sets, I suspect many of us needed one. 7 00:00:31,309 --> 00:00:35,640 Just one quick announcement so as I announced by email 8 00:00:35,640 --> 00:00:37,870 a few days ago, this afternoon 9 00:00:37,870 --> 00:00:38,489 we'll be 10 00:00:38,489 --> 00:00:42,590 doing another tape ahead of lecture, so I won't physically be here on 11 00:00:42,590 --> 00:00:43,840 Wednesday, and so we'll be taping 12 00:00:43,840 --> 00:00:45,990 this Wednesday's lecture ahead of time. 13 00:00:45,990 --> 00:00:47,210 If you're free 14 00:00:47,210 --> 00:00:52,770 this afternoon, please come to that; it'll be at 3:45 p.m. 15 00:00:52,770 --> 00:00:57,610 in the Skilling Auditorium in Skilling 193 16 00:00:57,610 --> 00:00:59,830 at 3:45. But of course, you can also just show up 17 00:00:59,830 --> 00:01:06,830 in class as usual at the usual time or just watch it online as usual also. 18 00:01:07,110 --> 00:01:12,920 Okay, welcome back. What I want to do today is continue our discussion on 19 00:01:12,920 --> 00:01:15,690 Reinforcement Learning in MDPs. Quite a 20 00:01:15,690 --> 00:01:19,790 long topic for me to go over today, so most of today's lecture will be on 21 00:01:19,790 --> 00:01:21,720 continuous state MDPs, 22 00:01:21,720 --> 00:01:24,090 and in particular, 23 00:01:24,090 --> 00:01:26,750 algorithms for solving continuous state MDPs, 24 00:01:26,750 --> 00:01:29,890 so I'll talk just very briefly about discretization. 25 00:01:29,890 --> 00:01:34,080 I'll spend a lot of time talking about models, assimilators of MDPs, 26 00:01:34,080 --> 00:01:39,250 and then talk about one algorithm called fitted value iteration 27 00:01:39,250 --> 00:01:40,420 and 28 00:01:40,420 --> 00:01:45,119 two functions which builds on that, and then 29 00:01:45,119 --> 00:01:49,740 hopefully, I'll have time to get to a second algorithm called, approximate policy 30 00:01:49,740 --> 00:01:50,960 iteration 31 00:01:50,960 --> 00:01:53,189 Just to recap, 32 00:01:53,189 --> 00:01:55,119 right, in the previous lecture, 33 00:01:55,119 --> 00:01:58,270 I defined the Reinforcement Learning problem and 34 00:01:58,270 --> 00:02:02,950 I defined MDPs, so let me just recap the notation. 35 00:02:02,950 --> 00:02:04,400 36 00:02:04,400 --> 00:02:11,400 I said that an MDP or a Markov Decision Process, was a ? tuple, 37 00:02:14,209 --> 00:02:15,959 comprising 38 00:02:15,959 --> 00:02:19,139 those things and 39 00:02:19,139 --> 00:02:22,219 the running example of those using last time 40 00:02:22,219 --> 00:02:25,189 was this one 41 00:02:25,189 --> 00:02:26,709 right, adapted from 42 00:02:26,709 --> 00:02:30,209 the Russell and Norvig AI textbook. 43 00:02:30,209 --> 00:02:31,619 So 44 00:02:31,619 --> 00:02:35,499 in this example MDP that I was using, 45 00:02:35,499 --> 00:02:39,269 it had 11 states, so that's where S was. 46 00:02:39,269 --> 00:02:46,269 The actions were compass directions: north, south, east and west. The 47 00:02:46,650 --> 00:02:49,689 state transition probability is to capture 48 00:02:49,689 --> 00:02:51,109 chance of your 49 00:02:51,109 --> 00:02:52,409 transitioning to every state 50 00:02:52,409 --> 00:02:55,899 when you take any action in any other given state and so 51 00:02:55,899 --> 00:02:58,059 in our example that captured 52 00:02:58,059 --> 00:03:01,119 the stochastic dynamics of our 53 00:03:01,119 --> 00:03:04,499 robot wondering around [inaudible], and we said if you take the action north 54 00:03:04,499 --> 00:03:05,639 and the south, 55 00:03:05,639 --> 00:03:08,879 you have a .8 chance of actually going north and .1 chance of veering 56 00:03:08,879 --> 00:03:09,879 off, 57 00:03:09,879 --> 00:03:12,689 so that .1 chance of veering off to the right so 58 00:03:12,689 --> 00:03:17,069 said model of the robot's noisy 59 00:03:17,069 --> 00:03:19,200 dynamic with a [inaudible] 60 00:03:19,200 --> 00:03:21,709 and the 61 00:03:21,709 --> 00:03:24,789 reward function was that +/-1 at the 62 00:03:24,789 --> 00:03:31,059 absorbing states 63 00:03:31,059 --> 00:03:32,749 and 64 00:03:32,749 --> 00:03:37,659 -0.02 65 00:03:37,659 --> 00:03:39,589 elsewhere. 66 00:03:39,589 --> 00:03:41,499 This is an example of an MDP, and 67 00:03:41,499 --> 00:03:47,899 that's what these five things were. Oh, and I used a discount 68 00:03:47,899 --> 00:03:48,870 factor G 69 00:03:48,870 --> 00:03:54,229 of usually a number slightly less than one, so that's the 0.99. 70 00:03:54,229 --> 00:03:58,939 And so our goal was to find the policy, the 71 00:03:58,939 --> 00:04:02,089 control policy and that's at ?, 72 00:04:02,089 --> 00:04:03,729 which is a function 73 00:04:03,729 --> 00:04:06,400 mapping from the states of the actions 74 00:04:06,400 --> 00:04:09,609 that tells us what action to take in every state, 75 00:04:09,609 --> 00:04:14,159 and our goal was to find a policy that maximizes the expected value of 76 00:04:14,159 --> 00:04:17,319 our total payoff. So we want to find a 77 00:04:17,319 --> 00:04:19,599 policy. Well, 78 00:04:19,599 --> 00:04:26,599 let's see. We define value functions Vp (s) to be equal to 79 00:04:33,250 --> 00:04:35,740 this. 80 00:04:35,740 --> 00:04:36,979 We said that 81 00:04:36,979 --> 00:04:41,340 the value of a policy ? from State S was given by the expected 82 00:04:41,340 --> 00:04:42,029 value 83 00:04:42,029 --> 00:04:46,199 of the sum of discounted rewards, conditioned on your executing the policy ? 84 00:04:46,199 --> 00:04:49,329 and 85 00:04:49,329 --> 00:04:54,469 you're stating off your [inaudible] to say in the State S, 86 00:04:54,469 --> 00:04:59,959 and so our strategy for finding the policy was sort of comprised of 87 00:04:59,959 --> 00:05:06,959 two steps. So the goal is to find 88 00:05:11,569 --> 00:05:15,190 a good policy that maximizes the suspected value of 89 00:05:15,190 --> 00:05:16,759 the sum of discounted rewards, 90 00:05:16,759 --> 00:05:21,020 and so I said last time that one strategy for finding the [inaudible] of a 91 00:05:21,020 --> 00:05:21,709 policy 92 00:05:21,709 --> 00:05:26,080 is to first compute the optimal value function which I denoted V*(s) and is defined 93 00:05:26,080 --> 00:05:28,839 like that. It's the maximum 94 00:05:28,839 --> 00:05:31,969 value that any policy can obtain, 95 00:05:31,969 --> 00:05:38,969 and for example, 96 00:05:44,939 --> 00:05:48,449 the optimal value function 97 00:05:48,449 --> 00:05:55,449 for that MDP looks like this. 98 00:06:01,439 --> 00:06:04,409 So in other words, starting from any of these states, 99 00:06:04,409 --> 00:06:07,050 what's the expected value of the 100 00:06:07,050 --> 00:06:09,210 sum of discounted rewards you get, 101 00:06:09,210 --> 00:06:12,009 so this is 102 00:06:12,009 --> 00:06:13,210 V*. 103 00:06:13,210 --> 00:06:15,289 We also said that 104 00:06:15,289 --> 00:06:17,099 once you've found V*, 105 00:06:17,099 --> 00:06:20,400 you can compute the optimal policy 106 00:06:20,400 --> 00:06:27,400 using this. 107 00:06:31,579 --> 00:06:34,299 108 00:06:34,299 --> 00:06:36,279 And 109 00:06:36,279 --> 00:06:41,049 so once you've found 110 00:06:41,049 --> 00:06:48,049 V and the last piece of this algorithm was Bellman's equations where 111 00:06:48,460 --> 00:06:51,240 we know that V*, 112 00:06:51,240 --> 00:06:55,579 the optimal sum of discounted rewards you can get for State S, is equal to the 113 00:06:55,579 --> 00:06:59,339 immediate reward you get just for starting off in that state 114 00:06:59,339 --> 00:07:02,360 +G(for the 115 00:07:02,360 --> 00:07:05,979 max over all the actions you could take)(your 116 00:07:05,979 --> 00:07:09,350 117 00:07:09,350 --> 00:07:11,639 118 00:07:11,639 --> 00:07:13,659 future 119 00:07:13,659 --> 00:07:15,619 sum of discounted 120 00:07:15,619 --> 00:07:16,479 rewards)(your 121 00:07:16,479 --> 00:07:20,650 future payoff starting from the State S(p) which is where you might transition to 122 00:07:20,650 --> 00:07:22,430 after 1(s). 123 00:07:22,430 --> 00:07:26,279 And so this gave us a value iteration algorithm, 124 00:07:26,279 --> 00:07:29,909 which was essentially 125 00:07:29,909 --> 00:07:35,289 V.I. I'm abbreviating value iteration as V.I., so in the value iteration 126 00:07:35,289 --> 00:07:36,159 127 00:07:36,159 --> 00:07:43,159 algorithm, in V.I., you just take Bellman's equations and you repeatedly do this. 128 00:07:49,960 --> 00:07:54,099 So initialize some guess of the value functions. Initialize a zero as the sum 129 00:07:54,099 --> 00:07:55,129 rounding guess and 130 00:07:55,129 --> 00:07:58,959 then repeatedly perform this update for all states, and I said last time that if 131 00:07:58,959 --> 00:08:00,740 you do this repeatedly, 132 00:08:00,740 --> 00:08:06,339 then V(s) will converge to the optimal value function, V(s), 133 00:08:06,339 --> 00:08:09,340 you can 134 00:08:09,340 --> 00:08:12,219 compute the 135 00:08:12,219 --> 00:08:13,830 136 00:08:13,830 --> 00:08:19,150 optimal policy ?*. Just one final thing I want to recap was 137 00:08:19,150 --> 00:08:20,150 138 00:08:20,150 --> 00:08:27,150 the policy iteration algorithm 139 00:08:32,370 --> 00:08:39,370 in which we repeat the following two steps. 140 00:08:39,640 --> 00:08:43,790 So let's see, given a random initial policy, we'll solve for Vp. We'll solve for the 141 00:08:43,790 --> 00:08:48,190 value function for that specific policy. 142 00:08:48,190 --> 00:08:52,200 So this means for every state, compute the expected sum of discounted rewards 143 00:08:52,200 --> 00:08:55,460 for if you execute the policy ? 144 00:08:55,460 --> 00:08:58,000 from that state, 145 00:08:58,000 --> 00:09:05,000 and then the other step of policy 146 00:09:28,150 --> 00:09:30,380 iteration 147 00:09:30,380 --> 00:09:31,730 is 148 00:09:31,730 --> 00:09:34,960 having found the value function for your policy, 149 00:09:34,960 --> 00:09:37,440 you then update the policy 150 00:09:37,440 --> 00:09:41,120 pretending that you've already found the optimal value function, V*, 151 00:09:41,120 --> 00:09:42,390 and then you 152 00:09:42,390 --> 00:09:45,790 repeatedly perform these two steps where you solve for the value function for your current 153 00:09:45,790 --> 00:09:46,820 policy 154 00:09:46,820 --> 00:09:50,760 and then pretend that that's actually the optimal value function and solve for the 155 00:09:50,760 --> 00:09:54,790 policy given the value function, and you repeatedly update 156 00:09:54,790 --> 00:09:59,060 the value function or update the policy using that value function. 157 00:09:59,060 --> 00:10:00,350 And 158 00:10:00,350 --> 00:10:03,540 last time I said that this will also cause 159 00:10:03,540 --> 00:10:07,500 the estimated value function V to converge to V* 160 00:10:07,500 --> 00:10:12,640 and this will cause p to converge to ?*, the optimal policy. 161 00:10:12,640 --> 00:10:14,010 So those are based 162 00:10:14,010 --> 00:10:17,920 on our last lecture [inaudible] MDPs and introduced a lot of new 163 00:10:17,920 --> 00:10:20,180 notation symbols and just 164 00:10:20,180 --> 00:10:21,880 summarize all that again. 165 00:10:21,880 --> 00:10:26,160 What I'm about to do now, what I'm about to do for the rest of today's 166 00:10:26,160 --> 00:10:27,680 lecture is actually 167 00:10:27,680 --> 00:10:32,420 build on these two algorithms so 168 00:10:32,420 --> 00:10:36,040 I guess if you have any questions about this piece, ask now since I've got to go on please. Yeah. Student:[Inaudible] how 169 00:10:36,040 --> 00:10:42,080 those two algorithms are very 170 00:10:42,080 --> 00:10:43,710 different? 171 00:10:43,710 --> 00:10:46,810 Instructor (Andrew Ng):I see, right, so 172 00:10:46,810 --> 00:10:51,810 yeah, do you see that they're different? Okay, 173 00:10:51,810 --> 00:10:53,840 how it's different. Let's see. 174 00:10:53,840 --> 00:10:54,970 So 175 00:10:54,970 --> 00:11:00,050 well here's one difference. I didn't say this cause no longer use it today. So 176 00:11:00,050 --> 00:11:03,530 value iteration and policy iteration are different algorithms. 177 00:11:03,530 --> 00:11:06,260 In policy iteration 178 00:11:06,260 --> 00:11:07,570 in this 179 00:11:07,570 --> 00:11:10,630 step, you're given a fixed policy, 180 00:11:10,630 --> 00:11:14,170 and you're going to solve for the value function for that policy 181 00:11:14,170 --> 00:11:15,320 182 00:11:15,320 --> 00:11:20,520 and so you're given some fixed policy ?, meaning 183 00:11:20,520 --> 00:11:26,930 some function mapping from the state's actions. So give you some policy 184 00:11:26,930 --> 00:11:28,840 185 00:11:28,840 --> 00:11:34,630 and 186 00:11:34,630 --> 00:11:39,350 whatever. That's just some policy; it's not a great policy. 187 00:11:39,350 --> 00:11:40,890 And 188 00:11:40,890 --> 00:11:45,350 in that step that I circled, we have to find the ? of S 189 00:11:45,350 --> 00:11:52,350 190 00:12:01,640 --> 00:12:04,740 which means that for every state you need to compute 191 00:12:04,740 --> 00:12:08,360 your expected sum of discounted rewards or if you execute 192 00:12:08,360 --> 00:12:10,810 this specific policy 193 00:12:10,810 --> 00:12:15,500 and starting off the MDP in that state S. 194 00:12:15,500 --> 00:12:19,000 So I showed this last time. I won't go into details today, so I said last 195 00:12:19,000 --> 00:12:20,330 time that 196 00:12:20,330 --> 00:12:23,960 you can actually solve for Vp by solving a linear system of 197 00:12:23,960 --> 00:12:25,360 equations. 198 00:12:25,360 --> 00:12:27,260 There was a form of Bellman's equations 199 00:12:27,260 --> 00:12:28,760 for Vp, 200 00:12:28,760 --> 00:12:32,029 and it turned out to be, if you write this out, you end up with a 201 00:12:32,029 --> 00:12:35,920 linear system of 11 equations of 11 unknowns and so you 202 00:12:35,920 --> 00:12:38,830 can actually solve for the value function 203 00:12:38,830 --> 00:12:39,970 for a fixed policy 204 00:12:39,970 --> 00:12:42,180 by solving like a system of 205 00:12:42,180 --> 00:12:46,180 linear equations with 11 variables and 11 constraints, 206 00:12:46,180 --> 00:12:48,830 and so 207 00:12:48,830 --> 00:12:51,790 that's policy iteration; 208 00:12:51,790 --> 00:12:55,290 whereas, in value iteration, 209 00:12:55,290 --> 00:12:57,430 going back on board, 210 00:12:57,430 --> 00:13:00,590 in value iteration you sort of repeatedly perform this update where you 211 00:13:00,590 --> 00:13:04,020 update the value of a state as the [inaudible]. 212 00:13:04,020 --> 00:13:11,020 So I hope that makes sense that the algorithm of these is different. Student:[Inaudible] on the atomic kits 213 00:13:12,670 --> 00:13:13,459 so is the assumption 214 00:13:13,459 --> 00:13:16,130 that we can never get out of 215 00:13:16,130 --> 00:13:17,960 those states? Instructor 216 00:13:17,960 --> 00:13:21,730 (Andrew Ng):Yes. There's always things that you where you solve for this [inaudible], for example, and make the numbers 217 00:13:21,730 --> 00:13:25,950 come up nicely, but I don't wanna spend too much time on them, but yeah, so the 218 00:13:25,950 --> 00:13:26,820 assumption is that 219 00:13:26,820 --> 00:13:30,329 once you enter the absorbing state, then the world ends or there're no more 220 00:13:30,329 --> 00:13:32,549 rewards after that and 221 00:13:32,549 --> 00:13:33,740 you can think of 222 00:13:33,740 --> 00:13:36,970 another way to think of the absorbing states which is sort of 223 00:13:36,970 --> 00:13:38,000 mathematically equivalent. 224 00:13:38,000 --> 00:13:41,090 You can think of the absorbing states as 225 00:13:41,090 --> 00:13:43,440 transitioning with probability 1 226 00:13:43,440 --> 00:13:46,030 to sum 12 state, and then once you're in that 227 00:13:46,030 --> 00:13:48,870 12th state, you always 228 00:13:48,870 --> 00:13:51,660 remain in that 12th state, and there're no further rewards from there. If you 229 00:13:51,660 --> 00:13:53,910 want, you can think of this 230 00:13:53,910 --> 00:13:57,250 as actually an MDP with 12 states rather than 11 states, 231 00:13:57,250 --> 00:13:59,270 and the 12th state 232 00:13:59,270 --> 00:14:05,320 is this zero cost absorbing state that you get stuck in forever. Other 233 00:14:05,320 --> 00:14:12,320 questions? Yeah, 234 00:14:19,640 --> 00:14:26,460 please go. Student:Where did the Bellman's equations [inaudible] to optimal value [inaudible]? Instructor (Andrew Ng):Boy, yeah. Okay, this Bellman's equations, this equation that I'm pointing to, I sort 235 00:14:26,460 --> 00:14:27,190 of 236 00:14:27,190 --> 00:14:31,120 tried to give it justification for this last time. 237 00:14:31,120 --> 00:14:34,850 I'll 238 00:14:34,850 --> 00:14:37,500 say it in 239 00:14:37,500 --> 00:14:41,990 one sentence so that's that the expected total payoff I get, I 240 00:14:41,990 --> 00:14:46,820 expect to get something from the state as is equal to my immediate reward which is 241 00:14:46,820 --> 00:14:49,650 the reward I get for starting a state. Let's see. If I sum 242 00:14:49,650 --> 00:14:54,220 the state, I'm gonna get some first reward and then I can transition to 243 00:14:54,220 --> 00:14:55,440 some other state, and 244 00:14:55,440 --> 00:14:59,020 then from that other state, I'll get some additional rewards from then. 245 00:14:59,020 --> 00:15:03,390 So Bellman's equations breaks that sum into two pieces. It says the value of a 246 00:15:03,390 --> 00:15:04,860 state is equal to 247 00:15:04,860 --> 00:15:06,650 the reward you get right away 248 00:15:06,650 --> 00:15:10,530 is really, well. 249 00:15:10,530 --> 00:15:17,120 V*(s) is really equal to 250 00:15:17,120 --> 00:15:17,790 251 00:15:17,790 --> 00:15:24,790 252 00:15:26,640 --> 00:15:31,130 253 00:15:31,130 --> 00:15:33,390 +G, so this is 254 00:15:33,390 --> 00:15:35,940 255 00:15:35,940 --> 00:15:40,060 V into two terms and says that there's this first term which is the immediate reward, that, and then +G(the 256 00:15:40,060 --> 00:15:42,260 rewards 257 00:15:42,260 --> 00:15:44,480 258 00:15:44,480 --> 00:15:46,440 you get in the future) 259 00:15:46,440 --> 00:15:50,080 which it turns out to be equal to that 260 00:15:50,080 --> 00:15:52,340 second row. I spent more time 261 00:15:52,340 --> 00:15:55,360 justifying this in the previous lecture, 262 00:15:55,360 --> 00:15:59,930 although yeah, hopefully, for the purposes of this lecture, if you're not sure where this is came, if 263 00:15:59,930 --> 00:16:02,100 you don't remember the justification of that, why don't you just 264 00:16:02,100 --> 00:16:03,430 maybe take my word for 265 00:16:03,430 --> 00:16:05,990 that this equation holds true 266 00:16:05,990 --> 00:16:07,940 since I use it a little bit later as well, 267 00:16:07,940 --> 00:16:11,090 and then the lecture notes sort of 268 00:16:11,090 --> 00:16:15,830 explain a little further the justification for why this equation might hold 269 00:16:15,830 --> 00:16:17,430 true. But for now, 270 00:16:17,430 --> 00:16:21,200 yeah, just for now take my word for it that this holds true cause we'll use it a little bit later 271 00:16:21,200 --> 00:16:28,200 today as well. 272 00:16:40,960 --> 00:16:44,710 Student:[Inaudible] and would it be in sort of turn back into [inaudible]. Instructor (Andrew Ng):Actually, [inaudible] right question is if in policy iteration if we represent 273 00:16:44,710 --> 00:16:47,010 ? implicitly, 274 00:16:47,010 --> 00:16:47,600 275 00:16:47,600 --> 00:16:53,060 using V(s), would it become equivalent to valuation, 276 00:16:53,060 --> 00:16:58,240 and the answer is sort of no. 277 00:16:58,240 --> 00:17:01,710 Let's see. It's true that policy iteration and value iteration are 278 00:17:01,710 --> 00:17:07,210 closely related algorithms, and there's actually a continuum between them, but 279 00:17:07,210 --> 00:17:12,850 yeah, it actually turns out that, 280 00:17:12,850 --> 00:17:15,400 oh, no, 281 00:17:15,400 --> 00:17:22,370 282 00:17:22,370 --> 00:17:23,909 the 283 00:17:23,909 --> 00:17:24,759 284 00:17:24,759 --> 00:17:28,910 algorithms are not equivalent. It's just in policy iteration, 285 00:17:28,910 --> 00:17:33,289 there is a step where you're solving for the value function for the policy vehicle is V, 286 00:17:33,289 --> 00:17:35,470 solve for Vp. Usually, 287 00:17:35,470 --> 00:17:40,590 you can do this, for instance, by solving a linear system of equations. In value 288 00:17:40,590 --> 00:17:41,260 iteration, 289 00:17:41,260 --> 00:17:43,850 it is a different 290 00:17:43,850 --> 00:17:50,850 algorithm, yes. I hope it makes sense that at least cosmetically it's different. 291 00:17:56,270 --> 00:17:58,140 Student:[Inaudible] 292 00:17:58,140 --> 00:18:02,250 you have [inaudible] representing ? implicitly, then you won't have 293 00:18:02,250 --> 00:18:06,620 to solve that to [inaudible] equations. Instructor (Andrew Ng):Yeah, the problem is - let's see. To solve for Vp, this works only if you have a fixed policy, so once you 294 00:18:06,620 --> 00:18:08,559 change a value function, 295 00:18:08,559 --> 00:18:10,360 if ? changes as well, 296 00:18:10,360 --> 00:18:15,750 then it's sort of hard to solve this. 297 00:18:15,750 --> 00:18:19,340 Yeah, so later on we'll actually talk about some examples of when ? is implicitly 298 00:18:19,340 --> 00:18:20,440 represented 299 00:18:20,440 --> 00:18:24,059 but at 300 00:18:24,059 --> 00:18:26,210 least for now it's I think there's " yeah. 301 00:18:26,210 --> 00:18:30,350 Maybe there's a way to redefine something, see a mapping onto value iteration but that's not 302 00:18:30,350 --> 00:18:32,530 usually done. These are 303 00:18:32,530 --> 00:18:36,670 viewed as different algorithms. 304 00:18:36,670 --> 00:18:39,640 Okay, cool, so all good questions. 305 00:18:39,640 --> 00:18:43,410 Let me move on and talk about how 306 00:18:43,410 --> 00:18:50,410 to generalize these ideas to continuous states. 307 00:18:58,880 --> 00:19:01,260 Everything we've done so far 308 00:19:01,260 --> 00:19:02,950 has been for 309 00:19:02,950 --> 00:19:05,000 discrete states or finite-state 310 00:19:05,000 --> 00:19:09,890 MDPs. Where, for example, here we had an MDP with a finite set of 11 311 00:19:09,890 --> 00:19:10,780 states 312 00:19:10,780 --> 00:19:12,980 and so 313 00:19:12,980 --> 00:19:16,720 the value function or V(s) or our estimate for the value function, 314 00:19:16,720 --> 00:19:17,360 V(s), 315 00:19:17,360 --> 00:19:22,290 could then be represented using an array of 11 numbers cause if you have 11 states, the 316 00:19:22,290 --> 00:19:23,580 value function 317 00:19:23,580 --> 00:19:26,120 needs to assign a real number 318 00:19:26,120 --> 00:19:29,330 to each of the 11 states and so to represent V(s) 319 00:19:29,330 --> 00:19:31,990 using an array of 11 320 00:19:31,990 --> 00:19:33,080 numbers. 321 00:19:33,080 --> 00:19:37,080 What I want to do for [inaudible] today is talk about 322 00:19:37,080 --> 00:19:38,450 continuous 323 00:19:38,450 --> 00:19:41,830 states, so for example, 324 00:19:41,830 --> 00:19:44,980 if you want to control 325 00:19:44,980 --> 00:19:48,240 any of the number of real [inaudible], so for example, if you want 326 00:19:48,240 --> 00:19:51,409 to control a car, 327 00:19:51,409 --> 00:19:52,610 328 00:19:52,610 --> 00:19:55,090 a car 329 00:19:55,090 --> 00:19:59,840 is positioned given by XYT, as position 330 00:19:59,840 --> 00:20:04,660 and orientation and if you want to Markov the velocity as well, then Xdot, Ydot, Tdot, 331 00:20:04,660 --> 00:20:05,159 so these 332 00:20:05,159 --> 00:20:06,640 are so depending on 333 00:20:06,640 --> 00:20:08,929 whether you want to model the kinematics and 334 00:20:08,929 --> 00:20:12,139 so just position, or whether you want to model the dynamics, meaning the velocity as 335 00:20:12,139 --> 00:20:14,630 well. 336 00:20:14,630 --> 00:20:16,179 Earlier I showed you 337 00:20:16,179 --> 00:20:19,740 video of a helicopter 338 00:20:19,740 --> 00:20:23,480 that was flying, using a rain forest we're learning algorithms, so the 339 00:20:23,480 --> 00:20:24,730 helicopter which can 340 00:20:24,730 --> 00:20:28,390 fly in three-dimensional space rather than just drive on the 2-D plane, the 341 00:20:28,390 --> 00:20:32,780 state will be given by XYZ position, 342 00:20:32,780 --> 00:20:35,100 343 00:20:35,100 --> 00:20:37,610 FT?, which is 344 00:20:37,610 --> 00:20:42,740 ?[inaudible]. 345 00:20:42,740 --> 00:20:45,250 The FT? is 346 00:20:45,250 --> 00:20:48,620 sometimes used to note the P[inaudible] 347 00:20:48,620 --> 00:20:49,840 of the helicopter, 348 00:20:49,840 --> 00:20:51,710 just orientation, 349 00:20:51,710 --> 00:20:55,340 and 350 00:20:55,340 --> 00:20:59,050 351 00:20:59,050 --> 00:21:02,540 if you want to control a helicopter, you pretty much have to model 352 00:21:02,540 --> 00:21:06,490 velocity as well which means both linear velocity as well as angular 353 00:21:06,490 --> 00:21:09,520 velocity, and so this would be a 12-dimensional state. 354 00:21:09,520 --> 00:21:10,400 355 00:21:10,400 --> 00:21:12,610 If you want an example that 356 00:21:12,610 --> 00:21:14,150 is 357 00:21:14,150 --> 00:21:16,770 kind of fun but unusual 358 00:21:16,770 --> 00:21:19,960 is, 359 00:21:19,960 --> 00:21:24,550 and I'm just gonna use this as an example 360 00:21:24,550 --> 00:21:28,240 and actually use this little bit example in today's lecture 361 00:21:28,240 --> 00:21:31,410 is the inverted pendulum problem which is 362 00:21:31,410 --> 00:21:35,500 sort of a long-running classic in reinforcement learning 363 00:21:35,500 --> 00:21:39,040 in which imagine that you 364 00:21:39,040 --> 00:21:42,880 have a little cart that's on a rail. The rail 365 00:21:42,880 --> 00:21:45,370 ends at some point 366 00:21:45,370 --> 00:21:48,400 and if you imagine that you have a pole 367 00:21:48,400 --> 00:21:51,650 attached to the cart, and this is a free hinge and so 368 00:21:51,650 --> 00:21:54,110 the pole here can rotate freely, 369 00:21:54,110 --> 00:21:56,950 and your goal is to control the cart and 370 00:21:56,950 --> 00:21:59,030 to move it back and forth on this rail 371 00:21:59,030 --> 00:22:01,130 so as to keep the pole balanced. 372 00:22:01,130 --> 00:22:02,560 Yeah, 373 00:22:02,560 --> 00:22:04,620 there's no long pole 374 00:22:04,620 --> 00:22:05,400 in 375 00:22:05,400 --> 00:22:07,480 this class but you know what I 376 00:22:07,480 --> 00:22:08,409 mean, so you 377 00:22:08,409 --> 00:22:15,210 can imagine. Oh, is there a long pole here? 378 00:22:15,210 --> 00:22:16,590 Student:Back in the corner. Instructor (Andrew Ng):Oh, thanks. Cool. So I did not practice this 379 00:22:16,590 --> 00:22:19,920 but you can take a long pole and sort of hold it up, balance, 380 00:22:19,920 --> 00:22:24,670 so imagine that you can do it better than I can. Imagine these are [inaudible] just moving 381 00:22:24,670 --> 00:22:27,590 back and forth to try to keep the pole balanced, so 382 00:22:27,590 --> 00:22:30,580 you can actually us the reinforcement learning algorithm to do that. 383 00:22:30,580 --> 00:22:33,929 This is actually one of the longstanding classic problems that 384 00:22:33,929 --> 00:22:35,280 people [inaudible] 385 00:22:35,280 --> 00:22:38,030 implement and play off using reinforcement learning algorithms, 386 00:22:38,030 --> 00:22:40,800 and so for this, the 387 00:22:40,800 --> 00:22:41,940 states would be X and 388 00:22:41,940 --> 00:22:44,000 T, so X 389 00:22:44,000 --> 00:22:49,660 would be the position of the cart, 390 00:22:49,660 --> 00:22:56,070 and T would be the orientation of the pole 391 00:22:56,070 --> 00:22:59,750 and also the linear velocity and the angular velocity of 392 00:22:59,750 --> 00:23:00,650 393 00:23:00,650 --> 00:23:07,650 the pole, so I'll 394 00:23:12,340 --> 00:23:15,390 actually use this example a 395 00:23:15,390 --> 00:23:18,710 couple times. So to read continuous state space, 396 00:23:18,710 --> 00:23:22,760 how can you apply an algorithm like value iteration and policy iteration to solve the 397 00:23:22,760 --> 00:23:24,170 MDP to control 398 00:23:24,170 --> 00:23:27,800 like the car or a helicopter or something like the inverted 399 00:23:27,800 --> 00:23:29,030 pendulum? 400 00:23:29,030 --> 00:23:31,720 So one thing you can do and 401 00:23:31,720 --> 00:23:35,060 this is maybe the most straightforward thing is, 402 00:23:35,060 --> 00:23:37,670 if you have say a two-dimensional 403 00:23:37,670 --> 00:23:42,270 continuous state space, a S-1 and S-2 are my state variables, and in all the 404 00:23:42,270 --> 00:23:43,990 examples there 405 00:23:43,990 --> 00:23:48,150 are I guess between 4dimensional to 12-dimensional. I'll just draw 2-D here. 406 00:23:48,150 --> 00:23:53,670 The most straightforward thing to do would be to take the continuous state space and discretize 407 00:23:53,670 --> 00:23:55,780 it 408 00:23:55,780 --> 00:24:02,780 into a number of discrete cells. 409 00:24:06,600 --> 00:24:07,440 410 00:24:07,440 --> 00:24:11,890 And I use S-bar to denote they're discretized or they're discrete 411 00:24:11,890 --> 00:24:13,080 states, 412 00:24:13,080 --> 00:24:16,919 and so you can [inaudible] with this continuous state problem with a finite or 413 00:24:16,919 --> 00:24:18,409 discrete set of states 414 00:24:18,409 --> 00:24:19,490 and then 415 00:24:19,490 --> 00:24:22,880 you can use policy iteration or value iteration 416 00:24:22,880 --> 00:24:28,900 to solve for 417 00:24:28,900 --> 00:24:31,529 418 00:24:31,529 --> 00:24:35,950 V(s)-bar. 419 00:24:35,950 --> 00:24:37,390 And 420 00:24:37,390 --> 00:24:40,350 if you're robot is then in some state 421 00:24:40,350 --> 00:24:41,860 given by that dot, 422 00:24:41,860 --> 00:24:45,910 you would then figure out what discretized state it is in. In this case it's in, 423 00:24:45,910 --> 00:24:47,950 this discretized dygrid cell 424 00:24:47,950 --> 00:24:49,180 that's called 425 00:24:49,180 --> 00:24:51,400 S-bar, 426 00:24:51,400 --> 00:24:52,840 and then you execute. 427 00:24:52,840 --> 00:24:56,320 You choose the policy. You choose the action 428 00:24:56,320 --> 00:24:58,679 given by applied to that discrete 429 00:24:58,679 --> 00:25:01,190 state, so discretization is maybe the 430 00:25:01,190 --> 00:25:06,690 most straightforward way to turn a continuous state problem into a discrete state problem. Sometimes 431 00:25:06,690 --> 00:25:10,200 you can sorta make this work but a couple of reasons why 432 00:25:10,200 --> 00:25:13,070 this does not work very well. 433 00:25:13,070 --> 00:25:15,130 One reason is the following, 434 00:25:15,130 --> 00:25:17,610 and 435 00:25:17,610 --> 00:25:21,210 for this picture, let's even temporarily put aside reinforcement 436 00:25:21,210 --> 00:25:22,840 learning. Let's just 437 00:25:22,840 --> 00:25:26,610 think about doing regression for now and so 438 00:25:26,610 --> 00:25:29,650 suppose you have some invariable X 439 00:25:29,650 --> 00:25:36,210 and suppose I have some data, 440 00:25:36,210 --> 00:25:37,910 and I want to fill a function. 441 00:25:37,910 --> 00:25:39,720 Y is the function of X, 442 00:25:39,720 --> 00:25:40,860 so 443 00:25:40,860 --> 00:25:44,550 discretization is saying that I'm going to take my 444 00:25:44,550 --> 00:25:48,010 horizontal Xs and chop it up into a number of 445 00:25:48,010 --> 00:25:48,669 intervals. 446 00:25:48,669 --> 00:25:52,060 Sometimes I call these intervals buckets as well. 447 00:25:52,060 --> 00:25:56,310 We chop my horizontal Xs up into a number of buckets 448 00:25:56,310 --> 00:25:59,480 and then we're approximate this function 449 00:25:59,480 --> 00:26:03,770 using something that's piecewise constant 450 00:26:03,770 --> 00:26:05,440 in each of these buckets. 451 00:26:05,440 --> 00:26:08,800 And just look at this. This is clearly not a very good representation, right, and when we 452 00:26:08,800 --> 00:26:10,390 talk about regression, 453 00:26:10,390 --> 00:26:14,340 you just choose some features of X 454 00:26:14,340 --> 00:26:17,990 and run linear regression or something. You get a much better fit to the function. 455 00:26:17,990 --> 00:26:22,170 And so the sense that discretization just isn't a very good source 456 00:26:22,170 --> 00:26:25,770 of piecewise constant functions. This just isn't a very good function 457 00:26:25,770 --> 00:26:28,210 for representing many things, 458 00:26:28,210 --> 00:26:32,090 and there's also the sense that 459 00:26:32,090 --> 00:26:35,580 there's no smoothing or there's no generalization across the different buckets. 460 00:26:35,580 --> 00:26:39,290 And in fact, back in regression, I would never have chosen 461 00:26:39,290 --> 00:26:42,870 to do regression using this sort of visualization. It's just really doesn't make sense. 462 00:26:42,870 --> 00:26:45,630 463 00:26:45,630 --> 00:26:47,990 And so in the same way, 464 00:26:47,990 --> 00:26:50,950 instead of 465 00:26:50,950 --> 00:26:53,929 X, V(s), instead of X and 466 00:26:53,929 --> 00:26:57,730 some hypothesis function of X, if you have the state here and you're trying to approximate the value 467 00:26:57,730 --> 00:26:59,110 function, then 468 00:26:59,110 --> 00:27:02,470 you can get discretization to work for many problems but maybe this isn't the best 469 00:27:02,470 --> 00:27:03,429 representation to represent 470 00:27:03,429 --> 00:27:06,419 a value function. 471 00:27:06,419 --> 00:27:08,960 472 00:27:08,960 --> 00:27:12,580 The other problem with discretization and maybe the more serious 473 00:27:12,580 --> 00:27:17,350 problem is what's 474 00:27:17,350 --> 00:27:21,450 often somewhat fancifully called the curse of dimensionality. 475 00:27:21,450 --> 00:27:26,480 476 00:27:26,480 --> 00:27:30,230 And just the observation that 477 00:27:30,230 --> 00:27:30,990 478 00:27:30,990 --> 00:27:33,800 if the state space is in 479 00:27:33,800 --> 00:27:39,350 RN, and if you discretize each variable into K buckets, 480 00:27:39,350 --> 00:27:40,450 so 481 00:27:40,450 --> 00:27:41,909 if discretize each 482 00:27:41,909 --> 00:27:44,159 483 00:27:44,159 --> 00:27:45,690 variable into K 484 00:27:45,690 --> 00:27:47,750 discrete values, 485 00:27:47,750 --> 00:27:50,150 then you get 486 00:27:50,150 --> 00:27:53,030 on the order of K to the power of 487 00:27:53,030 --> 00:27:54,760 N 488 00:27:54,760 --> 00:27:55,740 discrete states. 489 00:27:55,740 --> 00:27:56,609 In other words, 490 00:27:56,609 --> 00:28:00,170 the number of discrete states you end up with 491 00:28:00,170 --> 00:28:02,100 grows exponentially 492 00:28:02,100 --> 00:28:04,789 in the dimension of the problem, 493 00:28:04,789 --> 00:28:05,560 and so 494 00:28:05,560 --> 00:28:09,350 for a helicopter with 12-dimensional state 495 00:28:09,350 --> 00:28:10,340 space, this would be 496 00:28:10,340 --> 00:28:12,769 maybe like 100 to the power of 12, just huge, and it's 497 00:28:12,769 --> 00:28:13,460 not 498 00:28:13,460 --> 00:28:14,650 feasible. 499 00:28:14,650 --> 00:28:19,130 And so discretization doesn't scale well at all with two problems in high-dimensional 500 00:28:19,130 --> 00:28:23,600 state spaces, 501 00:28:23,600 --> 00:28:24,790 and 502 00:28:24,790 --> 00:28:28,520 this observation actually applies more generally than to 503 00:28:28,520 --> 00:28:32,510 just robotics and continuous state problems. 504 00:28:32,510 --> 00:28:37,669 For example, another fairly well-known applications 505 00:28:37,669 --> 00:28:40,170 of reinforcement learning 506 00:28:40,170 --> 00:28:44,850 has been to factory automations. If you imagine that you have 507 00:28:44,850 --> 00:28:47,040 20 machines sitting in the factory 508 00:28:47,040 --> 00:28:50,029 and the machines lie in a assembly line and they all 509 00:28:50,029 --> 00:28:53,200 do something to a part on the assembly line, then they route the part onto a 510 00:28:53,200 --> 00:28:55,389 different machine. You want to 511 00:28:55,389 --> 00:28:58,779 use reinforcement learning algorithms, [inaudible] the order in which the 512 00:28:58,779 --> 00:29:02,519 different machines operate on your different things that are flowing through your assembly line and maybe 513 00:29:02,519 --> 00:29:05,309 different machines can do different things. 514 00:29:05,309 --> 00:29:08,970 So if you have N machines and each machine 515 00:29:08,970 --> 00:29:10,540 can be in K states, 516 00:29:10,540 --> 00:29:13,730 then if you do this sort of discretization, the total number of states would be K 517 00:29:13,730 --> 00:29:15,420 to N as well. 518 00:29:15,420 --> 00:29:18,289 If you have N machines 519 00:29:18,289 --> 00:29:21,860 and if each machine can be in K states, then again, you can get 520 00:29:21,860 --> 00:29:24,550 this huge number of states. 521 00:29:24,550 --> 00:29:26,799 Other well-known examples would be 522 00:29:26,799 --> 00:29:30,620 if you have a board game is another example. 523 00:29:30,620 --> 00:29:33,760 You'd want to use reinforcement learning to play chess. 524 00:29:33,760 --> 00:29:36,310 Then if you have N 525 00:29:36,310 --> 00:29:38,660 pieces on your board game, 526 00:29:38,660 --> 00:29:43,540 you have N pieces on the chessboard and if each piece can be in K positions, then this is a 527 00:29:43,540 --> 00:29:46,470 game sort of the curse of dimensionality thing where the 528 00:29:46,470 --> 00:29:50,059 number of discrete states you end up with goes exponentially with the number of 529 00:29:50,059 --> 00:29:54,500 530 00:29:54,500 --> 00:29:55,059 pieces 531 00:29:55,059 --> 00:29:57,120 in your board game. 532 00:29:57,120 --> 00:30:01,169 So the curse of dimensionality 533 00:30:01,169 --> 00:30:05,830 means that discretization scales poorly to high-dimensional state spaces or 534 00:30:05,830 --> 00:30:09,370 at least discrete representations 535 00:30:09,370 --> 00:30:14,169 scale poorly to high-dimensional state spaces. In practice, discretization will usually, if you have a 2-dimensional 536 00:30:14,169 --> 00:30:16,720 problem, discretization will usually work great. 537 00:30:16,720 --> 00:30:20,730 If you have a 3-dimensional problem, you can often get discretization to 538 00:30:20,730 --> 00:30:21,290 work 539 00:30:21,290 --> 00:30:23,270 not too badly without too much trouble. 540 00:30:23,270 --> 00:30:28,370 With a 4-dimensional problem, you can still often get to where that they 541 00:30:28,370 --> 00:30:30,190 could be challenging 542 00:30:30,190 --> 00:30:32,240 543 00:30:32,240 --> 00:30:32,990 544 00:30:32,990 --> 00:30:38,770 and as you go to higher and higher dimensional state spaces, 545 00:30:38,770 --> 00:30:42,520 the odds and [inaudible] that you need to figure around to discretization and do things 546 00:30:42,520 --> 00:30:45,900 like non-uniform grids, so for example, 547 00:30:45,900 --> 00:30:49,470 what I've 548 00:30:49,470 --> 00:30:52,710 drawn for you is an example of a non-uniform discretization 549 00:30:52,710 --> 00:30:57,139 where I'm discretizing S-2 much more finally than S-1. If I think the value 550 00:30:57,139 --> 00:30:57,750 function 551 00:30:57,750 --> 00:30:59,250 is much more sensitive 552 00:30:59,250 --> 00:31:02,620 to the value of state variable S-2 than to S-1, 553 00:31:02,620 --> 00:31:06,610 and so as you get into higher dimensional state spaces, you may need to manually fiddle 554 00:31:06,610 --> 00:31:10,000 with choices like these with no uniform discretizations and so on. 555 00:31:10,000 --> 00:31:13,269 But the folk wisdom seems to be that if you have 2- or 3-dimensional problems, it 556 00:31:13,269 --> 00:31:14,350 work fine. 557 00:31:14,350 --> 00:31:17,630 With 4-dimensional problems, you can probably get it to work but it'd be just 558 00:31:17,630 --> 00:31:19,030 slightly challenging 559 00:31:19,030 --> 00:31:20,300 and 560 00:31:20,300 --> 00:31:24,420 you can sometimes by fooling around and being clever, you can often push 561 00:31:24,420 --> 00:31:25,110 discretization 562 00:31:25,110 --> 00:31:29,120 up to let's say about 6-dimensional problems but 563 00:31:29,120 --> 00:31:30,870 with some difficulty 564 00:31:30,870 --> 00:31:32,980 and problems higher 565 00:31:32,980 --> 00:31:37,220 than 6-dimensional would be extremely difficult to solve with discretization. 566 00:31:37,220 --> 00:31:39,700 So that's just rough 567 00:31:39,700 --> 00:31:44,360 folk wisdom order of managing problems you think about using for discretization. 568 00:31:44,360 --> 00:31:51,360 569 00:31:51,370 --> 00:31:52,790 But what I want to 570 00:31:52,790 --> 00:31:56,030 spend most of today talking about is 571 00:31:56,030 --> 00:31:59,909 [inaudible] methods that often work much better than discretization 572 00:31:59,909 --> 00:32:02,029 and which we will 573 00:32:02,029 --> 00:32:04,640 approximate V* directly 574 00:32:04,640 --> 00:32:08,850 without resulting to these sort of discretizations. 575 00:32:08,850 --> 00:32:12,300 Before I jump to the specific representation let me just spend a few minutes 576 00:32:12,300 --> 00:32:14,780 talking about the problem setup 577 00:32:14,780 --> 00:32:15,610 then. 578 00:32:15,610 --> 00:32:18,790 For today's lecture, I'm going to focus on 579 00:32:18,790 --> 00:32:25,360 the problem of continuous states 580 00:32:25,360 --> 00:32:28,060 and just to keep things sort of very 581 00:32:28,060 --> 00:32:29,480 simple in this lecture, 582 00:32:29,480 --> 00:32:33,619 I want view of continuous actions, so I'm gonna see 583 00:32:33,619 --> 00:32:36,000 584 00:32:36,000 --> 00:32:37,650 discrete actions 585 00:32:37,650 --> 00:32:40,280 A. So it turns out actually that 586 00:32:40,280 --> 00:32:44,230 is a critical fact also for many problems, it turns out that the state 587 00:32:44,230 --> 00:32:45,630 space is 588 00:32:45,630 --> 00:32:47,299 much larger than 589 00:32:47,299 --> 00:32:51,179 the states of actions. That just seems to have worked out that way for many problems, so for example, 590 00:32:51,179 --> 00:32:52,630 591 00:32:52,630 --> 00:32:58,210 for driving a car the state space is 6-dimensional, so if 592 00:32:58,210 --> 00:33:00,010 XY T, Xdot, Ydot, Tdot. 593 00:33:00,010 --> 00:33:04,840 Whereas, your action has, you still have two actions. You have forward 594 00:33:04,840 --> 00:33:07,600 backward motion and steering the car, so you have 595 00:33:07,600 --> 00:33:11,889 6-D states and 2-D actions, and so you can discretize the action much more easily than discretize the states. 596 00:33:11,889 --> 00:33:14,740 The only 597 00:33:14,740 --> 00:33:17,190 examples down for a helicopter you've 12-D states in a 598 00:33:17,190 --> 00:33:18,620 4-dimensional action 599 00:33:18,620 --> 00:33:20,290 it turns out, and 600 00:33:20,290 --> 00:33:24,200 it's also often much easier to just discretize a continuous 601 00:33:24,200 --> 00:33:27,300 actions into a discrete sum of actions. 602 00:33:27,300 --> 00:33:27,840 603 00:33:27,840 --> 00:33:31,670 And for the inverted pendulum, you have a 4-D state and a 1-D action. 604 00:33:31,670 --> 00:33:33,470 Whether you accelerate 605 00:33:33,470 --> 00:33:38,140 your cart to the left or the right is one D action 606 00:33:38,140 --> 00:33:41,879 and so for the rest of today, I'm gonna assume a continuous state 607 00:33:41,879 --> 00:33:45,480 but I'll assume that maybe you've already discretized your actions, 608 00:33:45,480 --> 00:33:47,700 just because in practice it turns out that 609 00:33:47,700 --> 00:33:52,780 not for all problems, with many problems large actions 610 00:33:52,780 --> 00:33:54,280 is just less of a 611 00:33:54,280 --> 00:33:58,470 difficulty than large state spaces. 612 00:33:58,470 --> 00:33:59,730 So I'm 613 00:33:59,730 --> 00:34:01,350 going to assume 614 00:34:01,350 --> 00:34:06,360 that we have 615 00:34:06,360 --> 00:34:08,699 a model 616 00:34:08,699 --> 00:34:15,089 or simulator of the 617 00:34:15,089 --> 00:34:16,450 MDP, and so 618 00:34:16,450 --> 00:34:18,179 this is really 619 00:34:18,179 --> 00:34:22,579 an assumption on how the state transition probabilities are represented. 620 00:34:22,579 --> 00:34:23,589 621 00:34:23,589 --> 00:34:24,690 I'm gonna assume 622 00:34:24,690 --> 00:34:28,079 and I'm going to use the terms model and simulator 623 00:34:28,079 --> 00:34:29,800 pretty much synonymously, 624 00:34:29,800 --> 00:34:31,749 so 625 00:34:31,749 --> 00:34:38,749 specifically, what I'm going to assume is that we have a black box and a 626 00:34:40,649 --> 00:34:41,829 piece of code, 627 00:34:41,829 --> 00:34:45,200 so that I can input any state, input an action 628 00:34:45,200 --> 00:34:46,399 and it will output 629 00:34:46,399 --> 00:34:48,559 S 630 00:34:48,559 --> 00:34:51,259 prime, 631 00:34:51,259 --> 00:34:54,219 sample from the state transition distribution. Says that 632 00:34:54,219 --> 00:34:55,069 this is really 633 00:34:55,069 --> 00:34:57,080 my assumption 634 00:34:57,080 --> 00:35:01,209 on the representation I have for the state transition probabilities, 635 00:35:01,209 --> 00:35:05,189 so I'll assume I have a box that read 636 00:35:05,189 --> 00:35:06,999 take us in for the stated action 637 00:35:06,999 --> 00:35:09,849 and output in mixed state. 638 00:35:09,849 --> 00:35:16,849 And so 639 00:35:17,410 --> 00:35:20,469 since they're fairly common ways to get models of 640 00:35:20,469 --> 00:35:22,359 different MDPs you may work with, 641 00:35:22,359 --> 00:35:24,479 one is 642 00:35:24,479 --> 00:35:30,599 you might get a model from a physics 643 00:35:30,599 --> 00:35:32,960 simulator. So for example, 644 00:35:32,960 --> 00:35:36,759 if you're interested in controlling that inverted pendulum, 645 00:35:36,759 --> 00:35:39,449 so your action is 646 00:35:39,449 --> 00:35:41,230 A which is 647 00:35:41,230 --> 00:35:45,130 the magnitude of the force you exert on the cart to 648 00:35:45,130 --> 00:35:46,140 649 00:35:46,140 --> 00:35:47,709 left 650 00:35:47,709 --> 00:35:49,380 or right, and your 651 00:35:49,380 --> 00:35:56,380 state is Xdot, T, Tdot. I'm just gonna write that in that order. 652 00:35:58,109 --> 00:36:01,469 And so I'm gonna 653 00:36:01,469 --> 00:36:04,379 write down a bunch of equations just for completeness but 654 00:36:04,379 --> 00:36:08,779 everything I'm gonna write below here is most of what I wanna write is a bit 655 00:36:08,779 --> 00:36:12,599 gratuitous, but so since I'll maybe flip open a textbook on physics, a 656 00:36:12,599 --> 00:36:13,820 textbook on mechanics, 657 00:36:13,820 --> 00:36:17,069 you can work out the equations of motion of a 658 00:36:17,069 --> 00:36:19,499 physical device like this, so you find that 659 00:36:19,499 --> 00:36:21,719 660 00:36:21,719 --> 00:36:22,749 661 00:36:22,749 --> 00:36:26,400 Sdot. The dot denotes derivative, so the derivative of the state with respect to time 662 00:36:26,400 --> 00:36:28,080 is given by 663 00:36:28,080 --> 00:36:28,739 664 00:36:28,739 --> 00:36:31,710 Xdot, ?-L(B) 665 00:36:31,710 --> 00:36:34,099 cost B 666 00:36:34,099 --> 00:36:35,549 over 667 00:36:35,549 --> 00:36:36,919 668 00:36:36,919 --> 00:36:39,069 M 669 00:36:39,069 --> 00:36:46,069 Tdot, 670 00:36:52,019 --> 00:36:52,859 B. 671 00:36:52,859 --> 00:36:58,199 And so on where A is the force is the action that you exert on the cart. L is 672 00:36:58,199 --> 00:36:59,869 the length of the pole. 673 00:36:59,869 --> 00:37:04,099 M is the total mass of the system and so on. So all these equations 674 00:37:04,099 --> 00:37:06,209 are good uses, just writing 675 00:37:06,209 --> 00:37:08,549 them down for completeness, but by 676 00:37:08,549 --> 00:37:12,179 flipping over, open like a physics textbook, you can work out these equations 677 00:37:12,179 --> 00:37:13,839 and notions yourself 678 00:37:13,839 --> 00:37:15,169 and this 679 00:37:15,169 --> 00:37:17,410 then gives you a model which 680 00:37:17,410 --> 00:37:18,259 can say that S-2+1. 681 00:37:18,259 --> 00:37:21,549 You're still one time step later 682 00:37:21,549 --> 00:37:27,489 will be equal to your previous state 683 00:37:27,489 --> 00:37:32,219 plus [inaudible], so in your simulator or in my model 684 00:37:32,219 --> 00:37:35,200 what happens to the cart every 685 00:37:35,200 --> 00:37:39,229 10th of a second, so ? T 686 00:37:39,229 --> 00:37:43,130 would be within one second 687 00:37:43,130 --> 00:37:50,130 and then so plus ? T times 688 00:37:57,419 --> 00:37:59,789 689 00:37:59,789 --> 00:38:03,469 690 00:38:03,469 --> 00:38:06,129 that. And so that'd be one way to come up with a model of 691 00:38:06,129 --> 00:38:08,169 your MDP. 692 00:38:08,169 --> 00:38:12,389 And in this specific example, I've actually written down deterministic model because 693 00:38:12,389 --> 00:38:12,830 694 00:38:12,830 --> 00:38:14,829 and by deterministic I mean that 695 00:38:14,829 --> 00:38:18,320 given an action in a state, the next state 696 00:38:18,320 --> 00:38:19,839 is not random, 697 00:38:19,839 --> 00:38:23,179 so would be an example of a deterministic model 698 00:38:23,179 --> 00:38:26,180 where I can compute the next state exactly 699 00:38:26,180 --> 00:38:29,699 as a function of the previous state and the previous action 700 00:38:29,699 --> 00:38:33,479 or it's a deterministic model because all the probability 701 00:38:33,479 --> 00:38:34,999 mass is on a single state 702 00:38:34,999 --> 00:38:36,939 given the previous stated action. 703 00:38:36,939 --> 00:38:43,939 You can also make this a stochastic model. 704 00:38:48,140 --> 00:38:55,140 705 00:38:58,589 --> 00:39:05,589 A second way that is often used to attain a model is to learn one. 706 00:39:08,359 --> 00:39:13,079 And so again, just concretely what you do is you would 707 00:39:13,079 --> 00:39:15,599 imagine that you have a physical 708 00:39:15,599 --> 00:39:17,109 inverted pendulum system 709 00:39:17,109 --> 00:39:20,630 as you physically own an inverted pendulum robot. 710 00:39:20,630 --> 00:39:24,109 What you would do is you would then 711 00:39:24,109 --> 00:39:27,109 initialize your inverted pendulum robot to some state 712 00:39:27,109 --> 00:39:31,909 and then execute some policy, could be some random policy or some policy that you think is pretty 713 00:39:31,909 --> 00:39:36,219 good, or you could even try controlling yourself with a joystick or something. 714 00:39:36,219 --> 00:39:37,399 But so you set 715 00:39:37,399 --> 00:39:40,889 the system off in some state as zero. 716 00:39:40,889 --> 00:39:43,619 Then you take some action. 717 00:39:43,619 --> 00:39:48,039 Here's zero and the game could be chosen by some policy or chosen by you using a joystick tryina 718 00:39:48,039 --> 00:39:51,179 control your inverted pendulum or whatever. 719 00:39:51,179 --> 00:39:55,029 System would transition to some new state, S-1, and then you 720 00:39:55,029 --> 00:39:57,269 take some new action, A-1 721 00:39:57,269 --> 00:39:59,799 and so on. Let's say you do 722 00:39:59,799 --> 00:40:01,579 this 723 00:40:01,579 --> 00:40:04,779 for two time steps 724 00:40:04,779 --> 00:40:06,480 and 725 00:40:06,480 --> 00:40:08,859 sometimes I call this one trajectory and 726 00:40:08,859 --> 00:40:10,009 you repeat this 727 00:40:10,009 --> 00:40:12,989 M times, so 728 00:40:12,989 --> 00:40:16,989 this is the first trial of the first trajectory, 729 00:40:16,989 --> 00:40:23,989 and then you do this again. Initialize it in some 730 00:40:35,419 --> 00:40:36,880 and so on. 731 00:40:36,880 --> 00:40:39,979 So you do this a bunch of times 732 00:40:39,979 --> 00:40:43,640 and then you would run the learning algorithm 733 00:40:43,640 --> 00:40:49,190 to 734 00:40:49,190 --> 00:40:53,199 estimate ST+1 735 00:40:53,199 --> 00:40:56,650 as a function 736 00:40:56,650 --> 00:41:01,869 of 737 00:41:01,869 --> 00:41:04,619 ST and 738 00:41:04,619 --> 00:41:08,009 AT. And for sake of completeness, you should just think of this 739 00:41:08,009 --> 00:41:10,130 as inverted pendulum problem, 740 00:41:10,130 --> 00:41:11,190 so ST+1 is 741 00:41:11,190 --> 00:41:13,290 a 4-dimensional vector. ST, AT 742 00:41:13,290 --> 00:41:16,579 will be a 4-dimensional vector and that'll be a real number, 743 00:41:16,579 --> 00:41:17,680 and so you might 744 00:41:17,680 --> 00:41:21,190 run linear regression 4 times to predict each of these state variables as 745 00:41:21,190 --> 00:41:22,749 a function 746 00:41:22,749 --> 00:41:25,250 of each of these 747 00:41:25,250 --> 00:41:26,979 5 real numbers 748 00:41:26,979 --> 00:41:29,519 and so on. 749 00:41:29,519 --> 00:41:32,579 Just for example, if you 750 00:41:32,579 --> 00:41:36,899 say that if you want to estimate your next state ST+1 as a linear 751 00:41:36,899 --> 00:41:40,639 function 752 00:41:40,639 --> 00:41:45,889 of your previous state in your action and so 753 00:41:45,889 --> 00:41:51,470 A here will be a 4 by 4 matrix, 754 00:41:51,470 --> 00:41:54,449 and B would be a 4-dimensional vector, 755 00:41:54,449 --> 00:42:00,199 then 756 00:42:00,199 --> 00:42:07,199 you would choose the values of A and B that minimize this. 757 00:42:07,489 --> 00:42:08,089 758 00:42:08,089 --> 00:42:15,089 759 00:42:18,229 --> 00:42:25,229 760 00:42:28,689 --> 00:42:31,169 761 00:42:31,169 --> 00:42:32,269 762 00:42:32,269 --> 00:42:35,220 So if you want your model to be that 763 00:42:35,220 --> 00:42:38,849 ST+1 is some linear function of the previous stated action, 764 00:42:38,849 --> 00:42:39,960 then you might 765 00:42:39,960 --> 00:42:41,950 pose this optimization objective 766 00:42:41,950 --> 00:42:45,130 and choose A and B to minimize the sum of squares error 767 00:42:45,130 --> 00:42:52,130 in your predictive value for ST+1 as the linear function of ST 768 00:42:52,489 --> 00:42:56,889 and AT. I 769 00:42:56,889 --> 00:42:58,119 should say that 770 00:42:58,119 --> 00:43:02,839 this is one specific example where you're using a linear function of the previous 771 00:43:02,839 --> 00:43:06,399 stated action to predict the next state. Of course, you can also use other algorithms 772 00:43:06,399 --> 00:43:08,469 like low [inaudible] weight to linear regression 773 00:43:08,469 --> 00:43:12,079 or linear regression with nonlinear features or kernel linear 774 00:43:12,079 --> 00:43:13,119 regression or whatever 775 00:43:13,119 --> 00:43:16,380 to predict the next state as a nonlinear function of the current state as well, so this 776 00:43:16,380 --> 00:43:23,380 is just [inaudible] linear problems. 777 00:43:25,469 --> 00:43:27,820 And it turns out that low [inaudible] weight to linear 778 00:43:27,820 --> 00:43:28,770 regression is 779 00:43:28,770 --> 00:43:31,990 for many robots turns out to be an effective method for this learning 780 00:43:31,990 --> 00:43:38,990 problem as well. 781 00:43:47,279 --> 00:43:51,919 And so having learned to model, having learned the parameters A 782 00:43:51,919 --> 00:43:52,680 and B, 783 00:43:52,680 --> 00:43:58,979 you then have a model where you say that ST+1 is AST 784 00:43:58,979 --> 00:44:05,359 plus BAT, and so that would be an example of a deterministic 785 00:44:05,359 --> 00:44:06,219 model 786 00:44:06,219 --> 00:44:07,949 787 00:44:07,949 --> 00:44:10,749 or having learned the parameters A and B, 788 00:44:10,749 --> 00:44:13,259 you might say that ST+1 is 789 00:44:13,259 --> 00:44:15,229 equal to AST + 790 00:44:15,229 --> 00:44:17,440 BAT 791 00:44:17,440 --> 00:44:24,039 792 00:44:24,039 --> 00:44:25,979 + 793 00:44:25,979 --> 00:44:28,249 ?T. 794 00:44:28,249 --> 00:44:29,919 And so these would be 795 00:44:29,919 --> 00:44:32,699 very reasonable ways to come up with either 796 00:44:32,699 --> 00:44:37,069 a deterministic or a stochastic model for your 797 00:44:37,069 --> 00:44:39,329 inverted pendulum MDP. 798 00:44:39,329 --> 00:44:41,519 And so 799 00:44:41,519 --> 00:44:43,549 just to summarize, 800 00:44:43,549 --> 00:44:48,239 what we have now is a model, meaning a piece of code, 801 00:44:48,239 --> 00:44:53,459 where you can input a state and an action 802 00:44:53,459 --> 00:44:55,079 and get an ST+1. 803 00:44:55,079 --> 00:44:57,969 And so if you have a stochastic model, 804 00:44:57,969 --> 00:45:02,419 then to influence this model, you would actually sample ?T from this [inaudible] 805 00:45:02,419 --> 00:45:03,039 distribution 806 00:45:03,039 --> 00:45:10,039 in order to generate ST+1. So it 807 00:45:13,179 --> 00:45:18,269 actually turns out that in a 808 00:45:18,269 --> 00:45:20,169 preview, I guess, in the next lecture 809 00:45:20,169 --> 00:45:24,679 it actually turns out that in the specific case of linear dynamical systems, in the specific 810 00:45:24,679 --> 00:45:28,559 case where the next state is a linear function of the current stated action, it 811 00:45:28,559 --> 00:45:31,579 actually turns out that there're very powerful algorithms you can use. 812 00:45:31,579 --> 00:45:35,640 So I'm actually not gonna talk about that today. I'll talk about that in the next lecture rather than 813 00:45:35,640 --> 00:45:36,630 right now 814 00:45:36,630 --> 00:45:38,559 815 00:45:38,559 --> 00:45:42,759 but turns out that for many problems of inverted pendulum go if you use low [inaudible] weights 816 00:45:42,759 --> 00:45:45,599 and linear regressionals and long linear 817 00:45:45,599 --> 00:45:48,339 algorithm cause many systems aren't really 818 00:45:48,339 --> 00:45:54,309 linear. You can build a nonlinear model. 819 00:45:54,309 --> 00:45:56,949 So what I wanna do now is talk about 820 00:45:56,949 --> 00:45:57,330 given 821 00:45:57,330 --> 00:46:00,510 a model, given a simulator for your 822 00:46:00,510 --> 00:46:03,349 MDP, how to come up with an algorithm 823 00:46:03,349 --> 00:46:07,319 to approximate the alpha value function piece. 824 00:46:07,319 --> 00:46:14,319 Before I move on, let me check if there're questions on this. Okay, 825 00:46:38,339 --> 00:46:42,579 cool. So here's 826 00:46:42,579 --> 00:46:43,500 the idea. 827 00:46:43,500 --> 00:46:46,890 Back when we talked about linear regression, we said that 828 00:46:46,890 --> 00:46:52,119 given some inputs X in supervised learning, given the input feature is X, we 829 00:46:52,119 --> 00:46:54,589 may choose some features of X 830 00:46:54,589 --> 00:46:56,529 and then 831 00:46:56,529 --> 00:46:57,229 832 00:46:57,229 --> 00:46:58,819 approximate the type of variable 833 00:46:58,819 --> 00:47:02,229 as a linear function of various features of X, 834 00:47:02,229 --> 00:47:04,929 and so just do exactly the same thing to 835 00:47:04,929 --> 00:47:06,329 approximate 836 00:47:06,329 --> 00:47:08,539 the optimal value function, 837 00:47:08,539 --> 00:47:10,869 and in particular, 838 00:47:10,869 --> 00:47:15,819 we'll choose some features 839 00:47:15,819 --> 00:47:19,249 5-S of a 840 00:47:19,249 --> 00:47:20,979 state 841 00:47:20,979 --> 00:47:25,549 S. And so you could actually choose 842 00:47:25,549 --> 00:47:29,529 5-S equals S. That would be one reasonable choice, if you want to approximate the value 843 00:47:29,529 --> 00:47:30,160 function 844 00:47:30,160 --> 00:47:32,759 as a linear function of the states, but you can 845 00:47:32,759 --> 00:47:33,659 also 846 00:47:33,659 --> 00:47:37,069 choose other things, so for example, 847 00:47:37,069 --> 00:47:41,029 for the inverted pendulum example, you may choose 5-S 848 00:47:41,029 --> 00:47:45,049 to be equal to a vector of features that may be [inaudible]1 or you may have 849 00:47:45,049 --> 00:47:46,279 850 00:47:46,279 --> 00:47:47,870 851 00:47:47,870 --> 00:47:51,789 Xdot2, 852 00:47:51,789 --> 00:47:52,190 Xdot 853 00:47:52,190 --> 00:47:54,649 maybe some cross terms, 854 00:47:54,649 --> 00:47:56,799 maybe times 855 00:47:56,799 --> 00:47:59,369 X, maybe dot2 856 00:47:59,369 --> 00:48:03,659 and so on. So 857 00:48:03,659 --> 00:48:07,209 you choose some vector or features 858 00:48:07,209 --> 00:48:14,209 and then approximate 859 00:48:16,819 --> 00:48:21,759 the value function as the value of the state as is equal to data 860 00:48:21,759 --> 00:48:24,289 transfers 861 00:48:24,289 --> 00:48:28,420 times the features. And I should apologize in advance; I'm overloading notation here. 862 00:48:28,420 --> 00:48:29,839 It's unfortunate. I 863 00:48:29,839 --> 00:48:34,349 use data both to denote the angle of the cart of the pole inverted 864 00:48:34,349 --> 00:48:37,549 pendulum. So this is known as 865 00:48:37,549 --> 00:48:39,140 the angle T 866 00:48:39,140 --> 00:48:43,280 but also using T to denote the vector of parameters in my [inaudible] algorithm. 867 00:48:43,280 --> 00:48:43,910 So sorry 868 00:48:43,910 --> 00:48:49,939 about the overloading notation. 869 00:48:49,939 --> 00:48:51,599 870 00:48:51,599 --> 00:48:55,439 Just like we did in linear regression, my goal is to come up 871 00:48:55,439 --> 00:48:56,599 with 872 00:48:56,599 --> 00:49:00,339 a linear combination of the features that gives me a good approximation 873 00:49:00,339 --> 00:49:01,929 to the value function 874 00:49:01,929 --> 00:49:07,660 and this is completely analogous to when we said that in linear regression 875 00:49:07,660 --> 00:49:09,500 our estimate, 876 00:49:09,500 --> 00:49:13,369 877 00:49:13,369 --> 00:49:18,049 my response there but Y as a linear function a feature is at the 878 00:49:18,049 --> 00:49:23,459 input. That's what we have 879 00:49:23,459 --> 00:49:30,459 in 880 00:49:40,109 --> 00:49:47,079 linear regression. Let 881 00:49:47,079 --> 00:49:50,779 me just write 882 00:49:50,779 --> 00:49:55,859 down value iteration again and then I'll written down an approximation to value 883 00:49:55,859 --> 00:49:57,199 iteration, so 884 00:49:57,199 --> 00:50:00,759 for discrete states, 885 00:50:00,759 --> 00:50:03,929 this is the idea behind value iteration and we said that 886 00:50:03,929 --> 00:50:08,849 V(s) will be 887 00:50:08,849 --> 00:50:11,609 updated 888 00:50:11,609 --> 00:50:17,400 as R(s) + G 889 00:50:17,400 --> 00:50:19,530 [inaudible]. That was value iteration 890 00:50:19,530 --> 00:50:24,129 and in the case of continuous states, this would be the replaced by an [inaudible], an 891 00:50:24,129 --> 00:50:31,129 [inaudible] over states rather via sum over states. 892 00:50:31,389 --> 00:50:34,340 Let me just write this 893 00:50:34,340 --> 00:50:35,669 as 894 00:50:35,669 --> 00:50:38,809 895 00:50:38,809 --> 00:50:42,589 R(s) + G([inaudible]) and then that sum over T's prime. 896 00:50:42,589 --> 00:50:45,150 That's really an expectation 897 00:50:45,150 --> 00:50:47,400 with respect to random state as 898 00:50:47,400 --> 00:50:54,400 prime drawn from the state transition probabilities piece SA 899 00:50:55,049 --> 00:50:56,359 of V(s) 900 00:50:56,359 --> 00:51:00,339 prime. So this is a sum of all states S prime with the 901 00:51:00,339 --> 00:51:01,599 probability of going to S prime (value), 902 00:51:01,599 --> 00:51:05,249 so that's really an expectation over the random state S prime flowing from PSA of 903 00:51:05,249 --> 00:51:09,369 that. 904 00:51:09,369 --> 00:51:11,989 And so what I'll do now is 905 00:51:11,989 --> 00:51:15,199 write down an algorithm called 906 00:51:15,199 --> 00:51:22,199 fitted value iteration 907 00:51:25,289 --> 00:51:27,949 that's in 908 00:51:27,949 --> 00:51:30,599 approximation to this 909 00:51:30,599 --> 00:51:33,809 but specifically for continuous states. 910 00:51:33,809 --> 00:51:38,519 I just wrote down the first two steps, and then I'll continue on the next board, so the 911 00:51:38,519 --> 00:51:43,499 first step of the algorithm is we'll sample. 912 00:51:43,499 --> 00:51:45,269 Choose some set of states 913 00:51:45,269 --> 00:51:52,269 at 914 00:51:53,250 --> 00:51:55,119 random. 915 00:51:55,119 --> 00:51:59,869 So sample S-1, S-2 through S-M randomly so choose a 916 00:51:59,869 --> 00:52:02,469 set of states randomly 917 00:52:02,469 --> 00:52:08,209 and 918 00:52:08,209 --> 00:52:12,020 initialize my parameter vector to be equal to zero. This is 919 00:52:12,020 --> 00:52:16,190 analogous to in value iteration where I 920 00:52:16,190 --> 00:52:21,140 might initialize the value function to be the function of all zeros. Then here's the 921 00:52:21,140 --> 00:52:28,140 end view for the algorithm. 922 00:52:42,359 --> 00:52:44,119 Got 923 00:52:44,119 --> 00:52:51,119 quite a lot to 924 00:52:59,189 --> 00:53:06,189 write 925 00:55:05,409 --> 00:55:12,409 actually. 926 00:55:21,899 --> 00:55:28,599 927 00:55:28,599 --> 00:55:29,639 Let's 928 00:55:29,639 --> 00:55:36,639 see. 929 00:56:10,420 --> 00:56:12,009 And so 930 00:56:12,009 --> 00:56:17,839 that's the algorithm. 931 00:56:17,839 --> 00:56:19,880 Let me just adjust the writing. Give me a second. Give me a minute 932 00:56:19,880 --> 00:56:26,880 to finish and then I'll step through this. 933 00:56:26,989 --> 00:56:28,489 Actually, if some of my 934 00:56:28,489 --> 00:56:35,489 handwriting is eligible, let me know. So 935 00:56:57,559 --> 00:57:02,079 let me step through this and so briefly explain the rationale. 936 00:57:02,079 --> 00:57:08,140 So the hear of the algorithm is - let's see. 937 00:57:08,140 --> 00:57:10,869 In the original value iteration algorithm, 938 00:57:10,869 --> 00:57:13,240 we would take the value for each state, 939 00:57:13,240 --> 00:57:14,799 V(s)I, 940 00:57:14,799 --> 00:57:17,599 and we will overwrite it with this 941 00:57:17,599 --> 00:57:21,659 expression here. In the original, this discrete value iteration algorithm 942 00:57:21,659 --> 00:57:23,199 was to V(s)I 943 00:57:23,199 --> 00:57:23,940 and we will 944 00:57:23,940 --> 00:57:25,829 set V(s)I 945 00:57:25,829 --> 00:57:28,589 to be equal to that, I think. 946 00:57:28,589 --> 00:57:33,130 Now we have in the continuous state case, we have an infinite continuous set of 947 00:57:33,130 --> 00:57:34,199 states and so 948 00:57:34,199 --> 00:57:37,859 you can't discretely set the value of each of these to that. 949 00:57:37,859 --> 00:57:42,509 So what we'll do instead is choose the parameters T 950 00:57:42,509 --> 00:57:46,849 so that V(s)I is as close as possible to 951 00:57:46,849 --> 00:57:49,930 this thing on the right hand side 952 00:57:49,930 --> 00:57:54,509 instead. And this is what YI turns out to be. 953 00:57:54,509 --> 00:57:58,479 So completely, what I'm going to do is I'm going to construct 954 00:57:58,479 --> 00:58:00,579 estimates of this term, 955 00:58:00,579 --> 00:58:04,279 and then I'm going to choose the parameters of my function approximator. I'm 956 00:58:04,279 --> 00:58:07,729 gonna choose my parameter as T, so that V(s)I is as close as possible to 957 00:58:07,729 --> 00:58:10,619 these. That's what YI is, 958 00:58:10,619 --> 00:58:11,979 and specifically, 959 00:58:11,979 --> 00:58:15,139 what I'm going to do is I'll choose parameters data 960 00:58:15,139 --> 00:58:18,489 to minimize the sum of square differences between T [inaudible] plus 961 00:58:18,489 --> 00:58:20,409 5SI. 962 00:58:20,409 --> 00:58:21,550 This thing here 963 00:58:21,550 --> 00:58:24,489 is just V(s)I because I'm approximating V(s)I 964 00:58:24,489 --> 00:58:26,219 is a linear function of 5SI 965 00:58:26,219 --> 00:58:27,880 and so choose 966 00:58:27,880 --> 00:58:32,019 the parameters data to minimize the sum of square differences. 967 00:58:32,019 --> 00:58:34,829 So this is last step is basically 968 00:58:34,829 --> 00:58:38,759 the approximation version of value 969 00:58:38,759 --> 00:58:40,610 iteration. What everything else above 970 00:58:40,610 --> 00:58:42,419 was doing was just 971 00:58:42,419 --> 00:58:44,749 coming up with an approximation 972 00:58:44,749 --> 00:58:46,189 to this term, to 973 00:58:46,189 --> 00:58:47,499 this thing here 974 00:58:47,499 --> 00:58:50,979 and which I was calling YI. 975 00:58:50,979 --> 00:58:54,199 And so confluently, for every state SI we want to 976 00:58:54,199 --> 00:58:57,159 estimate what the thing on the right hand side is 977 00:58:57,159 --> 00:58:57,979 and but 978 00:58:57,979 --> 00:59:01,980 there's an expectation here. There's an expectation over a continuous set of states, 979 00:59:01,980 --> 00:59:06,609 may be a very high dimensional state so I can't compute this expectation exactly. 980 00:59:06,609 --> 00:59:09,809 What I'll do instead is I'll use my simulator to sample 981 00:59:09,809 --> 00:59:13,969 a set of states from this distribution from this P substrip, 982 00:59:13,969 --> 00:59:16,959 SIA, from the state transition distribution 983 00:59:16,959 --> 00:59:20,669 of where I get to if I take the action A in the state as I, 984 00:59:20,669 --> 00:59:23,919 and then I'll average over that sample of states 985 00:59:23,919 --> 00:59:26,819 to compute this expectation. 986 00:59:26,819 --> 00:59:30,609 And so stepping through the algorithm just says that for 987 00:59:30,609 --> 00:59:31,500 each 988 00:59:31,500 --> 00:59:36,329 state and for each action, I'm going to sample a set of states. This S prime 1 through S 989 00:59:36,329 --> 00:59:40,659 prime K from that state transition distribution, still using the model, and 990 00:59:40,659 --> 00:59:43,909 then I'll set Q(a) to be equal to that average 991 00:59:43,909 --> 00:59:45,489 and so this is my 992 00:59:45,489 --> 00:59:48,429 estimate for R(s)I + G(this 993 00:59:48,429 --> 00:59:50,399 expected value 994 00:59:50,399 --> 00:59:53,629 for that specific action A). 995 00:59:53,629 --> 00:59:56,759 Then I'll take the maximum of actions A 996 00:59:56,759 --> 01:00:00,890 and this gives me YI, and so YI is for S for that. 997 01:00:00,890 --> 01:00:02,919 And finally, I'll run 998 01:00:02,919 --> 01:00:06,779 really run linear regression which is that last of the set [inaudible] to get 999 01:00:06,779 --> 01:00:07,789 V(s)I 1000 01:00:07,789 --> 01:00:11,479 to be close to the YIs. 1001 01:00:11,479 --> 01:00:16,179 And so this algorithm is called fitted value iteration and it actually often 1002 01:00:16,179 --> 01:00:19,389 works quite well for continuous, 1003 01:00:19,389 --> 01:00:22,659 for problems with anywhere from 1004 01:00:22,659 --> 01:00:26,010 6- to 10- to 20-dimensional state spaces 1005 01:00:26,010 --> 01:00:33,010 1006 01:00:33,949 --> 01:00:39,499 if you can choose appropriate features. Can you raise a hand please if this algorithm makes sense? 1007 01:00:39,499 --> 01:00:40,790 Some of you didn't have your hands up. 1008 01:00:40,790 --> 01:00:47,279 Are there questions for those, 1009 01:00:47,279 --> 01:00:48,410 1010 01:00:48,410 --> 01:00:51,339 yeah? Student: Is there a recess [inaudible] function in this setup? 1011 01:00:51,339 --> 01:00:55,669 Instructor (Andrew Ng):Oh, yes. An MDP comprises 1012 01:00:55,669 --> 01:00:57,189 SA, 1013 01:00:57,189 --> 01:00:59,869 the state transition probabilities G 1014 01:00:59,869 --> 01:01:01,229 and 1015 01:01:01,229 --> 01:01:02,380 R 1016 01:01:02,380 --> 01:01:03,359 and so 1017 01:01:03,359 --> 01:01:08,359 for continuous state spaces, S would be like R4 for the inverted pendulum 1018 01:01:08,359 --> 01:01:09,409 or something. 1019 01:01:09,409 --> 01:01:13,529 Actions with discretized state transitions probabilities with specifying with the model 1020 01:01:13,529 --> 01:01:14,779 or 1021 01:01:14,779 --> 01:01:17,470 the simulator, 1022 01:01:17,470 --> 01:01:19,550 G is just a real number like .99 1023 01:01:19,550 --> 01:01:23,599 and the real function is usually a function that's given to you. And so 1024 01:01:23,599 --> 01:01:26,099 the reward function 1025 01:01:26,099 --> 01:01:30,689 is some function of your 4-dimensional state, 1026 01:01:30,689 --> 01:01:31,489 and 1027 01:01:31,489 --> 01:01:32,989 for example, you 1028 01:01:32,989 --> 01:01:39,989 might choose a reward function to be minus " I don't know. 1029 01:01:40,029 --> 01:01:47,029 Just for an example of simple reward function, if we want a -1 if the pole has fallen 1030 01:01:47,819 --> 01:01:51,640 and there it depends on you choose your reward function to be -1 if 1031 01:01:51,640 --> 01:01:52,290 the 1032 01:01:52,290 --> 01:01:54,619 inverted pendulum falls over 1033 01:01:54,619 --> 01:01:58,700 and to find that its angle is greater than 30° or something 1034 01:01:58,700 --> 01:02:02,179 and zero otherwise. 1035 01:02:02,179 --> 01:02:03,939 So that would be an example 1036 01:02:03,939 --> 01:02:06,640 of a reward function that you can choose for the inverted pendulum, but yes, assuming a 1037 01:02:06,640 --> 01:02:09,619 reward function is given to you 1038 01:02:09,619 --> 01:02:14,589 so that you can compute R(s)I for any state. 1039 01:02:14,589 --> 01:02:21,589 Are there other questions? 1040 01:02:39,689 --> 01:02:40,619 Actually, let 1041 01:02:40,619 --> 01:02:47,619 me try 1042 01:02:50,459 --> 01:02:57,459 1043 01:03:23,979 --> 01:03:28,479 asking a question, so everything I did here assume that we have a stochastic simulator. 1044 01:03:28,479 --> 01:03:31,879 So it 1045 01:03:31,879 --> 01:03:35,389 turns out I can simply this algorithm if I have a deterministic simulator, 1046 01:03:35,389 --> 01:03:36,510 but 1047 01:03:36,510 --> 01:03:38,769 deterministic simulator is given a 1048 01:03:38,769 --> 01:03:42,079 stated action, my next state is always exactly determined. 1049 01:03:42,079 --> 01:03:45,569 So let me ask you, if I have a deterministic simulator, 1050 01:03:45,569 --> 01:03:52,569 how would I change this algorithm? How would I simplify this algorithm? Student:Lower your samples that you're drawing [inaudible]. 1051 01:03:53,929 --> 01:03:57,709 1052 01:03:57,709 --> 01:03:58,720 Instructor (Andrew Ng):Right, so 1053 01:03:58,720 --> 01:04:01,249 Justin's going right. If I have a deterministic simulator, 1054 01:04:01,249 --> 01:04:04,869 all my samples from those would be exactly the same, and so if I have a 1055 01:04:04,869 --> 01:04:06,149 deterministic simulator, 1056 01:04:06,149 --> 01:04:09,109 I can set K to be equal to 1, 1057 01:04:09,109 --> 01:04:10,930 so I don't need to draw 1058 01:04:10,930 --> 01:04:15,869 K different samples. I really only need one sample if I have a 1059 01:04:15,869 --> 01:04:17,299 deterministic simulator, 1060 01:04:17,299 --> 01:04:21,669 so you can simplify this by setting K=1 if you have a deterministic simulator. Yeah? Student:I guess I'm really 1061 01:04:21,669 --> 01:04:28,159 confused about the, 1062 01:04:28,159 --> 01:04:29,029 yeah, 1063 01:04:29,029 --> 01:04:34,429 we sorta turned this [inaudible] into something that looks like linear 1064 01:04:34,429 --> 01:04:37,249 state 1065 01:04:37,249 --> 01:04:40,089 regression or some' you know the data transpose times something that 1066 01:04:40,089 --> 01:04:43,890 we're used to but I guess I'm a 1067 01:04:43,890 --> 01:04:45,369 little. I don't know really know what question to ask but like when we did this before we 1068 01:04:45,369 --> 01:04:48,330 had like discrete states and everything. We 1069 01:04:48,330 --> 01:04:50,019 were determined with 1070 01:04:50,019 --> 01:04:53,189 finding this optimal policy 1071 01:04:53,189 --> 01:04:56,709 and I 1072 01:04:56,709 --> 01:05:01,869 guess it doesn't look like we haven't said the word policy in a while so kinda 1073 01:05:01,869 --> 01:05:05,009 difficult. Instructor (Andrew Ng):Okay, yeah, so [inaudible] matters back to policy but maybe I should 1074 01:05:05,009 --> 01:05:06,819 just say a couple words so 1075 01:05:06,819 --> 01:05:10,660 let me actually try to get at some of what maybe what you're saying. 1076 01:05:10,660 --> 01:05:14,249 Our strategy for finding optimal policy 1077 01:05:14,249 --> 01:05:15,080 has been to 1078 01:05:15,080 --> 01:05:19,269 find some way to find V 1079 01:05:19,269 --> 01:05:21,169 and 1080 01:05:21,169 --> 01:05:24,089 some of approximations 1081 01:05:24,089 --> 01:05:28,549 of ?*. So far everything I've been doing has been focused on how to find V*. I just 1082 01:05:28,549 --> 01:05:30,479 want 1083 01:05:30,479 --> 01:05:33,119 1084 01:05:33,119 --> 01:05:36,359 to say one more word. It actually turns out that 1085 01:05:36,359 --> 01:05:40,729 for linear regression it's often very easy. It's often not terribly difficult to 1086 01:05:40,729 --> 01:05:43,019 choose some resource of the features. Choosing 1087 01:05:43,019 --> 01:05:46,979 features for approximating the value function is often somewhat 1088 01:05:46,979 --> 01:05:48,400 harder, 1089 01:05:48,400 --> 01:05:50,969 so because the value of a state is 1090 01:05:50,969 --> 01:05:52,020 how good is 1091 01:05:52,020 --> 01:05:54,380 starting off in this state. 1092 01:05:54,380 --> 01:05:56,039 What is my 1093 01:05:56,039 --> 01:05:59,609 expected sum of discounted rewards? What if I start in a certain state? 1094 01:05:59,609 --> 01:06:00,409 And so 1095 01:06:00,409 --> 01:06:03,559 what the feature of the state have to measure is really 1096 01:06:03,559 --> 01:06:07,890 how good is it to start in a certain state? 1097 01:06:07,890 --> 01:06:12,209 And so for inverted pendulum you actually have that 1098 01:06:12,209 --> 01:06:15,899 states where the poles are vertical and when a cart that's centered on your 1099 01:06:15,899 --> 01:06:18,079 track or something, maybe better and so you can 1100 01:06:18,079 --> 01:06:21,299 come up with features that measure the orientation of the pole and how close 1101 01:06:21,299 --> 01:06:25,689 you are to the center of the track and so on and those will be reasonable features to use 1102 01:06:25,689 --> 01:06:28,329 to approximate V*. 1103 01:06:28,329 --> 01:06:30,249 Although in general it is true that 1104 01:06:30,249 --> 01:06:33,859 choosing features, the value function approximation, is often slightly trickier 1105 01:06:33,859 --> 01:06:39,719 than choosing good features for linear regression. Okay and then Justin's 1106 01:06:39,719 --> 01:06:40,709 questions of 1107 01:06:40,709 --> 01:06:42,529 so given 1108 01:06:42,529 --> 01:06:47,109 V*, how do you go back to actually find a policy? 1109 01:06:47,109 --> 01:06:49,799 In the discrete case, so we 1110 01:06:49,799 --> 01:06:54,109 have that ?*(s) is equal to all 1111 01:06:54,109 --> 01:07:01,109 [inaudible] over A of 1112 01:07:04,319 --> 01:07:07,469 that. So that's again, I used to write this as a sum over states [inaudible]. I'll just write this 1113 01:07:07,469 --> 01:07:12,869 as an expectation 1114 01:07:12,869 --> 01:07:15,609 and so then once you find the optimal value 1115 01:07:15,609 --> 01:07:16,739 function V*, 1116 01:07:16,739 --> 01:07:18,579 you can then 1117 01:07:18,579 --> 01:07:20,349 find the optimal policy ?* by computing the 1118 01:07:20,349 --> 01:07:22,579 [inaudible]. 1119 01:07:22,579 --> 01:07:25,219 So if you're in a continuous state MDP, 1120 01:07:25,219 --> 01:07:26,429 then 1121 01:07:26,429 --> 01:07:30,409 you can't actually do this in advance for every single state because there's an 1122 01:07:30,409 --> 01:07:31,899 infinite number of states 1123 01:07:31,899 --> 01:07:34,809 and so you can't actually perform this computation in advance to every single 1124 01:07:34,809 --> 01:07:36,430 state. 1125 01:07:36,430 --> 01:07:38,949 What you do instead is 1126 01:07:38,949 --> 01:07:43,209 whenever your robot is in some specific state S is only when your system 1127 01:07:43,209 --> 01:07:48,069 is in some specific state S like your car is at some position orientation or 1128 01:07:48,069 --> 01:07:49,829 your inverted pendulum 1129 01:07:49,829 --> 01:07:52,949 is in some specific position, posed in some 1130 01:07:52,949 --> 01:07:56,529 specific angle T. It's 1131 01:07:56,529 --> 01:07:58,489 only when 1132 01:07:58,489 --> 01:08:03,069 your system, be it a factor or a board game or a robot, 1133 01:08:03,069 --> 01:08:05,299 is in some specific state S 1134 01:08:05,299 --> 01:08:08,809 that you would then go ahead and compute this augmax, so 1135 01:08:08,809 --> 01:08:10,229 it's only when 1136 01:08:10,229 --> 01:08:15,779 you're in some state S that you then compute this augmax, and then 1137 01:08:15,779 --> 01:08:18,099 you 1138 01:08:18,099 --> 01:08:22,589 1139 01:08:22,589 --> 01:08:24,440 execute that action A 1140 01:08:24,440 --> 01:08:25,159 and then 1141 01:08:25,159 --> 01:08:29,489 as a result of your action, your robot would transition to some new state and then 1142 01:08:29,489 --> 01:08:33,759 so it'll be given that specific new state that you compute as 1143 01:08:33,759 --> 01:08:40,759 augmax using that specific state S that 1144 01:08:41,799 --> 01:08:45,739 you're in. There're a few ways to do it. One way to do this is 1145 01:08:45,739 --> 01:08:50,099 actually the same as in the inner loop of the fitted value iteration algorithm so 1146 01:08:50,099 --> 01:08:52,150 because of an expectation of a large number 1147 01:08:52,150 --> 01:08:55,219 of states, you'd need to sample 1148 01:08:55,219 --> 01:08:57,999 some set of states from the simulator and then 1149 01:08:57,999 --> 01:09:03,120 approximate this expectation using an average over your samples, so it's actually as 1150 01:09:03,120 --> 01:09:05,819 inner loop of the value iteration algorithm. 1151 01:09:05,819 --> 01:09:09,479 So you could do that. That's sometimes done. Sometimes it can also be a pain to have to sample 1152 01:09:09,479 --> 01:09:10,770 a 1153 01:09:10,770 --> 01:09:13,029 set of states 1154 01:09:13,029 --> 01:09:16,230 to approximate those expectations every time you want to take an action in your MDP. 1155 01:09:16,230 --> 01:09:18,359 Couple 1156 01:09:18,359 --> 01:09:22,710 of special cases where this can be done, one special case is if you 1157 01:09:22,710 --> 01:09:29,710 have 1158 01:09:31,710 --> 01:09:35,560 a deterministic simulator. If it's a deterministic simulator, so in other words, if your similar is 1159 01:09:35,560 --> 01:09:39,769 just some function, 1160 01:09:39,769 --> 01:09:43,690 could be a linear or a nonlinear function. If it's a deterministic simulator 1161 01:09:43,690 --> 01:09:47,569 then the next state, ST+1, is just some function of your 1162 01:09:47,569 --> 01:09:50,189 previous stated action. 1163 01:09:50,189 --> 01:09:52,060 If that's the case then 1164 01:09:52,060 --> 01:09:54,849 this expectation, well, then 1165 01:09:54,849 --> 01:09:59,259 this simplifies to augmax of A of V* 1166 01:09:59,259 --> 01:10:00,709 of F 1167 01:10:00,709 --> 01:10:03,409 of 1168 01:10:03,409 --> 01:10:07,280 I guess S,A 1169 01:10:07,280 --> 01:10:08,249 because 1170 01:10:08,249 --> 01:10:11,070 this is really saying 1171 01:10:11,070 --> 01:10:14,570 S 1172 01:10:14,570 --> 01:10:18,409 prime=F(s),A. I switched back and forth between notation; I hope that's okay. S 1173 01:10:18,409 --> 01:10:21,800 to denote the current state, and S prime to deterministic state versus ST and ST+1 through the 1174 01:10:21,800 --> 01:10:25,530 current state. Both of these are 1175 01:10:25,530 --> 01:10:27,699 sorta standing notation and don't 1176 01:10:27,699 --> 01:10:31,189 mind my switching back and forth between them. But if it's a 1177 01:10:31,189 --> 01:10:35,310 deterministic simulator you can then just compute what the next state S prime would be 1178 01:10:35,310 --> 01:10:37,729 for each action you might take from the current state, 1179 01:10:37,729 --> 01:10:41,689 and then take the augmax of actions A, 1180 01:10:41,689 --> 01:10:47,229 basically choose the action that gets you to the highest value 1181 01:10:47,229 --> 01:10:54,229 state. 1182 01:11:04,300 --> 01:11:04,659 And 1183 01:11:04,659 --> 01:11:08,920 so that's one case where you can compute the augmax and we can compute that 1184 01:11:08,920 --> 01:11:13,469 expectation without needing to sample an average over some sample. 1185 01:11:13,469 --> 01:11:16,670 Another very common case actually it turns out is if you have a stochastic 1186 01:11:16,670 --> 01:11:20,219 1187 01:11:20,219 --> 01:11:24,010 1188 01:11:24,010 --> 01:11:25,969 simulator, but if your 1189 01:11:25,969 --> 01:11:30,459 similar happens to take on a very specific form of 1190 01:11:30,459 --> 01:11:33,820 1191 01:11:33,820 --> 01:11:36,119 ST+1=F(s)T,AT+?T 1192 01:11:36,119 --> 01:11:37,120 1193 01:11:37,120 --> 01:11:43,689 where this 1194 01:11:43,689 --> 01:11:47,899 is galsie noise. The [inaudible] is a very common way to build simulators where you model the next 1195 01:11:47,899 --> 01:11:50,229 state as a function of the current state and action 1196 01:11:50,229 --> 01:11:52,260 plus some noise and so 1197 01:11:52,260 --> 01:11:57,340 once specific example would be that sort of mini dynamical system that 1198 01:11:57,340 --> 01:12:01,400 we talked about with linear function of the current state and action plus 1199 01:12:01,400 --> 01:12:03,019 galsie 1200 01:12:03,019 --> 01:12:07,959 noise. In this case, you can approximate augment over 1201 01:12:07,959 --> 01:12:14,959 A, well. 1202 01:12:21,809 --> 01:12:27,159 In that case you take that 1203 01:12:27,159 --> 01:12:29,089 expectation that 1204 01:12:29,089 --> 01:12:33,179 you're trying to approximate. The 1205 01:12:33,179 --> 01:12:37,309 expected value of V of the 1206 01:12:37,309 --> 01:12:41,479 1207 01:12:41,479 --> 01:12:45,260 expected value of 1208 01:12:45,260 --> 01:12:49,179 S prime, and this is approximation. 1209 01:12:49,179 --> 01:12:52,510 Expected value of a function is usually not equal to the value of an 1210 01:12:52,510 --> 01:12:53,059 expectation but 1211 01:12:53,059 --> 01:12:54,159 it is often 1212 01:12:54,159 --> 01:12:56,119 a reasonable approximation 1213 01:12:56,119 --> 01:12:58,340 and so 1214 01:12:58,340 --> 01:13:01,919 1215 01:13:01,919 --> 01:13:04,669 that would be another way to 1216 01:13:04,669 --> 01:13:06,439 approximate that expectation 1217 01:13:06,439 --> 01:13:11,270 and so you choose the actions 1218 01:13:11,270 --> 01:13:14,340 according to 1219 01:13:14,340 --> 01:13:21,340 watch we do the same formula as I wrote just now. 1220 01:13:22,010 --> 01:13:23,619 And so this 1221 01:13:23,619 --> 01:13:24,920 would be a way of 1222 01:13:24,920 --> 01:13:27,109 approximating 1223 01:13:27,109 --> 01:13:28,160 1224 01:13:28,160 --> 01:13:30,530 this augmax, 1225 01:13:30,530 --> 01:13:34,489 ignoring the noise in the simulator essentially. And 1226 01:13:34,489 --> 01:13:37,610 this often works pretty well as well just because many simulators turn 1227 01:13:37,610 --> 01:13:38,539 out 1228 01:13:38,539 --> 01:13:42,699 to be the form of some linear or some nonlinear function plus zero mean 1229 01:13:42,699 --> 01:13:43,749 galsie noise, 1230 01:13:43,749 --> 01:13:46,130 so and just that ignore the zero mean 1231 01:13:46,130 --> 01:13:47,369 galsie 1232 01:13:47,369 --> 01:13:50,070 noise, so that you can compute this quickly. 1233 01:13:50,070 --> 01:13:54,230 And just to complete about this, what 1234 01:13:54,230 --> 01:13:55,899 that is, right, 1235 01:13:55,899 --> 01:13:58,460 that V* F of SA, 1236 01:13:58,460 --> 01:13:58,969 this 1237 01:13:58,969 --> 01:14:01,670 you 1238 01:14:01,670 --> 01:14:04,389 1239 01:14:04,389 --> 01:14:08,880 down rate as data transfers Fi of S prime where S prime=F of SA. Great, 1240 01:14:08,880 --> 01:14:12,219 so this V* you would compute using 1241 01:14:12,219 --> 01:14:15,050 the parameters data that you just learned using the 1242 01:14:15,050 --> 01:14:21,719 fitted value iteration 1243 01:14:21,719 --> 01:14:23,170 algorithm. 1244 01:14:23,170 --> 01:14:30,170 Questions about this? Student:[Inaudible] case, 1245 01:14:33,849 --> 01:14:40,849 for real-time application is it possible to use that [inaudible], for example for 1246 01:14:41,809 --> 01:14:47,739 [inaudible]. Instructor (Andrew Ng):Yes, in real-time applications is it possible to sample case phases use [inaudible] 1247 01:14:47,739 --> 01:14:49,989 1248 01:14:49,989 --> 01:14:51,159 1249 01:14:51,159 --> 01:14:55,510 expectation. Computers today actually amazingly fast. I'm actually often 1250 01:14:55,510 --> 01:14:58,659 surprised by how much you can do in real time so 1251 01:14:58,659 --> 01:15:00,269 the helicopter we actually flying a helicopter 1252 01:15:00,269 --> 01:15:03,889 using an algorithm different than 1253 01:15:03,889 --> 01:15:04,739 this? I can't say. 1254 01:15:04,739 --> 01:15:06,280 1255 01:15:06,280 --> 01:15:10,539 But my intuition is that you could actually do this with a 1256 01:15:10,539 --> 01:15:14,010 helicopter. A helicopter would control at somewhere between 10hz and 50hz. You need to do 1257 01:15:14,010 --> 01:15:16,389 this 10 times a second to 50 times a second, 1258 01:15:16,389 --> 01:15:18,099 and that's actually plenty of time 1259 01:15:18,099 --> 01:15:19,400 to sample 1260 01:15:19,400 --> 01:15:23,500 1,000 states and compute this 1261 01:15:23,500 --> 01:15:26,909 expectation. They're real difficult, helicopters because helicopters are mission critical, and you 1262 01:15:26,909 --> 01:15:29,500 do something it's like fast. You 1263 01:15:29,500 --> 01:15:31,849 can do serious damage and so 1264 01:15:31,849 --> 01:15:35,750 maybe not for good reasons. We've actually tended to avoid 1265 01:15:35,750 --> 01:15:37,039 tossing coins when we're 1266 01:15:37,039 --> 01:15:40,179 in the air, so the ideal of letting our actions be some 1267 01:15:40,179 --> 01:15:45,099 up with some random process is slightly scary and just tend not to do that. 1268 01:15:45,099 --> 01:15:48,989 I should say that's prob'ly not a great reason because you average a 1269 01:15:48,989 --> 01:15:51,550 large number of things here very well fine but just 1270 01:15:51,550 --> 01:15:55,829 as a maybe overly conservative design choice, we actually 1271 01:15:55,829 --> 01:15:57,060 don't, tend not to find 1272 01:15:57,060 --> 01:15:59,499 anything randomized on 1273 01:15:59,499 --> 01:16:01,400 which is prob'ly being over conservative. 1274 01:16:01,400 --> 01:16:02,340 1275 01:16:02,340 --> 01:16:04,969 It's the choice we made cause other things are slightly safer. 1276 01:16:04,969 --> 01:16:08,149 I think you can actually often do this. 1277 01:16:08,149 --> 01:16:13,189 So long as I see a model can be evaluated fast enough where you can sample 1278 01:16:13,189 --> 01:16:15,469 100 state transitions or 1,000 state 1279 01:16:15,469 --> 01:16:18,119 transitions, and then do that at 1280 01:16:18,119 --> 01:16:21,550 10hz. They haven't said that. This is often attained which is why we often 1281 01:16:21,550 --> 01:16:28,550 use the other approximations that don't require your drawing a large sample. Anything else? No, okay, cool. So 1282 01:16:31,800 --> 01:16:35,800 now you know one algorithm [inaudible] reinforcement learning on continuous state 1283 01:16:35,800 --> 01:16:38,710 spaces. Then we'll pick up with some more ideas on some even 1284 01:16:38,710 --> 01:16:40,449 more powerful algorithms, 1285 01:16:40,449 --> 01:16:43,370 the solving MDPs of continuous state spaces. 1286 01:16:43,370 --> 01:16:44,170 Thanks. Let's close for today.