00:00:10,830 --> 00:00:12,219 This subtitle is provided from lecsub.jimdo.com under Creative Commons License. 2 00:00:12,219 --> 00:00:15,499 This presentation is delivered by the Stanford Center for Professional 3 00:00:15,499 --> 00:00:22,499 Development. All 4 00:00:24,419 --> 00:00:25,749 right, so welcome back. What 5 00:00:25,749 --> 00:00:29,699 I want to do today is start a new chapter, a new discussion on 6 00:00:29,699 --> 00:00:32,059 machine learning and in particular, I want to talk about 7 00:00:32,059 --> 00:00:36,760 a different type of learning problem called reinforcement learning, so that's 8 00:00:36,760 --> 00:00:40,700 Markov Decision Processes, value functions, value iteration, 9 00:00:40,700 --> 00:00:46,880 and policy iteration. Both of these last two items are algorithms for solving 10 00:00:46,880 --> 00:00:49,080 reinforcement learning problems. 11 00:00:49,080 --> 00:00:53,560 As you can see, we're also taping a different room today, so the background is a bit different. 12 00:00:53,560 --> 00:00:56,250 So just to put this in context, the first of 13 00:00:56,250 --> 00:01:00,320 the four major topics we had in this class was supervised 14 00:01:00,320 --> 00:01:01,420 learning 15 00:01:01,420 --> 00:01:03,390 and in supervised learning, 16 00:01:03,390 --> 00:01:04,940 we had the training set 17 00:01:04,940 --> 00:01:09,730 in which we were given sort of the right answer of every training example and it 18 00:01:09,730 --> 00:01:13,500 was then just a drop of the learning algorithms to replicate more of the right 19 00:01:13,500 --> 00:01:15,410 answers. 20 00:01:15,410 --> 00:01:19,240 And then that was learning theory and then we talked about unsupervised learning, and in unsupervised 21 00:01:19,240 --> 00:01:19,850 learning, 22 00:01:19,850 --> 00:01:20,700 we had 23 00:01:20,700 --> 00:01:23,530 just a bunch of unlabeled data, just the x's, 24 00:01:23,530 --> 00:01:27,710 and it was the job in the learning algorithm to discover so-called structure in the 25 00:01:27,710 --> 00:01:30,930 data and several algorithms like cluster analysis, K-means, a 26 00:01:30,930 --> 00:01:35,150 mixture of all the sort of the PCA, ICA, and so on. 27 00:01:35,150 --> 00:01:38,610 Today, I want to talk about a different class of learning algorithms that's sort 28 00:01:38,610 --> 00:01:41,479 of in between supervised and unsupervised, so there will be a class 29 00:01:41,479 --> 00:01:45,090 of problems where there's a level of supervision 30 00:01:45,090 --> 00:01:49,510 that's also much less supervision than what we saw in supervised 31 00:01:49,510 --> 00:01:53,050 learning. And this is a problem in formalism called reinforcement learning. So next up 32 00:01:53,050 --> 00:01:54,380 here are 33 00:01:54,380 --> 00:01:59,490 slides. Let me show you. As a moving example, here's an example of the sorts of 34 00:01:59,490 --> 00:02:04,470 things we do with reinforcement learning. 35 00:02:04,470 --> 00:02:06,470 So here's a picture o fsome of 36 00:02:06,470 --> 00:02:09,670 this I talked about in Lecture 1 as well, but here's a picture of 37 00:02:09,670 --> 00:02:13,469 the - we have an autonomous helicopter we have at Stanford. 38 00:02:13,469 --> 00:02:17,940 So how would you write a program to make a helicopter like this fly by itself? 39 00:02:17,940 --> 00:02:20,049 I'll show you a fun video. This is actually, 40 00:02:20,049 --> 00:02:24,219 I think, the same video that I showed in class in the first lecture, 41 00:02:24,219 --> 00:02:28,420 but here's a video taken in the football field at Stanford 42 00:02:28,420 --> 00:02:30,979 of using machine learning algorithm to 43 00:02:30,979 --> 00:02:33,470 fly the helicopter. So let's 44 00:02:33,470 --> 00:02:40,470 just play the video. You can zoom 45 00:02:41,309 --> 00:02:46,459 in the camera and see the trees in the sky. So 46 00:02:46,459 --> 00:02:47,930 in terms of 47 00:02:47,930 --> 00:02:49,489 autonomous helicopter flight, 48 00:02:49,489 --> 00:02:52,639 this is written then by some of my students and me. 49 00:02:52,639 --> 00:02:55,939 In terms of autonomous helicopter flight, this is one of the most difficult 50 00:02:55,939 --> 00:02:58,059 aerobatic maneuvers flown 51 00:02:58,059 --> 00:03:02,970 and it's actually very hard to write a program to make a helicopter do this and the way 52 00:03:02,970 --> 00:03:07,219 this was done was with what's called a reinforcement learning algorithm. 53 00:03:07,219 --> 00:03:08,969 So 54 00:03:08,969 --> 00:03:12,239 just to make this more concrete, right, the learning problem in helicopter flight 55 00:03:12,239 --> 00:03:13,359 is 56 00:03:13,359 --> 00:03:15,229 ten times per second, say. 57 00:03:15,229 --> 00:03:17,279 Your sensors on the helicopter 58 00:03:17,279 --> 00:03:20,939 gives you a very accurate estimate of the position of orientation of the helicopter and so you know where 59 00:03:20,939 --> 00:03:25,759 the helicopter is pretty accurately at all points in time. 60 00:03:25,759 --> 00:03:28,349 And your job is to take this input, 61 00:03:28,349 --> 00:03:30,989 the position orientation, 62 00:03:30,989 --> 00:03:32,190 and to output 63 00:03:32,190 --> 00:03:36,120 a set of numbers that correspond to where to move the control sticks to control the 64 00:03:36,120 --> 00:03:38,099 helicopter, to make it 65 00:03:38,099 --> 00:03:41,589 fly, to right side up, fly upside down, actually whatever maneuver you want. 66 00:03:41,589 --> 00:03:45,849 And this is different from supervised learning because 67 00:03:45,849 --> 00:03:50,309 usually we actually don't know what the right control action is. 68 00:03:50,309 --> 00:03:54,969 And more specifically, if the helicopter is in a certain position orientation, it's actually very 69 00:03:54,969 --> 00:03:59,509 hard to say when the helicopter is doing this, you should move the control sticks exactly these 70 00:03:59,509 --> 00:04:00,439 positions. 71 00:04:00,439 --> 00:04:03,969 So it's very hard to apply supervised learning algorithms to this problem because we 72 00:04:03,969 --> 00:04:06,549 can't come up with a training set 73 00:04:06,549 --> 00:04:10,900 where the inputs of the position and the output is all the right control 74 00:04:10,900 --> 00:04:13,699 actions. It's really hard to come up with a training set 75 00:04:13,699 --> 00:04:17,470 like that. Instead of reinforcement learning, we'll give the learning algorithm a 76 00:04:17,470 --> 00:04:21,370 different type of feedback, basically called a reward signal, 77 00:04:21,370 --> 00:04:24,259 which will tell the helicopter when it's doing well 78 00:04:24,259 --> 00:04:26,650 and when it's doing poorly. 79 00:04:26,650 --> 00:04:28,550 So what we'll end up doing is we're coming up 80 00:04:28,550 --> 00:04:32,750 with something called a reward signal and I'll formalize this later, 81 00:04:32,750 --> 00:04:36,610 which will be a measure of how well the helicopter is doing, and then it will be the 82 00:04:36,610 --> 00:04:39,980 job of the learning algorithm to take just this reward function as input and try 83 00:04:39,980 --> 00:04:42,849 to fly well. 84 00:04:42,849 --> 00:04:45,780 Another good example of reinforcement learning is thinking about getting 85 00:04:45,780 --> 00:04:49,780 a program to play a game, to play chess, a 86 00:04:49,780 --> 00:04:52,000 game of chess. 87 00:04:52,000 --> 00:04:56,270 At any stage in the game, we actually don't know what the optimal move 88 00:04:56,270 --> 00:04:58,710 is, so it's very hard to 89 00:04:58,710 --> 00:05:02,650 pose playing chess as a supervised learning problem because 90 00:05:02,650 --> 00:05:06,550 we can't say the x's are the board positions and the y's are the optimum moves because 91 00:05:06,550 --> 00:05:12,089 we just don't know how we create any training examples of optimum moves of chess. But 92 00:05:12,089 --> 00:05:15,050 what we do know is if you have a 93 00:05:15,050 --> 00:05:19,069 computer playing games of chess, we know when its won a game and when its lost a game, 94 00:05:19,069 --> 00:05:23,189 so what we do is we give it a reward signal, so give it a positive 95 00:05:23,189 --> 00:05:24,349 reward when it 96 00:05:24,349 --> 00:05:27,560 wins a game of chess and give it a negative reward whenever it loses, 97 00:05:27,560 --> 00:05:33,650 and hopefully have it learn to win more and more games by itself over time. So 98 00:05:33,650 --> 00:05:37,419 what I'd like you to think about reinforcement learning is think about training a dog. 99 00:05:37,419 --> 00:05:40,629 Every time your dog does something good, you sort of tell it, Good dog, and every time it does 100 00:05:40,629 --> 00:05:43,180 something bad, you tell it, Bad dog, and over time, your dog 101 00:05:43,180 --> 00:05:46,409 learns to do more and more good things over time. So in the same 102 00:05:46,409 --> 00:05:50,979 way, when we try to fly a helicopter, every time the helicopter does something good, you say, Good 103 00:05:50,979 --> 00:05:51,720 helicopter, and 104 00:05:51,720 --> 00:05:54,380 every time it crashes, you say, Bad helicopter, 105 00:05:54,380 --> 00:05:58,710 and then over time, it learns to do the right things more and more 106 00:05:58,710 --> 00:06:00,879 often. 107 00:06:00,879 --> 00:06:05,499 The reason - one of the reasons that reinforcement learning is much harder than 108 00:06:05,499 --> 00:06:07,039 supervised learning is because 109 00:06:07,039 --> 00:06:09,939 this is not a one-shot decision making problem. 110 00:06:09,939 --> 00:06:14,539 So in supervised learning, if you have a classification, prediction if someone 111 00:06:14,539 --> 00:06:17,779 has cancer or not, you make a prediction and then you're done, right? And your patient 112 00:06:17,779 --> 00:06:21,520 either has cancer or not, you're either right or wrong, they live or die, whatever. You 113 00:06:21,520 --> 00:06:23,430 make a decision and then you're done. 114 00:06:23,430 --> 00:06:27,440 In reinforcement learning, you have to keep taking actions over time, so 115 00:06:27,440 --> 00:06:29,860 it's called the sequential decision making. 116 00:06:29,860 --> 00:06:33,669 So concretely, suppose a program loses a game of chess on move 117 00:06:33,669 --> 00:06:35,800 No. 60. 118 00:06:35,800 --> 00:06:39,969 Then it has actually made 60 moves before it got this negative reward of 119 00:06:39,969 --> 00:06:41,699 losing a game of chess 120 00:06:41,699 --> 00:06:45,889 and the thing that makes it hard for the algorithm to learn from this is called 121 00:06:45,889 --> 00:06:49,490 the credit assignment problem. 122 00:06:49,490 --> 00:06:51,960 123 00:06:51,960 --> 00:06:56,369 124 00:06:56,369 --> 00:06:58,360 And just to state that informally, 125 00:06:58,360 --> 00:06:59,740 126 00:06:59,740 --> 00:07:03,949 what this is if the program loses a game of chess in move 60, 127 00:07:03,949 --> 00:07:07,860 you're actually not quite sure of all the moves he made which ones were the right moves and 128 00:07:07,860 --> 00:07:09,900 which ones were the bad moves, and so 129 00:07:09,900 --> 00:07:13,539 maybe it's because you've blundered on move No. 23 130 00:07:13,539 --> 00:07:17,210 and then everything else you did may have been perfect, but because you made a mistake on 131 00:07:17,210 --> 00:07:20,960 move 23 in your game of chess, you eventually end up losing on move 60. 132 00:07:20,960 --> 00:07:23,850 So just to define very loosely for the assignment problem is 133 00:07:23,850 --> 00:07:27,730 whether you get a positive or negative reward, so figure out what you actually did right or did wrong to 134 00:07:27,730 --> 00:07:28,739 135 00:07:28,739 --> 00:07:30,949 cause the reward so you can do 136 00:07:30,949 --> 00:07:34,849 more of the right things and less of the wrong things. And this is sort of one of the things 137 00:07:34,849 --> 00:07:37,990 that makes reinforcement learning 138 00:07:37,990 --> 00:07:40,369 hard. 139 00:07:40,369 --> 00:07:44,789 And in the same way, if the helicopter crashes, you may not know. 140 00:07:44,789 --> 00:07:47,599 And in the same way, if the helicopter crashes, 141 00:07:47,599 --> 00:07:52,060 it may be something you did many minutes ago that causes the helicopter to crash. In fact, 142 00:07:52,060 --> 00:07:55,310 if you ever crash a car - and hopefully none of you ever get in a car accident - but when someone 143 00:07:55,310 --> 00:07:59,430 crashes a car, usually the things they're doing right before they crash is step on 144 00:07:59,430 --> 00:08:03,479 the brakes to slow the car down before the impact. And 145 00:08:03,479 --> 00:08:07,419 usually stepping on the brakes does not cause a crash. Rather it makes the crash 146 00:08:07,419 --> 00:08:08,519 sort of hurt less. 147 00:08:08,519 --> 00:08:11,020 But so, reinforcement algorithm, 148 00:08:11,020 --> 00:08:14,599 you see this pattern in that you step on the brakes, you crash, it's 149 00:08:14,599 --> 00:08:18,889 not the reason you crash and it's hard to figure out that it's not actually your stepping on the brakes that 150 00:08:18,889 --> 00:08:20,030 caused the crash, 151 00:08:20,030 --> 00:08:23,039 but something you did long before that. 152 00:08:23,039 --> 00:08:24,900 153 00:08:24,900 --> 00:08:25,650 So 154 00:08:25,650 --> 00:08:28,280 let me go ahead and define 155 00:08:28,280 --> 00:08:31,300 the - formalize the reinforcement learning problem more, 156 00:08:31,300 --> 00:08:33,230 and as a preface, let me 157 00:08:33,230 --> 00:08:34,210 say 158 00:08:34,210 --> 00:08:36,390 algorithms are applied to a broad range of problems, 159 00:08:36,390 --> 00:08:40,789 but because robotics videos are easy to show in 160 00:08:40,789 --> 00:08:45,700 the lecture - I have a lot of them - throughout this lecture I use a bunch of robotics for examples, but later, we'll talk about 161 00:08:45,700 --> 00:08:49,060 applications of these ideas, so broader ranges of problems as well. 162 00:08:49,060 --> 00:08:52,910 But the basic problem I'm facing is sequential decision 163 00:08:52,910 --> 00:08:54,470 making. We need to make 164 00:08:54,470 --> 00:08:57,480 many decisions and where your decisions perhaps have 165 00:08:57,480 --> 00:09:00,280 long term consequences. 166 00:09:00,280 --> 00:09:01,790 So 167 00:09:01,790 --> 00:09:06,720 let's formalize the reinforcement learning problem. Reinforcement 168 00:09:06,720 --> 00:09:11,609 learning problems model the worlds using something called the MDP or 169 00:09:11,609 --> 00:09:15,570 the Markov Decision Process formalism. 170 00:09:15,570 --> 00:09:22,570 171 00:09:23,620 --> 00:09:25,270 And 172 00:09:25,270 --> 00:09:30,630 let's see, MDP is a five tuple - I don't have 173 00:09:30,630 --> 00:09:32,730 enough space - 174 00:09:32,730 --> 00:09:34,860 well, comprising 175 00:09:34,860 --> 00:09:38,560 five things. 176 00:09:38,560 --> 00:09:45,560 177 00:09:45,880 --> 00:09:49,220 178 00:09:49,220 --> 00:09:51,069 So let me say what these 179 00:09:51,069 --> 00:09:56,190 are. Actually, could you please raise the screen? I won't need the laptop anymore today. 180 00:09:56,190 --> 00:09:59,650 [Inaudible] more space. Yep, go, great. Thanks. 181 00:09:59,650 --> 00:10:06,650 So an MDP comprises a five tuple. The 182 00:10:08,840 --> 00:10:14,070 first of these elements, s is a set of states 183 00:10:14,070 --> 00:10:17,460 and so for the helicopter example, the set of states would be the possible 184 00:10:17,460 --> 00:10:21,160 positions and orientations of a helicopter. A 185 00:10:21,160 --> 00:10:27,960 is a set of actions. So again, for the helicopter example, this would be the 186 00:10:27,960 --> 00:10:29,710 set of all possible 187 00:10:29,710 --> 00:10:35,380 positions that we could put our control sticks into. 188 00:10:35,380 --> 00:10:42,380 P, s, a are state transition distributions. 189 00:10:46,750 --> 00:10:52,400 So for each state and each action, this is a high probability distribution, 190 00:10:52,400 --> 00:10:54,530 so sum over s prime, 191 00:10:54,530 --> 00:10:59,350 Psa, s prime equals 192 00:10:59,350 --> 00:11:01,440 1 and Psa s prime is created over 193 00:11:01,440 --> 00:11:02,870 zero. 194 00:11:02,870 --> 00:11:03,670 And 195 00:11:03,670 --> 00:11:08,100 state transition distributions are - or state transition probabilities 196 00:11:08,100 --> 00:11:10,540 work as follows. P subscript (s, a) 197 00:11:10,540 --> 00:11:15,080 gives me the probability distribution over what state I will transition to, 198 00:11:15,080 --> 00:11:16,710 what state I wind up in, 199 00:11:16,710 --> 00:11:18,770 if I take an action a 200 00:11:18,770 --> 00:11:20,390 in a state s. So this is 201 00:11:20,390 --> 00:11:24,580 probability distribution over states s prime and then I get to 202 00:11:24,580 --> 00:11:27,070 when I take an action a in the state s. 203 00:11:27,070 --> 00:11:29,700 Now I'll read this in a second. Gamma is 204 00:11:29,700 --> 00:11:32,310 the number 205 00:11:32,310 --> 00:11:39,310 called the discount factor. Don't worry about this yet. I'll say what this is in a second. 206 00:11:41,040 --> 00:11:48,040 And there's usually a number strictly rated, strictly less than 1 and rated equal to zero. And R is our 207 00:11:48,789 --> 00:11:55,789 reward function, so the reward function maps from the 208 00:12:00,240 --> 00:12:02,880 set of 209 00:12:02,880 --> 00:12:07,590 states to the real numbers and can be positive or negative. 210 00:12:07,590 --> 00:12:12,470 This is the set of 211 00:12:12,470 --> 00:12:13,509 real numbers. So just to make 212 00:12:13,509 --> 00:12:18,149 these elements concrete, let me give a specific example of a 213 00:12:18,149 --> 00:12:20,470 MDP. 214 00:12:20,470 --> 00:12:23,280 Rather than talking about something as complicated as helicopters, I'm going to use 215 00:12:23,280 --> 00:12:25,540 a much smaller 216 00:12:25,540 --> 00:12:29,930 MDP as the running example for the rest of today's lecture. We'll 217 00:12:29,930 --> 00:12:30,779 look at 218 00:12:30,779 --> 00:12:34,580 much more complicated MDPs in subsequent lectures. 219 00:12:34,580 --> 00:12:39,160 This is an example that I adapted from a textbook by Stuart Russell and Peter Norvig 220 00:12:39,160 --> 00:12:44,080 called Artificial Intelligence: A Modern Approach (Second Edition). 221 00:12:44,080 --> 00:12:48,269 And this is a small MDP that models a robot navigation task in which if you imagine you have 222 00:12:48,269 --> 00:12:52,540 a robot that lives all over the grid world 223 00:12:52,540 --> 00:12:56,880 where the shaded-in cell is an obstacle, so the robot can't go over this 224 00:12:56,880 --> 00:12:57,970 cell. 225 00:12:57,970 --> 00:13:00,260 226 00:13:00,260 --> 00:13:02,420 227 00:13:02,420 --> 00:13:05,100 And so, let's see. 228 00:13:05,100 --> 00:13:08,930 I would really like the robot to get to this upper right north cell, let's say, so I'm 229 00:13:08,930 --> 00:13:10,860 going to associate that 230 00:13:10,860 --> 00:13:13,540 cell with a +1 reward, and I'd 231 00:13:13,540 --> 00:13:15,680 really like it to avoid that 232 00:13:15,680 --> 00:13:19,860 grid cell, so I'm gonna associate that grid cell with -1 233 00:13:19,860 --> 00:13:24,540 reward. So let's actually iterate through the five elements of the MDP and 234 00:13:24,540 --> 00:13:27,250 so see what they are for this problem. 235 00:13:27,250 --> 00:13:30,140 So the robot can be in any of these eleven positions and so 236 00:13:30,140 --> 00:13:32,840 I have 237 00:13:32,840 --> 00:13:36,600 an MDP with 11 states, and it's a set capital S corresponding 238 00:13:36,600 --> 00:13:37,760 to the 11 places 239 00:13:37,760 --> 00:13:40,820 it could be 240 00:13:40,820 --> 00:13:42,250 in. 241 00:13:42,250 --> 00:13:46,420 And let's say my robot in this set, 242 00:13:46,420 --> 00:13:47,720 highly simplified for a logical 243 00:13:47,720 --> 00:13:52,840 example, can try to move in each of the compass directions, so in this MDP, I'll have four 244 00:13:52,840 --> 00:13:59,380 actions corresponding to moving in each of the North, South, East, West compass directions. And 245 00:13:59,380 --> 00:14:00,500 let's see. 246 00:14:00,500 --> 00:14:04,660 Let's say that my robot's dynamics are noisy. If you've worked in robotics before, you know 247 00:14:04,660 --> 00:14:06,950 that if you command a robot to go North, 248 00:14:06,950 --> 00:14:09,060 because of wheel slip or a 249 00:14:09,060 --> 00:14:13,480 core design in how you act or whatever, 250 00:14:13,480 --> 00:14:16,370 there's a small chance that your robot will be off side here. So you command your 251 00:14:16,370 --> 00:14:18,649 robot to move forward one meter, 252 00:14:18,649 --> 00:14:21,180 usually it will move forward somewhere between like 95 253 00:14:21,180 --> 00:14:22,220 centimeters or 254 00:14:22,220 --> 00:14:24,470 to 105 centimeters. 255 00:14:24,470 --> 00:14:28,170 So in this highly simplified grid world, I'm going to model 256 00:14:28,170 --> 00:14:32,660 the stochastic dynamics of my robot as follows. I'm going to say that if you 257 00:14:32,660 --> 00:14:35,230 command the robot to go north, 258 00:14:35,230 --> 00:14:36,000 there's actually a 259 00:14:36,000 --> 00:14:37,770 10 percent chance that it 260 00:14:37,770 --> 00:14:41,040 will accidentally veer off to the left and a 261 00:14:41,040 --> 00:14:43,860 10 percent chance it will veer off to the right 262 00:14:43,860 --> 00:14:46,320 and only a .8 chance that it will manage to go in the 263 00:14:46,320 --> 00:14:50,230 direction you commanded it. This is sort of a crude model, wheels slipping on the 264 00:14:50,230 --> 00:14:51,640 model robot. 265 00:14:51,640 --> 00:14:54,860 And if the robot bounces off a wall, then it just stays where it is and 266 00:14:54,860 --> 00:14:56,860 nothing happens. 267 00:14:56,860 --> 00:14:59,690 So let's see. 268 00:14:59,690 --> 00:15:01,430 Concretely, we would 269 00:15:01,430 --> 00:15:05,880 write this down using the state transition probability. So for example, let's 270 00:15:05,880 --> 00:15:10,580 take the state - let me call it a free comma one state and let's say you command the 271 00:15:10,580 --> 00:15:12,100 robot to go 272 00:15:12,100 --> 00:15:15,810 north. To specify these noisy dynamics of the robot, 273 00:15:15,810 --> 00:15:19,510 you would write down state transition probabilities for the robot as follows. You 274 00:15:19,510 --> 00:15:22,420 say that if you're in the state free one 275 00:15:22,420 --> 00:15:23,910 and you take the action 276 00:15:23,910 --> 00:15:28,760 north, your chance of 277 00:15:28,760 --> 00:15:31,760 getting to free two is 0.8. If you're in the state of free one and you take the action north, the 278 00:15:31,760 --> 00:15:34,510 chance of getting to four 1 is open 279 00:15:34,510 --> 00:15:37,730 1 280 00:15:37,730 --> 00:15:39,710 and so 281 00:15:39,710 --> 00:15:46,710 on. 282 00:15:52,460 --> 00:15:54,950 And so on, okay? 283 00:15:54,950 --> 00:15:59,330 This last line is that if you're in the state free one 284 00:15:59,330 --> 00:16:03,319 and you take the action north, the chance of you getting to the state free free is zero. 285 00:16:03,319 --> 00:16:04,740 And this is your 286 00:16:04,740 --> 00:16:09,170 chance of transitioning in one-time sets of the state free free is equal to zero. 287 00:16:09,170 --> 00:16:14,460 So these are the state transition probabilities for my MDP. Let's see. 288 00:16:14,460 --> 00:16:16,270 289 00:16:16,270 --> 00:16:19,710 The last two elements of my five tuple 290 00:16:19,710 --> 00:16:23,780 are gamma and the reward function. Let's not worry about gamma for now, 291 00:16:23,780 --> 00:16:26,279 but my reward function would be as follows, so 292 00:16:26,279 --> 00:16:28,420 I really want the robot 293 00:16:28,420 --> 00:16:32,980 to get to the fourth - I'm using four comma free. It's indexing to the states 294 00:16:32,980 --> 00:16:37,040 by using the numbers I wrote at the sides of the grid. So my 295 00:16:37,040 --> 00:16:41,310 reward for getting to the fourth free state is +1 296 00:16:41,310 --> 00:16:48,310 and my reward for getting to the fourth 2-state is 297 00:16:50,120 --> 00:16:51,270 -1, 298 00:16:51,270 --> 00:16:52,019 and 299 00:16:52,019 --> 00:16:58,450 as is common practice - let's see. As 300 00:16:58,450 --> 00:17:04,520 301 00:17:04,520 --> 00:17:08,199 is fairly common practice in navigation tasks, for all 302 00:17:08,199 --> 00:17:09,679 other states, 303 00:17:09,679 --> 00:17:12,010 the terminal states, I'm going to associate 304 00:17:12,010 --> 00:17:14,009 sort of a small negative reward 305 00:17:14,009 --> 00:17:18,910 and you can think of this as a small negative reward that charges my robot 306 00:17:18,910 --> 00:17:23,169 for his battery consumption or his fuel consumption for one move around. And so 307 00:17:23,169 --> 00:17:26,390 a small negative reward like this 308 00:17:26,390 --> 00:17:29,710 that charges the robot for running around randomly tends to cause 309 00:17:29,710 --> 00:17:34,360 the system to compute solutions that don't waste time and make its way to the 310 00:17:34,360 --> 00:17:39,500 goal as quickly as possible because it's charged for fuel consumption. 311 00:17:39,500 --> 00:17:46,500 312 00:17:47,030 --> 00:17:49,780 Okay. So, well, let me just mention, there's actually one 313 00:17:49,780 --> 00:17:53,309 other complication that I'm gonna sort of not worry about. 314 00:17:53,309 --> 00:17:56,860 In this specific example, unless you're going to assume that when the robot 315 00:17:56,860 --> 00:17:58,990 gets to the +1 or the -1 reward, 316 00:17:58,990 --> 00:18:03,000 then the world ends and so you get to the +1 and then that's it. 317 00:18:03,000 --> 00:18:07,740 The world ends. There are no more rewards positive or negative after that, right? And so 318 00:18:07,740 --> 00:18:11,229 there are various ways to model that. One way to think about that is 319 00:18:11,229 --> 00:18:14,020 you may imagine there's actually a 12th state, 320 00:18:14,020 --> 00:18:16,979 something that's called the zero cost absorbing state, 321 00:18:16,979 --> 00:18:21,499 so that whenever you get to the +1 or the -1, you then transition the 322 00:18:21,499 --> 00:18:25,700 probability one to this 12th state and you stay in this 12th state forever 323 00:18:25,700 --> 00:18:27,669 with no more 324 00:18:27,669 --> 00:18:31,249 rewards. I just mention that, that when you get to the +1 or -1, think of the 325 00:18:31,249 --> 00:18:32,690 problems in finishing. 326 00:18:32,690 --> 00:18:36,040 The reason I do that is because it makes some of the 327 00:18:36,040 --> 00:18:40,360 numbers come up nicer and be easier to understand. It's the sort 328 00:18:40,360 --> 00:18:43,549 of state where you go in where 329 00:18:43,549 --> 00:18:47,760 sometimes you hear the term zero cost absorbing states. It's another 330 00:18:47,760 --> 00:18:48,639 state so that 331 00:18:48,639 --> 00:18:51,529 when you enter that state, there are no more rewards; you always stay in that state 332 00:18:51,529 --> 00:18:56,150 forever. All right. 333 00:18:56,150 --> 00:18:58,820 So 334 00:18:58,820 --> 00:19:03,080 let's just see how an MDP works and it works as follows. 335 00:19:03,080 --> 00:19:07,590 At time 0, your 336 00:19:07,590 --> 00:19:10,450 robot starts off at some state as 337 00:19:10,450 --> 00:19:13,270 0 and, depending on where you are, 338 00:19:13,270 --> 00:19:17,490 you get to choose an action a0 to decide to go 339 00:19:17,490 --> 00:19:20,500 North, South, East, or West. 340 00:19:20,500 --> 00:19:22,490 Depending on your choice, 341 00:19:22,490 --> 00:19:25,340 you get to some state s1, 342 00:19:25,340 --> 00:19:29,760 which is going to be randomly drawn from the state transition distribution 343 00:19:29,760 --> 00:19:34,750 index by state 0 and the action you just chose. So the next state you get to depends - 344 00:19:34,750 --> 00:19:38,410 well, it depends in the probabilistic way on the previous state and the action you just 345 00:19:38,410 --> 00:19:39,860 took. 346 00:19:39,860 --> 00:19:44,700 After you get to the state s1, you get to choose a new action 347 00:19:44,700 --> 00:19:46,560 a1, 348 00:19:46,560 --> 00:19:49,210 and then as a result of that, 349 00:19:49,210 --> 00:19:53,000 you get to some new state s2 sort 350 00:19:53,000 --> 00:19:56,640 of randomly from the 351 00:19:56,640 --> 00:19:59,740 state transition distributions 352 00:19:59,740 --> 00:20:02,159 and so on. Okay? 353 00:20:02,159 --> 00:20:08,210 So after your 354 00:20:08,210 --> 00:20:15,210 robot does this 355 00:20:18,500 --> 00:20:20,240 356 00:20:20,240 --> 00:20:22,549 for a while, 357 00:20:22,549 --> 00:20:25,630 it will have visited some sequence 358 00:20:25,630 --> 00:20:30,540 of states s0, s1, s2, 359 00:20:30,540 --> 00:20:32,210 and so on, 360 00:20:32,210 --> 00:20:35,900 and to evaluate how well we did, 361 00:20:35,900 --> 00:20:38,870 we'll take the reward function 362 00:20:38,870 --> 00:20:44,190 and we'll apply it 363 00:20:44,190 --> 00:20:46,380 to the sequence of states 364 00:20:46,380 --> 00:20:47,620 and add up 365 00:20:47,620 --> 00:20:50,540 the sum of rewards that your robot obtained 366 00:20:50,540 --> 00:20:52,940 on the sequence of states it visited. 367 00:20:52,940 --> 00:20:55,320 State s0 is your action, you get 368 00:20:55,320 --> 00:20:57,790 to s1, take an action, you get to s2 and so on. 369 00:20:57,790 --> 00:21:01,470 So you keep the reward function in the pile to every state in the sequence and 370 00:21:01,470 --> 00:21:03,719 this is the sum of rewards you obtain. Let 371 00:21:03,719 --> 00:21:06,610 me show you just one more bit. You can 372 00:21:06,610 --> 00:21:09,920 multiply this by gamma, gamma squared, and the next term will be multiplied 373 00:21:09,920 --> 00:21:12,610 by gamma cubed 374 00:21:12,610 --> 00:21:15,450 and so on. And this is called - I'm going to 375 00:21:15,450 --> 00:21:18,620 call 376 00:21:18,620 --> 00:21:20,049 this the 377 00:21:20,049 --> 00:21:22,250 Total 378 00:21:22,250 --> 00:21:23,180 Payoff for 379 00:21:23,180 --> 00:21:25,240 the sequence of states s0, 380 00:21:25,240 --> 00:21:26,830 s1, s2, 381 00:21:26,830 --> 00:21:30,060 and so on that your robot visited. 382 00:21:30,060 --> 00:21:32,020 And so let 383 00:21:32,020 --> 00:21:36,040 me also say what gamma is. See the quality gamma is a number that's 384 00:21:36,040 --> 00:21:41,460 usually less than one. It usually you think of gamma as a number like open 99. 385 00:21:41,460 --> 00:21:42,660 So the 386 00:21:42,660 --> 00:21:47,190 effect of gamma is that the reward you obtain at time 1 is given 387 00:21:47,190 --> 00:21:51,080 a slightly smaller weight than the reward you get at time zero. And then 388 00:21:51,080 --> 00:21:53,960 the reward you get at time 2 is even a little bit smaller 389 00:21:53,960 --> 00:21:55,120 than the reward you get 390 00:21:55,120 --> 00:21:58,090 at a previous time set 391 00:21:58,090 --> 00:22:02,700 and so on. 392 00:22:02,700 --> 00:22:05,220 Let's see. 393 00:22:05,220 --> 00:22:06,499 And so 394 00:22:06,499 --> 00:22:10,659 if this is an economic application, if you're in like 395 00:22:10,659 --> 00:22:14,400 stock market trading with Gaussian algorithm or whatever, this is an economic 396 00:22:14,400 --> 00:22:15,150 application, 397 00:22:15,150 --> 00:22:19,830 then your rewards are dollars earned and lost. Then 398 00:22:19,830 --> 00:22:24,610 to this kind of factor, gamma has a very natural interpretation as the time 399 00:22:24,610 --> 00:22:26,960 value of money because like a dollar today 400 00:22:26,960 --> 00:22:28,600 is worth slightly less than - excuse me, the 401 00:22:28,600 --> 00:22:31,850 dollar today is worth slightly more than the dollar tomorrow 402 00:22:31,850 --> 00:22:35,560 because the dollar in the bank can earn a little bit of interest. And conversely, 403 00:22:35,560 --> 00:22:36,480 404 00:22:36,480 --> 00:22:40,840 having to pay out a dollar tomorrow is better than having to pay out a dollar 405 00:22:40,840 --> 00:22:42,110 today. 406 00:22:42,110 --> 00:22:42,809 407 00:22:42,809 --> 00:22:44,110 So in other words, 408 00:22:44,110 --> 00:22:48,270 the effect of this compacted gamma tends to 409 00:22:48,270 --> 00:22:49,139 weight 410 00:22:49,139 --> 00:22:53,980 wins or losses in the future less than wins or losses in the immediate future - 411 00:22:53,980 --> 00:22:56,680 tends to weight wins and losses in the distant future 412 00:22:56,680 --> 00:23:01,470 less than wins and losses in 413 00:23:01,470 --> 00:23:02,650 the immediate. 414 00:23:02,650 --> 00:23:06,690 And so the girth of the reinforcement learning algorithm 415 00:23:06,690 --> 00:23:13,690 is to choose actions over time, 416 00:23:13,980 --> 00:23:16,960 to choose actions a0, a1 and so 417 00:23:16,960 --> 00:23:23,010 on to try to maximize the 418 00:23:23,010 --> 00:23:30,010 expected value of this total payoff. 419 00:23:35,830 --> 00:23:37,580 420 00:23:37,580 --> 00:23:41,880 And more concretely, 421 00:23:41,880 --> 00:23:46,270 what we will try to do is have our reinforcement learning algorithms 422 00:23:46,270 --> 00:23:49,020 compute a policy, 423 00:23:49,020 --> 00:23:51,370 424 00:23:51,370 --> 00:23:53,570 which I denote by the 425 00:23:53,570 --> 00:23:55,290 lower case p, 426 00:23:55,290 --> 00:23:58,090 which - 427 00:23:58,090 --> 00:24:01,090 and all a policy is, a definition of a policy 428 00:24:01,090 --> 00:24:02,440 is a function 429 00:24:02,440 --> 00:24:04,980 mapping from the states of the actions 430 00:24:04,980 --> 00:24:07,260 and so it goes to kind of a policy 431 00:24:07,260 --> 00:24:08,349 that tells us 432 00:24:08,349 --> 00:24:09,820 so for every state, 433 00:24:09,820 --> 00:24:14,630 what action it recommends we take in that state. 434 00:24:14,630 --> 00:24:15,910 435 00:24:15,910 --> 00:24:22,910 So concretely, here is an example of a policy. 436 00:24:26,650 --> 00:24:31,530 And this actually turns out to be the optimal policy for the MDP and I'll 437 00:24:31,530 --> 00:24:34,140 tell you later how 438 00:24:34,140 --> 00:24:37,589 I computed 439 00:24:37,589 --> 00:24:41,390 440 00:24:41,390 --> 00:24:48,390 this. 441 00:24:50,490 --> 00:24:52,350 And so this is an example of a policy. 442 00:24:52,350 --> 00:24:55,650 A policy is just a mapping from the states to the actions, 443 00:24:55,650 --> 00:24:57,289 and so our policy tells me 444 00:24:57,289 --> 00:25:01,920 when you're in this state, you should take the left action and so 445 00:25:01,920 --> 00:25:02,929 on. And 446 00:25:02,929 --> 00:25:06,240 this particular policy I drew out happens to be after a policy in the sense 447 00:25:06,240 --> 00:25:06,830 that 448 00:25:06,830 --> 00:25:08,809 when you execute this policy, 449 00:25:08,809 --> 00:25:11,710 this will maximize your expected value 450 00:25:11,710 --> 00:25:14,770 of the total payoff. This will maximize your expected 451 00:25:14,770 --> 00:25:17,610 total sum of the counter rewards. Student:[Inaudible] 452 00:25:17,610 --> 00:25:19,909 Instructor 453 00:25:19,909 --> 00:25:24,520 (Andrew Ng):Yeah, so can policy be over multiple states, can it be over - so 454 00:25:24,520 --> 00:25:27,950 can it be a function of not only current state, but the state I was in previously as well. 455 00:25:27,950 --> 00:25:29,530 So the answer 456 00:25:29,530 --> 00:25:33,900 is yes. Sometimes people call them strategies instead of policies, but usually you're going to 457 00:25:33,900 --> 00:25:35,050 use policies. It 458 00:25:35,050 --> 00:25:38,039 actually turns out that for an MDP, 459 00:25:38,039 --> 00:25:43,880 you're allowing policies that depend on my previous states, will not allow you to 460 00:25:43,880 --> 00:25:48,480 do any better. At least in the limited context we're talking about. So in other words, 461 00:25:48,480 --> 00:25:52,140 there exists a policy that only ever lets the current state ever maximize my 462 00:25:52,140 --> 00:25:53,930 expected total payoff. 463 00:25:53,930 --> 00:25:57,710 And this statement won't be true for some of the richer models we talk about later, but 464 00:25:57,710 --> 00:26:04,710 for now, all we need to do is - this suffices to just look at the current states and actions. 465 00:26:04,740 --> 00:26:08,950 And sometimes they use the term executable policy to mean that I'm going to take 466 00:26:08,950 --> 00:26:12,429 actions according to the policies, so I'm going to execute the policy p. 467 00:26:12,429 --> 00:26:16,740 That means I'm going to - whenever some state s, I'm going to 468 00:26:16,740 --> 00:26:19,270 take the action 469 00:26:19,270 --> 00:26:21,630 that the policy p 470 00:26:21,630 --> 00:26:28,630 outputs when given the current state. All right. So it turns out 471 00:26:28,900 --> 00:26:33,549 that one of the things MDPs are very good at is - all right, 472 00:26:33,549 --> 00:26:37,040 let's look at our states. 473 00:26:37,040 --> 00:26:40,100 Say the optimal policy in this state is to go left. 474 00:26:40,100 --> 00:26:41,290 There's actually - this 475 00:26:41,290 --> 00:26:45,419 probably wasn't very obvious. Why is it that you have actions that go left take a longer 476 00:26:45,419 --> 00:26:48,580 path there? The alternative would be to go north 477 00:26:48,580 --> 00:26:51,760 and try to find a much shorter path to the +1 state, 478 00:26:51,760 --> 00:26:54,700 but when you're in this state over here, 479 00:26:54,700 --> 00:26:58,419 this, I guess, free comma 2 state, 480 00:26:58,419 --> 00:27:01,890 when in that state over there, when you go north, there's a .1 chance you accidentally 481 00:27:01,890 --> 00:27:02,790 veer off 482 00:27:02,790 --> 00:27:06,190 to the right to the -1 state. And so there will be subtle tradeoffs. Is it better to 483 00:27:06,190 --> 00:27:08,299 take the longer, safer route, 484 00:27:08,299 --> 00:27:13,009 but the discount factor tends to discourage that and the .02 485 00:27:13,009 --> 00:27:14,890 charge per step will tend to discourage that. Or is 486 00:27:14,890 --> 00:27:18,679 it better to take a shorter, riskier route. And 487 00:27:18,679 --> 00:27:22,760 so it wasn't obvious to me until I computed it, but 488 00:27:22,760 --> 00:27:27,220 just to see also action and this is one of those things that MDP machinery is 489 00:27:27,220 --> 00:27:32,820 very good at making, to make subtle tradeoffs to make these optimal. 490 00:27:32,820 --> 00:27:36,039 So what I want to do next is 491 00:27:36,039 --> 00:27:40,040 make a few more definitions and that will lead us to our first algorithm 492 00:27:40,040 --> 00:27:43,610 for computing optimal policies and MDPs, so finding optimal ways to 493 00:27:43,610 --> 00:27:48,150 act on MDPs. Before I move on, let's check for any questions about the 494 00:27:48,150 --> 00:27:55,150 MDP formalism. 495 00:27:58,430 --> 00:28:00,480 Okay, cool. So 496 00:28:00,480 --> 00:28:02,049 497 00:28:02,049 --> 00:28:06,720 let's now talk about how we actually go about computing optimal policy like that and 498 00:28:06,720 --> 00:28:13,720 to get there, I need to define a few things. 499 00:28:15,059 --> 00:28:17,860 So 500 00:28:17,860 --> 00:28:20,070 just as a preview of 501 00:28:20,070 --> 00:28:21,559 the next moves I'm gonna take, 502 00:28:21,559 --> 00:28:25,490 I'm going to define something called Vp and then 503 00:28:25,490 --> 00:28:28,559 I'm going to define V* 504 00:28:28,559 --> 00:28:32,060 and then 505 00:28:32,060 --> 00:28:36,370 I'm going to define p is the optimal 506 00:28:36,370 --> 00:28:37,520 policy. 507 00:28:37,520 --> 00:28:38,710 And so I'm going to say 508 00:28:38,710 --> 00:28:40,810 as I define these things, keep 509 00:28:40,810 --> 00:28:44,890 in mind what is a definition and what is a consequence of a 510 00:28:44,890 --> 00:28:45,490 definition. 511 00:28:45,490 --> 00:28:49,340 In particular, I won't be defining p* to be the optimal policy, but I'll define 512 00:28:49,340 --> 00:28:54,230 p* by a different equation and it will be a consequence of my definition that p* is the optimal 513 00:28:54,230 --> 00:28:55,770 policy. The first one I 514 00:28:55,770 --> 00:28:58,320 want 515 00:28:58,320 --> 00:29:04,200 to define is Vp, so for any given policy p, for 516 00:29:04,200 --> 00:29:06,559 any policy p, 517 00:29:06,559 --> 00:29:12,340 I'm going to define the value function 518 00:29:12,340 --> 00:29:14,180 Vp 519 00:29:14,180 --> 00:29:17,240 and sometimes I call this the 520 00:29:17,240 --> 00:29:21,330 value function for p. So I want to find the value function for Vp, the 521 00:29:21,330 --> 00:29:27,660 function mapping from the state's known numbers, such that 522 00:29:27,660 --> 00:29:34,660 Vp(s) is the expected payoff - 523 00:29:36,910 --> 00:29:40,950 is the expected total payoff if 524 00:29:40,950 --> 00:29:44,860 you started 525 00:29:44,860 --> 00:29:48,809 in the state s and 526 00:29:48,809 --> 00:29:52,710 execute p. 527 00:29:52,710 --> 00:29:55,430 So in other words, 528 00:29:55,430 --> 00:30:01,000 Vp(s) is equal to the expected value of this here, sum of this 529 00:30:01,000 --> 00:30:04,580 counted rewards, the total payoff, 530 00:30:04,580 --> 00:30:06,429 given that 531 00:30:06,429 --> 00:30:10,440 you execute the policy p and 532 00:30:10,440 --> 00:30:13,880 the first state in the sequence is zero, is that 533 00:30:13,880 --> 00:30:15,470 state s. 534 00:30:15,470 --> 00:30:19,600 I say this is slightly sloppy probabilistic notation, so p isn't really in 535 00:30:19,600 --> 00:30:22,660 around the variable, so maybe I shouldn't actually be conditioning 536 00:30:22,660 --> 00:30:25,180 on p. This is sort of 537 00:30:25,180 --> 00:30:31,650 moderately standard notation horror, so we use the steady sloppy policy notation. 538 00:30:31,650 --> 00:30:33,160 So 539 00:30:33,160 --> 00:30:40,160 as a concrete example, 540 00:30:43,110 --> 00:30:45,130 541 00:30:45,130 --> 00:30:52,130 542 00:30:54,549 --> 00:30:56,580 here's a policy and 543 00:30:56,580 --> 00:31:03,580 this 544 00:31:13,830 --> 00:31:17,340 is not a great policy. This is just some policy p. It's actually a pretty bad 545 00:31:17,340 --> 00:31:18,260 policy that 546 00:31:18,260 --> 00:31:23,160 for many states, seems to be heading to the 1 rather than the +1. 547 00:31:23,160 --> 00:31:26,200 And 548 00:31:26,200 --> 00:31:30,640 so the value function is the function mapping from the states of known numbers, 549 00:31:30,640 --> 00:31:37,440 so it associates each state with a number 550 00:31:37,440 --> 00:31:44,440 551 00:31:58,670 --> 00:32:00,860 and in this case, this is Vp. So that's the value function 552 00:32:00,860 --> 00:32:03,230 for this policy. 553 00:32:03,230 --> 00:32:07,650 And so you notice, for instance, that for all the states in the 554 00:32:07,650 --> 00:32:09,130 bottom two rows, 555 00:32:09,130 --> 00:32:13,160 I guess, this is a really bad policy that has a high chance to take you to the 556 00:32:13,160 --> 00:32:14,270 -1 state 557 00:32:14,270 --> 00:32:16,059 and so all the 558 00:32:16,059 --> 00:32:17,500 559 00:32:17,500 --> 00:32:21,440 values for those states in the bottom two rows are negative. 560 00:32:21,440 --> 00:32:25,120 So in this expectation, your total payoff would be negative. And if you 561 00:32:25,120 --> 00:32:27,359 execute this rather bad policy, you 562 00:32:27,359 --> 00:32:29,430 start in any of the states in the bottom row, and 563 00:32:29,430 --> 00:32:33,830 if you start in the top row, the total payoff would be positive. This is not 564 00:32:33,830 --> 00:32:35,639 a terribly bad policy if it 565 00:32:35,639 --> 00:32:39,460 stays in the topmost row. 566 00:32:39,460 --> 00:32:40,880 And so 567 00:32:40,880 --> 00:32:43,320 given any policy, you can write down 568 00:32:43,320 --> 00:32:45,280 a value function 569 00:32:45,280 --> 00:32:50,470 for that policy. 570 00:32:50,470 --> 00:32:52,679 If 571 00:32:52,679 --> 00:32:59,679 some of you are still writing, I'll leave that up for a second while I clean another couple of boards. 572 00:33:10,500 --> 00:33:12,540 Okay. 573 00:33:12,540 --> 00:33:16,470 So 574 00:33:16,470 --> 00:33:21,990 the circumstances of the following, 575 00:33:21,990 --> 00:33:25,170 Vp(s) is equal to - well, 576 00:33:25,170 --> 00:33:28,020 the expected value of R if 577 00:33:28,020 --> 00:33:31,440 s is zero, which is the reward you get right away 578 00:33:31,440 --> 00:33:35,250 for just being in the initial state s, plus - let 579 00:33:35,250 --> 00:33:38,160 me write this like this - I'm going to write gamma 580 00:33:38,160 --> 00:33:39,820 and then R if s1 plus gamma, R 581 00:33:39,820 --> 00:33:43,270 of s2 plus dot, dot, 582 00:33:43,270 --> 00:33:44,740 dot, 583 00:33:44,740 --> 00:33:48,500 what 584 00:33:48,500 --> 00:33:55,500 condition of p. 585 00:34:07,600 --> 00:34:08,520 Okay? So just 586 00:34:08,520 --> 00:34:10,999 de-parenthesize these. 587 00:34:10,999 --> 00:34:14,399 This first term here, this is sometimes called the immediate reward. This is 588 00:34:14,399 --> 00:34:16,399 the reward you get right away 589 00:34:16,399 --> 00:34:18,710 just for starting in the state at zero. 590 00:34:18,710 --> 00:34:22,089 And then the second term here, these are sometimes called the future reward, which 591 00:34:22,089 --> 00:34:23,499 is the rewards you get 592 00:34:23,499 --> 00:34:26,459 sort of one time step into the future 593 00:34:26,459 --> 00:34:28,219 and what I want you 594 00:34:28,219 --> 00:34:31,739 to note is what this term is. That 595 00:34:31,739 --> 00:34:32,719 term there 596 00:34:32,719 --> 00:34:35,199 is really just 597 00:34:35,199 --> 00:34:39,019 the value function 598 00:34:39,019 --> 00:34:40,579 for the state s1 599 00:34:40,579 --> 00:34:43,449 because this term here in parentheses, this is really - 600 00:34:43,449 --> 00:34:46,499 suppose I was to start in the state s1 or 601 00:34:46,499 --> 00:34:49,950 this is the sum of the counter rewards I would get if I were to start in the state s1. 602 00:34:49,950 --> 00:34:53,589 So my immediate reward for starting in s1 would be R (s1), then 603 00:34:53,589 --> 00:34:55,599 plus gamma times 604 00:34:55,599 --> 00:34:59,029 additional future rewards in the future. 605 00:34:59,029 --> 00:35:01,639 And so 606 00:35:01,639 --> 00:35:04,199 it turns out you can 607 00:35:04,199 --> 00:35:09,540 write VRp recursively in terms of 608 00:35:09,540 --> 00:35:10,629 609 00:35:10,629 --> 00:35:12,220 itself. 610 00:35:12,220 --> 00:35:16,049 And presume that VRp is equal to the 611 00:35:16,049 --> 00:35:18,639 immediate reward plus gamma times - 612 00:35:18,639 --> 00:35:20,219 actually, 613 00:35:20,219 --> 00:35:21,809 let 614 00:35:21,809 --> 00:35:24,690 me write - it would 615 00:35:24,690 --> 00:35:30,099 just be mapped as notation - s0 gets mapped to s and s1 gets mapped to s prime and 616 00:35:30,099 --> 00:35:31,539 it 617 00:35:31,539 --> 00:35:33,730 just makes it all right. So value function 618 00:35:33,730 --> 00:35:36,829 for p to stay zero is the immediate reward 619 00:35:36,829 --> 00:35:39,780 plus this current factor gamma times 620 00:35:39,780 --> 00:35:41,549 - 621 00:35:41,549 --> 00:35:44,839 and now you have Vp of s1. Right here is Vp of s 622 00:35:44,839 --> 00:35:46,529 prime. 623 00:35:46,529 --> 00:35:49,140 But s prime is a random variable 624 00:35:49,140 --> 00:35:50,959 because the next state you get to 625 00:35:50,959 --> 00:35:53,049 after one time set is random 626 00:35:53,049 --> 00:35:57,569 and so in particular, taking expectations, this is the sum of all 627 00:35:57,569 --> 00:36:01,349 states s prime 628 00:36:01,349 --> 00:36:02,440 of 629 00:36:02,440 --> 00:36:08,009 your probability of getting to that state 630 00:36:08,009 --> 00:36:10,009 times 631 00:36:10,009 --> 00:36:17,009 that. 632 00:36:17,439 --> 00:36:19,319 And 633 00:36:19,319 --> 00:36:22,169 just to be clear on this notation, right, this is 634 00:36:22,169 --> 00:36:27,249 P subscript (s, a) of s prime is the chance of you getting to the state s prime when you take the 635 00:36:27,249 --> 00:36:29,409 action a in state s. 636 00:36:29,409 --> 00:36:33,160 And in this case, we're executing the policy p 637 00:36:33,160 --> 00:36:36,029 and so this is P(s) p(s) 638 00:36:36,029 --> 00:36:37,740 because 639 00:36:37,740 --> 00:36:42,609 the action we're going to take in state s is the action p reverse. 640 00:36:42,609 --> 00:36:43,899 So this is 641 00:36:43,899 --> 00:36:48,689 - in other words, this Ps 642 00:36:48,689 --> 00:36:49,249 p(s), 643 00:36:49,249 --> 00:36:53,419 this distribution overstates s prime that you would transitioned to, the 644 00:36:53,419 --> 00:36:54,539 one time step, 645 00:36:54,539 --> 00:36:57,469 when you take the action p(s) 646 00:36:57,469 --> 00:37:03,209 in the state s. 647 00:37:03,209 --> 00:37:10,209 So just to give this a name, this equation is called Bellman's equations and is one of the 648 00:37:13,059 --> 00:37:20,059 central equations that we'll use over and over 649 00:37:22,359 --> 00:37:24,709 when we 650 00:37:24,709 --> 00:37:28,689 solve MDPs. Raise your hand if this equation makes 651 00:37:28,689 --> 00:37:29,919 sense. Some of 652 00:37:29,919 --> 00:37:36,919 you didn't raise your hands. Do you have questions? 653 00:37:40,969 --> 00:37:42,739 So let's try to say 654 00:37:42,739 --> 00:37:44,779 this again. 655 00:37:44,779 --> 00:37:50,630 Actually, which of the symbols don't make sense for those of you that didn't raise your hands? 656 00:37:50,630 --> 00:37:56,640 You're regretting not raising your hand now, aren't you? Let's try saying 657 00:37:56,640 --> 00:38:00,249 this one more time and maybe it will come clear later. 658 00:38:00,249 --> 00:38:01,139 So 659 00:38:01,139 --> 00:38:03,089 what 660 00:38:03,089 --> 00:38:05,889 is it? So this equation is sort of 661 00:38:05,889 --> 00:38:09,339 like my value at the current state is equal to 662 00:38:09,339 --> 00:38:12,400 R(s) plus 663 00:38:12,400 --> 00:38:14,149 gamma times - 664 00:38:14,149 --> 00:38:18,780 and depending on what state I get to next, 665 00:38:18,780 --> 00:38:21,019 my expected total payoff 666 00:38:21,019 --> 00:38:22,950 from the state 667 00:38:22,950 --> 00:38:25,589 s prime is Vp of s prime, 668 00:38:25,589 --> 00:38:29,799 whereas prime is the state I get to after one time step. So incurring one state as 669 00:38:29,799 --> 00:38:33,379 I'm going to take some action and I get to some state that's prime and this equation 670 00:38:33,379 --> 00:38:37,669 is sort of my expected total payoff for executing the policy p from the 671 00:38:37,669 --> 00:38:40,969 state s. But 672 00:38:40,969 --> 00:38:45,369 s prime is random because the next state I get to is random and well, 673 00:38:45,369 --> 00:38:46,819 we'll use the 674 00:38:46,819 --> 00:38:51,299 next board. The chance I get to some specific state as prime 675 00:38:51,299 --> 00:38:55,869 is given by we have a P subscript (s, a), s prime, 676 00:38:55,869 --> 00:38:57,630 where - 677 00:38:57,630 --> 00:39:00,349 because these are just [inaudible] and probabilities - 678 00:39:00,349 --> 00:39:04,189 where the action a I chose is 679 00:39:04,189 --> 00:39:07,489 given by p(s) because I'm executing the action a 680 00:39:07,489 --> 00:39:10,140 in the current state s. 681 00:39:10,140 --> 00:39:14,990 And so when you plug this back in, you get P subscript s R(s) 682 00:39:14,990 --> 00:39:18,569 as prime, just gives me the distribution over the states in making the transition 683 00:39:18,569 --> 00:39:20,209 to it in one step, 684 00:39:20,209 --> 00:39:20,809 and hence, 685 00:39:20,809 --> 00:39:27,809 that just needs Bellman's equations. So 686 00:39:28,400 --> 00:39:31,640 since Bellman's equations gives you a 687 00:39:31,640 --> 00:39:32,509 way 688 00:39:32,509 --> 00:39:34,709 to solve for the 689 00:39:34,709 --> 00:39:37,249 value function for policy in closed form. 690 00:39:37,249 --> 00:39:41,439 So again, the problem is suppose I'm given a fixed policy. How do I solve for Vp? How 691 00:39:41,439 --> 00:39:44,399 do I solve for the - 692 00:39:44,399 --> 00:39:50,170 so given fixed policy, how do I solve for this equation? It turns 693 00:39:50,170 --> 00:39:53,479 out, Bellman's equation gives you a way of doing this, 694 00:39:53,479 --> 00:39:55,749 so coming back to this board, 695 00:39:55,749 --> 00:39:57,649 it turns out to be 696 00:39:57,649 --> 00:40:01,769 - sort of coming back to the previous boards, 697 00:40:01,769 --> 00:40:05,069 around - could you 698 00:40:05,069 --> 00:40:08,139 move the camera to point to this board? Okay, cool. 699 00:40:08,139 --> 00:40:10,669 So going back to this board, we work 700 00:40:10,669 --> 00:40:17,669 the problem's equation, this equation, right, 701 00:40:18,499 --> 00:40:21,420 let's say I have a fixed policy p and I want to solve for the 702 00:40:21,420 --> 00:40:23,769 value function for the policy p. 703 00:40:23,769 --> 00:40:28,410 Then what this equation is just imposes a set of linear constraints on the value 704 00:40:28,410 --> 00:40:30,060 function. So in particular, 705 00:40:30,060 --> 00:40:34,530 this says that the value for a given state is equal to some constant, and 706 00:40:34,530 --> 00:40:36,189 then some linear function 707 00:40:36,189 --> 00:40:38,589 of other values. 708 00:40:38,589 --> 00:40:44,739 And so you can write down one such equation for every state in your MDP 709 00:40:44,739 --> 00:40:46,799 and this imposes a set of 710 00:40:46,799 --> 00:40:50,089 linear constraints on what the value function could be. 711 00:40:50,089 --> 00:40:53,619 And then it turns out that by solving the resulting linear system equations, you 712 00:40:53,619 --> 00:40:55,979 can then solve for the value function 713 00:40:55,979 --> 00:40:57,760 Vp(s). 714 00:40:57,760 --> 00:41:00,969 There's a high level description. Let me now make this concrete. 715 00:41:00,969 --> 00:41:07,969 So specifically, 716 00:41:11,449 --> 00:41:16,019 let me take the free one state, that state we're 717 00:41:16,019 --> 00:41:18,759 using as an example. So 718 00:41:18,759 --> 00:41:22,649 Bellman's equation tells me that the value for p 719 00:41:22,649 --> 00:41:25,079 for the free one state - 720 00:41:25,079 --> 00:41:30,859 oh, and let's say I have a specific policy so that p are free one - let's say it 721 00:41:30,859 --> 00:41:34,869 takes a North action, which is not the ultimate action. 722 00:41:34,869 --> 00:41:36,449 For this policy, 723 00:41:36,449 --> 00:41:39,270 Bellman's equation tells me that Vp of free one 724 00:41:39,270 --> 00:41:41,059 is equal to 725 00:41:41,059 --> 00:41:41,790 R 726 00:41:41,790 --> 00:41:44,259 of the state free one, 727 00:41:44,259 --> 00:41:46,319 and then plus 728 00:41:46,319 --> 00:41:48,499 gamma times 729 00:41:48,499 --> 00:41:54,039 our trans 0.8 I get to the free two state, 730 00:41:54,039 --> 00:41:56,970 which translates .1 and 731 00:41:56,970 --> 00:42:03,970 gets to 732 00:42:06,239 --> 00:42:08,689 the four one state 733 00:42:08,689 --> 00:42:11,759 and which times 0.1, I will get 734 00:42:11,759 --> 00:42:15,719 to the 735 00:42:15,719 --> 00:42:16,999 two one state. 736 00:42:16,999 --> 00:42:19,899 And so what I've done is I've written down Bellman's equations for the free one 737 00:42:19,899 --> 00:42:23,019 state. I hope you know what it 738 00:42:23,019 --> 00:42:25,459 means. It's in my low MDP; 739 00:42:25,459 --> 00:42:28,539 I'm indexing the states 1, 2, 3, 4, so this 740 00:42:28,539 --> 00:42:30,140 state over there where 741 00:42:30,140 --> 00:42:33,939 I drew the circle is the free one 742 00:42:33,939 --> 00:42:35,049 state. 743 00:42:35,049 --> 00:42:38,539 So for every one of my 11 states in the MDP, 744 00:42:38,539 --> 00:42:41,719 I can write down an equation like this. This 745 00:42:41,719 --> 00:42:43,269 stands just for one state. 746 00:42:43,269 --> 00:42:46,630 And you notice that if I'm trying to solve for the 747 00:42:46,630 --> 00:42:48,529 values, 748 00:42:48,529 --> 00:42:53,069 so these are the unknowns, 749 00:42:53,069 --> 00:42:56,009 then I will have 11 variables 750 00:42:56,009 --> 00:42:59,190 because I'm trying to solve for the value function for each of my 11 751 00:42:59,190 --> 00:43:00,329 states, 752 00:43:00,329 --> 00:43:04,160 and I will have 11 constraints because I can write down 11 753 00:43:04,160 --> 00:43:05,999 equations of this form, 754 00:43:05,999 --> 00:43:09,379 one such equation for each one of my states. 755 00:43:09,379 --> 00:43:12,689 So if you do this, if you write down this sort of equation for every one of your states and then do 756 00:43:12,689 --> 00:43:13,340 these, 757 00:43:13,340 --> 00:43:14,499 you have 758 00:43:14,499 --> 00:43:17,250 your set of linear equations with 759 00:43:17,250 --> 00:43:20,259 11 unknowns and 11 variables - excuse 760 00:43:20,259 --> 00:43:23,549 me, 11 constraints or 11 equations with 11 unknowns, 761 00:43:23,549 --> 00:43:27,689 and so you can solve that linear system of equations 762 00:43:27,689 --> 00:43:31,559 to get an explicit solution for 763 00:43:31,559 --> 00:43:33,409 Vp. So 764 00:43:33,409 --> 00:43:37,189 if you have n states, you end up with n equations and n unknowns and solve that 765 00:43:37,189 --> 00:43:41,140 to get the values for 766 00:43:41,140 --> 00:43:42,449 all of your states. 767 00:43:42,449 --> 00:43:45,109 Okay, cool. 768 00:43:45,109 --> 00:43:49,170 So 769 00:43:49,170 --> 00:43:56,170 actually, could you just raise your hand if this made sense? 770 00:43:56,449 --> 00:43:58,059 Cool. 771 00:43:58,059 --> 00:44:01,160 All right, so that was the value function for specific policy and how to 772 00:44:01,160 --> 00:44:02,499 solve 773 00:44:02,499 --> 00:44:08,249 for it. Let me define one more 774 00:44:08,249 --> 00:44:12,939 thing. 775 00:44:12,939 --> 00:44:16,599 So the optimal value function when defined as 776 00:44:16,599 --> 00:44:19,479 V*(s) 777 00:44:19,479 --> 00:44:24,269 equals max over all policies 778 00:44:24,269 --> 00:44:26,499 p of Vp(s). So in other words, 779 00:44:26,499 --> 00:44:28,259 for any given state s, 780 00:44:28,259 --> 00:44:33,019 the optimal value function says suppose I take a max over all possible policies 781 00:44:33,019 --> 00:44:33,789 p, 782 00:44:33,789 --> 00:44:37,130 what is the best possible expected - 783 00:44:37,130 --> 00:44:41,330 some of the counter rewards that I can expect to get? Or what is my optimal 784 00:44:41,330 --> 00:44:43,220 expected total payoff 785 00:44:43,220 --> 00:44:47,779 for starting at state s, so taking a max over all possible control policies 786 00:44:47,779 --> 00:44:50,140 p. 787 00:44:50,140 --> 00:44:51,169 788 00:44:51,169 --> 00:44:55,649 So it turns out that there's a version of Bellman's equations for V* as well 789 00:44:55,649 --> 00:45:02,649 and so this is also called Bellman's equations for V* rather than for Vp and I'll just 790 00:45:03,429 --> 00:45:06,439 write that down. So this says that 791 00:45:06,439 --> 00:45:10,759 the optimal payoff you can get 792 00:45:10,759 --> 00:45:14,749 from the state s is equal to - 793 00:45:14,749 --> 00:45:17,839 so our [inaudible] are multi here, so let's see. 794 00:45:17,839 --> 00:45:21,469 Just for starting off in the state s, you're going to 795 00:45:21,469 --> 00:45:24,719 get your immediate R(s) 796 00:45:24,719 --> 00:45:28,930 and then depending on what action a you take 797 00:45:28,930 --> 00:45:35,390 your expected total payoff will be given by 798 00:45:35,390 --> 00:45:36,190 this. So 799 00:45:36,190 --> 00:45:39,259 if I take an action a in some state s, 800 00:45:39,259 --> 00:45:43,499 then with probability given by P subscript (s; a) of s prime, by this 801 00:45:43,499 --> 00:45:46,279 probability our transition of state s prime, and when we 802 00:45:46,279 --> 00:45:49,759 get to the state s prime, I'll expect my total payoff from there to be given by V*(s) 803 00:45:49,759 --> 00:45:53,099 prime because I'm now starting to use the s 804 00:45:53,099 --> 00:45:57,959 prime. So the only thing in this equation I need to fill in is where is the action a, 805 00:45:57,959 --> 00:46:03,179 so in order to actually obtain the optimal expected payoff, and 806 00:46:03,179 --> 00:46:08,549 to actually obtain the maximum or the optimal expected total payoff, 807 00:46:08,549 --> 00:46:12,349 what you should choose here is the max over our actions a, 808 00:46:12,349 --> 00:46:16,179 choose your action a that 809 00:46:16,179 --> 00:46:19,739 maximizes the expected value of your total payoffs as well. 810 00:46:19,739 --> 00:46:23,619 So it just makes sense. There's a version of Bellman's equations for V* rather than Vp and I'll 811 00:46:23,619 --> 00:46:24,419 812 00:46:24,419 --> 00:46:25,820 just say it again. It says that 813 00:46:25,820 --> 00:46:30,130 my optimal expected total payoff is my immediate reward plus, and then 814 00:46:30,130 --> 00:46:33,299 the best action it can choose, the max over all actions a 815 00:46:33,299 --> 00:46:38,749 of my expected future 816 00:46:38,749 --> 00:46:39,799 payoff. 817 00:46:39,799 --> 00:46:42,900 And these also lead 818 00:46:42,900 --> 00:46:45,579 to my definition of p*, 819 00:46:45,579 --> 00:46:49,079 which 820 00:46:49,079 --> 00:46:50,089 is 821 00:46:50,089 --> 00:46:56,329 let's say I'm in some state s and I want to know what action to choose. Well, if I'm 822 00:46:56,329 --> 00:46:58,140 in some state s, I'm gonna get 823 00:46:58,140 --> 00:46:59,910 here an 824 00:46:59,910 --> 00:47:01,869 immediate R(s) anyway, 825 00:47:01,869 --> 00:47:05,449 so what's the best action for me to choose is whatever action 826 00:47:05,449 --> 00:47:08,089 will enable me to maximize the second term, as well 827 00:47:08,089 --> 00:47:11,489 as if my robot is in some state s 828 00:47:11,489 --> 00:47:15,279 and it wants to know what action to choose, I want to choose the action that will 829 00:47:15,279 --> 00:47:17,329 maximize my expected total payoff 830 00:47:17,329 --> 00:47:24,329 and so p*(s) is going to define as R(max) over actions 831 00:47:24,440 --> 00:47:28,569 a of this same thing. I 832 00:47:28,569 --> 00:47:33,759 could 833 00:47:33,759 --> 00:47:36,999 also put the gamma there, but gamma is just a positive. 834 00:47:36,999 --> 00:47:41,799 Gamma is almost always positive, so I just drop that 835 00:47:41,799 --> 00:47:44,449 because it's just a constant scale you go through 836 00:47:44,449 --> 00:47:48,779 and doesn't affect the R(max). 837 00:47:48,779 --> 00:47:50,099 And so, 838 00:47:50,099 --> 00:47:55,069 the consequence of this definition is that p* is actually the optimal policy 839 00:47:55,069 --> 00:47:59,899 because p* will maximize my expected total payoffs. Cool. 840 00:47:59,899 --> 00:48:04,379 Any 841 00:48:04,379 --> 00:48:11,379 questions at this point? Cool. 842 00:48:18,389 --> 00:48:21,299 So 843 00:48:21,299 --> 00:48:23,769 what I'd like to do now is 844 00:48:23,769 --> 00:48:26,069 talk about how algorithms actually compute 845 00:48:26,069 --> 00:48:28,359 high start, compute the optimal policy. 846 00:48:28,359 --> 00:48:30,779 I should write down a little bit more before I do that, 847 00:48:30,779 --> 00:48:32,309 but 848 00:48:32,309 --> 00:48:36,579 notice that if I can compute 849 00:48:36,579 --> 00:48:40,009 V*, if I can compute the optimal value function, 850 00:48:40,009 --> 00:48:42,420 then I can plug it into this equation 851 00:48:42,420 --> 00:48:45,719 and then I'll be done. So if I can compute V and can compute the 852 00:48:45,719 --> 00:48:50,159 optimal policy. 853 00:48:50,159 --> 00:48:52,960 So my strategy for computing the optimal policy 854 00:48:52,960 --> 00:48:55,119 will be to compute V* 855 00:48:55,119 --> 00:48:59,869 and then plug it into this equation and that will give me the optimal policy p*. So my goal, 856 00:48:59,869 --> 00:49:01,169 my next goal, 857 00:49:01,169 --> 00:49:04,839 will really be to compute V*. 858 00:49:04,839 --> 00:49:08,489 But the definition of V* here doesn't lead to a nice algorithm for 859 00:49:08,489 --> 00:49:10,700 computing it because 860 00:49:10,700 --> 00:49:14,260 let's see - so I know how to compute Vp for any given policy p by solving 861 00:49:14,260 --> 00:49:17,619 that linear system equation, but 862 00:49:17,619 --> 00:49:21,859 there's an exponentially large number of policies, 863 00:49:21,859 --> 00:49:22,510 so 864 00:49:22,510 --> 00:49:25,270 you get 11 states and four actions and 865 00:49:25,270 --> 00:49:28,750 what the number of policies is froze to the par of 11. This is of a huge space 866 00:49:28,750 --> 00:49:30,750 of possible policies and so 867 00:49:30,750 --> 00:49:35,489 I can't actually exhaust the union of all policies and then take a 868 00:49:35,489 --> 00:49:38,730 max on [inaudible]. So I should write down some other things first, just 869 00:49:38,730 --> 00:49:42,160 to ground the notations, but what I'll do is 870 00:49:42,160 --> 00:49:44,799 eventually come up with an algorithm for computing V*, the 871 00:49:44,799 --> 00:49:46,359 optimal value function 872 00:49:46,359 --> 00:49:51,949 and then we'll plug them into this and that will give us the optimal policy 873 00:49:51,949 --> 00:49:52,809 p*. 874 00:49:52,809 --> 00:49:56,979 And so I'll write down the algorithm in a second, but 875 00:49:56,979 --> 00:50:03,979 just to ground the notation, 876 00:50:04,849 --> 00:50:08,359 877 00:50:08,359 --> 00:50:11,380 well - yeah, let's skip that. Let's just talk about the algorithm. 878 00:50:11,380 --> 00:50:16,750 879 00:50:16,750 --> 00:50:18,049 So 880 00:50:18,049 --> 00:50:21,519 this is an algorithm called value iteration 881 00:50:21,519 --> 00:50:28,519 and it makes use of Bellman's equations for the optimal policy to compute V*. So here's the algorithm. 882 00:50:42,739 --> 00:50:49,739 Okay, 883 00:51:05,769 --> 00:51:09,979 and that's the entirety of the algorithm and oh, you repeat the step, I 884 00:51:09,979 --> 00:51:13,869 guess. You repeatedly do this step. 885 00:51:13,869 --> 00:51:17,429 So just to be concrete, let's say in my MDP of 11 states, 886 00:51:17,429 --> 00:51:20,319 the first step is initialize V(s) equals zero, so what that means 887 00:51:20,319 --> 00:51:23,649 is I create an array in computer implementation, 888 00:51:23,649 --> 00:51:27,640 create an array of 11 elements and say set all of them to zero. Says I can initialize into 889 00:51:27,640 --> 00:51:30,759 anything. It doesn't really matter. 890 00:51:30,759 --> 00:51:34,169 And now what I'm going to do is I'll take Bellman's equations and we'll 891 00:51:34,169 --> 00:51:37,970 keep on taking the right hand side of Bellman's equations and overwriting 892 00:51:37,970 --> 00:51:40,530 and start copying down 893 00:51:40,530 --> 00:51:44,609 the left hand side. So we'll essentially iteratively try to make Bellman's equations 894 00:51:44,609 --> 00:51:46,459 hold true for 895 00:51:46,459 --> 00:51:48,790 the numbers V(s) that are stored 896 00:51:48,790 --> 00:51:51,739 along the way. So V(s) here is in the array of 11 elements 897 00:51:51,739 --> 00:51:55,819 and I'm going to repeatedly compute the right hand side and copy that onto 898 00:51:55,819 --> 00:51:59,900 V(s). 899 00:51:59,900 --> 00:52:02,859 And it turns out that when you do this, 900 00:52:02,859 --> 00:52:09,409 this will make 901 00:52:09,409 --> 00:52:12,429 V(s) 902 00:52:12,429 --> 00:52:12,939 converge 903 00:52:12,939 --> 00:52:16,349 to V*(s), so it may be of no surprise because we know 904 00:52:16,349 --> 00:52:21,509 V* [inaudible] set inside 905 00:52:21,509 --> 00:52:25,489 Bellman's equations. Just to tell you, some of these ideas that they get more than the problem says, so I won't prove 906 00:52:25,489 --> 00:52:28,019 the conversions of this 907 00:52:28,019 --> 00:52:29,259 algorithm. 908 00:52:29,259 --> 00:52:32,390 Some implementation details, 909 00:52:32,390 --> 00:52:36,479 it turns out there's two ways you can do this update. One is when I say 910 00:52:36,479 --> 00:52:39,419 for every state s that has performed this update, one 911 00:52:39,419 --> 00:52:42,479 way you can do this is 912 00:52:42,479 --> 00:52:45,629 for every state s, you can compute the right hand side 913 00:52:45,629 --> 00:52:49,089 and then you can simultaneously overwrite the left hand side for every 914 00:52:49,089 --> 00:52:51,180 state s. And so if you do that, that's 915 00:52:51,180 --> 00:52:56,430 called a sequence update. Right and sequence [inaudible], 916 00:52:56,430 --> 00:53:00,069 so update all the states s simultaneously. 917 00:53:00,069 --> 00:53:03,759 And if you do that, it's sometimes written as follows. If you 918 00:53:03,759 --> 00:53:05,869 do synchronous update, then it's as if 919 00:53:05,869 --> 00:53:08,149 you have some value function, 920 00:53:08,149 --> 00:53:11,829 you're at the Ith iteration or Tth iteration of the algorithm 921 00:53:11,829 --> 00:53:16,069 and then you're going to compute some function of your entire value 922 00:53:16,069 --> 00:53:17,109 function, 923 00:53:17,109 --> 00:53:19,369 and then 924 00:53:19,369 --> 00:53:23,299 you get to set your value function to your new version, so simultaneously update all 925 00:53:23,299 --> 00:53:25,920 11 values in your s space value function. 926 00:53:25,920 --> 00:53:30,069 So it's sometimes written like this. My B here is called the Bellman backup 927 00:53:30,069 --> 00:53:33,809 operator, so the synchronized valuation you sort of take the value function, 928 00:53:33,809 --> 00:53:36,319 you apply the Bellman backup operator to it 929 00:53:36,319 --> 00:53:39,059 and then the Bellman backup operator just means computing the right hand side 930 00:53:39,059 --> 00:53:40,930 of this for all the states 931 00:53:40,930 --> 00:53:45,109 and you've overwritten your entire value 932 00:53:45,109 --> 00:53:50,129 function. The only way of performing these updates is 933 00:53:50,129 --> 00:53:52,359 asynchronous updates, which is where 934 00:53:52,359 --> 00:53:54,830 you update the states one at a time. So 935 00:53:54,830 --> 00:53:58,610 you go through the states in some fixed order, so would update V(s) for 936 00:53:58,610 --> 00:54:00,919 state No. 1 937 00:54:00,919 --> 00:54:04,759 and then I would like to update V(s) for state No. 2, then state No. 3, and so 938 00:54:04,759 --> 00:54:05,809 on. 939 00:54:05,809 --> 00:54:10,959 And when I'm updating V(s) for state No. 5, 940 00:54:10,959 --> 00:54:15,059 if V(s) prime, if I end up using the values for states 1, 2, 3, and 4 on 941 00:54:15,059 --> 00:54:18,880 the right hand side, then I'd use my recently updated values on the right 942 00:54:18,880 --> 00:54:19,809 hand 943 00:54:19,809 --> 00:54:21,849 side. So as you update 944 00:54:21,849 --> 00:54:24,569 sequentially, when you're updating in the fifth state, you'd 945 00:54:24,569 --> 00:54:28,729 be using values, new values, for states 1, 2, 3, and 4. 946 00:54:28,729 --> 00:54:32,119 And that's called an asynchronous update. 947 00:54:32,119 --> 00:54:36,930 Other versions will cause V(s) conversion to be *(s). 948 00:54:36,930 --> 00:54:38,830 In synchronized updates, it makes them just a 949 00:54:38,830 --> 00:54:44,500 tiny little bit faster [inaudible] and then it turns out the analysis of value iterations 950 00:54:44,500 --> 00:54:48,729 synchronous updates are also 951 00:54:48,729 --> 00:54:50,599 easier to analyze and that just 952 00:54:50,599 --> 00:54:54,779 matters [inaudible]. Asynchronous has been just a little bit faster. 953 00:54:54,779 --> 00:55:00,079 So 954 00:55:00,079 --> 00:55:07,079 when you 955 00:55:07,889 --> 00:55:09,569 run this algorithm 956 00:55:09,569 --> 00:55:10,739 on 957 00:55:10,739 --> 00:55:12,709 958 00:55:12,709 --> 00:55:14,059 959 00:55:14,059 --> 00:55:16,640 the MDP - I forgot to say all these values 960 00:55:16,640 --> 00:55:21,239 were computed with gamma equals open 99 961 00:55:21,239 --> 00:55:22,470 and actually, 962 00:55:22,470 --> 00:55:26,820 Roger Gross, who's a, I guess, master [inaudible] helped me 963 00:55:26,820 --> 00:55:29,759 with computing some of these numbers. 964 00:55:29,759 --> 00:55:33,130 So you compute it. That way you run value relation on this MDP. The 965 00:55:33,130 --> 00:55:34,700 numbers you get 966 00:55:34,700 --> 00:55:37,309 for V* are 967 00:55:37,309 --> 00:55:39,529 as follows: 968 00:55:39,529 --> 00:55:41,299 .86, 969 00:55:41,299 --> 00:55:43,379 .90 - 970 00:55:43,379 --> 00:55:47,619 again, the numbers sort of don't matter that much, but just 971 00:55:47,619 --> 00:55:54,619 take a look at it and make sure it intuitively makes sense. 972 00:55:58,039 --> 00:56:00,389 973 00:56:00,389 --> 00:56:04,509 And then when you plug those in to the formula for computing, that I wrote down 974 00:56:04,509 --> 00:56:09,170 earlier, for computing p, 975 00:56:09,170 --> 00:56:09,880 then - well, 976 00:56:09,880 --> 00:56:16,880 I drew this previously, but here's the optimal policy p*. 977 00:56:22,689 --> 00:56:25,219 And so, just to summarize, the process is 978 00:56:25,219 --> 00:56:27,119 run value iteration to compute 979 00:56:27,119 --> 00:56:30,179 V*, so this would be this table of numbers, 980 00:56:30,179 --> 00:56:33,640 and then I use my form of p* to compute the optimal policy, which is this 981 00:56:33,640 --> 00:56:34,879 policy in this case. Now, 982 00:56:34,879 --> 00:56:39,659 to be just completely concrete, let's look at that free one state again. Is it 983 00:56:39,659 --> 00:56:41,540 better to go left or is it 984 00:56:41,540 --> 00:56:44,359 better to go north? So let me just 985 00:56:44,359 --> 00:56:49,259 illustrate why I'd rather go left than north. In 986 00:56:49,259 --> 00:56:52,999 the form of the p(sp), 987 00:56:52,999 --> 00:56:59,719 988 00:56:59,719 --> 00:57:02,439 this would be - well, let me just 989 00:57:02,439 --> 00:57:09,439 write this down. 990 00:57:14,079 --> 00:57:21,079 Right, if I go north, then it 991 00:57:39,359 --> 00:57:42,630 would be because of that. I wrote it down 992 00:57:42,630 --> 00:57:45,670 really quickly, so 993 00:57:45,670 --> 00:57:48,539 it's messy writing. The 994 00:57:48,539 --> 00:57:50,820 way I got these numbers is 995 00:57:50,820 --> 00:57:54,099 suppose I'm in this state, 996 00:57:54,099 --> 00:57:58,539 in this free one state. If I choose to go west and with chance .8, I get to .75 - to 997 00:57:58,539 --> 00:58:00,829 this table -- 998 00:58:00,829 --> 00:58:04,839 .75. With chance .1, I veer off and get to the .69, then at chance 999 00:58:04,839 --> 00:58:08,389 .1, I go south and I bounce off the wall and I stay where I am. 1000 00:58:08,389 --> 00:58:12,429 So that's why my expected future payoff for going west is .8 times 1001 00:58:12,429 --> 00:58:15,269 .75, plus .1 times .69, 1002 00:58:15,269 --> 00:58:17,130 plus .1 times .71, the 1003 00:58:17,130 --> 00:58:21,519 last .71 being if I bounce off the wall to the south and then seeing where I am, 1004 00:58:21,519 --> 00:58:24,569 that gives you .740. You 1005 00:58:24,569 --> 00:58:29,409 can then repeat the same process to estimate your expected total payoff if you go north, 1006 00:58:29,409 --> 00:58:30,569 so if you do that, 1007 00:58:30,569 --> 00:58:34,369 with a .8 chance, you end up going north, so you get .69. With 1008 00:58:34,369 --> 00:58:38,509 a .1 chance, you end up here and .1 chance you end up there. This 1009 00:58:38,509 --> 00:58:40,849 map leads mentally to that expression 1010 00:58:40,849 --> 00:58:45,439 and compute the expectation, you get .676. And so your total payoff 1011 00:58:45,439 --> 00:58:49,689 is higher if you go west - your expected total payoff is higher if you go west than if you go 1012 00:58:49,689 --> 00:58:50,419 north. 1013 00:58:50,419 --> 00:58:57,419 And that's why the optimal action in this state is to go west. 1014 00:58:57,839 --> 00:59:01,809 1015 00:59:01,809 --> 00:59:08,809 So that was value iteration. 1016 00:59:15,469 --> 00:59:17,789 1017 00:59:17,789 --> 00:59:20,039 It turns out there are 1018 00:59:20,039 --> 00:59:21,920 1019 00:59:21,920 --> 00:59:26,819 two sort of standard algorithms for computing optimal policies in MDPs. Value 1020 00:59:26,819 --> 00:59:28,209 iteration is one. As soon as 1021 00:59:28,209 --> 00:59:32,639 you finish the writing. So value iteration 1022 00:59:32,639 --> 00:59:37,819 is one and the other sort of standard algorithm for computing optimal policies 1023 00:59:37,819 --> 00:59:42,529 in MDPs is called policy iteration. 1024 00:59:42,529 --> 00:59:46,799 And let me - I'm just going to write this 1025 00:59:46,799 --> 00:59:49,239 down. 1026 00:59:49,239 --> 00:59:52,299 In policy iteration, we 1027 00:59:52,299 --> 00:59:57,879 initialize the policy p randomly, so it doesn't matter. It can be the policy that always 1028 00:59:57,879 --> 01:00:04,130 goes north or the policy that takes actions random or whatever. 1029 01:00:04,130 --> 01:00:11,130 And then we'll repeatedly do the following. Okay, 1030 01:00:38,609 --> 01:00:43,679 so that's the algorithm. 1031 01:00:43,679 --> 01:00:47,279 So the algorithm has two steps. In the first step, we solve. 1032 01:00:47,279 --> 01:00:49,949 We take the current policy p 1033 01:00:49,949 --> 01:00:54,469 and we solve Bellman's equations to obtain Vp. So 1034 01:00:54,469 --> 01:00:57,679 remember, earlier I said if you have a fixed policy p, 1035 01:00:57,679 --> 01:01:01,929 then yeah, Bellman's equation defines this system of linear equations with 11 1036 01:01:01,929 --> 01:01:05,619 unknowns and 11 linear constraints. 1037 01:01:05,619 --> 01:01:09,409 And so you solve that linear system equation so you get the value function for your current 1038 01:01:09,409 --> 01:01:10,559 policy p, 1039 01:01:10,559 --> 01:01:15,639 and by this notation, I mean just let V be the value function for policy p. 1040 01:01:15,639 --> 01:01:19,140 Then the second step is you update the policy. In other words, you 1041 01:01:19,140 --> 01:01:22,839 pretend that your current guess V from the value function is indeed the optimal value 1042 01:01:22,839 --> 01:01:23,630 function 1043 01:01:23,630 --> 01:01:25,619 and you let p(s) be equal to 1044 01:01:25,619 --> 01:01:27,559 that out max formula, so as 1045 01:01:27,559 --> 01:01:29,739 to update your policy p. 1046 01:01:29,739 --> 01:01:36,739 And so it turns out that if you do this, then V will converge to V* and p 1047 01:01:37,529 --> 01:01:38,960 will converge to p*, and 1048 01:01:38,960 --> 01:01:41,119 so this is another way 1049 01:01:41,119 --> 01:01:45,509 to find the optimal policy for MDP. 1050 01:01:45,509 --> 01:01:47,999 In terms of tradeoffs, 1051 01:01:47,999 --> 01:01:51,219 it turns out that - let's see - in policy iteration, 1052 01:01:51,219 --> 01:01:54,749 the 1053 01:01:54,749 --> 01:01:58,049 computationally expensive step is this 1054 01:01:58,049 --> 01:02:01,029 one. You need to solve this linear system of equations. 1055 01:02:01,029 --> 01:02:05,579 You have n equations and n unknowns, if you have n states. And so if you have a problem with 1056 01:02:05,579 --> 01:02:09,559 a reasonably few number of states, if you have a problem with like 11 states, you can solve the linear 1057 01:02:09,559 --> 01:02:11,979 system equations fairly efficiently, 1058 01:02:11,979 --> 01:02:16,259 and so policy iteration tends to work extremely well for problems with smallish 1059 01:02:16,259 --> 01:02:17,669 numbers of states where you can 1060 01:02:17,669 --> 01:02:21,509 actually solve those linear systems of equations efficiently. 1061 01:02:21,509 --> 01:02:25,699 So if you have a thousand states, anything less than that, you can solve a 1062 01:02:25,699 --> 01:02:29,339 system of a thousand equations very efficiently, so policy iteration will often work 1063 01:02:29,339 --> 01:02:30,119 fine. If you have 1064 01:02:30,119 --> 01:02:33,119 an MDP with an 1065 01:02:33,119 --> 01:02:36,829 enormous number of states, so we'll actually often see 1066 01:02:36,829 --> 01:02:40,479 MDPs with tens of thousands or hundreds of thousands or millions or tens of 1067 01:02:40,479 --> 01:02:41,429 millions of states. 1068 01:02:41,429 --> 01:02:43,760 If you have a problem with 1069 01:02:43,760 --> 01:02:45,190 10 million states 1070 01:02:45,190 --> 01:02:50,340 and you try to apply policy iteration, then this step requires solving 1071 01:02:50,340 --> 01:02:51,300 the linear 1072 01:02:51,300 --> 01:02:54,719 system of 10 million equations and this would be computationally expensive. 1073 01:02:54,719 --> 01:03:01,719 And so for these really, really large MDPs, I tend to use value iteration. Let's see. Any questions about this? Student:So this is a 1074 01:03:04,539 --> 01:03:10,919 convex function where 1075 01:03:10,919 --> 01:03:12,339 - that 1076 01:03:12,339 --> 01:03:15,019 it could be good in local optimization scheme. 1077 01:03:15,019 --> 01:03:15,619 Instructor 1078 01:03:15,619 --> 01:03:18,640 (Andrew Ng):Ah, yes, you're right. That's a 1079 01:03:18,640 --> 01:03:20,660 good question: Is this a convex function? 1080 01:03:20,660 --> 01:03:24,490 It actually turns out that there is a way to pose a problem of solving for 1081 01:03:24,490 --> 01:03:25,170 V* 1082 01:03:25,170 --> 01:03:28,649 as a convex optimization problem, as a linear program. 1083 01:03:28,649 --> 01:03:32,149 For instance, I can break down the solution - 1084 01:03:32,149 --> 01:03:38,869 you write down V* as a solution, so linear would be the only problem you can solve. 1085 01:03:38,869 --> 01:03:41,539 Policy iteration converges as 1086 01:03:41,539 --> 01:03:42,489 gamma 1087 01:03:42,489 --> 01:03:46,269 T conversion. We're not just stuck with local optimal, but the proof of the 1088 01:03:46,269 --> 01:03:48,199 conversions of policy iteration 1089 01:03:48,199 --> 01:03:51,360 sort of uses somewhat different principles in convex optimization. 1090 01:03:51,360 --> 01:03:52,439 At 1091 01:03:52,439 --> 01:03:54,380 least the versions as far as 1092 01:03:54,380 --> 01:03:57,540 I can see, yeah. You could probably relate this back to convex optimization, but not understand the 1093 01:03:57,540 --> 01:03:58,609 principle of 1094 01:03:58,609 --> 01:03:59,450 why 1095 01:03:59,450 --> 01:04:00,429 1096 01:04:00,429 --> 01:04:02,559 this often converges. 1097 01:04:02,559 --> 01:04:03,779 The proof is not 1098 01:04:03,779 --> 01:04:09,339 that difficult, but it is also sort of longer than I want to go over in this class. 1099 01:04:09,339 --> 01:04:11,609 Yeah, that was a good point. 1100 01:04:11,609 --> 01:04:12,879 1101 01:04:12,879 --> 01:04:19,879 Cool. Actually, any questions for any of these? 1102 01:04:25,489 --> 01:04:28,409 Okay, so 1103 01:04:28,409 --> 01:04:32,450 we now have two algorithms for solving MDP. There's a given, 1104 01:04:32,450 --> 01:04:36,399 the five tuple, given the set of states, the set of actions, the state 1105 01:04:36,399 --> 01:04:39,949 transition properties, the discount factor, and the reward function, you can 1106 01:04:39,949 --> 01:04:42,829 now apply policy iteration or value iteration 1107 01:04:42,829 --> 01:04:46,969 to compute the optimal policy for the MDP. The last thing I want 1108 01:04:46,969 --> 01:04:51,309 to talk about is what if you don't know the 1109 01:04:51,309 --> 01:04:57,169 state 1110 01:04:57,169 --> 01:05:00,249 transition probabilities, and sometimes you won't know the reward function R as well, 1111 01:05:00,249 --> 01:05:02,819 but let's leave that 1112 01:05:02,819 --> 01:05:04,839 aside. And 1113 01:05:04,839 --> 01:05:08,349 so for example, let's say you're trying to fly a helicopter and 1114 01:05:08,349 --> 01:05:13,019 you don't really know in advance what state your helicopter will transition to and take an action 1115 01:05:13,019 --> 01:05:16,259 in a certain state, because helicopter dynamics are kind of noisy. You sort of often don't really know what 1116 01:05:16,259 --> 01:05:17,809 1117 01:05:17,809 --> 01:05:21,269 state you end up in. 1118 01:05:21,269 --> 01:05:26,750 So the standard thing to do, or one standard thing to do, is then to try to estimate the 1119 01:05:26,750 --> 01:05:29,300 state transition probabilities from data. 1120 01:05:29,300 --> 01:05:32,189 Let me just write this out. It turns out that 1121 01:05:32,189 --> 01:05:35,199 the MDP has its 5 tuple, right? S, A; you have 1122 01:05:35,199 --> 01:05:38,500 the transition probabilities, 1123 01:05:38,500 --> 01:05:41,079 gamma, and R. S and 1124 01:05:41,079 --> 01:05:45,199 A you almost always know. The state space is sort of up to you to define. What's the state 1125 01:05:45,199 --> 01:05:47,539 space at the very bottom, 1126 01:05:47,539 --> 01:05:49,730 factor you're trying to control, whatever. 1127 01:05:49,730 --> 01:05:53,459 Actions is, again, just one of your actions. Usually, we almost always know these. 1128 01:05:53,459 --> 01:05:57,359 Gamma, the discount factor is something you choose depending on 1129 01:05:57,359 --> 01:06:00,329 how much you want to trade off 1130 01:06:00,329 --> 01:06:02,149 current versus future rewards. 1131 01:06:02,149 --> 01:06:06,069 The reward function you usually know. There are some exceptional cases. 1132 01:06:06,069 --> 01:06:09,239 Usually, you come up with a reward function and so you usually know what the reward 1133 01:06:09,239 --> 01:06:10,169 function 1134 01:06:10,169 --> 01:06:13,409 is. Sometimes you don't, but let's just leave that aside for now and 1135 01:06:13,409 --> 01:06:17,859 the most common thing for you to have to learn are the state transition probabilities. So we'll just talk about how to 1136 01:06:17,859 --> 01:06:20,559 learn that. 1137 01:06:20,559 --> 01:06:21,280 So 1138 01:06:21,280 --> 01:06:25,120 when you don't know state transition probabilities, the most common thing to do is just estimate it 1139 01:06:25,120 --> 01:06:26,629 from data. 1140 01:06:26,629 --> 01:06:30,599 So what I mean is imagine some robot - maybe it's a robot roaming 1141 01:06:30,599 --> 01:06:33,089 around the hallway, like in that grid example - 1142 01:06:33,089 --> 01:06:37,499 you would then have the robot just take actions in the 1143 01:06:37,499 --> 01:06:38,379 1144 01:06:38,379 --> 01:06:40,929 1145 01:06:40,929 --> 01:06:45,029 MDP and you would then estimate your state transition probabilities P subscript (s, 1146 01:06:45,029 --> 01:06:46,869 a) s prime to be 1147 01:06:46,869 --> 01:06:49,920 - pretty much exactly what you'd expect it to be. This would be 1148 01:06:49,920 --> 01:06:55,099 the number of times you took action a in state s and 1149 01:06:55,099 --> 01:07:00,249 you 1150 01:07:00,249 --> 01:07:05,369 got to s prime, 1151 01:07:05,369 --> 01:07:12,369 divided by the number of times you took action a in 1152 01:07:13,969 --> 01:07:17,659 state s. Okay? So the estimate 1153 01:07:17,659 --> 01:07:19,859 of this is just all the times you took the action 1154 01:07:19,859 --> 01:07:23,369 a in the state s, what's the fraction of times you actually got 1155 01:07:23,369 --> 01:07:26,359 to the state s prime. It's pretty much exactly what you expect it to be. Or 1156 01:07:26,359 --> 01:07:28,420 you 1157 01:07:28,420 --> 01:07:32,249 can 1158 01:07:32,249 --> 01:07:33,869 - 1159 01:07:33,869 --> 01:07:38,189 or in case you've never actually tried action a in state s, so if this turns 1160 01:07:38,189 --> 01:07:40,140 out to be 0 over 0, 1161 01:07:40,140 --> 01:07:44,610 you can then have some default estimate for those vector uniform distribution 1162 01:07:44,610 --> 01:07:46,009 over all states, 1163 01:07:46,009 --> 01:07:47,259 this reasonable default. 1164 01:07:47,259 --> 01:07:54,259 And so, putting it all together - 1165 01:07:59,599 --> 01:08:01,859 and by the 1166 01:08:01,859 --> 01:08:05,079 way, it turns out in reinforcement learning, in 1167 01:08:05,079 --> 01:08:08,589 most of the earlier parts of this class where we did supervised learning, I sort of talked about 1168 01:08:08,589 --> 01:08:12,529 the logistic regression algorithm, so it does the algorithm and 1169 01:08:12,529 --> 01:08:16,439 most implementations of logistic regression - like a 1170 01:08:16,439 --> 01:08:20,179 fairly standard way to do logistic regression were SVMs or faster analysis or 1171 01:08:20,179 --> 01:08:21,019 whatever. 1172 01:08:21,019 --> 01:08:23,569 It turns out in reinforcement learning 1173 01:08:23,569 --> 01:08:27,449 there's more of a mix and match sense, I guess, so there are often 1174 01:08:27,449 --> 01:08:30,339 different pieces of different algorithms you can choose to use. So 1175 01:08:30,339 --> 01:08:34,189 in some of the algorithms I write down, there's sort of more 1176 01:08:34,189 --> 01:08:36,370 than one way to do it and I'm sort of 1177 01:08:36,370 --> 01:08:37,239 1178 01:08:37,239 --> 01:08:39,759 giving specific examples, but if 1179 01:08:39,759 --> 01:08:43,169 you're faced with an AI problem, some of you in control of robots, 1180 01:08:43,169 --> 01:08:46,689 you want to plug in value iteration here instead of policy iteration. You want to 1181 01:08:46,689 --> 01:08:47,750 do something 1182 01:08:47,750 --> 01:08:50,520 slightly different than one of the specific things I wrote down. That's actually 1183 01:08:50,520 --> 01:08:52,589 fairly common, so 1184 01:08:52,589 --> 01:08:57,499 just in reinforcement learning, there's sort of other major ways to apply 1185 01:08:57,499 --> 01:09:00,969 different algorithms and mix and match different algorithms. And this will come up again in the weekly lectures. 1186 01:09:00,969 --> 01:09:02,400 1187 01:09:02,400 --> 01:09:05,359 So just putting the things I said together, 1188 01:09:05,359 --> 01:09:11,940 here would be a - 1189 01:09:11,940 --> 01:09:14,739 now this would be an example of how you might 1190 01:09:14,739 --> 01:09:15,449 estimate the 1191 01:09:15,449 --> 01:09:19,319 state transition probabilities in a MDP and find the policy for it. So you might repeatedly 1192 01:09:19,319 --> 01:09:25,789 do the following. Let's see. 1193 01:09:25,789 --> 01:09:29,559 Take actions 1194 01:09:29,559 --> 01:09:31,919 1195 01:09:31,919 --> 01:09:36,049 using some policy p to 1196 01:09:36,049 --> 01:09:42,949 get experience 1197 01:09:42,949 --> 01:09:45,199 in the 1198 01:09:45,199 --> 01:09:49,539 MDP, meaning that just execute the policy p observed state 1199 01:09:49,539 --> 01:09:50,579 transitions. 1200 01:09:50,579 --> 01:09:53,290 Based on the data you get, you then update 1201 01:09:53,290 --> 01:09:55,869 estimates 1202 01:09:55,869 --> 01:09:59,940 1203 01:09:59,940 --> 01:10:04,299 of your state transition probabilities P subscript (s, a) 1204 01:10:04,299 --> 01:10:08,600 based on the experience of the observations you just got. 1205 01:10:08,600 --> 01:10:10,330 Then you might solve Bellman's 1206 01:10:10,330 --> 01:10:17,090 equations 1207 01:10:17,090 --> 01:10:19,079 using value 1208 01:10:19,079 --> 01:10:22,749 iterations, which I'm abbreviating to VI, and by Bellman's 1209 01:10:22,749 --> 01:10:24,090 1210 01:10:24,090 --> 01:10:28,720 equations, I mean Bellman's equations for V*, not for Vp. Solve Bellman's equations 1211 01:10:28,720 --> 01:10:30,920 using value iteration 1212 01:10:30,920 --> 01:10:34,870 1213 01:10:34,870 --> 01:10:37,530 to get an estimate for P* 1214 01:10:37,530 --> 01:10:41,239 and then you update your policy by 1215 01:10:41,239 --> 01:10:43,009 events 1216 01:10:43,009 --> 01:10:49,119 equals [inaudible]. 1217 01:10:49,119 --> 01:10:56,119 1218 01:10:59,139 --> 01:11:02,560 And now you have a new policy so you can then go 1219 01:11:02,560 --> 01:11:05,570 back and execute this policy for a bit more of the MDPs to get some more 1220 01:11:05,570 --> 01:11:09,309 observations of state transitions, 1221 01:11:09,309 --> 01:11:14,139 get the noisy ones in MDP, use that update to estimate your state transition probabilities again; use value iteration or policy 1222 01:11:14,139 --> 01:11:14,819 iteration 1223 01:11:14,819 --> 01:11:19,419 to solve for [inaudible] the value function, get a new policy 1224 01:11:19,419 --> 01:11:21,189 and so on. Okay? And 1225 01:11:21,189 --> 01:11:24,880 it turns out when you do this, I actually wrote down value iteration for a reason. It turns out 1226 01:11:24,880 --> 01:11:27,049 in the third step of the algorithm, 1227 01:11:27,049 --> 01:11:31,389 if you're using value iteration rather than policy iteration, 1228 01:11:31,389 --> 01:11:36,499 to initialize value iteration, if you use your solution from the previous used 1229 01:11:36,499 --> 01:11:37,520 algorithm, 1230 01:11:37,520 --> 01:11:42,050 right, then that's a very good initialization condition and this will tend to converge much 1231 01:11:42,050 --> 01:11:43,360 more quickly because 1232 01:11:43,360 --> 01:11:44,999 value iteration 1233 01:11:44,999 --> 01:11:48,400 tries to solve for V(s) for every 1234 01:11:48,400 --> 01:11:52,090 state s. It tries to estimate 1235 01:11:52,090 --> 01:11:54,939 V in V(s) and so if you're looking 1236 01:11:54,939 --> 01:11:55,909 through this 1237 01:11:55,909 --> 01:11:59,199 and you initialize your value 1238 01:11:59,199 --> 01:12:00,479 iteration algorithm 1239 01:12:00,479 --> 01:12:04,179 using the values you have from the previous round through this, then that 1240 01:12:04,179 --> 01:12:06,319 will often make this converge faster. 1241 01:12:06,319 --> 01:12:08,030 But again, this is again 1242 01:12:08,030 --> 01:12:12,369 here, you can also adjust a small part in policy iteration in here as well and whatever, and 1243 01:12:12,369 --> 01:12:15,039 this is a fairly typical example 1244 01:12:15,039 --> 01:12:18,999 of how you would solve a policy, correct digits and then key in and 1245 01:12:18,999 --> 01:12:20,800 try to find a good policy 1246 01:12:20,800 --> 01:12:21,969 for a problem 1247 01:12:21,969 --> 01:12:27,599 for which you did not know the state transition probabilities in advance. 1248 01:12:27,599 --> 01:12:34,599 Cool. Questions about this? Cool. So that sure 1249 01:12:39,599 --> 01:12:45,269 was exciting. This is like our first two MDP algorithms in just one lecture. All 1250 01:12:45,269 --> 01:12:47,440 right, let's close for today. Thanks.