00:00:10,010 --> 00:00:11,269 This subtitle is provided from lecsub.jimdo.com under Creative Commons License. 2 00:00:11,269 --> 00:00:14,539 This presentation is delivered by the Stanford Center for Professional 3 00:00:14,539 --> 00:00:21,539 Development. 4 00:00:23,929 --> 00:00:28,039 So welcome to the last lecture of this course. 5 00:00:28,039 --> 00:00:32,969 What I want to do today is tell you about one final class of reinforcement 6 00:00:32,969 --> 00:00:34,820 learning algorithms. I 7 00:00:34,820 --> 00:00:38,560 just want to say a little bit about POMDPs, the partially observable MDPs, and 8 00:00:38,560 --> 00:00:39,350 then 9 00:00:39,350 --> 00:00:43,590 the main technical topic for today will be policy search algorithms. 10 00:00:43,590 --> 00:00:45,979 I'll talk about 11 00:00:45,979 --> 00:00:48,670 two specific algorithms, essentially called reinforced and called Pegasus, 12 00:00:48,670 --> 00:00:52,660 and then we'll wrap up the class. So if 13 00:00:52,660 --> 00:00:54,940 you recall from the last lecture, 14 00:00:54,940 --> 00:00:59,870 I actually started to talk about one specific example of a POMDP, 15 00:00:59,870 --> 00:01:06,870 which was this sort of linear dynamical 16 00:01:07,170 --> 00:01:10,790 system. This is sort of LQR, linear quadratic revelation problem, 17 00:01:10,790 --> 00:01:17,790 but I changed it and said 18 00:01:19,550 --> 00:01:22,480 what 19 00:01:22,480 --> 00:01:24,190 if we only have observations 20 00:01:24,190 --> 00:01:28,270 YT. And what if we couldn't observe the state of the system directly, 21 00:01:28,270 --> 00:01:28,729 but 22 00:01:28,729 --> 00:01:29,740 had to choose 23 00:01:29,740 --> 00:01:31,789 an action only 24 00:01:31,789 --> 00:01:35,049 based on some noisy observations that maybe some function of the 25 00:01:35,049 --> 00:01:37,180 state? 26 00:01:37,180 --> 00:01:41,960 So our strategy last time was that we said that in the fully observable 27 00:01:41,960 --> 00:01:42,660 case, 28 00:01:42,660 --> 00:01:46,060 we could choose actions - 29 00:01:46,060 --> 00:01:49,410 AT equals 30 00:01:49,410 --> 00:01:51,439 two, that matrix 31 00:01:51,439 --> 00:01:54,870 LT times ST. So LT was this 32 00:01:54,870 --> 00:01:58,730 matrix of parameters that [inaudible] describe the dynamic programming 33 00:01:58,730 --> 00:02:01,100 algorithm for finite horizon MDPs in 34 00:02:01,100 --> 00:02:02,200 the LQR problem. 35 00:02:02,200 --> 00:02:03,710 And so we said if only 36 00:02:03,710 --> 00:02:07,229 we knew what the state was, we choose actions according to some 37 00:02:07,229 --> 00:02:08,409 matrix LT 38 00:02:08,409 --> 00:02:10,529 times the state. 39 00:02:10,529 --> 00:02:13,929 And then I said in the partially observable case, 40 00:02:13,929 --> 00:02:14,469 41 00:02:14,469 --> 00:02:17,689 we would compute 42 00:02:17,689 --> 00:02:19,979 these estimates. 43 00:02:19,979 --> 00:02:26,979 I wrote them as S of T given T, 44 00:02:30,379 --> 00:02:34,169 which were our best estimate for what the state is given all the 45 00:02:34,169 --> 00:02:37,230 observations. And in particular, 46 00:02:37,230 --> 00:02:41,459 I'm gonna talk about a Kalman filter 47 00:02:41,459 --> 00:02:43,299 which 48 00:02:43,299 --> 00:02:45,219 we worked out that our 49 00:02:45,219 --> 00:02:48,959 posterior distribution of what the state is 50 00:02:48,959 --> 00:02:52,519 given all the observations up to a certain time 51 00:02:52,519 --> 00:02:56,579 that was this. So this 52 00:02:56,579 --> 00:02:59,139 is from last time. So that 53 00:02:59,139 --> 00:03:03,159 given the observations Y one through YT, our posterior distribution 54 00:03:03,159 --> 00:03:05,579 of the current state ST was Gaussian 55 00:03:05,579 --> 00:03:10,169 would mean ST given T sigma T given T. 56 00:03:10,169 --> 00:03:14,090 So I said we use a Kalman filter to compute this thing, this ST given T, 57 00:03:14,090 --> 00:03:17,279 which is going to be our best 58 00:03:17,279 --> 00:03:20,499 guess for what the state is currently. 59 00:03:20,499 --> 00:03:27,499 And then we choose actions 60 00:03:29,559 --> 00:03:30,609 using 61 00:03:30,609 --> 00:03:34,059 our estimate for what the state is, rather than using the true state because we 62 00:03:34,059 --> 00:03:35,969 don't know the true state anymore in 63 00:03:35,969 --> 00:03:37,519 this POMDP. 64 00:03:37,519 --> 00:03:39,619 So 65 00:03:39,619 --> 00:03:41,929 it turns out that this specific 66 00:03:41,929 --> 00:03:43,559 strategy 67 00:03:43,559 --> 00:03:47,219 actually allows you to choose optimal actions, 68 00:03:47,219 --> 00:03:51,749 allows you to choose actions as well as you possibly can given that this is a 69 00:03:51,749 --> 00:03:54,799 POMDP, and given there are these noisy observations. 70 00:03:54,799 --> 00:03:59,469 It turns out that in general finding optimal policies with POMDPs - 71 00:03:59,469 --> 00:04:03,149 finding optimal policies for these sorts of partially observable MDPs is an 72 00:04:03,149 --> 00:04:06,039 NP-hard problem. 73 00:04:06,039 --> 00:04:10,159 Just to be concrete about the formalism of the POMDP - I should just 74 00:04:10,159 --> 00:04:16,259 write it here - 75 00:04:16,259 --> 00:04:20,069 a POMDP formally 76 00:04:20,069 --> 00:04:27,069 is a tuple 77 00:04:30,729 --> 00:04:32,000 like that 78 00:04:32,000 --> 00:04:35,550 where the changes are the set Y 79 00:04:35,550 --> 00:04:42,550 is the set of possible observations, 80 00:04:44,759 --> 00:04:46,519 and this O 81 00:04:46,519 --> 00:04:53,519 subscript S are the observation distributions. 82 00:04:58,039 --> 00:05:04,220 And so at each step, 83 00:05:04,220 --> 00:05:11,220 we observe 84 00:05:17,629 --> 00:05:20,179 - at each step in the POMDP, if we're 85 00:05:20,179 --> 00:05:23,849 in some state ST, we observe some observation YT 86 00:05:23,849 --> 00:05:27,610 drawn from the observation distribution O subscript ST, that there's an 87 00:05:27,610 --> 00:05:28,420 index by what 88 00:05:28,420 --> 00:05:30,369 the current state is. 89 00:05:30,369 --> 00:05:31,060 And 90 00:05:31,060 --> 00:05:35,599 it turns out that computing the optimal policy in a POMDP is an NPhard problem. 91 00:05:35,599 --> 00:05:36,309 92 00:05:36,309 --> 00:05:39,709 For the specific case of linear dynamical systems with the Kalman 93 00:05:39,709 --> 00:05:40,539 filter 94 00:05:40,539 --> 00:05:45,849 model, we have this strategy of computing the optimal policy assuming full observability 95 00:05:45,849 --> 00:05:49,080 and then estimating the states from the observations, 96 00:05:49,080 --> 00:05:50,889 and then plugging the two together. 97 00:05:50,889 --> 00:05:52,500 That turns out to be optimal 98 00:05:52,500 --> 00:05:56,319 essentially for only that special case of a POMDP. 99 00:05:56,319 --> 00:05:58,479 In the more general case, 100 00:05:58,479 --> 00:06:00,449 that strategy of designing 101 00:06:00,449 --> 00:06:02,959 a controller assuming full observability 102 00:06:02,959 --> 00:06:06,009 and then just estimating the state and plugging the two together, 103 00:06:06,009 --> 00:06:07,950 for general POMDPs 104 00:06:07,950 --> 00:06:12,050 that same strategy is often a very reasonable strategy but is not 105 00:06:12,050 --> 00:06:14,759 always guaranteed to be optimal. 106 00:06:14,759 --> 00:06:20,869 Solving these problems in general, NP-hard. 107 00:06:20,869 --> 00:06:22,500 So what I want to do today 108 00:06:22,500 --> 00:06:25,569 is actually 109 00:06:25,569 --> 00:06:29,139 talk about a different class of reinforcement learning algorithms. These are called 110 00:06:29,139 --> 00:06:34,970 policy search 111 00:06:34,970 --> 00:06:38,259 algorithms. In particular, policy search algorithms 112 00:06:38,259 --> 00:06:42,719 can be applied equally well to MDPs, to fully observed Markov decision 113 00:06:42,719 --> 00:06:43,630 processes, 114 00:06:43,630 --> 00:06:46,419 or to these POMDPs, 115 00:06:46,419 --> 00:06:49,789 or to these partially observable MPDs. 116 00:06:49,789 --> 00:06:53,279 What I want to do now, I'll actually just describe policy search 117 00:06:53,279 --> 00:06:54,240 118 00:06:54,240 --> 00:06:58,269 algorithms applied to MDPs, applied to the fully observable case. 119 00:06:58,269 --> 00:07:01,919 And in the end, I just briefly describe how you can take policy search algorithms and apply them to POMDPs. In 120 00:07:01,919 --> 00:07:05,460 the latter case, when 121 00:07:05,460 --> 00:07:09,830 you apply a policy search algorithm to a POMDP, it's 122 00:07:09,830 --> 00:07:12,289 going to be hard to guarantee that you get 123 00:07:12,289 --> 00:07:15,499 the globally optimal policy because 124 00:07:15,499 --> 00:07:20,060 solving POMDPs in general is NP-hard, but nonetheless policy search algorithms - it 125 00:07:20,060 --> 00:07:20,809 turns out to be 126 00:07:20,809 --> 00:07:23,009 I think one of the most 127 00:07:23,009 --> 00:07:26,659 effective classes of reinforcement learning algorithms, as well both for 128 00:07:26,659 --> 00:07:29,590 MDPs 129 00:07:29,590 --> 00:07:31,029 and for POMDPs. 130 00:07:31,029 --> 00:07:34,110 So 131 00:07:34,110 --> 00:07:36,849 here's what we're going to do. 132 00:07:36,849 --> 00:07:40,789 In policy search, 133 00:07:40,789 --> 00:07:47,789 we're going to define of some set 134 00:07:48,309 --> 00:07:54,869 which I denote capital pi of policies, 135 00:07:54,869 --> 00:08:01,869 and our strategy is to search for 136 00:08:02,710 --> 00:08:05,999 a good policy lower pi 137 00:08:05,999 --> 00:08:09,240 into set capital pi. 138 00:08:09,240 --> 00:08:10,909 139 00:08:10,909 --> 00:08:14,930 Just by analogy, I want to say 140 00:08:14,930 --> 00:08:19,379 - in the same way, back when we were talking about supervised learning, 141 00:08:19,379 --> 00:08:23,769 the way we defined the set capital pi of policies in the search for policy in this 142 00:08:23,769 --> 00:08:28,069 set capital pi is analogous to 143 00:08:28,069 --> 00:08:31,269 supervised learning 144 00:08:31,269 --> 00:08:38,269 where we defined a set script H of hypotheses 145 00:08:39,300 --> 00:08:46,300 and search - 146 00:08:49,520 --> 00:08:52,750 and would search for a good hypothesis in this 147 00:08:52,750 --> 00:08:54,910 policy 148 00:08:54,910 --> 00:08:58,120 script H. 149 00:08:58,120 --> 00:09:03,040 Policy search is sometimes also called direct policy search. 150 00:09:03,040 --> 00:09:06,740 To contrast this with the source of algorithms we've been talking about so 151 00:09:06,740 --> 00:09:11,360 far, in all the algorithms we've been talking about so far, we would try to find V star. We would try to 152 00:09:11,360 --> 00:09:13,600 find the optimal value function. 153 00:09:13,600 --> 00:09:15,339 And then we'd use V star - we'd 154 00:09:15,339 --> 00:09:19,179 use the optimal value function to then try to compute or 155 00:09:19,179 --> 00:09:20,260 try to approximate pi star. 156 00:09:20,260 --> 00:09:23,980 So all the approaches we talked about previously are 157 00:09:23,980 --> 00:09:27,710 strategy for finding a good policy. Once we compute the value function, then we go 158 00:09:27,710 --> 00:09:29,500 from that to policy. 159 00:09:29,500 --> 00:09:33,580 In contrast, in policy search algorithms and something that's called direct policy search 160 00:09:33,580 --> 00:09:34,700 algorithms, 161 00:09:34,700 --> 00:09:39,360 the idea is that we're going to quote directly try to approximate a good 162 00:09:39,360 --> 00:09:40,160 policy 163 00:09:40,160 --> 00:09:44,280 without going through the intermediate stage of trying to find the value 164 00:09:44,280 --> 00:09:48,280 function. 165 00:09:48,280 --> 00:09:49,530 Let's see. 166 00:09:49,530 --> 00:09:50,880 And also 167 00:09:50,880 --> 00:09:54,580 as I develop policy search - just one step that's 168 00:09:54,580 --> 00:09:57,610 sometimes slightly confusing. 169 00:09:57,610 --> 00:10:00,660 Making an analogy to supervised learning again, when 170 00:10:00,660 --> 00:10:02,710 we talked about logistic regression, 171 00:10:02,710 --> 00:10:04,290 I said 172 00:10:04,290 --> 00:10:08,700 we have input features X and some labels Y, and I sort of said 173 00:10:08,700 --> 00:10:11,300 let's approximate Y using the logistic function 174 00:10:11,300 --> 00:10:13,320 of the inputs X. And 175 00:10:13,320 --> 00:10:16,310 at least initially, the logistic function was sort of pulled out of the 176 00:10:16,310 --> 00:10:18,320 air. 177 00:10:18,320 --> 00:10:22,450 In the same way, as I define policy search algorithms, there'll sort of be a step where I 178 00:10:22,450 --> 00:10:23,400 say, 179 00:10:23,400 --> 00:10:27,600 Well, let's try to compute the actions. Let's try to approximate what a good 180 00:10:27,600 --> 00:10:28,560 action is 181 00:10:28,560 --> 00:10:33,020 using a logistic function of the state. So again, I'll sort of pull a 182 00:10:33,020 --> 00:10:36,770 function out of the air. I'll say, Let's just choose a function, and 183 00:10:36,770 --> 00:10:38,120 that'll be our choice of 184 00:10:38,120 --> 00:10:40,650 the policy cost, and I'll say, 185 00:10:40,650 --> 00:10:41,539 Let's 186 00:10:41,539 --> 00:10:45,309 take this input the state, and then we'll map it through logistic function, and then hopefully, we'll approximate 187 00:10:45,309 --> 00:10:47,300 what is a good function - 188 00:10:47,300 --> 00:10:48,350 excuse me, 189 00:10:48,350 --> 00:10:50,750 we'll approximate what is a good action 190 00:10:50,750 --> 00:10:53,269 using a logistic function of the state. 191 00:10:53,269 --> 00:10:54,420 So there's that sort of - 192 00:10:54,420 --> 00:10:59,460 the function of the choice of policy cost that's again a little bit arbitrary, 193 00:10:59,460 --> 00:11:04,270 but it's arbitrary as it was when we were talking about supervised 194 00:11:04,270 --> 00:11:06,080 learning. 195 00:11:06,080 --> 00:11:08,340 So 196 00:11:08,340 --> 00:11:10,350 to develop 197 00:11:10,350 --> 00:11:17,350 our first policy search algorithm, I'm actually gonna need the new definition. 198 00:11:26,670 --> 00:11:33,670 199 00:11:56,100 --> 00:12:03,100 So 200 00:12:06,650 --> 00:12:10,910 our first policy search algorithm, we'll actually need to work with stochastic 201 00:12:10,910 --> 00:12:12,710 policies. 202 00:12:12,710 --> 00:12:15,430 What I mean by stochastic policy is 203 00:12:15,430 --> 00:12:17,679 there's going to be a function that maps from 204 00:12:17,679 --> 00:12:18,940 the space of 205 00:12:18,940 --> 00:12:22,510 states across actions. They're real numbers 206 00:12:22,510 --> 00:12:25,039 where pi of S comma A will be 207 00:12:25,039 --> 00:12:28,900 interpreted as the probability of taking this action A 208 00:12:28,900 --> 00:12:30,580 in sum state S. 209 00:12:30,580 --> 00:12:35,630 And so we have to add sum 210 00:12:35,630 --> 00:12:42,050 211 00:12:42,050 --> 00:12:44,150 over 212 00:12:44,150 --> 00:12:46,340 A - In other words, for every state 213 00:12:46,340 --> 00:12:49,150 a stochastic policy 214 00:12:49,150 --> 00:12:53,830 specifies a probability distribution over the actions. 215 00:12:53,830 --> 00:12:54,550 216 00:12:54,550 --> 00:12:59,250 So concretely, 217 00:12:59,250 --> 00:13:03,330 suppose you are executing some policy pi. Say I have some stochastic 218 00:13:03,330 --> 00:13:04,910 policy pi. I 219 00:13:04,910 --> 00:13:07,270 wanna execute the policy pi. 220 00:13:07,270 --> 00:13:11,690 What that means is that - in this example let's say I have three actions. 221 00:13:11,690 --> 00:13:16,060 What that means is that suppose I'm in some state S. 222 00:13:16,060 --> 00:13:20,640 I would then compute pi of S comma A1, pi of 223 00:13:20,640 --> 00:13:21,189 S 224 00:13:21,189 --> 00:13:25,290 comma A2, 225 00:13:25,290 --> 00:13:29,370 pi of S comma A3, if I have a three action MDP. 226 00:13:29,370 --> 00:13:30,740 These 227 00:13:30,740 --> 00:13:33,630 will be three numbers that sum up to one, 228 00:13:33,630 --> 00:13:36,730 and then my chance of taking action A1 will be 229 00:13:36,730 --> 00:13:41,350 equal to this. My chance of taking action A2 will be equal to pi of S comma A2. 230 00:13:41,350 --> 00:13:44,129 My chance of taking action A3 will be equal 231 00:13:44,129 --> 00:13:49,670 to this number. So that's what it means to execute a stochastic policy. 232 00:13:49,670 --> 00:13:50,880 So 233 00:13:50,880 --> 00:13:53,730 as a concrete example, just let me make this 234 00:13:53,730 --> 00:13:56,680 - the concept of why you wanna use stochastic policy is maybe a little bit 235 00:13:56,680 --> 00:14:03,380 hard to understand. So 236 00:14:03,380 --> 00:14:07,470 let me just go ahead and give one specific example of what a stochastic 237 00:14:07,470 --> 00:14:09,550 policy may look like. 238 00:14:09,550 --> 00:14:11,570 For this example, I'm gonna use the 239 00:14:11,570 --> 00:14:14,140 inverted pendulum 240 00:14:14,140 --> 00:14:20,370 as my motivating example. It's that problem of balancing a pole. We have an 241 00:14:20,370 --> 00:14:21,270 inverted 242 00:14:21,270 --> 00:14:26,230 pendulum that swings freely, and you want to move the cart left and 243 00:14:26,230 --> 00:14:29,090 right to keep the pole vertical. 244 00:14:29,090 --> 00:14:31,430 Let's say my actions 245 00:14:31,430 --> 00:14:33,240 246 00:14:33,240 --> 00:14:34,950 - 247 00:14:34,950 --> 00:14:36,970 for today's example, I'm gonna use 248 00:14:36,970 --> 00:14:39,330 that angle to 249 00:14:39,330 --> 00:14:43,780 denote the angle of the pole 250 00:14:43,780 --> 00:14:47,440 phi. I have two actions where 251 00:14:47,440 --> 00:14:49,740 A1 is to accelerate left 252 00:14:49,740 --> 00:14:55,010 and A2 is to accelerate right. Actually, let me 253 00:14:55,010 --> 00:14:56,710 just 254 00:14:56,710 --> 00:14:58,720 write 255 00:14:58,720 --> 00:15:02,560 that the other way around. A1 is to accelerate right. A2 is to accelerate left. 256 00:15:02,560 --> 00:15:04,720 So 257 00:15:04,720 --> 00:15:06,050 let's see. 258 00:15:06,050 --> 00:15:10,480 Choose a reward function that penalizes the pole falling over whatever. 259 00:15:10,480 --> 00:15:14,170 And now let's come up with a stochastic policy for this problem. 260 00:15:14,170 --> 00:15:14,910 261 00:15:14,910 --> 00:15:17,860 To come up with a class of stochastic policies 262 00:15:17,860 --> 00:15:20,080 263 00:15:20,080 --> 00:15:23,990 really means coming up with some class of functions to approximate what action you 264 00:15:23,990 --> 00:15:26,570 want to take as a function of the state. 265 00:15:26,570 --> 00:15:30,000 So here's my somewhat arbitrary choice. I'm gonna say that 266 00:15:30,000 --> 00:15:31,139 the probability 267 00:15:31,139 --> 00:15:33,710 of action A1, so pi 268 00:15:33,710 --> 00:15:35,210 of S comma A1, 269 00:15:35,210 --> 00:15:39,880 I'm gonna write as - okay? 270 00:15:39,880 --> 00:15:40,950 And I 271 00:15:40,950 --> 00:15:44,520 just chose the logistic function because it's a convenient function we've used a lot. So I'm gonna 272 00:15:44,520 --> 00:15:45,630 say that 273 00:15:45,630 --> 00:15:49,740 my policy is parameterized by a set of parameters theta, 274 00:15:49,740 --> 00:15:53,310 and for any given set of parameters theta, 275 00:15:53,310 --> 00:15:56,270 that gives me a stochastic policy. 276 00:15:56,270 --> 00:15:59,380 And if I'm executing that policy with parameters theta, that means 277 00:15:59,380 --> 00:16:00,490 that the 278 00:16:00,490 --> 00:16:03,770 chance of my choosing to a set of [inaudible] is given by 279 00:16:03,770 --> 00:16:06,020 this number. 280 00:16:06,020 --> 00:16:08,370 281 00:16:08,370 --> 00:16:10,490 282 00:16:10,490 --> 00:16:17,490 283 00:16:18,170 --> 00:16:21,900 Because my chances of executing actions A1 or A2 must sum to one, this gives me pi of S A2. So 284 00:16:21,900 --> 00:16:25,520 just 285 00:16:25,520 --> 00:16:28,850 [inaudible], this means that when I'm in sum state S, 286 00:16:28,850 --> 00:16:30,960 I'm going to compute this number, 287 00:16:30,960 --> 00:16:34,260 compute one over one plus E to the minus state of transpose S. 288 00:16:34,260 --> 00:16:40,060 And then with this probability, I will execute the accelerate right action, 289 00:16:40,060 --> 00:16:45,780 and with one minus this probability, I'll execute the accelerate left action. 290 00:16:45,780 --> 00:16:50,820 And again, just to give you a sense of why this might be a reasonable thing to do, 291 00:16:50,820 --> 00:16:52,200 292 00:16:52,200 --> 00:16:59,200 let's say my state vector is - this 293 00:17:01,530 --> 00:17:05,440 is [inaudible] state, and I added an extra one as an interceptor, just to give my 294 00:17:05,440 --> 00:17:08,949 logistic function an extra feature. 295 00:17:08,949 --> 00:17:13,380 If I choose my parameters and my policy to be say 296 00:17:13,380 --> 00:17:19,860 this, 297 00:17:19,860 --> 00:17:23,079 then that means that at any state, 298 00:17:23,079 --> 00:17:24,850 the probability 299 00:17:24,850 --> 00:17:28,700 300 00:17:28,700 --> 00:17:30,360 301 00:17:30,360 --> 00:17:33,870 of my taking action A1 - the probability of my taking the accelerate 302 00:17:33,870 --> 00:17:40,870 right action is this one over one plus E to the minus state of transpose S, which taking the inner product 303 00:17:41,360 --> 00:17:48,180 of theta and S, this just gives you phi, 304 00:17:48,180 --> 00:17:51,250 equals one over one plus E to the minus phi. 305 00:17:51,250 --> 00:17:54,900 And so if I choose my parameters theta as follows, 306 00:17:54,900 --> 00:18:01,150 what that means is that 307 00:18:01,150 --> 00:18:08,150 just depending on the angle phi of my inverted pendulum, 308 00:18:09,340 --> 00:18:12,200 the chance of my accelerating to the 309 00:18:12,200 --> 00:18:15,190 right is just this function of 310 00:18:15,190 --> 00:18:18,320 the angle of my inverted pendulum. And so this means for example 311 00:18:18,320 --> 00:18:22,090 that if my inverted pendulum is leaning far over to the right, 312 00:18:22,090 --> 00:18:26,370 then I'm very likely to accelerate to the right to try to catch it. I 313 00:18:26,370 --> 00:18:30,000 hope the physics of this inverted pendulum thing make 314 00:18:30,000 --> 00:18:33,440 sense. If my pole's leaning over to the right, then I wanna accelerate to the right to catch it. And 315 00:18:33,440 --> 00:18:36,740 conversely if phi is negative, it's leaning over to the left, and I'll accelerate to the left to try 316 00:18:36,740 --> 00:18:38,520 to catch it. 317 00:18:38,520 --> 00:18:41,210 So this is one example 318 00:18:41,210 --> 00:18:44,030 for one specific choice of parameters theta. 319 00:18:44,030 --> 00:18:47,660 Obviously, this isn't a great policy because it ignores the rest of 320 00:18:47,660 --> 00:18:49,420 the features. Maybe if 321 00:18:49,420 --> 00:18:52,860 the cart is further to the right, you want it to be less likely to accelerate to the 322 00:18:52,860 --> 00:18:55,610 right, and you can capture that by changing 323 00:18:55,610 --> 00:18:57,770 one of these coefficients to take into account the 324 00:18:57,770 --> 00:19:00,250 actual position of the cart. And then depending on 325 00:19:00,250 --> 00:19:02,120 the velocity of the cart and the 326 00:19:02,120 --> 00:19:03,379 angle of velocity, 327 00:19:03,379 --> 00:19:08,040 you might want to change theta to take into account these other effects as well. Maybe 328 00:19:08,040 --> 00:19:13,060 if the pole's leaning far to the right, but is actually on its way to swinging back, it's 329 00:19:13,060 --> 00:19:15,620 specified to the angle of velocity, then you might be 330 00:19:15,620 --> 00:19:19,300 less worried about having to accelerate hard to the right. And so 331 00:19:19,300 --> 00:19:23,170 these are the sorts of behavior you can get by varying the parameters theta. 332 00:19:23,170 --> 00:19:24,160 333 00:19:24,160 --> 00:19:28,640 And so our goal is to tune the parameters theta - our goal in policy search 334 00:19:28,640 --> 00:19:31,290 is to tune the parameters 335 00:19:31,290 --> 00:19:34,860 theta so that when we execute the policy 336 00:19:34,860 --> 00:19:37,290 pi subscript theta, 337 00:19:37,290 --> 00:19:39,730 the pole stays up as long as possible. 338 00:19:39,730 --> 00:19:43,809 In other words, our goal is to 339 00:19:43,809 --> 00:19:48,430 maximize as a function of theta - 340 00:19:48,430 --> 00:19:55,430 our 341 00:19:58,070 --> 00:19:59,760 goal is to 342 00:19:59,760 --> 00:20:05,440 maximize the expected value of the payoff 343 00:20:05,440 --> 00:20:08,170 for when we execute the policy 344 00:20:08,170 --> 00:20:09,620 pi theta. We 345 00:20:09,620 --> 00:20:11,249 want to choose parameters theta 346 00:20:11,249 --> 00:20:16,060 to maximize that. Are 347 00:20:16,060 --> 00:20:18,429 there 348 00:20:18,429 --> 00:20:19,660 questions about 349 00:20:19,660 --> 00:20:21,120 the problem set up, 350 00:20:21,120 --> 00:20:26,330 and policy search and policy classes or anything? Yeah. Student:In a case where we have 351 00:20:26,330 --> 00:20:27,300 more 352 00:20:27,300 --> 00:20:32,880 than two actions, would we use a different theta for each of the distributions, or still have 353 00:20:32,880 --> 00:20:36,310 the same parameters? Instructor (Andrew Ng):Oh, yeah. 354 00:20:36,310 --> 00:20:39,430 Right. So what if we have more than two actions. It turns out you can choose almost anything you want for the policy class, but 355 00:20:39,430 --> 00:20:42,110 you have say a 356 00:20:42,110 --> 00:20:43,710 fixed number of discrete actions, 357 00:20:43,710 --> 00:20:48,210 I would sometimes use like a softmax parameterization. 358 00:20:48,210 --> 00:20:53,550 Similar to softmax regression that we saw earlier in the class, 359 00:20:53,550 --> 00:20:55,910 you may 360 00:20:55,910 --> 00:21:02,910 say 361 00:21:05,540 --> 00:21:06,900 that - [inaudible] out of space. You may have a set of parameters theta 362 00:21:06,900 --> 00:21:09,350 1 through theta D if 363 00:21:09,350 --> 00:21:16,350 you have D actions and - pi equals 364 00:21:17,230 --> 00:21:18,480 E to the theta 365 00:21:18,480 --> 00:21:24,910 I transpose S over - so 366 00:21:24,910 --> 00:21:29,160 that would be an example of a softmax parameterization for multiple actions. 367 00:21:29,160 --> 00:21:33,390 It turns out that if you have continuous actions, you can actually make this be a 368 00:21:33,390 --> 00:21:34,090 density 369 00:21:34,090 --> 00:21:35,950 over the actions A 370 00:21:35,950 --> 00:21:38,500 and parameterized by other things as well. 371 00:21:38,500 --> 00:21:42,590 But the choice of policy class is somewhat up to you, in the same way that 372 00:21:42,590 --> 00:21:44,150 373 00:21:44,150 --> 00:21:47,670 the choice of whether we chose to use a linear function 374 00:21:47,670 --> 00:21:49,430 or linear function with 375 00:21:49,430 --> 00:21:56,430 quadratic features or whatever in supervised learning that was sort of up to us. Anything 376 00:22:01,790 --> 00:22:08,790 else? Yeah. Student:[Inaudible] stochastic? 377 00:22:12,400 --> 00:22:16,350 Instructor (Andrew Ng):Yes. Student:So is it possible to [inaudible] a stochastic policy using numbers [inaudible]? Instructor (Andrew Ng):I see. Given that MDP has stochastic transition 378 00:22:16,350 --> 00:22:18,200 probabilities, is it possible to 379 00:22:18,200 --> 00:22:22,740 use [inaudible] policies and [inaudible] the stochasticity of the state transition probabilities. 380 00:22:22,740 --> 00:22:24,460 The answer is yes, 381 00:22:24,460 --> 00:22:25,740 but 382 00:22:25,740 --> 00:22:29,400 for the purposes of what I want to show later, that won't be useful. 383 00:22:29,400 --> 00:22:31,870 But formally, it is possible. 384 00:22:31,870 --> 00:22:34,810 If you already have a fixed - if you have a 385 00:22:34,810 --> 00:22:40,760 fixed policy, then you'd be able to do that. Anything else? 386 00:22:40,760 --> 00:22:44,190 Yeah. No, I guess even a [inaudible] class of policy can do that, 387 00:22:44,190 --> 00:22:50,290 but for the derivation later, I actually need to keep it separate. 388 00:22:50,290 --> 00:22:55,059 Actually, could you just - I know the concept of policy search is sometimes a little confusing. Could you just raise your hand 389 00:22:55,059 --> 00:23:01,730 if this makes sense? 390 00:23:01,730 --> 00:23:05,430 Okay. Thanks. So let's talk about 391 00:23:05,430 --> 00:23:08,650 an algorithm. What I'm gonna talk about - the first algorithm I'm going to present 392 00:23:08,650 --> 00:23:09,750 393 00:23:09,750 --> 00:23:13,820 is sometimes called the reinforce algorithm. What I'm going to present it turns out 394 00:23:13,820 --> 00:23:18,550 isn't exactly the reinforce algorithm as it was originally presented by the 395 00:23:18,550 --> 00:23:24,930 author Ron Williams, but it sort of captures its essence. Here's the 396 00:23:24,930 --> 00:23:30,570 idea. 397 00:23:30,570 --> 00:23:37,030 In the sequel - in what I'm about to do, I'm going to assume that S0 is 398 00:23:37,030 --> 00:23:37,679 some 399 00:23:37,679 --> 00:23:40,330 fixed initial state. 400 00:23:40,330 --> 00:23:41,950 401 00:23:41,950 --> 00:23:44,720 Or 402 00:23:44,720 --> 00:23:48,740 it turns out if S0 is drawn from some fixed initial state 403 00:23:48,740 --> 00:23:52,440 distribution then everything else [inaudible], but let's just say S0 is some 404 00:23:52,440 --> 00:23:54,040 fixed initial state. 405 00:23:54,040 --> 00:23:57,600 So my goal is to maximize 406 00:23:57,600 --> 00:24:04,600 this expected sum 407 00:24:05,820 --> 00:24:10,340 [inaudible]. Given the policy and whatever else, drop that. 408 00:24:10,340 --> 00:24:14,700 So the random variables in this expectation is a sequence of states and actions: 409 00:24:14,700 --> 00:24:20,200 S0, A0, S1, A1, and so on, up to ST, AT are the random variables. 410 00:24:20,200 --> 00:24:25,919 So let me write out this expectation explicitly as a sum 411 00:24:25,919 --> 00:24:29,800 over all possible state and action sequences of 412 00:24:29,800 --> 00:24:33,430 that 413 00:24:33,430 --> 00:24:40,430 414 00:24:46,050 --> 00:24:52,920 - so that's what an expectation is. It's the probability of the random variables times that. Let 415 00:24:52,920 --> 00:24:59,920 me just expand out this probability. 416 00:25:00,510 --> 00:25:06,110 So the probability of seeing this exact sequence of states and actions is 417 00:25:06,110 --> 00:25:08,250 the probability 418 00:25:08,250 --> 00:25:10,630 of the MDP starting in that state. 419 00:25:10,630 --> 00:25:14,070 If this is a deterministic initial state, then all the probability 420 00:25:14,070 --> 00:25:18,180 mass would be on one state. Otherwise, there's some distribution over initial states. 421 00:25:18,180 --> 00:25:21,040 Then times the probability 422 00:25:21,040 --> 00:25:23,529 that 423 00:25:23,529 --> 00:25:25,599 you chose action A0 424 00:25:25,599 --> 00:25:26,660 from that state 425 00:25:26,660 --> 00:25:29,350 as zero, 426 00:25:29,350 --> 00:25:34,390 and then times the probability that 427 00:25:34,390 --> 00:25:38,130 the MDP's transition probabilities happen to transition you to state 428 00:25:38,130 --> 00:25:41,250 S1 where you chose action 429 00:25:41,250 --> 00:25:44,830 A0 to state S0, 430 00:25:44,830 --> 00:25:50,510 times the probability that you chose that 431 00:25:50,510 --> 00:25:53,020 and so on. 432 00:25:53,020 --> 00:25:57,650 The last term here 433 00:25:57,650 --> 00:25:59,279 is that, 434 00:25:59,279 --> 00:26:06,279 and then times that. 435 00:26:12,370 --> 00:26:13,930 So what I did was just 436 00:26:13,930 --> 00:26:18,620 take this probability of seeing this sequence of states and actions, and then just 437 00:26:18,620 --> 00:26:22,610 [inaudible] explicitly or expanded explicitly like this. 438 00:26:22,610 --> 00:26:24,480 It turns out later on 439 00:26:24,480 --> 00:26:25,720 I'm going to 440 00:26:25,720 --> 00:26:30,270 need to write this sum of rewards a lot, so I'm just gonna call this the payoff 441 00:26:30,270 --> 00:26:31,590 442 00:26:31,590 --> 00:26:36,039 from now. So whenever later in this lecture I write the word payoff, I 443 00:26:36,039 --> 00:26:40,230 just mean this sum. 444 00:26:40,230 --> 00:26:47,020 So 445 00:26:47,020 --> 00:26:52,330 our goal is to maximize the expected payoff, so our goal is to maximize this sum. 446 00:26:52,330 --> 00:26:55,590 Let me actually just skip ahead. I'm going to write down what the final answer is, and 447 00:26:55,590 --> 00:26:59,370 then I'll come back and justify the 448 00:26:59,370 --> 00:27:06,370 algorithm. So here's the algorithm. This is 449 00:27:25,190 --> 00:27:32,190 450 00:28:12,770 --> 00:28:16,020 how we're going to 451 00:28:16,020 --> 00:28:19,280 update the parameters of the algorithm. We're going to 452 00:28:19,280 --> 00:28:22,880 sample a state action sequence. The way you do this is you just take your 453 00:28:22,880 --> 00:28:24,700 current stochastic policy, 454 00:28:24,700 --> 00:28:29,039 and you execute it in the MDP. So just go ahead and start from some initial state, take 455 00:28:29,039 --> 00:28:32,600 a stochastic action according to your current stochastic policy, 456 00:28:32,600 --> 00:28:35,000 see where the state transition probably takes you, and so you just 457 00:28:35,000 --> 00:28:38,860 do that for T times steps, and that's how you sample the state sequence. Then 458 00:28:38,860 --> 00:28:41,960 you compute the payoff, and then you 459 00:28:41,960 --> 00:28:45,020 perform this update. 460 00:28:45,020 --> 00:28:49,210 So let's go back and figure out what this algorithm is doing. 461 00:28:49,210 --> 00:28:52,730 Notice that this algorithm performs stochastic updates because on 462 00:28:52,730 --> 00:28:56,750 every step it updates data according to this thing on the right hand side. 463 00:28:56,750 --> 00:29:00,150 This thing on the right hand side depends very much on your payoff and on 464 00:29:00,150 --> 00:29:02,320 the state action sequence you saw. Your 465 00:29:02,320 --> 00:29:04,650 state action sequence is random, 466 00:29:04,650 --> 00:29:07,210 so what I want to do is figure out 467 00:29:07,210 --> 00:29:12,230 - so on every step, I'll sort of take a step that's chosen randomly 468 00:29:12,230 --> 00:29:13,070 469 00:29:13,070 --> 00:29:16,380 because it depends on this random state action sequence. 470 00:29:16,380 --> 00:29:19,940 So what I want to do is figure out on average how does it change the 471 00:29:19,940 --> 00:29:21,530 parameters theta. 472 00:29:21,530 --> 00:29:23,360 In particular, 473 00:29:23,360 --> 00:29:30,360 I want to know what is the expected value of the change to the 474 00:29:37,320 --> 00:29:44,320 parameters. 475 00:29:58,100 --> 00:30:00,919 So I want to know what is the expected value 476 00:30:00,919 --> 00:30:03,370 of this change 477 00:30:03,370 --> 00:30:06,280 to my parameters theta. Our 478 00:30:06,280 --> 00:30:10,150 goal is to maximize the sum [inaudible] - our goal is to maximize the value of the 479 00:30:10,150 --> 00:30:10,930 payoff. 480 00:30:10,930 --> 00:30:13,700 So long as 481 00:30:13,700 --> 00:30:18,310 the updates on expectation are on average taking us uphill 482 00:30:18,310 --> 00:30:20,050 on the expected payoff, 483 00:30:20,050 --> 00:30:21,970 then we're happy. 484 00:30:21,970 --> 00:30:24,970 It turns out that 485 00:30:24,970 --> 00:30:31,720 this algorithm is a form of stochastic gradient ascent in which - 486 00:30:31,720 --> 00:30:32,369 remember 487 00:30:32,369 --> 00:30:34,910 when I talked about stochastic gradient descent for 488 00:30:34,910 --> 00:30:36,779 least squares regression, I 489 00:30:36,779 --> 00:30:40,230 said that you have some parameters - maybe you're trying to 490 00:30:40,230 --> 00:30:42,820 minimize a quadratic function. 491 00:30:42,820 --> 00:30:45,590 Then you may have parameters that 492 00:30:45,590 --> 00:30:46,790 will 493 00:30:46,790 --> 00:30:49,149 wander around randomly 494 00:30:49,149 --> 00:30:51,080 until it gets close to 495 00:30:51,080 --> 00:30:54,290 the optimum of the [inaudible] quadratic surface. 496 00:30:54,290 --> 00:30:55,810 It turns out that 497 00:30:55,810 --> 00:30:59,650 the reinforce algorithm will be very much like that. It will be a 498 00:30:59,650 --> 00:31:03,220 stochastic gradient ascent algorithm in which on every step - 499 00:31:03,220 --> 00:31:06,010 the step we take is a little bit random. It's determined by the random 500 00:31:06,010 --> 00:31:07,300 state action sequence, 501 00:31:07,300 --> 00:31:11,940 but on expectation this turns out to be essentially gradient ascent algorithm. 502 00:31:11,940 --> 00:31:16,100 And so we'll do something like this. It'll wander around randomly, but on average take you towards the 503 00:31:16,100 --> 00:31:17,700 optimum. So let 504 00:31:17,700 --> 00:31:24,700 me go ahead and prove that now. 505 00:31:28,780 --> 00:31:35,780 Let's see. 506 00:31:37,540 --> 00:31:43,210 What I'm going to do is I'm going to derive a 507 00:31:43,210 --> 00:31:46,020 gradient ascent update rule 508 00:31:46,020 --> 00:31:50,429 for maximizing the expected payoff. Then I'll hopefully show that 509 00:31:50,429 --> 00:31:53,930 by deriving a gradient ascent update rule, I'll end up with 510 00:31:53,930 --> 00:31:55,019 this thing 511 00:31:55,019 --> 00:31:58,570 on expectation. 512 00:31:58,570 --> 00:32:02,110 So before I do the derivation, let me just remind you of 513 00:32:02,110 --> 00:32:06,330 the chain rule - the product rule for differentiation 514 00:32:06,330 --> 00:32:08,510 in which 515 00:32:08,510 --> 00:32:13,890 if I have a product of functions, 516 00:32:13,890 --> 00:32:17,230 then 517 00:32:17,230 --> 00:32:20,720 518 00:32:20,720 --> 00:32:27,720 519 00:32:33,440 --> 00:32:37,779 the derivative of the product is given by 520 00:32:37,779 --> 00:32:41,029 taking of the derivatives of these things one at a time. So first I differentiate 521 00:32:41,029 --> 00:32:45,340 with respect to F prime, leaving the other two fixed. Then I differentiate with respect to G, 522 00:32:45,340 --> 00:32:46,540 leaving the other two fixed. 523 00:32:46,540 --> 00:32:50,640 Then I differentiate with respect to H, so I get H prime leaving the other two fixed. So that's the 524 00:32:50,640 --> 00:32:53,530 product rule 525 00:32:53,530 --> 00:32:56,380 for 526 00:32:56,380 --> 00:33:00,980 derivatives. 527 00:33:00,980 --> 00:33:05,020 If you refer back to this equation where 528 00:33:05,020 --> 00:33:06,870 earlier we wrote out that 529 00:33:06,870 --> 00:33:11,140 the expected payoff by this equation, this 530 00:33:11,140 --> 00:33:14,110 sum over all the states of the probability 531 00:33:14,110 --> 00:33:15,720 times the payoff. 532 00:33:15,720 --> 00:33:19,159 So what I'm going to do is take the derivative of 533 00:33:19,159 --> 00:33:20,410 this expression 534 00:33:20,410 --> 00:33:22,730 with respect to the parameters theta 535 00:33:22,730 --> 00:33:26,320 because I want to do gradient ascent on this function. So I'm going to take the 536 00:33:26,320 --> 00:33:30,880 derivative of this function with respect to theta, and then try to go uphill on this function. 537 00:33:30,880 --> 00:33:34,360 So using the product rule, when I take the derivative of this function with respect 538 00:33:34,360 --> 00:33:35,690 to theta 539 00:33:35,690 --> 00:33:39,169 what I get is - we'll end up with the sum of terms right there. There are a lot 540 00:33:39,169 --> 00:33:42,300 of terms here that depend on theta, 541 00:33:42,300 --> 00:33:47,160 and so what I'll end up with is I'll end up having a sum - having one term 542 00:33:47,160 --> 00:33:49,179 that corresponds to the derivative of this 543 00:33:49,179 --> 00:33:50,700 keeping everything else fixed, 544 00:33:50,700 --> 00:33:53,960 to have one term from the derivative of this keeping everything else fixed, and I'll 545 00:33:53,960 --> 00:33:55,140 have one term 546 00:33:55,140 --> 00:33:56,430 from the derivative of 547 00:33:56,430 --> 00:34:02,100 that last thing keeping everything else fixed. So just apply the product rule to this. Let's 548 00:34:02,100 --> 00:34:09,100 write that down. So I have that - the derivative 549 00:34:10,599 --> 00:34:15,789 with respect to theta of the expected value of the payoff is - 550 00:34:15,789 --> 00:34:19,959 it turns out I can actually do this entire derivation 551 00:34:19,959 --> 00:34:22,129 in exactly four steps, 552 00:34:22,129 --> 00:34:26,819 but each of the steps requires a huge amount of writing, so 553 00:34:26,819 --> 00:34:27,849 I'll 554 00:34:27,849 --> 00:34:30,569 just start writing and see how that goes, 555 00:34:30,569 --> 00:34:37,569 but this is a four step derivation. 556 00:34:37,919 --> 00:34:44,919 So there's the sum over the state action sequences as we saw before. Close the 557 00:35:04,869 --> 00:35:11,869 bracket, 558 00:36:00,229 --> 00:36:03,169 and 559 00:36:03,169 --> 00:36:08,670 then times the payoff. 560 00:36:08,670 --> 00:36:12,599 So that huge amount of writing, that was just taking my previous formula and 561 00:36:12,599 --> 00:36:13,529 562 00:36:13,529 --> 00:36:16,589 differentiating these terms that depend on theta one at a time. This was the term with 563 00:36:16,589 --> 00:36:17,879 the derivative 564 00:36:17,879 --> 00:36:19,489 of the first pi of 565 00:36:19,489 --> 00:36:22,410 theta S0 A0. So there's the first derivative 566 00:36:22,410 --> 00:36:26,329 term. There's the second one. Then you have plus dot, dot, dot, like 567 00:36:26,329 --> 00:36:28,369 in terms of [inaudible]. 568 00:36:28,369 --> 00:36:31,279 That's my last term. 569 00:36:31,279 --> 00:36:38,279 So that was step one of four. 570 00:36:51,299 --> 00:36:52,950 And so 571 00:36:52,950 --> 00:36:59,849 by algebra - let me 572 00:36:59,849 --> 00:37:06,849 just write this down 573 00:37:53,829 --> 00:38:00,829 574 00:38:04,009 --> 00:38:05,180 and convince us 575 00:38:05,180 --> 00:38:06,509 all that it's true. 576 00:38:06,509 --> 00:38:08,549 This is the second of four steps 577 00:38:08,549 --> 00:38:09,649 in which it 578 00:38:09,649 --> 00:38:13,529 just convinced itself that if I expand out - take the sum and multiply it by 579 00:38:13,529 --> 00:38:15,239 that big product in front, 580 00:38:15,239 --> 00:38:18,929 then I get back that sum of terms I get. It's essentially - 581 00:38:18,929 --> 00:38:22,530 for example, when I multiply out, this product on top of this ratio, of this 582 00:38:22,530 --> 00:38:24,789 first fraction, 583 00:38:24,789 --> 00:38:29,289 then pi subscript theta S0 A0, that would cancel out this pi subscript 584 00:38:29,289 --> 00:38:31,219 theta S0 A0 585 00:38:31,219 --> 00:38:35,130 and replace it with the derivative with respect to theta of pi theta S0 A0. So [inaudible] algebra was 586 00:38:35,130 --> 00:38:42,130 the second. But 587 00:38:44,960 --> 00:38:47,640 that term on top is just 588 00:38:47,640 --> 00:38:53,999 what I worked out previously 589 00:38:53,999 --> 00:38:58,569 - was the joint probability of the state action sequence, 590 00:38:58,569 --> 00:39:00,140 and now I have that 591 00:39:00,140 --> 00:39:04,919 times 592 00:39:04,919 --> 00:39:11,919 that 593 00:39:15,660 --> 00:39:22,660 594 00:39:23,489 --> 00:39:24,340 times the 595 00:39:24,340 --> 00:39:29,459 596 00:39:29,459 --> 00:39:32,170 payoff. 597 00:39:32,170 --> 00:39:36,209 And so by the definition of expectation, this is just equal to 598 00:39:36,209 --> 00:39:43,209 that thing 599 00:40:02,519 --> 00:40:07,129 times the payoff. 600 00:40:07,129 --> 00:40:08,440 So 601 00:40:08,440 --> 00:40:12,329 this thing inside the expectation, this is exactly 602 00:40:12,329 --> 00:40:13,859 the 603 00:40:13,859 --> 00:40:17,439 step that we were taking in 604 00:40:17,439 --> 00:40:18,839 the inner group of our reinforce algorithm, 605 00:40:18,839 --> 00:40:20,880 roughly the reinforce algorithm. This 606 00:40:20,880 --> 00:40:25,640 proves that the expected value of our change to theta 607 00:40:25,640 --> 00:40:30,509 is exactly in the direction of the gradient of our expected payoff. That's how 608 00:40:30,509 --> 00:40:32,639 I started this whole derivation. I said 609 00:40:32,639 --> 00:40:36,019 let's look at our expected payoff and take the derivative of that with respect to theta. 610 00:40:36,019 --> 00:40:37,109 611 00:40:37,109 --> 00:40:37,980 What we've 612 00:40:37,980 --> 00:40:40,529 proved is that on expectation, 613 00:40:40,529 --> 00:40:44,229 the step direction I'll take reinforce is exactly the gradient of 614 00:40:44,229 --> 00:40:46,220 the thing I'm trying to 615 00:40:46,220 --> 00:40:48,679 optimize. This shows that this algorithm is 616 00:40:48,679 --> 00:40:54,029 a stochastic gradient ascent 617 00:40:54,029 --> 00:40:58,059 algorithm. I wrote a lot. Why don't you take a minute to look at the equations and [inaudible] 618 00:40:58,059 --> 00:41:00,910 check if everything makes sense. I'll erase a couple of boards and then check if you have 619 00:41:00,910 --> 00:41:07,910 questions after that. Questions? 620 00:41:42,919 --> 00:41:49,919 Could you 621 00:41:53,749 --> 00:41:56,279 raise your hand if this makes sense? 622 00:41:56,279 --> 00:42:01,239 623 00:42:01,239 --> 00:42:03,999 Great. 624 00:42:03,999 --> 00:42:06,789 Some of the comments - we talked about 625 00:42:06,789 --> 00:42:10,629 those value function approximation approaches where you approximate 626 00:42:10,629 --> 00:42:13,579 V star, then you go from V star 627 00:42:13,579 --> 00:42:17,119 to pi star. Then here was also policy search approaches, where you try to approximate the policy 628 00:42:17,119 --> 00:42:17,839 directly. 629 00:42:17,839 --> 00:42:22,329 So let's talk briefly about when either one may be preferable. 630 00:42:22,329 --> 00:42:24,049 It turns out that 631 00:42:24,049 --> 00:42:26,899 policy search algorithms are especially effective 632 00:42:26,899 --> 00:42:30,699 when you can choose a simple policy class pi. 633 00:42:30,699 --> 00:42:32,029 634 00:42:32,029 --> 00:42:35,949 So the question really is for your problem 635 00:42:35,949 --> 00:42:40,079 does there exist a simple function like a linear function or a logistic function 636 00:42:40,079 --> 00:42:42,050 that maps from features of the state 637 00:42:42,050 --> 00:42:43,409 to the action 638 00:42:43,409 --> 00:42:45,289 that works pretty well. 639 00:42:45,289 --> 00:42:49,029 So the problem with the inverted pendulum - this is quite likely to be true. 640 00:42:49,029 --> 00:42:52,450 Going through all the different choices of parameters, you can say 641 00:42:52,450 --> 00:42:54,930 things like if the pole's leaning towards the right, 642 00:42:54,930 --> 00:42:57,709 then accelerate towards the right to try to catch it. Thanks to the 643 00:42:57,709 --> 00:43:01,579 inverted pendulum, this is probably true. For lots of what's called 644 00:43:01,579 --> 00:43:04,650 low level control tasks, things like driving a car, 645 00:43:04,650 --> 00:43:09,069 the low level reflexes of do you steer your car left to avoid another car, do 646 00:43:09,069 --> 00:43:12,109 you steer your car left to follow the car 647 00:43:12,109 --> 00:43:13,389 road, flying a helicopter, 648 00:43:13,389 --> 00:43:18,359 again very short time scale types of decisions - I like to think of these as 649 00:43:18,359 --> 00:43:21,960 decisions like trained operator for like a trained driver or a trained 650 00:43:21,960 --> 00:43:23,819 pilot. It would almost 651 00:43:23,819 --> 00:43:27,360 be a reflex, these sorts of very quick instinctive things where 652 00:43:27,360 --> 00:43:30,790 you map very directly from the inputs, data, and action. These are 653 00:43:30,790 --> 00:43:32,159 problems for which 654 00:43:32,159 --> 00:43:35,879 you can probably choose a reasonable policy class like a logistic function or something, 655 00:43:35,879 --> 00:43:36,329 and it 656 00:43:36,329 --> 00:43:38,489 will often work well. 657 00:43:38,489 --> 00:43:41,829 In contrast, if you have problems that require 658 00:43:41,829 --> 00:43:46,019 long multistep reasoning, so things like a game of chess where you have to 659 00:43:46,019 --> 00:43:47,230 reason carefully about if 660 00:43:47,230 --> 00:43:50,130 I do this, then they'll do that, then they'll do this, then they'll do that. 661 00:43:50,130 --> 00:43:54,979 Those I think of as less instinctual, very high level decision 662 00:43:54,979 --> 00:43:56,049 making. 663 00:43:56,049 --> 00:43:59,519 For problems like that, 664 00:43:59,519 --> 00:44:06,289 I would sometimes use a value function approximation approaches instead. Let 665 00:44:06,289 --> 00:44:09,089 me say more about this later. 666 00:44:09,089 --> 00:44:10,009 The last 667 00:44:10,009 --> 00:44:14,139 thing I want to do is actually tell you about - 668 00:44:14,139 --> 00:44:19,599 I 669 00:44:19,599 --> 00:44:21,999 guess just as a side comment, 670 00:44:21,999 --> 00:44:25,499 it turns out also that if you have POMDP, if you have a partially observable 671 00:44:25,499 --> 00:44:28,459 MDP - I don't want to say too much about this - 672 00:44:28,459 --> 00:44:31,219 it turns out that if you only have 673 00:44:31,219 --> 00:44:36,799 an approximation 674 00:44:36,799 --> 00:44:42,119 - let's call it S hat of the true 675 00:44:42,119 --> 00:44:47,389 state, and so this could be S hat equals S of T given T from Kalman filter 676 00:44:47,389 --> 00:44:54,389 - 677 00:44:57,359 --> 00:45:01,569 then you can still use these sorts of policy search algorithms where you can 678 00:45:01,569 --> 00:45:08,569 say pi 679 00:45:10,159 --> 00:45:12,949 theta of S 680 00:45:12,949 --> 00:45:15,819 hat comma A - There are various other ways you can use 681 00:45:15,819 --> 00:45:18,669 policy search algorithms for POMDPs, but this is one of them 682 00:45:18,669 --> 00:45:21,440 where if you only have estimates of the state, you can then 683 00:45:21,440 --> 00:45:24,670 choose a policy class that only looks at your estimate of the state to choose the 684 00:45:24,670 --> 00:45:25,599 action. 685 00:45:25,599 --> 00:45:29,099 By using the same way of estimating the states in both training 686 00:45:29,099 --> 00:45:30,629 and testing, 687 00:45:30,629 --> 00:45:33,089 this will usually do some - 688 00:45:33,089 --> 00:45:37,069 so these sorts of policy search algorithms can be applied 689 00:45:37,069 --> 00:45:38,589 often reasonably effectively 690 00:45:38,589 --> 00:45:42,129 to 691 00:45:42,129 --> 00:45:43,829 692 00:45:43,829 --> 00:45:47,440 693 00:45:47,440 --> 00:45:50,349 POMDPs as well. There's one more algorithm I wanna talk about, but some final 694 00:45:50,349 --> 00:45:52,150 words on the reinforce algorithm. It turns out the 695 00:45:52,150 --> 00:45:54,519 reinforce algorithm 696 00:45:54,519 --> 00:46:00,089 often works well but is often extremely slow. So it [inaudible] 697 00:46:00,089 --> 00:46:03,680 works, but one thing to watch out for is that because you're taking these 698 00:46:03,680 --> 00:46:07,470 gradient ascent steps that are very noisy, you're sampling a state action sequence, and then 699 00:46:07,470 --> 00:46:07,909 you're sort of 700 00:46:07,909 --> 00:46:09,579 taking a gradient ascent step in essentially a sort 701 00:46:09,579 --> 00:46:12,960 of random direction that only on expectation is 702 00:46:12,960 --> 00:46:14,239 correct. The 703 00:46:14,239 --> 00:46:17,999 gradient ascent direction for reinforce can sometimes be a bit noisy, 704 00:46:17,999 --> 00:46:21,599 and so it's not that uncommon to need like a million iterations of gradient ascent, 705 00:46:21,599 --> 00:46:24,420 or ten million, or 100 million 706 00:46:24,420 --> 00:46:25,579 iterations of gradient ascent 707 00:46:25,579 --> 00:46:29,450 for reinforce [inaudible], so that's just something to watch out for. 708 00:46:29,450 --> 00:46:35,579 One consequence of that is in 709 00:46:35,579 --> 00:46:39,089 the reinforce 710 00:46:39,089 --> 00:46:40,619 algorithm - I shouldn't 711 00:46:40,619 --> 00:46:44,969 really call it reinforce. In what's essentially the reinforce algorithm, there's 712 00:46:44,969 --> 00:46:48,449 this step where you need to sample a state action sequence. 713 00:46:48,449 --> 00:46:50,759 So in 714 00:46:50,759 --> 00:46:54,160 principle you could do this on your own robot. If there were a robot you were trying to 715 00:46:54,160 --> 00:46:55,280 control, you can actually 716 00:46:55,280 --> 00:46:58,709 physically initialize in some state, pick an action and so on, and go from there 717 00:46:58,709 --> 00:47:00,450 to sample a state action sequence. 718 00:47:00,450 --> 00:47:01,529 But 719 00:47:01,529 --> 00:47:05,269 if you need to do this ten million times, you probably don't want to [inaudible] 720 00:47:05,269 --> 00:47:07,459 your robot ten million times. 721 00:47:07,459 --> 00:47:11,660 I personally have seen many more applications of reinforce 722 00:47:11,660 --> 00:47:13,160 in simulation. 723 00:47:13,160 --> 00:47:16,930 You can easily run ten thousand simulations or ten million simulations of your robot in 724 00:47:16,930 --> 00:47:18,989 simulation maybe, but you 725 00:47:18,989 --> 00:47:20,890 might not want to do that - 726 00:47:20,890 --> 00:47:24,509 have your robot physically repeat some action ten million times. 727 00:47:24,509 --> 00:47:27,529 So I personally have seen many more applications of reinforce 728 00:47:27,529 --> 00:47:30,319 to learn using a simulator 729 00:47:30,319 --> 00:47:34,799 than to actually do this on a physical device. 730 00:47:34,799 --> 00:47:38,230 The last thing I wanted to do is tell you about one other algorithm, 731 00:47:38,230 --> 00:47:45,230 one final policy search algorithm. [Inaudible] the laptop display please. It's 732 00:47:52,249 --> 00:47:56,119 a policy search algorithm called Pegasus that's 733 00:47:56,119 --> 00:48:00,039 actually what we use on our autonomous helicopter flight things 734 00:48:00,039 --> 00:48:05,459 for many years. There are some other things we do now. So 735 00:48:05,459 --> 00:48:07,579 here's the idea. There's a 736 00:48:07,579 --> 00:48:11,059 reminder slide on RL formalism. There's nothing here that you don't know, but 737 00:48:11,059 --> 00:48:12,709 I just want to 738 00:48:12,709 --> 00:48:14,180 pictorially 739 00:48:14,180 --> 00:48:17,119 describe the RL formalism because I'll use that later. 740 00:48:17,119 --> 00:48:20,859 I'm gonna draw the reinforcement learning picture as follows. The 741 00:48:20,859 --> 00:48:21,689 initialized [inaudible] 742 00:48:21,689 --> 00:48:23,509 system, say a 743 00:48:23,509 --> 00:48:25,559 helicopter or whatever in sum state S0, 744 00:48:25,559 --> 00:48:27,509 you choose an action A0, 745 00:48:27,509 --> 00:48:31,779 and then you'll say helicopter dynamics takes you to some new state S1, you choose 746 00:48:31,779 --> 00:48:33,259 some other action A1, 747 00:48:33,259 --> 00:48:34,619 and so on. 748 00:48:34,619 --> 00:48:37,909 And then you have some reward function, which you reply to the sequence of states you 749 00:48:37,909 --> 00:48:38,789 summed out, 750 00:48:38,789 --> 00:48:41,249 and that's your total payoff. So 751 00:48:41,249 --> 00:48:44,549 this is just a picture I wanna use to 752 00:48:44,549 --> 00:48:46,890 summarize the RL problem. 753 00:48:46,890 --> 00:48:47,949 754 00:48:47,949 --> 00:48:50,880 Our goal is to maximize the expected payoff, 755 00:48:50,880 --> 00:48:53,000 which is this, the expected sum of the rewards. 756 00:48:53,000 --> 00:48:56,909 And our goal is to learn the policy, which I denote by a green box. 757 00:48:56,909 --> 00:48:58,500 So our policy 758 00:48:58,500 --> 00:49:01,989 - and I'll switch back to deterministic policies for now. 759 00:49:01,989 --> 00:49:06,539 So my deterministic policy will be some function mapping from 760 00:49:06,539 --> 00:49:09,129 the states to the actions. 761 00:49:09,129 --> 00:49:14,009 As a concrete example, you imagine that in the policy search 762 00:49:14,009 --> 00:49:14,599 setting, 763 00:49:14,599 --> 00:49:17,339 you may have a linear class of policies. 764 00:49:17,339 --> 00:49:21,399 So you may imagine that the action A will be say a linear 765 00:49:21,399 --> 00:49:22,660 function of the states, 766 00:49:22,660 --> 00:49:24,899 and your goal is to 767 00:49:24,899 --> 00:49:27,599 learn the parameters of the linear function. 768 00:49:27,599 --> 00:49:31,430 So imagine trying to do linear progression on policies, except you're trying to optimize the reinforcement 769 00:49:31,430 --> 00:49:33,349 learning 770 00:49:33,349 --> 00:49:36,209 objective. So just [inaudible] imagine that the action A 771 00:49:36,209 --> 00:49:40,309 is state of transpose S, and you go and policy search this to come up with 772 00:49:40,309 --> 00:49:42,109 good parameters theta so as 773 00:49:42,109 --> 00:49:47,079 to maximize the expected payoff. That would be one setting in which this picture applies. There's the 774 00:49:47,079 --> 00:49:48,979 idea. 775 00:49:48,979 --> 00:49:49,809 Quite often 776 00:49:49,809 --> 00:49:54,459 we come up with a model or a simulator for the MDP, and as before 777 00:49:54,459 --> 00:49:57,499 a model or a simulator is just a box 778 00:49:57,499 --> 00:49:58,830 that takes this input 779 00:49:58,830 --> 00:50:00,419 some state ST, 780 00:50:00,419 --> 00:50:02,690 takes this input some action AT, 781 00:50:02,690 --> 00:50:06,570 and then outputs some [inaudible] state ST plus one that you might want to take in the MDP. 782 00:50:06,570 --> 00:50:10,130 This ST plus one will be a random state. It will be drawn from 783 00:50:10,130 --> 00:50:13,030 the random state transition probabilities of MDP. 784 00:50:13,030 --> 00:50:16,979 This is important. Very important, ST plus one 785 00:50:16,979 --> 00:50:19,669 will be a random function ST and AT. In the 786 00:50:19,669 --> 00:50:21,809 simulator, this is [inaudible]. 787 00:50:21,809 --> 00:50:25,709 So for example, for autonomous helicopter flight, you [inaudible] build a simulator 788 00:50:25,709 --> 00:50:29,819 using supervised learning, an algorithm like linear regression 789 00:50:29,819 --> 00:50:31,529 [inaudible] to linear regression, 790 00:50:31,529 --> 00:50:35,099 so we can get a nonlinear model of the dynamics of what ST 791 00:50:35,099 --> 00:50:41,399 plus one is as a random function of ST and AT. Now 792 00:50:41,399 --> 00:50:44,999 once you have a simulator, 793 00:50:44,999 --> 00:50:48,100 given any fixed policy you can quite 794 00:50:48,100 --> 00:50:52,309 straightforwardly evaluate any policy in a simulator. 795 00:50:52,309 --> 00:50:53,359 796 00:50:53,359 --> 00:50:54,779 Concretely, our 797 00:50:54,779 --> 00:50:58,619 goal is to find the policy pi mapping from states to actions, so the goal is to find the green 798 00:50:58,619 --> 00:51:00,209 box 799 00:51:00,209 --> 00:51:02,879 like that. It works well. 800 00:51:02,879 --> 00:51:04,809 So if you have 801 00:51:04,809 --> 00:51:07,569 any one fixed policy pi, 802 00:51:07,569 --> 00:51:13,109 you can evaluate the policy pi just using the simulator 803 00:51:13,109 --> 00:51:15,969 via the picture shown at the bottom of the slide. 804 00:51:15,969 --> 00:51:18,880 So concretely, you can take your initial state S0, 805 00:51:18,880 --> 00:51:20,979 feed it into the policy pi, 806 00:51:20,979 --> 00:51:25,449 your policy pi will output some action A0, you plug it in 807 00:51:25,449 --> 00:51:28,069 the simulator, the simulator outputs a random state S1, you feed 808 00:51:28,069 --> 00:51:31,630 S1 into the policy and so on, and you get a sequence of states S0 through ST 809 00:51:31,630 --> 00:51:35,429 that your helicopter flies through in simulation. 810 00:51:35,429 --> 00:51:37,809 Then sum up the rewards, 811 00:51:37,809 --> 00:51:39,899 and this gives you an estimate of 812 00:51:39,899 --> 00:51:42,609 the expected payoff of the policy. 813 00:51:42,609 --> 00:51:44,929 This picture is just a fancy way of saying fly 814 00:51:44,929 --> 00:51:48,809 your helicopter in simulation and see how well it does, and measure the sum of 815 00:51:48,809 --> 00:51:53,969 rewards you get on average in the simulator. The 816 00:51:53,969 --> 00:51:57,219 picture I've drawn here assumes that you run your policy through the 817 00:51:57,219 --> 00:51:59,159 simulator just once. In general, 818 00:51:59,159 --> 00:52:01,299 you would run the policy through the simulator 819 00:52:01,299 --> 00:52:05,099 some number of times and then average to get an average over M 820 00:52:05,099 --> 00:52:07,339 simulations to get a better estimate of the 821 00:52:07,339 --> 00:52:08,740 policy's 822 00:52:08,740 --> 00:52:10,869 expected payoff. 823 00:52:10,869 --> 00:52:12,799 824 00:52:12,799 --> 00:52:14,619 Now that I have a way 825 00:52:14,619 --> 00:52:16,490 - given any one 826 00:52:16,490 --> 00:52:18,069 fixed policy, 827 00:52:18,069 --> 00:52:19,259 this gives me a way 828 00:52:19,259 --> 00:52:20,379 to evaluate 829 00:52:20,379 --> 00:52:23,429 the expected payoff of that policy. 830 00:52:23,429 --> 00:52:25,280 So one 831 00:52:25,280 --> 00:52:28,179 reasonably obvious thing you might try to do is then 832 00:52:28,179 --> 00:52:30,069 just search for a policy, 833 00:52:30,069 --> 00:52:33,609 in other words search for parameters theta for your policy, that 834 00:52:33,609 --> 00:52:35,929 gives you high estimated payoff. 835 00:52:35,929 --> 00:52:39,509 Does that make sense? So my policy has some parameters theta, so 836 00:52:39,509 --> 00:52:40,979 my policy is 837 00:52:40,979 --> 00:52:46,309 my actions A are equal to theta transpose S say if there's a linear policy. 838 00:52:46,309 --> 00:52:49,559 For any fixed value of the parameters theta, 839 00:52:49,559 --> 00:52:50,989 I can evaluate - 840 00:52:50,989 --> 00:52:54,440 I can get an estimate for how good the policy is using the simulator. 841 00:52:54,440 --> 00:52:58,599 One thing I might try to do is search for parameters theta to try to 842 00:52:58,599 --> 00:53:03,109 maximize my estimated payoff. 843 00:53:03,109 --> 00:53:05,929 It turns out you can sort of do that, 844 00:53:05,929 --> 00:53:10,979 but that idea as I've just stated is hard to get to work. 845 00:53:10,979 --> 00:53:12,180 Here's the 846 00:53:12,180 --> 00:53:15,720 reason. The simulator allows us to evaluate 847 00:53:15,720 --> 00:53:17,529 policy, so let's search for policy of high 848 00:53:17,529 --> 00:53:20,619 value. The difficulty is that the simulator is random, 849 00:53:20,619 --> 00:53:22,999 and so every time we evaluate a policy, 850 00:53:22,999 --> 00:53:26,819 we get back a very slightly different answer. So 851 00:53:26,819 --> 00:53:28,739 in the 852 00:53:28,739 --> 00:53:33,219 cartoon below, I want you to imagine that the horizontal axis is the space of policies. 853 00:53:33,219 --> 00:53:36,779 In other words, as I vary the 854 00:53:36,779 --> 00:53:40,839 parameters in my policy, I get different points on the horizontal axis here. As I 855 00:53:40,839 --> 00:53:43,789 vary the parameters theta, I get different policies, 856 00:53:43,789 --> 00:53:46,479 and so I'm moving along the X axis, 857 00:53:46,479 --> 00:53:50,079 and my total payoff I'm gonna plot on the vertical axis, 858 00:53:50,079 --> 00:53:54,459 and the red line in this cartoon is the expected payoff of the different 859 00:53:54,459 --> 00:53:56,130 policies. 860 00:53:56,130 --> 00:54:01,929 My goal is to find the policy with the highest expected payoff. 861 00:54:01,929 --> 00:54:05,319 You could search for a policy with high expected payoff, but every time you evaluate a 862 00:54:05,319 --> 00:54:06,019 policy 863 00:54:06,019 --> 00:54:08,150 - say I evaluate some policy, 864 00:54:08,150 --> 00:54:09,819 then I might get a point that 865 00:54:09,819 --> 00:54:12,879 just by chance looked a little bit better than it should have. If 866 00:54:12,879 --> 00:54:16,489 I evaluate a second policy and just by chance it looked a little bit worse. I evaluate a third 867 00:54:16,489 --> 00:54:21,419 policy, fourth, 868 00:54:21,419 --> 00:54:25,439 sometimes you look here - sometimes I might actually evaluate exactly the same policy 869 00:54:25,439 --> 00:54:27,829 twice and get back slightly different 870 00:54:27,829 --> 00:54:32,380 answers just because my simulator is random, so when I apply the same policy 871 00:54:32,380 --> 00:54:38,669 twice in simulation, I might get back slightly different answers. 872 00:54:38,669 --> 00:54:41,839 So as I evaluate more and more policies, these are the pictures I get. 873 00:54:41,839 --> 00:54:45,799 My goal is to try to optimize the red line. I hope you appreciate this is a 874 00:54:45,799 --> 00:54:50,569 hard problem, especially when all [inaudible] optimization algorithm gets to see are these 875 00:54:50,569 --> 00:54:53,949 black dots, and they don't have direct access to the red line. 876 00:54:53,949 --> 00:54:58,269 So when my input space is some fairly high dimensional space, if I have 877 00:54:58,269 --> 00:55:02,319 ten parameters, the horizontal axis would actually be a 10-D space, 878 00:55:02,319 --> 00:55:07,049 all I get are these noisy estimates of what the red line is. This is a very hard 879 00:55:07,049 --> 00:55:10,209 stochastic optimization problem. 880 00:55:10,209 --> 00:55:15,969 So it turns out there's one way to make this optimization problem much easier. Here's the idea. 881 00:55:15,969 --> 00:55:20,109 And the method is called Pegasus, which is an acronym for 882 00:55:20,109 --> 00:55:20,949 something. 883 00:55:20,949 --> 00:55:22,890 I'll tell you later. 884 00:55:22,890 --> 00:55:26,140 So the simulator repeatedly makes calls 885 00:55:26,140 --> 00:55:27,959 to a random number generator 886 00:55:27,959 --> 00:55:32,179 to generate random numbers RT, which are used to simulate the stochastic 887 00:55:32,179 --> 00:55:33,079 dynamics. 888 00:55:33,079 --> 00:55:36,709 What I mean by that is that the simulator takes this input of state 889 00:55:36,709 --> 00:55:39,749 and action, and it outputs the mixed state randomly, 890 00:55:39,749 --> 00:55:43,999 and if you peer into the simulator, if you open the box of the simulator and ask 891 00:55:43,999 --> 00:55:49,529 how is my simulator generating these random mixed states ST plus one, 892 00:55:49,529 --> 00:55:51,299 pretty much the only way 893 00:55:51,299 --> 00:55:52,519 to do so - pretty 894 00:55:52,519 --> 00:55:53,809 much the only way 895 00:55:53,809 --> 00:55:56,909 to write a simulator with random outputs is 896 00:55:56,909 --> 00:56:00,209 we're gonna make calls to a random number generator, 897 00:56:00,209 --> 00:56:03,440 and get random numbers, these RTs, 898 00:56:03,440 --> 00:56:08,299 and then you have some function that takes this input S0, A0, and 899 00:56:08,299 --> 00:56:10,979 the results of your random number generator, 900 00:56:10,979 --> 00:56:14,430 and it computes some mixed state as a function of the inputs and of the random 901 00:56:14,430 --> 00:56:16,839 number it got from the random number 902 00:56:16,839 --> 00:56:17,699 generator. This is 903 00:56:17,699 --> 00:56:21,889 pretty much the only way anyone implements any random code, any 904 00:56:21,889 --> 00:56:22,359 code 905 00:56:22,359 --> 00:56:26,179 that generates random outputs. You make a call to a random number generator, 906 00:56:26,179 --> 00:56:30,869 and you compute some function of the random number and of your other 907 00:56:30,869 --> 00:56:35,299 inputs. The reason that when you evaluate different policies you get 908 00:56:35,299 --> 00:56:36,439 different answers 909 00:56:36,439 --> 00:56:37,749 is because 910 00:56:37,749 --> 00:56:41,459 every time you rerun the simulator, you get a different sequence of random 911 00:56:41,459 --> 00:56:43,569 numbers from the random number generator, 912 00:56:43,569 --> 00:56:45,569 and so you get a different answer every time, 913 00:56:45,569 --> 00:56:48,139 even if you evaluate the same policy twice. 914 00:56:48,139 --> 00:56:50,910 So here's the idea. 915 00:56:50,910 --> 00:56:56,150 Let's say we fix in advance the sequence of random numbers, 916 00:56:56,150 --> 00:57:00,229 choose R1, R2, up to RT minus one. 917 00:57:00,229 --> 00:57:03,359 Fix the sequence of random numbers in advance, 918 00:57:03,359 --> 00:57:07,280 and we'll always use the same sequence of random numbers 919 00:57:07,280 --> 00:57:10,189 to evaluate different policies. Go 920 00:57:10,189 --> 00:57:14,549 into your code and fix R1, R2, through RT minus one. 921 00:57:14,549 --> 00:57:17,640 Choose them randomly once and then fix them forever. 922 00:57:17,640 --> 00:57:20,979 If you always use the same sequence of random numbers, then 923 00:57:20,979 --> 00:57:25,210 the system is no longer random, and if you evaluate the same policy twice, 924 00:57:25,210 --> 00:57:28,569 you get back exactly the same answer. 925 00:57:28,569 --> 00:57:33,330 And so these sequences of random numbers, [inaudible] call them scenarios, and 926 00:57:33,330 --> 00:57:37,410 Pegasus is an acronym for policy evaluation of gradient and search using scenarios. 927 00:57:37,410 --> 00:57:38,859 So 928 00:57:38,859 --> 00:57:42,369 929 00:57:42,369 --> 00:57:45,949 when you do that, this is the picture you get. 930 00:57:45,949 --> 00:57:50,190 As before, the red line is your expected payoff, and by fixing the random numbers, 931 00:57:50,190 --> 00:57:53,469 you've defined some estimate of the expected payoff. 932 00:57:53,469 --> 00:57:56,739 And as you evaluate different policies, they're 933 00:57:56,739 --> 00:58:00,609 still only approximations to their true expected payoff, but at least now you have 934 00:58:00,609 --> 00:58:01,310 a 935 00:58:01,310 --> 00:58:03,359 deterministic function to optimize, 936 00:58:03,359 --> 00:58:07,109 and you can now apply your favorite algorithms, be it gradient ascent or 937 00:58:07,109 --> 00:58:08,089 938 00:58:08,089 --> 00:58:09,809 some sort 939 00:58:09,809 --> 00:58:12,259 of local [inaudible] search 940 00:58:12,259 --> 00:58:13,609 to try to optimize the black 941 00:58:13,609 --> 00:58:16,329 curve. This gives you a much easier 942 00:58:16,329 --> 00:58:18,390 optimization problem, and you 943 00:58:18,390 --> 00:58:22,789 can optimize this to find hopefully a good policy. So this is 944 00:58:22,789 --> 00:58:26,650 the Pegasus policy search method. 945 00:58:26,650 --> 00:58:27,450 So when 946 00:58:27,450 --> 00:58:30,950 I started to talk about reinforcement learning, I 947 00:58:30,950 --> 00:58:31,390 showed 948 00:58:31,390 --> 00:58:34,980 that video of a helicopter flying upside down. That was actually done using 949 00:58:34,980 --> 00:58:36,269 exactly 950 00:58:36,269 --> 00:58:38,539 method, using exactly this policy search 951 00:58:38,539 --> 00:58:40,029 algorithm. This 952 00:58:40,029 --> 00:58:41,849 seems to scale well 953 00:58:41,849 --> 00:58:47,099 even to fairly large problems, even to fairly high dimensional state spaces. 954 00:58:47,099 --> 00:58:49,009 Typically Pegasus policy search 955 00:58:49,009 --> 00:58:50,439 algorithms have been using - 956 00:58:50,439 --> 00:58:52,719 the optimization problem 957 00:58:52,719 --> 00:58:56,440 is still - is much easier than the stochastic version, but sometimes it's 958 00:58:56,440 --> 00:58:58,279 not entirely trivial, and so 959 00:58:58,279 --> 00:59:00,299 you have to apply this sort of method with 960 00:59:00,299 --> 00:59:04,549 maybe on the order of ten parameters or tens of parameters, so 30, 40 961 00:59:04,549 --> 00:59:05,750 parameters, but 962 00:59:05,750 --> 00:59:08,169 not thousands of parameters, 963 00:59:08,169 --> 00:59:13,049 at least in these sorts of things with them. 964 00:59:13,049 --> 00:59:16,270 Student:So is that 965 00:59:16,270 --> 00:59:19,469 method different than just assuming that 966 00:59:19,469 --> 00:59:23,479 you know your simulator exactly, just throwing away all the random numbers entirely? Instructor (Andrew Ng):So is this different from assuming 967 00:59:23,479 --> 00:59:25,089 968 00:59:25,089 --> 00:59:26,329 that 969 00:59:26,329 --> 00:59:26,949 970 00:59:26,949 --> 00:59:28,809 we have a deterministic simulator? The 971 00:59:28,809 --> 00:59:31,249 answer is no. In the way you do this, 972 00:59:31,249 --> 00:59:34,949 for the sake of simplicity I talked about one sequence of random numbers. 973 00:59:34,949 --> 00:59:36,890 What you do is - 974 00:59:36,890 --> 00:59:41,109 so imagine that the random numbers are simulating different wind gusts against your 975 00:59:41,109 --> 00:59:41,969 helicopter. 976 00:59:41,969 --> 00:59:46,429 So what you want to do isn't really evaluate just against one pattern of wind gusts. 977 00:59:46,429 --> 00:59:48,000 What you want to do is 978 00:59:48,000 --> 00:59:49,920 sample some set of 979 00:59:49,920 --> 00:59:53,650 different patterns of wind gusts, and evaluate against all of them in average. 980 00:59:53,650 --> 00:59:55,680 So what you do is you actually 981 00:59:55,680 --> 00:59:58,039 sample say 100 - 982 00:59:58,039 --> 01:00:01,359 some number I made up like 100 sequences of random numbers, 983 01:00:01,359 --> 01:00:03,430 and every time you want to evaluate a policy, 984 01:00:03,430 --> 01:00:08,679 you evaluate it against all 100 sequences of random numbers and then average. 985 01:00:08,679 --> 01:00:13,249 This is in exactly the same way that on this earlier picture 986 01:00:13,249 --> 01:00:16,879 you wouldn't necessarily evaluate the policy just once. You evaluate it maybe 100 987 01:00:16,879 --> 01:00:18,450 times in simulation, and then 988 01:00:18,450 --> 01:00:21,709 average to get a better estimate of the expected reward. 989 01:00:21,709 --> 01:00:23,009 In the same way, 990 01:00:23,009 --> 01:00:24,079 you do that here 991 01:00:24,079 --> 01:00:29,079 but with 100 fixed sequences of random numbers. Does that make sense? Any 992 01:00:29,079 --> 01:00:36,079 other questions? Student:If we use 100 scenarios and get an estimate for the policy, [inaudible] 100 times [inaudible] random numbers [inaudible] won't you get similar ideas [inaudible]? Instructor 993 01:00:47,719 --> 01:00:51,829 (Andrew Ng):Yeah. I guess you're right. So the quality - for a fixed policy, 994 01:00:51,829 --> 01:00:57,099 the quality of the approximation is equally good for both cases. The 995 01:00:57,099 --> 01:01:00,989 advantage of fixing the random numbers is that you end up with 996 01:01:00,989 --> 01:01:03,690 an optimization problem that's much easier. 997 01:01:03,690 --> 01:01:04,609 998 01:01:04,609 --> 01:01:06,100 I have some search problem, 999 01:01:06,100 --> 01:01:10,049 and on the horizontal axis there's a space of control policies, 1000 01:01:10,049 --> 01:01:12,559 and my goal is to find a control policy that 1001 01:01:12,559 --> 01:01:15,319 maximizes the payoff. 1002 01:01:15,319 --> 01:01:18,660 The problem with this earlier setting was that 1003 01:01:18,660 --> 01:01:21,949 when I evaluate policies I get these noisy estimates, 1004 01:01:21,949 --> 01:01:22,950 and then it's 1005 01:01:22,950 --> 01:01:26,690 just very hard to optimize the red curve if I have these points that are all over the 1006 01:01:26,690 --> 01:01:29,420 place. And if I evaluate the same policy twice, I don't even get back the same 1007 01:01:29,420 --> 01:01:30,719 1008 01:01:30,719 --> 01:01:32,959 answer. By fixing the random numbers, 1009 01:01:32,959 --> 01:01:36,669 the algorithm still doesn't get to see the red curve, but at least it's 1010 01:01:36,669 --> 01:01:40,609 now optimizing a deterministic function. That makes the 1011 01:01:40,609 --> 01:01:42,349 optimization problem much easier. Does 1012 01:01:42,349 --> 01:01:49,349 that make sense? Student:So every time you fix the random numbers, you get a nice curve to optimize. And then you change the random numbers to get a bunch of different curves 1013 01:01:50,539 --> 01:01:52,009 that are easy to 1014 01:01:52,009 --> 01:01:54,759 optimize. And then you smush 1015 01:01:54,759 --> 01:01:58,009 them together? Instructor (Andrew 1016 01:01:58,009 --> 01:01:59,309 Ng):Let's 1017 01:01:59,309 --> 01:02:04,559 see. I have just one nice black curve that I'm trying to optimize. Student:For each scenario. Instructor (Andrew 1018 01:02:04,559 --> 01:02:07,579 Ng):I see. So I'm gonna average over M scenarios, so I'm gonna average 1019 01:02:07,579 --> 01:02:08,890 over 100 scenarios. 1020 01:02:08,890 --> 01:02:12,339 So the black curve here is defined by averaging over 1021 01:02:12,339 --> 01:02:15,239 a large set of scenarios. Does that make 1022 01:02:15,239 --> 01:02:16,249 sense? 1023 01:02:16,249 --> 01:02:20,099 So instead of only one - if the averaging thing doesn't make sense, imagine that there's 1024 01:02:20,099 --> 01:02:24,049 just one sequence of random numbers. That might be easier to think 1025 01:02:24,049 --> 01:02:27,419 about. Fix one sequence of random numbers, and every time I evaluate another policy, 1026 01:02:27,419 --> 01:02:31,039 I evaluate against the same sequence of random numbers, and that gives me 1027 01:02:31,039 --> 01:02:36,489 a nice deterministic function to optimize. 1028 01:02:36,489 --> 01:02:39,449 Any other questions? 1029 01:02:39,449 --> 01:02:41,269 The advantage is really 1030 01:02:41,269 --> 01:02:45,459 that - one way to think about it is when I evaluate the same policy twice, 1031 01:02:45,459 --> 01:02:50,449 at least I get back the same answer. This gives me a deterministic function 1032 01:02:50,449 --> 01:02:53,890 mapping from parameters in my policy 1033 01:02:53,890 --> 01:02:56,239 to my estimate of the expected payoff. 1034 01:02:56,239 --> 01:02:57,789 1035 01:02:57,789 --> 01:02:59,339 That's just a function that I can try to 1036 01:02:59,339 --> 01:03:06,339 optimize using the search 1037 01:03:11,049 --> 01:03:13,229 algorithm. 1038 01:03:13,229 --> 01:03:14,920 So we use this algorithm for 1039 01:03:14,920 --> 01:03:16,799 inverted hovering, 1040 01:03:16,799 --> 01:03:20,329 and again policy search algorithms tend to work well when you can find 1041 01:03:20,329 --> 01:03:23,739 a reasonably simple policy mapping from 1042 01:03:23,739 --> 01:03:28,640 the states to the actions. This is sort of especially the low level control tasks, 1043 01:03:28,640 --> 01:03:30,359 which I think of as sort 1044 01:03:30,359 --> 01:03:32,699 of reflexes almost. 1045 01:03:32,699 --> 01:03:33,799 Completely, 1046 01:03:33,799 --> 01:03:37,269 if you want to solve a problem like Tetris where you might plan 1047 01:03:37,269 --> 01:03:40,499 ahead a few steps about what's a nice configuration of the board, or 1048 01:03:40,499 --> 01:03:42,289 something like a game of chess, 1049 01:03:42,289 --> 01:03:46,719 or problems of long path plannings of go here, then go there, then go there, 1050 01:03:46,719 --> 01:03:49,319 then sometimes you might apply a value function 1051 01:03:49,319 --> 01:03:51,069 method instead. 1052 01:03:51,069 --> 01:03:55,059 But for tasks like helicopter flight, for 1053 01:03:55,059 --> 01:03:58,569 low level control tasks, for the reflexes of driving or controlling 1054 01:03:58,569 --> 01:04:00,749 1055 01:04:00,749 --> 01:04:05,699 various robots, policy search algorithms were easier - 1056 01:04:05,699 --> 01:04:10,629 we can sometimes more easily approximate the policy directly works very well. 1057 01:04:10,629 --> 01:04:11,839 1058 01:04:11,839 --> 01:04:15,599 Some [inaudible] the state of RL today. RL algorithms are applied 1059 01:04:15,599 --> 01:04:18,499 to a wide range of problems, 1060 01:04:18,499 --> 01:04:22,239 and the key is really sequential decision making. The place where you 1061 01:04:22,239 --> 01:04:25,569 think about applying reinforcement learning algorithm is when you need to make a decision, 1062 01:04:25,569 --> 01:04:28,209 then another decision, then another decision, 1063 01:04:28,209 --> 01:04:31,889 and some of your actions may have long-term consequences. I think that 1064 01:04:31,889 --> 01:04:33,170 is the 1065 01:04:33,170 --> 01:04:34,999 heart of RL's 1066 01:04:34,999 --> 01:04:37,910 sequential decision making, where you make multiple decisions, 1067 01:04:37,910 --> 01:04:41,599 and some of your actions may have long-term consequences. I've shown you 1068 01:04:41,599 --> 01:04:44,170 a bunch of robotics examples. RL 1069 01:04:44,170 --> 01:04:45,569 is also 1070 01:04:45,569 --> 01:04:48,759 applied to thinks like medical decision making, where you may have a patient and you 1071 01:04:48,759 --> 01:04:51,879 want to choose a sequence of treatments, and you do this now for the patient, and the 1072 01:04:51,879 --> 01:04:56,449 patient may be in some other state, and you choose to do that later, and so on. 1073 01:04:56,449 --> 01:05:00,839 It turns out there's a large community of people applying these sorts of tools to queues. So 1074 01:05:00,839 --> 01:05:04,539 imagine you have a bank where you have people lining up, and 1075 01:05:04,539 --> 01:05:06,009 after they go to one cashier, 1076 01:05:06,009 --> 01:05:09,740 some of them have to go to the manager to deal with something else. You have 1077 01:05:09,740 --> 01:05:13,410 a system of multiple people standing in line in multiple queues, and so how do you route 1078 01:05:13,410 --> 01:05:17,839 people optimally to minimize the waiting 1079 01:05:17,839 --> 01:05:20,239 time. And not just people, but 1080 01:05:20,239 --> 01:05:23,969 objects in an assembly line and so on. It turns out there's a surprisingly large community 1081 01:05:23,969 --> 01:05:26,349 working on optimizing queues. 1082 01:05:26,349 --> 01:05:29,719 I mentioned game playing a little bit already. Things 1083 01:05:29,719 --> 01:05:33,289 like financial decision making, if you have a large amount of stock, 1084 01:05:33,289 --> 01:05:37,569 how do you sell off a large amount - how do you time the selling off of your stock 1085 01:05:37,569 --> 01:05:41,329 so as to not affect market prices adversely too much? There 1086 01:05:41,329 --> 01:05:44,949 are many operations research problems, things like factory automation. Can you design 1087 01:05:44,949 --> 01:05:46,369 a factory 1088 01:05:46,369 --> 01:05:50,140 to optimize throughput, or minimize cost, or whatever. These are all 1089 01:05:50,140 --> 01:05:54,049 sorts of problems that people are applying reinforcement learning algorithms to. 1090 01:05:54,049 --> 01:05:57,529 Let me just close with a few robotics examples because they're always fun, and we just 1091 01:05:57,529 --> 01:05:59,190 have these videos. 1092 01:05:59,190 --> 01:06:00,189 1093 01:06:00,189 --> 01:06:03,989 This video was a work of Ziko Coulter and Peter Abiel, which 1094 01:06:03,989 --> 01:06:07,259 is a PhD student 1095 01:06:07,259 --> 01:06:09,719 here. They were working getting a robot dog 1096 01:06:09,719 --> 01:06:11,759 to climb over 1097 01:06:11,759 --> 01:06:13,760 difficult rugged 1098 01:06:13,760 --> 01:06:16,959 terrain. Using a reinforcement learning algorithm, 1099 01:06:16,959 --> 01:06:19,979 they applied an approach that's 1100 01:06:19,979 --> 01:06:24,149 similar to a value function approximation approach, not quite but similar. 1101 01:06:24,149 --> 01:06:28,849 They allowed the robot dog to sort of plan ahead multiple steps, and 1102 01:06:28,849 --> 01:06:32,329 carefully choose his footsteps and 1103 01:06:32,329 --> 01:06:33,999 traverse rugged terrain. 1104 01:06:33,999 --> 01:06:35,220 This is really 1105 01:06:35,220 --> 01:06:39,129 state of the art in terms of what can 1106 01:06:39,129 --> 01:06:43,339 you get a robotic dog to do. 1107 01:06:43,339 --> 01:06:45,569 Here's another fun one. 1108 01:06:45,569 --> 01:06:48,009 It turns out that 1109 01:06:48,009 --> 01:06:51,259 wheeled robots are very fuel-efficient. Cars and trucks are the 1110 01:06:51,259 --> 01:06:53,749 most fuel-efficient robots in the world 1111 01:06:53,749 --> 01:06:55,250 almost. 1112 01:06:55,250 --> 01:06:57,260 Cars and trucks are very fuel-efficient, 1113 01:06:57,260 --> 01:07:01,309 but the bigger robots have the ability to traverse more rugged terrain. 1114 01:07:01,309 --> 01:07:04,599 So this is a robot - this is actually a small scale mockup of a larger 1115 01:07:04,599 --> 01:07:06,519 vehicle built by Lockheed Martin, 1116 01:07:06,519 --> 01:07:08,119 but can you come 1117 01:07:08,119 --> 01:07:09,699 up with a vehicle that 1118 01:07:09,699 --> 01:07:13,689 has wheels and has the fuel efficiency of wheeled robots, but also 1119 01:07:13,689 --> 01:07:18,329 has legs so it 1120 01:07:18,329 --> 01:07:19,099 can traverse obstacles. 1121 01:07:19,099 --> 01:07:23,599 Again, using a reinforcement algorithm to design a controller for this robot to make 1122 01:07:23,599 --> 01:07:25,279 it 1123 01:07:25,279 --> 01:07:27,349 traverse obstacles, and 1124 01:07:27,349 --> 01:07:29,390 somewhat complex gaits that 1125 01:07:29,390 --> 01:07:31,319 would be very hard to design by hand, 1126 01:07:31,319 --> 01:07:33,279 but by choosing a reward function, tell the robot 1127 01:07:33,279 --> 01:07:34,550 this is a 1128 01:07:34,550 --> 01:07:36,789 plus one reward that's top of the goal, 1129 01:07:36,789 --> 01:07:38,469 and a few other things, it learns 1130 01:07:38,469 --> 01:07:45,469 these sorts of policies automatically. 1131 01:07:46,519 --> 01:07:47,509 Last couple 1132 01:07:47,509 --> 01:07:51,059 fun ones - I'll show you a 1133 01:07:51,059 --> 01:07:55,229 couple last helicopter videos. 1134 01:07:55,229 --> 01:07:57,029 1135 01:07:57,029 --> 01:08:01,709 This is the work of again PhD students here, Peter Abiel and Adam 1136 01:08:01,709 --> 01:08:08,709 Coates 1137 01:08:08,769 --> 01:08:11,659 who have been working on 1138 01:08:11,659 --> 01:08:18,659 autonomous flight. I'll just let 1139 01:08:25,419 --> 01:08:26,319 1140 01:08:26,319 --> 01:08:33,319 you watch. I'll just 1141 01:09:49,339 --> 01:09:53,900 show you one more. Student:[Inaudible] do this with a real helicopter [inaudible]? Instructor 1142 01:09:53,900 --> 01:09:55,199 (Andrew Ng):Not a full-size helicopter. 1143 01:09:55,199 --> 01:09:58,020 Only small radio control helicopters. Student:[Inaudible]. Instructor (Andrew Ng):Full-size 1144 01:09:58,020 --> 01:10:00,860 helicopters 1145 01:10:00,860 --> 01:10:05,780 are the wrong design for this. You shouldn't do this on a full-size helicopter. 1146 01:10:05,780 --> 01:10:12,619 Many full-size helicopters would fall apart if you tried to do this. Let's see. There's one more. 1147 01:10:12,619 --> 01:10:18,919 Student:Are there any human [inaudible]? Instructor (Andrew Ng):Yes, 1148 01:10:18,919 --> 01:10:25,919 there are very good human pilots that can. 1149 01:10:28,420 --> 01:10:35,420 This is just one more 1150 01:10:43,739 --> 01:10:45,510 maneuver. That was kind of fun. So this is the work of 1151 01:10:45,510 --> 01:10:48,090 Peter Abiel and Adam Coates. So that was it. That was 1152 01:10:48,090 --> 01:10:50,509 all 1153 01:10:50,509 --> 01:10:52,639 1154 01:10:52,639 --> 01:10:55,530 the technical material 1155 01:10:55,530 --> 01:11:02,530 I wanted to present in this class. I guess 1156 01:11:03,109 --> 01:11:04,040 you're all 1157 01:11:04,040 --> 01:11:10,199 experts on machine learning now. Congratulations. 1158 01:11:10,199 --> 01:11:14,209 And as I hope you've got the sense through this class that this is one of the technologies 1159 01:11:14,209 --> 01:11:15,289 that's really 1160 01:11:15,289 --> 01:11:18,969 having a huge impact on science in engineering and industry. 1161 01:11:18,969 --> 01:11:23,130 As I said in the first lecture, I think many people use machine 1162 01:11:23,130 --> 01:11:30,130 learning algorithms dozens of times a day without even knowing about it. 1163 01:11:30,389 --> 01:11:34,110 Based on the projects you've done, I hope that 1164 01:11:34,110 --> 01:11:37,899 most of you will be able to imagine yourself 1165 01:11:37,899 --> 01:11:41,309 going out after this class and applying these things to solve a variety of 1166 01:11:41,309 --> 01:11:43,280 problems. 1167 01:11:43,280 --> 01:11:45,980 Hopefully, some of you will also imagine yourselves 1168 01:11:45,980 --> 01:11:48,989 writing research papers after this class, be it on 1169 01:11:48,989 --> 01:11:52,440 a novel way to do machine learning, or on some way of applying machine 1170 01:11:52,440 --> 01:11:55,289 learning to a problem that you care 1171 01:11:55,289 --> 01:11:58,749 about. In fact, looking at project milestones, I'm actually sure that a large fraction of 1172 01:11:58,749 --> 01:12:04,300 the projects in this class will be publishable, so I think that's great. 1173 01:12:04,300 --> 01:12:05,780 1174 01:12:05,780 --> 01:12:06,679 1175 01:12:06,679 --> 01:12:10,750 I guess many of you will also go industry, make new products, and make lots 1176 01:12:10,750 --> 01:12:12,659 of money using learning 1177 01:12:12,659 --> 01:12:15,580 algorithms. Remember me 1178 01:12:15,580 --> 01:12:19,530 if that happens. One of the things I'm excited about is through research or 1179 01:12:19,530 --> 01:12:22,570 industry, I'm actually completely sure that the people in this class in the 1180 01:12:22,570 --> 01:12:23,849 next few months 1181 01:12:23,849 --> 01:12:27,320 will apply machine learning algorithms to lots of problems in 1182 01:12:27,320 --> 01:12:30,820 industrial management, and computer science, things like optimizing computer 1183 01:12:30,820 --> 01:12:31,770 architectures, 1184 01:12:31,770 --> 01:12:33,340 network security, 1185 01:12:33,340 --> 01:12:38,480 robotics, computer vision, to problems in computational biology, 1186 01:12:38,480 --> 01:12:39,140 1187 01:12:39,140 --> 01:12:42,429 to problems in aerospace, or understanding natural language, 1188 01:12:42,429 --> 01:12:44,309 and many more things like that. 1189 01:12:44,309 --> 01:12:46,030 So 1190 01:12:46,030 --> 01:12:49,920 right now I have no idea what all of you are going to do with the learning algorithms you learned about, 1191 01:12:49,920 --> 01:12:51,540 but 1192 01:12:51,540 --> 01:12:54,809 every time as I wrap up this class, I always 1193 01:12:54,809 --> 01:12:58,209 feel very excited, and optimistic, and hopeful about the sorts of amazing 1194 01:12:58,209 --> 01:13:03,129 things you'll be able to do. 1195 01:13:03,129 --> 01:13:06,780 One final thing, I'll just 1196 01:13:06,780 --> 01:13:10,489 give out this handout. 1197 01:13:10,489 --> 01:13:12,709 One final thing is that 1198 01:13:12,709 --> 01:13:19,709 1199 01:13:24,699 --> 01:13:27,969 1200 01:13:27,969 --> 01:13:30,749 1201 01:13:30,749 --> 01:13:32,900 machine learning has grown out of a 1202 01:13:32,900 --> 01:13:35,170 larger literature on the AI 1203 01:13:35,170 --> 01:13:36,419 where 1204 01:13:36,419 --> 01:13:40,360 this desire to build systems that exhibit intelligent behavior and machine 1205 01:13:40,360 --> 01:13:41,829 learning 1206 01:13:41,829 --> 01:13:44,889 is one of the tools of AI, maybe one that's had a 1207 01:13:44,889 --> 01:13:46,510 disproportionately large impact, 1208 01:13:46,510 --> 01:13:51,519 but there are many other ideas in AI that I hope you 1209 01:13:51,519 --> 01:13:55,289 go and continue to learn about. Fortunately, Stanford 1210 01:13:55,289 --> 01:13:59,279 has one of the best and broadest sets of AI classes, and I hope that 1211 01:13:59,279 --> 01:14:03,690 you take advantage of some of these classes, and go and learn more about AI, and more 1212 01:14:03,690 --> 01:14:07,690 about other fields which often apply learning algorithms to problems in vision, 1213 01:14:07,690 --> 01:14:10,480 problems in natural language processing in robotics, and so on. 1214 01:14:10,480 --> 01:14:14,960 So the handout I just gave out has a list of AI related courses. Just 1215 01:14:14,960 --> 01:14:18,489 running down very quickly, I guess, CS221 is an overview that I 1216 01:14:18,489 --> 01:14:19,340 teach. There 1217 01:14:19,340 --> 01:14:23,699 are a lot of robotics classes also: 223A, 1218 01:14:23,699 --> 01:14:25,139 225A, 225B 1219 01:14:25,139 --> 01:14:28,760 - many robotics class. There are so many applications 1220 01:14:28,760 --> 01:14:31,999 of learning algorithms to robotics today. 222 1221 01:14:31,999 --> 01:14:36,369 and 227 are knowledge representation and reasoning classes. 1222 01:14:36,369 --> 01:14:39,919 228 - of all the classes on this list, 228, which Daphne 1223 01:14:39,919 --> 01:14:43,479 Koller teaches, is probably closest in spirit to 229. This is one of 1224 01:14:43,479 --> 01:14:45,300 the classes I highly recommend 1225 01:14:45,300 --> 01:14:48,600 to all of my PhD students as well. 1226 01:14:48,600 --> 01:14:53,300 Many other problems also touch on machine learning. On the next page, 1227 01:14:53,300 --> 01:14:54,040 1228 01:14:54,040 --> 01:14:57,760 courses on computer vision, speech recognition, natural language processing, 1229 01:14:57,760 --> 01:14:58,879 various 1230 01:14:58,879 --> 01:15:02,030 tools that aren't just machine learning, but often involve machine learning in 1231 01:15:02,030 --> 01:15:03,199 many ways. 1232 01:15:03,199 --> 01:15:07,339 Other aspects of AI, multi-agent systems taught by [inaudible]. 1233 01:15:07,339 --> 01:15:09,630 1234 01:15:09,630 --> 01:15:13,139 EE364A is convex optimization. It's a class taught by 1235 01:15:13,139 --> 01:15:16,630 Steve Boyd, and convex optimization came up many times in this class. If you 1236 01:15:16,630 --> 01:15:20,909 want to become really good at it, EE364 is a great class. If you're 1237 01:15:20,909 --> 01:15:24,579 interested in project courses, I also teach a project class 1238 01:15:24,579 --> 01:15:27,179 next quarter where we spend the whole quarter working 1239 01:15:27,179 --> 01:15:31,510 on research projects. 1240 01:15:31,510 --> 01:15:37,589 So I hope you go and take some more of those classes. 1241 01:15:37,589 --> 01:15:40,379 In closing, 1242 01:15:40,379 --> 01:15:42,849 let me just say 1243 01:15:42,849 --> 01:15:46,880 this class has been really fun to teach, and it's very satisfying to 1244 01:15:46,880 --> 01:15:52,010 me personally when we set these insanely difficult hallmarks, and 1245 01:15:52,010 --> 01:15:54,649 then we'd see a solution, and I'd be like, Oh my god. They actually figured that one out. It's 1246 01:15:54,649 --> 01:15:55,469 actually 1247 01:15:55,469 --> 01:15:58,570 very satisfying when I see that. 1248 01:15:58,570 --> 01:16:03,089 Or looking at the milestones, I often go, Wow, that's really cool. I bet this would be 1249 01:16:03,089 --> 01:16:04,739 publishable. 1250 01:16:04,739 --> 01:16:05,469 So 1251 01:16:05,469 --> 01:16:09,039 I hope you take what you've learned, go forth, and 1252 01:16:09,039 --> 01:16:11,570 do amazing things with learning algorithms. 1253 01:16:11,570 --> 01:16:16,220 I know this is a heavy workload class, so thank you all very much for the hard 1254 01:16:16,220 --> 01:16:20,280 work you've put into this class, and the hard work you've put into learning this material, 1255 01:16:20,280 --> 01:16:21,699 and 1256 01:16:21,699 --> 01:16:23,389 thank you very much for having been students in.