00:00:10,830 --> 00:00:12,219
This subtitle is provided from lecsub.jimdo.com under Creative Commons License.

2
00:00:12,219 --> 00:00:15,499
This presentation is delivered by the Stanford Center for Professional 

3
00:00:15,499 --> 00:00:22,499
Development. All 

4
00:00:24,419 --> 00:00:25,749
right, so welcome back. What 

5
00:00:25,749 --> 00:00:29,699
I want to do today is start a new chapter, a new discussion on 

6
00:00:29,699 --> 00:00:32,059
machine learning and in particular, I want to talk about 

7
00:00:32,059 --> 00:00:36,760
a different type of learning problem called reinforcement learning, so that's 

8
00:00:36,760 --> 00:00:40,700
Markov Decision Processes, value functions, value iteration, 

9
00:00:40,700 --> 00:00:46,880
and policy iteration. Both of these last two items are algorithms for solving 

10
00:00:46,880 --> 00:00:49,080
reinforcement learning problems. 

11
00:00:49,080 --> 00:00:53,560
As you can see, we're also taping a different room today, so the background is a bit different. 

12
00:00:53,560 --> 00:00:56,250
So just to put this in context, the first of 

13
00:00:56,250 --> 00:01:00,320
the four major topics we had in this class was supervised 

14
00:01:00,320 --> 00:01:01,420
learning 

15
00:01:01,420 --> 00:01:03,390
and in supervised learning, 

16
00:01:03,390 --> 00:01:04,940
we had the training set 

17
00:01:04,940 --> 00:01:09,730
in which we were given sort of the śrightť answer of every training example and it 

18
00:01:09,730 --> 00:01:13,500
was then just a drop of the learning algorithms to replicate more of the right 

19
00:01:13,500 --> 00:01:15,410
answers. 

20
00:01:15,410 --> 00:01:19,240
And then that was learning theory and then we talked about unsupervised learning, and in unsupervised 

21
00:01:19,240 --> 00:01:19,850
learning, 

22
00:01:19,850 --> 00:01:20,700
we had 

23
00:01:20,700 --> 00:01:23,530
just a bunch of unlabeled data, just the x's, 

24
00:01:23,530 --> 00:01:27,710
and it was the job in the learning algorithm to discover so-called structure in the 

25
00:01:27,710 --> 00:01:30,930
data and several algorithms like cluster analysis, K-means, a 

26
00:01:30,930 --> 00:01:35,150
mixture of all the sort of the PCA, ICA, and so on. 

27
00:01:35,150 --> 00:01:38,610
Today, I want to talk about a different class of learning algorithms that's sort 

28
00:01:38,610 --> 00:01:41,479
of in between supervised and unsupervised, so there will be a class 

29
00:01:41,479 --> 00:01:45,090
of problems where there's a level of supervision 

30
00:01:45,090 --> 00:01:49,510
that's also much less supervision than what we saw in supervised 

31
00:01:49,510 --> 00:01:53,050
learning. And this is a problem in formalism called reinforcement learning. So next up 

32
00:01:53,050 --> 00:01:54,380
here are 

33
00:01:54,380 --> 00:01:59,490
slides. Let me show you. As a moving example, here's an example of the sorts of 

34
00:01:59,490 --> 00:02:04,470
things we do with reinforcement learning. 

35
00:02:04,470 --> 00:02:06,470
So here's a picture o fsome of 

36
00:02:06,470 --> 00:02:09,670
this I talked about in Lecture 1 as well, but here's a picture of 

37
00:02:09,670 --> 00:02:13,469
the - we have an autonomous helicopter we have at Stanford. 

38
00:02:13,469 --> 00:02:17,940
So how would you write a program to make a helicopter like this fly by itself? 

39
00:02:17,940 --> 00:02:20,049
I'll show you a fun video. This is actually, 

40
00:02:20,049 --> 00:02:24,219
I think, the same video that I showed in class in the first lecture, 

41
00:02:24,219 --> 00:02:28,420
but here's a video taken in the football field at Stanford 

42
00:02:28,420 --> 00:02:30,979
of using machine learning algorithm to 

43
00:02:30,979 --> 00:02:33,470
fly the helicopter. So let's 

44
00:02:33,470 --> 00:02:40,470
just play the video. You can zoom 

45
00:02:41,309 --> 00:02:46,459
in the camera and see the trees in the sky. So 

46
00:02:46,459 --> 00:02:47,930
in terms of 

47
00:02:47,930 --> 00:02:49,489
autonomous helicopter flight, 

48
00:02:49,489 --> 00:02:52,639
this is written then by some of my students and me. 

49
00:02:52,639 --> 00:02:55,939
In terms of autonomous helicopter flight, this is one of the most difficult 

50
00:02:55,939 --> 00:02:58,059
aerobatic maneuvers flown 

51
00:02:58,059 --> 00:03:02,970
and it's actually very hard to write a program to make a helicopter do this and the way 

52
00:03:02,970 --> 00:03:07,219
this was done was with what's called a reinforcement learning algorithm. 

53
00:03:07,219 --> 00:03:08,969
So 

54
00:03:08,969 --> 00:03:12,239
just to make this more concrete, right, the learning problem in helicopter flight 

55
00:03:12,239 --> 00:03:13,359
is 

56
00:03:13,359 --> 00:03:15,229
ten times per second, say. 

57
00:03:15,229 --> 00:03:17,279
Your sensors on the helicopter 

58
00:03:17,279 --> 00:03:20,939
gives you a very accurate estimate of the position of orientation of the helicopter and so you know where 

59
00:03:20,939 --> 00:03:25,759
the helicopter is pretty accurately at all points in time. 

60
00:03:25,759 --> 00:03:28,349
And your job is to take this input, 

61
00:03:28,349 --> 00:03:30,989
the position orientation, 

62
00:03:30,989 --> 00:03:32,190
and to output 

63
00:03:32,190 --> 00:03:36,120
a set of numbers that correspond to where to move the control sticks to control the 

64
00:03:36,120 --> 00:03:38,099
helicopter, to make it 

65
00:03:38,099 --> 00:03:41,589
fly, to right side up, fly upside down, actually whatever maneuver you want. 

66
00:03:41,589 --> 00:03:45,849
And this is different from supervised learning because 

67
00:03:45,849 --> 00:03:50,309
usually we actually don't know what the śrightť control action is. 

68
00:03:50,309 --> 00:03:54,969
And more specifically, if the helicopter is in a certain position orientation, it's actually very 

69
00:03:54,969 --> 00:03:59,509
hard to say when the helicopter is doing this, you should move the control sticks exactly these 

70
00:03:59,509 --> 00:04:00,439
positions. 

71
00:04:00,439 --> 00:04:03,969
So it's very hard to apply supervised learning algorithms to this problem because we 

72
00:04:03,969 --> 00:04:06,549
can't come up with a training set 

73
00:04:06,549 --> 00:04:10,900
where the inputs of the position and the output is all the śrightť control 

74
00:04:10,900 --> 00:04:13,699
actions. It's really hard to come up with a training set 

75
00:04:13,699 --> 00:04:17,470
like that. Instead of reinforcement learning, we'll give the learning algorithm a 

76
00:04:17,470 --> 00:04:21,370
different type of feedback, basically called a reward signal, 

77
00:04:21,370 --> 00:04:24,259
which will tell the helicopter when it's doing well 

78
00:04:24,259 --> 00:04:26,650
and when it's doing poorly. 

79
00:04:26,650 --> 00:04:28,550
So what we'll end up doing is we're coming up 

80
00:04:28,550 --> 00:04:32,750
with something called a reward signal and I'll formalize this later, 

81
00:04:32,750 --> 00:04:36,610
which will be a measure of how well the helicopter is doing, and then it will be the 

82
00:04:36,610 --> 00:04:39,980
job of the learning algorithm to take just this reward function as input and try 

83
00:04:39,980 --> 00:04:42,849
to fly well. 

84
00:04:42,849 --> 00:04:45,780
Another good example of reinforcement learning is thinking about getting 

85
00:04:45,780 --> 00:04:49,780
a program to play a game, to play chess, a 

86
00:04:49,780 --> 00:04:52,000
game of chess. 

87
00:04:52,000 --> 00:04:56,270
At any stage in the game, we actually don't know what the śoptimalť move 

88
00:04:56,270 --> 00:04:58,710
is, so it's very hard to 

89
00:04:58,710 --> 00:05:02,650
pose playing chess as a supervised learning problem because 

90
00:05:02,650 --> 00:05:06,550
we can't say the x's are the board positions and the y's are the optimum moves because 

91
00:05:06,550 --> 00:05:12,089
we just don't know how we create any training examples of optimum moves of chess. But 

92
00:05:12,089 --> 00:05:15,050
what we do know is if you have a 

93
00:05:15,050 --> 00:05:19,069
computer playing games of chess, we know when its won a game and when its lost a game, 

94
00:05:19,069 --> 00:05:23,189
so what we do is we give it a reward signal, so give it a positive 

95
00:05:23,189 --> 00:05:24,349
reward when it 

96
00:05:24,349 --> 00:05:27,560
wins a game of chess and give it a negative reward whenever it loses, 

97
00:05:27,560 --> 00:05:33,650
and hopefully have it learn to win more and more games by itself over time. So 

98
00:05:33,650 --> 00:05:37,419
what I'd like you to think about reinforcement learning is think about training a dog. 

99
00:05:37,419 --> 00:05:40,629
Every time your dog does something good, you sort of tell it, śGood dog,ť and every time it does 

100
00:05:40,629 --> 00:05:43,180
something bad, you tell it, śBad dog,ť and over time, your dog 

101
00:05:43,180 --> 00:05:46,409
learns to do more and more good things over time. So in the same 

102
00:05:46,409 --> 00:05:50,979
way, when we try to fly a helicopter, every time the helicopter does something good, you say, śGood 

103
00:05:50,979 --> 00:05:51,720
helicopter,ť and 

104
00:05:51,720 --> 00:05:54,380
every time it crashes, you say, śBad helicopter,ť 

105
00:05:54,380 --> 00:05:58,710
and then over time, it learns to do the right things more and more 

106
00:05:58,710 --> 00:06:00,879
often. 

107
00:06:00,879 --> 00:06:05,499
The reason - one of the reasons that reinforcement learning is much harder than 

108
00:06:05,499 --> 00:06:07,039
supervised learning is because 

109
00:06:07,039 --> 00:06:09,939
this is not a one-shot decision making problem. 

110
00:06:09,939 --> 00:06:14,539
So in supervised learning, if you have a classification, prediction if someone 

111
00:06:14,539 --> 00:06:17,779
has cancer or not, you make a prediction and then you're done, right? And your patient 

112
00:06:17,779 --> 00:06:21,520
either has cancer or not, you're either right or wrong, they live or die, whatever. You 

113
00:06:21,520 --> 00:06:23,430
make a decision and then you're done. 

114
00:06:23,430 --> 00:06:27,440
In reinforcement learning, you have to keep taking actions over time, so 

115
00:06:27,440 --> 00:06:29,860
it's called the sequential decision making. 

116
00:06:29,860 --> 00:06:33,669
So concretely, suppose a program loses a game of chess on move 

117
00:06:33,669 --> 00:06:35,800
No. 60. 

118
00:06:35,800 --> 00:06:39,969
Then it has actually made 60 moves before it got this negative reward of 

119
00:06:39,969 --> 00:06:41,699
losing a game of chess 

120
00:06:41,699 --> 00:06:45,889
and the thing that makes it hard for the algorithm to learn from this is called 

121
00:06:45,889 --> 00:06:49,490
the credit assignment problem. 

122
00:06:49,490 --> 00:06:51,960


123
00:06:51,960 --> 00:06:56,369


124
00:06:56,369 --> 00:06:58,360
And just to state that informally, 

125
00:06:58,360 --> 00:06:59,740


126
00:06:59,740 --> 00:07:03,949
what this is if the program loses a game of chess in move 60, 

127
00:07:03,949 --> 00:07:07,860
you're actually not quite sure of all the moves he made which ones were the right moves and 

128
00:07:07,860 --> 00:07:09,900
which ones were the bad moves, and so 

129
00:07:09,900 --> 00:07:13,539
maybe it's because you've blundered on move No. 23 

130
00:07:13,539 --> 00:07:17,210
and then everything else you did may have been perfect, but because you made a mistake on 

131
00:07:17,210 --> 00:07:20,960
move 23 in your game of chess, you eventually end up losing on move 60. 

132
00:07:20,960 --> 00:07:23,850
So just to define very loosely for the assignment problem is 

133
00:07:23,850 --> 00:07:27,730
whether you get a positive or negative reward, so figure out what you actually did right or did wrong to 

134
00:07:27,730 --> 00:07:28,739


135
00:07:28,739 --> 00:07:30,949
cause the reward so you can do 

136
00:07:30,949 --> 00:07:34,849
more of the right things and less of the wrong things. And this is sort of one of the things 

137
00:07:34,849 --> 00:07:37,990
that makes reinforcement learning 

138
00:07:37,990 --> 00:07:40,369
hard. 

139
00:07:40,369 --> 00:07:44,789
And in the same way, if the helicopter crashes, you may not know. 

140
00:07:44,789 --> 00:07:47,599
And in the same way, if the helicopter crashes, 

141
00:07:47,599 --> 00:07:52,060
it may be something you did many minutes ago that causes the helicopter to crash. In fact, 

142
00:07:52,060 --> 00:07:55,310
if you ever crash a car - and hopefully none of you ever get in a car accident - but when someone 

143
00:07:55,310 --> 00:07:59,430
crashes a car, usually the things they're doing right before they crash is step on 

144
00:07:59,430 --> 00:08:03,479
the brakes to slow the car down before the impact. And 

145
00:08:03,479 --> 00:08:07,419
usually stepping on the brakes does not cause a crash. Rather it makes the crash 

146
00:08:07,419 --> 00:08:08,519
sort of hurt less. 

147
00:08:08,519 --> 00:08:11,020
But so, reinforcement algorithm, 

148
00:08:11,020 --> 00:08:14,599
you see this pattern in that you step on the brakes, you crash, it's 

149
00:08:14,599 --> 00:08:18,889
not the reason you crash and it's hard to figure out that it's not actually your stepping on the brakes that 

150
00:08:18,889 --> 00:08:20,030
caused the crash, 

151
00:08:20,030 --> 00:08:23,039
but something you did long before that. 

152
00:08:23,039 --> 00:08:24,900


153
00:08:24,900 --> 00:08:25,650
So 

154
00:08:25,650 --> 00:08:28,280
let me go ahead and define 

155
00:08:28,280 --> 00:08:31,300
the - formalize the reinforcement learning problem more, 

156
00:08:31,300 --> 00:08:33,230
and as a preface, let me 

157
00:08:33,230 --> 00:08:34,210
say 

158
00:08:34,210 --> 00:08:36,390
algorithms are applied to a broad range of problems, 

159
00:08:36,390 --> 00:08:40,789
but because robotics videos are easy to show in 

160
00:08:40,789 --> 00:08:45,700
the lecture - I have a lot of them - throughout this lecture I use a bunch of robotics for examples, but later, we'll talk about 

161
00:08:45,700 --> 00:08:49,060
applications of these ideas, so broader ranges of problems as well. 

162
00:08:49,060 --> 00:08:52,910
But the basic problem I'm facing is sequential decision 

163
00:08:52,910 --> 00:08:54,470
making. We need to make 

164
00:08:54,470 --> 00:08:57,480
many decisions and where your decisions perhaps have 

165
00:08:57,480 --> 00:09:00,280
long term consequences. 

166
00:09:00,280 --> 00:09:01,790
So 

167
00:09:01,790 --> 00:09:06,720
let's formalize the reinforcement learning problem. Reinforcement 

168
00:09:06,720 --> 00:09:11,609
learning problems model the worlds using something called the MDP or 

169
00:09:11,609 --> 00:09:15,570
the Markov Decision Process formalism. 

170
00:09:15,570 --> 00:09:22,570


171
00:09:23,620 --> 00:09:25,270
And 

172
00:09:25,270 --> 00:09:30,630
let's see, MDP is a five tuple - I don't have 

173
00:09:30,630 --> 00:09:32,730
enough space - 

174
00:09:32,730 --> 00:09:34,860
well, comprising 

175
00:09:34,860 --> 00:09:38,560
five things. 

176
00:09:38,560 --> 00:09:45,560


177
00:09:45,880 --> 00:09:49,220


178
00:09:49,220 --> 00:09:51,069
So let me say what these 

179
00:09:51,069 --> 00:09:56,190
are. Actually, could you please raise the screen? I won't need the laptop anymore today. 

180
00:09:56,190 --> 00:09:59,650
[Inaudible] more space. Yep, go, great. Thanks. 

181
00:09:59,650 --> 00:10:06,650
So an MDP comprises a five tuple. The 

182
00:10:08,840 --> 00:10:14,070
first of these elements, s is a set of states 

183
00:10:14,070 --> 00:10:17,460
and so for the helicopter example, the set of states would be the possible 

184
00:10:17,460 --> 00:10:21,160
positions and orientations of a helicopter. A 

185
00:10:21,160 --> 00:10:27,960
is a set of actions. So again, for the helicopter example, this would be the 

186
00:10:27,960 --> 00:10:29,710
set of all possible 

187
00:10:29,710 --> 00:10:35,380
positions that we could put our control sticks into. 

188
00:10:35,380 --> 00:10:42,380
P, s, a are state transition distributions. 

189
00:10:46,750 --> 00:10:52,400
So for each state and each action, this is a high probability distribution, 

190
00:10:52,400 --> 00:10:54,530
so sum over s prime, 

191
00:10:54,530 --> 00:10:59,350
Psa, s prime equals 

192
00:10:59,350 --> 00:11:01,440
1 and Psa s prime is created over 

193
00:11:01,440 --> 00:11:02,870
zero. 

194
00:11:02,870 --> 00:11:03,670
And 

195
00:11:03,670 --> 00:11:08,100
state transition distributions are - or state transition probabilities 

196
00:11:08,100 --> 00:11:10,540
work as follows. P subscript (s, a) 

197
00:11:10,540 --> 00:11:15,080
gives me the probability distribution over what state I will transition to, 

198
00:11:15,080 --> 00:11:16,710
what state I wind up in, 

199
00:11:16,710 --> 00:11:18,770
if I take an action a 

200
00:11:18,770 --> 00:11:20,390
in a state s. So this is 

201
00:11:20,390 --> 00:11:24,580
probability distribution over states s prime and then I get to 

202
00:11:24,580 --> 00:11:27,070
when I take an action a in the state s. 

203
00:11:27,070 --> 00:11:29,700
Now I'll read this in a second. Gamma is 

204
00:11:29,700 --> 00:11:32,310
the number 

205
00:11:32,310 --> 00:11:39,310
called the discount factor. Don't worry about this yet. I'll say what this is in a second. 

206
00:11:41,040 --> 00:11:48,040
And there's usually a number strictly rated, strictly less than 1 and rated equal to zero. And R is our 

207
00:11:48,789 --> 00:11:55,789
reward function, so the reward function maps from the 

208
00:12:00,240 --> 00:12:02,880
set of 

209
00:12:02,880 --> 00:12:07,590
states to the real numbers and can be positive or negative. 

210
00:12:07,590 --> 00:12:12,470
This is the set of 

211
00:12:12,470 --> 00:12:13,509
real numbers. So just to make 

212
00:12:13,509 --> 00:12:18,149
these elements concrete, let me give a specific example of a 

213
00:12:18,149 --> 00:12:20,470
MDP. 

214
00:12:20,470 --> 00:12:23,280
Rather than talking about something as complicated as helicopters, I'm going to use 

215
00:12:23,280 --> 00:12:25,540
a much smaller 

216
00:12:25,540 --> 00:12:29,930
MDP as the running example for the rest of today's lecture. We'll 

217
00:12:29,930 --> 00:12:30,779
look at 

218
00:12:30,779 --> 00:12:34,580
much more complicated MDPs in subsequent lectures. 

219
00:12:34,580 --> 00:12:39,160
This is an example that I adapted from a textbook by Stuart Russell and Peter Norvig 

220
00:12:39,160 --> 00:12:44,080
called Artificial Intelligence: A Modern Approach (Second Edition). 

221
00:12:44,080 --> 00:12:48,269
And this is a small MDP that models a robot navigation task in which if you imagine you have 

222
00:12:48,269 --> 00:12:52,540
a robot that lives all over the grid world 

223
00:12:52,540 --> 00:12:56,880
where the shaded-in cell is an obstacle, so the robot can't go over this 

224
00:12:56,880 --> 00:12:57,970
cell. 

225
00:12:57,970 --> 00:13:00,260


226
00:13:00,260 --> 00:13:02,420


227
00:13:02,420 --> 00:13:05,100
And so, let's see. 

228
00:13:05,100 --> 00:13:08,930
I would really like the robot to get to this upper right north cell, let's say, so I'm 

229
00:13:08,930 --> 00:13:10,860
going to associate that 

230
00:13:10,860 --> 00:13:13,540
cell with a +1 reward, and I'd 

231
00:13:13,540 --> 00:13:15,680
really like it to avoid that 

232
00:13:15,680 --> 00:13:19,860
grid cell, so I'm gonna associate that grid cell with -1 

233
00:13:19,860 --> 00:13:24,540
reward. So let's actually iterate through the five elements of the MDP and 

234
00:13:24,540 --> 00:13:27,250
so see what they are for this problem. 

235
00:13:27,250 --> 00:13:30,140
So the robot can be in any of these eleven positions and so 

236
00:13:30,140 --> 00:13:32,840
I have 

237
00:13:32,840 --> 00:13:36,600
an MDP with 11 states, and it's a set capital S corresponding 

238
00:13:36,600 --> 00:13:37,760
to the 11 places 

239
00:13:37,760 --> 00:13:40,820
it could be 

240
00:13:40,820 --> 00:13:42,250
in. 

241
00:13:42,250 --> 00:13:46,420
And let's say my robot in this set, 

242
00:13:46,420 --> 00:13:47,720
highly simplified for a logical 

243
00:13:47,720 --> 00:13:52,840
example, can try to move in each of the compass directions, so in this MDP, I'll have four 

244
00:13:52,840 --> 00:13:59,380
actions corresponding to moving in each of the North, South, East, West compass directions. And 

245
00:13:59,380 --> 00:14:00,500
let's see. 

246
00:14:00,500 --> 00:14:04,660
Let's say that my robot's dynamics are noisy. If you've worked in robotics before, you know 

247
00:14:04,660 --> 00:14:06,950
that if you command a robot to go North, 

248
00:14:06,950 --> 00:14:09,060
because of wheel slip or a 

249
00:14:09,060 --> 00:14:13,480
core design in how you act or whatever, 

250
00:14:13,480 --> 00:14:16,370
there's a small chance that your robot will be off side here. So you command your 

251
00:14:16,370 --> 00:14:18,649
robot to move forward one meter, 

252
00:14:18,649 --> 00:14:21,180
usually it will move forward somewhere between like 95 

253
00:14:21,180 --> 00:14:22,220
centimeters or 

254
00:14:22,220 --> 00:14:24,470
to 105 centimeters. 

255
00:14:24,470 --> 00:14:28,170
So in this highly simplified grid world, I'm going to model 

256
00:14:28,170 --> 00:14:32,660
the stochastic dynamics of my robot as follows. I'm going to say that if you 

257
00:14:32,660 --> 00:14:35,230
command the robot to go north, 

258
00:14:35,230 --> 00:14:36,000
there's actually a 

259
00:14:36,000 --> 00:14:37,770
10 percent chance that it 

260
00:14:37,770 --> 00:14:41,040
will accidentally veer off to the left and a 

261
00:14:41,040 --> 00:14:43,860
10 percent chance it will veer off to the right 

262
00:14:43,860 --> 00:14:46,320
and only a .8 chance that it will manage to go in the 

263
00:14:46,320 --> 00:14:50,230
direction you commanded it. This is sort of a crude model, wheels slipping on the 

264
00:14:50,230 --> 00:14:51,640
model robot. 

265
00:14:51,640 --> 00:14:54,860
And if the robot bounces off a wall, then it just stays where it is and 

266
00:14:54,860 --> 00:14:56,860
nothing happens. 

267
00:14:56,860 --> 00:14:59,690
So let's see. 

268
00:14:59,690 --> 00:15:01,430
Concretely, we would 

269
00:15:01,430 --> 00:15:05,880
write this down using the state transition probability. So for example, let's 

270
00:15:05,880 --> 00:15:10,580
take the state - let me call it a free comma one state and let's say you command the 

271
00:15:10,580 --> 00:15:12,100
robot to go 

272
00:15:12,100 --> 00:15:15,810
north. To specify these noisy dynamics of the robot, 

273
00:15:15,810 --> 00:15:19,510
you would write down state transition probabilities for the robot as follows. You 

274
00:15:19,510 --> 00:15:22,420
say that if you're in the state free one 

275
00:15:22,420 --> 00:15:23,910
and you take the action 

276
00:15:23,910 --> 00:15:28,760
north, your chance of 

277
00:15:28,760 --> 00:15:31,760
getting to free two is 0.8. If you're in the state of free one and you take the action north, the 

278
00:15:31,760 --> 00:15:34,510
chance of getting to four 1 is open 

279
00:15:34,510 --> 00:15:37,730
1 

280
00:15:37,730 --> 00:15:39,710
and so 

281
00:15:39,710 --> 00:15:46,710
on. 

282
00:15:52,460 --> 00:15:54,950
And so on, okay? 

283
00:15:54,950 --> 00:15:59,330
This last line is that if you're in the state free one 

284
00:15:59,330 --> 00:16:03,319
and you take the action north, the chance of you getting to the state free free is zero. 

285
00:16:03,319 --> 00:16:04,740
And this is your 

286
00:16:04,740 --> 00:16:09,170
chance of transitioning in one-time sets of the state free free is equal to zero. 

287
00:16:09,170 --> 00:16:14,460
So these are the state transition probabilities for my MDP. Let's see. 

288
00:16:14,460 --> 00:16:16,270


289
00:16:16,270 --> 00:16:19,710
The last two elements of my five tuple 

290
00:16:19,710 --> 00:16:23,780
are gamma and the reward function. Let's not worry about gamma for now, 

291
00:16:23,780 --> 00:16:26,279
but my reward function would be as follows, so 

292
00:16:26,279 --> 00:16:28,420
I really want the robot 

293
00:16:28,420 --> 00:16:32,980
to get to the fourth - I'm using four comma free. It's indexing to the states 

294
00:16:32,980 --> 00:16:37,040
by using the numbers I wrote at the sides of the grid. So my 

295
00:16:37,040 --> 00:16:41,310
reward for getting to the fourth free state is +1 

296
00:16:41,310 --> 00:16:48,310
and my reward for getting to the fourth 2-state is 

297
00:16:50,120 --> 00:16:51,270
-1, 

298
00:16:51,270 --> 00:16:52,019
and 

299
00:16:52,019 --> 00:16:58,450
as is common practice - let's see. As 

300
00:16:58,450 --> 00:17:04,520


301
00:17:04,520 --> 00:17:08,199
is fairly common practice in navigation tasks, for all 

302
00:17:08,199 --> 00:17:09,679
other states, 

303
00:17:09,679 --> 00:17:12,010
the terminal states, I'm going to associate 

304
00:17:12,010 --> 00:17:14,009
sort of a small negative reward 

305
00:17:14,009 --> 00:17:18,910
and you can think of this as a small negative reward that charges my robot 

306
00:17:18,910 --> 00:17:23,169
for his battery consumption or his fuel consumption for one move around. And so 

307
00:17:23,169 --> 00:17:26,390
a small negative reward like this 

308
00:17:26,390 --> 00:17:29,710
that charges the robot for running around randomly tends to cause 

309
00:17:29,710 --> 00:17:34,360
the system to compute solutions that don't waste time and make its way to the 

310
00:17:34,360 --> 00:17:39,500
goal as quickly as possible because it's charged for fuel consumption. 

311
00:17:39,500 --> 00:17:46,500


312
00:17:47,030 --> 00:17:49,780
Okay. So, well, let me just mention, there's actually one 

313
00:17:49,780 --> 00:17:53,309
other complication that I'm gonna sort of not worry about. 

314
00:17:53,309 --> 00:17:56,860
In this specific example, unless you're going to assume that when the robot 

315
00:17:56,860 --> 00:17:58,990
gets to the +1 or the -1 reward, 

316
00:17:58,990 --> 00:18:03,000
then the world ends and so you get to the +1 and then that's it. 

317
00:18:03,000 --> 00:18:07,740
The world ends. There are no more rewards positive or negative after that, right? And so 

318
00:18:07,740 --> 00:18:11,229
there are various ways to model that. One way to think about that is 

319
00:18:11,229 --> 00:18:14,020
you may imagine there's actually a 12th state, 

320
00:18:14,020 --> 00:18:16,979
something that's called the zero cost absorbing state, 

321
00:18:16,979 --> 00:18:21,499
so that whenever you get to the +1 or the -1, you then transition the 

322
00:18:21,499 --> 00:18:25,700
probability one to this 12th state and you stay in this 12th state forever 

323
00:18:25,700 --> 00:18:27,669
with no more 

324
00:18:27,669 --> 00:18:31,249
rewards. I just mention that, that when you get to the +1 or -1, think of the 

325
00:18:31,249 --> 00:18:32,690
problems in finishing. 

326
00:18:32,690 --> 00:18:36,040
The reason I do that is because it makes some of the 

327
00:18:36,040 --> 00:18:40,360
numbers come up nicer and be easier to understand. It's the sort 

328
00:18:40,360 --> 00:18:43,549
of state where you go in where 

329
00:18:43,549 --> 00:18:47,760
sometimes you hear the term zero cost absorbing states. It's another 

330
00:18:47,760 --> 00:18:48,639
state so that 

331
00:18:48,639 --> 00:18:51,529
when you enter that state, there are no more rewards; you always stay in that state 

332
00:18:51,529 --> 00:18:56,150
forever. All right. 

333
00:18:56,150 --> 00:18:58,820
So 

334
00:18:58,820 --> 00:19:03,080
let's just see how an MDP works and it works as follows. 

335
00:19:03,080 --> 00:19:07,590
At time 0, your 

336
00:19:07,590 --> 00:19:10,450
robot starts off at some state as 

337
00:19:10,450 --> 00:19:13,270
0 and, depending on where you are, 

338
00:19:13,270 --> 00:19:17,490
you get to choose an action a0 to decide to go 

339
00:19:17,490 --> 00:19:20,500
North, South, East, or West. 

340
00:19:20,500 --> 00:19:22,490
Depending on your choice, 

341
00:19:22,490 --> 00:19:25,340
you get to some state s1, 

342
00:19:25,340 --> 00:19:29,760
which is going to be randomly drawn from the state transition distribution 

343
00:19:29,760 --> 00:19:34,750
index by state 0 and the action you just chose. So the next state you get to depends - 

344
00:19:34,750 --> 00:19:38,410
well, it depends in the probabilistic way on the previous state and the action you just 

345
00:19:38,410 --> 00:19:39,860
took. 

346
00:19:39,860 --> 00:19:44,700
After you get to the state s1, you get to choose a new action 

347
00:19:44,700 --> 00:19:46,560
a1, 

348
00:19:46,560 --> 00:19:49,210
and then as a result of that, 

349
00:19:49,210 --> 00:19:53,000
you get to some new state s2 sort 

350
00:19:53,000 --> 00:19:56,640
of randomly from the 

351
00:19:56,640 --> 00:19:59,740
state transition distributions 

352
00:19:59,740 --> 00:20:02,159
and so on. Okay? 

353
00:20:02,159 --> 00:20:08,210
So after your 

354
00:20:08,210 --> 00:20:15,210
robot does this 

355
00:20:18,500 --> 00:20:20,240


356
00:20:20,240 --> 00:20:22,549
for a while, 

357
00:20:22,549 --> 00:20:25,630
it will have visited some sequence 

358
00:20:25,630 --> 00:20:30,540
of states s0, s1, s2, 

359
00:20:30,540 --> 00:20:32,210
and so on, 

360
00:20:32,210 --> 00:20:35,900
and to evaluate how well we did, 

361
00:20:35,900 --> 00:20:38,870
we'll take the reward function 

362
00:20:38,870 --> 00:20:44,190
and we'll apply it 

363
00:20:44,190 --> 00:20:46,380
to the sequence of states 

364
00:20:46,380 --> 00:20:47,620
and add up 

365
00:20:47,620 --> 00:20:50,540
the sum of rewards that your robot obtained 

366
00:20:50,540 --> 00:20:52,940
on the sequence of states it visited. 

367
00:20:52,940 --> 00:20:55,320
State s0 is your action, you get 

368
00:20:55,320 --> 00:20:57,790
to s1, take an action, you get to s2 and so on. 

369
00:20:57,790 --> 00:21:01,470
So you keep the reward function in the pile to every state in the sequence and 

370
00:21:01,470 --> 00:21:03,719
this is the sum of rewards you obtain. Let 

371
00:21:03,719 --> 00:21:06,610
me show you just one more bit. You can 

372
00:21:06,610 --> 00:21:09,920
multiply this by gamma, gamma squared, and the next term will be multiplied 

373
00:21:09,920 --> 00:21:12,610
by gamma cubed 

374
00:21:12,610 --> 00:21:15,450
and so on. And this is called - I'm going to 

375
00:21:15,450 --> 00:21:18,620
call 

376
00:21:18,620 --> 00:21:20,049
this the 

377
00:21:20,049 --> 00:21:22,250
Total 

378
00:21:22,250 --> 00:21:23,180
Payoff for 

379
00:21:23,180 --> 00:21:25,240
the sequence of states s0, 

380
00:21:25,240 --> 00:21:26,830
s1, s2, 

381
00:21:26,830 --> 00:21:30,060
and so on that your robot visited. 

382
00:21:30,060 --> 00:21:32,020
And so let 

383
00:21:32,020 --> 00:21:36,040
me also say what gamma is. See the quality gamma is a number that's 

384
00:21:36,040 --> 00:21:41,460
usually less than one. It usually you think of gamma as a number like open 99. 

385
00:21:41,460 --> 00:21:42,660
So the 

386
00:21:42,660 --> 00:21:47,190
effect of gamma is that the reward you obtain at time 1 is given 

387
00:21:47,190 --> 00:21:51,080
a slightly smaller weight than the reward you get at time zero. And then 

388
00:21:51,080 --> 00:21:53,960
the reward you get at time 2 is even a little bit smaller 

389
00:21:53,960 --> 00:21:55,120
than the reward you get 

390
00:21:55,120 --> 00:21:58,090
at a previous time set 

391
00:21:58,090 --> 00:22:02,700
and so on. 

392
00:22:02,700 --> 00:22:05,220
Let's see. 

393
00:22:05,220 --> 00:22:06,499
And so 

394
00:22:06,499 --> 00:22:10,659
if this is an economic application, if you're in like 

395
00:22:10,659 --> 00:22:14,400
stock market trading with Gaussian algorithm or whatever, this is an economic 

396
00:22:14,400 --> 00:22:15,150
application, 

397
00:22:15,150 --> 00:22:19,830
then your rewards are dollars earned and lost. Then 

398
00:22:19,830 --> 00:22:24,610
to this kind of factor, gamma has a very natural interpretation as the time 

399
00:22:24,610 --> 00:22:26,960
value of money because like a dollar today 

400
00:22:26,960 --> 00:22:28,600
is worth slightly less than - excuse me, the 

401
00:22:28,600 --> 00:22:31,850
dollar today is worth slightly more than the dollar tomorrow 

402
00:22:31,850 --> 00:22:35,560
because the dollar in the bank can earn a little bit of interest. And conversely, 

403
00:22:35,560 --> 00:22:36,480


404
00:22:36,480 --> 00:22:40,840
having to pay out a dollar tomorrow is better than having to pay out a dollar 

405
00:22:40,840 --> 00:22:42,110
today. 

406
00:22:42,110 --> 00:22:42,809


407
00:22:42,809 --> 00:22:44,110
So in other words, 

408
00:22:44,110 --> 00:22:48,270
the effect of this compacted gamma tends to 

409
00:22:48,270 --> 00:22:49,139
weight 

410
00:22:49,139 --> 00:22:53,980
wins or losses in the future less than wins or losses in the immediate future - 

411
00:22:53,980 --> 00:22:56,680
tends to weight wins and losses in the distant future 

412
00:22:56,680 --> 00:23:01,470
less than wins and losses in 

413
00:23:01,470 --> 00:23:02,650
the immediate. 

414
00:23:02,650 --> 00:23:06,690
And so the girth of the reinforcement learning algorithm 

415
00:23:06,690 --> 00:23:13,690
is to choose actions over time, 

416
00:23:13,980 --> 00:23:16,960
to choose actions a0, a1 and so 

417
00:23:16,960 --> 00:23:23,010
on to try to maximize the 

418
00:23:23,010 --> 00:23:30,010
expected value of this total payoff. 

419
00:23:35,830 --> 00:23:37,580


420
00:23:37,580 --> 00:23:41,880
And more concretely, 

421
00:23:41,880 --> 00:23:46,270
what we will try to do is have our reinforcement learning algorithms 

422
00:23:46,270 --> 00:23:49,020
compute a policy, 

423
00:23:49,020 --> 00:23:51,370


424
00:23:51,370 --> 00:23:53,570
which I denote by the 

425
00:23:53,570 --> 00:23:55,290
lower case p, 

426
00:23:55,290 --> 00:23:58,090
which - 

427
00:23:58,090 --> 00:24:01,090
and all a policy is, a definition of a policy 

428
00:24:01,090 --> 00:24:02,440
is a function 

429
00:24:02,440 --> 00:24:04,980
mapping from the states of the actions 

430
00:24:04,980 --> 00:24:07,260
and so it goes to kind of a policy 

431
00:24:07,260 --> 00:24:08,349
that tells us 

432
00:24:08,349 --> 00:24:09,820
so for every state, 

433
00:24:09,820 --> 00:24:14,630
what action it recommends we take in that state. 

434
00:24:14,630 --> 00:24:15,910


435
00:24:15,910 --> 00:24:22,910
So concretely, here is an example of a policy. 

436
00:24:26,650 --> 00:24:31,530
And this actually turns out to be the optimal policy for the MDP and I'll 

437
00:24:31,530 --> 00:24:34,140
tell you later how 

438
00:24:34,140 --> 00:24:37,589
I computed 

439
00:24:37,589 --> 00:24:41,390


440
00:24:41,390 --> 00:24:48,390
this. 

441
00:24:50,490 --> 00:24:52,350
And so this is an example of a policy. 

442
00:24:52,350 --> 00:24:55,650
A policy is just a mapping from the states to the actions, 

443
00:24:55,650 --> 00:24:57,289
and so our policy tells me 

444
00:24:57,289 --> 00:25:01,920
when you're in this state, you should take the left action and so 

445
00:25:01,920 --> 00:25:02,929
on. And 

446
00:25:02,929 --> 00:25:06,240
this particular policy I drew out happens to be after a policy in the sense 

447
00:25:06,240 --> 00:25:06,830
that 

448
00:25:06,830 --> 00:25:08,809
when you execute this policy, 

449
00:25:08,809 --> 00:25:11,710
this will maximize your expected value 

450
00:25:11,710 --> 00:25:14,770
of the total payoff. This will maximize your expected 

451
00:25:14,770 --> 00:25:17,610
total sum of the counter rewards. Student:[Inaudible] 

452
00:25:17,610 --> 00:25:19,909
Instructor 

453
00:25:19,909 --> 00:25:24,520
(Andrew Ng):Yeah, so can policy be over multiple states, can it be over - so 

454
00:25:24,520 --> 00:25:27,950
can it be a function of not only current state, but the state I was in previously as well. 

455
00:25:27,950 --> 00:25:29,530
So the answer 

456
00:25:29,530 --> 00:25:33,900
is yes. Sometimes people call them strategies instead of policies, but usually you're going to 

457
00:25:33,900 --> 00:25:35,050
use policies. It 

458
00:25:35,050 --> 00:25:38,039
actually turns out that for an MDP, 

459
00:25:38,039 --> 00:25:43,880
you're allowing policies that depend on my previous states, will not allow you to 

460
00:25:43,880 --> 00:25:48,480
do any better. At least in the limited context we're talking about. So in other words, 

461
00:25:48,480 --> 00:25:52,140
there exists a policy that only ever lets the current state ever maximize my 

462
00:25:52,140 --> 00:25:53,930
expected total payoff. 

463
00:25:53,930 --> 00:25:57,710
And this statement won't be true for some of the richer models we talk about later, but 

464
00:25:57,710 --> 00:26:04,710
for now, all we need to do is - this suffices to just look at the current states and actions. 

465
00:26:04,740 --> 00:26:08,950
And sometimes they use the term executable policy to mean that I'm going to take 

466
00:26:08,950 --> 00:26:12,429
actions according to the policies, so I'm going to execute the policy p. 

467
00:26:12,429 --> 00:26:16,740
That means I'm going to - whenever some state s, I'm going to 

468
00:26:16,740 --> 00:26:19,270
take the action 

469
00:26:19,270 --> 00:26:21,630
that the policy p 

470
00:26:21,630 --> 00:26:28,630
outputs when given the current state. All right. So it turns out 

471
00:26:28,900 --> 00:26:33,549
that one of the things MDPs are very good at is - all right, 

472
00:26:33,549 --> 00:26:37,040
let's look at our states. 

473
00:26:37,040 --> 00:26:40,100
Say the optimal policy in this state is to go left. 

474
00:26:40,100 --> 00:26:41,290
There's actually - this 

475
00:26:41,290 --> 00:26:45,419
probably wasn't very obvious. Why is it that you have actions that go left take a longer 

476
00:26:45,419 --> 00:26:48,580
path there? The alternative would be to go north 

477
00:26:48,580 --> 00:26:51,760
and try to find a much shorter path to the +1 state, 

478
00:26:51,760 --> 00:26:54,700
but when you're in this state over here, 

479
00:26:54,700 --> 00:26:58,419
this, I guess, free comma 2 state, 

480
00:26:58,419 --> 00:27:01,890
when in that state over there, when you go north, there's a .1 chance you accidentally 

481
00:27:01,890 --> 00:27:02,790
veer off 

482
00:27:02,790 --> 00:27:06,190
to the right to the -1 state. And so there will be subtle tradeoffs. Is it better to 

483
00:27:06,190 --> 00:27:08,299
take the longer, safer route, 

484
00:27:08,299 --> 00:27:13,009
but the discount factor tends to discourage that and the .02 

485
00:27:13,009 --> 00:27:14,890
charge per step will tend to discourage that. Or is 

486
00:27:14,890 --> 00:27:18,679
it better to take a shorter, riskier route. And 

487
00:27:18,679 --> 00:27:22,760
so it wasn't obvious to me until I computed it, but 

488
00:27:22,760 --> 00:27:27,220
just to see also action and this is one of those things that MDP machinery is 

489
00:27:27,220 --> 00:27:32,820
very good at making, to make subtle tradeoffs to make these optimal. 

490
00:27:32,820 --> 00:27:36,039
So what I want to do next is 

491
00:27:36,039 --> 00:27:40,040
make a few more definitions and that will lead us to our first algorithm 

492
00:27:40,040 --> 00:27:43,610
for computing optimal policies and MDPs, so finding optimal ways to 

493
00:27:43,610 --> 00:27:48,150
act on MDPs. Before I move on, let's check for any questions about the 

494
00:27:48,150 --> 00:27:55,150
MDP formalism. 

495
00:27:58,430 --> 00:28:00,480
Okay, cool. So 

496
00:28:00,480 --> 00:28:02,049


497
00:28:02,049 --> 00:28:06,720
let's now talk about how we actually go about computing optimal policy like that and 

498
00:28:06,720 --> 00:28:13,720
to get there, I need to define a few things. 

499
00:28:15,059 --> 00:28:17,860
So 

500
00:28:17,860 --> 00:28:20,070
just as a preview of 

501
00:28:20,070 --> 00:28:21,559
the next moves I'm gonna take, 

502
00:28:21,559 --> 00:28:25,490
I'm going to define something called Vp and then 

503
00:28:25,490 --> 00:28:28,559
I'm going to define V* 

504
00:28:28,559 --> 00:28:32,060
and then 

505
00:28:32,060 --> 00:28:36,370
I'm going to define p is the optimal 

506
00:28:36,370 --> 00:28:37,520
policy. 

507
00:28:37,520 --> 00:28:38,710
And so I'm going to say 

508
00:28:38,710 --> 00:28:40,810
as I define these things, keep 

509
00:28:40,810 --> 00:28:44,890
in mind what is a definition and what is a consequence of a 

510
00:28:44,890 --> 00:28:45,490
definition. 

511
00:28:45,490 --> 00:28:49,340
In particular, I won't be defining p* to be the optimal policy, but I'll define 

512
00:28:49,340 --> 00:28:54,230
p* by a different equation and it will be a consequence of my definition that p* is the optimal 

513
00:28:54,230 --> 00:28:55,770
policy. The first one I 

514
00:28:55,770 --> 00:28:58,320
want 

515
00:28:58,320 --> 00:29:04,200
to define is Vp, so for any given policy p, for 

516
00:29:04,200 --> 00:29:06,559
any policy p, 

517
00:29:06,559 --> 00:29:12,340
I'm going to define the value function 

518
00:29:12,340 --> 00:29:14,180
Vp 

519
00:29:14,180 --> 00:29:17,240
and sometimes I call this the 

520
00:29:17,240 --> 00:29:21,330
value function for p. So I want to find the value function for Vp, the 

521
00:29:21,330 --> 00:29:27,660
function mapping from the state's known numbers, such that 

522
00:29:27,660 --> 00:29:34,660
Vp(s) is the expected payoff - 

523
00:29:36,910 --> 00:29:40,950
is the expected total payoff if 

524
00:29:40,950 --> 00:29:44,860
you started 

525
00:29:44,860 --> 00:29:48,809
in the state s and 

526
00:29:48,809 --> 00:29:52,710
execute p. 

527
00:29:52,710 --> 00:29:55,430
So in other words, 

528
00:29:55,430 --> 00:30:01,000
Vp(s) is equal to the expected value of this here, sum of this 

529
00:30:01,000 --> 00:30:04,580
counted rewards, the total payoff, 

530
00:30:04,580 --> 00:30:06,429
given that 

531
00:30:06,429 --> 00:30:10,440
you execute the policy p and 

532
00:30:10,440 --> 00:30:13,880
the first state in the sequence is zero, is that 

533
00:30:13,880 --> 00:30:15,470
state s. 

534
00:30:15,470 --> 00:30:19,600
I say this is slightly sloppy probabilistic notation, so p isn't really in 

535
00:30:19,600 --> 00:30:22,660
around the variable, so maybe I shouldn't actually be conditioning 

536
00:30:22,660 --> 00:30:25,180
on p. This is sort of 

537
00:30:25,180 --> 00:30:31,650
moderately standard notation horror, so we use the steady sloppy policy notation. 

538
00:30:31,650 --> 00:30:33,160
So 

539
00:30:33,160 --> 00:30:40,160
as a concrete example, 

540
00:30:43,110 --> 00:30:45,130


541
00:30:45,130 --> 00:30:52,130


542
00:30:54,549 --> 00:30:56,580
here's a policy and 

543
00:30:56,580 --> 00:31:03,580
this 

544
00:31:13,830 --> 00:31:17,340
is not a great policy. This is just some policy p. It's actually a pretty bad 

545
00:31:17,340 --> 00:31:18,260
policy that 

546
00:31:18,260 --> 00:31:23,160
for many states, seems to be heading to the 1 rather than the +1. 

547
00:31:23,160 --> 00:31:26,200
And 

548
00:31:26,200 --> 00:31:30,640
so the value function is the function mapping from the states of known numbers, 

549
00:31:30,640 --> 00:31:37,440
so it associates each state with a number 

550
00:31:37,440 --> 00:31:44,440


551
00:31:58,670 --> 00:32:00,860
and in this case, this is Vp. So that's the value function 

552
00:32:00,860 --> 00:32:03,230
for this policy. 

553
00:32:03,230 --> 00:32:07,650
And so you notice, for instance, that for all the states in the 

554
00:32:07,650 --> 00:32:09,130
bottom two rows, 

555
00:32:09,130 --> 00:32:13,160
I guess, this is a really bad policy that has a high chance to take you to the 

556
00:32:13,160 --> 00:32:14,270
-1 state 

557
00:32:14,270 --> 00:32:16,059
and so all the 

558
00:32:16,059 --> 00:32:17,500


559
00:32:17,500 --> 00:32:21,440
values for those states in the bottom two rows are negative. 

560
00:32:21,440 --> 00:32:25,120
So in this expectation, your total payoff would be negative. And if you 

561
00:32:25,120 --> 00:32:27,359
execute this rather bad policy, you 

562
00:32:27,359 --> 00:32:29,430
start in any of the states in the bottom row, and 

563
00:32:29,430 --> 00:32:33,830
if you start in the top row, the total payoff would be positive. This is not 

564
00:32:33,830 --> 00:32:35,639
a terribly bad policy if it 

565
00:32:35,639 --> 00:32:39,460
stays in the topmost row. 

566
00:32:39,460 --> 00:32:40,880
And so 

567
00:32:40,880 --> 00:32:43,320
given any policy, you can write down 

568
00:32:43,320 --> 00:32:45,280
a value function 

569
00:32:45,280 --> 00:32:50,470
for that policy. 

570
00:32:50,470 --> 00:32:52,679
If 

571
00:32:52,679 --> 00:32:59,679
some of you are still writing, I'll leave that up for a second while I clean another couple of boards. 

572
00:33:10,500 --> 00:33:12,540
Okay. 

573
00:33:12,540 --> 00:33:16,470
So 

574
00:33:16,470 --> 00:33:21,990
the circumstances of the following, 

575
00:33:21,990 --> 00:33:25,170
Vp(s) is equal to - well, 

576
00:33:25,170 --> 00:33:28,020
the expected value of R if 

577
00:33:28,020 --> 00:33:31,440
s is zero, which is the reward you get right away 

578
00:33:31,440 --> 00:33:35,250
for just being in the initial state s, plus - let 

579
00:33:35,250 --> 00:33:38,160
me write this like this - I'm going to write gamma 

580
00:33:38,160 --> 00:33:39,820
and then R if s1 plus gamma, R 

581
00:33:39,820 --> 00:33:43,270
of s2 plus dot, dot, 

582
00:33:43,270 --> 00:33:44,740
dot, 

583
00:33:44,740 --> 00:33:48,500
what 

584
00:33:48,500 --> 00:33:55,500
condition of p. 

585
00:34:07,600 --> 00:34:08,520
Okay? So just 

586
00:34:08,520 --> 00:34:10,999
de-parenthesize these. 

587
00:34:10,999 --> 00:34:14,399
This first term here, this is sometimes called the immediate reward. This is 

588
00:34:14,399 --> 00:34:16,399
the reward you get right away 

589
00:34:16,399 --> 00:34:18,710
just for starting in the state at zero. 

590
00:34:18,710 --> 00:34:22,089
And then the second term here, these are sometimes called the future reward, which 

591
00:34:22,089 --> 00:34:23,499
is the rewards you get 

592
00:34:23,499 --> 00:34:26,459
sort of one time step into the future 

593
00:34:26,459 --> 00:34:28,219
and what I want you 

594
00:34:28,219 --> 00:34:31,739
to note is what this term is. That 

595
00:34:31,739 --> 00:34:32,719
term there 

596
00:34:32,719 --> 00:34:35,199
is really just 

597
00:34:35,199 --> 00:34:39,019
the value function 

598
00:34:39,019 --> 00:34:40,579
for the state s1 

599
00:34:40,579 --> 00:34:43,449
because this term here in parentheses, this is really - 

600
00:34:43,449 --> 00:34:46,499
suppose I was to start in the state s1 or 

601
00:34:46,499 --> 00:34:49,950
this is the sum of the counter rewards I would get if I were to start in the state s1. 

602
00:34:49,950 --> 00:34:53,589
So my immediate reward for starting in s1 would be R (s1), then 

603
00:34:53,589 --> 00:34:55,599
plus gamma times 

604
00:34:55,599 --> 00:34:59,029
additional future rewards in the future. 

605
00:34:59,029 --> 00:35:01,639
And so 

606
00:35:01,639 --> 00:35:04,199
it turns out you can 

607
00:35:04,199 --> 00:35:09,540
write VRp recursively in terms of 

608
00:35:09,540 --> 00:35:10,629


609
00:35:10,629 --> 00:35:12,220
itself. 

610
00:35:12,220 --> 00:35:16,049
And presume that VRp is equal to the 

611
00:35:16,049 --> 00:35:18,639
immediate reward plus gamma times - 

612
00:35:18,639 --> 00:35:20,219
actually, 

613
00:35:20,219 --> 00:35:21,809
let 

614
00:35:21,809 --> 00:35:24,690
me write - it would 

615
00:35:24,690 --> 00:35:30,099
just be mapped as notation - s0 gets mapped to s and s1 gets mapped to s prime and 

616
00:35:30,099 --> 00:35:31,539
it 

617
00:35:31,539 --> 00:35:33,730
just makes it all right. So value function 

618
00:35:33,730 --> 00:35:36,829
for p to stay zero is the immediate reward 

619
00:35:36,829 --> 00:35:39,780
plus this current factor gamma times 

620
00:35:39,780 --> 00:35:41,549
- 

621
00:35:41,549 --> 00:35:44,839
and now you have Vp of s1. Right here is Vp of s 

622
00:35:44,839 --> 00:35:46,529
prime. 

623
00:35:46,529 --> 00:35:49,140
But s prime is a random variable 

624
00:35:49,140 --> 00:35:50,959
because the next state you get to 

625
00:35:50,959 --> 00:35:53,049
after one time set is random 

626
00:35:53,049 --> 00:35:57,569
and so in particular, taking expectations, this is the sum of all 

627
00:35:57,569 --> 00:36:01,349
states s prime 

628
00:36:01,349 --> 00:36:02,440
of 

629
00:36:02,440 --> 00:36:08,009
your probability of getting to that state 

630
00:36:08,009 --> 00:36:10,009
times 

631
00:36:10,009 --> 00:36:17,009
that. 

632
00:36:17,439 --> 00:36:19,319
And 

633
00:36:19,319 --> 00:36:22,169
just to be clear on this notation, right, this is 

634
00:36:22,169 --> 00:36:27,249
P subscript (s, a) of s prime is the chance of you getting to the state s prime when you take the 

635
00:36:27,249 --> 00:36:29,409
action a in state s. 

636
00:36:29,409 --> 00:36:33,160
And in this case, we're executing the policy p 

637
00:36:33,160 --> 00:36:36,029
and so this is P(s) p(s) 

638
00:36:36,029 --> 00:36:37,740
because 

639
00:36:37,740 --> 00:36:42,609
the action we're going to take in state s is the action p reverse. 

640
00:36:42,609 --> 00:36:43,899
So this is 

641
00:36:43,899 --> 00:36:48,689
- in other words, this Ps 

642
00:36:48,689 --> 00:36:49,249
p(s), 

643
00:36:49,249 --> 00:36:53,419
this distribution overstates s prime that you would transitioned to, the 

644
00:36:53,419 --> 00:36:54,539
one time step, 

645
00:36:54,539 --> 00:36:57,469
when you take the action p(s) 

646
00:36:57,469 --> 00:37:03,209
in the state s. 

647
00:37:03,209 --> 00:37:10,209
So just to give this a name, this equation is called Bellman's equations and is one of the 

648
00:37:13,059 --> 00:37:20,059
central equations that we'll use over and over 

649
00:37:22,359 --> 00:37:24,709
when we 

650
00:37:24,709 --> 00:37:28,689
solve MDPs. Raise your hand if this equation makes 

651
00:37:28,689 --> 00:37:29,919
sense. Some of 

652
00:37:29,919 --> 00:37:36,919
you didn't raise your hands. Do you have questions? 

653
00:37:40,969 --> 00:37:42,739
So let's try to say 

654
00:37:42,739 --> 00:37:44,779
this again. 

655
00:37:44,779 --> 00:37:50,630
Actually, which of the symbols don't make sense for those of you that didn't raise your hands? 

656
00:37:50,630 --> 00:37:56,640
You're regretting not raising your hand now, aren't you? Let's try saying 

657
00:37:56,640 --> 00:38:00,249
this one more time and maybe it will come clear later. 

658
00:38:00,249 --> 00:38:01,139
So 

659
00:38:01,139 --> 00:38:03,089
what 

660
00:38:03,089 --> 00:38:05,889
is it? So this equation is sort of 

661
00:38:05,889 --> 00:38:09,339
like my value at the current state is equal to 

662
00:38:09,339 --> 00:38:12,400
R(s) plus 

663
00:38:12,400 --> 00:38:14,149
gamma times - 

664
00:38:14,149 --> 00:38:18,780
and depending on what state I get to next, 

665
00:38:18,780 --> 00:38:21,019
my expected total payoff 

666
00:38:21,019 --> 00:38:22,950
from the state 

667
00:38:22,950 --> 00:38:25,589
s prime is Vp of s prime, 

668
00:38:25,589 --> 00:38:29,799
whereas prime is the state I get to after one time step. So incurring one state as 

669
00:38:29,799 --> 00:38:33,379
I'm going to take some action and I get to some state that's prime and this equation 

670
00:38:33,379 --> 00:38:37,669
is sort of my expected total payoff for executing the policy p from the 

671
00:38:37,669 --> 00:38:40,969
state s. But 

672
00:38:40,969 --> 00:38:45,369
s prime is random because the next state I get to is random and well, 

673
00:38:45,369 --> 00:38:46,819
we'll use the 

674
00:38:46,819 --> 00:38:51,299
next board. The chance I get to some specific state as prime 

675
00:38:51,299 --> 00:38:55,869
is given by we have a P subscript (s, a), s prime, 

676
00:38:55,869 --> 00:38:57,630
where - 

677
00:38:57,630 --> 00:39:00,349
because these are just [inaudible] and probabilities - 

678
00:39:00,349 --> 00:39:04,189
where the action a I chose is 

679
00:39:04,189 --> 00:39:07,489
given by p(s) because I'm executing the action a 

680
00:39:07,489 --> 00:39:10,140
in the current state s. 

681
00:39:10,140 --> 00:39:14,990
And so when you plug this back in, you get P subscript s R(s) 

682
00:39:14,990 --> 00:39:18,569
as prime, just gives me the distribution over the states in making the transition 

683
00:39:18,569 --> 00:39:20,209
to it in one step, 

684
00:39:20,209 --> 00:39:20,809
and hence, 

685
00:39:20,809 --> 00:39:27,809
that just needs Bellman's equations. So 

686
00:39:28,400 --> 00:39:31,640
since Bellman's equations gives you a 

687
00:39:31,640 --> 00:39:32,509
way 

688
00:39:32,509 --> 00:39:34,709
to solve for the 

689
00:39:34,709 --> 00:39:37,249
value function for policy in closed form. 

690
00:39:37,249 --> 00:39:41,439
So again, the problem is suppose I'm given a fixed policy. How do I solve for Vp? How 

691
00:39:41,439 --> 00:39:44,399
do I solve for the - 

692
00:39:44,399 --> 00:39:50,170
so given fixed policy, how do I solve for this equation? It turns 

693
00:39:50,170 --> 00:39:53,479
out, Bellman's equation gives you a way of doing this, 

694
00:39:53,479 --> 00:39:55,749
so coming back to this board, 

695
00:39:55,749 --> 00:39:57,649
it turns out to be 

696
00:39:57,649 --> 00:40:01,769
- sort of coming back to the previous boards, 

697
00:40:01,769 --> 00:40:05,069
around - could you 

698
00:40:05,069 --> 00:40:08,139
move the camera to point to this board? Okay, cool. 

699
00:40:08,139 --> 00:40:10,669
So going back to this board, we work 

700
00:40:10,669 --> 00:40:17,669
the problem's equation, this equation, right, 

701
00:40:18,499 --> 00:40:21,420
let's say I have a fixed policy p and I want to solve for the 

702
00:40:21,420 --> 00:40:23,769
value function for the policy p. 

703
00:40:23,769 --> 00:40:28,410
Then what this equation is just imposes a set of linear constraints on the value 

704
00:40:28,410 --> 00:40:30,060
function. So in particular, 

705
00:40:30,060 --> 00:40:34,530
this says that the value for a given state is equal to some constant, and 

706
00:40:34,530 --> 00:40:36,189
then some linear function 

707
00:40:36,189 --> 00:40:38,589
of other values. 

708
00:40:38,589 --> 00:40:44,739
And so you can write down one such equation for every state in your MDP 

709
00:40:44,739 --> 00:40:46,799
and this imposes a set of 

710
00:40:46,799 --> 00:40:50,089
linear constraints on what the value function could be. 

711
00:40:50,089 --> 00:40:53,619
And then it turns out that by solving the resulting linear system equations, you 

712
00:40:53,619 --> 00:40:55,979
can then solve for the value function 

713
00:40:55,979 --> 00:40:57,760
Vp(s). 

714
00:40:57,760 --> 00:41:00,969
There's a high level description. Let me now make this concrete. 

715
00:41:00,969 --> 00:41:07,969
So specifically, 

716
00:41:11,449 --> 00:41:16,019
let me take the free one state, that state we're 

717
00:41:16,019 --> 00:41:18,759
using as an example. So 

718
00:41:18,759 --> 00:41:22,649
Bellman's equation tells me that the value for p 

719
00:41:22,649 --> 00:41:25,079
for the free one state - 

720
00:41:25,079 --> 00:41:30,859
oh, and let's say I have a specific policy so that p are free one - let's say it 

721
00:41:30,859 --> 00:41:34,869
takes a North action, which is not the ultimate action. 

722
00:41:34,869 --> 00:41:36,449
For this policy, 

723
00:41:36,449 --> 00:41:39,270
Bellman's equation tells me that Vp of free one 

724
00:41:39,270 --> 00:41:41,059
is equal to 

725
00:41:41,059 --> 00:41:41,790
R 

726
00:41:41,790 --> 00:41:44,259
of the state free one, 

727
00:41:44,259 --> 00:41:46,319
and then plus 

728
00:41:46,319 --> 00:41:48,499
gamma times 

729
00:41:48,499 --> 00:41:54,039
our trans 0.8 I get to the free two state, 

730
00:41:54,039 --> 00:41:56,970
which translates .1 and 

731
00:41:56,970 --> 00:42:03,970
gets to 

732
00:42:06,239 --> 00:42:08,689
the four one state 

733
00:42:08,689 --> 00:42:11,759
and which times 0.1, I will get 

734
00:42:11,759 --> 00:42:15,719
to the 

735
00:42:15,719 --> 00:42:16,999
two one state. 

736
00:42:16,999 --> 00:42:19,899
And so what I've done is I've written down Bellman's equations for the free one 

737
00:42:19,899 --> 00:42:23,019
state. I hope you know what it 

738
00:42:23,019 --> 00:42:25,459
means. It's in my low MDP; 

739
00:42:25,459 --> 00:42:28,539
I'm indexing the states 1, 2, 3, 4, so this 

740
00:42:28,539 --> 00:42:30,140
state over there where 

741
00:42:30,140 --> 00:42:33,939
I drew the circle is the free one 

742
00:42:33,939 --> 00:42:35,049
state. 

743
00:42:35,049 --> 00:42:38,539
So for every one of my 11 states in the MDP, 

744
00:42:38,539 --> 00:42:41,719
I can write down an equation like this. This 

745
00:42:41,719 --> 00:42:43,269
stands just for one state. 

746
00:42:43,269 --> 00:42:46,630
And you notice that if I'm trying to solve for the 

747
00:42:46,630 --> 00:42:48,529
values, 

748
00:42:48,529 --> 00:42:53,069
so these are the unknowns, 

749
00:42:53,069 --> 00:42:56,009
then I will have 11 variables 

750
00:42:56,009 --> 00:42:59,190
because I'm trying to solve for the value function for each of my 11 

751
00:42:59,190 --> 00:43:00,329
states, 

752
00:43:00,329 --> 00:43:04,160
and I will have 11 constraints because I can write down 11 

753
00:43:04,160 --> 00:43:05,999
equations of this form, 

754
00:43:05,999 --> 00:43:09,379
one such equation for each one of my states. 

755
00:43:09,379 --> 00:43:12,689
So if you do this, if you write down this sort of equation for every one of your states and then do 

756
00:43:12,689 --> 00:43:13,340
these, 

757
00:43:13,340 --> 00:43:14,499
you have 

758
00:43:14,499 --> 00:43:17,250
your set of linear equations with 

759
00:43:17,250 --> 00:43:20,259
11 unknowns and 11 variables - excuse 

760
00:43:20,259 --> 00:43:23,549
me, 11 constraints or 11 equations with 11 unknowns, 

761
00:43:23,549 --> 00:43:27,689
and so you can solve that linear system of equations 

762
00:43:27,689 --> 00:43:31,559
to get an explicit solution for 

763
00:43:31,559 --> 00:43:33,409
Vp. So 

764
00:43:33,409 --> 00:43:37,189
if you have n states, you end up with n equations and n unknowns and solve that 

765
00:43:37,189 --> 00:43:41,140
to get the values for 

766
00:43:41,140 --> 00:43:42,449
all of your states. 

767
00:43:42,449 --> 00:43:45,109
Okay, cool. 

768
00:43:45,109 --> 00:43:49,170
So 

769
00:43:49,170 --> 00:43:56,170
actually, could you just raise your hand if this made sense? 

770
00:43:56,449 --> 00:43:58,059
Cool. 

771
00:43:58,059 --> 00:44:01,160
All right, so that was the value function for specific policy and how to 

772
00:44:01,160 --> 00:44:02,499
solve 

773
00:44:02,499 --> 00:44:08,249
for it. Let me define one more 

774
00:44:08,249 --> 00:44:12,939
thing. 

775
00:44:12,939 --> 00:44:16,599
So the optimal value function when defined as 

776
00:44:16,599 --> 00:44:19,479
V*(s) 

777
00:44:19,479 --> 00:44:24,269
equals max over all policies 

778
00:44:24,269 --> 00:44:26,499
p of Vp(s). So in other words, 

779
00:44:26,499 --> 00:44:28,259
for any given state s, 

780
00:44:28,259 --> 00:44:33,019
the optimal value function says suppose I take a max over all possible policies 

781
00:44:33,019 --> 00:44:33,789
p, 

782
00:44:33,789 --> 00:44:37,130
what is the best possible expected - 

783
00:44:37,130 --> 00:44:41,330
some of the counter rewards that I can expect to get? Or what is my optimal 

784
00:44:41,330 --> 00:44:43,220
expected total payoff 

785
00:44:43,220 --> 00:44:47,779
for starting at state s, so taking a max over all possible control policies 

786
00:44:47,779 --> 00:44:50,140
p. 

787
00:44:50,140 --> 00:44:51,169


788
00:44:51,169 --> 00:44:55,649
So it turns out that there's a version of Bellman's equations for V* as well 

789
00:44:55,649 --> 00:45:02,649
and so this is also called Bellman's equations for V* rather than for Vp and I'll just 

790
00:45:03,429 --> 00:45:06,439
write that down. So this says that 

791
00:45:06,439 --> 00:45:10,759
the optimal payoff you can get 

792
00:45:10,759 --> 00:45:14,749
from the state s is equal to - 

793
00:45:14,749 --> 00:45:17,839
so our [inaudible] are multi here, so let's see. 

794
00:45:17,839 --> 00:45:21,469
Just for starting off in the state s, you're going to 

795
00:45:21,469 --> 00:45:24,719
get your immediate R(s) 

796
00:45:24,719 --> 00:45:28,930
and then depending on what action a you take 

797
00:45:28,930 --> 00:45:35,390
your expected total payoff will be given by 

798
00:45:35,390 --> 00:45:36,190
this. So 

799
00:45:36,190 --> 00:45:39,259
if I take an action a in some state s, 

800
00:45:39,259 --> 00:45:43,499
then with probability given by P subscript (s; a) of s prime, by this 

801
00:45:43,499 --> 00:45:46,279
probability our transition of state s prime, and when we 

802
00:45:46,279 --> 00:45:49,759
get to the state s prime, I'll expect my total payoff from there to be given by V*(s) 

803
00:45:49,759 --> 00:45:53,099
prime because I'm now starting to use the s 

804
00:45:53,099 --> 00:45:57,959
prime. So the only thing in this equation I need to fill in is where is the action a, 

805
00:45:57,959 --> 00:46:03,179
so in order to actually obtain the optimal expected payoff, and 

806
00:46:03,179 --> 00:46:08,549
to actually obtain the maximum or the optimal expected total payoff, 

807
00:46:08,549 --> 00:46:12,349
what you should choose here is the max over our actions a, 

808
00:46:12,349 --> 00:46:16,179
choose your action a that 

809
00:46:16,179 --> 00:46:19,739
maximizes the expected value of your total payoffs as well. 

810
00:46:19,739 --> 00:46:23,619
So it just makes sense. There's a version of Bellman's equations for V* rather than Vp and I'll 

811
00:46:23,619 --> 00:46:24,419


812
00:46:24,419 --> 00:46:25,820
just say it again. It says that 

813
00:46:25,820 --> 00:46:30,130
my optimal expected total payoff is my immediate reward plus, and then 

814
00:46:30,130 --> 00:46:33,299
the best action it can choose, the max over all actions a 

815
00:46:33,299 --> 00:46:38,749
of my expected future 

816
00:46:38,749 --> 00:46:39,799
payoff. 

817
00:46:39,799 --> 00:46:42,900
And these also lead 

818
00:46:42,900 --> 00:46:45,579
to my definition of p*, 

819
00:46:45,579 --> 00:46:49,079
which 

820
00:46:49,079 --> 00:46:50,089
is 

821
00:46:50,089 --> 00:46:56,329
let's say I'm in some state s and I want to know what action to choose. Well, if I'm 

822
00:46:56,329 --> 00:46:58,140
in some state s, I'm gonna get 

823
00:46:58,140 --> 00:46:59,910
here an 

824
00:46:59,910 --> 00:47:01,869
immediate R(s) anyway, 

825
00:47:01,869 --> 00:47:05,449
so what's the best action for me to choose is whatever action 

826
00:47:05,449 --> 00:47:08,089
will enable me to maximize the second term, as well 

827
00:47:08,089 --> 00:47:11,489
as if my robot is in some state s 

828
00:47:11,489 --> 00:47:15,279
and it wants to know what action to choose, I want to choose the action that will 

829
00:47:15,279 --> 00:47:17,329
maximize my expected total payoff 

830
00:47:17,329 --> 00:47:24,329
and so p*(s) is going to define as R(max) over actions 

831
00:47:24,440 --> 00:47:28,569
a of this same thing. I 

832
00:47:28,569 --> 00:47:33,759
could 

833
00:47:33,759 --> 00:47:36,999
also put the gamma there, but gamma is just a positive. 

834
00:47:36,999 --> 00:47:41,799
Gamma is almost always positive, so I just drop that 

835
00:47:41,799 --> 00:47:44,449
because it's just a constant scale you go through 

836
00:47:44,449 --> 00:47:48,779
and doesn't affect the R(max). 

837
00:47:48,779 --> 00:47:50,099
And so, 

838
00:47:50,099 --> 00:47:55,069
the consequence of this definition is that p* is actually the optimal policy 

839
00:47:55,069 --> 00:47:59,899
because p* will maximize my expected total payoffs. Cool. 

840
00:47:59,899 --> 00:48:04,379
Any 

841
00:48:04,379 --> 00:48:11,379
questions at this point? Cool. 

842
00:48:18,389 --> 00:48:21,299
So 

843
00:48:21,299 --> 00:48:23,769
what I'd like to do now is 

844
00:48:23,769 --> 00:48:26,069
talk about how algorithms actually compute 

845
00:48:26,069 --> 00:48:28,359
high start, compute the optimal policy. 

846
00:48:28,359 --> 00:48:30,779
I should write down a little bit more before I do that, 

847
00:48:30,779 --> 00:48:32,309
but 

848
00:48:32,309 --> 00:48:36,579
notice that if I can compute 

849
00:48:36,579 --> 00:48:40,009
V*, if I can compute the optimal value function, 

850
00:48:40,009 --> 00:48:42,420
then I can plug it into this equation 

851
00:48:42,420 --> 00:48:45,719
and then I'll be done. So if I can compute V and can compute the 

852
00:48:45,719 --> 00:48:50,159
optimal policy. 

853
00:48:50,159 --> 00:48:52,960
So my strategy for computing the optimal policy 

854
00:48:52,960 --> 00:48:55,119
will be to compute V* 

855
00:48:55,119 --> 00:48:59,869
and then plug it into this equation and that will give me the optimal policy p*. So my goal, 

856
00:48:59,869 --> 00:49:01,169
my next goal, 

857
00:49:01,169 --> 00:49:04,839
will really be to compute V*. 

858
00:49:04,839 --> 00:49:08,489
But the definition of V* here doesn't lead to a nice algorithm for 

859
00:49:08,489 --> 00:49:10,700
computing it because 

860
00:49:10,700 --> 00:49:14,260
let's see - so I know how to compute Vp for any given policy p by solving 

861
00:49:14,260 --> 00:49:17,619
that linear system equation, but 

862
00:49:17,619 --> 00:49:21,859
there's an exponentially large number of policies, 

863
00:49:21,859 --> 00:49:22,510
so 

864
00:49:22,510 --> 00:49:25,270
you get 11 states and four actions and 

865
00:49:25,270 --> 00:49:28,750
what the number of policies is froze to the par of 11. This is of a huge space 

866
00:49:28,750 --> 00:49:30,750
of possible policies and so 

867
00:49:30,750 --> 00:49:35,489
I can't actually exhaust the union of all policies and then take a 

868
00:49:35,489 --> 00:49:38,730
max on [inaudible]. So I should write down some other things first, just 

869
00:49:38,730 --> 00:49:42,160
to ground the notations, but what I'll do is 

870
00:49:42,160 --> 00:49:44,799
eventually come up with an algorithm for computing V*, the 

871
00:49:44,799 --> 00:49:46,359
optimal value function 

872
00:49:46,359 --> 00:49:51,949
and then we'll plug them into this and that will give us the optimal policy 

873
00:49:51,949 --> 00:49:52,809
p*. 

874
00:49:52,809 --> 00:49:56,979
And so I'll write down the algorithm in a second, but 

875
00:49:56,979 --> 00:50:03,979
just to ground the notation, 

876
00:50:04,849 --> 00:50:08,359


877
00:50:08,359 --> 00:50:11,380
well - yeah, let's skip that. Let's just talk about the algorithm. 

878
00:50:11,380 --> 00:50:16,750


879
00:50:16,750 --> 00:50:18,049
So 

880
00:50:18,049 --> 00:50:21,519
this is an algorithm called value iteration 

881
00:50:21,519 --> 00:50:28,519
and it makes use of Bellman's equations for the optimal policy to compute V*. So here's the algorithm. 

882
00:50:42,739 --> 00:50:49,739
Okay, 

883
00:51:05,769 --> 00:51:09,979
and that's the entirety of the algorithm and oh, you repeat the step, I 

884
00:51:09,979 --> 00:51:13,869
guess. You repeatedly do this step. 

885
00:51:13,869 --> 00:51:17,429
So just to be concrete, let's say in my MDP of 11 states, 

886
00:51:17,429 --> 00:51:20,319
the first step is initialize V(s) equals zero, so what that means 

887
00:51:20,319 --> 00:51:23,649
is I create an array in computer implementation, 

888
00:51:23,649 --> 00:51:27,640
create an array of 11 elements and say set all of them to zero. Says I can initialize into 

889
00:51:27,640 --> 00:51:30,759
anything. It doesn't really matter. 

890
00:51:30,759 --> 00:51:34,169
And now what I'm going to do is I'll take Bellman's equations and we'll 

891
00:51:34,169 --> 00:51:37,970
keep on taking the right hand side of Bellman's equations and overwriting 

892
00:51:37,970 --> 00:51:40,530
and start copying down 

893
00:51:40,530 --> 00:51:44,609
the left hand side. So we'll essentially iteratively try to make Bellman's equations 

894
00:51:44,609 --> 00:51:46,459
hold true for 

895
00:51:46,459 --> 00:51:48,790
the numbers V(s) that are stored 

896
00:51:48,790 --> 00:51:51,739
along the way. So V(s) here is in the array of 11 elements 

897
00:51:51,739 --> 00:51:55,819
and I'm going to repeatedly compute the right hand side and copy that onto 

898
00:51:55,819 --> 00:51:59,900
V(s). 

899
00:51:59,900 --> 00:52:02,859
And it turns out that when you do this, 

900
00:52:02,859 --> 00:52:09,409
this will make 

901
00:52:09,409 --> 00:52:12,429
V(s) 

902
00:52:12,429 --> 00:52:12,939
converge 

903
00:52:12,939 --> 00:52:16,349
to V*(s), so it may be of no surprise because we know 

904
00:52:16,349 --> 00:52:21,509
V* [inaudible] set inside 

905
00:52:21,509 --> 00:52:25,489
Bellman's equations. Just to tell you, some of these ideas that they get more than the problem says, so I won't prove 

906
00:52:25,489 --> 00:52:28,019
the conversions of this 

907
00:52:28,019 --> 00:52:29,259
algorithm. 

908
00:52:29,259 --> 00:52:32,390
Some implementation details, 

909
00:52:32,390 --> 00:52:36,479
it turns out there's two ways you can do this update. One is when I say 

910
00:52:36,479 --> 00:52:39,419
for every state s that has performed this update, one 

911
00:52:39,419 --> 00:52:42,479
way you can do this is 

912
00:52:42,479 --> 00:52:45,629
for every state s, you can compute the right hand side 

913
00:52:45,629 --> 00:52:49,089
and then you can simultaneously overwrite the left hand side for every 

914
00:52:49,089 --> 00:52:51,180
state s. And so if you do that, that's 

915
00:52:51,180 --> 00:52:56,430
called a sequence update. Right and sequence [inaudible], 

916
00:52:56,430 --> 00:53:00,069
so update all the states s simultaneously. 

917
00:53:00,069 --> 00:53:03,759
And if you do that, it's sometimes written as follows. If you 

918
00:53:03,759 --> 00:53:05,869
do synchronous update, then it's as if 

919
00:53:05,869 --> 00:53:08,149
you have some value function, 

920
00:53:08,149 --> 00:53:11,829
you're at the Ith iteration or Tth iteration of the algorithm 

921
00:53:11,829 --> 00:53:16,069
and then you're going to compute some function of your entire value 

922
00:53:16,069 --> 00:53:17,109
function, 

923
00:53:17,109 --> 00:53:19,369
and then 

924
00:53:19,369 --> 00:53:23,299
you get to set your value function to your new version, so simultaneously update all 

925
00:53:23,299 --> 00:53:25,920
11 values in your s space value function. 

926
00:53:25,920 --> 00:53:30,069
So it's sometimes written like this. My B here is called the Bellman backup 

927
00:53:30,069 --> 00:53:33,809
operator, so the synchronized valuation you sort of take the value function, 

928
00:53:33,809 --> 00:53:36,319
you apply the Bellman backup operator to it 

929
00:53:36,319 --> 00:53:39,059
and then the Bellman backup operator just means computing the right hand side 

930
00:53:39,059 --> 00:53:40,930
of this for all the states 

931
00:53:40,930 --> 00:53:45,109
and you've overwritten your entire value 

932
00:53:45,109 --> 00:53:50,129
function. The only way of performing these updates is 

933
00:53:50,129 --> 00:53:52,359
asynchronous updates, which is where 

934
00:53:52,359 --> 00:53:54,830
you update the states one at a time. So 

935
00:53:54,830 --> 00:53:58,610
you go through the states in some fixed order, so would update V(s) for 

936
00:53:58,610 --> 00:54:00,919
state No. 1 

937
00:54:00,919 --> 00:54:04,759
and then I would like to update V(s) for state No. 2, then state No. 3, and so 

938
00:54:04,759 --> 00:54:05,809
on. 

939
00:54:05,809 --> 00:54:10,959
And when I'm updating V(s) for state No. 5, 

940
00:54:10,959 --> 00:54:15,059
if V(s) prime, if I end up using the values for states 1, 2, 3, and 4 on 

941
00:54:15,059 --> 00:54:18,880
the right hand side, then I'd use my recently updated values on the right 

942
00:54:18,880 --> 00:54:19,809
hand 

943
00:54:19,809 --> 00:54:21,849
side. So as you update 

944
00:54:21,849 --> 00:54:24,569
sequentially, when you're updating in the fifth state, you'd 

945
00:54:24,569 --> 00:54:28,729
be using values, new values, for states 1, 2, 3, and 4. 

946
00:54:28,729 --> 00:54:32,119
And that's called an asynchronous update. 

947
00:54:32,119 --> 00:54:36,930
Other versions will cause V(s) conversion to be *(s). 

948
00:54:36,930 --> 00:54:38,830
In synchronized updates, it makes them just a 

949
00:54:38,830 --> 00:54:44,500
tiny little bit faster [inaudible] and then it turns out the analysis of value iterations 

950
00:54:44,500 --> 00:54:48,729
synchronous updates are also 

951
00:54:48,729 --> 00:54:50,599
easier to analyze and that just 

952
00:54:50,599 --> 00:54:54,779
matters [inaudible]. Asynchronous has been just a little bit faster. 

953
00:54:54,779 --> 00:55:00,079
So 

954
00:55:00,079 --> 00:55:07,079
when you 

955
00:55:07,889 --> 00:55:09,569
run this algorithm 

956
00:55:09,569 --> 00:55:10,739
on 

957
00:55:10,739 --> 00:55:12,709


958
00:55:12,709 --> 00:55:14,059


959
00:55:14,059 --> 00:55:16,640
the MDP - I forgot to say all these values 

960
00:55:16,640 --> 00:55:21,239
were computed with gamma equals open 99 

961
00:55:21,239 --> 00:55:22,470
and actually, 

962
00:55:22,470 --> 00:55:26,820
Roger Gross, who's a, I guess, master [inaudible] helped me 

963
00:55:26,820 --> 00:55:29,759
with computing some of these numbers. 

964
00:55:29,759 --> 00:55:33,130
So you compute it. That way you run value relation on this MDP. The 

965
00:55:33,130 --> 00:55:34,700
numbers you get 

966
00:55:34,700 --> 00:55:37,309
for V* are 

967
00:55:37,309 --> 00:55:39,529
as follows: 

968
00:55:39,529 --> 00:55:41,299
.86, 

969
00:55:41,299 --> 00:55:43,379
.90 - 

970
00:55:43,379 --> 00:55:47,619
again, the numbers sort of don't matter that much, but just 

971
00:55:47,619 --> 00:55:54,619
take a look at it and make sure it intuitively makes sense. 

972
00:55:58,039 --> 00:56:00,389


973
00:56:00,389 --> 00:56:04,509
And then when you plug those in to the formula for computing, that I wrote down 

974
00:56:04,509 --> 00:56:09,170
earlier, for computing p, 

975
00:56:09,170 --> 00:56:09,880
then - well, 

976
00:56:09,880 --> 00:56:16,880
I drew this previously, but here's the optimal policy p*. 

977
00:56:22,689 --> 00:56:25,219
And so, just to summarize, the process is 

978
00:56:25,219 --> 00:56:27,119
run value iteration to compute 

979
00:56:27,119 --> 00:56:30,179
V*, so this would be this table of numbers, 

980
00:56:30,179 --> 00:56:33,640
and then I use my form of p* to compute the optimal policy, which is this 

981
00:56:33,640 --> 00:56:34,879
policy in this case. Now, 

982
00:56:34,879 --> 00:56:39,659
to be just completely concrete, let's look at that free one state again. Is it 

983
00:56:39,659 --> 00:56:41,540
better to go left or is it 

984
00:56:41,540 --> 00:56:44,359
better to go north? So let me just 

985
00:56:44,359 --> 00:56:49,259
illustrate why I'd rather go left than north. In 

986
00:56:49,259 --> 00:56:52,999
the form of the p(sp), 

987
00:56:52,999 --> 00:56:59,719


988
00:56:59,719 --> 00:57:02,439
this would be - well, let me just 

989
00:57:02,439 --> 00:57:09,439
write this down. 

990
00:57:14,079 --> 00:57:21,079
Right, if I go north, then it 

991
00:57:39,359 --> 00:57:42,630
would be because of that. I wrote it down 

992
00:57:42,630 --> 00:57:45,670
really quickly, so 

993
00:57:45,670 --> 00:57:48,539
it's messy writing. The 

994
00:57:48,539 --> 00:57:50,820
way I got these numbers is 

995
00:57:50,820 --> 00:57:54,099
suppose I'm in this state, 

996
00:57:54,099 --> 00:57:58,539
in this free one state. If I choose to go west and with chance .8, I get to .75 - to 

997
00:57:58,539 --> 00:58:00,829
this table -- 

998
00:58:00,829 --> 00:58:04,839
.75. With chance .1, I veer off and get to the .69, then at chance 

999
00:58:04,839 --> 00:58:08,389
.1, I go south and I bounce off the wall and I stay where I am. 

1000
00:58:08,389 --> 00:58:12,429
So that's why my expected future payoff for going west is .8 times 

1001
00:58:12,429 --> 00:58:15,269
.75, plus .1 times .69, 

1002
00:58:15,269 --> 00:58:17,130
plus .1 times .71, the 

1003
00:58:17,130 --> 00:58:21,519
last .71 being if I bounce off the wall to the south and then seeing where I am, 

1004
00:58:21,519 --> 00:58:24,569
that gives you .740. You 

1005
00:58:24,569 --> 00:58:29,409
can then repeat the same process to estimate your expected total payoff if you go north, 

1006
00:58:29,409 --> 00:58:30,569
so if you do that, 

1007
00:58:30,569 --> 00:58:34,369
with a .8 chance, you end up going north, so you get .69. With 

1008
00:58:34,369 --> 00:58:38,509
a .1 chance, you end up here and .1 chance you end up there. This 

1009
00:58:38,509 --> 00:58:40,849
map leads mentally to that expression 

1010
00:58:40,849 --> 00:58:45,439
and compute the expectation, you get .676. And so your total payoff 

1011
00:58:45,439 --> 00:58:49,689
is higher if you go west - your expected total payoff is higher if you go west than if you go 

1012
00:58:49,689 --> 00:58:50,419
north. 

1013
00:58:50,419 --> 00:58:57,419
And that's why the optimal action in this state is to go west. 

1014
00:58:57,839 --> 00:59:01,809


1015
00:59:01,809 --> 00:59:08,809
So that was value iteration. 

1016
00:59:15,469 --> 00:59:17,789


1017
00:59:17,789 --> 00:59:20,039
It turns out there are 

1018
00:59:20,039 --> 00:59:21,920


1019
00:59:21,920 --> 00:59:26,819
two sort of standard algorithms for computing optimal policies in MDPs. Value 

1020
00:59:26,819 --> 00:59:28,209
iteration is one. As soon as 

1021
00:59:28,209 --> 00:59:32,639
you finish the writing. So value iteration 

1022
00:59:32,639 --> 00:59:37,819
is one and the other sort of standard algorithm for computing optimal policies 

1023
00:59:37,819 --> 00:59:42,529
in MDPs is called policy iteration. 

1024
00:59:42,529 --> 00:59:46,799
And let me - I'm just going to write this 

1025
00:59:46,799 --> 00:59:49,239
down. 

1026
00:59:49,239 --> 00:59:52,299
In policy iteration, we 

1027
00:59:52,299 --> 00:59:57,879
initialize the policy p randomly, so it doesn't matter. It can be the policy that always 

1028
00:59:57,879 --> 01:00:04,130
goes north or the policy that takes actions random or whatever. 

1029
01:00:04,130 --> 01:00:11,130
And then we'll repeatedly do the following. Okay, 

1030
01:00:38,609 --> 01:00:43,679
so that's the algorithm. 

1031
01:00:43,679 --> 01:00:47,279
So the algorithm has two steps. In the first step, we solve. 

1032
01:00:47,279 --> 01:00:49,949
We take the current policy p 

1033
01:00:49,949 --> 01:00:54,469
and we solve Bellman's equations to obtain Vp. So 

1034
01:00:54,469 --> 01:00:57,679
remember, earlier I said if you have a fixed policy p, 

1035
01:00:57,679 --> 01:01:01,929
then yeah, Bellman's equation defines this system of linear equations with 11 

1036
01:01:01,929 --> 01:01:05,619
unknowns and 11 linear constraints. 

1037
01:01:05,619 --> 01:01:09,409
And so you solve that linear system equation so you get the value function for your current 

1038
01:01:09,409 --> 01:01:10,559
policy p, 

1039
01:01:10,559 --> 01:01:15,639
and by this notation, I mean just let V be the value function for policy p. 

1040
01:01:15,639 --> 01:01:19,140
Then the second step is you update the policy. In other words, you 

1041
01:01:19,140 --> 01:01:22,839
pretend that your current guess V from the value function is indeed the optimal value 

1042
01:01:22,839 --> 01:01:23,630
function 

1043
01:01:23,630 --> 01:01:25,619
and you let p(s) be equal to 

1044
01:01:25,619 --> 01:01:27,559
that out max formula, so as 

1045
01:01:27,559 --> 01:01:29,739
to update your policy p. 

1046
01:01:29,739 --> 01:01:36,739
And so it turns out that if you do this, then V will converge to V* and p 

1047
01:01:37,529 --> 01:01:38,960
will converge to p*, and 

1048
01:01:38,960 --> 01:01:41,119
so this is another way 

1049
01:01:41,119 --> 01:01:45,509
to find the optimal policy for MDP. 

1050
01:01:45,509 --> 01:01:47,999
In terms of tradeoffs, 

1051
01:01:47,999 --> 01:01:51,219
it turns out that - let's see - in policy iteration, 

1052
01:01:51,219 --> 01:01:54,749
the 

1053
01:01:54,749 --> 01:01:58,049
computationally expensive step is this 

1054
01:01:58,049 --> 01:02:01,029
one. You need to solve this linear system of equations. 

1055
01:02:01,029 --> 01:02:05,579
You have n equations and n unknowns, if you have n states. And so if you have a problem with 

1056
01:02:05,579 --> 01:02:09,559
a reasonably few number of states, if you have a problem with like 11 states, you can solve the linear 

1057
01:02:09,559 --> 01:02:11,979
system equations fairly efficiently, 

1058
01:02:11,979 --> 01:02:16,259
and so policy iteration tends to work extremely well for problems with smallish 

1059
01:02:16,259 --> 01:02:17,669
numbers of states where you can 

1060
01:02:17,669 --> 01:02:21,509
actually solve those linear systems of equations efficiently. 

1061
01:02:21,509 --> 01:02:25,699
So if you have a thousand states, anything less than that, you can solve a 

1062
01:02:25,699 --> 01:02:29,339
system of a thousand equations very efficiently, so policy iteration will often work 

1063
01:02:29,339 --> 01:02:30,119
fine. If you have 

1064
01:02:30,119 --> 01:02:33,119
an MDP with an 

1065
01:02:33,119 --> 01:02:36,829
enormous number of states, so we'll actually often see 

1066
01:02:36,829 --> 01:02:40,479
MDPs with tens of thousands or hundreds of thousands or millions or tens of 

1067
01:02:40,479 --> 01:02:41,429
millions of states. 

1068
01:02:41,429 --> 01:02:43,760
If you have a problem with 

1069
01:02:43,760 --> 01:02:45,190
10 million states 

1070
01:02:45,190 --> 01:02:50,340
and you try to apply policy iteration, then this step requires solving 

1071
01:02:50,340 --> 01:02:51,300
the linear 

1072
01:02:51,300 --> 01:02:54,719
system of 10 million equations and this would be computationally expensive. 

1073
01:02:54,719 --> 01:03:01,719
And so for these really, really large MDPs, I tend to use value iteration. Let's see. Any questions about this? Student:So this is a 

1074
01:03:04,539 --> 01:03:10,919
convex function where 

1075
01:03:10,919 --> 01:03:12,339
- that 

1076
01:03:12,339 --> 01:03:15,019
it could be good in local optimization scheme. 

1077
01:03:15,019 --> 01:03:15,619
Instructor 

1078
01:03:15,619 --> 01:03:18,640
(Andrew Ng):Ah, yes, you're right. That's a 

1079
01:03:18,640 --> 01:03:20,660
good question: Is this a convex function? 

1080
01:03:20,660 --> 01:03:24,490
It actually turns out that there is a way to pose a problem of solving for 

1081
01:03:24,490 --> 01:03:25,170
V* 

1082
01:03:25,170 --> 01:03:28,649
as a convex optimization problem, as a linear program. 

1083
01:03:28,649 --> 01:03:32,149
For instance, I can break down the solution - 

1084
01:03:32,149 --> 01:03:38,869
you write down V* as a solution, so linear would be the only problem you can solve. 

1085
01:03:38,869 --> 01:03:41,539
Policy iteration converges as 

1086
01:03:41,539 --> 01:03:42,489
gamma 

1087
01:03:42,489 --> 01:03:46,269
T conversion. We're not just stuck with local optimal, but the proof of the 

1088
01:03:46,269 --> 01:03:48,199
conversions of policy iteration 

1089
01:03:48,199 --> 01:03:51,360
sort of uses somewhat different principles in convex optimization. 

1090
01:03:51,360 --> 01:03:52,439
At 

1091
01:03:52,439 --> 01:03:54,380
least the versions as far as 

1092
01:03:54,380 --> 01:03:57,540
I can see, yeah. You could probably relate this back to convex optimization, but not understand the 

1093
01:03:57,540 --> 01:03:58,609
principle of 

1094
01:03:58,609 --> 01:03:59,450
why 

1095
01:03:59,450 --> 01:04:00,429


1096
01:04:00,429 --> 01:04:02,559
this often converges. 

1097
01:04:02,559 --> 01:04:03,779
The proof is not 

1098
01:04:03,779 --> 01:04:09,339
that difficult, but it is also sort of longer than I want to go over in this class. 

1099
01:04:09,339 --> 01:04:11,609
Yeah, that was a good point. 

1100
01:04:11,609 --> 01:04:12,879


1101
01:04:12,879 --> 01:04:19,879
Cool. Actually, any questions for any of these? 

1102
01:04:25,489 --> 01:04:28,409
Okay, so 

1103
01:04:28,409 --> 01:04:32,450
we now have two algorithms for solving MDP. There's a given, 

1104
01:04:32,450 --> 01:04:36,399
the five tuple, given the set of states, the set of actions, the state 

1105
01:04:36,399 --> 01:04:39,949
transition properties, the discount factor, and the reward function, you can 

1106
01:04:39,949 --> 01:04:42,829
now apply policy iteration or value iteration 

1107
01:04:42,829 --> 01:04:46,969
to compute the optimal policy for the MDP. The last thing I want 

1108
01:04:46,969 --> 01:04:51,309
to talk about is what if you don't know the 

1109
01:04:51,309 --> 01:04:57,169
state 

1110
01:04:57,169 --> 01:05:00,249
transition probabilities, and sometimes you won't know the reward function R as well, 

1111
01:05:00,249 --> 01:05:02,819
but let's leave that 

1112
01:05:02,819 --> 01:05:04,839
aside. And 

1113
01:05:04,839 --> 01:05:08,349
so for example, let's say you're trying to fly a helicopter and 

1114
01:05:08,349 --> 01:05:13,019
you don't really know in advance what state your helicopter will transition to and take an action 

1115
01:05:13,019 --> 01:05:16,259
in a certain state, because helicopter dynamics are kind of noisy. You sort of often don't really know what 

1116
01:05:16,259 --> 01:05:17,809


1117
01:05:17,809 --> 01:05:21,269
state you end up in. 

1118
01:05:21,269 --> 01:05:26,750
So the standard thing to do, or one standard thing to do, is then to try to estimate the 

1119
01:05:26,750 --> 01:05:29,300
state transition probabilities from data. 

1120
01:05:29,300 --> 01:05:32,189
Let me just write this out. It turns out that 

1121
01:05:32,189 --> 01:05:35,199
the MDP has its 5 tuple, right? S, A; you have 

1122
01:05:35,199 --> 01:05:38,500
the transition probabilities, 

1123
01:05:38,500 --> 01:05:41,079
gamma, and R. S and 

1124
01:05:41,079 --> 01:05:45,199
A you almost always know. The state space is sort of up to you to define. What's the state 

1125
01:05:45,199 --> 01:05:47,539
space at the very bottom, 

1126
01:05:47,539 --> 01:05:49,730
factor you're trying to control, whatever. 

1127
01:05:49,730 --> 01:05:53,459
Actions is, again, just one of your actions. Usually, we almost always know these. 

1128
01:05:53,459 --> 01:05:57,359
Gamma, the discount factor is something you choose depending on 

1129
01:05:57,359 --> 01:06:00,329
how much you want to trade off 

1130
01:06:00,329 --> 01:06:02,149
current versus future rewards. 

1131
01:06:02,149 --> 01:06:06,069
The reward function you usually know. There are some exceptional cases. 

1132
01:06:06,069 --> 01:06:09,239
Usually, you come up with a reward function and so you usually know what the reward 

1133
01:06:09,239 --> 01:06:10,169
function 

1134
01:06:10,169 --> 01:06:13,409
is. Sometimes you don't, but let's just leave that aside for now and 

1135
01:06:13,409 --> 01:06:17,859
the most common thing for you to have to learn are the state transition probabilities. So we'll just talk about how to 

1136
01:06:17,859 --> 01:06:20,559
learn that. 

1137
01:06:20,559 --> 01:06:21,280
So 

1138
01:06:21,280 --> 01:06:25,120
when you don't know state transition probabilities, the most common thing to do is just estimate it 

1139
01:06:25,120 --> 01:06:26,629
from data. 

1140
01:06:26,629 --> 01:06:30,599
So what I mean is imagine some robot - maybe it's a robot roaming 

1141
01:06:30,599 --> 01:06:33,089
around the hallway, like in that grid example - 

1142
01:06:33,089 --> 01:06:37,499
you would then have the robot just take actions in the 

1143
01:06:37,499 --> 01:06:38,379


1144
01:06:38,379 --> 01:06:40,929


1145
01:06:40,929 --> 01:06:45,029
MDP and you would then estimate your state transition probabilities P subscript (s, 

1146
01:06:45,029 --> 01:06:46,869
a) s prime to be 

1147
01:06:46,869 --> 01:06:49,920
- pretty much exactly what you'd expect it to be. This would be 

1148
01:06:49,920 --> 01:06:55,099
the number of times you took action a in state s and 

1149
01:06:55,099 --> 01:07:00,249
you 

1150
01:07:00,249 --> 01:07:05,369
got to s prime, 

1151
01:07:05,369 --> 01:07:12,369
divided by the number of times you took action a in 

1152
01:07:13,969 --> 01:07:17,659
state s. Okay? So the estimate 

1153
01:07:17,659 --> 01:07:19,859
of this is just all the times you took the action 

1154
01:07:19,859 --> 01:07:23,369
a in the state s, what's the fraction of times you actually got 

1155
01:07:23,369 --> 01:07:26,359
to the state s prime. It's pretty much exactly what you expect it to be. Or 

1156
01:07:26,359 --> 01:07:28,420
you 

1157
01:07:28,420 --> 01:07:32,249
can 

1158
01:07:32,249 --> 01:07:33,869
- 

1159
01:07:33,869 --> 01:07:38,189
or in case you've never actually tried action a in state s, so if this turns 

1160
01:07:38,189 --> 01:07:40,140
out to be 0 over 0, 

1161
01:07:40,140 --> 01:07:44,610
you can then have some default estimate for those vector uniform distribution 

1162
01:07:44,610 --> 01:07:46,009
over all states, 

1163
01:07:46,009 --> 01:07:47,259
this reasonable default. 

1164
01:07:47,259 --> 01:07:54,259
And so, putting it all together - 

1165
01:07:59,599 --> 01:08:01,859
and by the 

1166
01:08:01,859 --> 01:08:05,079
way, it turns out in reinforcement learning, in 

1167
01:08:05,079 --> 01:08:08,589
most of the earlier parts of this class where we did supervised learning, I sort of talked about 

1168
01:08:08,589 --> 01:08:12,529
the logistic regression algorithm, so it does the algorithm and 

1169
01:08:12,529 --> 01:08:16,439
most implementations of logistic regression - like a 

1170
01:08:16,439 --> 01:08:20,179
fairly standard way to do logistic regression were SVMs or faster analysis or 

1171
01:08:20,179 --> 01:08:21,019
whatever. 

1172
01:08:21,019 --> 01:08:23,569
It turns out in reinforcement learning 

1173
01:08:23,569 --> 01:08:27,449
there's more of a mix and match sense, I guess, so there are often 

1174
01:08:27,449 --> 01:08:30,339
different pieces of different algorithms you can choose to use. So 

1175
01:08:30,339 --> 01:08:34,189
in some of the algorithms I write down, there's sort of more 

1176
01:08:34,189 --> 01:08:36,370
than one way to do it and I'm sort of 

1177
01:08:36,370 --> 01:08:37,239


1178
01:08:37,239 --> 01:08:39,759
giving specific examples, but if 

1179
01:08:39,759 --> 01:08:43,169
you're faced with an AI problem, some of you in control of robots, 

1180
01:08:43,169 --> 01:08:46,689
you want to plug in value iteration here instead of policy iteration. You want to 

1181
01:08:46,689 --> 01:08:47,750
do something 

1182
01:08:47,750 --> 01:08:50,520
slightly different than one of the specific things I wrote down. That's actually 

1183
01:08:50,520 --> 01:08:52,589
fairly common, so 

1184
01:08:52,589 --> 01:08:57,499
just in reinforcement learning, there's sort of other major ways to apply 

1185
01:08:57,499 --> 01:09:00,969
different algorithms and mix and match different algorithms. And this will come up again in the weekly lectures. 

1186
01:09:00,969 --> 01:09:02,400


1187
01:09:02,400 --> 01:09:05,359
So just putting the things I said together, 

1188
01:09:05,359 --> 01:09:11,940
here would be a - 

1189
01:09:11,940 --> 01:09:14,739
now this would be an example of how you might 

1190
01:09:14,739 --> 01:09:15,449
estimate the 

1191
01:09:15,449 --> 01:09:19,319
state transition probabilities in a MDP and find the policy for it. So you might repeatedly 

1192
01:09:19,319 --> 01:09:25,789
do the following. Let's see. 

1193
01:09:25,789 --> 01:09:29,559
Take actions 

1194
01:09:29,559 --> 01:09:31,919


1195
01:09:31,919 --> 01:09:36,049
using some policy p to 

1196
01:09:36,049 --> 01:09:42,949
get experience 

1197
01:09:42,949 --> 01:09:45,199
in the 

1198
01:09:45,199 --> 01:09:49,539
MDP, meaning that just execute the policy p observed state 

1199
01:09:49,539 --> 01:09:50,579
transitions. 

1200
01:09:50,579 --> 01:09:53,290
Based on the data you get, you then update 

1201
01:09:53,290 --> 01:09:55,869
estimates 

1202
01:09:55,869 --> 01:09:59,940


1203
01:09:59,940 --> 01:10:04,299
of your state transition probabilities P subscript (s, a) 

1204
01:10:04,299 --> 01:10:08,600
based on the experience of the observations you just got. 

1205
01:10:08,600 --> 01:10:10,330
Then you might solve Bellman's 

1206
01:10:10,330 --> 01:10:17,090
equations 

1207
01:10:17,090 --> 01:10:19,079
using value 

1208
01:10:19,079 --> 01:10:22,749
iterations, which I'm abbreviating to VI, and by Bellman's 

1209
01:10:22,749 --> 01:10:24,090


1210
01:10:24,090 --> 01:10:28,720
equations, I mean Bellman's equations for V*, not for Vp. Solve Bellman's equations 

1211
01:10:28,720 --> 01:10:30,920
using value iteration 

1212
01:10:30,920 --> 01:10:34,870


1213
01:10:34,870 --> 01:10:37,530
to get an estimate for P* 

1214
01:10:37,530 --> 01:10:41,239
and then you update your policy by 

1215
01:10:41,239 --> 01:10:43,009
events 

1216
01:10:43,009 --> 01:10:49,119
equals [inaudible]. 

1217
01:10:49,119 --> 01:10:56,119


1218
01:10:59,139 --> 01:11:02,560
And now you have a new policy so you can then go 

1219
01:11:02,560 --> 01:11:05,570
back and execute this policy for a bit more of the MDPs to get some more 

1220
01:11:05,570 --> 01:11:09,309
observations of state transitions, 

1221
01:11:09,309 --> 01:11:14,139
get the noisy ones in MDP, use that update to estimate your state transition probabilities again; use value iteration or policy 

1222
01:11:14,139 --> 01:11:14,819
iteration 

1223
01:11:14,819 --> 01:11:19,419
to solve for [inaudible] the value function, get a new policy 

1224
01:11:19,419 --> 01:11:21,189
and so on. Okay? And 

1225
01:11:21,189 --> 01:11:24,880
it turns out when you do this, I actually wrote down value iteration for a reason. It turns out 

1226
01:11:24,880 --> 01:11:27,049
in the third step of the algorithm, 

1227
01:11:27,049 --> 01:11:31,389
if you're using value iteration rather than policy iteration, 

1228
01:11:31,389 --> 01:11:36,499
to initialize value iteration, if you use your solution from the previous used 

1229
01:11:36,499 --> 01:11:37,520
algorithm, 

1230
01:11:37,520 --> 01:11:42,050
right, then that's a very good initialization condition and this will tend to converge much 

1231
01:11:42,050 --> 01:11:43,360
more quickly because 

1232
01:11:43,360 --> 01:11:44,999
value iteration 

1233
01:11:44,999 --> 01:11:48,400
tries to solve for V(s) for every 

1234
01:11:48,400 --> 01:11:52,090
state s. It tries to estimate 

1235
01:11:52,090 --> 01:11:54,939
V in V(s) and so if you're looking 

1236
01:11:54,939 --> 01:11:55,909
through this 

1237
01:11:55,909 --> 01:11:59,199
and you initialize your value 

1238
01:11:59,199 --> 01:12:00,479
iteration algorithm 

1239
01:12:00,479 --> 01:12:04,179
using the values you have from the previous round through this, then that 

1240
01:12:04,179 --> 01:12:06,319
will often make this converge faster. 

1241
01:12:06,319 --> 01:12:08,030
But again, this is again 

1242
01:12:08,030 --> 01:12:12,369
here, you can also adjust a small part in policy iteration in here as well and whatever, and 

1243
01:12:12,369 --> 01:12:15,039
this is a fairly typical example 

1244
01:12:15,039 --> 01:12:18,999
of how you would solve a policy, correct digits and then key in and 

1245
01:12:18,999 --> 01:12:20,800
try to find a good policy 

1246
01:12:20,800 --> 01:12:21,969
for a problem 

1247
01:12:21,969 --> 01:12:27,599
for which you did not know the state transition probabilities in advance. 

1248
01:12:27,599 --> 01:12:34,599
Cool. Questions about this? Cool. So that sure 

1249
01:12:39,599 --> 01:12:45,269
was exciting. This is like our first two MDP algorithms in just one lecture. All 

1250
01:12:45,269 --> 01:12:47,440
right, let's close for today. Thanks.