1
00:00:09,570 --> 00:00:10,849

2
00:00:10,849 --> 00:00:14,119
This presentation is delivered by the Stanford Center for Professional 

3
00:00:14,119 --> 00:00:21,119
Development. 

4
00:00:24,529 --> 00:00:27,279
Okay. So good morning and welcome back. 

5
00:00:27,279 --> 00:00:32,059
What I want to do today is actually begin a new chapter in 229 

6
00:00:32,059 --> 00:00:37,710
in which I'm gonna start to talk about 

7
00:00:37,710 --> 00:00:42,830
[inaudible]. So [inaudible] today is I'm gonna just very briefly talk about clustering's, 

8
00:00:42,830 --> 00:00:44,660
[inaudible] algorithm. 

9
00:00:44,660 --> 00:00:47,400
[Inaudible] 

10
00:00:47,400 --> 00:00:51,590
and a special case of the EM, Expectation Maximization, algorithm with a 

11
00:00:51,590 --> 00:00:52,960
mixture of [inaudible] 

12
00:00:52,960 --> 00:00:54,650
model to 

13
00:00:54,650 --> 00:00:58,010
describe something called Jensen and Equality and then we'll use that 

14
00:00:58,010 --> 00:00:59,060
derive 

15
00:00:59,060 --> 00:01:02,780
a general form of something called the EM or the Expectation Maximization 

16
00:01:02,780 --> 00:01:03,400
algorithm, 

17
00:01:03,400 --> 00:01:05,300
which is a very useful algorithm. We 

18
00:01:05,300 --> 00:01:09,290
sort of use it all over the place and different unsupervised machine or 

19
00:01:09,290 --> 00:01:16,290
any application. 

20
00:01:17,320 --> 00:01:21,630
So the cartoons that I used to draw for supervised learning was 

21
00:01:21,630 --> 00:01:23,160
you'd be given 

22
00:01:23,160 --> 00:01:26,860
the data set like this, 

23
00:01:26,860 --> 00:01:30,240
right, and you'd use [inaudible] between the positive and 

24
00:01:30,240 --> 00:01:32,490
negative crosses and we'd call it the supervised 

25
00:01:32,490 --> 00:01:36,030
learning because you're sort of told what the right 

26
00:01:36,030 --> 00:01:38,750
cross label is for every training example, 

27
00:01:38,750 --> 00:01:41,960
and that was the supervision. 

28
00:01:41,960 --> 00:01:47,260
In unsupervised learning, we'll study a different problem. 

29
00:01:47,260 --> 00:01:54,260
You're given a data set 

30
00:01:54,920 --> 00:01:58,689
that maybe just comprises a set of points. You're just given a data set 

31
00:01:58,689 --> 00:02:03,049
with no labels and no indication of what the śright answersť or where the 

32
00:02:03,049 --> 00:02:04,560
supervision is 

33
00:02:04,560 --> 00:02:08,840
and it's the job of the algorithm to discover structure in the 

34
00:02:08,840 --> 00:02:10,889
data. 

35
00:02:10,889 --> 00:02:14,699
So in this lecture and the next couple of weeks we'll talk about a variety of unsupervised 

36
00:02:14,699 --> 00:02:16,359
learning algorithms 

37
00:02:16,359 --> 00:02:20,189
that can look at data sets like these and discover there's different types of 

38
00:02:20,189 --> 00:02:22,059
structure in it. 

39
00:02:22,059 --> 00:02:25,699
In this particular cartoon that I've drawn " one has the structure that you and I can 

40
00:02:25,699 --> 00:02:28,859
probably see as is that this data 

41
00:02:28,859 --> 00:02:31,029
lives in two different crosses 

42
00:02:31,029 --> 00:02:32,279
and so 

43
00:02:32,279 --> 00:02:35,559
the first unsupervised learning algorithm that I'm just gonna talk about will be a 

44
00:02:35,559 --> 00:02:37,739
clustering algorithm. It'll be an algorithm 

45
00:02:37,739 --> 00:02:40,120
that looks for a data set like this 

46
00:02:40,120 --> 00:02:43,889
and automatically breaks the data set into different 

47
00:02:43,889 --> 00:02:47,919
smaller clusters. 

48
00:02:47,919 --> 00:02:49,709
So 

49
00:02:49,709 --> 00:02:51,589
let's see. When my 

50
00:02:51,589 --> 00:02:52,879
laptop comes back up, I'll 

51
00:02:52,879 --> 00:02:54,629
show you an example. 

52
00:02:54,629 --> 00:02:57,309
So clustering algorithms like these have 

53
00:02:57,309 --> 00:03:00,790
a variety of applications. 

54
00:03:00,790 --> 00:03:04,989
Just to rattle off a few of the better-known ones I guess in biology 

55
00:03:04,989 --> 00:03:07,409
application you often cross the different things here. You 

56
00:03:07,409 --> 00:03:10,969
have [inaudible] genes and they cluster the different genes together in order to examine them 

57
00:03:10,969 --> 00:03:14,799
and understand the biological function better. 

58
00:03:14,799 --> 00:03:16,510
Another 

59
00:03:16,510 --> 00:03:18,530
common application of clustering is market research. So 

60
00:03:18,530 --> 00:03:21,310
imagine you have a customer database of how your different 

61
00:03:21,310 --> 00:03:23,969
customers behave. It's a 

62
00:03:23,969 --> 00:03:27,139
very common practice to apply clustering algorithms to break 

63
00:03:27,139 --> 00:03:30,109
your database of customers into different market segments so that you 

64
00:03:30,109 --> 00:03:30,749
can 

65
00:03:30,749 --> 00:03:34,569
target your products towards different market segments and target your sales 

66
00:03:34,569 --> 00:03:38,149
pitches specifically to different market segments. 

67
00:03:38,149 --> 00:03:40,709
Something we'll do later today " I don't want to do 

68
00:03:40,709 --> 00:03:47,219
this now, but you actually go to a website, like, use.google.com and 

69
00:03:47,219 --> 00:03:49,000
that's an example of a website 

70
00:03:49,000 --> 00:03:52,329
that uses a clustering algorithm 

71
00:03:52,329 --> 00:03:56,809
to everyday group related news articles together to display to you so 

72
00:03:56,809 --> 00:03:59,039
that you can see one of the 

73
00:03:59,039 --> 00:04:02,699
thousand news articles today on whatever the top story of today is and all the 

74
00:04:02,699 --> 00:04:05,129
500 news articles on all the different websites 

75
00:04:05,129 --> 00:04:07,529
on different story of 

76
00:04:07,529 --> 00:04:11,379
the day. And a very solid [inaudible] actually talks about image segmentation, which the 

77
00:04:11,379 --> 00:04:13,590
application of when 

78
00:04:13,590 --> 00:04:14,759
you might 

79
00:04:14,759 --> 00:04:18,229
take a picture and group together different subsets of the picture into 

80
00:04:18,229 --> 00:04:19,620
coherent pieces 

81
00:04:19,620 --> 00:04:21,219
of pixels 

82
00:04:21,219 --> 00:04:24,349
to try to understand what's contained in the picture. So that's yet another 

83
00:04:24,349 --> 00:04:25,650


84
00:04:25,650 --> 00:04:27,430
application of clustering. 

85
00:04:27,430 --> 00:04:29,229
The next idea is 

86
00:04:29,229 --> 00:04:34,779
given a data set like this, given a set of points, can you automatically group 

87
00:04:34,779 --> 00:04:36,169
the data 

88
00:04:36,169 --> 00:04:37,659
sets 

89
00:04:37,659 --> 00:04:40,209
into coherent clusters. 

90
00:04:40,209 --> 00:04:42,389


91
00:04:42,389 --> 00:04:43,879
Let's see. I'm 

92
00:04:43,879 --> 00:04:47,830
still waiting for the laptop to come back so I can show you an example. 

93
00:04:47,830 --> 00:04:51,490
You know what, why don't I just start to write 

94
00:04:51,490 --> 00:04:52,159
out 

95
00:04:52,159 --> 00:04:59,159
the specific clustering algorithm and then I'll show you the animation later. So this 

96
00:05:01,440 --> 00:05:03,960
is the called the k-means clustering algorithm for finding clustering's near 

97
00:05:03,960 --> 00:05:06,569
the inset. The 

98
00:05:06,569 --> 00:05:09,740
input to 

99
00:05:09,740 --> 00:05:13,930
the algorithm will be an unlabeled data set 

100
00:05:13,930 --> 00:05:17,999
which I write 

101
00:05:17,999 --> 00:05:19,879


102
00:05:19,879 --> 00:05:21,289
as X1, X2, [inaudible] 

103
00:05:21,289 --> 00:05:25,149
and because we're now talking about unsupervised learning, you see a lot of this as 

104
00:05:25,149 --> 00:05:29,059
[inaudible] with just the Xs and no cross labels Y. 

105
00:05:29,059 --> 00:05:33,270
So what a k-means algorithm does is the following. This will all make a bit 

106
00:05:33,270 --> 00:05:36,719
more sense when I show you the 

107
00:05:36,719 --> 00:05:40,009
animation on my laptop. 

108
00:05:40,009 --> 00:05:43,599
To initialize a set of points, called 

109
00:05:43,599 --> 00:05:50,599
the cluster centroids, 

110
00:05:52,169 --> 00:05:54,940
[inaudible] randomly 

111
00:05:54,940 --> 00:05:57,069
and so if you're 

112
00:05:57,069 --> 00:06:01,560
[inaudible] of training data are [inaudible] then your cluster centroids, these 

113
00:06:01,560 --> 00:06:02,159
muse, 

114
00:06:02,159 --> 00:06:04,300
will also be 

115
00:06:04,300 --> 00:06:06,410
vectors and [inaudible] 

116
00:06:06,410 --> 00:06:11,119
and then you repeat 

117
00:06:11,119 --> 00:06:15,869
until convergence 

118
00:06:15,869 --> 00:06:17,829
the following two steps. 

119
00:06:17,829 --> 00:06:24,829


120
00:06:35,080 --> 00:06:36,029


121
00:06:36,029 --> 00:06:38,399
So 

122
00:06:38,399 --> 00:06:39,790
the cluster centroids will be 

123
00:06:39,790 --> 00:06:43,610
your guesses for where the centers of each of the clusters are 

124
00:06:43,610 --> 00:06:46,119
and so in one of those steps 

125
00:06:46,119 --> 00:06:50,089
you look at each point, XI and you look at which cluster centroid 

126
00:06:50,089 --> 00:06:52,649
J is closest to it and then 

127
00:06:52,649 --> 00:06:55,830
this step is called assigning your point XI 

128
00:06:55,830 --> 00:06:57,229
to cluster 

129
00:06:57,229 --> 00:06:58,349
J. So 

130
00:06:58,349 --> 00:07:03,329
looking at each point and picking the cluster centroid that's closest to it 

131
00:07:03,329 --> 00:07:09,059
and the other step is 

132
00:07:09,059 --> 00:07:16,059
you update the cluster centroids 

133
00:07:19,099 --> 00:07:26,099


134
00:07:28,749 --> 00:07:34,589
to be the median of all the points that have been assigned to it. 

135
00:07:34,589 --> 00:07:41,589
Okay. 

136
00:07:46,439 --> 00:07:53,439
Let's see. Could you please bring down the display for the laptop? Excuse me. Okay. Okay. There we go. Okay. 

137
00:09:02,340 --> 00:09:06,860
So here's an example of the k-means algorithm and hope the animation will make 

138
00:09:06,860 --> 00:09:08,660
more sense. This is an 

139
00:09:08,660 --> 00:09:13,160
inch chopped off. This is basically an example I got from Michael Jordan in 

140
00:09:13,160 --> 00:09:15,540
Berkley. So 

141
00:09:15,540 --> 00:09:20,640
these points in green are my data points and so I'm going to randomly 

142
00:09:20,640 --> 00:09:24,160
initialize a pair of cluster centroids. So 

143
00:09:24,160 --> 00:09:28,150
the [inaudible] blue crosses to note the positions of New1 and New2 say if I'm going 

144
00:09:28,150 --> 00:09:31,600
to guess that there's two clusters 

145
00:09:31,600 --> 00:09:33,780


146
00:09:33,780 --> 00:09:35,709
in 

147
00:09:35,709 --> 00:09:39,060
this data. Sets of k-means algorithms as follow. I'm going to repeatedly 

148
00:09:39,060 --> 00:09:41,370
go to all of the points in my data set 

149
00:09:41,370 --> 00:09:46,610
and I'm going to associate each of the green dots with the closer of the 

150
00:09:46,610 --> 00:09:48,770
two cluster centroids, so 

151
00:09:48,770 --> 00:09:51,500
visually, I'm gonna denote that by 

152
00:09:51,500 --> 00:09:55,540
painting each of the dots either blue or red 

153
00:09:55,540 --> 00:09:58,390
depending on which is the closer cluster centroid. Okay. So 

154
00:09:58,390 --> 00:10:01,130
all the points closer to the blue cross points are 

155
00:10:01,130 --> 00:10:04,000
painted blue and so on. The 

156
00:10:04,000 --> 00:10:07,750
other step is updating the cluster centroids and so 

157
00:10:07,750 --> 00:10:09,209
I'm going to repeatedly 

158
00:10:09,209 --> 00:10:12,090
look at all the points that I've painted blue 

159
00:10:12,090 --> 00:10:15,660
and compute the average of all of the blue dots, 

160
00:10:15,660 --> 00:10:16,939
and I'll 

161
00:10:16,939 --> 00:10:20,520
look at all the red dots and compute the average of all the red dots 

162
00:10:20,520 --> 00:10:24,350
and then I move the cluster centroids as follows to the average 

163
00:10:24,350 --> 00:10:26,109
of the respective locations. 

164
00:10:26,109 --> 00:10:28,550
So this is now [inaudible] of kmeans on 

165
00:10:28,550 --> 00:10:29,940
here, 

166
00:10:29,940 --> 00:10:33,550
and now I'll repeat the same process. I look at all the points and assign all the 

167
00:10:33,550 --> 00:10:37,760
points closer to the blue cross to the color blue and similarly 

168
00:10:37,760 --> 00:10:39,630
red. And so now I have that 

169
00:10:39,630 --> 00:10:43,380
assignments of points to the cluster centroids 

170
00:10:43,380 --> 00:10:47,970
and finally, I'll again compute the average of all the blue points and 

171
00:10:47,970 --> 00:10:49,770
compute the average of all the red points 

172
00:10:49,770 --> 00:10:51,639
and update the cluster centroids again 

173
00:10:51,639 --> 00:10:53,690
as follows 

174
00:10:53,690 --> 00:10:54,670
and 

175
00:10:54,670 --> 00:10:57,170
now k-means is actually [inaudible]. If you keep 

176
00:10:57,170 --> 00:11:00,150
running these two sets of k-means over and over, 

177
00:11:00,150 --> 00:11:03,260
the cluster centroids and the assignment of the points closest to the cluster centroids will actually 

178
00:11:03,260 --> 00:11:06,220
remain 

179
00:11:06,220 --> 00:11:12,340
the same. Yeah. 

180
00:11:12,340 --> 00:11:15,470
Student:[Inaudible] 

181
00:11:15,470 --> 00:11:22,470
Instructor (Andrew Ng):Yeah, I'll 

182
00:11:32,550 --> 00:11:33,830
assign that in a second. Yeah. Okay. 

183
00:11:33,830 --> 00:11:36,700
So [inaudible]. Take a second to look at this again and make sure you understand 

184
00:11:36,700 --> 00:11:39,240


185
00:11:39,240 --> 00:11:44,350
how the algorithm I wrote out maps onto the animation 

186
00:11:44,350 --> 00:11:48,180
that we just saw. Do you have a question? Student:[Inaudible] 

187
00:11:48,180 --> 00:11:53,060
Instructor (Andrew Ng):I see. Okay. Let me answer on that in a second. Okay. So these are the 

188
00:11:53,060 --> 00:11:54,440
two steps. This 

189
00:11:54,440 --> 00:11:58,800
step 2.1 was assigning the points to the closest centroid 

190
00:11:58,800 --> 00:12:02,640
and 2.2 was shifting the cluster centroid to be the mean 

191
00:12:02,640 --> 00:12:07,540
of all the points assigned to that cluster centroid. Okay. Okay. [Inaudible] questions 

192
00:12:07,540 --> 00:12:08,560
that 

193
00:12:08,560 --> 00:12:12,790
we just had, one is, does the algorithm converge? The answer 

194
00:12:12,790 --> 00:12:16,490
is yes, k-means is guaranteed to converge in a certain sense. 

195
00:12:16,490 --> 00:12:17,760
In particular, 

196
00:12:17,760 --> 00:12:20,300
if you define 

197
00:12:20,300 --> 00:12:27,300


198
00:12:29,190 --> 00:12:31,770
the distortion function 

199
00:12:31,770 --> 00:12:36,040
to be J of 

200
00:12:36,040 --> 00:12:43,040
C 

201
00:12:44,130 --> 00:12:46,820
[inaudible] squared. 

202
00:12:46,820 --> 00:12:50,070
You can define the distortion function to be a function of the 

203
00:12:50,070 --> 00:12:53,840
cluster assignments, and the cluster centroids and [inaudible] square 

204
00:12:53,840 --> 00:12:57,930
distances, which mean the points and the cluster centroids that they're assigned to, 

205
00:12:57,930 --> 00:13:01,790
then you can show " I won't really prove this here but 

206
00:13:01,790 --> 00:13:08,790
you can show that k-means is called [inaudible] 

207
00:13:09,230 --> 00:13:11,660


208
00:13:11,660 --> 00:13:14,040
on the function J. 

209
00:13:14,040 --> 00:13:18,250
In particular, who remembers, it's called in a sense as an authorization 

210
00:13:18,250 --> 00:13:22,120
algorithm, I don't know, maybe about two weeks ago, 

211
00:13:22,120 --> 00:13:27,260
so called in a sense is the algorithm that we'll 

212
00:13:27,260 --> 00:13:30,740
repeatedly [inaudible] with respect to C. Okay. So that's called 

213
00:13:30,740 --> 00:13:32,130
[inaudible]. 

214
00:13:32,130 --> 00:13:35,950
And so what you can prove 

215
00:13:35,950 --> 00:13:37,829
is that k-means " 

216
00:13:37,829 --> 00:13:40,500
the two steps of k-means, 

217
00:13:40,500 --> 00:13:45,050
are exactly optimizing this function with respect to C and will respect a new 

218
00:13:45,050 --> 00:13:46,590
alternately. 

219
00:13:46,590 --> 00:13:50,670
And therefore, this 

220
00:13:50,670 --> 00:13:52,700
function, J of C, new, 

221
00:13:52,700 --> 00:13:55,090
must be decreasing monotonically 

222
00:13:55,090 --> 00:13:56,900
on every other variation 

223
00:13:56,900 --> 00:14:00,660
and so the sense in which k-means converges is that this function, J of C, 

224
00:14:00,660 --> 00:14:03,970
new, can only go down and therefore, this 

225
00:14:03,970 --> 00:14:07,210
function will actually eventually converge in the sense that it will 

226
00:14:07,210 --> 00:14:10,580
stop going down. Okay. 

227
00:14:10,580 --> 00:14:13,820
It's actually possible that there may be several 

228
00:14:13,820 --> 00:14:18,720
clustering's they give the same value of J of C, new and so k-means may actually switch 

229
00:14:18,720 --> 00:14:20,949
back and forth between different clustering's that they 

230
00:14:20,949 --> 00:14:24,920
[inaudible] in the extremely unlikely case, if there's multiple clustering's, 

231
00:14:24,920 --> 00:14:27,790
they give exactly the same value for this objective function. 

232
00:14:27,790 --> 00:14:31,710
Kmeans may also be [inaudible] it'll just never happen. 

233
00:14:31,710 --> 00:14:33,700
That 

234
00:14:33,700 --> 00:14:39,380
even if that happens, this function J of C, new will converge. Another 

235
00:14:39,380 --> 00:14:41,860
question was how do you choose the number of clusters? 

236
00:14:41,860 --> 00:14:43,690
So it turns out that 

237
00:14:43,690 --> 00:14:47,960
in the vast majority of time when people apply k-means, you still 

238
00:14:47,960 --> 00:14:51,120
just randomly pick a number of clusters or you randomly try 

239
00:14:51,120 --> 00:14:56,210
a few different numbers of clusters 

240
00:14:56,210 --> 00:14:59,090
and pick the one that seems to work best. 

241
00:14:59,090 --> 00:15:02,560
The number of clusters in this algorithm instead of just one parameters, so usually I think 

242
00:15:02,560 --> 00:15:05,550
it's not very hard to choose automatically. 

243
00:15:05,550 --> 00:15:09,680
There are some automatic ways of choosing the number of clusters, but I'm 

244
00:15:09,680 --> 00:15:11,259
not gonna talk about 

245
00:15:11,259 --> 00:15:14,900
them. When I do this, I usually just pick of the number of clusters randomly. 

246
00:15:14,900 --> 00:15:16,920
And the reason is 

247
00:15:16,920 --> 00:15:18,210
I think 

248
00:15:18,210 --> 00:15:22,050
for many clustering problems the śtrueť number of clusters is actually 

249
00:15:22,050 --> 00:15:29,050
ambiguous so for example if you have a data set that looks like this, 

250
00:15:30,730 --> 00:15:34,060
some of you may see four clusters, 

251
00:15:34,060 --> 00:15:37,180
right, and some of you may see two clusters, and so 

252
00:15:37,180 --> 00:15:44,180
the right answer for the actual number of clusters is sort of ambiguous. Yeah. Student:[Inaudible]. [Inaudible] clusters [inaudible] far away from the data point [inaudible] points and the same cluster? Instructor (Andrew 

253
00:15:57,270 --> 00:16:01,590
Ng):I 

254
00:16:01,590 --> 00:16:02,660
see. Right. So 

255
00:16:02,660 --> 00:16:08,740
yes. K-means is susceptible to [inaudible] so 

256
00:16:08,740 --> 00:16:12,570
this function, J of C, new is not a convex 

257
00:16:12,570 --> 00:16:13,540
function 

258
00:16:13,540 --> 00:16:14,700
and so 

259
00:16:14,700 --> 00:16:18,930
k-means, sort of called in a sense on the non-convex function is not guaranteed to 

260
00:16:18,930 --> 00:16:20,660
converge the [inaudible]. 

261
00:16:20,660 --> 00:16:24,600
So kmeans 

262
00:16:24,600 --> 00:16:26,960
is susceptible to local 

263
00:16:26,960 --> 00:16:27,680
optimal 

264
00:16:27,680 --> 00:16:31,160
and [inaudible]. 

265
00:16:31,160 --> 00:16:31,790


266
00:16:31,790 --> 00:16:35,790
One thing you can do is try multiple random initializations 

267
00:16:35,790 --> 00:16:38,070
and then run clustering a bunch of times 

268
00:16:38,070 --> 00:16:41,570
and pick the solution that ended up with the lowest value for the 

269
00:16:41,570 --> 00:16:48,570
distortion function. 

270
00:16:52,630 --> 00:16:57,000
Yeah. Student:[Inaudible] Instructor (Andrew Ng):Yeah, let's 

271
00:16:57,000 --> 00:17:01,340
see. Right. So what if one cluster centroid has no points assigned to it, again, one thing 

272
00:17:01,340 --> 00:17:04,820
you could do is just eliminate it exactly the same. Another thing you can 

273
00:17:04,820 --> 00:17:06,060
is you 

274
00:17:06,060 --> 00:17:11,900
can just reinitialize randomly if you really [inaudible]. More questions. Yeah. 

275
00:17:11,900 --> 00:17:15,830
Student:[Inaudible] as a 

276
00:17:15,830 --> 00:17:16,220
norm 

277
00:17:16,220 --> 00:17:20,140
or can you [inaudible] or infinity 

278
00:17:20,140 --> 00:17:21,329
norm or " Instructor (Andrew Ng):I 

279
00:17:21,329 --> 00:17:24,690
see. Right. Is it usually two norms? 

280
00:17:24,690 --> 00:17:26,340
Let's see. 

281
00:17:26,340 --> 00:17:29,530
For the vast majority of applications I've seen for k-means, you do 

282
00:17:29,530 --> 00:17:31,830
take two norms when you have data 

283
00:17:31,830 --> 00:17:35,830
[inaudible]. I'm sure there are others who have taken infinity norm and one norm as well. 

284
00:17:35,830 --> 00:17:39,270
I personally haven't seen that very often, but 

285
00:17:39,270 --> 00:17:42,159
there are other variations on this 

286
00:17:42,159 --> 00:17:43,550
algorithm that use different norms, but the one I 

287
00:17:43,550 --> 00:17:50,550
described is probably the most commonly used there is. Okay. 

288
00:17:54,559 --> 00:17:55,620
So 

289
00:17:55,620 --> 00:17:58,430
that was 

290
00:17:58,430 --> 00:18:05,430
k-means clustering. 

291
00:18:05,660 --> 00:18:09,179
What I want to do next and this will take longer to describe is 

292
00:18:09,179 --> 00:18:16,179
actually talk about a closely related problem. 

293
00:18:18,550 --> 00:18:25,550
In particular, what I wanted to do was talk about density estimation. 

294
00:18:26,710 --> 00:18:28,450


295
00:18:28,450 --> 00:18:31,070


296
00:18:31,070 --> 00:18:33,300
As another k-means example, 

297
00:18:33,300 --> 00:18:37,860
this is a problem that I know some guys that worked on. Let's say you have aircraft engine 

298
00:18:37,860 --> 00:18:41,600
building off an assembly. Let's say you work for an aircraft company, 

299
00:18:41,600 --> 00:18:44,230
you're building aircraft engines off the assembly line 

300
00:18:44,230 --> 00:18:45,640
and 

301
00:18:45,640 --> 00:18:49,910
as the aircraft engines roll off the assembly line, you test these aircraft engines and 

302
00:18:49,910 --> 00:18:52,510
measure various different properties of it and 

303
00:18:52,510 --> 00:18:57,330
to use [inaudible] example I'm gonna write these properties as heat 

304
00:18:57,330 --> 00:19:01,400
and vibrations. Right. In reality, you'd measure different 

305
00:19:01,400 --> 00:19:04,330


306
00:19:04,330 --> 00:19:07,250
vibrations, different 

307
00:19:07,250 --> 00:19:12,960


308
00:19:12,960 --> 00:19:14,730


309
00:19:14,730 --> 00:19:15,970
frequencies and so 

310
00:19:15,970 --> 00:19:19,539
on. We'll just write the amount of heat produced and 

311
00:19:19,539 --> 00:19:26,539
vibrations produced. Let's say 

312
00:19:26,740 --> 00:19:28,950
that maybe it looks like that 

313
00:19:28,950 --> 00:19:30,680
and what you would like to do 

314
00:19:30,680 --> 00:19:33,460
is estimate the density of 

315
00:19:33,460 --> 00:19:37,110
these [inaudible] of the joint distribution, the amount of heat produced and the amount of 

316
00:19:37,110 --> 00:19:39,660
vibrations 

317
00:19:39,660 --> 00:19:42,789
because you would like to detect [inaudible] 

318
00:19:42,789 --> 00:19:46,470
so that as a new aircraft engine rolls off the assembly line, you can then 

319
00:19:46,470 --> 00:19:50,420
measure the same heat and vibration properties. 

320
00:19:50,420 --> 00:19:52,530
If you get a point there, 

321
00:19:52,530 --> 00:19:53,500
you can then ask, 

322
00:19:53,500 --> 00:19:54,820
śHow likely is it 

323
00:19:54,820 --> 00:19:59,120
that there was an undetected flaw in this aircraft engine that it needs to go 

324
00:19:59,120 --> 00:20:00,900
undergo further inspections?ť 

325
00:20:00,900 --> 00:20:01,950
And so 

326
00:20:01,950 --> 00:20:02,899
if 

327
00:20:02,899 --> 00:20:07,800
we look at the typical distribution of features we get, 

328
00:20:07,800 --> 00:20:10,809
and we build a model for P of X 

329
00:20:10,809 --> 00:20:12,760
and then if 

330
00:20:12,760 --> 00:20:16,390
P of X is very small for some new aircraft engine 

331
00:20:16,390 --> 00:20:19,640
then that would raise a red flag and we'll say there's an anomaly 

332
00:20:19,640 --> 00:20:21,010
aircraft engine and we 

333
00:20:21,010 --> 00:20:25,720
should subject it to further inspections before we let 

334
00:20:25,720 --> 00:20:27,330
someone fly with 

335
00:20:27,330 --> 00:20:28,549
the engine. 

336
00:20:28,549 --> 00:20:30,380
So this 

337
00:20:30,380 --> 00:20:34,049
problem I just described is an instance of what is called anomaly detection 

338
00:20:34,049 --> 00:20:36,749
and so a common way of doing anomaly detection 

339
00:20:36,749 --> 00:20:43,720
is to take your training set 

340
00:20:43,720 --> 00:20:46,030
and from this data set, 

341
00:20:46,030 --> 00:20:49,159
build a model, P of X of the density of 

342
00:20:49,159 --> 00:20:52,660
the typical data you're saying 

343
00:20:52,660 --> 00:20:56,320
and if you ever then see an example with very low probability 

344
00:20:56,320 --> 00:20:57,799
under P of X, then 

345
00:20:57,799 --> 00:21:02,580
you may flag that as an anomaly example. Okay? 

346
00:21:02,580 --> 00:21:06,380
So anomaly detection is also used in security applications. 

347
00:21:06,380 --> 00:21:09,940


348
00:21:09,940 --> 00:21:11,450


349
00:21:11,450 --> 00:21:15,180


350
00:21:15,180 --> 00:21:16,870


351
00:21:16,870 --> 00:21:20,700
If many, very unusual transactions to start to appear on my 

352
00:21:20,700 --> 00:21:25,010
credit card, that's a sign to me that someone has stolen my credit card. 

353
00:21:25,010 --> 00:21:25,960
And what I 

354
00:21:25,960 --> 00:21:28,710
want to do now is talk about specific algorithm for 

355
00:21:28,710 --> 00:21:30,960
density estimation, 

356
00:21:30,960 --> 00:21:34,009
and in particular, one that works with data sets like these, that, 

357
00:21:34,009 --> 00:21:37,330
you know, this distribution like 

358
00:21:37,330 --> 00:21:38,840


359
00:21:38,840 --> 00:21:42,570
this doesn't really fall into any of the standard text book 

360
00:21:42,570 --> 00:21:47,550
distributions. This is not really, like, a Gaussian or a [inaudible] explanation or 

361
00:21:47,550 --> 00:21:54,550
anything. So can we come up with a model to estimate densities that may look like these somewhat unusual shapes? 

362
00:21:55,220 --> 00:22:01,280
Okay. So 

363
00:22:01,280 --> 00:22:04,560
to describe the algorithm a bit a more I'm also going to use a one dimensional example 

364
00:22:04,560 --> 00:22:07,620
rather than a two D example, 

365
00:22:07,620 --> 00:22:09,590
and in the example 

366
00:22:09,590 --> 00:22:13,590
that I'm going to describe I'm going to say 

367
00:22:13,590 --> 00:22:20,590
that let's imagine maybe a 

368
00:22:21,310 --> 00:22:23,720
data set that looks like this 

369
00:22:23,720 --> 00:22:25,910
where the horizontal access here 

370
00:22:25,910 --> 00:22:30,110
is the X axis and these dots represent the 

371
00:22:30,110 --> 00:22:32,400
positions of the data set 

372
00:22:32,400 --> 00:22:38,130
that I have. Okay. So this data set looks like it's maybe coming from a density 

373
00:22:38,130 --> 00:22:40,230
that looks like that 

374
00:22:40,230 --> 00:22:41,659
as if this was the sum 

375
00:22:41,659 --> 00:22:44,160
of two Gaussian distributions 

376
00:22:44,160 --> 00:22:47,750
and so the specific model I'm gonna describe 

377
00:22:47,750 --> 00:22:50,840
will be what's called a mixture of Gaussian's 

378
00:22:50,840 --> 00:22:53,550
model. And just be clear that 

379
00:22:53,550 --> 00:22:56,540
the picture I have is that 

380
00:22:56,540 --> 00:22:58,899
when visioning 

381
00:22:58,899 --> 00:23:05,899
that 

382
00:23:07,520 --> 00:23:10,760
maybe there were two separate Gaussian's that generated 

383
00:23:10,760 --> 00:23:14,470
this data set, and if only I knew what 

384
00:23:14,470 --> 00:23:17,770
the two Gaussian's were, then I could put a Gaussian to my crosses, put a Gaussian to the Os 

385
00:23:17,770 --> 00:23:20,950
and then sum those up 

386
00:23:20,950 --> 00:23:23,650
to get the overall density for the two, 

387
00:23:23,650 --> 00:23:25,280
but 

388
00:23:25,280 --> 00:23:28,010
the problem is I don't 

389
00:23:28,010 --> 00:23:31,410
actually have access to these labels. I don't actually know which of the two Gaussian's 

390
00:23:31,410 --> 00:23:33,919
each of my data points came from 

391
00:23:33,919 --> 00:23:38,540
and so what I'd like to do is come up with an algorithm 

392
00:23:38,540 --> 00:23:41,750
to fit this mixture of Gaussian's model 

393
00:23:41,750 --> 00:23:44,809
even when I don't know which of the two Gaussian's 

394
00:23:44,809 --> 00:23:48,230
each of my data points came from. 

395
00:23:48,230 --> 00:23:55,090
Okay. 

396
00:23:55,090 --> 00:24:02,090
So here's the idea. 

397
00:24:02,460 --> 00:24:09,460
In this model, I'm going to 

398
00:24:09,820 --> 00:24:16,820
imagine there's a latent random variable, 

399
00:24:19,960 --> 00:24:23,980
latent is just synonymous with hidden or unobserved, okay. 

400
00:24:23,980 --> 00:24:30,980
So we're 

401
00:24:35,750 --> 00:24:37,899
gonna imagine there's a 

402
00:24:37,899 --> 00:24:40,100
latent random variable Z 

403
00:24:40,100 --> 00:24:42,210
and XI, 

404
00:24:42,210 --> 00:24:44,540
ZI have a 

405
00:24:44,540 --> 00:24:51,540
joint distribution 

406
00:24:53,630 --> 00:24:56,930
that is given as follows. 

407
00:24:56,930 --> 00:25:01,590
We have that P of X, ZI 

408
00:25:01,590 --> 00:25:04,730
by the chamber of probability, 

409
00:25:04,730 --> 00:25:11,080
this is always like that. This is always true. 

410
00:25:11,080 --> 00:25:13,059
And moreover, our [inaudible] is 

411
00:25:13,059 --> 00:25:17,240
given by the following ZI is distributed 

412
00:25:17,240 --> 00:25:20,190
multinomial with parameters I. And in 

413
00:25:20,190 --> 00:25:27,190
the 

414
00:25:32,590 --> 00:25:36,000
special case where I have just to make sure that two Gaussian's and ZI will be 

415
00:25:36,000 --> 00:25:38,390
[inaudible], 

416
00:25:38,390 --> 00:25:45,390
and so 

417
00:25:46,540 --> 00:25:51,600
these parameter [inaudible] are the parameters of a multinomial distribution. 

418
00:25:51,600 --> 00:25:54,840
And the distribution of XI 

419
00:25:54,840 --> 00:25:58,100
conditioned on ZI being 

420
00:25:58,100 --> 00:26:00,000
equal to J 

421
00:26:00,000 --> 00:26:02,990
so it's P of XI given ZI is equal to J. 

422
00:26:02,990 --> 00:26:06,960
That's going to be a Gaussian distribution 

423
00:26:06,960 --> 00:26:11,300
with [inaudible] and 

424
00:26:11,300 --> 00:26:16,230
covariant sigler. Okay. 

425
00:26:16,230 --> 00:26:17,870
So this should actually look 

426
00:26:17,870 --> 00:26:21,180
extremely familiar to you. What I've written down are pretty much 

427
00:26:21,180 --> 00:26:25,679
the same equations that I wrote down 

428
00:26:25,679 --> 00:26:28,740
for the Gaussian Discriminant Analysis algorithm that we 

429
00:26:28,740 --> 00:26:30,910
saw way back, right, except that 

430
00:26:30,910 --> 00:26:33,240
the differences " 

431
00:26:33,240 --> 00:26:35,710
instead of, I guess 

432
00:26:35,710 --> 00:26:36,840
supervised learning 

433
00:26:36,840 --> 00:26:39,970
where we were given the cross labels Y, 

434
00:26:39,970 --> 00:26:41,300
I've now 

435
00:26:41,300 --> 00:26:43,870
replaced Y in Gaussian Discriminant Analysis 

436
00:26:43,870 --> 00:26:46,770
with these latent random variables or these unobserved random 

437
00:26:46,770 --> 00:26:52,140
variables Z, and we don't actually know what the values of Z are. 

438
00:26:52,140 --> 00:26:59,140
Okay. 

439
00:27:01,090 --> 00:27:05,730
So just to make the link to the Gaussian Discriminant Analysis even a little more 

440
00:27:05,730 --> 00:27:06,590
explicit 

441
00:27:06,590 --> 00:27:13,220
" if we knew what the 

442
00:27:13,220 --> 00:27:16,680
Zs were, which was actually don't, but suppose for the sake of argument that 

443
00:27:16,680 --> 00:27:18,210
we actually knew 

444
00:27:18,210 --> 00:27:19,540
which of, 

445
00:27:19,540 --> 00:27:23,680
say the two Gaussian's, each of our data points came from, 

446
00:27:23,680 --> 00:27:29,290
then you can use [inaudible] estimation " 

447
00:27:29,290 --> 00:27:33,110
you can write down the likelihood the parameters 

448
00:27:33,110 --> 00:27:40,110
which would be that 

449
00:27:43,740 --> 00:27:44,610
and 

450
00:27:44,610 --> 00:27:48,120
you can then use [inaudible] estimation 

451
00:27:48,120 --> 00:27:55,120
and you get exactly the same formula as in Gaussian Discriminant Analysis. Okay. 

452
00:28:24,440 --> 00:28:25,419
So if 

453
00:28:25,419 --> 00:28:27,710
you knew the value of the Z, you can 

454
00:28:27,710 --> 00:28:30,840
write down the law of likelihood 

455
00:28:30,840 --> 00:28:33,180
and do maximum likeliness this way, 

456
00:28:33,180 --> 00:28:36,790
and you can then estimate all the parameters of your model. 

457
00:28:36,790 --> 00:28:38,289
Does 

458
00:28:38,289 --> 00:28:39,299
this make sense? 

459
00:28:39,299 --> 00:28:43,180
Raise your hand if this makes sense. Cool. 

460
00:28:43,180 --> 00:28:47,640
Some of you have questions? Some of you didn't raise your hands. 

461
00:28:47,640 --> 00:28:48,200
Yeah. Student:So this ZI is 

462
00:28:48,200 --> 00:28:49,750
just a label, like, an X or an O? Instructor (Andrew Ng):Yes. Basically. 

463
00:28:49,750 --> 00:28:53,750
Any 

464
00:28:53,750 --> 00:29:00,750
other questions? 

465
00:29:07,290 --> 00:29:11,540
Okay. 

466
00:29:11,540 --> 00:29:15,550
So if you knew the values of Z, the Z 

467
00:29:15,550 --> 00:29:19,380
playing a similar role to the cross labels in Gaussian's Discriminant Analysis, 

468
00:29:19,380 --> 00:29:24,360
then you could use maximum likeliness estimation parameters. 

469
00:29:24,360 --> 00:29:27,480
But in reality, we don't actually know the values of the Zs. All we're given 

470
00:29:27,480 --> 00:29:30,090
is this unlabeled data set 

471
00:29:30,090 --> 00:29:31,320
and so 

472
00:29:31,320 --> 00:29:35,000
let me write down the specific bootstrap procedure 

473
00:29:35,000 --> 00:29:37,170
in which the idea is that 

474
00:29:37,170 --> 00:29:39,390
we're going to 

475
00:29:39,390 --> 00:29:44,240
use our model to try and guess what the values of Z is. We don't know our 

476
00:29:44,240 --> 00:29:47,550
Z, but we'll just take a guess at the values of Z 

477
00:29:47,550 --> 00:29:51,960
and we'll then use some of the values of Z that we guessed to fit the 

478
00:29:51,960 --> 00:29:53,980
parameters of the rest of the model and then we'll actually iterate. 

479
00:29:53,980 --> 00:29:56,950
And now that we have a better estimate for the 

480
00:29:56,950 --> 00:29:58,289
parameters for the rest of the model, 

481
00:29:58,289 --> 00:30:01,669
we'll then take another guess for what the values of Z are. And then we'll 

482
00:30:01,669 --> 00:30:04,750
sort of use something like the maximum likeliness estimation 

483
00:30:04,750 --> 00:30:11,750
to set even parameters of the model. 

484
00:30:17,920 --> 00:30:19,430
So 

485
00:30:19,430 --> 00:30:26,430
the 

486
00:30:31,220 --> 00:30:34,200
algorithm Im gonna write down is called 

487
00:30:34,200 --> 00:30:38,570
the EM Algorithm 

488
00:30:38,570 --> 00:30:42,050
and it proceeds as follows. 

489
00:30:42,050 --> 00:30:48,510
Repeat until convergence 

490
00:30:48,510 --> 00:30:55,510
and the E set, we're going to guess the values 

491
00:30:57,780 --> 00:31:00,910
of the unknown ZIs and in particular, 

492
00:31:00,910 --> 00:31:07,910
I'm going to set 

493
00:31:07,940 --> 00:31:14,940
WIJ. 

494
00:31:20,480 --> 00:31:23,200
Okay. So I'm going to compute the probability that 

495
00:31:23,200 --> 00:31:25,859
ZI is equal to J. So I'm going to 

496
00:31:25,859 --> 00:31:29,750
use the rest of the parameters in my model and then I'm gonna compute the probability 

497
00:31:29,750 --> 00:31:31,080
that 

498
00:31:31,080 --> 00:31:33,240
point XI 

499
00:31:33,240 --> 00:31:34,210
came from 

500
00:31:34,210 --> 00:31:36,860
Gaussian number J. 

501
00:31:36,860 --> 00:31:41,330
And just to be sort of concrete about what I mean by this, this means that I'm 

502
00:31:41,330 --> 00:31:48,330
going to compute P of XI. 

503
00:31:53,760 --> 00:32:00,760
This step is sort of [inaudible], I guess. 

504
00:32:14,960 --> 00:32:17,170


505
00:32:17,170 --> 00:32:20,390
And again, just to be completely concrete about what I mean about this, 

506
00:32:20,390 --> 00:32:24,970
the [inaudible] rate of P of XI given ZI equals J, you know, 

507
00:32:24,970 --> 00:32:28,640
well that's the Gaussian density. Right? That's one over 

508
00:32:28,640 --> 00:32:34,279


509
00:32:34,279 --> 00:32:35,980


510
00:32:35,980 --> 00:32:42,980
E 

511
00:32:48,730 --> 00:32:51,200
to 

512
00:32:51,200 --> 00:32:54,059
the " [inaudible] 

513
00:32:54,059 --> 00:32:57,440


514
00:32:57,440 --> 00:32:59,520
and then divided by 

515
00:32:59,520 --> 00:33:03,730
sum from O 

516
00:33:03,730 --> 00:33:05,039
equals 

517
00:33:05,039 --> 00:33:07,080
1 to K of [inaudible] of 

518
00:33:07,080 --> 00:33:10,540
essentially the same thing, but with J replaced by 

519
00:33:10,540 --> 00:33:13,610
L. Okay. [Inaudible] for the Gaussian and the numerator 

520
00:33:13,610 --> 00:33:17,980
and the sum of the similar terms of the denominator. Excuse 

521
00:33:17,980 --> 00:33:21,620
me. This is the sum 

522
00:33:21,620 --> 00:33:23,950


523
00:33:23,950 --> 00:33:30,950
from O equals 1 through K in the denominator. 

524
00:33:35,039 --> 00:33:42,039


525
00:33:45,429 --> 00:33:52,429


526
00:33:52,570 --> 00:33:58,000
Okay. Let's see. 

527
00:33:58,000 --> 00:34:00,119
The maximization step 

528
00:34:00,119 --> 00:34:04,899
where you would then update your estimates of the parameters. So I'll just lay down the 

529
00:34:04,899 --> 00:34:07,260
formulas here. 

530
00:34:07,260 --> 00:34:11,379
When you see these, you should compare them to the 

531
00:34:11,379 --> 00:34:18,379
formulas we had for maximum likelihood estimation. 

532
00:34:25,099 --> 00:34:32,099


533
00:35:03,420 --> 00:35:05,309
And so these 

534
00:35:05,309 --> 00:35:09,139


535
00:35:09,139 --> 00:35:12,309
two formulas on top are very similar to what you saw 

536
00:35:12,309 --> 00:35:15,769
for Gaussian Discriminant Analysis except that now, 

537
00:35:15,769 --> 00:35:19,399
we have these [inaudible] so WIJ is 

538
00:35:19,399 --> 00:35:22,329
" you remember was the probability that we computed 

539
00:35:22,329 --> 00:35:23,529
that point I 

540
00:35:23,529 --> 00:35:25,140
came from 

541
00:35:25,140 --> 00:35:27,119
Gaussian's. 

542
00:35:27,119 --> 00:35:29,509
I don't want to call it cluster J, but that's what 

543
00:35:29,509 --> 00:35:32,380
" point I came from 

544
00:35:32,380 --> 00:35:34,469
Gaussian 

545
00:35:34,469 --> 00:35:38,769
J, rather than an indicator for where the point I came from Gaussian J. Okay. 

546
00:35:38,769 --> 00:35:40,889
And 

547
00:35:40,889 --> 00:35:45,359
the one slight difference between this and the formulas who have 

548
00:35:45,359 --> 00:35:48,170
a Gaussian's Discriminant Analysis 

549
00:35:48,170 --> 00:35:50,670
is that in the mixture of Gaussian's, 

550
00:35:50,670 --> 00:35:55,329
we more commonly use different covariant [inaudible] for the different Gaussian's. So 

551
00:35:55,329 --> 00:36:00,849
in Gaussian's Discriminant Analysis, sort of by convention, you usually model all of the 

552
00:36:00,849 --> 00:36:03,789
crosses to the same covariant matrix sigma. 

553
00:36:03,789 --> 00:36:07,789
I just wrote down a 

554
00:36:07,789 --> 00:36:10,689
lot of equations. Why don't you 

555
00:36:10,689 --> 00:36:14,219


556
00:36:14,219 --> 00:36:17,809


557
00:36:17,809 --> 00:36:19,029


558
00:36:19,029 --> 00:36:20,549


559
00:36:20,549 --> 00:36:23,410


560
00:36:23,410 --> 00:36:27,670
just take a second to look at this and make sure it all makes 

561
00:36:27,670 --> 00:36:34,670
sense? Do you have questions about this? 

562
00:36:50,960 --> 00:36:57,960
Raise your 

563
00:37:00,369 --> 00:37:03,999
hand if this makes sense to you? [Inaudible]. Okay. Only some of you. Let's see. 

564
00:37:03,999 --> 00:37:07,999
So 

565
00:37:07,999 --> 00:37:13,990
let me 

566
00:37:13,990 --> 00:37:14,869
try 

567
00:37:14,869 --> 00:37:17,589
to explain that a little bit more. 

568
00:37:17,589 --> 00:37:23,079
Some of you recall that in Gaussian's Discriminant Analysis, 

569
00:37:23,079 --> 00:37:25,039


570
00:37:25,039 --> 00:37:31,799
right, if we knew the values for the ZIs so let's 

571
00:37:31,799 --> 00:37:33,689
see. 

572
00:37:33,689 --> 00:37:35,540
Suppose I was to give you labeled 

573
00:37:35,540 --> 00:37:39,959
data sets, suppose I was to tell you the values of the ZIs for each example, 

574
00:37:39,959 --> 00:37:42,029
then I'd be giving you a data set 

575
00:37:42,029 --> 00:37:46,559
that looks like 

576
00:37:46,559 --> 00:37:47,959
this. Okay. So 

577
00:37:47,959 --> 00:37:50,429
here's my 1 D data set. That's sort of a typical 1 D 

578
00:37:50,429 --> 00:37:53,149
Gaussian's Discriminant 

579
00:37:53,149 --> 00:37:54,549
Analysis. So 

580
00:37:54,549 --> 00:37:56,730
for Gaussian's Discriminant Analysis 

581
00:37:56,730 --> 00:38:00,999
we figured out the maximum likeliness estimation and the maximum likeliness estimate for the 

582
00:38:00,999 --> 00:38:03,079
parameters of GDA, 

583
00:38:03,079 --> 00:38:04,160
and 

584
00:38:04,160 --> 00:38:09,449
one of the estimates for the parameters for GDA was 

585
00:38:09,449 --> 00:38:10,130
[inaudible] 

586
00:38:10,130 --> 00:38:13,409
which is the probability 

587
00:38:13,409 --> 00:38:16,059
that 

588
00:38:16,059 --> 00:38:20,929
ZI equals J. You would 

589
00:38:20,929 --> 00:38:23,089
estimate that as sum of I equals 

590
00:38:23,089 --> 00:38:24,660
sum of I 

591
00:38:24,660 --> 00:38:26,519
from 1 to M 

592
00:38:26,519 --> 00:38:28,779
indicator 

593
00:38:28,779 --> 00:38:31,169
ZI equals J and 

594
00:38:31,169 --> 00:38:33,189
divide by N. 

595
00:38:33,189 --> 00:38:36,259
Okay. When we're deriving GDA, 

596
00:38:36,259 --> 00:38:37,700
[inaudible]. 

597
00:38:37,700 --> 00:38:41,709
If you knew the cross labels for every example you cross, then 

598
00:38:41,709 --> 00:38:44,419
this was your maximum likeliness estimate for the chance that the 

599
00:38:44,419 --> 00:38:45,980
labels 

600
00:38:45,980 --> 00:38:49,559
came from the positive [inaudible] versus the negative [inaudible]. It's just a 

601
00:38:49,559 --> 00:38:50,819
fraction of examples. Your maximum 

602
00:38:50,819 --> 00:38:53,619
likeliness estimate for 

603
00:38:53,619 --> 00:38:55,329
probability of getting examples 

604
00:38:55,329 --> 00:38:57,479
from cross J is just 

605
00:38:57,479 --> 00:39:01,380
the fraction of examples in your training set that actually came from cross J. 

606
00:39:01,380 --> 00:39:07,430
So this is the maximum likeliness estimation for Gaussian's Discriminant Analysis. 

607
00:39:07,430 --> 00:39:08,799
Now, 

608
00:39:08,799 --> 00:39:13,019
in the mixture of Gaussian's model and the EM problem we don't actually have 

609
00:39:13,019 --> 00:39:14,109
these 

610
00:39:14,109 --> 00:39:15,089
cross labels, right, we just have 

611
00:39:15,089 --> 00:39:20,249
an unlabeled data set like this. We just have a set of 

612
00:39:20,249 --> 00:39:20,659
dots. 

613
00:39:20,659 --> 00:39:23,919
I'm trying to draw the same data set that I had above, but just with the cross 

614
00:39:23,919 --> 00:39:25,660
labels. 

615
00:39:25,660 --> 00:39:28,619
So now, it's as if 

616
00:39:28,619 --> 00:39:32,139
you only get to observe the 

617
00:39:32,139 --> 00:39:36,929
XIs, but the ZIs 

618
00:39:36,929 --> 00:39:41,699
are unknown. Okay. 

619
00:39:41,699 --> 00:39:46,039
So the cross label is unknown. So in the 

620
00:39:46,039 --> 00:39:49,309
EM algorithm 

621
00:39:49,309 --> 00:39:50,869
we're going to try to 

622
00:39:50,869 --> 00:39:52,499
take a guess 

623
00:39:52,499 --> 00:39:56,039
for the values of the ZIs, 

624
00:39:56,039 --> 00:39:58,569
and specifically, 

625
00:39:58,569 --> 00:40:01,900
in the E step we computed 

626
00:40:01,900 --> 00:40:06,609
WIJ was our current best guess for the probability that ZI 

627
00:40:06,609 --> 00:40:09,199
equals J 

628
00:40:09,199 --> 00:40:10,739
given that data point. 

629
00:40:10,739 --> 00:40:12,819
Okay. So this just means 

630
00:40:12,819 --> 00:40:17,479
given my current hypothesis, the way the Gaussian's are, and given everything else, can 

631
00:40:17,479 --> 00:40:21,309
I compute the [inaudible] probability " what was the [inaudible] probability 

632
00:40:21,309 --> 00:40:22,860
that the point XI 

633
00:40:22,860 --> 00:40:26,959
actually came from cross J? What is the probability that 

634
00:40:26,959 --> 00:40:33,959
this point was a cross versus O? What's the probability that this point was [inaudible]? 

635
00:40:34,079 --> 00:40:37,539
And now in the M step, 

636
00:40:37,539 --> 00:40:42,130
my formula of estimating for the 

637
00:40:42,130 --> 00:40:44,559
parameters [inaudible] will be given by 

638
00:40:44,559 --> 00:40:47,029
1 over M 

639
00:40:47,029 --> 00:40:50,359
sum 

640
00:40:50,359 --> 00:40:53,150
from I equals 

641
00:40:53,150 --> 00:40:55,919
1 through M, sum 

642
00:40:55,919 --> 00:40:57,109


643
00:40:57,109 --> 00:40:59,070
of WIJ. So 

644
00:40:59,070 --> 00:41:00,789
WIJ is 

645
00:41:00,789 --> 00:41:04,839
right. The probability is my best guess for the probability that point I 

646
00:41:04,839 --> 00:41:06,119
belongs to Gaussian 

647
00:41:06,119 --> 00:41:09,179
or belongs to cross J, 

648
00:41:09,179 --> 00:41:11,799
and 

649
00:41:11,799 --> 00:41:13,010
[inaudible] 

650
00:41:13,010 --> 00:41:18,589
using this formula instead of this one. Okay. 

651
00:41:18,589 --> 00:41:23,100
And similarly, this is my formula for the estimate for new J and 

652
00:41:23,100 --> 00:41:24,289
it 

653
00:41:24,289 --> 00:41:26,949
replaces the WIJs with 

654
00:41:26,949 --> 00:41:28,749
these new indicator functions, 

655
00:41:28,749 --> 00:41:30,119
you get back to 

656
00:41:30,119 --> 00:41:35,159
the formula that you had in Gaussian's 

657
00:41:35,159 --> 00:41:38,789
Discriminant Analysis. I'm trying to convey 

658
00:41:38,789 --> 00:41:40,799


659
00:41:40,799 --> 00:41:47,799
an 

660
00:41:49,829 --> 00:41:53,249
intuitive sense 

661
00:41:53,249 --> 00:41:54,630
of 

662
00:41:54,630 --> 00:41:59,279
why these algorithm's make sense. 

663
00:41:59,279 --> 00:42:02,630
Can you 

664
00:42:02,630 --> 00:42:05,589
raise 

665
00:42:05,589 --> 00:42:10,339
your hand if this 

666
00:42:10,339 --> 00:42:11,670
makes 

667
00:42:11,670 --> 00:42:18,670
sense now? Cool. Okay. 

668
00:42:38,729 --> 00:42:42,569
So what I want to do now is actually 

669
00:42:42,569 --> 00:42:45,900
present a broader view of the EM algorithm. 

670
00:42:45,900 --> 00:42:49,930
What you just saw was a special case of the EM algorithm 

671
00:42:49,930 --> 00:42:53,299
for specially to make sure of Gaussian's model, and in 

672
00:42:53,299 --> 00:42:56,649
the remaining half hour I have today I'm going to describe 

673
00:42:56,649 --> 00:42:58,509


674
00:42:58,509 --> 00:42:59,249


675
00:42:59,249 --> 00:43:01,919
a general description of the EM algorithm 

676
00:43:01,919 --> 00:43:05,469
and everything you just saw will be devised, sort of there's a special case of this more 

677
00:43:05,469 --> 00:43:07,859
general view that I'll present 

678
00:43:07,859 --> 00:43:08,730
now. 

679
00:43:08,730 --> 00:43:15,039
And 

680
00:43:15,039 --> 00:43:18,359
as a pre-cursor to actually 

681
00:43:18,359 --> 00:43:21,839
deriving this more general view of the EM algorithm, 

682
00:43:21,839 --> 00:43:22,770
I'm gonna 

683
00:43:22,770 --> 00:43:28,640
have to describe something called Jensen's and Equality that we use in the derivation. 

684
00:43:28,640 --> 00:43:32,219
So here's Jensen's and Equality. 

685
00:43:32,219 --> 00:43:39,219
Just let F be a convex function. 

686
00:43:45,319 --> 00:43:48,489
So a function is a convex of the second derivative, 

687
00:43:48,489 --> 00:43:52,639
which I've written F prime prime to [inaudible]. The 

688
00:43:52,639 --> 00:43:55,149
functions don't have to be differentiatable to be convex, 

689
00:43:55,149 --> 00:43:59,949
but if it has a second derivative, then F prime prime should be creating 

690
00:43:59,949 --> 00:44:06,949
a 0. And let X be a random variable 

691
00:44:12,749 --> 00:44:19,749
then 

692
00:44:20,679 --> 00:44:26,239
the F applied to the expectation of X is less than the equal of 2D 

693
00:44:26,239 --> 00:44:28,899
expectation of F of F. 

694
00:44:28,899 --> 00:44:32,999
Okay. And hopefully you remember I often drop the square 

695
00:44:32,999 --> 00:44:37,179
back, so E of X is the 

696
00:44:37,179 --> 00:44:43,789
[inaudible], I'll often drop the square brackets. So let me draw 

697
00:44:43,789 --> 00:44:48,249
a picture that would explain this 

698
00:44:48,249 --> 00:44:49,140
and I think 

699
00:44:49,140 --> 00:44:51,419
" Many of my friends and I often 

700
00:44:51,419 --> 00:44:55,529
don't remember is less than or great than or whatever, 

701
00:44:55,529 --> 00:44:58,219
and the way that 

702
00:44:58,219 --> 00:45:00,369
many of us remember the sign 

703
00:45:00,369 --> 00:45:04,009
of that in equality is by drawing 

704
00:45:04,009 --> 00:45:09,739
the following picture. 

705
00:45:09,739 --> 00:45:15,209


706
00:45:15,209 --> 00:45:17,799


707
00:45:17,799 --> 00:45:24,799


708
00:45:30,759 --> 00:45:33,269
For this example, let's say, X 

709
00:45:33,269 --> 00:45:35,289
is equal to 1 

710
00:45:35,289 --> 00:45:37,670
with a probability of one-half 

711
00:45:37,670 --> 00:45:41,449
and X is equal to 6 worth probability 1 whole. So 

712
00:45:41,449 --> 00:45:43,859
I'll illustrate this 

713
00:45:43,859 --> 00:45:47,559
inequality with an example. So let's see. 

714
00:45:47,559 --> 00:45:51,009
So X is 1 

715
00:45:51,009 --> 00:45:52,249
with 

716
00:45:52,249 --> 00:45:53,419
probability 

717
00:45:53,419 --> 00:45:57,179
one-half and X is 6 with probably with half 

718
00:45:57,179 --> 00:46:00,309
and so the expected value of X is 3.5. 

719
00:46:00,309 --> 00:46:03,130
It would be in the middle here. So 

720
00:46:03,130 --> 00:46:04,839
that's the expected value 

721
00:46:04,839 --> 00:46:08,950
of X. The horizontal axis here is the X axis. 

722
00:46:08,950 --> 00:46:11,140
And so 

723
00:46:11,140 --> 00:46:14,579
F of the expected value of X, 

724
00:46:14,579 --> 00:46:21,579
you can read of as this point here. So this is F of the expected value of X. 

725
00:46:22,660 --> 00:46:25,619
Where as in contrast, let's see. If 

726
00:46:25,619 --> 00:46:28,389
X is equal to 1 

727
00:46:28,389 --> 00:46:30,989
then 

728
00:46:30,989 --> 00:46:34,209
here's F of 1 

729
00:46:34,209 --> 00:46:37,969
and if X equaled a 6 

730
00:46:37,969 --> 00:46:44,379
then here's F of 6 and 

731
00:46:44,379 --> 00:46:47,660
the expected value of F of X, 

732
00:46:47,660 --> 00:46:51,440
it turns out, 

733
00:46:51,440 --> 00:46:58,440
is now averaging on the vertical axis. We're 50 percent chance you 

734
00:47:07,939 --> 00:47:10,109
get F of 1 with 50 percent chance 

735
00:47:10,109 --> 00:47:13,439
you get F of 6 

736
00:47:13,439 --> 00:47:17,329
and so these expected value of F of X is the average of F of 1 and F of 6, 

737
00:47:17,329 --> 00:47:21,339
which is going to be the value in the middle here. 

738
00:47:21,339 --> 00:47:23,839
And so in this example you see that 

739
00:47:23,839 --> 00:47:29,339
the expected value of F of X is greater than or equal to F of the 

740
00:47:29,339 --> 00:47:32,640
expected value 

741
00:47:32,640 --> 00:47:34,139


742
00:47:34,139 --> 00:47:41,139
of X. 

743
00:47:54,099 --> 00:47:57,729
Okay. And it turns out 

744
00:47:57,729 --> 00:48:03,599
further that if 

745
00:48:03,599 --> 00:48:07,119
F double prime of X makes 

746
00:48:07,119 --> 00:48:14,119
[inaudible] than Z row, if this happens, we say F is strictly convex 

747
00:48:18,239 --> 00:48:25,099
then the inequality holds an equality or in other words, E of F of X 

748
00:48:25,099 --> 00:48:29,910
equals F of EX, 

749
00:48:29,910 --> 00:48:32,279
if and only if, X is 

750
00:48:32,279 --> 00:48:36,989
a constant 

751
00:48:36,989 --> 00:48:38,969
with 

752
00:48:38,969 --> 00:48:42,499
probability 1. Well, 

753
00:48:42,499 --> 00:48:49,499
another way of writing this is X equals EX. 

754
00:48:51,689 --> 00:48:54,589
Okay. 

755
00:48:54,589 --> 00:48:58,469
So in other words, if F is a strictly convex function, 

756
00:48:58,469 --> 00:49:02,949
then the only way for this inequality to hold its equality is 

757
00:49:02,949 --> 00:49:07,419
if the random variable X always takes on the same value. 

758
00:49:07,419 --> 00:49:14,419
Okay. 

759
00:49:14,909 --> 00:49:17,149


760
00:49:17,149 --> 00:49:18,180


761
00:49:18,180 --> 00:49:21,179
Any questions 

762
00:49:21,179 --> 00:49:22,279


763
00:49:22,279 --> 00:49:28,950
about this? Yeah. Student: [Inaudible] Instructor (Andrew 

764
00:49:28,950 --> 00:49:30,899


765
00:49:30,899 --> 00:49:35,989
Ng):Say that again? Student:What is the strict 

766
00:49:35,989 --> 00:49:38,349


767
00:49:38,349 --> 00:49:42,629
[inaudible]? 

768
00:49:42,629 --> 00:49:46,319


769
00:49:46,319 --> 00:49:48,699
Instructor 

770
00:49:48,699 --> 00:49:52,609
(Andrew Ng):I still 

771
00:49:52,609 --> 00:49:55,109
couldn't hear 

772
00:49:55,109 --> 00:49:58,679
that. What is " Student:What is the strictly 

773
00:49:58,679 --> 00:49:59,840
convex 

774
00:49:59,840 --> 00:50:06,229
[inaudible]? 

775
00:50:06,229 --> 00:50:13,229


776
00:50:21,180 --> 00:50:24,819
Instructor 

777
00:50:24,819 --> 00:50:30,150
(Andrew 

778
00:50:30,150 --> 00:50:31,110
Ng):Oh, I see. 

779
00:50:31,110 --> 00:50:34,790
If double prime of X is strictly greater than 0 that's my definition for 

780
00:50:34,790 --> 00:50:38,159
strictly convex. 

781
00:50:38,159 --> 00:50:41,339
If the second derivative of X is strictly greater than 0 then 

782
00:50:41,339 --> 00:50:45,429
that's what it means for F to be strictly convex. Student: [Inaudible] 

783
00:50:45,429 --> 00:50:47,150
Instructor (Andrew Ng):I see. Sure. So 

784
00:50:47,150 --> 00:50:50,249
for example, 

785
00:50:50,249 --> 00:50:54,989
this is an example of a convex function that's not strictly convexed because 

786
00:50:54,989 --> 00:50:57,639
there's part of this function is a straight line 

787
00:50:57,639 --> 00:50:59,069
and so 

788
00:50:59,069 --> 00:51:01,169
F double prime would be zero in 

789
00:51:01,169 --> 00:51:02,700
this portion. 

790
00:51:02,700 --> 00:51:09,630


791
00:51:09,630 --> 00:51:11,069
Let's see. Yeah. It's 

792
00:51:11,069 --> 00:51:14,630
just a less formal way of saying strictly convexed just means that you can't have 

793
00:51:14,630 --> 00:51:17,669
a convex function within a straight line portion and 

794
00:51:17,669 --> 00:51:19,399
then [inaudible]. 

795
00:51:19,399 --> 00:51:22,719
Speaking very informally, think of this as meaning that there aren't any 

796
00:51:22,719 --> 00:51:29,719
straight line portions. Okay. So 

797
00:51:49,689 --> 00:51:52,420
here's the derivation for 

798
00:51:52,420 --> 00:51:58,659
the general version of EM. 

799
00:51:58,659 --> 00:52:02,120
The problem was face is as follows. 

800
00:52:02,120 --> 00:52:09,120


801
00:52:13,549 --> 00:52:18,649
We have some model for the joint distribution of X 

802
00:52:18,649 --> 00:52:24,239
of Z, but we observe 

803
00:52:24,239 --> 00:52:26,329
only X, 

804
00:52:26,329 --> 00:52:29,569
and our goal is to 

805
00:52:29,569 --> 00:52:36,569
maximize 

806
00:52:38,979 --> 00:52:41,709


807
00:52:41,709 --> 00:52:48,709
the law of likelihood of the parameters 

808
00:52:52,689 --> 00:52:53,860


809
00:52:53,860 --> 00:52:57,929
of model. Right. So we have some models for the joint distribution for X and Z 

810
00:52:57,929 --> 00:53:02,109
and our goal is to find the maximum likeliness estimate of the parameters data 

811
00:53:02,109 --> 00:53:03,509
where 

812
00:53:03,509 --> 00:53:07,689
the likelihood is defined as something equals 1 to M [inaudible] 

813
00:53:07,689 --> 00:53:09,989
probably of our data as usual. 

814
00:53:09,989 --> 00:53:14,089
And 

815
00:53:14,089 --> 00:53:18,669
here X is parameterized by data is now given by 

816
00:53:18,669 --> 00:53:19,919
a 

817
00:53:19,919 --> 00:53:24,309
sum over all the values of 

818
00:53:24,309 --> 00:53:29,589
ZI 

819
00:53:29,589 --> 00:53:30,950
parameterized by data. Okay. 

820
00:53:30,950 --> 00:53:31,909
So 

821
00:53:31,909 --> 00:53:36,209
just by taking our model of the joint distribution of X and Z 

822
00:53:36,209 --> 00:53:39,099
and marginalizing out ZI 

823
00:53:39,099 --> 00:53:40,579
that we get P of XI 

824
00:53:40,579 --> 00:53:45,039
parameterized by data. 

825
00:53:45,039 --> 00:53:52,039
And so the 

826
00:53:55,819 --> 00:53:56,809
EM algorithm 

827
00:53:56,809 --> 00:53:59,519
will be a way of performing 

828
00:53:59,519 --> 00:54:02,779
this maximum likeliness estimation problem, 

829
00:54:02,779 --> 00:54:06,339
which is complicated by the fact that we have these ZIs in our model that 

830
00:54:06,339 --> 00:54:11,669
are 

831
00:54:11,669 --> 00:54:18,669
unobserved. 

832
00:54:24,519 --> 00:54:31,519
Before I actually do the math, here's a useful picture to keep in mind. 

833
00:54:31,899 --> 00:54:36,660
So the horizontal axis in this cartoon is the [inaudible] axis and there's some 

834
00:54:36,660 --> 00:54:43,660
function, the law of likelihood of theta zero that we're trying to maximize, 

835
00:54:47,219 --> 00:54:48,820
and 

836
00:54:48,820 --> 00:54:52,750
usually maximizing our [inaudible] derivatives instead of the zero that would 

837
00:54:52,750 --> 00:54:55,109
be very hard to do. 

838
00:54:55,109 --> 00:54:59,549
What the EM algorithm will do is the following. Let's say it initializes some 

839
00:54:59,549 --> 00:55:02,029
value of theta zero, 

840
00:55:02,029 --> 00:55:06,389
what the EM algorithm will end up doing is it will construct 

841
00:55:06,389 --> 00:55:08,379
a lower bound for this law of 

842
00:55:08,379 --> 00:55:10,809
likelihood function 

843
00:55:10,809 --> 00:55:12,729
and this lower bound 

844
00:55:12,729 --> 00:55:14,089
will be tight 

845
00:55:14,089 --> 00:55:16,499
[inaudible] of equality 

846
00:55:16,499 --> 00:55:19,649
after current guessing the parameters 

847
00:55:19,649 --> 00:55:24,239
and they maximize this lower boundary with respect to theta so we'll end up with 

848
00:55:24,239 --> 00:55:28,899
say that value. So that will be data 1. Okay. 

849
00:55:28,899 --> 00:55:32,069
And then EM algorithm look at theta 1 and they'll construct a new lower bound of 

850
00:55:32,069 --> 00:55:35,359
theta 

851
00:55:35,359 --> 00:55:37,339


852
00:55:37,339 --> 00:55:39,219
and then we'll maximize that. So 

853
00:55:39,219 --> 00:55:41,689
you jump here. So that's the next 

854
00:55:41,689 --> 00:55:43,259
theta 2 

855
00:55:43,259 --> 00:55:45,979
and you do that again and 

856
00:55:45,979 --> 00:55:48,549
you 

857
00:55:48,549 --> 00:55:49,460
get the same 3, 4, and so on 

858
00:55:49,460 --> 00:55:55,059
until you converge to local optimum on [inaudible] theta function. Okay. 

859
00:55:55,059 --> 00:55:57,279
So this is a cartoon 

860
00:55:57,279 --> 00:56:00,149
that displays what the EM algorithm will do. 

861
00:56:00,149 --> 00:56:07,149
So let's actually make that formal now. 

862
00:56:20,409 --> 00:56:23,609
So you 

863
00:56:23,609 --> 00:56:27,169
want to maximize with respect to theta 

864
00:56:27,169 --> 00:56:32,989
sum of [inaudible] " there's 

865
00:56:32,989 --> 00:56:34,979
my theta, so this is 

866
00:56:34,979 --> 00:56:37,900
sum over 1 [inaudible] 

867
00:56:37,900 --> 00:56:39,849


868
00:56:39,849 --> 00:56:44,929
sum over all values of Z. 

869
00:56:44,929 --> 00:56:50,369
Okay. 

870
00:56:50,369 --> 00:56:54,969
So 

871
00:56:54,969 --> 00:56:57,660
what I'm going to do is 

872
00:56:57,660 --> 00:57:04,660
multiply and divide by the same thing and I'm gonna write this 

873
00:57:06,759 --> 00:57:10,409


874
00:57:10,409 --> 00:57:14,729
as Q " okay. 

875
00:57:14,729 --> 00:57:15,829
So 

876
00:57:15,829 --> 00:57:17,439
I'm going to 

877
00:57:17,439 --> 00:57:21,809
construct the probability distribution QI, that would 

878
00:57:21,809 --> 00:57:25,069
be over the latent random variables ZI 

879
00:57:25,069 --> 00:57:28,709
and so these QI would get distribution so each of the QI 

880
00:57:28,709 --> 00:57:31,159
would bring in a 0 

881
00:57:31,159 --> 00:57:34,770
and sum over all the values of ZI of 

882
00:57:34,770 --> 00:57:36,879
QI 

883
00:57:36,879 --> 00:57:41,949
would be 1, so these Qs will be a probability distribution that I get to construct. 

884
00:57:41,949 --> 00:57:42,989
Okay. 

885
00:57:42,989 --> 00:57:47,399
And then I'll later go describe the specific choice of this 

886
00:57:47,399 --> 00:57:54,279
distribution QI. 

887
00:57:54,279 --> 00:57:56,009
So 

888
00:57:56,009 --> 00:57:59,669
this QI is a probability distribution over 

889
00:57:59,669 --> 00:58:02,609
the 

890
00:58:02,609 --> 00:58:07,139
random variables 

891
00:58:07,139 --> 00:58:09,469


892
00:58:09,469 --> 00:58:14,150


893
00:58:14,150 --> 00:58:20,309


894
00:58:20,309 --> 00:58:21,449


895
00:58:21,449 --> 00:58:25,629


896
00:58:25,629 --> 00:58:32,629


897
00:58:34,380 --> 00:58:38,919
of 

898
00:58:38,919 --> 00:58:40,829


899
00:58:40,829 --> 00:58:43,189
ZI 

900
00:58:43,189 --> 00:58:44,199


901
00:58:44,199 --> 00:58:47,389
so 

902
00:58:47,389 --> 00:58:52,069
this is [inaudible]. 

903
00:58:52,069 --> 00:58:54,429


904
00:58:54,429 --> 00:58:59,549
Right. I 

905
00:58:59,549 --> 00:59:06,549
see some frowns. Do you have questions about 

906
00:59:39,909 --> 00:59:41,689
this? No. Okay. 

907
00:59:41,689 --> 00:59:44,389
So the log function looks like that 

908
00:59:44,389 --> 00:59:47,390
and there's a concave function 

909
00:59:47,390 --> 00:59:51,369
so that tells us that the log of E of X 

910
00:59:51,369 --> 00:59:56,279
is greater than and equal to 

911
00:59:56,279 --> 00:59:59,369
the expected value of log X by the other 

912
00:59:59,369 --> 01:00:03,899
concave function form of Jensen's and Equality. 

913
01:00:03,899 --> 01:00:09,890
And so continuing from the previous expression, this is a summary of 

914
01:00:09,890 --> 01:00:11,589
a 

915
01:00:11,589 --> 01:00:13,499
log and an expectation, 

916
01:00:13,499 --> 01:00:18,009
that must therefore be greater 

917
01:00:18,009 --> 01:00:21,329
than or equal to the 

918
01:00:21,329 --> 01:00:26,289
expected value 

919
01:00:26,289 --> 01:00:28,899
of the log of that. 

920
01:00:28,899 --> 01:00:35,899
Okay. 

921
01:00:46,969 --> 01:00:51,479
Using Jensen's and Equality. 

922
01:00:51,479 --> 01:00:58,479
And lastly just to expand out this formula again. This is equal 

923
01:01:07,029 --> 01:01:11,629


924
01:01:11,629 --> 01:01:15,859


925
01:01:15,859 --> 01:01:22,859
to that. Okay. Yeah. Student: [Inaudible] Instructor (Andrew Ng): [Inaudible] 

926
01:01:30,089 --> 01:01:30,829
Student: 

927
01:01:30,829 --> 01:01:31,989
[Inaudible]. 

928
01:01:31,989 --> 01:01:38,989
Yeah. 

929
01:01:42,029 --> 01:01:44,259
Okay. So this has 

930
01:01:44,259 --> 01:01:45,620
the [inaudible] so 

931
01:01:45,620 --> 01:01:48,989
let's say Random variable Z, 

932
01:01:48,989 --> 01:01:50,079
right, and 

933
01:01:50,079 --> 01:01:54,969
Z has some distribution. Let's denote it G. 

934
01:01:54,969 --> 01:01:59,399
And let's say I have some function G of Z. Okay. Then 

935
01:01:59,399 --> 01:02:02,759
by definition, 

936
01:02:02,759 --> 01:02:05,529
the expected value of G of 

937
01:02:05,529 --> 01:02:07,070
Z, by definition, 

938
01:02:07,070 --> 01:02:10,049
that's equal to sum over all the values of Z, 

939
01:02:10,049 --> 01:02:13,889
the probability of that value of Z 

940
01:02:13,889 --> 01:02:17,709
times Z of G. Right. That's sort of the definition of 

941
01:02:17,709 --> 01:02:20,729
a random variable. 

942
01:02:20,729 --> 01:02:23,499
And so the way I 

943
01:02:23,499 --> 01:02:25,419
got from this step 

944
01:02:25,419 --> 01:02:27,009
to this step 

945
01:02:27,009 --> 01:02:28,490
is by using that. So 

946
01:02:28,490 --> 01:02:30,109
in particular, 

947
01:02:30,109 --> 01:02:32,529
now, 

948
01:02:32,529 --> 01:02:36,299
I've been using distribution QI to denote the distribution of 

949
01:02:36,299 --> 01:02:37,699
Z, 

950
01:02:37,699 --> 01:02:42,429
so this is, like, sum over Z of P of Z times [inaudible]. 

951
01:02:42,429 --> 01:02:44,129


952
01:02:44,129 --> 01:02:47,559


953
01:02:47,559 --> 01:02:51,799


954
01:02:51,799 --> 01:02:53,799


955
01:02:53,799 --> 01:02:56,819


956
01:02:56,819 --> 01:02:58,650
And so 

957
01:02:58,650 --> 01:03:00,439
this is just 

958
01:03:00,439 --> 01:03:04,639
the expected value with respect to a random variable Z joined from the 

959
01:03:04,639 --> 01:03:07,670
distribution Q of 

960
01:03:07,670 --> 01:03:11,139
G of Z. Are there questions? Student:So 

961
01:03:11,139 --> 01:03:18,139
in general when you're doing maximum likelihood 

962
01:03:22,249 --> 01:03:25,099
estimations, the likelihood of the data, 

963
01:03:25,099 --> 01:03:29,880
but in this case you only say probability of X because 

964
01:03:29,880 --> 01:03:32,849
you only have observed X whereas previously 

965
01:03:32,849 --> 01:03:36,889
we said probability of X given the labels? Instructor (Andrew Ng):Yes. Exactly. Right. Right. 

966
01:03:36,889 --> 01:03:41,569
[Inaudible] we want to choose the parameters that maximizes the probability of the data, 

967
01:03:41,569 --> 01:03:47,339
and in this case, our data comprises only the Xs because we don't reserve the Zs, 

968
01:03:47,339 --> 01:03:49,019
and therefore, 

969
01:03:49,019 --> 01:03:51,799
the likelihood of parameters 

970
01:03:51,799 --> 01:03:55,829
is given by the probability of the data, which 

971
01:03:55,829 --> 01:03:56,869
is 

972
01:03:56,869 --> 01:04:03,869
[inaudible]. 

973
01:04:06,939 --> 01:04:08,969
So this is all we've done, right, 

974
01:04:08,969 --> 01:04:13,509
we wanted to maximize the law of likelihood of theta 

975
01:04:13,509 --> 01:04:17,449
and what we've done, through these manipulations, we've know constructed a 

976
01:04:17,449 --> 01:04:20,869
lower bound on the law of likelihood of data. Okay. 

977
01:04:20,869 --> 01:04:23,379
And in particular, this formula that we came up, 

978
01:04:23,379 --> 01:04:26,169
we should think of this as a function 

979
01:04:26,169 --> 01:04:28,619
of theta 

980
01:04:28,619 --> 01:04:29,219
then, if you think about it, 

981
01:04:29,219 --> 01:04:32,449
theta are the parameters of your model, right, if you think about this as a 

982
01:04:32,449 --> 01:04:37,609
function of your parameters theta, what we've just shown is that the law of 

983
01:04:37,609 --> 01:04:42,749
likelihood of your parameters theta 

984
01:04:42,749 --> 01:04:45,089
is lower bounded 

985
01:04:45,089 --> 01:04:49,939
by this 

986
01:04:49,939 --> 01:04:56,909
thing. 

987
01:04:56,909 --> 01:05:03,589
Okay. 

988
01:05:03,589 --> 01:05:05,289
Remember that cartoon 

989
01:05:05,289 --> 01:05:09,439
of repeatedly constructing a lower bound and optimizing 

990
01:05:09,439 --> 01:05:10,849
the lower bound. 

991
01:05:10,849 --> 01:05:13,129
So what we've just done is construct 

992
01:05:13,129 --> 01:05:15,099
a lower bound for the law of likelihood 

993
01:05:15,099 --> 01:05:19,509
for theta. Now, 

994
01:05:19,509 --> 01:05:24,659
the last piece we want for this lower bound is actually we want this 

995
01:05:24,659 --> 01:05:27,639
inequality to hold with equality 

996
01:05:27,639 --> 01:05:29,630
for the current value for 

997
01:05:29,630 --> 01:05:33,519
theta. So just refrain back to the previous cartoon. 

998
01:05:33,519 --> 01:05:35,019
If this was 

999
01:05:35,019 --> 01:05:39,199
the law of likelihood for theta, we'd then construct some lower bound of it, 

1000
01:05:39,199 --> 01:05:42,449
some function of 

1001
01:05:42,449 --> 01:05:46,029
theta and if this is my current value for theta, then I want my lower bound 

1002
01:05:46,029 --> 01:05:50,889
to be tight. I want my lower bound to be equal to the law of likelihood of theta 

1003
01:05:50,889 --> 01:05:54,579
because that's what I need to guarantee that when I optimize my lower bound, then I'll actually 

1004
01:05:54,579 --> 01:05:54,929


1005
01:05:54,929 --> 01:05:56,410
do even better 

1006
01:05:56,410 --> 01:06:02,699
on the true objective function. 

1007
01:06:02,699 --> 01:06:05,989
Yeah. Student: How do 

1008
01:06:05,989 --> 01:06:08,450


1009
01:06:08,450 --> 01:06:10,130


1010
01:06:10,130 --> 01:06:10,429


1011
01:06:10,429 --> 01:06:11,719


1012
01:06:11,719 --> 01:06:15,849


1013
01:06:15,849 --> 01:06:20,869
[inaudible] 

1014
01:06:20,869 --> 01:06:22,739
Instructor (Andrew 

1015
01:06:22,739 --> 01:06:24,580
Ng):Excuse me. 

1016
01:06:24,580 --> 01:06:29,449
Yeah. Great question. 

1017
01:06:29,449 --> 01:06:34,219
How do I 

1018
01:06:34,219 --> 01:06:35,609
know that 

1019
01:06:35,609 --> 01:06:37,800


1020
01:06:37,800 --> 01:06:39,569


1021
01:06:39,569 --> 01:06:45,199


1022
01:06:45,199 --> 01:06:51,899
function is concave? 

1023
01:06:51,899 --> 01:06:54,669
Yeah. I don't think I've shown it. It actually turns out to be true for all the models 

1024
01:06:54,669 --> 01:06:55,609
we work with. 

1025
01:06:55,609 --> 01:06:58,119
Do 

1026
01:06:58,119 --> 01:07:01,689
I know that the law of bound is a concave function of theta? I 

1027
01:07:01,689 --> 01:07:04,829
think you're right. In general, this may not be a concave function of theta. For many of 

1028
01:07:04,829 --> 01:07:07,899
the models we work with, this will turn out to be a concave function, but that's not 

1029
01:07:07,899 --> 01:07:13,969


1030
01:07:13,969 --> 01:07:16,719
always true. Okay. So let me go ahead and choose a value for Q. 

1031
01:07:16,719 --> 01:07:20,719
And I'll refer back to Jensen's and Equality. We said that 

1032
01:07:20,719 --> 01:07:24,579
this inequality will become an equality 

1033
01:07:24,579 --> 01:07:28,699
if the random variable inside is a constant. Right. If you're taking an 

1034
01:07:28,699 --> 01:07:30,089
expectation with 

1035
01:07:30,089 --> 01:07:31,710
respect to constant valued 

1036
01:07:31,710 --> 01:07:33,749
variables. 

1037
01:07:33,749 --> 01:07:40,749


1038
01:07:53,909 --> 01:07:58,589


1039
01:07:58,589 --> 01:08:05,589
So 

1040
01:08:10,369 --> 01:08:11,679


1041
01:08:11,679 --> 01:08:15,439


1042
01:08:15,439 --> 01:08:17,329


1043
01:08:17,329 --> 01:08:19,310


1044
01:08:19,310 --> 01:08:26,310


1045
01:08:27,109 --> 01:08:30,509


1046
01:08:30,509 --> 01:08:35,589
the 

1047
01:08:35,589 --> 01:08:36,979


1048
01:08:36,979 --> 01:08:40,430
QI 

1049
01:08:40,430 --> 01:08:47,009


1050
01:08:47,009 --> 01:08:53,080
of 

1051
01:08:53,080 --> 01:08:58,000


1052
01:08:58,000 --> 01:09:01,799
ZIs must 

1053
01:09:01,799 --> 01:09:07,369


1054
01:09:07,369 --> 01:09:09,219


1055
01:09:09,219 --> 01:09:10,359


1056
01:09:10,359 --> 01:09:15,429
sum to 

1057
01:09:15,429 --> 01:09:18,109


1058
01:09:18,109 --> 01:09:21,099


1059
01:09:21,099 --> 01:09:28,099


1060
01:09:34,739 --> 01:09:40,779


1061
01:09:40,779 --> 01:09:47,779


1062
01:09:48,289 --> 01:09:55,289


1063
01:10:07,739 --> 01:10:10,069


1064
01:10:10,069 --> 01:10:17,069


1065
01:10:19,190 --> 01:10:23,239
1 and so 

1066
01:10:23,239 --> 01:10:27,550
to compute it you should just take P of XI, ZI, 

1067
01:10:27,550 --> 01:10:30,300
parameterized by theta and just normalize the sum to one. 

1068
01:10:30,300 --> 01:10:34,030
There is a step that I'm skipping here to show that this is really the right 

1069
01:10:34,030 --> 01:10:36,539
thing to do. Hopefully, you'll just be convinced it's true. 

1070
01:10:36,539 --> 01:10:41,690
For the actual steps that I skipped, it's actually written out in the lecture notes. 

1071
01:10:41,690 --> 01:10:43,889


1072
01:10:43,889 --> 01:10:50,519
So you then have 

1073
01:10:50,519 --> 01:10:56,520
the denominator, by definition, is that 

1074
01:10:56,520 --> 01:11:03,520
and so by the definition of 

1075
01:11:04,469 --> 01:11:05,840
conditional probability 

1076
01:11:05,840 --> 01:11:08,869
QI of ZI is just equal to P of ZI 

1077
01:11:08,869 --> 01:11:15,869
given XI and parameterized by theta. Okay. And so 

1078
01:11:37,579 --> 01:11:44,579
to summarize the algorithm, the 

1079
01:11:46,270 --> 01:11:49,150
EM algorithm has two steps. 

1080
01:11:49,150 --> 01:11:53,040
And 

1081
01:11:53,040 --> 01:11:55,010


1082
01:11:55,010 --> 01:12:00,009
the E step, we set, we choose the distributions 

1083
01:12:00,009 --> 01:12:07,009
QI, so QI of ZI will set to be equal to a P 

1084
01:12:09,559 --> 01:12:13,359
of ZI given [inaudible] by data. That's the formula we just worked 

1085
01:12:13,359 --> 01:12:16,639
out. 

1086
01:12:16,639 --> 01:12:21,940
And so by this step we've now created a lower bound on the law of likelihood function 

1087
01:12:21,940 --> 01:12:25,920
that is now tight at a current value of theta. 

1088
01:12:25,920 --> 01:12:28,139
And in the M step, 

1089
01:12:28,139 --> 01:12:33,439
we then optimize that lower bound with respect to our parameters theta 

1090
01:12:33,439 --> 01:12:35,530
and specifically 

1091
01:12:35,530 --> 01:12:42,530
to the [inaudible] of theta. Okay. 

1092
01:12:58,809 --> 01:13:01,809
And so that's the EM algorithm. I 

1093
01:13:01,809 --> 01:13:03,919
won't 

1094
01:13:03,919 --> 01:13:07,260
have time to do it today, but I'll 

1095
01:13:07,260 --> 01:13:09,090
probably show this in the 

1096
01:13:09,090 --> 01:13:12,850
next lecture, but the EM algorithm's that I wrote down for the mixtures of Gaussian's 

1097
01:13:12,850 --> 01:13:13,830
algorithm 

1098
01:13:13,830 --> 01:13:17,619
is actually a special case of this more general template 

1099
01:13:17,619 --> 01:13:22,019
where the E step and the M step responded. So pretty much exactly to this 

1100
01:13:22,019 --> 01:13:24,530
E step and this M step that 

1101
01:13:24,530 --> 01:13:26,029


1102
01:13:26,029 --> 01:13:29,300
I wrote 

1103
01:13:29,300 --> 01:13:32,340
down. 

1104
01:13:32,340 --> 01:13:33,929
The E step 

1105
01:13:33,929 --> 01:13:36,260
constructs this lower bound 

1106
01:13:36,260 --> 01:13:39,989
and makes sure that it is tight to the current value of theta. 

1107
01:13:39,989 --> 01:13:42,299
That's in my choice of Q, 

1108
01:13:42,299 --> 01:13:44,920
and then the M step 

1109
01:13:44,920 --> 01:13:49,309
optimizes the lower bound with respect to [inaudible] data. Okay. 

1110
01:13:49,309 --> 01:13:53,759
So lots more to say about this in the next lecture. Let's check if there's any questions 

1111
01:13:53,759 --> 01:14:00,759
before we close. No. 

1112
01:14:03,780 --> 01:14:04,289
Okay. Cool. 

1113
01:14:04,289 --> 01:14:06,520
So let's wrap up for today and we'll continue talking about this in the next session.