00:00:08,290 --> 00:00:09,539
This subtitle is provided from lecsub.jimdo.com under Creative Commons License.

2
00:00:09,539 --> 00:00:12,809
This presentation is delivered by the Stanford Center for Professional 

3
00:00:12,809 --> 00:00:19,809
Development. 

4
00:00:22,570 --> 00:00:28,739


5
00:00:28,739 --> 00:00:30,300
Welcome back. 

6
00:00:30,300 --> 00:00:32,800
What I want to do today is 

7
00:00:32,800 --> 00:00:37,120
continue a discussion of principal components analysis, or PCA. 

8
00:00:37,120 --> 00:00:38,780
In particular, there's 

9
00:00:38,780 --> 00:00:41,730
one more application that I didn't get to in the last lecture on 

10
00:00:41,730 --> 00:00:43,950
[inaudible] indexing, 

11
00:00:43,950 --> 00:00:47,340
LSI. Then I want to spend just a little time talking about 

12
00:00:47,340 --> 00:00:50,400
how to implement PCA, 

13
00:00:50,400 --> 00:00:53,510
especially for very large problems. In particular, I'll spend just a little bit of time talking 

14
00:00:53,510 --> 00:00:55,790
about singular value decomposition, 

15
00:00:55,790 --> 00:00:57,930
or the SVD implementation 

16
00:00:57,930 --> 00:01:00,170
of principal component 

17
00:01:00,170 --> 00:01:01,929
analysis. So the 

18
00:01:01,929 --> 00:01:06,549
second half of today's lecture, I want to talk about the different algorithm called 

19
00:01:06,549 --> 00:01:09,090
independent component analysis, 

20
00:01:09,090 --> 00:01:13,400
which is, in some ways, related to PCA, but in many other ways, it 

21
00:01:13,400 --> 00:01:15,580
also manages to accomplish 

22
00:01:15,580 --> 00:01:18,390
very different things than PCA. 

23
00:01:18,390 --> 00:01:22,380
So with this lecture, this will actually wrap up our discussion on 

24
00:01:22,380 --> 00:01:26,470
unsupervised learning. The next lecture, we'll start to talk about 

25
00:01:26,470 --> 00:01:31,049
reinforcement learning algorithms. 

26
00:01:31,049 --> 00:01:32,500
Just to 

27
00:01:32,500 --> 00:01:36,770
recap where we were with PCA, principal 

28
00:01:36,770 --> 00:01:38,630
component analysis, 

29
00:01:38,630 --> 00:01:44,870
I said that 

30
00:01:44,870 --> 00:01:48,840
in PCA, we imagine that we have some very high dimensional data 

31
00:01:48,840 --> 00:01:50,970
that perhaps 

32
00:01:50,970 --> 00:01:54,650
lies approximately on some low dimensional subspace. So if you had the data set like 

33
00:01:54,650 --> 00:01:55,660
this, 

34
00:01:55,660 --> 00:01:57,000
you might find that 

35
00:01:57,000 --> 00:02:00,650
that's the first principal component of the data, 

36
00:02:00,650 --> 00:02:02,850
and that's the second 

37
00:02:02,850 --> 00:02:05,650
component of this 2-D data. 

38
00:02:05,650 --> 00:02:06,870


39
00:02:06,870 --> 00:02:09,269
To 

40
00:02:09,269 --> 00:02:13,039
summarize the algorithm, we have three steps. The first step of PCA 

41
00:02:13,039 --> 00:02:14,680


42
00:02:14,680 --> 00:02:16,549
was to 

43
00:02:16,549 --> 00:02:20,349
normalize the data to zero mean and 

44
00:02:20,349 --> 00:02:25,789
[inaudible]. So 

45
00:02:25,789 --> 00:02:27,860
tracked out the means of 

46
00:02:27,860 --> 00:02:33,199
your training examples. So it now has zero means, and then normalize each of your features so 

47
00:02:33,199 --> 00:02:35,509
that the variance of each feature is now one. 

48
00:02:35,509 --> 00:02:37,719


49
00:02:37,719 --> 00:02:40,679
The next step was [inaudible] 

50
00:02:40,679 --> 00:02:44,389
computical variance matrix of your zero mean data. So 

51
00:02:44,389 --> 00:02:48,859
you compute it as follows. 

52
00:02:48,859 --> 00:02:51,310
The sum of all the products, 

53
00:02:51,310 --> 00:02:52,409


54
00:02:52,409 --> 00:02:55,870
and then you find the 

55
00:02:55,870 --> 00:03:02,870
top K eigen vectors of 

56
00:03:03,799 --> 00:03:07,159
sigma. 

57
00:03:07,159 --> 00:03:09,589
So 

58
00:03:09,589 --> 00:03:12,589
last time we saw the applications of this. For example, 

59
00:03:12,589 --> 00:03:18,439
one of the applications was to eigen faces where 

60
00:03:18,439 --> 00:03:22,899
each of your training examples, XI, is an image. 

61
00:03:22,899 --> 00:03:24,180
So 

62
00:03:24,180 --> 00:03:26,749
if you have 

63
00:03:26,749 --> 00:03:30,890
100 by 100 images, if your pictures of faces are 

64
00:03:30,890 --> 00:03:33,249
100 pixels by 100 pixels, 

65
00:03:33,249 --> 00:03:37,519
then each of your training examples, XI, 

66
00:03:37,519 --> 00:03:39,989
will be a 10,000 dimensional vector, 

67
00:03:39,989 --> 00:03:42,729
corresponding to the 

68
00:03:42,729 --> 00:03:47,529
10,000 grayscale intensity pixel values. There are 10,000 pixel values in 

69
00:03:47,529 --> 00:03:50,349
each of your 100 by 100 images. 

70
00:03:50,349 --> 00:03:53,179
So the eigen faces application was where 

71
00:03:53,179 --> 00:03:56,280
the training example comprised 

72
00:03:56,280 --> 00:03:58,149
pictures of faces of people. 

73
00:03:58,149 --> 00:03:59,989
Then we ran PCA, 

74
00:03:59,989 --> 00:04:00,780
and then 

75
00:04:00,780 --> 00:04:03,090
to measure the distance between say 

76
00:04:03,090 --> 00:04:04,260
a face here 

77
00:04:04,260 --> 00:04:06,809
and a face there, we would project both 

78
00:04:06,809 --> 00:04:09,619
of the face images onto the subspace and then 

79
00:04:09,619 --> 00:04:11,469
measure 

80
00:04:11,469 --> 00:04:13,679
the distance along the subspace. So in eigen faces, you use something 

81
00:04:13,679 --> 00:04:16,919
like 50 principle components. 

82
00:04:16,919 --> 00:04:20,049
So 

83
00:04:20,049 --> 00:04:24,080
the difficulty of working with problems like these is that 

84
00:04:24,080 --> 00:04:27,330
in step two of the algorithm, 

85
00:04:27,330 --> 00:04:31,070
we construct the covariance matrix sigma. 

86
00:04:31,070 --> 00:04:37,589
The covariance matrix now becomes 

87
00:04:37,589 --> 00:04:42,919
a 10,000 by 10,000 dimensional matrix, which is huge. That has 

88
00:04:42,919 --> 00:04:45,580
100 million 

89
00:04:45,580 --> 00:04:47,650
entries, which is huge. 

90
00:04:47,650 --> 00:04:49,020


91
00:04:49,020 --> 00:04:50,789
So let's apply PCA to 

92
00:04:50,789 --> 00:04:54,289
very, very high dimensional data, used as a point of reducing the 

93
00:04:54,289 --> 00:04:55,180
dimension. But 

94
00:04:55,180 --> 00:04:59,550
step two of this algorithm had this step where you were constructing [inaudible]. So 

95
00:04:59,550 --> 00:05:03,599
this extremely large matrix, which you can't do. 

96
00:05:03,599 --> 00:05:06,580
Come back to this in a second. It turns out one of 

97
00:05:06,580 --> 00:05:08,650
the other 

98
00:05:08,650 --> 00:05:10,639
frequently-used applications of 

99
00:05:10,639 --> 00:05:11,870
PCA 

100
00:05:11,870 --> 00:05:14,210
is actually to text data. 

101
00:05:14,210 --> 00:05:16,329
So here's what I 

102
00:05:16,329 --> 00:05:21,039
mean. Remember our vectorial representation of emails? 

103
00:05:21,039 --> 00:05:22,520
So this is way back 

104
00:05:22,520 --> 00:05:26,869
when we were talking about supervised learning algorithms for a 

105
00:05:26,869 --> 00:05:29,360
stand classification. You remember I said that 

106
00:05:29,360 --> 00:05:32,620
given a piece of email or given a piece of text document, you 

107
00:05:32,620 --> 00:05:35,479
can represent it using a very high-dimensional vector 

108
00:05:35,479 --> 00:05:36,699
by taking 

109
00:05:36,699 --> 00:05:38,990


110
00:05:38,990 --> 00:05:43,889
- writing down a list of all the words in your dictionary. Somewhere you had the word 

111
00:05:43,889 --> 00:05:46,669
learn, somewhere you have the word 

112
00:05:46,669 --> 00:05:49,680
study 

113
00:05:49,680 --> 00:05:50,569


114
00:05:50,569 --> 00:05:54,759
and so on. 

115
00:05:54,759 --> 00:05:58,180
Depending on whether each word appears or does not appear in your text document, you put either 

116
00:05:58,180 --> 00:05:59,360
a one or a zero 

117
00:05:59,360 --> 00:06:03,889
there. This is a representation we use on an electrode five or electrode six 

118
00:06:03,889 --> 00:06:06,729
for representing text documents for 

119
00:06:06,729 --> 00:06:08,669
when we're building 

120
00:06:08,669 --> 00:06:12,090
[inaudible] based classifiers for 

121
00:06:12,090 --> 00:06:14,300
[inaudible]. So it turns 

122
00:06:14,300 --> 00:06:17,719
out one of the common applications of 

123
00:06:17,719 --> 00:06:22,210
PCA is actually this text data representations as well. 

124
00:06:22,210 --> 00:06:23,680
When you apply PCA 

125
00:06:23,680 --> 00:06:25,650
to this sort of data, 

126
00:06:25,650 --> 00:06:27,999
the resulting 

127
00:06:27,999 --> 00:06:34,740
algorithm, it often just goes by a different name, just latent semantic indexing. 

128
00:06:34,740 --> 00:06:41,250


129
00:06:41,250 --> 00:06:44,009
For the sake of completeness, I should say that 

130
00:06:44,009 --> 00:06:48,050
in LSI, you usually skip the preprocessing step. 

131
00:06:48,050 --> 00:06:49,220


132
00:06:49,220 --> 00:06:56,220


133
00:07:04,360 --> 00:07:06,089


134
00:07:06,089 --> 00:07:09,929
For various reasons, in LSI, you usually don't normalize the mean of the data to 

135
00:07:09,929 --> 00:07:10,939
one, 

136
00:07:10,939 --> 00:07:14,169
and you usually don't normalize the variance of the features to one. 

137
00:07:14,169 --> 00:07:18,020
These are relatively minor 

138
00:07:18,020 --> 00:07:21,199
differences, it turns out, so it does something very 

139
00:07:21,199 --> 00:07:24,599
similar to PCA. 

140
00:07:24,599 --> 00:07:25,849


141
00:07:25,849 --> 00:07:27,439
Normalizing the variance to one 

142
00:07:27,439 --> 00:07:33,439
for text data would actually be a bad idea because all the words are - 

143
00:07:33,439 --> 00:07:34,389
because that 

144
00:07:34,389 --> 00:07:37,150
would have the affect of 

145
00:07:37,150 --> 00:07:39,229
dramatically scaling up the 

146
00:07:39,229 --> 00:07:43,779
weight of rarely occurring words. So for example, the word aardvark hardly ever 

147
00:07:43,779 --> 00:07:45,729
appears in any document. 

148
00:07:45,729 --> 00:07:48,919
So to normalize the variance 

149
00:07:48,919 --> 00:07:51,009
of the second feature to one, you end up - 

150
00:07:51,009 --> 00:07:54,860
you're scaling up the weight of the word aardvark 

151
00:07:54,860 --> 00:07:58,340
dramatically. I don't understand why [inaudible]. 

152
00:07:58,340 --> 00:08:01,449
So let's 

153
00:08:01,449 --> 00:08:05,319
see. [Inaudible] the language, 

154
00:08:05,319 --> 00:08:11,169
something that we want to do quite often is, give it two documents, 

155
00:08:11,169 --> 00:08:13,109


156
00:08:13,109 --> 00:08:20,109
XI and XJ, to measure how similar they are. 

157
00:08:20,179 --> 00:08:22,400
So for example, 

158
00:08:22,400 --> 00:08:25,050
I may give you a document and ask 

159
00:08:25,050 --> 00:08:28,860
you to find me more documents like this one. We're reading some 

160
00:08:28,860 --> 00:08:30,900
article about some user event of today 

161
00:08:30,900 --> 00:08:33,770
and want to find out what other news articles there are. So I give you a document and 

162
00:08:33,770 --> 00:08:34,520


163
00:08:34,520 --> 00:08:37,159
ask you to look at all the other documents you have 

164
00:08:37,159 --> 00:08:40,829
in this large set of documents and find the documents similar to 

165
00:08:40,829 --> 00:08:42,220
this. 

166
00:08:42,220 --> 00:08:43,740


167
00:08:43,740 --> 00:08:45,210
So 

168
00:08:45,210 --> 00:08:48,170
this is typical text application, so 

169
00:08:48,170 --> 00:08:51,140
to measure the similarity 

170
00:08:51,140 --> 00:08:52,440


171
00:08:52,440 --> 00:08:56,550
between two documents in XI and XJ, [inaudible] 

172
00:08:56,550 --> 00:08:59,950
each of these documents is represented 

173
00:08:59,950 --> 00:09:03,190
as one of these highdimensional vectors. 

174
00:09:03,190 --> 00:09:08,700
One common way to do this is to view each of your documents 

175
00:09:08,700 --> 00:09:12,710
as some sort of very high-dimensional vector. 

176
00:09:12,710 --> 00:09:14,100
So these 

177
00:09:14,100 --> 00:09:18,040
are vectors in the very highdimensional space where 

178
00:09:18,040 --> 00:09:20,299
the dimension of the vector is equal to 

179
00:09:20,299 --> 00:09:27,210
the number of words in your dictionary. 

180
00:09:27,210 --> 00:09:30,630
So maybe each of these documents lives in some 

181
00:09:30,630 --> 00:09:33,560
50,000-dimension space, if you have 50,000 words in your 

182
00:09:33,560 --> 00:09:37,090
dictionary. So one nature of the similarity between these two documents that's 

183
00:09:37,090 --> 00:09:39,800
often used is 

184
00:09:39,800 --> 00:09:41,440
what's the angle 

185
00:09:41,440 --> 00:09:43,350
between these two 

186
00:09:43,350 --> 00:09:50,350
documents. 

187
00:09:51,240 --> 00:09:52,750
In particular, 

188
00:09:52,750 --> 00:09:56,030
if the angle between these two vectors is small, then 

189
00:09:56,030 --> 00:09:59,590
the two documents, we'll consider them to be similar. If the angle between 

190
00:09:59,590 --> 00:10:03,310
these two vectors is large, then we consider the documents to be dissimilar. 

191
00:10:03,310 --> 00:10:05,530
So 

192
00:10:05,530 --> 00:10:10,060
more formally, one commonly used heuristic, the national language of processing, 

193
00:10:10,060 --> 00:10:13,950
is to say that the similarity between the two documents is a co-sine of the angle theta between them. 

194
00:10:13,950 --> 00:10:16,710
For similar 

195
00:10:16,710 --> 00:10:19,270
values, anyway, the co-sine 

196
00:10:19,270 --> 00:10:23,110
is a decreasing function of theta. 

197
00:10:23,110 --> 00:10:24,490
So the 

198
00:10:24,490 --> 00:10:29,560
smaller the angle between them, the larger the similarity. 

199
00:10:29,560 --> 00:10:30,690
The co-sine 

200
00:10:30,690 --> 00:10:35,970
between two vectors is, of course, just [inaudible] 

201
00:10:35,970 --> 00:10:42,970
divided 

202
00:10:43,940 --> 00:10:46,680
by - okay? 

203
00:10:46,680 --> 00:10:51,100
That's just the linear algebra or the standard 

204
00:10:51,100 --> 00:10:56,460
geometry definition of the co-sine between two vectors. Here's the 

205
00:10:56,460 --> 00:10:59,260


206
00:10:59,260 --> 00:11:03,540


207
00:11:03,540 --> 00:11:10,540
intuition behind what LSI is doing. The hope, as usual, is 

208
00:11:11,250 --> 00:11:15,320


209
00:11:15,320 --> 00:11:17,930


210
00:11:17,930 --> 00:11:21,060
that there 

211
00:11:21,060 --> 00:11:24,970
may be some interesting axis of variations in the data, 

212
00:11:24,970 --> 00:11:27,660
and there maybe some other axis that 

213
00:11:27,660 --> 00:11:29,340
are just 

214
00:11:29,340 --> 00:11:33,960
noise. So by projecting all of your data on lower-dimensional subspace, the hope is that by 

215
00:11:33,960 --> 00:11:37,850
running PCA on your text data this way, you can remove some of the noise in the data and 

216
00:11:37,850 --> 00:11:41,800
get better measures of the similarity between pairs of 

217
00:11:41,800 --> 00:11:45,490
documents. Let's just delve a little deeper into those examples to convey more intuition about what LSI 

218
00:11:45,490 --> 00:11:46,940
is doing. 

219
00:11:46,940 --> 00:11:47,940
So 

220
00:11:47,940 --> 00:11:54,940
look further in the definition of the co-sine similarity measure. So 

221
00:11:56,430 --> 00:11:59,790
the numerator 

222
00:11:59,790 --> 00:12:06,790
or 

223
00:12:10,210 --> 00:12:17,210
the similarity between the two documents was this inner product, 

224
00:12:17,620 --> 00:12:24,270
which is therefore sum over K, 

225
00:12:24,270 --> 00:12:27,500
XIK, 

226
00:12:27,500 --> 00:12:28,870
XJK. So 

227
00:12:28,870 --> 00:12:30,410


228
00:12:30,410 --> 00:12:33,810
this inner product would be equal to zero if 

229
00:12:33,810 --> 00:12:36,910
the two documents have no words in common. So 

230
00:12:36,910 --> 00:12:39,900
this is really - sum over K 

231
00:12:39,900 --> 00:12:41,100
- 

232
00:12:41,100 --> 00:12:42,930
indicator of 

233
00:12:42,930 --> 00:12:44,760
whether 

234
00:12:44,760 --> 00:12:47,170
documents, I and 

235
00:12:47,170 --> 00:12:54,170
J, 

236
00:12:54,920 --> 00:12:58,250
both contain the word, K, because 

237
00:12:58,250 --> 00:13:02,590
I guess XIK indicates whether document I contains the word 

238
00:13:02,590 --> 00:13:04,410
K, and XJK 

239
00:13:04,410 --> 00:13:07,830
indicates whether document J contains the word, K. 

240
00:13:07,830 --> 00:13:10,430
So the product would be one only 

241
00:13:10,430 --> 00:13:12,420
if the word K 

242
00:13:12,420 --> 00:13:14,540
appears in both documents. 

243
00:13:14,540 --> 00:13:17,740
Therefore, the similarity between these two documents would be 

244
00:13:17,740 --> 00:13:23,450
zero if the two documents have no words in common. 

245
00:13:23,450 --> 00:13:30,450
For example, 

246
00:13:31,180 --> 00:13:34,530
suppose your document, 

247
00:13:34,530 --> 00:13:40,460
XI, has the word study and the word 

248
00:13:40,460 --> 00:13:41,730
XJ, 

249
00:13:41,730 --> 00:13:43,460
has the word learn. 

250
00:13:43,460 --> 00:13:47,530
Then these two documents may be considered 

251
00:13:47,530 --> 00:13:49,069
entirely dissimilar. 

252
00:13:49,069 --> 00:13:50,020


253
00:13:50,020 --> 00:13:53,280
[Inaudible] effective study strategies. Sometimes you read a 

254
00:13:53,280 --> 00:13:57,100
news article about that. So you ask, what other documents are similar to this? If 

255
00:13:57,100 --> 00:14:01,080
there are a bunch of other documents about good methods to 

256
00:14:01,080 --> 00:14:04,250
learn, than there are words in common. So similarity [inaudible] is zero. 

257
00:14:04,250 --> 00:14:06,790
So here's 

258
00:14:06,790 --> 00:14:09,200
a cartoon 

259
00:14:09,200 --> 00:14:10,970
of what we hope 

260
00:14:10,970 --> 00:14:12,820
[inaudible] PCA will do, 

261
00:14:12,820 --> 00:14:14,340
which is 

262
00:14:14,340 --> 00:14:17,320
suppose that on the horizontal axis, I plot 

263
00:14:17,320 --> 00:14:21,330
the word 

264
00:14:21,330 --> 00:14:25,310
learn, and on the vertical access, I plot the word study. 

265
00:14:25,310 --> 00:14:29,840
So the values take on either the value of zero or one. So if a document 

266
00:14:29,840 --> 00:14:33,040
contains the words learn but not study, then 

267
00:14:33,040 --> 00:14:35,260
it'll plot that document there. If 

268
00:14:35,260 --> 00:14:38,050
a document contains neither the word study nor learn, then it'll plot that 

269
00:14:38,050 --> 00:14:40,530
at zero, zero. 

270
00:14:40,530 --> 00:14:44,119
So here's a cartoon behind what PCA 

271
00:14:44,119 --> 00:14:46,940
is doing, which is 

272
00:14:46,940 --> 00:14:51,480
we identify lower dimensional subspace. That would be sum - eigen 

273
00:14:51,480 --> 00:14:57,630
vector, we get out of PCAs. 

274
00:14:57,630 --> 00:15:03,380
Now, supposed we have a document about learning. We have a document about studying. 

275
00:15:03,380 --> 00:15:05,150
The document about learning 

276
00:15:05,150 --> 00:15:07,710
points to the right. Document about studying points 

277
00:15:07,710 --> 00:15:11,439
up. So the inner product, or the co-sine angle between these two documents would be 

278
00:15:11,439 --> 00:15:13,170
- excuse 

279
00:15:13,170 --> 00:15:15,080
me. The inner product between 

280
00:15:15,080 --> 00:15:18,850
these two documents will be zero. 

281
00:15:18,850 --> 00:15:20,280
So these two 

282
00:15:20,280 --> 00:15:22,360
documents are entirely unrelated, 

283
00:15:22,360 --> 00:15:24,849
which is not what we want. 

284
00:15:24,849 --> 00:15:27,680
Documents about study, documents about learning, they are related. But 

285
00:15:27,680 --> 00:15:32,820
we take these two documents, and we project them 

286
00:15:32,820 --> 00:15:36,220
onto this subspace. 

287
00:15:36,220 --> 00:15:38,320


288
00:15:38,320 --> 00:15:40,850
Then these two documents now become much 

289
00:15:40,850 --> 00:15:42,610
closer together, 

290
00:15:42,610 --> 00:15:44,790
and the algorithm will recognize that 

291
00:15:44,790 --> 00:15:47,650
when you say the inner product between these two documents, 

292
00:15:47,650 --> 00:15:50,580
you actually end up with a positive number. So 

293
00:15:50,580 --> 00:15:52,190
LSI enables 

294
00:15:52,190 --> 00:15:56,200
our algorithm to recognize that these two documents have some positive similarity 

295
00:15:56,200 --> 00:15:58,920
between them. 

296
00:15:58,920 --> 00:16:01,230
So that's just intuition 

297
00:16:01,230 --> 00:16:02,370
about what 

298
00:16:02,370 --> 00:16:05,000
PCA may be doing to text data. 

299
00:16:05,000 --> 00:16:09,420
The same thing goes to other examples and the words study and learn. So you have 

300
00:16:09,420 --> 00:16:11,139
- you find a document about 

301
00:16:11,139 --> 00:16:15,059
politicians and a document with the names of prominent 

302
00:16:15,059 --> 00:16:16,490
politicians. 

303
00:16:16,490 --> 00:16:17,560


304
00:16:17,560 --> 00:16:20,740
That will also bring the documents closer together, 

305
00:16:20,740 --> 00:16:24,650
or just any related topics, they end up 

306
00:16:24,650 --> 00:16:25,670
[inaudible] 

307
00:16:25,670 --> 00:16:31,780
points closer together and just lower dimensional space. 

308
00:16:31,780 --> 00:16:33,130


309
00:16:33,130 --> 00:16:38,380
Question about this? Interviewee: [Inaudible]. Instructor 

310
00:16:38,380 --> 00:16:43,930
(Andrew Ng):Which ones? 

311
00:16:43,930 --> 00:16:50,930
This one? Interviewee: No, the line. Instructor (Andrew Ng):Oh, this one. Oh, 

312
00:16:53,820 --> 00:16:54,470
yes. 

313
00:16:54,470 --> 00:17:01,470
Thank you. 

314
00:17:01,670 --> 00:17:05,510
[Inaudible]. 

315
00:17:05,510 --> 00:17:06,650
So 

316
00:17:06,650 --> 00:17:13,650
let's talk about how to actually implement this now. 

317
00:17:16,400 --> 00:17:17,230


318
00:17:17,230 --> 00:17:20,520
Okay. How many of you know what 

319
00:17:20,520 --> 00:17:24,780
an SVD or single value decomposition is? Wow, 

320
00:17:24,780 --> 00:17:25,800
that's a lot of you. That's a 

321
00:17:25,800 --> 00:17:28,270
lot more than I thought. 

322
00:17:28,270 --> 00:17:35,270
Curious. Did you guys learn it as under grads or as graduate students? 

323
00:17:37,309 --> 00:17:38,029
All 

324
00:17:38,029 --> 00:17:40,429
right. Let 

325
00:17:40,429 --> 00:17:44,150
me talk about it anyway. I 

326
00:17:44,150 --> 00:17:47,790
wasn't expecting so many of you to know what SVD is, but I want to get this 

327
00:17:47,790 --> 00:17:51,030
on tape, just so everyone else can learn 

328
00:17:51,030 --> 00:17:55,050
about this, too. 

329
00:17:55,050 --> 00:17:59,020
So I'll say a little bit about how to implement 

330
00:17:59,020 --> 00:18:01,640
PCA. The problem I 

331
00:18:01,640 --> 00:18:05,110
was eluding to just now was that 

332
00:18:05,110 --> 00:18:09,310
when you have these very high-dimensional vectors, than sigma is a large matrix. In particular, for 

333
00:18:09,310 --> 00:18:11,620
our 

334
00:18:11,620 --> 00:18:13,390
text example, 

335
00:18:13,390 --> 00:18:18,000
if the vectors XI are 50,000 dimensional, 

336
00:18:18,000 --> 00:18:24,430
then 

337
00:18:24,430 --> 00:18:27,160
the covariance matrix will be 50,000 dimensional by 50,000 

338
00:18:27,160 --> 00:18:32,910
dimensional, which is much too big to represent explicitly. 

339
00:18:32,910 --> 00:18:36,270
I guess many 

340
00:18:36,270 --> 00:18:41,240
of you already know this, but I'll just say it anyway. It 

341
00:18:41,240 --> 00:18:48,240
turns out there's another way to implement PCA, which is 

342
00:18:48,679 --> 00:18:49,450
if 

343
00:18:49,450 --> 00:18:53,340
A is any N by N matrix, 

344
00:18:53,340 --> 00:18:56,580
than one of the most remarkable results of linear algebra 

345
00:18:56,580 --> 00:19:00,970
is that the matrix, A, 

346
00:19:00,970 --> 00:19:04,279
can be decomposed into 

347
00:19:04,279 --> 00:19:07,330
a singular value 

348
00:19:07,330 --> 00:19:10,160
decomposition. What that means is that the matrix, A, which 

349
00:19:10,160 --> 00:19:11,519
is 

350
00:19:11,519 --> 00:19:12,680
N by N, 

351
00:19:12,680 --> 00:19:15,230
can always be decomposed into a product of 

352
00:19:15,230 --> 00:19:18,280
three matrixes. U is N by N, 

353
00:19:18,280 --> 00:19:21,470
D is a square matrix, which is N by N, and V is 

354
00:19:21,470 --> 00:19:23,750


355
00:19:23,750 --> 00:19:27,170
also N by N. 

356
00:19:27,170 --> 00:19:30,910
D 

357
00:19:30,910 --> 00:19:34,890
is 

358
00:19:34,890 --> 00:19:37,320
going to be diagonal. 

359
00:19:37,320 --> 00:19:43,030


360
00:19:43,030 --> 00:19:45,140
Zeros are on the off-diagonals, 

361
00:19:45,140 --> 00:19:52,140
and the values sigma I are called the singular values of 

362
00:19:54,310 --> 00:19:57,530
the matrix A. 

363
00:19:57,530 --> 00:20:01,500


364
00:20:01,500 --> 00:20:05,470
Almost all of you said you learned this as a graduate student, rather than as an under grad, so 

365
00:20:05,470 --> 00:20:07,020
it turns out that 

366
00:20:07,020 --> 00:20:10,820
when you take a class in undergraduate linear algebra, usually you learn a bunch of 

367
00:20:10,820 --> 00:20:13,960
decomposition. So you usually learn about the QLD composition, maybe 

368
00:20:13,960 --> 00:20:17,070
the LU factorization of the matrixes. 

369
00:20:17,070 --> 00:20:20,350
Most under grad courses don't get to talk about singular value 

370
00:20:20,350 --> 00:20:21,490
decompositions, but at 

371
00:20:21,490 --> 00:20:22,860
least in - almost 

372
00:20:22,860 --> 00:20:24,580
everything I 

373
00:20:24,580 --> 00:20:26,900
do in machine learning, 

374
00:20:26,900 --> 00:20:31,180
you actually find that you end up using SVDs much more than any of the 

375
00:20:31,180 --> 00:20:33,120
decompositions 

376
00:20:33,120 --> 00:20:35,690
you learned in typical under grad linear algebra class. 

377
00:20:35,690 --> 00:20:40,030
So personally, I [inaudible] an SVD dozens of times in the last 

378
00:20:40,030 --> 00:20:42,820
year, but LU and QRD compositions, 

379
00:20:42,820 --> 00:20:45,480
I think I used the QRD composition once and an 

380
00:20:45,480 --> 00:20:49,530
LU decomposition in the last year. So let's see. I'll 

381
00:20:49,530 --> 00:20:53,880
say 

382
00:20:53,880 --> 00:20:54,830
a 

383
00:20:54,830 --> 00:21:01,830
bit 

384
00:21:01,840 --> 00:21:04,700
more about this. So I'm going to draw the picture, I guess. 

385
00:21:04,700 --> 00:21:08,090
For example, 

386
00:21:08,090 --> 00:21:10,890
if A is an N by N matrix, 

387
00:21:10,890 --> 00:21:17,890
it can be decomposed into another matrix, U, which is also N by N. It's the same 

388
00:21:21,780 --> 00:21:24,510
size, D, which is 

389
00:21:24,510 --> 00:21:29,840
N by 

390
00:21:29,840 --> 00:21:34,730
N. Another square matrix, V transpose, which is also 

391
00:21:34,730 --> 00:21:37,430
N by 

392
00:21:37,430 --> 00:21:42,850
N. Furthermore, in a singular value decomposition, the 

393
00:21:42,850 --> 00:21:46,460
columns of the matrix, U, will be the eigen 

394
00:21:46,460 --> 00:21:50,940
vectors 

395
00:21:50,940 --> 00:21:55,630
of A transpose, and the 

396
00:21:55,630 --> 00:22:02,020
columns of V will be the eigen vectors 

397
00:22:02,020 --> 00:22:05,730
of A 

398
00:22:05,730 --> 00:22:08,290
transpose A. 

399
00:22:08,290 --> 00:22:12,030


400
00:22:12,030 --> 00:22:16,660
To compute it, you just use the SVD commands 

401
00:22:16,660 --> 00:22:20,780
in 

402
00:22:20,780 --> 00:22:22,090
Matlab 

403
00:22:22,090 --> 00:22:23,009
or 

404
00:22:23,009 --> 00:22:26,760
Octave. Today, say the art in numerical linear algebra is that 

405
00:22:26,760 --> 00:22:31,040
SVD, singular value decompositions, and matrixes can be computed 

406
00:22:31,040 --> 00:22:34,450
extremely [inaudible]. We've 

407
00:22:34,450 --> 00:22:35,780
used a 

408
00:22:35,780 --> 00:22:37,610
package like Matlab or Octave 

409
00:22:37,610 --> 00:22:40,410
to compute, say, the eigen vectors of a matrix. 

410
00:22:40,410 --> 00:22:43,539
So if SVD 

411
00:22:43,539 --> 00:22:47,980
routines are even more numerically stable than eigen vector routines for finding eigen vector in the 

412
00:22:47,980 --> 00:22:49,230
matrix. So you can 

413
00:22:49,230 --> 00:22:50,250
safely 

414
00:22:50,250 --> 00:22:53,110
use a routine like this, and similar to the way they use 

415
00:22:53,110 --> 00:22:56,030
a square root command without thinking about 

416
00:22:56,030 --> 00:22:58,400
how it's computed. You can compute the square 

417
00:22:58,400 --> 00:23:03,640
root of something and just not worry about it. You know the computer will give you the right 

418
00:23:03,640 --> 00:23:07,670
answer. For most reasonably-sized matrixes, even up to thousands by thousands 

419
00:23:07,670 --> 00:23:11,180
matrixes, the SVD routine, I think of it as a square root function. If 

420
00:23:11,180 --> 00:23:14,700
you call it, it'll give you back the right answer. You don't have to worry 

421
00:23:14,700 --> 00:23:16,760
too much about 

422
00:23:16,760 --> 00:23:20,510
it. If you have extremely large matrixes, like a million by a million matrixes, I 

423
00:23:20,510 --> 00:23:23,200
might start to worry a bit, but a few thousand by a few 

424
00:23:23,200 --> 00:23:25,700
thousand matrixes, this is 

425
00:23:25,700 --> 00:23:29,330
implemented very well today. 

426
00:23:29,330 --> 00:23:31,360
Interviewee: [Inaudible]. Instructor (Andrew 

427
00:23:31,360 --> 00:23:34,990
Ng):What's the complexity of SVD? That's a good question. I actually don't know. I want to guess it's roughly on the 

428
00:23:34,990 --> 00:23:36,470
order of N-cubed. 

429
00:23:36,470 --> 00:23:41,750
I'm not sure. [Inaudible] 

430
00:23:41,750 --> 00:23:45,010
algorithms, so I think - I don't know what's 

431
00:23:45,010 --> 00:23:50,470
known about the conversion 

432
00:23:50,470 --> 00:23:54,370
of 

433
00:23:54,370 --> 00:23:58,280
these algorithms. The example I drew out was for a facts matrix, and a matrix is 

434
00:23:58,280 --> 00:23:59,890
[inaudible]. 

435
00:23:59,890 --> 00:24:03,420
In the same way, you can also 

436
00:24:03,420 --> 00:24:08,380
call SVD on the tall matrix, so it's taller than it's wide. 

437
00:24:08,380 --> 00:24:15,380
It would decompose it into - okay? A 

438
00:24:21,480 --> 00:24:24,190
product of three matrixes like 

439
00:24:24,190 --> 00:24:28,470
that. 

440
00:24:28,470 --> 00:24:31,220
The nice thing about this is that 

441
00:24:31,220 --> 00:24:33,169
we can use it to compute 

442
00:24:33,169 --> 00:24:40,169
eigen vectors and PCA very efficiently. 

443
00:24:42,150 --> 00:24:47,380
In particular, a 

444
00:24:47,380 --> 00:24:52,690
covariance matrix sigma was 

445
00:24:52,690 --> 00:24:55,410
this. It was the sum of all the products, 

446
00:24:55,410 --> 00:25:00,750
so if you go back and recall the definition of the 

447
00:25:00,750 --> 00:25:01,820
design matrix - 

448
00:25:01,820 --> 00:25:05,080
I think I described this in 

449
00:25:05,080 --> 00:25:07,720
lecture two when 

450
00:25:07,720 --> 00:25:13,510
we derived the close form solution to these squares [inaudible] these squares. The 

451
00:25:13,510 --> 00:25:17,860
design matrix was this matrix where I took my 

452
00:25:17,860 --> 00:25:21,390
examples 

453
00:25:21,390 --> 00:25:26,160
and stacked them in 

454
00:25:26,160 --> 00:25:27,590
rows. 

455
00:25:27,590 --> 00:25:33,320
They call this the design matrix [inaudible]. So if 

456
00:25:33,320 --> 00:25:38,780
you construct the design matrix, then 

457
00:25:38,780 --> 00:25:43,240
the covariance matrix sigma 

458
00:25:43,240 --> 00:25:47,760
can be written just X transposing. 

459
00:25:47,760 --> 00:25:54,760


460
00:26:01,270 --> 00:26:08,270
That's X transposed, and [inaudible]. 

461
00:26:16,230 --> 00:26:16,750
Okay? 

462
00:26:16,750 --> 00:26:18,860
I hope you see why the X transpose X gives you 

463
00:26:18,860 --> 00:26:22,460
the sum of products of 

464
00:26:22,460 --> 00:26:25,650
vectors. If you aren't seeing this right now, just go home and convince yourself 

465
00:26:25,650 --> 00:26:32,650
[inaudible] if it's 

466
00:26:36,340 --> 00:26:37,880


467
00:26:37,880 --> 00:26:40,020
true. 

468
00:26:40,020 --> 00:26:47,020
To get the top K eigen vectors of 

469
00:26:50,610 --> 00:26:50,950
sigma, 

470
00:26:50,950 --> 00:26:55,890
you would take sigma 

471
00:26:55,890 --> 00:26:57,419
and decompose it 

472
00:26:57,419 --> 00:26:59,650
using 

473
00:26:59,650 --> 00:27:01,880
the - 

474
00:27:01,880 --> 00:27:06,710
excuse me. 

475
00:27:06,710 --> 00:27:11,960
You would take the matrix X, and you would compute as SVD. So you get USV transpose. 

476
00:27:11,960 --> 00:27:15,400
Then the top three 

477
00:27:15,400 --> 00:27:19,169
columns 

478
00:27:19,169 --> 00:27:21,660
of U 

479
00:27:21,660 --> 00:27:24,340
are the top K eigen 

480
00:27:24,340 --> 00:27:26,830
vectors 

481
00:27:26,830 --> 00:27:29,419
of 

482
00:27:29,419 --> 00:27:31,910
X transpose 

483
00:27:31,910 --> 00:27:33,660
X, 

484
00:27:33,660 --> 00:27:37,760
which is therefore, the top K eigen vectors of your 

485
00:27:37,760 --> 00:27:40,440
covariance matrix 

486
00:27:40,440 --> 00:27:42,250
sigma. So 

487
00:27:42,250 --> 00:27:46,170
in our examples, the 

488
00:27:46,170 --> 00:27:48,950
design matrix may be, say R. If you have 

489
00:27:48,950 --> 00:27:52,450
50,000 words in your dictionary, than the 

490
00:27:52,450 --> 00:27:55,050
design matrix would be 

491
00:27:55,050 --> 00:27:58,150
RM by 50,000. 

492
00:27:58,150 --> 00:28:01,870
[Inaudible] say 100 by 50,000, if you have 100 examples. 

493
00:28:01,870 --> 00:28:03,250
So X would be 

494
00:28:03,250 --> 00:28:06,510
quite tractable to represent and compute the 

495
00:28:06,510 --> 00:28:10,420
SVD, whereas the matrix sigma would be much harder to represent. This is 

496
00:28:10,420 --> 00:28:12,260
50,000 by 50,000. 

497
00:28:12,260 --> 00:28:15,870
So this gives you an efficient way to implement 

498
00:28:15,870 --> 00:28:18,070
PCA. 

499
00:28:18,070 --> 00:28:20,560
The reason I want to talk about this is 

500
00:28:20,560 --> 00:28:24,620
in previous years, I didn't talk [inaudible]. 

501
00:28:24,620 --> 00:28:28,760
The class projects, I found a number of students trying to implement SVD on huge 

502
00:28:28,760 --> 00:28:29,580
problems and [inaudible], 

503
00:28:29,580 --> 00:28:31,950
so this is 

504
00:28:31,950 --> 00:28:35,010
a much better to implement PCA 

505
00:28:35,010 --> 00:28:37,600
if you have extremely high dimensional data. If you have 

506
00:28:37,600 --> 00:28:39,600
low dimensional data, if 

507
00:28:39,600 --> 00:28:43,500
you have 50 or 100 dimensional data, then 

508
00:28:43,500 --> 00:28:44,969
computing sigma's no problem. You can 

509
00:28:44,969 --> 00:28:51,309
do it the old way, but otherwise, use the SVD to implement this. 

510
00:28:51,309 --> 00:28:52,780


511
00:28:52,780 --> 00:28:59,780
Questions about this? 

512
00:29:26,180 --> 00:29:30,810
The last thing I want to say is that in practice, when you want to implement this, I want to say a note 

513
00:29:30,810 --> 00:29:32,240
of caution. 

514
00:29:32,240 --> 00:29:35,010
It turns out that 

515
00:29:35,010 --> 00:29:38,909
for many applications of - let's see. 

516
00:29:38,909 --> 00:29:42,420
When you apply SVD to these wide - yeah. 

517
00:29:42,420 --> 00:29:43,450
Interviewee: Just a quick question. Are 

518
00:29:43,450 --> 00:29:48,950
the top K columns of U or V because X transposed X is V transpose, right? Instructor (Andrew Ng):Let's see. Oh, 

519
00:29:48,950 --> 00:29:52,940
yes. 

520
00:29:52,940 --> 00:29:54,570
I think you're 

521
00:29:54,570 --> 00:29:55,509
right. I 

522
00:29:55,509 --> 00:30:02,509
think you're right. 

523
00:30:03,960 --> 00:30:05,550
Let's 

524
00:30:05,550 --> 00:30:12,550
see. Is it top K columns of U or top K of V? 

525
00:30:15,039 --> 00:30:22,039
Yeah, I think you're right. Is that right? Something 

526
00:30:31,560 --> 00:30:38,560
bothers me about that, but I think you're right. 

527
00:30:40,330 --> 00:30:46,350
Interviewee: [Inaudible], so then X transpose X should be VDD. X 

528
00:30:46,350 --> 00:30:53,350
is UDV, so X transpose X would be - Instructor (Andrew Ng):[Inaudible]. 

529
00:31:01,140 --> 00:31:03,389
If anyone thinks about this and 

530
00:31:03,389 --> 00:31:06,200
has another opinion, let me know, but I think you're 

531
00:31:06,200 --> 00:31:11,480
right. I'll make sure I get the details and let you 

532
00:31:11,480 --> 00:31:16,680
know. 

533
00:31:16,680 --> 00:31:22,240
Everyone's still looking at that. 

534
00:31:22,240 --> 00:31:22,720
Tom, can you figure out 

535
00:31:22,720 --> 00:31:25,110
the right answer and let me know? Male Speaker: That sounds right. Instructor (Andrew Ng):Okay. Cool. Okay. 

536
00:31:25,110 --> 00:31:30,070
So just 

537
00:31:30,070 --> 00:31:33,920
one last note, a note of caution. It turns out that 

538
00:31:33,920 --> 00:31:36,520
in this example, I was implementing SVD 

539
00:31:36,520 --> 00:31:43,520
with a wide matrix. So the matrix X was N by N. 

540
00:31:44,130 --> 00:31:47,010
It turns out when you 

541
00:31:47,010 --> 00:31:52,170
find the SVD decomposition of this, 

542
00:31:52,170 --> 00:31:57,870
it turns out that - 

543
00:31:57,870 --> 00:32:01,030
let's see. Yeah, I think you're definitely 

544
00:32:01,030 --> 00:32:04,230
right. So it turns out that we find the SVD 

545
00:32:04,230 --> 00:32:07,590
of this, the right-most portion of this block of this matrix would be all 

546
00:32:07,590 --> 00:32:14,590
zeros. 

547
00:32:15,130 --> 00:32:17,850
Also, when you compute the 

548
00:32:17,850 --> 00:32:22,190
matrix, D, a large part of this matrix would be zeros. 

549
00:32:22,190 --> 00:32:24,340
You have the matrix D 

550
00:32:24,340 --> 00:32:29,520
transpose. So depending on what convention you use, 

551
00:32:29,520 --> 00:32:36,150
for example, I think Matlab actually uses a convention of just 

552
00:32:36,150 --> 00:32:43,150
cutting off the zero elements. 

553
00:32:47,779 --> 00:32:50,750
So the Matlab uses the convention 

554
00:32:50,750 --> 00:32:54,260
of chopping off the right-most half of the U matrix and chopping off the bottom 

555
00:32:54,260 --> 00:32:56,520
portion of the D matrix. I'm not sure 

556
00:32:56,520 --> 00:33:00,860
if this even depends on the version of Matlab, 

557
00:33:00,860 --> 00:33:04,059
but when you call SVD on Matlab or some other numerical algebra packages, 

558
00:33:04,059 --> 00:33:07,820
there's slightly different conventions of how to define your SVD when 

559
00:33:07,820 --> 00:33:10,260
the matrix is wider than it is tall. 

560
00:33:10,260 --> 00:33:13,390
So just watch out for this and make sure you map 

561
00:33:13,390 --> 00:33:14,910
whatever convention 

562
00:33:14,910 --> 00:33:17,630
your numerical algebra library uses 

563
00:33:17,630 --> 00:33:22,620
to the original computations. 

564
00:33:22,620 --> 00:33:23,570
It turns out if 

565
00:33:23,570 --> 00:33:26,059
you turn Matlab [inaudible] 

566
00:33:26,059 --> 00:33:30,200
or you're writing C code. There are many scientific libraries that can 

567
00:33:30,200 --> 00:33:31,590


568
00:33:31,590 --> 00:33:36,110
compute SVDs for you, but they're just slightly different in 

569
00:33:36,110 --> 00:33:39,330
conventions for the dimensions of these matrixes. So just make sure you figure this out for the 

570
00:33:39,330 --> 00:33:44,230
package that you use. 

571
00:33:44,230 --> 00:33:46,230
Finally, I just want to 

572
00:33:46,230 --> 00:33:49,779
take the unsupervised learning algorithms we talked about and just put a little bit 

573
00:33:49,779 --> 00:33:52,010
of broader context. 

574
00:33:52,010 --> 00:33:56,450
This is partly in response to the questions I've gotten from students in 

575
00:33:56,450 --> 00:33:57,950
office hours and elsewhere 

576
00:33:57,950 --> 00:34:00,620
about when to use each of these 

577
00:34:00,620 --> 00:34:01,460
algorithms. So 

578
00:34:01,460 --> 00:34:05,700
I'm going to draw a two by two matrix. This is a little cartoon that 

579
00:34:05,700 --> 00:34:07,800


580
00:34:07,800 --> 00:34:10,690
I find useful. 

581
00:34:10,690 --> 00:34:13,039


582
00:34:13,039 --> 00:34:15,409


583
00:34:15,409 --> 00:34:20,159
One of the algorithms we talked about earlier, right 

584
00:34:20,159 --> 00:34:21,460
before this, was 

585
00:34:21,460 --> 00:34:24,159
factor analysis, which was - it 

586
00:34:24,159 --> 00:34:27,919
was - I hope you remember that picture I drew where I would have a bunch of point Z on the 

587
00:34:27,919 --> 00:34:28,799
line. 

588
00:34:28,799 --> 00:34:32,969
Then I had these ellipses that I drew. I hope you 

589
00:34:32,969 --> 00:34:34,999
remember that 

590
00:34:34,999 --> 00:34:37,919
picture. This was a factor analysis model 

591
00:34:37,919 --> 00:34:38,639
which models 

592
00:34:38,639 --> 00:34:42,389
the density effects [inaudible], right? 

593
00:34:42,389 --> 00:34:44,639
It was also a PCA, just now. 

594
00:34:44,639 --> 00:34:45,799
So the 

595
00:34:45,799 --> 00:34:48,969
difference between factor analysis and PCA, the 

596
00:34:48,969 --> 00:34:51,829
way I think about it, is that factor analysis 

597
00:34:51,829 --> 00:34:53,960
is a density estimation algorithm. 

598
00:34:53,960 --> 00:34:57,740
It tries to model the density of the training example's X. 

599
00:34:57,740 --> 00:35:01,249
Whereas PCA 

600
00:35:01,249 --> 00:35:02,309


601
00:35:02,309 --> 00:35:05,700
is not a probabilistic 

602
00:35:05,700 --> 00:35:06,750
algorithm. In particular, 

603
00:35:06,750 --> 00:35:11,270
it does not endow your training examples of any probabilistic 

604
00:35:11,270 --> 00:35:14,410
distributions and directly goes to find the subspace. 

605
00:35:14,410 --> 00:35:18,650
So in terms of when to use factor analysis and when to use PCA, 

606
00:35:18,650 --> 00:35:21,699
if your goal is to reduce the dimension of the data, 

607
00:35:21,699 --> 00:35:26,139
if your goal is to find the subspace that the data lies on, 

608
00:35:26,139 --> 00:35:27,309
then PCA 

609
00:35:27,309 --> 00:35:28,320
directly 

610
00:35:28,320 --> 00:35:32,180
tries to find the subspace. I think I would 

611
00:35:32,180 --> 00:35:35,279
tend to use PCA. 

612
00:35:35,279 --> 00:35:39,889
Factor analysis, it sort of assumes the data lies on a subspace. 

613
00:35:39,889 --> 00:35:45,039
Let me write a subspace here. 

614
00:35:45,039 --> 00:35:49,529
So both of these algorithms sort of assume the data maybe lies close 

615
00:35:49,529 --> 00:35:52,009
or on some low dimensional subspace, 

616
00:35:52,009 --> 00:35:55,999
but fundamentally, factor analysis, I think of it as a density estimation algorithm. 

617
00:35:55,999 --> 00:35:58,739
So that has some very high dimensional distribution. I 

618
00:35:58,739 --> 00:36:00,880
want to model P of X, then 

619
00:36:00,880 --> 00:36:04,519
the factor analysis is the algorithm I'm more inclined 

620
00:36:04,519 --> 00:36:05,400
to use. So 

621
00:36:05,400 --> 00:36:07,449
even though you could in theory, 

622
00:36:07,449 --> 00:36:12,199
I would tend to avoid trying to use factor analysis to 

623
00:36:12,199 --> 00:36:14,349
identify a subspace the 

624
00:36:14,349 --> 00:36:15,619
data 

625
00:36:15,619 --> 00:36:18,450
set lies on. So [inaudible], if you want to do 

626
00:36:18,450 --> 00:36:21,700
anomaly detection, if you want to model P of X 

627
00:36:21,700 --> 00:36:26,119
so that if you have a very low probability of N, you can factor an anomaly, 

628
00:36:26,119 --> 00:36:32,819
then I would tend to use factor analysis to do that density estimation. So factor 

629
00:36:32,819 --> 00:36:38,189
analysis and PCA are both algorithms that assume that your data lies in the subspace. 

630
00:36:38,189 --> 00:36:41,759
The other cause of algorithms we talked about was 

631
00:36:41,759 --> 00:36:45,379
algorithms that assumes the data 

632
00:36:45,379 --> 00:36:47,380
lies in 

633
00:36:47,380 --> 00:36:51,410
clumps or that the 

634
00:36:51,410 --> 00:36:52,359
data 

635
00:36:52,359 --> 00:36:57,869
has a few coherence to groups. So let me just fill in the rest of this picture. 

636
00:36:57,869 --> 00:37:04,869


637
00:37:08,579 --> 00:37:09,500
So if you think your data lies 

638
00:37:09,500 --> 00:37:14,270
in clumps or lies in groups, and if it goes [inaudible] 

639
00:37:14,270 --> 00:37:16,130
density estimation, then I would 

640
00:37:16,130 --> 00:37:19,579
tend to use a mixture of [inaudible] 

641
00:37:19,579 --> 00:37:23,349
algorithm. But again, you don't necessarily want to endow your data of any probably 

642
00:37:23,349 --> 00:37:26,919
semantics, so if you just want to find the clumps of the groups, then I'd be inclined 

643
00:37:26,919 --> 00:37:28,999
to use a [inaudible] algorithm. So 

644
00:37:28,999 --> 00:37:29,769


645
00:37:29,769 --> 00:37:33,289
haven't seen anyone else draw this picture before, but I tend to organize these things 

646
00:37:33,289 --> 00:37:34,989
this way in my brain. 

647
00:37:34,989 --> 00:37:36,419
Hopefully this helps guide 

648
00:37:36,419 --> 00:37:40,459
when you might use each of these algorithms as well, depending 

649
00:37:40,459 --> 00:37:44,719
on whether you believe the data might lie in the subspace or whether it might bind in 

650
00:37:44,719 --> 00:37:47,899
clumps or groups. 

651
00:37:47,899 --> 00:37:50,719


652
00:37:50,719 --> 00:37:53,789
All right. 

653
00:37:53,789 --> 00:38:00,789
That wraps up the discussion on 

654
00:38:02,629 --> 00:38:08,719
PCA. What I want to do next is talk about 

655
00:38:08,719 --> 00:38:15,329
independent component analysis, or ICA. Yeah. Interviewee: I have 

656
00:38:15,329 --> 00:38:17,579
a 

657
00:38:17,579 --> 00:38:21,589
question about the upper right [inaudible]. So once you have all of the eigen vectors, 

658
00:38:21,589 --> 00:38:26,179
[inaudible] how similar is feature I to 

659
00:38:26,179 --> 00:38:29,960
feature J. You pick some eigen vector, and you take some dot products between the 

660
00:38:29,960 --> 00:38:31,569
feature I and 

661
00:38:31,569 --> 00:38:35,019
feature J and the eigen vector. But 

662
00:38:35,019 --> 00:38:39,630
there's a lot of eigen vectors to choose from. Instructor 

663
00:38:39,630 --> 00:38:42,049
(Andrew Ng):Right. So Justin's question was 

664
00:38:42,049 --> 00:38:45,880
having found my eigen vectors, how do I choose what eigen vector to use to 

665
00:38:45,880 --> 00:38:47,539
measure distance. I'm 

666
00:38:47,539 --> 00:38:48,949
going to 

667
00:38:48,949 --> 00:38:51,019
start 

668
00:38:51,019 --> 00:38:53,289
this up. 

669
00:38:53,289 --> 00:38:57,299
So the 

670
00:38:57,299 --> 00:38:58,319
answer is really 

671
00:38:58,319 --> 00:39:02,029
- in this cartoon, I would avoid thinking about 

672
00:39:02,029 --> 00:39:03,920
eigen vectors one other time. 

673
00:39:03,920 --> 00:39:08,049
A better way to view this cartoon is that this is actually - 

674
00:39:08,049 --> 00:39:11,599
if I decide to choose 100 eigen vectors, this is really 100 D 

675
00:39:11,599 --> 00:39:18,599
subspace. 

676
00:39:19,259 --> 00:39:20,339
So 

677
00:39:20,339 --> 00:39:24,879
I'm not actually projecting my data onto one eigen vector. 

678
00:39:24,879 --> 00:39:29,769
This arrow, this cartoon, this denotes the 100-dimensional 

679
00:39:29,769 --> 00:39:32,089
subspace [inaudible] by all my eigen vectors. 

680
00:39:32,089 --> 00:39:36,119
So what I actually do is project my data onto 

681
00:39:36,119 --> 00:39:40,149
the span, the linear span of eigen vectors. Then I 

682
00:39:40,149 --> 00:39:41,440
measure distance or take 

683
00:39:41,440 --> 00:39:43,489
inner products of the distance between 

684
00:39:43,489 --> 00:39:49,829
the projections of the two points of the eigen vectors. Okay. 

685
00:39:49,829 --> 00:39:54,139
So let's talk about ICA, 

686
00:39:54,139 --> 00:39:58,749
independent component analysis. 

687
00:39:58,749 --> 00:40:00,749
So whereas PCA 

688
00:40:00,749 --> 00:40:02,600
was an algorithm for finding 

689
00:40:02,600 --> 00:40:06,699
what I call the main axis of variations of data, 

690
00:40:06,699 --> 00:40:11,199
in ICA, we're going to try find the independent of components of variations in the 

691
00:40:11,199 --> 00:40:12,039
data. 

692
00:40:12,039 --> 00:40:14,939
So switch it to the laptop there, please. 

693
00:40:14,939 --> 00:40:16,119
We'll just 

694
00:40:16,119 --> 00:40:21,900
take a second to motivate that. I'm 

695
00:40:21,900 --> 00:40:26,769
going to do so by 

696
00:40:26,769 --> 00:40:32,430
- although if you put on the - okay. This is 

697
00:40:32,430 --> 00:40:36,619
actually a slide that I showed in 

698
00:40:36,619 --> 00:40:39,779
lecture one of the cocktail party problem. 

699
00:40:39,779 --> 00:40:42,619
Suppose you have two speakers at a cocktail party, 

700
00:40:42,619 --> 00:40:45,019
and you have two microphones in the 

701
00:40:45,019 --> 00:40:46,119
room, overlapping 

702
00:40:46,119 --> 00:40:47,959
sets of two conversations. 

703
00:40:47,959 --> 00:40:51,640
Then can you separate out the two original speaker sources? 

704
00:40:51,640 --> 00:40:55,650
So I actually played this audio as well in the very first lecture, which is 

705
00:40:55,650 --> 00:40:59,069
suppose microphone one records this. 

706
00:40:59,069 --> 00:41:05,489
[Recording] Instructor 

707
00:41:05,489 --> 00:41:12,489
(Andrew 

708
00:41:13,229 --> 00:41:16,650
Ng):So the question is, these are really two speakers, 

709
00:41:16,650 --> 00:41:20,809
speaking independently of each other. So each speaker is outputting 

710
00:41:20,809 --> 00:41:24,699
a series of sound signals as independent of the other conversation 

711
00:41:24,699 --> 00:41:26,119
going on in the room. 

712
00:41:26,119 --> 00:41:27,839
So 

713
00:41:27,839 --> 00:41:31,880
this being an supervised learning problem, the question is, can we take these two microphone 

714
00:41:31,880 --> 00:41:33,900
recordings and feed it to 

715
00:41:33,900 --> 00:41:37,329
an algorithm to find the independent components in 

716
00:41:37,329 --> 00:41:38,450
this 

717
00:41:38,450 --> 00:41:40,589
data? This is the output 

718
00:41:40,589 --> 00:41:42,440


719
00:41:42,440 --> 00:41:48,619
when we do so. 

720
00:41:48,619 --> 00:41:55,050
[Recording] Instructor (Andrew Ng):This is the other one. [Recording] 

721
00:41:55,050 --> 00:41:55,940


722
00:41:55,940 --> 00:41:59,859
Instructor (Andrew Ng):Just for fun. [Inaudible]. These are audio clips I got 

723
00:41:59,859 --> 00:42:01,409


724
00:42:01,409 --> 00:42:04,409
from [inaudible]. Just for fun, let me play the other ones as well. This 

725
00:42:04,409 --> 00:42:11,409
is overlapping microphone one. [Recording] 

726
00:42:13,440 --> 00:42:20,440
Instructor (Andrew Ng):Here's microphone two. [Recording] Instructor (Andrew 

727
00:42:21,739 --> 00:42:24,250
Ng):So given this as input, here's output one. 

728
00:42:24,250 --> 00:42:27,459


729
00:42:27,459 --> 00:42:30,910
[Recording] 

730
00:42:30,910 --> 00:42:33,640
Instructor (Andrew Ng):It's not perfect, but it's largely cleaned up the music. 

731
00:42:33,640 --> 00:42:40,640
Here's number two. [Recording] Instructor (Andrew Ng):Okay. Switch back to 

732
00:42:42,979 --> 00:42:44,900
[inaudible], please. 

733
00:42:44,900 --> 00:42:46,979
So 

734
00:42:46,979 --> 00:42:53,979
what I want to do now is describe an algorithm that does that. 

735
00:42:54,829 --> 00:42:58,299
Before 

736
00:42:58,299 --> 00:43:03,239
I actually jump into the algorithm, I want to say two minutes 

737
00:43:03,239 --> 00:43:03,799
of 

738
00:43:03,799 --> 00:43:10,799
CDF, so cumulative distribution functions. I know most 

739
00:43:18,669 --> 00:43:21,089
of you know what these are, but I'm 

740
00:43:21,089 --> 00:43:23,599
just going to remind you of what they are. 

741
00:43:23,599 --> 00:43:24,579


742
00:43:24,579 --> 00:43:30,349
Let's say you have a one-D random variable S. So suppose you have 

743
00:43:30,349 --> 00:43:35,900
a random variable, S, 

744
00:43:35,900 --> 00:43:41,469
and suppose it has a property density function [inaudible]. 

745
00:43:41,469 --> 00:43:43,409
Then 

746
00:43:43,409 --> 00:43:45,859
the CDF 

747
00:43:45,859 --> 00:43:50,139
is defined as a function, or rather as F, 

748
00:43:50,139 --> 00:43:53,729
which is the probability that the random variable, 

749
00:43:53,729 --> 00:43:55,920
S, is less than the value 

750
00:43:55,920 --> 00:43:58,539
given by that lower-case 

751
00:43:58,539 --> 00:43:59,869
value, 

752
00:43:59,869 --> 00:44:01,929
S. 

753
00:44:01,929 --> 00:44:03,269
For example, 

754
00:44:03,269 --> 00:44:06,099
if this is your [inaudible] density, 

755
00:44:06,099 --> 00:44:10,229
than the density of the [inaudible] usually 

756
00:44:10,229 --> 00:44:14,609
to note it lower-case phi. That's roughly a bell-shaped density. Then 

757
00:44:14,609 --> 00:44:20,319
the CDF or the Gaussian 

758
00:44:20,319 --> 00:44:22,269
will look something like this. 

759
00:44:22,269 --> 00:44:24,959
There'll be a capital function 

760
00:44:24,959 --> 00:44:27,339
pi. So if I pick a value 

761
00:44:27,339 --> 00:44:29,079
S like that, then the 

762
00:44:29,079 --> 00:44:30,449
height of this - 

763
00:44:30,449 --> 00:44:32,569
this is [inaudible] probability that 

764
00:44:32,569 --> 00:44:35,410
my Gaussian random variable is less than 

765
00:44:35,410 --> 00:44:37,419
that value there. In other words, 

766
00:44:37,419 --> 00:44:40,549
the height of the function at that point is 

767
00:44:40,549 --> 00:44:44,229
less 

768
00:44:44,229 --> 00:44:46,269
than the area of the Gaussian density, 

769
00:44:46,269 --> 00:44:48,119
up to the point S. 

770
00:44:48,119 --> 00:44:48,890
As you 

771
00:44:48,890 --> 00:44:52,690
move further and further to the right, this function will approach one, as 

772
00:44:52,690 --> 00:44:59,690
you integrate more and more of this area of the Gaussian. So another way to write 

773
00:45:04,839 --> 00:45:11,839
F 

774
00:45:21,109 --> 00:45:28,109
of 

775
00:45:30,659 --> 00:45:34,619
S is the integral, the minus infinity 

776
00:45:34,619 --> 00:45:35,729
to S of 

777
00:45:35,729 --> 00:45:41,739
the density, DT. 

778
00:45:41,739 --> 00:45:43,859
So something that'll come later is 

779
00:45:43,859 --> 00:45:48,320
suppose I have a random variable, S, and I want to model the distribution of the random 

780
00:45:48,320 --> 00:45:49,439
variable, S. 

781
00:45:49,439 --> 00:45:53,449
So one thing I could do is I can specify 

782
00:45:53,449 --> 00:45:56,549
what I think the density 

783
00:45:56,549 --> 00:45:58,049
is. 

784
00:45:58,049 --> 00:46:03,200
Or I can specify 

785
00:46:03,200 --> 00:46:04,450
what the 

786
00:46:04,450 --> 00:46:08,099
CDF 

787
00:46:08,099 --> 00:46:11,359
is. These are related by this equation. F is the integral of P of S. You 

788
00:46:11,359 --> 00:46:13,989
can also 

789
00:46:13,989 --> 00:46:15,719
recover the density 

790
00:46:15,719 --> 00:46:20,469
by taking the CDF and taking the derivative. So F prime, take the derivative 

791
00:46:20,469 --> 00:46:21,729
of the CDF, 

792
00:46:21,729 --> 00:46:23,440
you get back the 

793
00:46:23,440 --> 00:46:24,709
density. So this has come up 

794
00:46:24,709 --> 00:46:28,179
in the middle of when I derive ICA, which is that 

795
00:46:28,179 --> 00:46:32,169
there'll be a step where they need to assume a distribution for random variable, S. 

796
00:46:32,169 --> 00:46:36,359
I can either specify the density for S directly, or I can specify the CDF. I 

797
00:46:36,359 --> 00:46:38,819
choose to specify the 

798
00:46:38,819 --> 00:46:39,920


799
00:46:39,920 --> 00:46:41,529
CDF. 

800
00:46:41,529 --> 00:46:46,919
It has to be some function increasing from zero to one. 

801
00:46:46,919 --> 00:46:48,029
So you can 

802
00:46:48,029 --> 00:46:50,679
choose any function that looks like that, and in particular, 

803
00:46:50,679 --> 00:46:51,969


804
00:46:51,969 --> 00:46:55,469
pulling functions out of a hat that look like that. You can, for instance, choose a 

805
00:46:55,469 --> 00:46:58,989
sigmoid function of 

806
00:46:58,989 --> 00:47:04,219
CDF. That would be one way of specifying the distribution of the densities for the random variable S. So 

807
00:47:04,219 --> 00:47:05,110
this 

808
00:47:05,110 --> 00:47:12,110
will come up later. 

809
00:47:30,299 --> 00:47:33,579
Just [inaudible], just raise your hand if that is familiar to you, if you've seen 

810
00:47:33,579 --> 00:47:40,579
that before. Great. So 

811
00:47:42,469 --> 00:47:43,239
let's 

812
00:47:43,239 --> 00:47:48,630
start to derive our RCA, or our independent component analysis 

813
00:47:48,630 --> 00:47:50,430
algorithm. 

814
00:47:50,430 --> 00:47:53,989
Let's assume that the 

815
00:47:53,989 --> 00:47:55,859


816
00:47:55,859 --> 00:47:59,819
data comes from 

817
00:47:59,819 --> 00:48:01,979
N original 

818
00:48:01,979 --> 00:48:03,319
sources. 

819
00:48:03,319 --> 00:48:07,009
So let's say there are N speakers in a cocktail party. 

820
00:48:07,009 --> 00:48:09,819
So the original sources, I'm 

821
00:48:09,819 --> 00:48:11,330
going to write as a vector, S 

822
00:48:11,330 --> 00:48:13,619
as in RN. 

823
00:48:13,619 --> 00:48:17,449
So just to be concrete about what I mean about that, I'm going to use 

824
00:48:17,449 --> 00:48:22,499
SIJ to denote the signal 

825
00:48:22,499 --> 00:48:25,849
from speaker 

826
00:48:25,849 --> 00:48:27,140


827
00:48:27,140 --> 00:48:30,219
J 

828
00:48:30,219 --> 00:48:32,659
at time 

829
00:48:32,659 --> 00:48:34,080
I. Here's what I mean. 

830
00:48:34,080 --> 00:48:37,940
So what is sound? When you hear sound waves, sound is created 

831
00:48:37,940 --> 00:48:39,279
by a pattern 

832
00:48:39,279 --> 00:48:43,160
of expansions and compressions in air. So the way you're hearing my voice is 

833
00:48:43,160 --> 00:48:44,620
my 

834
00:48:44,620 --> 00:48:47,719
mouth is causing certain 

835
00:48:47,719 --> 00:48:50,960
changes in the air pressure, and then your ear is hearing my voice as 

836
00:48:50,960 --> 00:48:53,539
detecting those changes in air 

837
00:48:53,539 --> 00:48:57,729
pressure. So what a microphone records, what my mouth is generating, is 

838
00:48:57,729 --> 00:48:59,159
a pattern. 

839
00:48:59,159 --> 00:49:01,459
I'm going to draw a cartoon, 

840
00:49:01,459 --> 00:49:04,819
I guess. 

841
00:49:04,819 --> 00:49:06,059
Changes in 

842
00:49:06,059 --> 00:49:06,970
air pressure. So 

843
00:49:06,970 --> 00:49:11,119
this is what sound is. You look at a microphone recording, you see these roughly periodic 

844
00:49:11,119 --> 00:49:13,289
signals that comprise of 

845
00:49:13,289 --> 00:49:16,230
changes in air pressure over time as the air pressure goes 

846
00:49:16,230 --> 00:49:18,539
above and below some baseline air pressure. 

847
00:49:18,539 --> 00:49:19,669
So this 

848
00:49:19,669 --> 00:49:22,369
is what the speech signal looks like, say. 

849
00:49:22,369 --> 00:49:26,399
So this is speaker one. 

850
00:49:26,399 --> 00:49:29,039
Then what I'm saying is that 

851
00:49:29,039 --> 00:49:31,189
- this is some time, T. 

852
00:49:31,189 --> 00:49:34,479
What I'm saying is that the value of that point, 

853
00:49:34,479 --> 00:49:36,989
I'm going to denote as S, super 

854
00:49:36,989 --> 00:49:40,229
script T, sub script one. 

855
00:49:40,229 --> 00:49:41,729
Similarly, 

856
00:49:41,729 --> 00:49:44,889
speaker two, it's 

857
00:49:44,889 --> 00:49:46,859
outputting some sound wave. Speaker voice 

858
00:49:46,859 --> 00:49:49,749
will play that. It'll actually sound like 

859
00:49:49,749 --> 00:49:52,920
a single tone, I guess. 

860
00:49:52,920 --> 00:49:56,099
So in the same way, at the same time, T, 

861
00:49:56,099 --> 00:49:59,049
the value of the air 

862
00:49:59,049 --> 00:50:02,589
pressure generated by speaker two, I'll denote as 

863
00:50:02,589 --> 00:50:09,589
ST 

864
00:50:16,579 --> 00:50:23,579
2. 

865
00:50:29,859 --> 00:50:36,859
So we observe 

866
00:50:37,769 --> 00:50:40,449
XI equals A times SI, where 

867
00:50:40,449 --> 00:50:43,409
these XIs 

868
00:50:43,409 --> 00:50:45,989
are vectors in RN. 

869
00:50:45,989 --> 00:50:50,269
So I'm going to assume 

870
00:50:50,269 --> 00:50:53,259
that I have N microphones, 

871
00:50:53,259 --> 00:50:53,580
and 

872
00:50:53,580 --> 00:50:58,489
each of my microphones records some linear combination 

873
00:50:58,489 --> 00:51:01,869
of what the speakers are saying. So each microphone records some overlapping 

874
00:51:01,869 --> 00:51:04,499
combination of what the speakers are saying. 

875
00:51:04,499 --> 00:51:07,619
For 

876
00:51:07,619 --> 00:51:08,749


877
00:51:08,749 --> 00:51:10,349


878
00:51:10,349 --> 00:51:12,669
example, XIJ, which is - this 

879
00:51:12,669 --> 00:51:16,249
is what microphone J records at time, I. So 

880
00:51:16,249 --> 00:51:17,349
by definition of 

881
00:51:17,349 --> 00:51:21,519
the matrix multiplication, this is sum 

882
00:51:21,519 --> 00:51:23,979
of AIKSJ. 

883
00:51:23,979 --> 00:51:29,369
Oh, excuse me. 

884
00:51:29,369 --> 00:51:36,369
Okay? So what my J - sorry. 

885
00:51:37,179 --> 00:51:41,049
So what my J microphone is recording is 

886
00:51:41,049 --> 00:51:42,190


887
00:51:42,190 --> 00:51:43,940
some linear combination of 

888
00:51:43,940 --> 00:51:45,569
all of the speakers. So 

889
00:51:45,569 --> 00:51:49,780
at time I, what microphone J is recording is some linear combination of 

890
00:51:49,780 --> 00:51:52,749
what all the speakers are saying at time I. 

891
00:51:52,749 --> 00:51:54,359
So K here 

892
00:51:54,359 --> 00:51:57,819
indexes over the N speakers. 

893
00:51:57,819 --> 00:52:01,239
So our goal 

894
00:52:01,239 --> 00:52:02,869


895
00:52:02,869 --> 00:52:06,420
is to find the matrix, W, equals A inverse, and 

896
00:52:06,420 --> 00:52:10,130
just defining W that way. 

897
00:52:10,130 --> 00:52:17,130
So 

898
00:52:18,139 --> 00:52:21,349
we can recover the original sources 

899
00:52:21,349 --> 00:52:23,310
as a linear combination of 

900
00:52:23,310 --> 00:52:23,560
our 

901
00:52:23,549 --> 00:52:30,549
microphone recordings, XI. 

902
00:52:33,059 --> 00:52:35,329
Just as a point of notation, 

903
00:52:35,329 --> 00:52:42,329
I'm going to write the matrix W this way. I'm going to use 

904
00:52:49,449 --> 00:52:50,890


905
00:52:50,890 --> 00:52:55,100
lower case W subscript one, subscript two and so on to denote the roles 

906
00:52:55,100 --> 00:53:02,100
of this matrix, W. 

907
00:53:13,909 --> 00:53:14,559
Let's 

908
00:53:14,559 --> 00:53:18,719
see. 

909
00:53:18,719 --> 00:53:23,539
So let's look at why IC is possible. Given these overlapping voices, 

910
00:53:23,539 --> 00:53:28,249
let's think briefly why it might be possible 

911
00:53:28,249 --> 00:53:30,760
to recover the original sources. 

912
00:53:30,760 --> 00:53:33,059
So for the next example, I want 

913
00:53:33,059 --> 00:53:36,509
to say 

914
00:53:36,509 --> 00:53:42,739


915
00:53:42,739 --> 00:53:46,530
- let's say that each of my speakers 

916
00:53:46,530 --> 00:53:50,380
outputs - this will sound like white noise. Can I switch 

917
00:53:50,380 --> 00:53:53,380
the laptop display, 

918
00:53:53,380 --> 00:53:56,709
please? For this example, let's say that 

919
00:53:56,709 --> 00:53:57,219


920
00:53:57,219 --> 00:54:01,459
each of my speakers outputs uniform white noise. So 

921
00:54:01,459 --> 00:54:05,459
if that's the case, these are my axis, S1 and S2. 

922
00:54:05,459 --> 00:54:08,819
This is what my two speakers would be uttering. 

923
00:54:08,819 --> 00:54:11,289
The parts of what they're 

924
00:54:11,289 --> 00:54:14,979
uttering will look like a line in a square box if the two speakers are independently 

925
00:54:14,979 --> 00:54:16,089
outputting 

926
00:54:16,089 --> 00:54:18,389
uniform minus one random variables. 

927
00:54:18,389 --> 00:54:20,289
So this is part of 

928
00:54:20,289 --> 00:54:24,009
S1 and S2, my original sources. 

929
00:54:24,009 --> 00:54:28,999
This would be a typical sample of what my microphones record. Here, at 

930
00:54:28,999 --> 00:54:31,400
the axis, are X1 and X2. 

931
00:54:31,400 --> 00:54:35,089
So these are images I got from [inaudible] on 

932
00:54:35,089 --> 00:54:37,269
ICA. 

933
00:54:37,269 --> 00:54:38,709


934
00:54:38,709 --> 00:54:43,689
Given a picture like this, you can sort of look at this box, and you can sort of tell what the axis of 

935
00:54:43,689 --> 00:54:44,939
this 

936
00:54:44,939 --> 00:54:45,809
parallelogram 

937
00:54:45,809 --> 00:54:48,149
are. You can figure out 

938
00:54:48,149 --> 00:54:51,999
what linear transformation would transform the parallelogram back 

939
00:54:51,999 --> 00:54:54,359
to a box. 

940
00:54:54,359 --> 00:54:58,769
So it turns out there are some inherent ambiguities in ICA. 

941
00:54:58,769 --> 00:55:00,509
I'll just say what they are. 

942
00:55:00,509 --> 00:55:01,569
One is that 

943
00:55:01,569 --> 00:55:05,709
you can't recover the original indexing of the sources. In particular, 

944
00:55:05,709 --> 00:55:07,379
if 

945
00:55:07,379 --> 00:55:10,809
I generated the data for speaker one and speaker two, 

946
00:55:10,809 --> 00:55:14,469
you can run ICA, and then you may end up with the order of the speakers 

947
00:55:14,469 --> 00:55:17,529
reversed. What that corresponds to is if you take this 

948
00:55:17,529 --> 00:55:21,809
picture and you flip this picture along a 45-degree 

949
00:55:21,809 --> 00:55:26,130
axis. You take a 45-degree axis and reflect this picture across the 45-degree axis, you'll still 

950
00:55:26,130 --> 00:55:28,279
get a box. So 

951
00:55:28,279 --> 00:55:31,319
there's no way for the algorithms to tell which was speaker No. 1 and 

952
00:55:31,319 --> 00:55:32,910
which 

953
00:55:32,910 --> 00:55:37,699
was speaker No. 2. The numbering or the ordering of the speakers is 

954
00:55:37,699 --> 00:55:40,839
ambiguous. The other source of ambiguity, and these are the only ambiguities 

955
00:55:40,839 --> 00:55:42,089
in this example, 

956
00:55:42,089 --> 00:55:44,469
is the sign of the sources. So 

957
00:55:44,469 --> 00:55:49,119
given my speakers' recordings, 

958
00:55:49,119 --> 00:55:53,190
you can't tell whether you got a positive SI or whether you got 

959
00:55:53,190 --> 00:55:56,179
back a negative SI. 

960
00:55:56,179 --> 00:55:58,210
In this picture, what that corresponds to 

961
00:55:58,210 --> 00:56:02,100
is if you take this picture, and you reflect it along the vertical axis, if 

962
00:56:02,100 --> 00:56:04,659
you reflect it along the horizontal axis, 

963
00:56:04,659 --> 00:56:05,910
you still get a box. 

964
00:56:05,910 --> 00:56:08,719
You still get back [inaudible] speakers. 

965
00:56:08,719 --> 00:56:09,649
So 

966
00:56:09,649 --> 00:56:11,719
it turns out that in this example, 

967
00:56:11,719 --> 00:56:16,599
you can't guarantee that you've recovered positive SI rather 

968
00:56:16,599 --> 00:56:19,689
than negative SI. 

969
00:56:19,689 --> 00:56:21,930
So it turns out that these are the only 

970
00:56:21,930 --> 00:56:25,740
two ambiguities in this example. What is the permutation of the speakers, and the 

971
00:56:25,740 --> 00:56:28,139
other is the sign of the speakers. 

972
00:56:28,139 --> 00:56:30,749
Permutation of the speakers, there's not much you can do about that. 

973
00:56:30,749 --> 00:56:34,909
It turns out that if you take the audio 

974
00:56:34,909 --> 00:56:35,609
source 

975
00:56:35,609 --> 00:56:39,199
and if you flip the sign, and you take negative S, and if you play that through a 

976
00:56:39,199 --> 00:56:43,819
microphone it'll sound indistinguishable. 

977
00:56:43,819 --> 00:56:44,879
So 

978
00:56:44,879 --> 00:56:47,829
for many of the applications we care about, the sign 

979
00:56:47,829 --> 00:56:51,259
as well as the permutation 

980
00:56:51,259 --> 00:56:55,079
is ambiguous, but you don't really care 

981
00:56:55,079 --> 00:57:02,079
about it. Let's switch back 

982
00:57:03,529 --> 00:57:08,989
to 

983
00:57:08,989 --> 00:57:11,180
chalk board, please. 

984
00:57:11,180 --> 00:57:15,639
It turns out, and I don't want to spend too much time on this, but I do want to say it briefly. 

985
00:57:15,639 --> 00:57:17,289
It turns out the 

986
00:57:17,289 --> 00:57:19,200
reason why those are the only 

987
00:57:19,200 --> 00:57:25,809
sources of ambiguity - so the ambiguities were the 

988
00:57:25,809 --> 00:57:29,869
permutation of the speakers 

989
00:57:29,869 --> 00:57:31,959
and the signs. 

990
00:57:31,959 --> 00:57:35,399
It turns out that 

991
00:57:35,399 --> 00:57:39,919
the reason these were the only ambiguities was because 

992
00:57:39,919 --> 00:57:44,999
the SIJs were 

993
00:57:44,999 --> 00:57:46,689


994
00:57:46,689 --> 00:57:50,509
non-Gaussian. I don't want to spend too much time on this, but I'll say it briefly. 

995
00:57:50,509 --> 00:57:54,089
Suppose my original sources, S1 and S2, were Gaussian. 

996
00:57:54,089 --> 00:57:55,909
So 

997
00:57:55,909 --> 00:57:58,329


998
00:57:58,329 --> 00:58:02,199
suppose SI is 

999
00:58:02,199 --> 00:58:04,339
Gaussian, would mean zero 

1000
00:58:04,339 --> 00:58:07,019
and identity covariance. 

1001
00:58:07,019 --> 00:58:10,959
That just means that each of my speakers outputs a Gaussian random variable. Here's a typical 

1002
00:58:10,959 --> 00:58:12,619
example of Gaussian 

1003
00:58:12,619 --> 00:58:18,479
data. 

1004
00:58:18,479 --> 00:58:22,869
You will recall the contours of a Gaussian distribution with identity covariants 

1005
00:58:22,869 --> 00:58:25,089
looks like 

1006
00:58:25,089 --> 00:58:27,739
this, right? The Gaussian is a 

1007
00:58:27,739 --> 00:58:30,569
spherically symmetric distribution. 

1008
00:58:30,569 --> 00:58:35,219
So if my speakers were outputting Gaussian random variables, than if 

1009
00:58:35,219 --> 00:58:38,179
I observe a linear combination of this, 

1010
00:58:38,179 --> 00:58:40,479
there's actually no way to recover the 

1011
00:58:40,479 --> 00:58:43,419
original distribution because there's no way for me to tell 

1012
00:58:43,419 --> 00:58:46,120
if the axis are at this angle or if they're at 

1013
00:58:46,120 --> 00:58:48,349
that angle and so 

1014
00:58:48,349 --> 00:58:52,429
on. The Gaussian is a rotationally symmetric 

1015
00:58:52,429 --> 00:58:56,769
distribution, so I would no be able to recover the orientation in the 

1016
00:58:56,769 --> 00:58:58,839
rotation 

1017
00:58:58,839 --> 00:59:02,279
of this. So I don't want to prove this too much. I don't want to spend too much time dwelling on this, but it turns 

1018
00:59:02,279 --> 00:59:02,900
out 

1019
00:59:02,900 --> 00:59:04,700
if your source is a Gaussian, 

1020
00:59:04,700 --> 00:59:07,929
then it's actually impossible to do 

1021
00:59:07,929 --> 00:59:12,049
ICA. ICA relies critically on your data being non-Gaussian because if the data 

1022
00:59:12,049 --> 00:59:16,939
were Gaussian, then the rotation of the data would be ambiguous. So 

1023
00:59:16,939 --> 00:59:19,079
regardless of how much data you have, 

1024
00:59:19,079 --> 00:59:23,549
even if you had infinitely large amounts of data, you would not be able to recover 

1025
00:59:23,549 --> 00:59:26,739
the matrix A or W. 

1026
00:59:26,739 --> 00:59:32,779


1027
00:59:32,779 --> 00:59:39,779
Let's go ahead and divide the algorithm. 

1028
00:59:56,779 --> 01:00:00,939
To do this, I need just one more result, and then the derivation will be 

1029
01:00:00,939 --> 01:00:03,029


1030
01:00:03,029 --> 01:00:07,729
three lines. [Inaudible] many variables as N, which is the joint vector of the sound that all of my 

1031
01:00:07,729 --> 01:00:11,309
speakers that are emitting at any time. 

1032
01:00:11,309 --> 01:00:12,459
So 

1033
01:00:12,459 --> 01:00:15,619
let's say the density of S is 

1034
01:00:15,619 --> 01:00:17,339
P subscript S, 

1035
01:00:17,339 --> 01:00:19,569
capital S. 

1036
01:00:19,569 --> 01:00:23,399
So my microphone recording records S equals AS, 

1037
01:00:23,399 --> 01:00:25,319
equals W inverse 

1038
01:00:25,319 --> 01:00:31,019
S. Equivalently, S equals W sign of X. 

1039
01:00:31,019 --> 01:00:34,529
So let's think about what is the density of 

1040
01:00:34,529 --> 01:00:38,209
X. So I have P of S. I know the density of 

1041
01:00:38,209 --> 01:00:41,359
S, and X is a linear combination of the S's. 

1042
01:00:41,359 --> 01:00:45,170
So let's figure out what is the density of X. 

1043
01:00:45,170 --> 01:00:48,670
One thing we could do is 

1044
01:00:48,670 --> 01:00:51,339
figure out what S is. So this is just - 

1045
01:00:51,339 --> 01:00:55,759
apply the density of 

1046
01:00:55,759 --> 01:00:58,069
S to W of S. So let's 

1047
01:00:58,069 --> 01:01:01,999
see. This is the probability of S, so we just 

1048
01:01:01,999 --> 01:01:02,909


1049
01:01:02,909 --> 01:01:06,559
figure out what S is. S is W times X, so the probability of S is 

1050
01:01:06,559 --> 01:01:09,939
W times X, so the probability of X must be [inaudible]. 

1051
01:01:09,939 --> 01:01:11,619
So this is wrong. 

1052
01:01:11,619 --> 01:01:14,749
It turns out you can do this for probably mass functions but not for 

1053
01:01:14,749 --> 01:01:16,919
continuous density. So in particular, 

1054
01:01:16,919 --> 01:01:20,969
it's not correct to say that the probability of X is - well, you just figure out what 

1055
01:01:20,969 --> 01:01:22,499
S is. 

1056
01:01:22,499 --> 01:01:26,189
Then you say the probability of S is applied to that. This is wrong. You 

1057
01:01:26,189 --> 01:01:27,819
can't do this with densities. 

1058
01:01:27,819 --> 01:01:30,969
You can't say the probability of S is that because it's a property density 

1059
01:01:30,969 --> 01:01:32,969
function. 

1060
01:01:32,969 --> 01:01:34,459
In particular, 

1061
01:01:34,459 --> 01:01:35,509
the 

1062
01:01:35,509 --> 01:01:37,849
right formula is the 

1063
01:01:37,849 --> 01:01:40,439
density of S applied to W times X, 

1064
01:01:40,439 --> 01:01:41,730
times the determinant 

1065
01:01:41,730 --> 01:01:44,209
of the matrix, W. 

1066
01:01:44,209 --> 01:01:47,189
Let me just illustrate that with an example. 

1067
01:01:47,189 --> 01:01:49,919
Let's say 

1068
01:01:49,919 --> 01:01:51,550
the 

1069
01:01:51,550 --> 01:01:58,199
density for S is that. In 

1070
01:01:58,199 --> 01:02:03,469
this example, S is uniform 

1071
01:02:03,469 --> 01:02:05,539
over the unit interval. 

1072
01:02:05,539 --> 01:02:07,679


1073
01:02:07,679 --> 01:02:14,679
So the density for S looks like that. It's 

1074
01:02:15,189 --> 01:02:18,140
just density for the uniform 

1075
01:02:18,140 --> 01:02:20,749
distribution of zero one. 

1076
01:02:20,749 --> 01:02:24,150
So let me let X be equal to two times 

1077
01:02:24,150 --> 01:02:30,009
S. So this means A equals two. 

1078
01:02:30,009 --> 01:02:33,709
W equals one half. So if 

1079
01:02:33,709 --> 01:02:36,719
S is a uniform distribution over zero, one, 

1080
01:02:36,719 --> 01:02:40,320
then X, which is two times that, will be the uniform distribution over the 

1081
01:02:40,320 --> 01:02:43,299
range from zero to two. 

1082
01:02:43,299 --> 01:02:50,299
So the density for X will be - 

1083
01:02:54,359 --> 01:02:57,289
that's one, that's two, 

1084
01:02:57,289 --> 01:03:01,409
that's one half, 

1085
01:03:01,409 --> 01:03:02,529


1086
01:03:02,529 --> 01:03:04,949
and 

1087
01:03:04,949 --> 01:03:07,939
that's one. Okay? Density for X will be indicator 

1088
01:03:07,939 --> 01:03:12,729
zero [inaudible] for X [inaudible] two 

1089
01:03:12,729 --> 01:03:15,739
times W, times one half. 

1090
01:03:15,739 --> 01:03:20,229
So 

1091
01:03:20,229 --> 01:03:21,729
does that make 

1092
01:03:21,729 --> 01:03:25,020
sense? [Inaudible] computer density for X because X is now spread out 

1093
01:03:25,020 --> 01:03:28,650
across a wider range. The density of X is now smaller, 

1094
01:03:28,650 --> 01:03:35,650
and therefore, the density of X has this one half 

1095
01:03:37,859 --> 01:03:38,919
term 

1096
01:03:38,919 --> 01:03:42,579
here. Okay? This is an illustration for the case of one-dimensional random variables, 

1097
01:03:42,579 --> 01:03:44,289


1098
01:03:44,289 --> 01:03:45,160
or S 

1099
01:03:45,160 --> 01:03:49,490
and X of one D. I'm not going to show it, but the generalization of this to vector value random variables is that the 

1100
01:03:49,490 --> 01:03:51,649
density of X is given by this 

1101
01:03:51,649 --> 01:03:53,950
times the determinant of the matrix, W. Over here, 

1102
01:03:53,950 --> 01:04:00,950
I showed the one dimensional [inaudible] generalization. 

1103
01:04:21,439 --> 01:04:28,439
So we're nearly there. Here's 

1104
01:04:28,749 --> 01:04:33,969
how I can implement ICA. 

1105
01:04:33,969 --> 01:04:37,039
So my distribution on 

1106
01:04:37,039 --> 01:04:44,039
S, 

1107
01:04:50,259 --> 01:04:52,959
so I'm going to assume that my density on S 

1108
01:04:52,959 --> 01:04:55,099
is given by this as a product over the 

1109
01:04:55,099 --> 01:04:59,950
N speakers of the density - the product of speaker 

1110
01:04:59,950 --> 01:05:00,889
I 

1111
01:05:00,889 --> 01:05:03,659
emitting a certain sound. This is a product of densities. 

1112
01:05:03,659 --> 01:05:07,659
This is a product of distributions because I'm going to assume that the 

1113
01:05:07,659 --> 01:05:11,469
speakers are having independent conversations. So the SI's independent 

1114
01:05:11,469 --> 01:05:15,869
for different values of I. 

1115
01:05:15,869 --> 01:05:18,060
So by the formula we just worked out, 

1116
01:05:18,060 --> 01:05:22,359
the density for X would be equal to that. 

1117
01:05:22,359 --> 01:05:29,359


1118
01:05:35,539 --> 01:05:36,599


1119
01:05:36,599 --> 01:05:39,309
I'll just remind you, W was A 

1120
01:05:39,309 --> 01:05:42,579
inverse. It was 

1121
01:05:42,579 --> 01:05:43,929
this matrix 

1122
01:05:43,929 --> 01:05:47,619
I defined previously 

1123
01:05:47,619 --> 01:05:50,430
so that SI 

1124
01:05:50,430 --> 01:05:52,519
equals WI [inaudible] 

1125
01:05:52,519 --> 01:05:59,209
X. So that's what's in 

1126
01:05:59,209 --> 01:06:02,299
there. To complete my formulation for this model, 

1127
01:06:02,299 --> 01:06:06,359
the final thing I need to do is 

1128
01:06:06,359 --> 01:06:10,179
choose 

1129
01:06:10,179 --> 01:06:11,549
a density 

1130
01:06:11,549 --> 01:06:14,259
for what I think each speaker is 

1131
01:06:14,259 --> 01:06:17,949
saying. I need to assume some density over 

1132
01:06:17,949 --> 01:06:21,660
the sounds emitted by an individual speaker. 

1133
01:06:21,660 --> 01:06:25,630
So following the discussion I had right when the [inaudible] 

1134
01:06:25,630 --> 01:06:27,650
ICA, 

1135
01:06:27,650 --> 01:06:30,559
one thing I could do is I could choose 

1136
01:06:30,559 --> 01:06:32,019
the density for S, 

1137
01:06:32,019 --> 01:06:35,509
or equivalently, I could choose the CDF, the cumulative distribution 

1138
01:06:35,509 --> 01:06:37,169
function for 

1139
01:06:37,169 --> 01:06:38,219
S. 

1140
01:06:38,219 --> 01:06:41,489
In this case, I'm going to choose 

1141
01:06:41,489 --> 01:06:44,819
a CDF, probably for historical reasons and probably for 

1142
01:06:44,819 --> 01:06:46,569
convenience. 

1143
01:06:46,569 --> 01:06:50,019
I need to choose the CDF for S, so 

1144
01:06:50,019 --> 01:06:54,779
what that means is I just need to choose some function that increases from zero to 

1145
01:06:54,779 --> 01:06:59,439
what. I know I can't choose a Gaussian because we know you can't 

1146
01:06:59,439 --> 01:07:02,199
do ICA on Gaussian data. 

1147
01:07:02,199 --> 01:07:04,649
So I need some function increasing from zero to one 

1148
01:07:04,649 --> 01:07:08,639
that is not the cumulative distribution function for a 

1149
01:07:08,639 --> 01:07:10,359
Gaussian distribution. 

1150
01:07:10,359 --> 01:07:14,009
So what other functions do I know that increase from zero to one? I 

1151
01:07:14,009 --> 01:07:16,139
just choose the 

1152
01:07:16,139 --> 01:07:18,329
CDF to be 

1153
01:07:18,329 --> 01:07:21,979
the 

1154
01:07:21,979 --> 01:07:23,039
sigmoid function. 

1155
01:07:23,039 --> 01:07:24,729
This is a 

1156
01:07:24,729 --> 01:07:27,229
commonly-made choice that 

1157
01:07:27,229 --> 01:07:31,049
is made for convenience. There is actually no great reason for why you 

1158
01:07:31,049 --> 01:07:34,079
choose a sigmoid function. It's just a convenient function that we all know 

1159
01:07:34,079 --> 01:07:35,289
and are familiar with 

1160
01:07:35,289 --> 01:07:37,849
that happens to increase from zero to one. 

1161
01:07:37,849 --> 01:07:44,849
When you take the derivative 

1162
01:07:45,789 --> 01:07:49,389
of the sigmoid, and that will give you back 

1163
01:07:49,389 --> 01:07:50,119
your 

1164
01:07:50,119 --> 01:07:55,459
density. This is just not Gaussian. This is the main virtue of choosing the sigmoid. 

1165
01:07:55,459 --> 01:08:02,459
So 

1166
01:08:19,020 --> 01:08:21,960
there's really no rational for the choice of sigma. Lots of other things will 

1167
01:08:21,960 --> 01:08:23,599
work fine, too. 

1168
01:08:23,599 --> 01:08:26,659
It's just a common, reasonable default. 

1169
01:08:26,659 --> 01:08:29,589


1170
01:08:29,589 --> 01:08:34,239


1171
01:08:34,239 --> 01:08:38,039


1172
01:08:38,039 --> 01:08:40,279
It turns out that 

1173
01:08:40,279 --> 01:08:44,629
one reason the sigma works well for a lot of data sources is that 

1174
01:08:44,629 --> 01:08:49,079
if this is the Gaussian. 

1175
01:08:49,079 --> 01:08:52,190
If you actually take the sigmoid and you take its derivative, 

1176
01:08:52,190 --> 01:08:59,190


1177
01:09:02,299 --> 01:09:06,639
you find that the sigmoid has [inaudible] than the Gaussian. By this I mean 

1178
01:09:06,639 --> 01:09:10,509
the density of the sigmoid dies down to zero much more slowly than 

1179
01:09:10,509 --> 01:09:12,299
the 

1180
01:09:12,299 --> 01:09:13,489
Gaussian. 

1181
01:09:13,489 --> 01:09:18,079
The magnitudes of the tails dies down as E to the minus S squared. 

1182
01:09:18,079 --> 01:09:21,960
For the sigmoid, the tails look like E to the minus 

1183
01:09:21,960 --> 01:09:26,949
S. So the tails die down as E to the minus S, around E 

1184
01:09:26,949 --> 01:09:29,529
to the minus S squared. It turns out that most distributions of this property 

1185
01:09:29,529 --> 01:09:34,359
with [inaudible] tails, where the distribution decays to zero relatively slowly 

1186
01:09:34,359 --> 01:09:38,439
compared to Gaussian will 

1187
01:09:38,439 --> 01:09:39,919
work fine for your data. 

1188
01:09:39,919 --> 01:09:43,939
Actually, one other choice you can sometimes us is what's called the Laplacian 

1189
01:09:43,939 --> 01:09:46,230
distribution, which is 

1190
01:09:46,230 --> 01:09:53,230
that. This will work fine, too, for many data sources. 

1191
01:10:06,539 --> 01:10:08,110
Sticking with the sigmoid for now, I'll just 

1192
01:10:08,110 --> 01:10:09,420
write 

1193
01:10:09,420 --> 01:10:14,479
down the algorithm in two steps. So given 

1194
01:10:14,479 --> 01:10:17,150
my training set, and 

1195
01:10:17,150 --> 01:10:21,179
as you show, this is an unlabeled training set, I can 

1196
01:10:21,179 --> 01:10:25,770
write down the log likelihood of my parameters. So that's - assembled my training 

1197
01:10:25,770 --> 01:10:27,209
examples, log of - times 

1198
01:10:27,209 --> 01:10:34,209
that. 

1199
01:10:42,869 --> 01:10:44,880
So that's my log 

1200
01:10:44,880 --> 01:10:51,880
likelihood. 

1201
01:10:53,339 --> 01:10:59,380
To learn the parameters, W, of this model, I can use the [inaudible] assent, 

1202
01:10:59,380 --> 01:11:06,380
which is 

1203
01:11:06,570 --> 01:11:08,579
just that. 

1204
01:11:08,579 --> 01:11:11,489
It turns out, if you work through the math, 

1205
01:11:11,489 --> 01:11:13,969
let's see. If P of S 

1206
01:11:13,969 --> 01:11:19,820
is equal to the derivative of the 

1207
01:11:19,820 --> 01:11:23,780
sigmoid, then if you just work through the math to compute the [inaudible] there. You've all 

1208
01:11:23,780 --> 01:11:27,409
done this a lot of times. I won't bother to show 

1209
01:11:27,409 --> 01:11:34,409
the details. You find that is equal to this. 

1210
01:11:46,629 --> 01:11:49,579
Okay? That's just - you can work those out yourself. It's just math to 

1211
01:11:49,579 --> 01:11:54,499
compute the derivative of this with respect to 

1212
01:11:54,499 --> 01:11:59,309
W. So to summarize, given the training set, 

1213
01:11:59,309 --> 01:12:02,099
here's my [inaudible] update rule. So you run the 

1214
01:12:02,099 --> 01:12:06,309
[inaudible] to learn the parameters W. 

1215
01:12:06,309 --> 01:12:08,380
After you're 

1216
01:12:08,380 --> 01:12:09,719
done, you then 

1217
01:12:09,719 --> 01:12:12,369


1218
01:12:12,369 --> 01:12:14,109
output SI equals 

1219
01:12:14,109 --> 01:12:16,989
WXI, and you've separated your sources 

1220
01:12:16,989 --> 01:12:18,170
of your 

1221
01:12:18,170 --> 01:12:21,780
data back out into the original independent sources. 

1222
01:12:21,780 --> 01:12:26,199
Hopefully up to only a permutation and a plus/minus 

1223
01:12:26,199 --> 01:12:30,650
sign ambiguity. 

1224
01:12:30,650 --> 01:12:34,559
Okay? So just switch back to the laptop, please? 

1225
01:12:34,559 --> 01:12:41,559
So we'll just wrap up with a couple of examples of applications of ICA. 

1226
01:12:42,210 --> 01:12:43,150
This is 

1227
01:12:43,150 --> 01:12:46,719
actually a picture of our TA, Katie. 

1228
01:12:46,719 --> 01:12:49,980
So one of the applications of ICA is 

1229
01:12:49,980 --> 01:12:52,010
to process 

1230
01:12:52,010 --> 01:12:56,530
various types of [inaudible] recording data, so [inaudible]. This 

1231
01:12:56,530 --> 01:12:58,780
is a picture of 

1232
01:12:58,780 --> 01:13:02,470
a EEG cap, in which there are a number of electrodes 

1233
01:13:02,470 --> 01:13:04,529
you place 

1234
01:13:04,529 --> 01:13:07,959
on the - in this case, on Katie's brain, on Katie's scalp. 

1235
01:13:07,959 --> 01:13:13,370
So where each electrode measures changes in voltage over time 

1236
01:13:13,370 --> 01:13:15,059
on the scalp. 

1237
01:13:15,059 --> 01:13:18,409
On the right, it's a typical example of [inaudible] data 

1238
01:13:18,409 --> 01:13:22,570
where each electrode measures - just changes in voltage over 

1239
01:13:22,570 --> 01:13:23,890
time. So 

1240
01:13:23,890 --> 01:13:27,949
the horizontal axis is time, and the vertical axis is voltage. So here's the same thing, 

1241
01:13:27,949 --> 01:13:29,560
blown up a little bit. 

1242
01:13:29,560 --> 01:13:32,680
You notice there are artifacts in this 

1243
01:13:32,680 --> 01:13:36,340
data. Where the circle is, where the data is circled, all 

1244
01:13:36,340 --> 01:13:37,669
the 

1245
01:13:37,669 --> 01:13:41,179
electrodes seem to measure in these very synchronized recordings. 

1246
01:13:41,179 --> 01:13:44,699
It turns out that we look at [inaudible] data as well as a number of other 

1247
01:13:44,699 --> 01:13:47,019
types of data, there are 

1248
01:13:47,019 --> 01:13:51,550
artifacts from heartbeats and from human eye blinks and so on. So the 

1249
01:13:51,550 --> 01:13:55,030
cartoonist, if you imagine, placing the 

1250
01:13:55,030 --> 01:13:56,730
electrodes, or 

1251
01:13:56,730 --> 01:13:58,319
microphones, on my scalp, 

1252
01:13:58,319 --> 01:14:01,839
then each microphone is recording some overlapping combination of all the 

1253
01:14:01,839 --> 01:14:04,920
things happening in my brain or in my body. 

1254
01:14:04,920 --> 01:14:08,380
My brain has a number of different processes going on. My body's [inaudible] 

1255
01:14:08,380 --> 01:14:10,519
going on, and 

1256
01:14:10,519 --> 01:14:13,429
each electrode measures a sum 

1257
01:14:13,429 --> 01:14:15,680
of the different voices in my brain. 

1258
01:14:15,680 --> 01:14:19,789
That didn't quite come out the way I wanted it to. 

1259
01:14:19,789 --> 01:14:21,530
So we can just take this data 

1260
01:14:21,530 --> 01:14:25,400
and run ICA on it and find out one of the independent components, what the 

1261
01:14:25,400 --> 01:14:26,130
independent 

1262
01:14:26,130 --> 01:14:30,329
process are going on in my brain. This is an example of running ICA. 

1263
01:14:30,329 --> 01:14:33,239
So you find that a small number of components, like those shown up there, 

1264
01:14:33,239 --> 01:14:37,739
they correspond to heartbeat, where the arrows - so those are very periodic 

1265
01:14:37,739 --> 01:14:42,329
signals. They come on occasionally and correspond to [inaudible] components of 

1266
01:14:42,329 --> 01:14:43,050
heartbeat. 

1267
01:14:43,050 --> 01:14:47,460
You also find things like an eye blink component, corresponding to a 

1268
01:14:47,460 --> 01:14:49,780
sigmoid generated when you blink your eyes. 

1269
01:14:49,780 --> 01:14:53,820
By doing this, you can then subtract out the heartbeat and the eye blink 

1270
01:14:53,820 --> 01:14:56,179
artifacts from the data, and now 

1271
01:14:56,179 --> 01:15:01,219
you get much cleaner ICA data - get much cleaner EEG readings. You can 

1272
01:15:01,219 --> 01:15:03,699
do further scientific studies. So this is a 

1273
01:15:03,699 --> 01:15:06,179
pretty commonly used preprocessing step 

1274
01:15:06,179 --> 01:15:09,699
that is a common application of ICA. 

1275
01:15:09,699 --> 01:15:13,030
[Inaudible] example is 

1276
01:15:13,030 --> 01:15:16,300
the application, again, from [inaudible]. As 

1277
01:15:16,300 --> 01:15:20,900
a result of running ICA on natural small image patches. Suppose I take 

1278
01:15:20,900 --> 01:15:22,050
natural images 

1279
01:15:22,050 --> 01:15:25,909
and run ICA on the data and ask what are the independent components of data. 

1280
01:15:25,909 --> 01:15:30,039
It turns out that these are the bases you get. So this is a plot of the 

1281
01:15:30,039 --> 01:15:32,530
sources you get. 

1282
01:15:32,530 --> 01:15:36,269
This algorithm is saying that a natural image patch 

1283
01:15:36,269 --> 01:15:37,749
shown 

1284
01:15:37,749 --> 01:15:39,790
on the left 

1285
01:15:39,790 --> 01:15:45,330
is often expressed as a sum, or a linear combination, of 

1286
01:15:45,330 --> 01:15:46,679
independent sources of 

1287
01:15:46,679 --> 01:15:48,159
things that make up images. 

1288
01:15:48,159 --> 01:15:52,780
So this model's natural images is generated by independent objects 

1289
01:15:52,780 --> 01:15:55,340
that generate different ages in the image. 

1290
01:15:55,340 --> 01:16:01,260
One of the fascinating things about this is that, similar to neuroscience, this has also been 

1291
01:16:01,260 --> 01:16:04,789
hypothesized as a method for how the human brain processes image 

1292
01:16:04,789 --> 01:16:05,999
data. It 

1293
01:16:05,999 --> 01:16:10,139
turns out, this is similar, in many ways, to computations 

1294
01:16:10,139 --> 01:16:15,079
happening in early visual processing in the human brain, 

1295
01:16:15,079 --> 01:16:17,659
in the mammalian 

1296
01:16:17,659 --> 01:16:19,799
brain. It's just 

1297
01:16:19,799 --> 01:16:25,260
interesting to see ages are the independent components of images. 

1298
01:16:25,260 --> 01:16:30,639
Are there quick questions, because I'm running late. Quick questions before I close? Interviewee: [Inaudible] square matrix? Instructor (Andrew 

1299
01:16:30,639 --> 01:16:31,929
Ng):Oh, 

1300
01:16:31,929 --> 01:16:35,410
yes. For the algorithms I describe, I assume A is a square matrix. 

1301
01:16:35,410 --> 01:16:38,590
It turns out if you have more microphones than speakers, you can also apply very 

1302
01:16:38,590 --> 01:16:39,609
similar algorithms. If 

1303
01:16:39,609 --> 01:16:43,919
you have fewer microphones than speakers, there's sort of an open research problem. The odds 

1304
01:16:43,919 --> 01:16:48,459
are that if you have one male and one female speaker, but one microphone, you can 

1305
01:16:48,459 --> 01:16:51,819
sometimes sort of separate them because one is high, one is low. If you have two 

1306
01:16:51,819 --> 01:16:55,459
male speakers or two female speakers, then it's beyond the state of the art now to separate them 

1307
01:16:55,459 --> 01:16:57,050
with one 

1308
01:16:57,050 --> 01:17:00,499
microphone. It's a great research program. Okay. 

1309
01:17:00,499 --> 01:17:04,869
Sorry about running late again. Let's close now, and we'll 

1310
01:17:04,869 --> 01:17:05,749
continue reinforcement learning