00:00:08,290 --> 00:00:09,539 This subtitle is provided from lecsub.jimdo.com under Creative Commons License. 2 00:00:09,539 --> 00:00:12,809 This presentation is delivered by the Stanford Center for Professional 3 00:00:12,809 --> 00:00:19,809 Development. 4 00:00:22,570 --> 00:00:28,739 5 00:00:28,739 --> 00:00:30,300 Welcome back. 6 00:00:30,300 --> 00:00:32,800 What I want to do today is 7 00:00:32,800 --> 00:00:37,120 continue a discussion of principal components analysis, or PCA. 8 00:00:37,120 --> 00:00:38,780 In particular, there's 9 00:00:38,780 --> 00:00:41,730 one more application that I didn't get to in the last lecture on 10 00:00:41,730 --> 00:00:43,950 [inaudible] indexing, 11 00:00:43,950 --> 00:00:47,340 LSI. Then I want to spend just a little time talking about 12 00:00:47,340 --> 00:00:50,400 how to implement PCA, 13 00:00:50,400 --> 00:00:53,510 especially for very large problems. In particular, I'll spend just a little bit of time talking 14 00:00:53,510 --> 00:00:55,790 about singular value decomposition, 15 00:00:55,790 --> 00:00:57,930 or the SVD implementation 16 00:00:57,930 --> 00:01:00,170 of principal component 17 00:01:00,170 --> 00:01:01,929 analysis. So the 18 00:01:01,929 --> 00:01:06,549 second half of today's lecture, I want to talk about the different algorithm called 19 00:01:06,549 --> 00:01:09,090 independent component analysis, 20 00:01:09,090 --> 00:01:13,400 which is, in some ways, related to PCA, but in many other ways, it 21 00:01:13,400 --> 00:01:15,580 also manages to accomplish 22 00:01:15,580 --> 00:01:18,390 very different things than PCA. 23 00:01:18,390 --> 00:01:22,380 So with this lecture, this will actually wrap up our discussion on 24 00:01:22,380 --> 00:01:26,470 unsupervised learning. The next lecture, we'll start to talk about 25 00:01:26,470 --> 00:01:31,049 reinforcement learning algorithms. 26 00:01:31,049 --> 00:01:32,500 Just to 27 00:01:32,500 --> 00:01:36,770 recap where we were with PCA, principal 28 00:01:36,770 --> 00:01:38,630 component analysis, 29 00:01:38,630 --> 00:01:44,870 I said that 30 00:01:44,870 --> 00:01:48,840 in PCA, we imagine that we have some very high dimensional data 31 00:01:48,840 --> 00:01:50,970 that perhaps 32 00:01:50,970 --> 00:01:54,650 lies approximately on some low dimensional subspace. So if you had the data set like 33 00:01:54,650 --> 00:01:55,660 this, 34 00:01:55,660 --> 00:01:57,000 you might find that 35 00:01:57,000 --> 00:02:00,650 that's the first principal component of the data, 36 00:02:00,650 --> 00:02:02,850 and that's the second 37 00:02:02,850 --> 00:02:05,650 component of this 2-D data. 38 00:02:05,650 --> 00:02:06,870 39 00:02:06,870 --> 00:02:09,269 To 40 00:02:09,269 --> 00:02:13,039 summarize the algorithm, we have three steps. The first step of PCA 41 00:02:13,039 --> 00:02:14,680 42 00:02:14,680 --> 00:02:16,549 was to 43 00:02:16,549 --> 00:02:20,349 normalize the data to zero mean and 44 00:02:20,349 --> 00:02:25,789 [inaudible]. So 45 00:02:25,789 --> 00:02:27,860 tracked out the means of 46 00:02:27,860 --> 00:02:33,199 your training examples. So it now has zero means, and then normalize each of your features so 47 00:02:33,199 --> 00:02:35,509 that the variance of each feature is now one. 48 00:02:35,509 --> 00:02:37,719 49 00:02:37,719 --> 00:02:40,679 The next step was [inaudible] 50 00:02:40,679 --> 00:02:44,389 computical variance matrix of your zero mean data. So 51 00:02:44,389 --> 00:02:48,859 you compute it as follows. 52 00:02:48,859 --> 00:02:51,310 The sum of all the products, 53 00:02:51,310 --> 00:02:52,409 54 00:02:52,409 --> 00:02:55,870 and then you find the 55 00:02:55,870 --> 00:03:02,870 top K eigen vectors of 56 00:03:03,799 --> 00:03:07,159 sigma. 57 00:03:07,159 --> 00:03:09,589 So 58 00:03:09,589 --> 00:03:12,589 last time we saw the applications of this. For example, 59 00:03:12,589 --> 00:03:18,439 one of the applications was to eigen faces where 60 00:03:18,439 --> 00:03:22,899 each of your training examples, XI, is an image. 61 00:03:22,899 --> 00:03:24,180 So 62 00:03:24,180 --> 00:03:26,749 if you have 63 00:03:26,749 --> 00:03:30,890 100 by 100 images, if your pictures of faces are 64 00:03:30,890 --> 00:03:33,249 100 pixels by 100 pixels, 65 00:03:33,249 --> 00:03:37,519 then each of your training examples, XI, 66 00:03:37,519 --> 00:03:39,989 will be a 10,000 dimensional vector, 67 00:03:39,989 --> 00:03:42,729 corresponding to the 68 00:03:42,729 --> 00:03:47,529 10,000 grayscale intensity pixel values. There are 10,000 pixel values in 69 00:03:47,529 --> 00:03:50,349 each of your 100 by 100 images. 70 00:03:50,349 --> 00:03:53,179 So the eigen faces application was where 71 00:03:53,179 --> 00:03:56,280 the training example comprised 72 00:03:56,280 --> 00:03:58,149 pictures of faces of people. 73 00:03:58,149 --> 00:03:59,989 Then we ran PCA, 74 00:03:59,989 --> 00:04:00,780 and then 75 00:04:00,780 --> 00:04:03,090 to measure the distance between say 76 00:04:03,090 --> 00:04:04,260 a face here 77 00:04:04,260 --> 00:04:06,809 and a face there, we would project both 78 00:04:06,809 --> 00:04:09,619 of the face images onto the subspace and then 79 00:04:09,619 --> 00:04:11,469 measure 80 00:04:11,469 --> 00:04:13,679 the distance along the subspace. So in eigen faces, you use something 81 00:04:13,679 --> 00:04:16,919 like 50 principle components. 82 00:04:16,919 --> 00:04:20,049 So 83 00:04:20,049 --> 00:04:24,080 the difficulty of working with problems like these is that 84 00:04:24,080 --> 00:04:27,330 in step two of the algorithm, 85 00:04:27,330 --> 00:04:31,070 we construct the covariance matrix sigma. 86 00:04:31,070 --> 00:04:37,589 The covariance matrix now becomes 87 00:04:37,589 --> 00:04:42,919 a 10,000 by 10,000 dimensional matrix, which is huge. That has 88 00:04:42,919 --> 00:04:45,580 100 million 89 00:04:45,580 --> 00:04:47,650 entries, which is huge. 90 00:04:47,650 --> 00:04:49,020 91 00:04:49,020 --> 00:04:50,789 So let's apply PCA to 92 00:04:50,789 --> 00:04:54,289 very, very high dimensional data, used as a point of reducing the 93 00:04:54,289 --> 00:04:55,180 dimension. But 94 00:04:55,180 --> 00:04:59,550 step two of this algorithm had this step where you were constructing [inaudible]. So 95 00:04:59,550 --> 00:05:03,599 this extremely large matrix, which you can't do. 96 00:05:03,599 --> 00:05:06,580 Come back to this in a second. It turns out one of 97 00:05:06,580 --> 00:05:08,650 the other 98 00:05:08,650 --> 00:05:10,639 frequently-used applications of 99 00:05:10,639 --> 00:05:11,870 PCA 100 00:05:11,870 --> 00:05:14,210 is actually to text data. 101 00:05:14,210 --> 00:05:16,329 So here's what I 102 00:05:16,329 --> 00:05:21,039 mean. Remember our vectorial representation of emails? 103 00:05:21,039 --> 00:05:22,520 So this is way back 104 00:05:22,520 --> 00:05:26,869 when we were talking about supervised learning algorithms for a 105 00:05:26,869 --> 00:05:29,360 stand classification. You remember I said that 106 00:05:29,360 --> 00:05:32,620 given a piece of email or given a piece of text document, you 107 00:05:32,620 --> 00:05:35,479 can represent it using a very high-dimensional vector 108 00:05:35,479 --> 00:05:36,699 by taking 109 00:05:36,699 --> 00:05:38,990 110 00:05:38,990 --> 00:05:43,889 - writing down a list of all the words in your dictionary. Somewhere you had the word 111 00:05:43,889 --> 00:05:46,669 learn, somewhere you have the word 112 00:05:46,669 --> 00:05:49,680 study 113 00:05:49,680 --> 00:05:50,569 114 00:05:50,569 --> 00:05:54,759 and so on. 115 00:05:54,759 --> 00:05:58,180 Depending on whether each word appears or does not appear in your text document, you put either 116 00:05:58,180 --> 00:05:59,360 a one or a zero 117 00:05:59,360 --> 00:06:03,889 there. This is a representation we use on an electrode five or electrode six 118 00:06:03,889 --> 00:06:06,729 for representing text documents for 119 00:06:06,729 --> 00:06:08,669 when we're building 120 00:06:08,669 --> 00:06:12,090 [inaudible] based classifiers for 121 00:06:12,090 --> 00:06:14,300 [inaudible]. So it turns 122 00:06:14,300 --> 00:06:17,719 out one of the common applications of 123 00:06:17,719 --> 00:06:22,210 PCA is actually this text data representations as well. 124 00:06:22,210 --> 00:06:23,680 When you apply PCA 125 00:06:23,680 --> 00:06:25,650 to this sort of data, 126 00:06:25,650 --> 00:06:27,999 the resulting 127 00:06:27,999 --> 00:06:34,740 algorithm, it often just goes by a different name, just latent semantic indexing. 128 00:06:34,740 --> 00:06:41,250 129 00:06:41,250 --> 00:06:44,009 For the sake of completeness, I should say that 130 00:06:44,009 --> 00:06:48,050 in LSI, you usually skip the preprocessing step. 131 00:06:48,050 --> 00:06:49,220 132 00:06:49,220 --> 00:06:56,220 133 00:07:04,360 --> 00:07:06,089 134 00:07:06,089 --> 00:07:09,929 For various reasons, in LSI, you usually don't normalize the mean of the data to 135 00:07:09,929 --> 00:07:10,939 one, 136 00:07:10,939 --> 00:07:14,169 and you usually don't normalize the variance of the features to one. 137 00:07:14,169 --> 00:07:18,020 These are relatively minor 138 00:07:18,020 --> 00:07:21,199 differences, it turns out, so it does something very 139 00:07:21,199 --> 00:07:24,599 similar to PCA. 140 00:07:24,599 --> 00:07:25,849 141 00:07:25,849 --> 00:07:27,439 Normalizing the variance to one 142 00:07:27,439 --> 00:07:33,439 for text data would actually be a bad idea because all the words are - 143 00:07:33,439 --> 00:07:34,389 because that 144 00:07:34,389 --> 00:07:37,150 would have the affect of 145 00:07:37,150 --> 00:07:39,229 dramatically scaling up the 146 00:07:39,229 --> 00:07:43,779 weight of rarely occurring words. So for example, the word aardvark hardly ever 147 00:07:43,779 --> 00:07:45,729 appears in any document. 148 00:07:45,729 --> 00:07:48,919 So to normalize the variance 149 00:07:48,919 --> 00:07:51,009 of the second feature to one, you end up - 150 00:07:51,009 --> 00:07:54,860 you're scaling up the weight of the word aardvark 151 00:07:54,860 --> 00:07:58,340 dramatically. I don't understand why [inaudible]. 152 00:07:58,340 --> 00:08:01,449 So let's 153 00:08:01,449 --> 00:08:05,319 see. [Inaudible] the language, 154 00:08:05,319 --> 00:08:11,169 something that we want to do quite often is, give it two documents, 155 00:08:11,169 --> 00:08:13,109 156 00:08:13,109 --> 00:08:20,109 XI and XJ, to measure how similar they are. 157 00:08:20,179 --> 00:08:22,400 So for example, 158 00:08:22,400 --> 00:08:25,050 I may give you a document and ask 159 00:08:25,050 --> 00:08:28,860 you to find me more documents like this one. We're reading some 160 00:08:28,860 --> 00:08:30,900 article about some user event of today 161 00:08:30,900 --> 00:08:33,770 and want to find out what other news articles there are. So I give you a document and 162 00:08:33,770 --> 00:08:34,520 163 00:08:34,520 --> 00:08:37,159 ask you to look at all the other documents you have 164 00:08:37,159 --> 00:08:40,829 in this large set of documents and find the documents similar to 165 00:08:40,829 --> 00:08:42,220 this. 166 00:08:42,220 --> 00:08:43,740 167 00:08:43,740 --> 00:08:45,210 So 168 00:08:45,210 --> 00:08:48,170 this is typical text application, so 169 00:08:48,170 --> 00:08:51,140 to measure the similarity 170 00:08:51,140 --> 00:08:52,440 171 00:08:52,440 --> 00:08:56,550 between two documents in XI and XJ, [inaudible] 172 00:08:56,550 --> 00:08:59,950 each of these documents is represented 173 00:08:59,950 --> 00:09:03,190 as one of these highdimensional vectors. 174 00:09:03,190 --> 00:09:08,700 One common way to do this is to view each of your documents 175 00:09:08,700 --> 00:09:12,710 as some sort of very high-dimensional vector. 176 00:09:12,710 --> 00:09:14,100 So these 177 00:09:14,100 --> 00:09:18,040 are vectors in the very highdimensional space where 178 00:09:18,040 --> 00:09:20,299 the dimension of the vector is equal to 179 00:09:20,299 --> 00:09:27,210 the number of words in your dictionary. 180 00:09:27,210 --> 00:09:30,630 So maybe each of these documents lives in some 181 00:09:30,630 --> 00:09:33,560 50,000-dimension space, if you have 50,000 words in your 182 00:09:33,560 --> 00:09:37,090 dictionary. So one nature of the similarity between these two documents that's 183 00:09:37,090 --> 00:09:39,800 often used is 184 00:09:39,800 --> 00:09:41,440 what's the angle 185 00:09:41,440 --> 00:09:43,350 between these two 186 00:09:43,350 --> 00:09:50,350 documents. 187 00:09:51,240 --> 00:09:52,750 In particular, 188 00:09:52,750 --> 00:09:56,030 if the angle between these two vectors is small, then 189 00:09:56,030 --> 00:09:59,590 the two documents, we'll consider them to be similar. If the angle between 190 00:09:59,590 --> 00:10:03,310 these two vectors is large, then we consider the documents to be dissimilar. 191 00:10:03,310 --> 00:10:05,530 So 192 00:10:05,530 --> 00:10:10,060 more formally, one commonly used heuristic, the national language of processing, 193 00:10:10,060 --> 00:10:13,950 is to say that the similarity between the two documents is a co-sine of the angle theta between them. 194 00:10:13,950 --> 00:10:16,710 For similar 195 00:10:16,710 --> 00:10:19,270 values, anyway, the co-sine 196 00:10:19,270 --> 00:10:23,110 is a decreasing function of theta. 197 00:10:23,110 --> 00:10:24,490 So the 198 00:10:24,490 --> 00:10:29,560 smaller the angle between them, the larger the similarity. 199 00:10:29,560 --> 00:10:30,690 The co-sine 200 00:10:30,690 --> 00:10:35,970 between two vectors is, of course, just [inaudible] 201 00:10:35,970 --> 00:10:42,970 divided 202 00:10:43,940 --> 00:10:46,680 by - okay? 203 00:10:46,680 --> 00:10:51,100 That's just the linear algebra or the standard 204 00:10:51,100 --> 00:10:56,460 geometry definition of the co-sine between two vectors. Here's the 205 00:10:56,460 --> 00:10:59,260 206 00:10:59,260 --> 00:11:03,540 207 00:11:03,540 --> 00:11:10,540 intuition behind what LSI is doing. The hope, as usual, is 208 00:11:11,250 --> 00:11:15,320 209 00:11:15,320 --> 00:11:17,930 210 00:11:17,930 --> 00:11:21,060 that there 211 00:11:21,060 --> 00:11:24,970 may be some interesting axis of variations in the data, 212 00:11:24,970 --> 00:11:27,660 and there maybe some other axis that 213 00:11:27,660 --> 00:11:29,340 are just 214 00:11:29,340 --> 00:11:33,960 noise. So by projecting all of your data on lower-dimensional subspace, the hope is that by 215 00:11:33,960 --> 00:11:37,850 running PCA on your text data this way, you can remove some of the noise in the data and 216 00:11:37,850 --> 00:11:41,800 get better measures of the similarity between pairs of 217 00:11:41,800 --> 00:11:45,490 documents. Let's just delve a little deeper into those examples to convey more intuition about what LSI 218 00:11:45,490 --> 00:11:46,940 is doing. 219 00:11:46,940 --> 00:11:47,940 So 220 00:11:47,940 --> 00:11:54,940 look further in the definition of the co-sine similarity measure. So 221 00:11:56,430 --> 00:11:59,790 the numerator 222 00:11:59,790 --> 00:12:06,790 or 223 00:12:10,210 --> 00:12:17,210 the similarity between the two documents was this inner product, 224 00:12:17,620 --> 00:12:24,270 which is therefore sum over K, 225 00:12:24,270 --> 00:12:27,500 XIK, 226 00:12:27,500 --> 00:12:28,870 XJK. So 227 00:12:28,870 --> 00:12:30,410 228 00:12:30,410 --> 00:12:33,810 this inner product would be equal to zero if 229 00:12:33,810 --> 00:12:36,910 the two documents have no words in common. So 230 00:12:36,910 --> 00:12:39,900 this is really - sum over K 231 00:12:39,900 --> 00:12:41,100 - 232 00:12:41,100 --> 00:12:42,930 indicator of 233 00:12:42,930 --> 00:12:44,760 whether 234 00:12:44,760 --> 00:12:47,170 documents, I and 235 00:12:47,170 --> 00:12:54,170 J, 236 00:12:54,920 --> 00:12:58,250 both contain the word, K, because 237 00:12:58,250 --> 00:13:02,590 I guess XIK indicates whether document I contains the word 238 00:13:02,590 --> 00:13:04,410 K, and XJK 239 00:13:04,410 --> 00:13:07,830 indicates whether document J contains the word, K. 240 00:13:07,830 --> 00:13:10,430 So the product would be one only 241 00:13:10,430 --> 00:13:12,420 if the word K 242 00:13:12,420 --> 00:13:14,540 appears in both documents. 243 00:13:14,540 --> 00:13:17,740 Therefore, the similarity between these two documents would be 244 00:13:17,740 --> 00:13:23,450 zero if the two documents have no words in common. 245 00:13:23,450 --> 00:13:30,450 For example, 246 00:13:31,180 --> 00:13:34,530 suppose your document, 247 00:13:34,530 --> 00:13:40,460 XI, has the word study and the word 248 00:13:40,460 --> 00:13:41,730 XJ, 249 00:13:41,730 --> 00:13:43,460 has the word learn. 250 00:13:43,460 --> 00:13:47,530 Then these two documents may be considered 251 00:13:47,530 --> 00:13:49,069 entirely dissimilar. 252 00:13:49,069 --> 00:13:50,020 253 00:13:50,020 --> 00:13:53,280 [Inaudible] effective study strategies. Sometimes you read a 254 00:13:53,280 --> 00:13:57,100 news article about that. So you ask, what other documents are similar to this? If 255 00:13:57,100 --> 00:14:01,080 there are a bunch of other documents about good methods to 256 00:14:01,080 --> 00:14:04,250 learn, than there are words in common. So similarity [inaudible] is zero. 257 00:14:04,250 --> 00:14:06,790 So here's 258 00:14:06,790 --> 00:14:09,200 a cartoon 259 00:14:09,200 --> 00:14:10,970 of what we hope 260 00:14:10,970 --> 00:14:12,820 [inaudible] PCA will do, 261 00:14:12,820 --> 00:14:14,340 which is 262 00:14:14,340 --> 00:14:17,320 suppose that on the horizontal axis, I plot 263 00:14:17,320 --> 00:14:21,330 the word 264 00:14:21,330 --> 00:14:25,310 learn, and on the vertical access, I plot the word study. 265 00:14:25,310 --> 00:14:29,840 So the values take on either the value of zero or one. So if a document 266 00:14:29,840 --> 00:14:33,040 contains the words learn but not study, then 267 00:14:33,040 --> 00:14:35,260 it'll plot that document there. If 268 00:14:35,260 --> 00:14:38,050 a document contains neither the word study nor learn, then it'll plot that 269 00:14:38,050 --> 00:14:40,530 at zero, zero. 270 00:14:40,530 --> 00:14:44,119 So here's a cartoon behind what PCA 271 00:14:44,119 --> 00:14:46,940 is doing, which is 272 00:14:46,940 --> 00:14:51,480 we identify lower dimensional subspace. That would be sum - eigen 273 00:14:51,480 --> 00:14:57,630 vector, we get out of PCAs. 274 00:14:57,630 --> 00:15:03,380 Now, supposed we have a document about learning. We have a document about studying. 275 00:15:03,380 --> 00:15:05,150 The document about learning 276 00:15:05,150 --> 00:15:07,710 points to the right. Document about studying points 277 00:15:07,710 --> 00:15:11,439 up. So the inner product, or the co-sine angle between these two documents would be 278 00:15:11,439 --> 00:15:13,170 - excuse 279 00:15:13,170 --> 00:15:15,080 me. The inner product between 280 00:15:15,080 --> 00:15:18,850 these two documents will be zero. 281 00:15:18,850 --> 00:15:20,280 So these two 282 00:15:20,280 --> 00:15:22,360 documents are entirely unrelated, 283 00:15:22,360 --> 00:15:24,849 which is not what we want. 284 00:15:24,849 --> 00:15:27,680 Documents about study, documents about learning, they are related. But 285 00:15:27,680 --> 00:15:32,820 we take these two documents, and we project them 286 00:15:32,820 --> 00:15:36,220 onto this subspace. 287 00:15:36,220 --> 00:15:38,320 288 00:15:38,320 --> 00:15:40,850 Then these two documents now become much 289 00:15:40,850 --> 00:15:42,610 closer together, 290 00:15:42,610 --> 00:15:44,790 and the algorithm will recognize that 291 00:15:44,790 --> 00:15:47,650 when you say the inner product between these two documents, 292 00:15:47,650 --> 00:15:50,580 you actually end up with a positive number. So 293 00:15:50,580 --> 00:15:52,190 LSI enables 294 00:15:52,190 --> 00:15:56,200 our algorithm to recognize that these two documents have some positive similarity 295 00:15:56,200 --> 00:15:58,920 between them. 296 00:15:58,920 --> 00:16:01,230 So that's just intuition 297 00:16:01,230 --> 00:16:02,370 about what 298 00:16:02,370 --> 00:16:05,000 PCA may be doing to text data. 299 00:16:05,000 --> 00:16:09,420 The same thing goes to other examples and the words study and learn. So you have 300 00:16:09,420 --> 00:16:11,139 - you find a document about 301 00:16:11,139 --> 00:16:15,059 politicians and a document with the names of prominent 302 00:16:15,059 --> 00:16:16,490 politicians. 303 00:16:16,490 --> 00:16:17,560 304 00:16:17,560 --> 00:16:20,740 That will also bring the documents closer together, 305 00:16:20,740 --> 00:16:24,650 or just any related topics, they end up 306 00:16:24,650 --> 00:16:25,670 [inaudible] 307 00:16:25,670 --> 00:16:31,780 points closer together and just lower dimensional space. 308 00:16:31,780 --> 00:16:33,130 309 00:16:33,130 --> 00:16:38,380 Question about this? Interviewee: [Inaudible]. Instructor 310 00:16:38,380 --> 00:16:43,930 (Andrew Ng):Which ones? 311 00:16:43,930 --> 00:16:50,930 This one? Interviewee: No, the line. Instructor (Andrew Ng):Oh, this one. Oh, 312 00:16:53,820 --> 00:16:54,470 yes. 313 00:16:54,470 --> 00:17:01,470 Thank you. 314 00:17:01,670 --> 00:17:05,510 [Inaudible]. 315 00:17:05,510 --> 00:17:06,650 So 316 00:17:06,650 --> 00:17:13,650 let's talk about how to actually implement this now. 317 00:17:16,400 --> 00:17:17,230 318 00:17:17,230 --> 00:17:20,520 Okay. How many of you know what 319 00:17:20,520 --> 00:17:24,780 an SVD or single value decomposition is? Wow, 320 00:17:24,780 --> 00:17:25,800 that's a lot of you. That's a 321 00:17:25,800 --> 00:17:28,270 lot more than I thought. 322 00:17:28,270 --> 00:17:35,270 Curious. Did you guys learn it as under grads or as graduate students? 323 00:17:37,309 --> 00:17:38,029 All 324 00:17:38,029 --> 00:17:40,429 right. Let 325 00:17:40,429 --> 00:17:44,150 me talk about it anyway. I 326 00:17:44,150 --> 00:17:47,790 wasn't expecting so many of you to know what SVD is, but I want to get this 327 00:17:47,790 --> 00:17:51,030 on tape, just so everyone else can learn 328 00:17:51,030 --> 00:17:55,050 about this, too. 329 00:17:55,050 --> 00:17:59,020 So I'll say a little bit about how to implement 330 00:17:59,020 --> 00:18:01,640 PCA. The problem I 331 00:18:01,640 --> 00:18:05,110 was eluding to just now was that 332 00:18:05,110 --> 00:18:09,310 when you have these very high-dimensional vectors, than sigma is a large matrix. In particular, for 333 00:18:09,310 --> 00:18:11,620 our 334 00:18:11,620 --> 00:18:13,390 text example, 335 00:18:13,390 --> 00:18:18,000 if the vectors XI are 50,000 dimensional, 336 00:18:18,000 --> 00:18:24,430 then 337 00:18:24,430 --> 00:18:27,160 the covariance matrix will be 50,000 dimensional by 50,000 338 00:18:27,160 --> 00:18:32,910 dimensional, which is much too big to represent explicitly. 339 00:18:32,910 --> 00:18:36,270 I guess many 340 00:18:36,270 --> 00:18:41,240 of you already know this, but I'll just say it anyway. It 341 00:18:41,240 --> 00:18:48,240 turns out there's another way to implement PCA, which is 342 00:18:48,679 --> 00:18:49,450 if 343 00:18:49,450 --> 00:18:53,340 A is any N by N matrix, 344 00:18:53,340 --> 00:18:56,580 than one of the most remarkable results of linear algebra 345 00:18:56,580 --> 00:19:00,970 is that the matrix, A, 346 00:19:00,970 --> 00:19:04,279 can be decomposed into 347 00:19:04,279 --> 00:19:07,330 a singular value 348 00:19:07,330 --> 00:19:10,160 decomposition. What that means is that the matrix, A, which 349 00:19:10,160 --> 00:19:11,519 is 350 00:19:11,519 --> 00:19:12,680 N by N, 351 00:19:12,680 --> 00:19:15,230 can always be decomposed into a product of 352 00:19:15,230 --> 00:19:18,280 three matrixes. U is N by N, 353 00:19:18,280 --> 00:19:21,470 D is a square matrix, which is N by N, and V is 354 00:19:21,470 --> 00:19:23,750 355 00:19:23,750 --> 00:19:27,170 also N by N. 356 00:19:27,170 --> 00:19:30,910 D 357 00:19:30,910 --> 00:19:34,890 is 358 00:19:34,890 --> 00:19:37,320 going to be diagonal. 359 00:19:37,320 --> 00:19:43,030 360 00:19:43,030 --> 00:19:45,140 Zeros are on the off-diagonals, 361 00:19:45,140 --> 00:19:52,140 and the values sigma I are called the singular values of 362 00:19:54,310 --> 00:19:57,530 the matrix A. 363 00:19:57,530 --> 00:20:01,500 364 00:20:01,500 --> 00:20:05,470 Almost all of you said you learned this as a graduate student, rather than as an under grad, so 365 00:20:05,470 --> 00:20:07,020 it turns out that 366 00:20:07,020 --> 00:20:10,820 when you take a class in undergraduate linear algebra, usually you learn a bunch of 367 00:20:10,820 --> 00:20:13,960 decomposition. So you usually learn about the QLD composition, maybe 368 00:20:13,960 --> 00:20:17,070 the LU factorization of the matrixes. 369 00:20:17,070 --> 00:20:20,350 Most under grad courses don't get to talk about singular value 370 00:20:20,350 --> 00:20:21,490 decompositions, but at 371 00:20:21,490 --> 00:20:22,860 least in - almost 372 00:20:22,860 --> 00:20:24,580 everything I 373 00:20:24,580 --> 00:20:26,900 do in machine learning, 374 00:20:26,900 --> 00:20:31,180 you actually find that you end up using SVDs much more than any of the 375 00:20:31,180 --> 00:20:33,120 decompositions 376 00:20:33,120 --> 00:20:35,690 you learned in typical under grad linear algebra class. 377 00:20:35,690 --> 00:20:40,030 So personally, I [inaudible] an SVD dozens of times in the last 378 00:20:40,030 --> 00:20:42,820 year, but LU and QRD compositions, 379 00:20:42,820 --> 00:20:45,480 I think I used the QRD composition once and an 380 00:20:45,480 --> 00:20:49,530 LU decomposition in the last year. So let's see. I'll 381 00:20:49,530 --> 00:20:53,880 say 382 00:20:53,880 --> 00:20:54,830 a 383 00:20:54,830 --> 00:21:01,830 bit 384 00:21:01,840 --> 00:21:04,700 more about this. So I'm going to draw the picture, I guess. 385 00:21:04,700 --> 00:21:08,090 For example, 386 00:21:08,090 --> 00:21:10,890 if A is an N by N matrix, 387 00:21:10,890 --> 00:21:17,890 it can be decomposed into another matrix, U, which is also N by N. It's the same 388 00:21:21,780 --> 00:21:24,510 size, D, which is 389 00:21:24,510 --> 00:21:29,840 N by 390 00:21:29,840 --> 00:21:34,730 N. Another square matrix, V transpose, which is also 391 00:21:34,730 --> 00:21:37,430 N by 392 00:21:37,430 --> 00:21:42,850 N. Furthermore, in a singular value decomposition, the 393 00:21:42,850 --> 00:21:46,460 columns of the matrix, U, will be the eigen 394 00:21:46,460 --> 00:21:50,940 vectors 395 00:21:50,940 --> 00:21:55,630 of A transpose, and the 396 00:21:55,630 --> 00:22:02,020 columns of V will be the eigen vectors 397 00:22:02,020 --> 00:22:05,730 of A 398 00:22:05,730 --> 00:22:08,290 transpose A. 399 00:22:08,290 --> 00:22:12,030 400 00:22:12,030 --> 00:22:16,660 To compute it, you just use the SVD commands 401 00:22:16,660 --> 00:22:20,780 in 402 00:22:20,780 --> 00:22:22,090 Matlab 403 00:22:22,090 --> 00:22:23,009 or 404 00:22:23,009 --> 00:22:26,760 Octave. Today, say the art in numerical linear algebra is that 405 00:22:26,760 --> 00:22:31,040 SVD, singular value decompositions, and matrixes can be computed 406 00:22:31,040 --> 00:22:34,450 extremely [inaudible]. We've 407 00:22:34,450 --> 00:22:35,780 used a 408 00:22:35,780 --> 00:22:37,610 package like Matlab or Octave 409 00:22:37,610 --> 00:22:40,410 to compute, say, the eigen vectors of a matrix. 410 00:22:40,410 --> 00:22:43,539 So if SVD 411 00:22:43,539 --> 00:22:47,980 routines are even more numerically stable than eigen vector routines for finding eigen vector in the 412 00:22:47,980 --> 00:22:49,230 matrix. So you can 413 00:22:49,230 --> 00:22:50,250 safely 414 00:22:50,250 --> 00:22:53,110 use a routine like this, and similar to the way they use 415 00:22:53,110 --> 00:22:56,030 a square root command without thinking about 416 00:22:56,030 --> 00:22:58,400 how it's computed. You can compute the square 417 00:22:58,400 --> 00:23:03,640 root of something and just not worry about it. You know the computer will give you the right 418 00:23:03,640 --> 00:23:07,670 answer. For most reasonably-sized matrixes, even up to thousands by thousands 419 00:23:07,670 --> 00:23:11,180 matrixes, the SVD routine, I think of it as a square root function. If 420 00:23:11,180 --> 00:23:14,700 you call it, it'll give you back the right answer. You don't have to worry 421 00:23:14,700 --> 00:23:16,760 too much about 422 00:23:16,760 --> 00:23:20,510 it. If you have extremely large matrixes, like a million by a million matrixes, I 423 00:23:20,510 --> 00:23:23,200 might start to worry a bit, but a few thousand by a few 424 00:23:23,200 --> 00:23:25,700 thousand matrixes, this is 425 00:23:25,700 --> 00:23:29,330 implemented very well today. 426 00:23:29,330 --> 00:23:31,360 Interviewee: [Inaudible]. Instructor (Andrew 427 00:23:31,360 --> 00:23:34,990 Ng):What's the complexity of SVD? That's a good question. I actually don't know. I want to guess it's roughly on the 428 00:23:34,990 --> 00:23:36,470 order of N-cubed. 429 00:23:36,470 --> 00:23:41,750 I'm not sure. [Inaudible] 430 00:23:41,750 --> 00:23:45,010 algorithms, so I think - I don't know what's 431 00:23:45,010 --> 00:23:50,470 known about the conversion 432 00:23:50,470 --> 00:23:54,370 of 433 00:23:54,370 --> 00:23:58,280 these algorithms. The example I drew out was for a facts matrix, and a matrix is 434 00:23:58,280 --> 00:23:59,890 [inaudible]. 435 00:23:59,890 --> 00:24:03,420 In the same way, you can also 436 00:24:03,420 --> 00:24:08,380 call SVD on the tall matrix, so it's taller than it's wide. 437 00:24:08,380 --> 00:24:15,380 It would decompose it into - okay? A 438 00:24:21,480 --> 00:24:24,190 product of three matrixes like 439 00:24:24,190 --> 00:24:28,470 that. 440 00:24:28,470 --> 00:24:31,220 The nice thing about this is that 441 00:24:31,220 --> 00:24:33,169 we can use it to compute 442 00:24:33,169 --> 00:24:40,169 eigen vectors and PCA very efficiently. 443 00:24:42,150 --> 00:24:47,380 In particular, a 444 00:24:47,380 --> 00:24:52,690 covariance matrix sigma was 445 00:24:52,690 --> 00:24:55,410 this. It was the sum of all the products, 446 00:24:55,410 --> 00:25:00,750 so if you go back and recall the definition of the 447 00:25:00,750 --> 00:25:01,820 design matrix - 448 00:25:01,820 --> 00:25:05,080 I think I described this in 449 00:25:05,080 --> 00:25:07,720 lecture two when 450 00:25:07,720 --> 00:25:13,510 we derived the close form solution to these squares [inaudible] these squares. The 451 00:25:13,510 --> 00:25:17,860 design matrix was this matrix where I took my 452 00:25:17,860 --> 00:25:21,390 examples 453 00:25:21,390 --> 00:25:26,160 and stacked them in 454 00:25:26,160 --> 00:25:27,590 rows. 455 00:25:27,590 --> 00:25:33,320 They call this the design matrix [inaudible]. So if 456 00:25:33,320 --> 00:25:38,780 you construct the design matrix, then 457 00:25:38,780 --> 00:25:43,240 the covariance matrix sigma 458 00:25:43,240 --> 00:25:47,760 can be written just X transposing. 459 00:25:47,760 --> 00:25:54,760 460 00:26:01,270 --> 00:26:08,270 That's X transposed, and [inaudible]. 461 00:26:16,230 --> 00:26:16,750 Okay? 462 00:26:16,750 --> 00:26:18,860 I hope you see why the X transpose X gives you 463 00:26:18,860 --> 00:26:22,460 the sum of products of 464 00:26:22,460 --> 00:26:25,650 vectors. If you aren't seeing this right now, just go home and convince yourself 465 00:26:25,650 --> 00:26:32,650 [inaudible] if it's 466 00:26:36,340 --> 00:26:37,880 467 00:26:37,880 --> 00:26:40,020 true. 468 00:26:40,020 --> 00:26:47,020 To get the top K eigen vectors of 469 00:26:50,610 --> 00:26:50,950 sigma, 470 00:26:50,950 --> 00:26:55,890 you would take sigma 471 00:26:55,890 --> 00:26:57,419 and decompose it 472 00:26:57,419 --> 00:26:59,650 using 473 00:26:59,650 --> 00:27:01,880 the - 474 00:27:01,880 --> 00:27:06,710 excuse me. 475 00:27:06,710 --> 00:27:11,960 You would take the matrix X, and you would compute as SVD. So you get USV transpose. 476 00:27:11,960 --> 00:27:15,400 Then the top three 477 00:27:15,400 --> 00:27:19,169 columns 478 00:27:19,169 --> 00:27:21,660 of U 479 00:27:21,660 --> 00:27:24,340 are the top K eigen 480 00:27:24,340 --> 00:27:26,830 vectors 481 00:27:26,830 --> 00:27:29,419 of 482 00:27:29,419 --> 00:27:31,910 X transpose 483 00:27:31,910 --> 00:27:33,660 X, 484 00:27:33,660 --> 00:27:37,760 which is therefore, the top K eigen vectors of your 485 00:27:37,760 --> 00:27:40,440 covariance matrix 486 00:27:40,440 --> 00:27:42,250 sigma. So 487 00:27:42,250 --> 00:27:46,170 in our examples, the 488 00:27:46,170 --> 00:27:48,950 design matrix may be, say R. If you have 489 00:27:48,950 --> 00:27:52,450 50,000 words in your dictionary, than the 490 00:27:52,450 --> 00:27:55,050 design matrix would be 491 00:27:55,050 --> 00:27:58,150 RM by 50,000. 492 00:27:58,150 --> 00:28:01,870 [Inaudible] say 100 by 50,000, if you have 100 examples. 493 00:28:01,870 --> 00:28:03,250 So X would be 494 00:28:03,250 --> 00:28:06,510 quite tractable to represent and compute the 495 00:28:06,510 --> 00:28:10,420 SVD, whereas the matrix sigma would be much harder to represent. This is 496 00:28:10,420 --> 00:28:12,260 50,000 by 50,000. 497 00:28:12,260 --> 00:28:15,870 So this gives you an efficient way to implement 498 00:28:15,870 --> 00:28:18,070 PCA. 499 00:28:18,070 --> 00:28:20,560 The reason I want to talk about this is 500 00:28:20,560 --> 00:28:24,620 in previous years, I didn't talk [inaudible]. 501 00:28:24,620 --> 00:28:28,760 The class projects, I found a number of students trying to implement SVD on huge 502 00:28:28,760 --> 00:28:29,580 problems and [inaudible], 503 00:28:29,580 --> 00:28:31,950 so this is 504 00:28:31,950 --> 00:28:35,010 a much better to implement PCA 505 00:28:35,010 --> 00:28:37,600 if you have extremely high dimensional data. If you have 506 00:28:37,600 --> 00:28:39,600 low dimensional data, if 507 00:28:39,600 --> 00:28:43,500 you have 50 or 100 dimensional data, then 508 00:28:43,500 --> 00:28:44,969 computing sigma's no problem. You can 509 00:28:44,969 --> 00:28:51,309 do it the old way, but otherwise, use the SVD to implement this. 510 00:28:51,309 --> 00:28:52,780 511 00:28:52,780 --> 00:28:59,780 Questions about this? 512 00:29:26,180 --> 00:29:30,810 The last thing I want to say is that in practice, when you want to implement this, I want to say a note 513 00:29:30,810 --> 00:29:32,240 of caution. 514 00:29:32,240 --> 00:29:35,010 It turns out that 515 00:29:35,010 --> 00:29:38,909 for many applications of - let's see. 516 00:29:38,909 --> 00:29:42,420 When you apply SVD to these wide - yeah. 517 00:29:42,420 --> 00:29:43,450 Interviewee: Just a quick question. Are 518 00:29:43,450 --> 00:29:48,950 the top K columns of U or V because X transposed X is V transpose, right? Instructor (Andrew Ng):Let's see. Oh, 519 00:29:48,950 --> 00:29:52,940 yes. 520 00:29:52,940 --> 00:29:54,570 I think you're 521 00:29:54,570 --> 00:29:55,509 right. I 522 00:29:55,509 --> 00:30:02,509 think you're right. 523 00:30:03,960 --> 00:30:05,550 Let's 524 00:30:05,550 --> 00:30:12,550 see. Is it top K columns of U or top K of V? 525 00:30:15,039 --> 00:30:22,039 Yeah, I think you're right. Is that right? Something 526 00:30:31,560 --> 00:30:38,560 bothers me about that, but I think you're right. 527 00:30:40,330 --> 00:30:46,350 Interviewee: [Inaudible], so then X transpose X should be VDD. X 528 00:30:46,350 --> 00:30:53,350 is UDV, so X transpose X would be - Instructor (Andrew Ng):[Inaudible]. 529 00:31:01,140 --> 00:31:03,389 If anyone thinks about this and 530 00:31:03,389 --> 00:31:06,200 has another opinion, let me know, but I think you're 531 00:31:06,200 --> 00:31:11,480 right. I'll make sure I get the details and let you 532 00:31:11,480 --> 00:31:16,680 know. 533 00:31:16,680 --> 00:31:22,240 Everyone's still looking at that. 534 00:31:22,240 --> 00:31:22,720 Tom, can you figure out 535 00:31:22,720 --> 00:31:25,110 the right answer and let me know? Male Speaker: That sounds right. Instructor (Andrew Ng):Okay. Cool. Okay. 536 00:31:25,110 --> 00:31:30,070 So just 537 00:31:30,070 --> 00:31:33,920 one last note, a note of caution. It turns out that 538 00:31:33,920 --> 00:31:36,520 in this example, I was implementing SVD 539 00:31:36,520 --> 00:31:43,520 with a wide matrix. So the matrix X was N by N. 540 00:31:44,130 --> 00:31:47,010 It turns out when you 541 00:31:47,010 --> 00:31:52,170 find the SVD decomposition of this, 542 00:31:52,170 --> 00:31:57,870 it turns out that - 543 00:31:57,870 --> 00:32:01,030 let's see. Yeah, I think you're definitely 544 00:32:01,030 --> 00:32:04,230 right. So it turns out that we find the SVD 545 00:32:04,230 --> 00:32:07,590 of this, the right-most portion of this block of this matrix would be all 546 00:32:07,590 --> 00:32:14,590 zeros. 547 00:32:15,130 --> 00:32:17,850 Also, when you compute the 548 00:32:17,850 --> 00:32:22,190 matrix, D, a large part of this matrix would be zeros. 549 00:32:22,190 --> 00:32:24,340 You have the matrix D 550 00:32:24,340 --> 00:32:29,520 transpose. So depending on what convention you use, 551 00:32:29,520 --> 00:32:36,150 for example, I think Matlab actually uses a convention of just 552 00:32:36,150 --> 00:32:43,150 cutting off the zero elements. 553 00:32:47,779 --> 00:32:50,750 So the Matlab uses the convention 554 00:32:50,750 --> 00:32:54,260 of chopping off the right-most half of the U matrix and chopping off the bottom 555 00:32:54,260 --> 00:32:56,520 portion of the D matrix. I'm not sure 556 00:32:56,520 --> 00:33:00,860 if this even depends on the version of Matlab, 557 00:33:00,860 --> 00:33:04,059 but when you call SVD on Matlab or some other numerical algebra packages, 558 00:33:04,059 --> 00:33:07,820 there's slightly different conventions of how to define your SVD when 559 00:33:07,820 --> 00:33:10,260 the matrix is wider than it is tall. 560 00:33:10,260 --> 00:33:13,390 So just watch out for this and make sure you map 561 00:33:13,390 --> 00:33:14,910 whatever convention 562 00:33:14,910 --> 00:33:17,630 your numerical algebra library uses 563 00:33:17,630 --> 00:33:22,620 to the original computations. 564 00:33:22,620 --> 00:33:23,570 It turns out if 565 00:33:23,570 --> 00:33:26,059 you turn Matlab [inaudible] 566 00:33:26,059 --> 00:33:30,200 or you're writing C code. There are many scientific libraries that can 567 00:33:30,200 --> 00:33:31,590 568 00:33:31,590 --> 00:33:36,110 compute SVDs for you, but they're just slightly different in 569 00:33:36,110 --> 00:33:39,330 conventions for the dimensions of these matrixes. So just make sure you figure this out for the 570 00:33:39,330 --> 00:33:44,230 package that you use. 571 00:33:44,230 --> 00:33:46,230 Finally, I just want to 572 00:33:46,230 --> 00:33:49,779 take the unsupervised learning algorithms we talked about and just put a little bit 573 00:33:49,779 --> 00:33:52,010 of broader context. 574 00:33:52,010 --> 00:33:56,450 This is partly in response to the questions I've gotten from students in 575 00:33:56,450 --> 00:33:57,950 office hours and elsewhere 576 00:33:57,950 --> 00:34:00,620 about when to use each of these 577 00:34:00,620 --> 00:34:01,460 algorithms. So 578 00:34:01,460 --> 00:34:05,700 I'm going to draw a two by two matrix. This is a little cartoon that 579 00:34:05,700 --> 00:34:07,800 580 00:34:07,800 --> 00:34:10,690 I find useful. 581 00:34:10,690 --> 00:34:13,039 582 00:34:13,039 --> 00:34:15,409 583 00:34:15,409 --> 00:34:20,159 One of the algorithms we talked about earlier, right 584 00:34:20,159 --> 00:34:21,460 before this, was 585 00:34:21,460 --> 00:34:24,159 factor analysis, which was - it 586 00:34:24,159 --> 00:34:27,919 was - I hope you remember that picture I drew where I would have a bunch of point Z on the 587 00:34:27,919 --> 00:34:28,799 line. 588 00:34:28,799 --> 00:34:32,969 Then I had these ellipses that I drew. I hope you 589 00:34:32,969 --> 00:34:34,999 remember that 590 00:34:34,999 --> 00:34:37,919 picture. This was a factor analysis model 591 00:34:37,919 --> 00:34:38,639 which models 592 00:34:38,639 --> 00:34:42,389 the density effects [inaudible], right? 593 00:34:42,389 --> 00:34:44,639 It was also a PCA, just now. 594 00:34:44,639 --> 00:34:45,799 So the 595 00:34:45,799 --> 00:34:48,969 difference between factor analysis and PCA, the 596 00:34:48,969 --> 00:34:51,829 way I think about it, is that factor analysis 597 00:34:51,829 --> 00:34:53,960 is a density estimation algorithm. 598 00:34:53,960 --> 00:34:57,740 It tries to model the density of the training example's X. 599 00:34:57,740 --> 00:35:01,249 Whereas PCA 600 00:35:01,249 --> 00:35:02,309 601 00:35:02,309 --> 00:35:05,700 is not a probabilistic 602 00:35:05,700 --> 00:35:06,750 algorithm. In particular, 603 00:35:06,750 --> 00:35:11,270 it does not endow your training examples of any probabilistic 604 00:35:11,270 --> 00:35:14,410 distributions and directly goes to find the subspace. 605 00:35:14,410 --> 00:35:18,650 So in terms of when to use factor analysis and when to use PCA, 606 00:35:18,650 --> 00:35:21,699 if your goal is to reduce the dimension of the data, 607 00:35:21,699 --> 00:35:26,139 if your goal is to find the subspace that the data lies on, 608 00:35:26,139 --> 00:35:27,309 then PCA 609 00:35:27,309 --> 00:35:28,320 directly 610 00:35:28,320 --> 00:35:32,180 tries to find the subspace. I think I would 611 00:35:32,180 --> 00:35:35,279 tend to use PCA. 612 00:35:35,279 --> 00:35:39,889 Factor analysis, it sort of assumes the data lies on a subspace. 613 00:35:39,889 --> 00:35:45,039 Let me write a subspace here. 614 00:35:45,039 --> 00:35:49,529 So both of these algorithms sort of assume the data maybe lies close 615 00:35:49,529 --> 00:35:52,009 or on some low dimensional subspace, 616 00:35:52,009 --> 00:35:55,999 but fundamentally, factor analysis, I think of it as a density estimation algorithm. 617 00:35:55,999 --> 00:35:58,739 So that has some very high dimensional distribution. I 618 00:35:58,739 --> 00:36:00,880 want to model P of X, then 619 00:36:00,880 --> 00:36:04,519 the factor analysis is the algorithm I'm more inclined 620 00:36:04,519 --> 00:36:05,400 to use. So 621 00:36:05,400 --> 00:36:07,449 even though you could in theory, 622 00:36:07,449 --> 00:36:12,199 I would tend to avoid trying to use factor analysis to 623 00:36:12,199 --> 00:36:14,349 identify a subspace the 624 00:36:14,349 --> 00:36:15,619 data 625 00:36:15,619 --> 00:36:18,450 set lies on. So [inaudible], if you want to do 626 00:36:18,450 --> 00:36:21,700 anomaly detection, if you want to model P of X 627 00:36:21,700 --> 00:36:26,119 so that if you have a very low probability of N, you can factor an anomaly, 628 00:36:26,119 --> 00:36:32,819 then I would tend to use factor analysis to do that density estimation. So factor 629 00:36:32,819 --> 00:36:38,189 analysis and PCA are both algorithms that assume that your data lies in the subspace. 630 00:36:38,189 --> 00:36:41,759 The other cause of algorithms we talked about was 631 00:36:41,759 --> 00:36:45,379 algorithms that assumes the data 632 00:36:45,379 --> 00:36:47,380 lies in 633 00:36:47,380 --> 00:36:51,410 clumps or that the 634 00:36:51,410 --> 00:36:52,359 data 635 00:36:52,359 --> 00:36:57,869 has a few coherence to groups. So let me just fill in the rest of this picture. 636 00:36:57,869 --> 00:37:04,869 637 00:37:08,579 --> 00:37:09,500 So if you think your data lies 638 00:37:09,500 --> 00:37:14,270 in clumps or lies in groups, and if it goes [inaudible] 639 00:37:14,270 --> 00:37:16,130 density estimation, then I would 640 00:37:16,130 --> 00:37:19,579 tend to use a mixture of [inaudible] 641 00:37:19,579 --> 00:37:23,349 algorithm. But again, you don't necessarily want to endow your data of any probably 642 00:37:23,349 --> 00:37:26,919 semantics, so if you just want to find the clumps of the groups, then I'd be inclined 643 00:37:26,919 --> 00:37:28,999 to use a [inaudible] algorithm. So 644 00:37:28,999 --> 00:37:29,769 645 00:37:29,769 --> 00:37:33,289 haven't seen anyone else draw this picture before, but I tend to organize these things 646 00:37:33,289 --> 00:37:34,989 this way in my brain. 647 00:37:34,989 --> 00:37:36,419 Hopefully this helps guide 648 00:37:36,419 --> 00:37:40,459 when you might use each of these algorithms as well, depending 649 00:37:40,459 --> 00:37:44,719 on whether you believe the data might lie in the subspace or whether it might bind in 650 00:37:44,719 --> 00:37:47,899 clumps or groups. 651 00:37:47,899 --> 00:37:50,719 652 00:37:50,719 --> 00:37:53,789 All right. 653 00:37:53,789 --> 00:38:00,789 That wraps up the discussion on 654 00:38:02,629 --> 00:38:08,719 PCA. What I want to do next is talk about 655 00:38:08,719 --> 00:38:15,329 independent component analysis, or ICA. Yeah. Interviewee: I have 656 00:38:15,329 --> 00:38:17,579 a 657 00:38:17,579 --> 00:38:21,589 question about the upper right [inaudible]. So once you have all of the eigen vectors, 658 00:38:21,589 --> 00:38:26,179 [inaudible] how similar is feature I to 659 00:38:26,179 --> 00:38:29,960 feature J. You pick some eigen vector, and you take some dot products between the 660 00:38:29,960 --> 00:38:31,569 feature I and 661 00:38:31,569 --> 00:38:35,019 feature J and the eigen vector. But 662 00:38:35,019 --> 00:38:39,630 there's a lot of eigen vectors to choose from. Instructor 663 00:38:39,630 --> 00:38:42,049 (Andrew Ng):Right. So Justin's question was 664 00:38:42,049 --> 00:38:45,880 having found my eigen vectors, how do I choose what eigen vector to use to 665 00:38:45,880 --> 00:38:47,539 measure distance. I'm 666 00:38:47,539 --> 00:38:48,949 going to 667 00:38:48,949 --> 00:38:51,019 start 668 00:38:51,019 --> 00:38:53,289 this up. 669 00:38:53,289 --> 00:38:57,299 So the 670 00:38:57,299 --> 00:38:58,319 answer is really 671 00:38:58,319 --> 00:39:02,029 - in this cartoon, I would avoid thinking about 672 00:39:02,029 --> 00:39:03,920 eigen vectors one other time. 673 00:39:03,920 --> 00:39:08,049 A better way to view this cartoon is that this is actually - 674 00:39:08,049 --> 00:39:11,599 if I decide to choose 100 eigen vectors, this is really 100 D 675 00:39:11,599 --> 00:39:18,599 subspace. 676 00:39:19,259 --> 00:39:20,339 So 677 00:39:20,339 --> 00:39:24,879 I'm not actually projecting my data onto one eigen vector. 678 00:39:24,879 --> 00:39:29,769 This arrow, this cartoon, this denotes the 100-dimensional 679 00:39:29,769 --> 00:39:32,089 subspace [inaudible] by all my eigen vectors. 680 00:39:32,089 --> 00:39:36,119 So what I actually do is project my data onto 681 00:39:36,119 --> 00:39:40,149 the span, the linear span of eigen vectors. Then I 682 00:39:40,149 --> 00:39:41,440 measure distance or take 683 00:39:41,440 --> 00:39:43,489 inner products of the distance between 684 00:39:43,489 --> 00:39:49,829 the projections of the two points of the eigen vectors. Okay. 685 00:39:49,829 --> 00:39:54,139 So let's talk about ICA, 686 00:39:54,139 --> 00:39:58,749 independent component analysis. 687 00:39:58,749 --> 00:40:00,749 So whereas PCA 688 00:40:00,749 --> 00:40:02,600 was an algorithm for finding 689 00:40:02,600 --> 00:40:06,699 what I call the main axis of variations of data, 690 00:40:06,699 --> 00:40:11,199 in ICA, we're going to try find the independent of components of variations in the 691 00:40:11,199 --> 00:40:12,039 data. 692 00:40:12,039 --> 00:40:14,939 So switch it to the laptop there, please. 693 00:40:14,939 --> 00:40:16,119 We'll just 694 00:40:16,119 --> 00:40:21,900 take a second to motivate that. I'm 695 00:40:21,900 --> 00:40:26,769 going to do so by 696 00:40:26,769 --> 00:40:32,430 - although if you put on the - okay. This is 697 00:40:32,430 --> 00:40:36,619 actually a slide that I showed in 698 00:40:36,619 --> 00:40:39,779 lecture one of the cocktail party problem. 699 00:40:39,779 --> 00:40:42,619 Suppose you have two speakers at a cocktail party, 700 00:40:42,619 --> 00:40:45,019 and you have two microphones in the 701 00:40:45,019 --> 00:40:46,119 room, overlapping 702 00:40:46,119 --> 00:40:47,959 sets of two conversations. 703 00:40:47,959 --> 00:40:51,640 Then can you separate out the two original speaker sources? 704 00:40:51,640 --> 00:40:55,650 So I actually played this audio as well in the very first lecture, which is 705 00:40:55,650 --> 00:40:59,069 suppose microphone one records this. 706 00:40:59,069 --> 00:41:05,489 [Recording] Instructor 707 00:41:05,489 --> 00:41:12,489 (Andrew 708 00:41:13,229 --> 00:41:16,650 Ng):So the question is, these are really two speakers, 709 00:41:16,650 --> 00:41:20,809 speaking independently of each other. So each speaker is outputting 710 00:41:20,809 --> 00:41:24,699 a series of sound signals as independent of the other conversation 711 00:41:24,699 --> 00:41:26,119 going on in the room. 712 00:41:26,119 --> 00:41:27,839 So 713 00:41:27,839 --> 00:41:31,880 this being an supervised learning problem, the question is, can we take these two microphone 714 00:41:31,880 --> 00:41:33,900 recordings and feed it to 715 00:41:33,900 --> 00:41:37,329 an algorithm to find the independent components in 716 00:41:37,329 --> 00:41:38,450 this 717 00:41:38,450 --> 00:41:40,589 data? This is the output 718 00:41:40,589 --> 00:41:42,440 719 00:41:42,440 --> 00:41:48,619 when we do so. 720 00:41:48,619 --> 00:41:55,050 [Recording] Instructor (Andrew Ng):This is the other one. [Recording] 721 00:41:55,050 --> 00:41:55,940 722 00:41:55,940 --> 00:41:59,859 Instructor (Andrew Ng):Just for fun. [Inaudible]. These are audio clips I got 723 00:41:59,859 --> 00:42:01,409 724 00:42:01,409 --> 00:42:04,409 from [inaudible]. Just for fun, let me play the other ones as well. This 725 00:42:04,409 --> 00:42:11,409 is overlapping microphone one. [Recording] 726 00:42:13,440 --> 00:42:20,440 Instructor (Andrew Ng):Here's microphone two. [Recording] Instructor (Andrew 727 00:42:21,739 --> 00:42:24,250 Ng):So given this as input, here's output one. 728 00:42:24,250 --> 00:42:27,459 729 00:42:27,459 --> 00:42:30,910 [Recording] 730 00:42:30,910 --> 00:42:33,640 Instructor (Andrew Ng):It's not perfect, but it's largely cleaned up the music. 731 00:42:33,640 --> 00:42:40,640 Here's number two. [Recording] Instructor (Andrew Ng):Okay. Switch back to 732 00:42:42,979 --> 00:42:44,900 [inaudible], please. 733 00:42:44,900 --> 00:42:46,979 So 734 00:42:46,979 --> 00:42:53,979 what I want to do now is describe an algorithm that does that. 735 00:42:54,829 --> 00:42:58,299 Before 736 00:42:58,299 --> 00:43:03,239 I actually jump into the algorithm, I want to say two minutes 737 00:43:03,239 --> 00:43:03,799 of 738 00:43:03,799 --> 00:43:10,799 CDF, so cumulative distribution functions. I know most 739 00:43:18,669 --> 00:43:21,089 of you know what these are, but I'm 740 00:43:21,089 --> 00:43:23,599 just going to remind you of what they are. 741 00:43:23,599 --> 00:43:24,579 742 00:43:24,579 --> 00:43:30,349 Let's say you have a one-D random variable S. So suppose you have 743 00:43:30,349 --> 00:43:35,900 a random variable, S, 744 00:43:35,900 --> 00:43:41,469 and suppose it has a property density function [inaudible]. 745 00:43:41,469 --> 00:43:43,409 Then 746 00:43:43,409 --> 00:43:45,859 the CDF 747 00:43:45,859 --> 00:43:50,139 is defined as a function, or rather as F, 748 00:43:50,139 --> 00:43:53,729 which is the probability that the random variable, 749 00:43:53,729 --> 00:43:55,920 S, is less than the value 750 00:43:55,920 --> 00:43:58,539 given by that lower-case 751 00:43:58,539 --> 00:43:59,869 value, 752 00:43:59,869 --> 00:44:01,929 S. 753 00:44:01,929 --> 00:44:03,269 For example, 754 00:44:03,269 --> 00:44:06,099 if this is your [inaudible] density, 755 00:44:06,099 --> 00:44:10,229 than the density of the [inaudible] usually 756 00:44:10,229 --> 00:44:14,609 to note it lower-case phi. That's roughly a bell-shaped density. Then 757 00:44:14,609 --> 00:44:20,319 the CDF or the Gaussian 758 00:44:20,319 --> 00:44:22,269 will look something like this. 759 00:44:22,269 --> 00:44:24,959 There'll be a capital function 760 00:44:24,959 --> 00:44:27,339 pi. So if I pick a value 761 00:44:27,339 --> 00:44:29,079 S like that, then the 762 00:44:29,079 --> 00:44:30,449 height of this - 763 00:44:30,449 --> 00:44:32,569 this is [inaudible] probability that 764 00:44:32,569 --> 00:44:35,410 my Gaussian random variable is less than 765 00:44:35,410 --> 00:44:37,419 that value there. In other words, 766 00:44:37,419 --> 00:44:40,549 the height of the function at that point is 767 00:44:40,549 --> 00:44:44,229 less 768 00:44:44,229 --> 00:44:46,269 than the area of the Gaussian density, 769 00:44:46,269 --> 00:44:48,119 up to the point S. 770 00:44:48,119 --> 00:44:48,890 As you 771 00:44:48,890 --> 00:44:52,690 move further and further to the right, this function will approach one, as 772 00:44:52,690 --> 00:44:59,690 you integrate more and more of this area of the Gaussian. So another way to write 773 00:45:04,839 --> 00:45:11,839 F 774 00:45:21,109 --> 00:45:28,109 of 775 00:45:30,659 --> 00:45:34,619 S is the integral, the minus infinity 776 00:45:34,619 --> 00:45:35,729 to S of 777 00:45:35,729 --> 00:45:41,739 the density, DT. 778 00:45:41,739 --> 00:45:43,859 So something that'll come later is 779 00:45:43,859 --> 00:45:48,320 suppose I have a random variable, S, and I want to model the distribution of the random 780 00:45:48,320 --> 00:45:49,439 variable, S. 781 00:45:49,439 --> 00:45:53,449 So one thing I could do is I can specify 782 00:45:53,449 --> 00:45:56,549 what I think the density 783 00:45:56,549 --> 00:45:58,049 is. 784 00:45:58,049 --> 00:46:03,200 Or I can specify 785 00:46:03,200 --> 00:46:04,450 what the 786 00:46:04,450 --> 00:46:08,099 CDF 787 00:46:08,099 --> 00:46:11,359 is. These are related by this equation. F is the integral of P of S. You 788 00:46:11,359 --> 00:46:13,989 can also 789 00:46:13,989 --> 00:46:15,719 recover the density 790 00:46:15,719 --> 00:46:20,469 by taking the CDF and taking the derivative. So F prime, take the derivative 791 00:46:20,469 --> 00:46:21,729 of the CDF, 792 00:46:21,729 --> 00:46:23,440 you get back the 793 00:46:23,440 --> 00:46:24,709 density. So this has come up 794 00:46:24,709 --> 00:46:28,179 in the middle of when I derive ICA, which is that 795 00:46:28,179 --> 00:46:32,169 there'll be a step where they need to assume a distribution for random variable, S. 796 00:46:32,169 --> 00:46:36,359 I can either specify the density for S directly, or I can specify the CDF. I 797 00:46:36,359 --> 00:46:38,819 choose to specify the 798 00:46:38,819 --> 00:46:39,920 799 00:46:39,920 --> 00:46:41,529 CDF. 800 00:46:41,529 --> 00:46:46,919 It has to be some function increasing from zero to one. 801 00:46:46,919 --> 00:46:48,029 So you can 802 00:46:48,029 --> 00:46:50,679 choose any function that looks like that, and in particular, 803 00:46:50,679 --> 00:46:51,969 804 00:46:51,969 --> 00:46:55,469 pulling functions out of a hat that look like that. You can, for instance, choose a 805 00:46:55,469 --> 00:46:58,989 sigmoid function of 806 00:46:58,989 --> 00:47:04,219 CDF. That would be one way of specifying the distribution of the densities for the random variable S. So 807 00:47:04,219 --> 00:47:05,110 this 808 00:47:05,110 --> 00:47:12,110 will come up later. 809 00:47:30,299 --> 00:47:33,579 Just [inaudible], just raise your hand if that is familiar to you, if you've seen 810 00:47:33,579 --> 00:47:40,579 that before. Great. So 811 00:47:42,469 --> 00:47:43,239 let's 812 00:47:43,239 --> 00:47:48,630 start to derive our RCA, or our independent component analysis 813 00:47:48,630 --> 00:47:50,430 algorithm. 814 00:47:50,430 --> 00:47:53,989 Let's assume that the 815 00:47:53,989 --> 00:47:55,859 816 00:47:55,859 --> 00:47:59,819 data comes from 817 00:47:59,819 --> 00:48:01,979 N original 818 00:48:01,979 --> 00:48:03,319 sources. 819 00:48:03,319 --> 00:48:07,009 So let's say there are N speakers in a cocktail party. 820 00:48:07,009 --> 00:48:09,819 So the original sources, I'm 821 00:48:09,819 --> 00:48:11,330 going to write as a vector, S 822 00:48:11,330 --> 00:48:13,619 as in RN. 823 00:48:13,619 --> 00:48:17,449 So just to be concrete about what I mean about that, I'm going to use 824 00:48:17,449 --> 00:48:22,499 SIJ to denote the signal 825 00:48:22,499 --> 00:48:25,849 from speaker 826 00:48:25,849 --> 00:48:27,140 827 00:48:27,140 --> 00:48:30,219 J 828 00:48:30,219 --> 00:48:32,659 at time 829 00:48:32,659 --> 00:48:34,080 I. Here's what I mean. 830 00:48:34,080 --> 00:48:37,940 So what is sound? When you hear sound waves, sound is created 831 00:48:37,940 --> 00:48:39,279 by a pattern 832 00:48:39,279 --> 00:48:43,160 of expansions and compressions in air. So the way you're hearing my voice is 833 00:48:43,160 --> 00:48:44,620 my 834 00:48:44,620 --> 00:48:47,719 mouth is causing certain 835 00:48:47,719 --> 00:48:50,960 changes in the air pressure, and then your ear is hearing my voice as 836 00:48:50,960 --> 00:48:53,539 detecting those changes in air 837 00:48:53,539 --> 00:48:57,729 pressure. So what a microphone records, what my mouth is generating, is 838 00:48:57,729 --> 00:48:59,159 a pattern. 839 00:48:59,159 --> 00:49:01,459 I'm going to draw a cartoon, 840 00:49:01,459 --> 00:49:04,819 I guess. 841 00:49:04,819 --> 00:49:06,059 Changes in 842 00:49:06,059 --> 00:49:06,970 air pressure. So 843 00:49:06,970 --> 00:49:11,119 this is what sound is. You look at a microphone recording, you see these roughly periodic 844 00:49:11,119 --> 00:49:13,289 signals that comprise of 845 00:49:13,289 --> 00:49:16,230 changes in air pressure over time as the air pressure goes 846 00:49:16,230 --> 00:49:18,539 above and below some baseline air pressure. 847 00:49:18,539 --> 00:49:19,669 So this 848 00:49:19,669 --> 00:49:22,369 is what the speech signal looks like, say. 849 00:49:22,369 --> 00:49:26,399 So this is speaker one. 850 00:49:26,399 --> 00:49:29,039 Then what I'm saying is that 851 00:49:29,039 --> 00:49:31,189 - this is some time, T. 852 00:49:31,189 --> 00:49:34,479 What I'm saying is that the value of that point, 853 00:49:34,479 --> 00:49:36,989 I'm going to denote as S, super 854 00:49:36,989 --> 00:49:40,229 script T, sub script one. 855 00:49:40,229 --> 00:49:41,729 Similarly, 856 00:49:41,729 --> 00:49:44,889 speaker two, it's 857 00:49:44,889 --> 00:49:46,859 outputting some sound wave. Speaker voice 858 00:49:46,859 --> 00:49:49,749 will play that. It'll actually sound like 859 00:49:49,749 --> 00:49:52,920 a single tone, I guess. 860 00:49:52,920 --> 00:49:56,099 So in the same way, at the same time, T, 861 00:49:56,099 --> 00:49:59,049 the value of the air 862 00:49:59,049 --> 00:50:02,589 pressure generated by speaker two, I'll denote as 863 00:50:02,589 --> 00:50:09,589 ST 864 00:50:16,579 --> 00:50:23,579 2. 865 00:50:29,859 --> 00:50:36,859 So we observe 866 00:50:37,769 --> 00:50:40,449 XI equals A times SI, where 867 00:50:40,449 --> 00:50:43,409 these XIs 868 00:50:43,409 --> 00:50:45,989 are vectors in RN. 869 00:50:45,989 --> 00:50:50,269 So I'm going to assume 870 00:50:50,269 --> 00:50:53,259 that I have N microphones, 871 00:50:53,259 --> 00:50:53,580 and 872 00:50:53,580 --> 00:50:58,489 each of my microphones records some linear combination 873 00:50:58,489 --> 00:51:01,869 of what the speakers are saying. So each microphone records some overlapping 874 00:51:01,869 --> 00:51:04,499 combination of what the speakers are saying. 875 00:51:04,499 --> 00:51:07,619 For 876 00:51:07,619 --> 00:51:08,749 877 00:51:08,749 --> 00:51:10,349 878 00:51:10,349 --> 00:51:12,669 example, XIJ, which is - this 879 00:51:12,669 --> 00:51:16,249 is what microphone J records at time, I. So 880 00:51:16,249 --> 00:51:17,349 by definition of 881 00:51:17,349 --> 00:51:21,519 the matrix multiplication, this is sum 882 00:51:21,519 --> 00:51:23,979 of AIKSJ. 883 00:51:23,979 --> 00:51:29,369 Oh, excuse me. 884 00:51:29,369 --> 00:51:36,369 Okay? So what my J - sorry. 885 00:51:37,179 --> 00:51:41,049 So what my J microphone is recording is 886 00:51:41,049 --> 00:51:42,190 887 00:51:42,190 --> 00:51:43,940 some linear combination of 888 00:51:43,940 --> 00:51:45,569 all of the speakers. So 889 00:51:45,569 --> 00:51:49,780 at time I, what microphone J is recording is some linear combination of 890 00:51:49,780 --> 00:51:52,749 what all the speakers are saying at time I. 891 00:51:52,749 --> 00:51:54,359 So K here 892 00:51:54,359 --> 00:51:57,819 indexes over the N speakers. 893 00:51:57,819 --> 00:52:01,239 So our goal 894 00:52:01,239 --> 00:52:02,869 895 00:52:02,869 --> 00:52:06,420 is to find the matrix, W, equals A inverse, and 896 00:52:06,420 --> 00:52:10,130 just defining W that way. 897 00:52:10,130 --> 00:52:17,130 So 898 00:52:18,139 --> 00:52:21,349 we can recover the original sources 899 00:52:21,349 --> 00:52:23,310 as a linear combination of 900 00:52:23,310 --> 00:52:23,560 our 901 00:52:23,549 --> 00:52:30,549 microphone recordings, XI. 902 00:52:33,059 --> 00:52:35,329 Just as a point of notation, 903 00:52:35,329 --> 00:52:42,329 I'm going to write the matrix W this way. I'm going to use 904 00:52:49,449 --> 00:52:50,890 905 00:52:50,890 --> 00:52:55,100 lower case W subscript one, subscript two and so on to denote the roles 906 00:52:55,100 --> 00:53:02,100 of this matrix, W. 907 00:53:13,909 --> 00:53:14,559 Let's 908 00:53:14,559 --> 00:53:18,719 see. 909 00:53:18,719 --> 00:53:23,539 So let's look at why IC is possible. Given these overlapping voices, 910 00:53:23,539 --> 00:53:28,249 let's think briefly why it might be possible 911 00:53:28,249 --> 00:53:30,760 to recover the original sources. 912 00:53:30,760 --> 00:53:33,059 So for the next example, I want 913 00:53:33,059 --> 00:53:36,509 to say 914 00:53:36,509 --> 00:53:42,739 915 00:53:42,739 --> 00:53:46,530 - let's say that each of my speakers 916 00:53:46,530 --> 00:53:50,380 outputs - this will sound like white noise. Can I switch 917 00:53:50,380 --> 00:53:53,380 the laptop display, 918 00:53:53,380 --> 00:53:56,709 please? For this example, let's say that 919 00:53:56,709 --> 00:53:57,219 920 00:53:57,219 --> 00:54:01,459 each of my speakers outputs uniform white noise. So 921 00:54:01,459 --> 00:54:05,459 if that's the case, these are my axis, S1 and S2. 922 00:54:05,459 --> 00:54:08,819 This is what my two speakers would be uttering. 923 00:54:08,819 --> 00:54:11,289 The parts of what they're 924 00:54:11,289 --> 00:54:14,979 uttering will look like a line in a square box if the two speakers are independently 925 00:54:14,979 --> 00:54:16,089 outputting 926 00:54:16,089 --> 00:54:18,389 uniform minus one random variables. 927 00:54:18,389 --> 00:54:20,289 So this is part of 928 00:54:20,289 --> 00:54:24,009 S1 and S2, my original sources. 929 00:54:24,009 --> 00:54:28,999 This would be a typical sample of what my microphones record. Here, at 930 00:54:28,999 --> 00:54:31,400 the axis, are X1 and X2. 931 00:54:31,400 --> 00:54:35,089 So these are images I got from [inaudible] on 932 00:54:35,089 --> 00:54:37,269 ICA. 933 00:54:37,269 --> 00:54:38,709 934 00:54:38,709 --> 00:54:43,689 Given a picture like this, you can sort of look at this box, and you can sort of tell what the axis of 935 00:54:43,689 --> 00:54:44,939 this 936 00:54:44,939 --> 00:54:45,809 parallelogram 937 00:54:45,809 --> 00:54:48,149 are. You can figure out 938 00:54:48,149 --> 00:54:51,999 what linear transformation would transform the parallelogram back 939 00:54:51,999 --> 00:54:54,359 to a box. 940 00:54:54,359 --> 00:54:58,769 So it turns out there are some inherent ambiguities in ICA. 941 00:54:58,769 --> 00:55:00,509 I'll just say what they are. 942 00:55:00,509 --> 00:55:01,569 One is that 943 00:55:01,569 --> 00:55:05,709 you can't recover the original indexing of the sources. In particular, 944 00:55:05,709 --> 00:55:07,379 if 945 00:55:07,379 --> 00:55:10,809 I generated the data for speaker one and speaker two, 946 00:55:10,809 --> 00:55:14,469 you can run ICA, and then you may end up with the order of the speakers 947 00:55:14,469 --> 00:55:17,529 reversed. What that corresponds to is if you take this 948 00:55:17,529 --> 00:55:21,809 picture and you flip this picture along a 45-degree 949 00:55:21,809 --> 00:55:26,130 axis. You take a 45-degree axis and reflect this picture across the 45-degree axis, you'll still 950 00:55:26,130 --> 00:55:28,279 get a box. So 951 00:55:28,279 --> 00:55:31,319 there's no way for the algorithms to tell which was speaker No. 1 and 952 00:55:31,319 --> 00:55:32,910 which 953 00:55:32,910 --> 00:55:37,699 was speaker No. 2. The numbering or the ordering of the speakers is 954 00:55:37,699 --> 00:55:40,839 ambiguous. The other source of ambiguity, and these are the only ambiguities 955 00:55:40,839 --> 00:55:42,089 in this example, 956 00:55:42,089 --> 00:55:44,469 is the sign of the sources. So 957 00:55:44,469 --> 00:55:49,119 given my speakers' recordings, 958 00:55:49,119 --> 00:55:53,190 you can't tell whether you got a positive SI or whether you got 959 00:55:53,190 --> 00:55:56,179 back a negative SI. 960 00:55:56,179 --> 00:55:58,210 In this picture, what that corresponds to 961 00:55:58,210 --> 00:56:02,100 is if you take this picture, and you reflect it along the vertical axis, if 962 00:56:02,100 --> 00:56:04,659 you reflect it along the horizontal axis, 963 00:56:04,659 --> 00:56:05,910 you still get a box. 964 00:56:05,910 --> 00:56:08,719 You still get back [inaudible] speakers. 965 00:56:08,719 --> 00:56:09,649 So 966 00:56:09,649 --> 00:56:11,719 it turns out that in this example, 967 00:56:11,719 --> 00:56:16,599 you can't guarantee that you've recovered positive SI rather 968 00:56:16,599 --> 00:56:19,689 than negative SI. 969 00:56:19,689 --> 00:56:21,930 So it turns out that these are the only 970 00:56:21,930 --> 00:56:25,740 two ambiguities in this example. What is the permutation of the speakers, and the 971 00:56:25,740 --> 00:56:28,139 other is the sign of the speakers. 972 00:56:28,139 --> 00:56:30,749 Permutation of the speakers, there's not much you can do about that. 973 00:56:30,749 --> 00:56:34,909 It turns out that if you take the audio 974 00:56:34,909 --> 00:56:35,609 source 975 00:56:35,609 --> 00:56:39,199 and if you flip the sign, and you take negative S, and if you play that through a 976 00:56:39,199 --> 00:56:43,819 microphone it'll sound indistinguishable. 977 00:56:43,819 --> 00:56:44,879 So 978 00:56:44,879 --> 00:56:47,829 for many of the applications we care about, the sign 979 00:56:47,829 --> 00:56:51,259 as well as the permutation 980 00:56:51,259 --> 00:56:55,079 is ambiguous, but you don't really care 981 00:56:55,079 --> 00:57:02,079 about it. Let's switch back 982 00:57:03,529 --> 00:57:08,989 to 983 00:57:08,989 --> 00:57:11,180 chalk board, please. 984 00:57:11,180 --> 00:57:15,639 It turns out, and I don't want to spend too much time on this, but I do want to say it briefly. 985 00:57:15,639 --> 00:57:17,289 It turns out the 986 00:57:17,289 --> 00:57:19,200 reason why those are the only 987 00:57:19,200 --> 00:57:25,809 sources of ambiguity - so the ambiguities were the 988 00:57:25,809 --> 00:57:29,869 permutation of the speakers 989 00:57:29,869 --> 00:57:31,959 and the signs. 990 00:57:31,959 --> 00:57:35,399 It turns out that 991 00:57:35,399 --> 00:57:39,919 the reason these were the only ambiguities was because 992 00:57:39,919 --> 00:57:44,999 the SIJs were 993 00:57:44,999 --> 00:57:46,689 994 00:57:46,689 --> 00:57:50,509 non-Gaussian. I don't want to spend too much time on this, but I'll say it briefly. 995 00:57:50,509 --> 00:57:54,089 Suppose my original sources, S1 and S2, were Gaussian. 996 00:57:54,089 --> 00:57:55,909 So 997 00:57:55,909 --> 00:57:58,329 998 00:57:58,329 --> 00:58:02,199 suppose SI is 999 00:58:02,199 --> 00:58:04,339 Gaussian, would mean zero 1000 00:58:04,339 --> 00:58:07,019 and identity covariance. 1001 00:58:07,019 --> 00:58:10,959 That just means that each of my speakers outputs a Gaussian random variable. Here's a typical 1002 00:58:10,959 --> 00:58:12,619 example of Gaussian 1003 00:58:12,619 --> 00:58:18,479 data. 1004 00:58:18,479 --> 00:58:22,869 You will recall the contours of a Gaussian distribution with identity covariants 1005 00:58:22,869 --> 00:58:25,089 looks like 1006 00:58:25,089 --> 00:58:27,739 this, right? The Gaussian is a 1007 00:58:27,739 --> 00:58:30,569 spherically symmetric distribution. 1008 00:58:30,569 --> 00:58:35,219 So if my speakers were outputting Gaussian random variables, than if 1009 00:58:35,219 --> 00:58:38,179 I observe a linear combination of this, 1010 00:58:38,179 --> 00:58:40,479 there's actually no way to recover the 1011 00:58:40,479 --> 00:58:43,419 original distribution because there's no way for me to tell 1012 00:58:43,419 --> 00:58:46,120 if the axis are at this angle or if they're at 1013 00:58:46,120 --> 00:58:48,349 that angle and so 1014 00:58:48,349 --> 00:58:52,429 on. The Gaussian is a rotationally symmetric 1015 00:58:52,429 --> 00:58:56,769 distribution, so I would no be able to recover the orientation in the 1016 00:58:56,769 --> 00:58:58,839 rotation 1017 00:58:58,839 --> 00:59:02,279 of this. So I don't want to prove this too much. I don't want to spend too much time dwelling on this, but it turns 1018 00:59:02,279 --> 00:59:02,900 out 1019 00:59:02,900 --> 00:59:04,700 if your source is a Gaussian, 1020 00:59:04,700 --> 00:59:07,929 then it's actually impossible to do 1021 00:59:07,929 --> 00:59:12,049 ICA. ICA relies critically on your data being non-Gaussian because if the data 1022 00:59:12,049 --> 00:59:16,939 were Gaussian, then the rotation of the data would be ambiguous. So 1023 00:59:16,939 --> 00:59:19,079 regardless of how much data you have, 1024 00:59:19,079 --> 00:59:23,549 even if you had infinitely large amounts of data, you would not be able to recover 1025 00:59:23,549 --> 00:59:26,739 the matrix A or W. 1026 00:59:26,739 --> 00:59:32,779 1027 00:59:32,779 --> 00:59:39,779 Let's go ahead and divide the algorithm. 1028 00:59:56,779 --> 01:00:00,939 To do this, I need just one more result, and then the derivation will be 1029 01:00:00,939 --> 01:00:03,029 1030 01:00:03,029 --> 01:00:07,729 three lines. [Inaudible] many variables as N, which is the joint vector of the sound that all of my 1031 01:00:07,729 --> 01:00:11,309 speakers that are emitting at any time. 1032 01:00:11,309 --> 01:00:12,459 So 1033 01:00:12,459 --> 01:00:15,619 let's say the density of S is 1034 01:00:15,619 --> 01:00:17,339 P subscript S, 1035 01:00:17,339 --> 01:00:19,569 capital S. 1036 01:00:19,569 --> 01:00:23,399 So my microphone recording records S equals AS, 1037 01:00:23,399 --> 01:00:25,319 equals W inverse 1038 01:00:25,319 --> 01:00:31,019 S. Equivalently, S equals W sign of X. 1039 01:00:31,019 --> 01:00:34,529 So let's think about what is the density of 1040 01:00:34,529 --> 01:00:38,209 X. So I have P of S. I know the density of 1041 01:00:38,209 --> 01:00:41,359 S, and X is a linear combination of the S's. 1042 01:00:41,359 --> 01:00:45,170 So let's figure out what is the density of X. 1043 01:00:45,170 --> 01:00:48,670 One thing we could do is 1044 01:00:48,670 --> 01:00:51,339 figure out what S is. So this is just - 1045 01:00:51,339 --> 01:00:55,759 apply the density of 1046 01:00:55,759 --> 01:00:58,069 S to W of S. So let's 1047 01:00:58,069 --> 01:01:01,999 see. This is the probability of S, so we just 1048 01:01:01,999 --> 01:01:02,909 1049 01:01:02,909 --> 01:01:06,559 figure out what S is. S is W times X, so the probability of S is 1050 01:01:06,559 --> 01:01:09,939 W times X, so the probability of X must be [inaudible]. 1051 01:01:09,939 --> 01:01:11,619 So this is wrong. 1052 01:01:11,619 --> 01:01:14,749 It turns out you can do this for probably mass functions but not for 1053 01:01:14,749 --> 01:01:16,919 continuous density. So in particular, 1054 01:01:16,919 --> 01:01:20,969 it's not correct to say that the probability of X is - well, you just figure out what 1055 01:01:20,969 --> 01:01:22,499 S is. 1056 01:01:22,499 --> 01:01:26,189 Then you say the probability of S is applied to that. This is wrong. You 1057 01:01:26,189 --> 01:01:27,819 can't do this with densities. 1058 01:01:27,819 --> 01:01:30,969 You can't say the probability of S is that because it's a property density 1059 01:01:30,969 --> 01:01:32,969 function. 1060 01:01:32,969 --> 01:01:34,459 In particular, 1061 01:01:34,459 --> 01:01:35,509 the 1062 01:01:35,509 --> 01:01:37,849 right formula is the 1063 01:01:37,849 --> 01:01:40,439 density of S applied to W times X, 1064 01:01:40,439 --> 01:01:41,730 times the determinant 1065 01:01:41,730 --> 01:01:44,209 of the matrix, W. 1066 01:01:44,209 --> 01:01:47,189 Let me just illustrate that with an example. 1067 01:01:47,189 --> 01:01:49,919 Let's say 1068 01:01:49,919 --> 01:01:51,550 the 1069 01:01:51,550 --> 01:01:58,199 density for S is that. In 1070 01:01:58,199 --> 01:02:03,469 this example, S is uniform 1071 01:02:03,469 --> 01:02:05,539 over the unit interval. 1072 01:02:05,539 --> 01:02:07,679 1073 01:02:07,679 --> 01:02:14,679 So the density for S looks like that. It's 1074 01:02:15,189 --> 01:02:18,140 just density for the uniform 1075 01:02:18,140 --> 01:02:20,749 distribution of zero one. 1076 01:02:20,749 --> 01:02:24,150 So let me let X be equal to two times 1077 01:02:24,150 --> 01:02:30,009 S. So this means A equals two. 1078 01:02:30,009 --> 01:02:33,709 W equals one half. So if 1079 01:02:33,709 --> 01:02:36,719 S is a uniform distribution over zero, one, 1080 01:02:36,719 --> 01:02:40,320 then X, which is two times that, will be the uniform distribution over the 1081 01:02:40,320 --> 01:02:43,299 range from zero to two. 1082 01:02:43,299 --> 01:02:50,299 So the density for X will be - 1083 01:02:54,359 --> 01:02:57,289 that's one, that's two, 1084 01:02:57,289 --> 01:03:01,409 that's one half, 1085 01:03:01,409 --> 01:03:02,529 1086 01:03:02,529 --> 01:03:04,949 and 1087 01:03:04,949 --> 01:03:07,939 that's one. Okay? Density for X will be indicator 1088 01:03:07,939 --> 01:03:12,729 zero [inaudible] for X [inaudible] two 1089 01:03:12,729 --> 01:03:15,739 times W, times one half. 1090 01:03:15,739 --> 01:03:20,229 So 1091 01:03:20,229 --> 01:03:21,729 does that make 1092 01:03:21,729 --> 01:03:25,020 sense? [Inaudible] computer density for X because X is now spread out 1093 01:03:25,020 --> 01:03:28,650 across a wider range. The density of X is now smaller, 1094 01:03:28,650 --> 01:03:35,650 and therefore, the density of X has this one half 1095 01:03:37,859 --> 01:03:38,919 term 1096 01:03:38,919 --> 01:03:42,579 here. Okay? This is an illustration for the case of one-dimensional random variables, 1097 01:03:42,579 --> 01:03:44,289 1098 01:03:44,289 --> 01:03:45,160 or S 1099 01:03:45,160 --> 01:03:49,490 and X of one D. I'm not going to show it, but the generalization of this to vector value random variables is that the 1100 01:03:49,490 --> 01:03:51,649 density of X is given by this 1101 01:03:51,649 --> 01:03:53,950 times the determinant of the matrix, W. Over here, 1102 01:03:53,950 --> 01:04:00,950 I showed the one dimensional [inaudible] generalization. 1103 01:04:21,439 --> 01:04:28,439 So we're nearly there. Here's 1104 01:04:28,749 --> 01:04:33,969 how I can implement ICA. 1105 01:04:33,969 --> 01:04:37,039 So my distribution on 1106 01:04:37,039 --> 01:04:44,039 S, 1107 01:04:50,259 --> 01:04:52,959 so I'm going to assume that my density on S 1108 01:04:52,959 --> 01:04:55,099 is given by this as a product over the 1109 01:04:55,099 --> 01:04:59,950 N speakers of the density - the product of speaker 1110 01:04:59,950 --> 01:05:00,889 I 1111 01:05:00,889 --> 01:05:03,659 emitting a certain sound. This is a product of densities. 1112 01:05:03,659 --> 01:05:07,659 This is a product of distributions because I'm going to assume that the 1113 01:05:07,659 --> 01:05:11,469 speakers are having independent conversations. So the SI's independent 1114 01:05:11,469 --> 01:05:15,869 for different values of I. 1115 01:05:15,869 --> 01:05:18,060 So by the formula we just worked out, 1116 01:05:18,060 --> 01:05:22,359 the density for X would be equal to that. 1117 01:05:22,359 --> 01:05:29,359 1118 01:05:35,539 --> 01:05:36,599 1119 01:05:36,599 --> 01:05:39,309 I'll just remind you, W was A 1120 01:05:39,309 --> 01:05:42,579 inverse. It was 1121 01:05:42,579 --> 01:05:43,929 this matrix 1122 01:05:43,929 --> 01:05:47,619 I defined previously 1123 01:05:47,619 --> 01:05:50,430 so that SI 1124 01:05:50,430 --> 01:05:52,519 equals WI [inaudible] 1125 01:05:52,519 --> 01:05:59,209 X. So that's what's in 1126 01:05:59,209 --> 01:06:02,299 there. To complete my formulation for this model, 1127 01:06:02,299 --> 01:06:06,359 the final thing I need to do is 1128 01:06:06,359 --> 01:06:10,179 choose 1129 01:06:10,179 --> 01:06:11,549 a density 1130 01:06:11,549 --> 01:06:14,259 for what I think each speaker is 1131 01:06:14,259 --> 01:06:17,949 saying. I need to assume some density over 1132 01:06:17,949 --> 01:06:21,660 the sounds emitted by an individual speaker. 1133 01:06:21,660 --> 01:06:25,630 So following the discussion I had right when the [inaudible] 1134 01:06:25,630 --> 01:06:27,650 ICA, 1135 01:06:27,650 --> 01:06:30,559 one thing I could do is I could choose 1136 01:06:30,559 --> 01:06:32,019 the density for S, 1137 01:06:32,019 --> 01:06:35,509 or equivalently, I could choose the CDF, the cumulative distribution 1138 01:06:35,509 --> 01:06:37,169 function for 1139 01:06:37,169 --> 01:06:38,219 S. 1140 01:06:38,219 --> 01:06:41,489 In this case, I'm going to choose 1141 01:06:41,489 --> 01:06:44,819 a CDF, probably for historical reasons and probably for 1142 01:06:44,819 --> 01:06:46,569 convenience. 1143 01:06:46,569 --> 01:06:50,019 I need to choose the CDF for S, so 1144 01:06:50,019 --> 01:06:54,779 what that means is I just need to choose some function that increases from zero to 1145 01:06:54,779 --> 01:06:59,439 what. I know I can't choose a Gaussian because we know you can't 1146 01:06:59,439 --> 01:07:02,199 do ICA on Gaussian data. 1147 01:07:02,199 --> 01:07:04,649 So I need some function increasing from zero to one 1148 01:07:04,649 --> 01:07:08,639 that is not the cumulative distribution function for a 1149 01:07:08,639 --> 01:07:10,359 Gaussian distribution. 1150 01:07:10,359 --> 01:07:14,009 So what other functions do I know that increase from zero to one? I 1151 01:07:14,009 --> 01:07:16,139 just choose the 1152 01:07:16,139 --> 01:07:18,329 CDF to be 1153 01:07:18,329 --> 01:07:21,979 the 1154 01:07:21,979 --> 01:07:23,039 sigmoid function. 1155 01:07:23,039 --> 01:07:24,729 This is a 1156 01:07:24,729 --> 01:07:27,229 commonly-made choice that 1157 01:07:27,229 --> 01:07:31,049 is made for convenience. There is actually no great reason for why you 1158 01:07:31,049 --> 01:07:34,079 choose a sigmoid function. It's just a convenient function that we all know 1159 01:07:34,079 --> 01:07:35,289 and are familiar with 1160 01:07:35,289 --> 01:07:37,849 that happens to increase from zero to one. 1161 01:07:37,849 --> 01:07:44,849 When you take the derivative 1162 01:07:45,789 --> 01:07:49,389 of the sigmoid, and that will give you back 1163 01:07:49,389 --> 01:07:50,119 your 1164 01:07:50,119 --> 01:07:55,459 density. This is just not Gaussian. This is the main virtue of choosing the sigmoid. 1165 01:07:55,459 --> 01:08:02,459 So 1166 01:08:19,020 --> 01:08:21,960 there's really no rational for the choice of sigma. Lots of other things will 1167 01:08:21,960 --> 01:08:23,599 work fine, too. 1168 01:08:23,599 --> 01:08:26,659 It's just a common, reasonable default. 1169 01:08:26,659 --> 01:08:29,589 1170 01:08:29,589 --> 01:08:34,239 1171 01:08:34,239 --> 01:08:38,039 1172 01:08:38,039 --> 01:08:40,279 It turns out that 1173 01:08:40,279 --> 01:08:44,629 one reason the sigma works well for a lot of data sources is that 1174 01:08:44,629 --> 01:08:49,079 if this is the Gaussian. 1175 01:08:49,079 --> 01:08:52,190 If you actually take the sigmoid and you take its derivative, 1176 01:08:52,190 --> 01:08:59,190 1177 01:09:02,299 --> 01:09:06,639 you find that the sigmoid has [inaudible] than the Gaussian. By this I mean 1178 01:09:06,639 --> 01:09:10,509 the density of the sigmoid dies down to zero much more slowly than 1179 01:09:10,509 --> 01:09:12,299 the 1180 01:09:12,299 --> 01:09:13,489 Gaussian. 1181 01:09:13,489 --> 01:09:18,079 The magnitudes of the tails dies down as E to the minus S squared. 1182 01:09:18,079 --> 01:09:21,960 For the sigmoid, the tails look like E to the minus 1183 01:09:21,960 --> 01:09:26,949 S. So the tails die down as E to the minus S, around E 1184 01:09:26,949 --> 01:09:29,529 to the minus S squared. It turns out that most distributions of this property 1185 01:09:29,529 --> 01:09:34,359 with [inaudible] tails, where the distribution decays to zero relatively slowly 1186 01:09:34,359 --> 01:09:38,439 compared to Gaussian will 1187 01:09:38,439 --> 01:09:39,919 work fine for your data. 1188 01:09:39,919 --> 01:09:43,939 Actually, one other choice you can sometimes us is what's called the Laplacian 1189 01:09:43,939 --> 01:09:46,230 distribution, which is 1190 01:09:46,230 --> 01:09:53,230 that. This will work fine, too, for many data sources. 1191 01:10:06,539 --> 01:10:08,110 Sticking with the sigmoid for now, I'll just 1192 01:10:08,110 --> 01:10:09,420 write 1193 01:10:09,420 --> 01:10:14,479 down the algorithm in two steps. So given 1194 01:10:14,479 --> 01:10:17,150 my training set, and 1195 01:10:17,150 --> 01:10:21,179 as you show, this is an unlabeled training set, I can 1196 01:10:21,179 --> 01:10:25,770 write down the log likelihood of my parameters. So that's - assembled my training 1197 01:10:25,770 --> 01:10:27,209 examples, log of - times 1198 01:10:27,209 --> 01:10:34,209 that. 1199 01:10:42,869 --> 01:10:44,880 So that's my log 1200 01:10:44,880 --> 01:10:51,880 likelihood. 1201 01:10:53,339 --> 01:10:59,380 To learn the parameters, W, of this model, I can use the [inaudible] assent, 1202 01:10:59,380 --> 01:11:06,380 which is 1203 01:11:06,570 --> 01:11:08,579 just that. 1204 01:11:08,579 --> 01:11:11,489 It turns out, if you work through the math, 1205 01:11:11,489 --> 01:11:13,969 let's see. If P of S 1206 01:11:13,969 --> 01:11:19,820 is equal to the derivative of the 1207 01:11:19,820 --> 01:11:23,780 sigmoid, then if you just work through the math to compute the [inaudible] there. You've all 1208 01:11:23,780 --> 01:11:27,409 done this a lot of times. I won't bother to show 1209 01:11:27,409 --> 01:11:34,409 the details. You find that is equal to this. 1210 01:11:46,629 --> 01:11:49,579 Okay? That's just - you can work those out yourself. It's just math to 1211 01:11:49,579 --> 01:11:54,499 compute the derivative of this with respect to 1212 01:11:54,499 --> 01:11:59,309 W. So to summarize, given the training set, 1213 01:11:59,309 --> 01:12:02,099 here's my [inaudible] update rule. So you run the 1214 01:12:02,099 --> 01:12:06,309 [inaudible] to learn the parameters W. 1215 01:12:06,309 --> 01:12:08,380 After you're 1216 01:12:08,380 --> 01:12:09,719 done, you then 1217 01:12:09,719 --> 01:12:12,369 1218 01:12:12,369 --> 01:12:14,109 output SI equals 1219 01:12:14,109 --> 01:12:16,989 WXI, and you've separated your sources 1220 01:12:16,989 --> 01:12:18,170 of your 1221 01:12:18,170 --> 01:12:21,780 data back out into the original independent sources. 1222 01:12:21,780 --> 01:12:26,199 Hopefully up to only a permutation and a plus/minus 1223 01:12:26,199 --> 01:12:30,650 sign ambiguity. 1224 01:12:30,650 --> 01:12:34,559 Okay? So just switch back to the laptop, please? 1225 01:12:34,559 --> 01:12:41,559 So we'll just wrap up with a couple of examples of applications of ICA. 1226 01:12:42,210 --> 01:12:43,150 This is 1227 01:12:43,150 --> 01:12:46,719 actually a picture of our TA, Katie. 1228 01:12:46,719 --> 01:12:49,980 So one of the applications of ICA is 1229 01:12:49,980 --> 01:12:52,010 to process 1230 01:12:52,010 --> 01:12:56,530 various types of [inaudible] recording data, so [inaudible]. This 1231 01:12:56,530 --> 01:12:58,780 is a picture of 1232 01:12:58,780 --> 01:13:02,470 a EEG cap, in which there are a number of electrodes 1233 01:13:02,470 --> 01:13:04,529 you place 1234 01:13:04,529 --> 01:13:07,959 on the - in this case, on Katie's brain, on Katie's scalp. 1235 01:13:07,959 --> 01:13:13,370 So where each electrode measures changes in voltage over time 1236 01:13:13,370 --> 01:13:15,059 on the scalp. 1237 01:13:15,059 --> 01:13:18,409 On the right, it's a typical example of [inaudible] data 1238 01:13:18,409 --> 01:13:22,570 where each electrode measures - just changes in voltage over 1239 01:13:22,570 --> 01:13:23,890 time. So 1240 01:13:23,890 --> 01:13:27,949 the horizontal axis is time, and the vertical axis is voltage. So here's the same thing, 1241 01:13:27,949 --> 01:13:29,560 blown up a little bit. 1242 01:13:29,560 --> 01:13:32,680 You notice there are artifacts in this 1243 01:13:32,680 --> 01:13:36,340 data. Where the circle is, where the data is circled, all 1244 01:13:36,340 --> 01:13:37,669 the 1245 01:13:37,669 --> 01:13:41,179 electrodes seem to measure in these very synchronized recordings. 1246 01:13:41,179 --> 01:13:44,699 It turns out that we look at [inaudible] data as well as a number of other 1247 01:13:44,699 --> 01:13:47,019 types of data, there are 1248 01:13:47,019 --> 01:13:51,550 artifacts from heartbeats and from human eye blinks and so on. So the 1249 01:13:51,550 --> 01:13:55,030 cartoonist, if you imagine, placing the 1250 01:13:55,030 --> 01:13:56,730 electrodes, or 1251 01:13:56,730 --> 01:13:58,319 microphones, on my scalp, 1252 01:13:58,319 --> 01:14:01,839 then each microphone is recording some overlapping combination of all the 1253 01:14:01,839 --> 01:14:04,920 things happening in my brain or in my body. 1254 01:14:04,920 --> 01:14:08,380 My brain has a number of different processes going on. My body's [inaudible] 1255 01:14:08,380 --> 01:14:10,519 going on, and 1256 01:14:10,519 --> 01:14:13,429 each electrode measures a sum 1257 01:14:13,429 --> 01:14:15,680 of the different voices in my brain. 1258 01:14:15,680 --> 01:14:19,789 That didn't quite come out the way I wanted it to. 1259 01:14:19,789 --> 01:14:21,530 So we can just take this data 1260 01:14:21,530 --> 01:14:25,400 and run ICA on it and find out one of the independent components, what the 1261 01:14:25,400 --> 01:14:26,130 independent 1262 01:14:26,130 --> 01:14:30,329 process are going on in my brain. This is an example of running ICA. 1263 01:14:30,329 --> 01:14:33,239 So you find that a small number of components, like those shown up there, 1264 01:14:33,239 --> 01:14:37,739 they correspond to heartbeat, where the arrows - so those are very periodic 1265 01:14:37,739 --> 01:14:42,329 signals. They come on occasionally and correspond to [inaudible] components of 1266 01:14:42,329 --> 01:14:43,050 heartbeat. 1267 01:14:43,050 --> 01:14:47,460 You also find things like an eye blink component, corresponding to a 1268 01:14:47,460 --> 01:14:49,780 sigmoid generated when you blink your eyes. 1269 01:14:49,780 --> 01:14:53,820 By doing this, you can then subtract out the heartbeat and the eye blink 1270 01:14:53,820 --> 01:14:56,179 artifacts from the data, and now 1271 01:14:56,179 --> 01:15:01,219 you get much cleaner ICA data - get much cleaner EEG readings. You can 1272 01:15:01,219 --> 01:15:03,699 do further scientific studies. So this is a 1273 01:15:03,699 --> 01:15:06,179 pretty commonly used preprocessing step 1274 01:15:06,179 --> 01:15:09,699 that is a common application of ICA. 1275 01:15:09,699 --> 01:15:13,030 [Inaudible] example is 1276 01:15:13,030 --> 01:15:16,300 the application, again, from [inaudible]. As 1277 01:15:16,300 --> 01:15:20,900 a result of running ICA on natural small image patches. Suppose I take 1278 01:15:20,900 --> 01:15:22,050 natural images 1279 01:15:22,050 --> 01:15:25,909 and run ICA on the data and ask what are the independent components of data. 1280 01:15:25,909 --> 01:15:30,039 It turns out that these are the bases you get. So this is a plot of the 1281 01:15:30,039 --> 01:15:32,530 sources you get. 1282 01:15:32,530 --> 01:15:36,269 This algorithm is saying that a natural image patch 1283 01:15:36,269 --> 01:15:37,749 shown 1284 01:15:37,749 --> 01:15:39,790 on the left 1285 01:15:39,790 --> 01:15:45,330 is often expressed as a sum, or a linear combination, of 1286 01:15:45,330 --> 01:15:46,679 independent sources of 1287 01:15:46,679 --> 01:15:48,159 things that make up images. 1288 01:15:48,159 --> 01:15:52,780 So this model's natural images is generated by independent objects 1289 01:15:52,780 --> 01:15:55,340 that generate different ages in the image. 1290 01:15:55,340 --> 01:16:01,260 One of the fascinating things about this is that, similar to neuroscience, this has also been 1291 01:16:01,260 --> 01:16:04,789 hypothesized as a method for how the human brain processes image 1292 01:16:04,789 --> 01:16:05,999 data. It 1293 01:16:05,999 --> 01:16:10,139 turns out, this is similar, in many ways, to computations 1294 01:16:10,139 --> 01:16:15,079 happening in early visual processing in the human brain, 1295 01:16:15,079 --> 01:16:17,659 in the mammalian 1296 01:16:17,659 --> 01:16:19,799 brain. It's just 1297 01:16:19,799 --> 01:16:25,260 interesting to see ages are the independent components of images. 1298 01:16:25,260 --> 01:16:30,639 Are there quick questions, because I'm running late. Quick questions before I close? Interviewee: [Inaudible] square matrix? Instructor (Andrew 1299 01:16:30,639 --> 01:16:31,929 Ng):Oh, 1300 01:16:31,929 --> 01:16:35,410 yes. For the algorithms I describe, I assume A is a square matrix. 1301 01:16:35,410 --> 01:16:38,590 It turns out if you have more microphones than speakers, you can also apply very 1302 01:16:38,590 --> 01:16:39,609 similar algorithms. If 1303 01:16:39,609 --> 01:16:43,919 you have fewer microphones than speakers, there's sort of an open research problem. The odds 1304 01:16:43,919 --> 01:16:48,459 are that if you have one male and one female speaker, but one microphone, you can 1305 01:16:48,459 --> 01:16:51,819 sometimes sort of separate them because one is high, one is low. If you have two 1306 01:16:51,819 --> 01:16:55,459 male speakers or two female speakers, then it's beyond the state of the art now to separate them 1307 01:16:55,459 --> 01:16:57,050 with one 1308 01:16:57,050 --> 01:17:00,499 microphone. It's a great research program. Okay. 1309 01:17:00,499 --> 01:17:04,869 Sorry about running late again. Let's close now, and we'll 1310 01:17:04,869 --> 01:17:05,749 continue reinforcement learning