https://www.youtube.com/watch?v=WW3ZJHPwvyg&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=17 字幕記錄 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high quality educational resources for free. 00:10 To make a donation or to view additional materials 00:12 from hundreds of MIT courses, visit MIT OpenCourseWare 00:16 at ocw.mit.edu. 00:17 01:14 PHILIPPE RIGOLLET: --bunch of x's and a bunch of y's. 01:17 The y's were univariate, just one real 01:20 valued random variable. 01:21 And the x's were vectors that described a bunch of attributes 01:24 for each of our individuals or each of our observations. 01:27 Let's assume now that we're given essentially only the x's. 01:30 This is sometimes referred to as unsupervised learning. 01:33 There is just the x's. 01:35 Usually, supervision is done by the y's. 01:38 And so what you're trying to do is to make sense of this data. 01:41 You're going to try to understand this data, 01:43 represent this data, visualize this data, 01:47 try to understand something, right? 01:48 So, if I give you a d-dimensional random vectors, 01:52 and you're going to have n independent copies 01:54 of this individual-- of this random vector, OK? 01:57 So you will see that I'm going to have-- 01:59 I'm going to very quickly run into some limitations 02:02 about what I can actually draw on the board 02:04 because I'm using [? boldface ?] here. 02:05 I'm also going to use the blackboard [? boldface. ?] 02:08 So it's going to be a bit difficult. 02:09 So tell me if you're actually a little confused by what 02:15 is a vector, what is a number, and what is a matrix. 02:17 But we'll get there. 02:19 So I have X in Rd, and that's a random vector. 02:22 02:26 And I have X1 to Xn that are IID. 02:30 They're independent copies of X. OK, 02:37 so you can think of those as being-- 02:40 the realization of these guys are 02:41 going to be a cloud of n points in R to the d. 02:51 And we're going to think of d as being fairly large. 02:54 And for this to start to make sense, 02:55 we're going to think of d as being at least 4, OK? 02:59 And meaning that you're going to have a hard time 03:01 visualizing those things. 03:03 If it was 3 or 2, you would be able to draw these points. 03:06 And that's pretty much as much sense 03:08 you're going to be making about those guys, 03:09 just looking at the [INAUDIBLE] 03:12 All right, so I'm going to write each of those X's, right? 03:16 So this vector, X, has d coordinate. 03:20 And I'm going to write them as X1, to Xd. 03:25 03:30 And I'm going to stack them into a matrix, OK? 03:34 So once I have those guys, I'm going to have a matrix. 03:38 But here, I'm going to use the double bar. 03:40 And it's X1 transpose, Xn transpose. 03:47 So what it means is that the coordinates of this guy, 03:51 of course, are X1,1. 03:53 Here, I have-- 03:54 I'm of size d, so I have X1d. 03:57 And here, I have Xn1. 04:01 Xnd. 04:02 And so the i-th, j-th-- 04:06 i-th row and j-th column is the matrix, Xij, right-- 04:10 is the entry, Xi to-- sorry. 04:12 04:23 OK, so each-- so the rows here are the observations. 04:28 And the columns are the covariance over attributes. 04:32 OK? 04:32 So this is an n by d matrix. 04:34 04:39 All right, this is really just some bookkeeping. 04:41 How do we store this data somehow? 04:43 And the fact that we use a matrix just like for regression 04:46 is going to be convenient because we're going to able 04:48 to talk about projections-- 04:50 going to be able to talk about things like this. 04:53 All right, so everything I'm going to say now 04:56 is about variances or covariances 04:59 of those things, which means that I need two moments, OK? 05:01 If the variance does not exist, there's 05:03 nothing I can say about this problem. 05:05 So I'm going to assume that the variance exists. 05:07 And one way to just put it to say 05:09 that the two norm of those guys is 05:12 finite, which is another way to say that each of them 05:15 is finite. 05:15 I mean, you can think of it the way you want. 05:18 All right, so now, the mean of X, right? 05:21 So I have a random vector. 05:22 So I can talk about the expectation of X. 05:26 That's a vector that's in Rd. 05:29 And that's just taking the expectation entrywise. 05:33 Sorry. 05:34 05:42 X1, Xd. 05:45 OK, so I should say it out loud. 05:49 For this, the purpose of this class, 05:51 I will denote by subscripts the indices that 05:55 corresponds to observations. 05:57 And superscripts, the indices that correspond to 06:02 coordinates of a variable. 06:04 And I think that's the same convention that we 06:07 took for the regression case. 06:10 Of course, you could use whatever you want. 06:12 If you want to put commas, et cetera, 06:13 it becomes just a bit more complicated. 06:16 All right, and so now, once I have this, 06:18 so this tells me where my cloud of point is centered, right? 06:21 So if I have a bunch of points-- 06:24 OK, so now I have a distribution on Rd, 06:27 so maybe I should talk about this-- 06:29 I'll talk about this when we talk 06:31 about the empirical version. 06:32 But if you think that you have, say, 06:34 a two-dimensional Gaussian random variable, 06:36 then you have a center in two dimension, which 06:38 is where it peaks, basically. 06:41 And that's what we're talking about here. 06:43 But the other thing we want to know 06:44 is how much does it spread in every direction, right? 06:47 So in every direction of the two dimensional thing, 06:49 I can then try to understand how much spread I'm getting. 06:52 And the way you measure this is by using covariance, right? 06:54 So the covariance matrix, sigma-- 07:02 that's a matrix which is d by d. 07:05 And it records-- in the j, k-th entry, 07:08 it records the covariance between the j-th coordinate 07:10 of X and the k-th coordinate of X, OK? 07:13 So with entries-- 07:14 07:21 OK, so I have sigma, which is sigma 1,1, sigma dd, sigma 1d, 07:30 sigma d1. 07:31 07:34 OK, and here I have sigma jk And sigma jk 07:39 is just the covariance between Xj, the j-th coordinate 07:48 and the k-th coordinate. 07:52 OK? 07:52 So in particular, it's symmetric because the covariance 07:55 between Xj and Xk is the same as the covariance between Xk 07:57 and Xj. 07:58 I should not put those parentheses here. 08:01 I do not use them in this, OK? 08:05 Just the covariance matrix. 08:06 So that's just something that records everything. 08:09 And so what's nice about the covariance matrix 08:10 is that if I actually give you X as a vector, 08:13 you actually can build the matrix just 08:15 by looking at vectors times vectors transpose, 08:18 rather than actually thinking about building 08:20 it coordinate by coordinate. 08:21 So for example, if you're used to using MATLAB, 08:23 that's the way you want to build a covariance matrix 08:26 because MATLAB is good at manipulating vectors 08:29 and matrices rather than just entering it entry by entry. 08:33 OK, so, right? 08:34 So, what is the covariance between Xj and Xk? 08:42 Well by definition, it's the expectation of Xj and Xk 08:51 minus the expectation of Xj times the expectation of Xk, 09:01 right? 09:01 That's the definition of the covariance. 09:03 I hope everybody's seeing that. 09:05 And so, in particular, I can actually 09:08 see that this thing can be written as-- 09:10 sigma can now be written as the expectation 09:14 of XX transpose minus the expectation of X 09:21 times the expectation of X transpose. 09:25 Why? 09:26 Well, let's look at the jk-th coefficient of this guy, right? 09:29 So here, if I look at the jk-th coefficient, I see what? 09:35 Well, I see that it's the expectation 09:38 of XX transpose jk, which is equal to the expectation of XX 09:50 transpose jk. 09:53 And what are the entries of XX transpose? 09:56 Well, they're of the form, Xj times Xk exactly. 10:00 So this is actually equal to the expectation of Xj times Xk. 10:02 10:09 And this is actually not the way I want to write it. 10:11 I want to write it-- 10:12 10:15 OK? 10:16 Is that clear? 10:17 That when I have a rank 1 matrix of this form, XX transpose, 10:20 the entries are of this form, right? 10:21 Because if I take-- 10:23 for example, think about x, y, z, and then 10:28 I multiply by x, y, z. 10:32 What I'm getting here is x-- 10:36 maybe I should actually use indices here. 10:40 x1, x2, x3. 10:42 x1, x2, x3. 10:44 The entries are x1x1, x1x2, x1x3; x2x1, x2x2, x2x3; x3x1, 10:57 x3x2, x3x3, OK? 11:04 So indeed, this is exactly of the form if you look at jk, 11:08 you get exactly Xj times Xk, OK? 11:12 So that's the beauty of those matrices. 11:15 So now, once I have this, I can do exactly the same thing, 11:19 except that here, if I take the jk-th entry, 11:23 I will get exactly the same thing, 11:25 except that it's not going to be the expectation of the product, 11:27 but the product of the expectation, right? 11:29 So I get that the jk-th entry of E of X, E of X transpose, 11:36 is just the j-th entry of E of X times the k-th entry of E of X. 11:48 So if I put those two together, it's actually telling me 11:52 that if I look at the j, k-th entry of sigma, 11:56 which I called little sigma jk, then 11:59 this is actually equal to what? 12:01 It's equal to the first term minus the second term. 12:04 The first term is the expectation of Xj, Xk 12:11 minus the expectation of Xj, expectation of Xk, which-- 12:18 oh, by the way, I forgot to say this is actually 12:20 equal to the expectation of Xj times the expectation of Xk 12:26 because that's just the definition of the expectation 12:28 of random vectors. 12:28 So my j and my k are now inside. 12:31 And that's by definition the covariance between Xj and Xk, 12:37 OK? 12:39 So just if you've seen those manipulations between vectors, 12:43 hopefully you're bored out of your mind. 12:45 And if you have not, then that's something 12:47 you just need to get comfortable with, right? 12:51 So one thing that's going to be useful 12:52 is to know very quickly what's called 12:55 the outer product of a vector with itself, which 12:57 is the vector of times the vector transpose, what 12:59 the entries of these things are. 13:01 And that's what we've been using on this second set of boards. 13:06 OK, so everybody agrees now that we've 13:08 sort of showed that the covariance matrix can 13:11 be written in this vector form. 13:14 So expectation of XX transpose minus expectation 13:17 of X, expectation of X transpose. 13:19 13:22 OK, just like the covariance can be written in two ways, 13:28 right we know that the covariance can also 13:30 be written as the expectation of Xj minus expectation of Xj 13:39 times Xk minus expectation of Xk, right? 13:45 That's the-- sometimes, this is the original definition 13:50 of covariance. 13:50 This is the second definition of covariance. 13:52 Just like you have the variance which 13:54 is the expectation of the square of X minus c of X, 13:57 or the expectation X squared minus the expectation of X 14:00 squared. 14:01 It's the same thing for covariance. 14:03 And you can actually see this in terms of vectors, right? 14:11 So this actually implies that you can also rewrite sigma 14:14 as the expectation of X minus expectation of X 14:21 times the same thing transpose. 14:23 14:32 Right? 14:32 And the reason is because if you just distribute those guys, 14:35 this is just the expectation of XX transpose 14:43 minus X, expectation of X transpose minus expectation 14:54 of XX transpose. 14:59 And then I have plus expectation of X, 15:03 expectation of X transpose. 15:05 15:09 Now, things could go wrong because the main difference 15:13 between matrices slash vectors and numbers is 15:18 that multiplication does not commute, right? 15:21 So in particular, those two things are not the same thing. 15:25 And so that's the main difference that we have before, 15:27 but it actually does not matter for our problem. 15:30 It's because what's happening is that if when 15:32 I take the expectation of this guy, then 15:34 it's actually the same as the expectation of this guy, OK? 15:38 And so just because the expectation is linear-- 15:43 15:48 so what we have is that sigma now 15:50 becomes equal to the expectation of XX transpose 15:55 minus the expectation of X, expectation 15:59 of X transpose minus expectation of X, 16:03 expectation of X transpose. 16:07 And then I have-- 16:10 well, really, what I have is this guy. 16:14 And then I have plus the expectation 16:15 of X, expectation of X transpose. 16:19 16:23 And now, those three things are actually equal to each other 16:28 just because the expectation of X transpose 16:30 is the same as the expectation of X transpose. 16:34 And so what I'm left with is just 16:35 the expectation of XX transpose minus the expectation of X, 16:44 expectation of X transpose, OK? 16:49 So same thing that's happening when 16:51 you want to prove that you can write 16:53 the covariance either this way or that way. 16:57 The same thing happens for matrices, or for vectors, 17:00 right, or a covariance matrix. 17:02 They go together. 17:04 Is there any questions so far? 17:05 And if you have some, please tell me, because I want to-- 17:09 I don't know to which extent you guys are comfortable with this 17:12 at all or not. 17:13 17:16 OK, so let's move on. 17:19 All right, so of course, this is what 17:23 I'm describing in terms of the distribution right here. 17:26 I took expectations. 17:28 Covariances are also expectations. 17:30 So those depend on some distribution of X, right? 17:32 If I wanted to compute that, I would basically 17:34 need to know what the distribution of X is. 17:36 Now, we're doing statistics, so I 17:37 need to [INAUDIBLE] my question is going to be to say, well, 17:41 how well can I estimate the covariance matrix itself, 17:44 or some properties of this covariance matrix 17:47 based on data? 17:48 All right, so if I want to understand 17:50 what my covariance matrix looks like based on data, 17:52 I'm going to have to basically form 17:54 its empirical counterparts, which 17:57 I can do by doing the age-old statistical trick, which 18:02 is replace your expectation by an average, all right? 18:04 So let's just-- everything that's on the board, 18:06 you see expectation, just replace it by an average. 18:09 OK, so, now I'm going to be given X1, Xn. 18:14 So, I'm going to define the empirical mean. 18:16 18:19 OK so, really, the idea is take your expectation 18:22 and replace it by 1 over n sum, right? 18:24 And so the empirical mean is just 1 over n. 18:28 Some of the Xi's-- 18:31 I'm guessing everybody knows how to average vectors. 18:34 It's just the average of the coordinates. 18:36 So I will write this as X bar. 18:39 And the empirical covariance matrix, often called 18:51 sample covariance matrix, hence the notation, S. 18:57 Well, this is my covariance matrix, right? 18:59 Let's just replace the expectations by averages. 19:02 1 over n, sum from i equal 1 to n, of Xi, Xi transpose, minus-- 19:12 this is the expectation of X. I will replace it 19:14 by the average, which I just called X bar, X bar transpose, 19:21 OK? 19:22 And that's when I want to use the-- 19:25 that's when I want to use the notation-- 19:28 the second definition, but I could actually 19:30 do exactly the same thing using this definition here. 19:35 Sorry, using this definition right here. 19:38 So this is actually 1 over n, sum from i 19:42 equal 1 to n, of Xi minus X bar, Xi minus X bar transpose. 19:55 And those are actually-- 19:56 I mean, in a way, it looks like I 19:58 could define two different estimators, 19:59 but you can actually check. 20:01 And I do encourage you to do this. 20:03 If you're not comfortable making those manipulations, 20:05 you can actually check that those two things are actually 20:08 exactly the same, OK? 20:15 20:20 So now, I'm going to want to talk about matrices, OK? 20:25 And remember, we defined this big matrix, X, 20:27 with the double bar. 20:28 And the question is, can I express 20:31 both X bar and the sample covariance matrix 20:35 in terms of this big matrix, X? 20:37 Because right now, it's still expressed 20:39 in terms of the vectors. 20:40 I'm summing those vectors, vectors transpose. 20:43 The question is, can I just do that in a very compact way, 20:46 in a way that I can actually remove this sum term, 20:50 all right? 20:50 That's going to be the goal. 20:52 I mean, that's not a notational goal. 20:54 That's really something that we want-- 20:58 that's going to be convenient for us 20:59 just like it was convenient to talk about matrices when 21:02 we did linear regression. 21:04 21:23 OK, X bar. 21:26 We just said it's 1 over n, sum from I equal 1 to n 21:30 of Xi, right? 21:32 Now remember, what does this matrix look like? 21:35 We said that X bar-- 21:39 X is this guy. 21:40 So if I look at X transpose, the columns of this guy 21:45 becomes X1, my first observation, X2, 21:51 my second observation, all the way to Xn, my last observation, 21:54 right? 21:56 Agreed? 21:56 That's what X transpose is. 21:58 So if I want to sum those guys, I 22:00 can multiply by the all-ones vector. 22:02 22:06 All right, so that's what the definition of the all-ones 1 22:08 vector is. 22:09 22:11 Well, it's just a bunch of 1's in Rn, in this case. 22:19 And so when I do X transpose 1, what I get is just the sum from 22:23 i equal 1 to n of the Xi's. 22:27 So if I divide by n, I get my average, OK? 22:36 So here, I definitely removed the sum term. 22:43 Let's see if with the covariance matrix, we can do the same. 22:47 Well, and that's actually a little more difficult to see, 22:53 I guess. 22:55 But let's use this definition for S, OK? 23:05 And one thing that's actually going to be-- 23:07 so, let's see for one second, what-- 23:10 so it's going to be something that involves X, 23:12 multiplying X with itself, OK? 23:14 And the question is, is it going to be 23:15 multiplying X with X transpose, or X tranpose with X? 23:19 To answer this question, you can go 23:20 the easy route, which says, well, my covariance matrix is 23:23 of size, what? 23:24 What is the size of S? 23:27 AUDIENCE: d by d. 23:28 PHILIPPE RIGOLLET: d by d, OK? 23:30 X is of size n by d. 23:34 So if I do X times X transpose, I'm 23:35 going to have something which is of size n by n. 23:37 If I do X transpose X, I'm going to have 23:39 something which is d by d. 23:40 That's the easy route. 23:41 And there's basically one of the two guys. 23:44 You can actually open the box a little bit 23:46 and see what's going on in there. 23:48 If you do X transpose X, which we know gives you a d by d, 23:52 you'll see that X is going to have vectors that 23:54 are of the form, Xi, and X transpose 23:57 is going to have vectors that are of the form, Xi transpose, 24:02 right? 24:03 And so, this is actually probably the right way to go. 24:06 So let's look at what's X transpose X is giving us. 24:11 So I claim that it's actually going to give us what we want, 24:16 but rather than actually going there, let's-- 24:19 to actually-- I mean, we could check it entry by entry, 24:22 but there's actually a nice thing we can do. 24:25 Before we go there, let's write X transpose 24:28 as the following sum of variables, X1 and then 24:33 just a bunch of 0's everywhere else. 24:36 So it's still d by n. 24:39 So n minus 1 of the columns are equal to 0 here. 24:42 Then I'm going to put a 0 and then put X2. 24:45 And then just a bunch of 0's, right? 24:48 So that's just 0, 0 plus 0, 0, all the way to Xn, OK? 24:59 Everybody agrees with it? 25:01 See what I'm doing here? 25:03 I'm just splitting it into a sum of matrices that 25:06 only have one nonzero columns. 25:08 But clearly, that's true. 25:11 Now let's look at the product of this guy with itself. 25:15 So, let's call these matrices M1, M2, Mn. 25:23 25:26 So when I do X transpose X, what I 25:30 do is the sum of the Mi's for i equal 1 to n, 25:37 times the sum of the Mi transpose, right? 25:48 Now, the sum of the Mi's transpose 25:50 is just the sum of each of the Mi's transpose, OK? 25:55 25:58 So now I just have this product of two sums, 26:00 so I'm just going to re-index the second one by j. 26:03 So this is sum for i equal 1 to n, j equal 1 to n of Mi 26:12 Mj transpose. 26:15 OK? 26:16 26:19 And now what we want to notice is 26:20 that if i is different from j, what's happening? 26:26 Well if i is different from j, let's look at say, M1 times XM2 26:34 transpose. 26:35 26:54 So what is the product between those two matrices? 26:56 27:04 AUDIENCE: It's a new entry and [INAUDIBLE] 27:09 PHILIPPE RIGOLLET: There's an entry? 27:11 AUDIENCE: Well, it's an entry. 27:12 It's like a dot product in that form next to [? transpose. ?] 27:17 PHILIPPE RIGOLLET: You mean a dot product is just getting 27:19 [INAUDIBLE] number, right? 27:20 So I want-- this is going to be a matrix. 27:22 It's the product of two matrices, right? 27:24 This is a matrix times a matrix. 27:27 So this should be a matrix, right, of size d by d. 27:31 27:35 Yeah, I should see a lot of hands 27:37 that look like this, right? 27:39 Because look at this. 27:40 So let's multiply the first-- 27:42 let's look at what's going on in the first column here. 27:45 I'm multiplying this column with each of those rows. 27:48 The only nonzero coefficient is here, 27:50 and it only hits this column of 0's. 27:54 So every time, this is going to give you 0, 0, 0, 0. 27:57 And it's going to be the same for every single one of them. 28:00 So this matrix is just full of 0's, right? 28:04 They never hit each other when I do 28:06 the matrix-matrix multiplication. 28:08 There's no-- every non-zero hits a 0. 28:11 So what it means is-- and this, of course, 28:13 you can check for every i different from j. 28:16 So this means that Mi times Mj transpose is actually 28:22 equal to 0 when i is different from j, Right? 28:27 Everybody is OK with this? 28:29 So what that means is that when I do this double sum, really, 28:32 it's a simple sum. 28:33 There's only just the sum from i equal 1 28:37 to n of Mi Mi transpose. 28:41 Because this is the only terms in this double sum 28:44 that are not going to be 0 when [INAUDIBLE] [? M1 ?] with M1 28:48 itself. 28:50 Now, let's see what's going on when 28:51 I do M1 times M1 transpose. 28:53 Well, now, if I do Mi times and Mi transpose, 28:57 now this guy becomes [? X1 ?] [INAUDIBLE] it's here. 29:00 And so now, I really have X1 times X1 transpose. 29:03 So this is really just the sum from i 29:06 equal 1 to n of Xi Xi transpose, just because Mi Mi transpose 29:20 is Xi Xi transpose. 29:21 There's nothing else there. 29:22 29:26 So that's the good news, right? 29:28 This term here is really just X transpose X divided by n. 29:37 29:43 OK, I can use that guy again, I guess. 29:45 Well, no. 29:46 Let's just-- OK, so let me rewrite S. 30:08 All right, that's the definition we have. 30:10 And we know that this guy already is equal to 1 over n X 30:14 transpose X. x bar x bar transpose-- 30:20 we know that x bar-- we just proved that x bar-- 30:25 sorry, little x bar was equal to 1 30:31 over n X bar transpose times the all-ones vector. 30:36 So I'm just going to do that. 30:37 So that's just going to be minus. 30:39 I'm going to pull my two 1 over n's-- 30:40 one from this guy, one from this guy. 30:42 So I'm going to get 1 over n squared. 30:44 And then I'm going to get X bar-- 30:47 sorry, there's no X bar here. 30:48 It's just X. Yeah. 30:50 X transpose all ones times X transpose all ones transpose, 30:59 right? 31:00 31:04 And X transpose all ones transpose-- 31:07 31:11 right, the rule-- if I have A times B transpose, 31:14 it's B transpose times A transpose, right? 31:16 31:23 That's just the rule of transposition. 31:25 So this is 1 transpose X transpose. 31:31 And so when I put all these guys together, 31:34 this is actually equal to 1 over n X transpose X minus one 31:38 over n squared X transpose 1, 1 transpose X. Because X 31:47 transpose transposes X, OK? 31:50 31:53 So now, I can actually-- 31:55 I have something which is of the form, X transpose X-- 31:59 [INAUDIBLE] to the left, X transpose; to the right, X. 32:01 Here, I have X transpose to the left, X to the right. 32:04 So it can factor out whatever's in there. 32:07 So I can write S as 1 over n-- 32:11 sorry, X transpose times 1 over n times the identity of Rd. 32:17 32:21 And then I have minus 1 over n, 1, 1 transpose X. 32:33 OK, because if you-- 32:34 I mean, you can distribute it back, right? 32:36 So here, I'm going to get what? 32:38 X transpose identity times X, the whole thing divided by n. 32:41 That's this term. 32:42 And then the second one is going to be-- sorry, 1 over n 32:45 squared. 32:46 And then I'm going to get 1 over n squared times X transpose 1, 32:50 1 transpose which is this guy, times X, 32:53 and that's the [? right ?] [? thing, ?] OK? 32:58 So, the way it's written, I factored out one of the 1 over 33:01 n's. 33:02 So I'm just going to do the same thing as on this slide. 33:05 So I'm just factoring out this 1 over n here. 33:08 So it's 1 over n times X transpose identity 33:16 of our d divided by n divided by 1 this time, 33:21 minus 1 over n 1, 1 transpose times X, OK? 33:26 So that's just what's on the slides. 33:28 33:31 What does the matrix, 1, 1 transpose, look like? 33:35 AUDIENCE: All 1's. 33:36 PHILIPPE RIGOLLET: It's just all 1's, right? 33:38 Because the entries are the products of the all-ones-- 33:41 of the coordinates of the all-ones vectors with 33:42 the coordinates of the all-ones vectors, so I only get 1's. 33:45 So it's a d by d matrix with only 1's. 33:49 So this matrix, I can actually write exactly, right? 33:52 H, this matrix that I called H which 33:55 is what's sandwiched in-between this X transpose and X. 33:59 By definition, I said this is the definition of H. Then 34:02 this thing, I can write its coordinates exactly. 34:06 34:18 We know it's identity divided by n minus-- 34:23 sorry, I don't know why I keep [INAUDIBLE].. 34:25 Minus 1 over n 1, 1 transpose-- 34:29 so it's this matrix with the only 1's 34:30 on the diagonals and 0's and elsewhere-- minus a matrix that 34:34 only has 1 over n everywhere. 34:36 34:41 OK, so the whole thing is 1 minus 1 over n on the diagonals 34:49 and then minus 1 over n here, OK? 34:57 And now I claim that this matrix is an orthogonal projector. 35:01 Now, I'm writing this, but it's completely useless. 35:05 This is just a way for you to see that it's actually very 35:08 convenient now to think about this problem 35:11 as being a matrix problem, because things 35:14 are much nicer when you think about the actual form 35:17 of your matrices, right? 35:18 They could tell you, here is the matrix. 35:21 I mean, imagine you're sitting at a midterm, 35:23 and I say, here's the matrix that has 1 minus 1 35:25 over n on the diagonals and minus 1 over n 35:28 on the [INAUDIBLE] diagonal. 35:30 Prove to me that it's a projector matrix. 35:32 You're going to have to basically 35:34 take this guy times itself. 35:35 It's going to be really complicated, right? 35:37 So we know it's symmetric. 35:38 That's for sure. 35:39 But the fact that it has this particular way 35:42 of writing it is going to make my life 35:44 super easy to check this. 35:45 That's the definition of a projector. 35:47 It has to be symmetric and it has 35:48 to square to itself because we just 35:51 said in the chapter on linear regression 35:54 that once you project, if you apply the projection again, 35:57 you're not moving because you're already there. 35:59 OK, so why is H squared equal to H? 36:04 Well let's just write H square. 36:05 It's the identity minus 1 over n 1, 1 36:09 transpose times the identity minus 1 over n 1, 1 36:16 transpose, right? 36:19 Let's just expand this now. 36:22 This is equal to the identity minus-- 36:25 well, the identity times 1, 1 transpose is just the identity. 36:29 So it's 1, 1 transpose, sorry. 36:31 So 1 over n 1, 1 transpose minus 1 over n 1, 1 transpose. 36:38 And then there's going to be what 36:40 makes the deal is that I get this 1 over n 36:42 squared this time. 36:44 And then I get the product of 1 over n trans-- 36:46 oh, let's write it completely. 36:48 I get 1, 1 transpose times 1, 1 transpose, OK? 36:58 But this thing here-- 37:01 what is this? 37:03 n, right, is the end product of the all-ones vector 37:06 with the all-ones vector. 37:07 So I'm just summing n times 1 squared, which is n. 37:10 So this is equal to n. 37:11 So I pull it out, cancel one of the ends, 37:13 and I'm back to what I had before. 37:15 So I had identity minus 2 over n 1, 1 transpose plus 1 37:21 over n 1, 1 transpose which is equal to H. 37:27 Because one of the 1 over n's cancel, OK? 37:30 37:36 So it's a projection matrix. 37:37 It's projecting onto some linear space, right? 37:41 It's taking a matrix. 37:42 Sorry, it's taking a vector and it's 37:44 projecting onto a certain space of vectors. 37:46 37:49 What is this space? 37:50 37:53 Right, so, how do you-- so I'm only 37:54 asking the answer to this question in words, right? 37:57 So how would you describe the vectors 37:59 onto which this matrix is projecting? 38:02 Well, if you want to answer this question, 38:05 the way you would tackle it is first by saying, OK, 38:07 what does a vector which is of the form, H times something, 38:13 look like, right? 38:14 What can I say about this vector that's 38:16 going to be definitely giving me something 38:19 about the space on which it projects? 38:21 I need to know a little more to know that it projects exactly 38:24 onto this. 38:25 But one way we can do this is just 38:29 see how it acts on a vector. 38:30 What does it do to a vector to apply H, right? 38:32 So I take v. And let's see what taking v and applying H to it 38:44 looks like. 38:46 Well, it's the identity minus something. 38:48 So it takes v and it removes something 38:50 from v. What does it remove? 38:54 Well, it's 1 over n times v transpose 1 times 39:00 the all-ones vector, right? 39:03 Agreed? 39:04 I just wrote v transpose 1 instead of 1 transpose v, 39:13 which are the same thing. 39:16 What is this thing? 39:17 39:25 What should I call it in mathematical notation? 39:27 39:30 v bar, right? 39:31 I should all it v bar because this is exactly the average 39:35 of the entries of v, agreed? 39:38 This is summing the entries of v's, and this is dividing 39:41 by the number of those v's. 39:43 Sorry, now v is in our-- 39:44 39:49 sorry, why do I divide by-- 39:51 39:53 I'm just-- OK, I need to check what my dimensions are now. 39:59 No, it's in Rd, right? 40:00 So why do I divide by n? 40:02 40:05 So it's not really v bar. 40:07 It's the sum of the v's divided by-- 40:13 right, so it's v bar. 40:14 40:24 AUDIENCE: [INAUDIBLE] 40:25 [INTERPOSING VOICES] 40:25 AUDIENCE: Yeah, v has to be [INAUDIBLE] 40:27 PHILIPPE RIGOLLET: Oh, yeah. 40:29 OK, thank you. 40:31 So everywhere I wrote Hd, that was actually Hn. 40:34 Oh, man. 40:35 I wish I had a computer now. 40:37 All right. 40:37 So-- yeah, because the-- 40:43 yeah, right? 40:43 So why it's not-- 40:45 well, why I thought it was this is because I was thinking 40:48 about the outer dimension of X, really 40:49 of X transpose, which is really the inner dimension, 40:51 didn't matter to me, right? 40:52 So the thing that I can sandwich between X transpose 40:55 and X has to be n by n. 40:56 So this was actually n by n. 40:58 And so that's actually n by n. 41:00 Everything is n by n. 41:03 Sorry about that. 41:04 41:08 So this is n. 41:09 This is n. 41:10 This is-- well, I didn't really tell you 41:12 what the all-ones vector was, but it's also in our n. 41:16 Yeah, OK. 41:18 41:22 Thank you. 41:23 And n-- actually, I used the fact that this was of size n 41:27 here already. 41:28 41:31 OK, and so that's indeed v bar. 41:33 41:38 So what is this projection doing to a vector? 41:40 41:47 It's removing its average on each coordinate, right? 41:51 And the effect of this is that v is a vector. 41:54 What is the average of Hv? 41:58 AUDIENCE: 0. 41:59 PHILIPPE RIGOLLET: Right, so it's 0. 42:00 It's the average of v, which is v bar, minus the average 42:04 of something that only has v bar's entry, which is v bar. 42:07 So this thing is actually 0. 42:08 42:11 So let me repeat my question. 42:12 Onto what subspace does H project? 42:15 42:22 Onto the subspace of vectors that have mean 0. 42:26 A vector that has mean 0 is a vector. 42:30 So if you want to talk more linear algebra, v bar-- 42:34 for a vector you have mean 0, it means 42:36 that v is orthogonal to the span of the all-ones vector. 42:43 That's it. 42:44 It projects to this space. 42:46 So in words, it projects onto the space 42:47 of vectors that have 0 mean. 42:49 In linear algebra, it says it projects 42:52 onto the hyperplane which is orthogonal 42:55 to the all-ones vector, OK? 42:58 So that's all. 43:01 Can you guys still see the screen? 43:04 Are you good over there? 43:05 OK. 43:07 All right, so now, what it means is that, well, I'm 43:12 doing this weird thing, right? 43:13 I'm taking the inner product-- 43:15 so S is taking X. And then it's removing its mean of each 43:20 of the columns of X, right? 43:21 When I take H times X, I'm basically applying this 43:24 projection which consists in removing the mean of all 43:26 the X's. 43:28 And then I multiply by H transpose. 43:31 But what's actually nice is that, remember, 43:33 H is a projector. 43:35 Sorry, I don't want to keep that. 43:38 Which means that when I look at X transpose HX, 43:47 it's the same as looking at X transpose H squared X. 43:52 But since H is equal to its transpose, 43:54 this is actually the same as looking at X transpose H 43:58 transpose HX, which is the same as looking at HX transpose 44:07 HX, OK? 44:11 So what it's doing, it's first applying this projection 44:14 matrix, H, which removes the mean of each of your columns, 44:18 and then looks at the inner products between those guys, 44:23 right? 44:23 Each entry of this guy is just the covariance 44:25 between those centered things. 44:27 That's all it's doing. 44:28 All right, so those are actually going to be the key statements. 44:35 So everything we've done so far is really 44:37 mainly linear algebra, right? 44:38 I mean, looking at expectations and covariances was just-- 44:41 we just used the fact that the expectation was linear. 44:44 We didn't do much. 44:45 But now there's a nice thing that's happening. 44:47 And that's why we're going to switch 44:50 from the language of linear algebra 44:51 to more statistical, because what's happening 44:53 is that if I look at this quadratic form, right? 44:57 So I take sigma. 44:59 So I take a vector, u. 45:00 45:03 And I'm going to look at u-- so let's say, in Rd. 45:09 And I'm going to look at u transpose sigma u. 45:14 OK? 45:15 45:18 What is this doing? 45:19 Well, we know that u transpose sigma u is equal to what? 45:24 Well, sigma is the expectation of XX transpose 45:31 minus the expectation of X expectation of X transpose, 45:35 right? 45:36 45:39 So I just substitute in there. 45:40 45:46 Now, u is deterministic. 45:49 So in particular, I can push it inside the expectation 45:52 here, agreed? 45:55 And I can do the same from the right. 45:57 So here, when I push u transpose here, and u here, 46:00 what I'm left with is the expectation of u transpose X 46:06 times X transpose u. 46:09 OK? 46:11 And now, I can do the same thing for this guy. 46:14 And this tells me that this is the expectation of u transpose 46:17 X times the expectation of X transpose u. 46:21 46:24 Of course, u transpose X is equal to X transpose u. 46:29 And u-- yeah. 46:31 So what it means is that this is actually 46:33 equal to the expectation of u transpose X squared 46:43 minus the expectation of u transpose X, 46:48 the whole thing squared. 46:49 46:56 But this is something that should look familiar. 46:58 This is really just the variance of this particular random 47:01 variable which is of the form, u transpose X, 47:03 right? u transpose X is a number. 47:06 It involves a random vector, so it's a random variable. 47:10 And so it has a variance. 47:11 And this variance is exactly given by this formula. 47:15 So this is just the variance of u transpose X. 47:19 So what we've proved is that if I look at this guy, 47:21 this is really just the variance of u transpose X, OK? 47:29 47:37 I can do the same thing for the sample variance. 47:40 So let's do this. 47:41 47:48 And as you can see, spoiler alert, 47:52 this is going to be the sample variance. 47:56 47:59 OK, so remember, S is 1 over n, sum of Xi Xi transpose minus X 48:09 bar X bar transpose. 48:12 So when I do u transpose, Su, what 48:16 it gives me is 1 over n sum from i equal 1 48:19 to n of u transpose Xi times Xi transpose u, all right? 48:25 So those are two numbers that multiply each other 48:27 and that happen to be equal to each other, 48:30 minus u transpose X bar X bar transpose u, 48:36 which is also the product of two numbers that happen 48:38 to be equal to each other. 48:39 So I can rewrite this with squares. 48:41 48:55 So we're almost there. 48:57 All I need to know to check is that this thing is actually 49:00 the average of those guys, right? 49:02 So u transpose X bar. 49:04 What is it? 49:05 It's 1 over n sum from i equal 1 to n of u transpose Xi. 49:10 So it's really something that I can write as u transpose X bar, 49:17 right? 49:17 That's the average of those random variables 49:19 of the form, u transpose Xi. 49:21 49:23 So what it means is that u transpose Su, I can write as 1 49:29 over n sum from i equal 1 to n of u transpose Xi squared 49:38 minus u transpose X bar squared, which 49:46 is the empirical variance that we need noted by small 49:51 s squared, right? 49:54 So that's the empirical variance of u transpose X1 all the way 50:06 to u transpose Xn. 50:08 50:12 OK, and here, same thing. 50:13 I use exactly the same thing. 50:15 I just use the fact that here, the only thing I use is really 50:17 the linearity of this guy, of 1 over n sum 50:20 or the linearity of expectation, that I can push things 50:24 in there, OK? 50:26 50:30 AUDIENCE: So what you have written 50:31 at the end of that sum for uT Su? 50:33 PHILIPPE RIGOLLET: This one? 50:35 AUDIENCE: Yeah. 50:35 PHILIPPE RIGOLLET: Yeah, I said it's equal to small s, 50:37 and I want to make a difference between the big S 50:39 that I'm using here. 50:40 So this is equal to small-- 50:42 I don't know, I'm trying to make it look 50:45 like a calligraphic s squared. 50:47 50:56 OK, so this is nice, right? 51:00 This covariance matrix-- so let's look at capital sigma 51:04 itself right now. 51:05 This covariance matrix, we know that if we 51:07 read its entries, what we get is the covariance 51:11 between the coordinates of the X's, right, 51:15 of the random vector, X. And the coordinates, well, 51:19 by definition, are attached to a coordinate system. 51:22 So I only know what the covariance 51:25 of X in of those two things are, or the covariance of those two 51:30 things are. 51:31 But what if I want to find coordinates between linear 51:33 combination of the X's? 51:35 Sorry, if I want to find covariances between linear 51:37 combination of those X's. 51:38 And that's exactly what this allows me to do. 51:40 It says, well, if I pre- and post-multiply by u, 51:44 this is actually telling me what the variance 51:47 of X along direction u is, OK? 51:51 So there's a lot of information in there, 51:53 and it's just really exploiting the fact 51:55 that there is some linearity going on in the covariance. 52:00 So, why variance? 52:02 Why is variance interesting for us, right? 52:03 Why? 52:04 I started by saying, here, we're going 52:05 to be interested in having something 52:07 to do dimension reduction. 52:08 We have-- think of your points as [? being in a ?] dimension 52:10 larger than 4, and we're going to try to reduce the dimension. 52:13 So let's just think for one second, 52:15 what do we want about a dimension reduction procedure? 52:19 If I have all my points that live in, say, three dimensions, 52:23 and I have one point here and one point here 52:25 and one point here and one point here and one point here, 52:28 and I decide to project them onto some plane-- 52:30 that I take a plane that's just like this, what's 52:32 going to happen is that those points are all going to project 52:34 to the same point, right? 52:36 I'm just going to not see anything. 52:38 However, if I take a plane which is like this, 52:40 they're all going to project into some nice line. 52:42 Maybe I can even project them onto a line 52:44 and they will still be far apart from each other. 52:47 So that's what you want. 52:48 You want to be able to say, when I take my points 52:51 and I say I project them onto lower dimensions, 52:54 I do not want them to collapse into one single point. 52:57 I want them to be spread as possible in the direction 53:00 on which I project. 53:02 And this is what we're going to try to do. 53:04 And of course, measuring spread between points 53:06 can be done in many ways, right? 53:08 I mean, you could look at, I don't know, 53:09 sum of pairwise distances between those guys. 53:12 You could look at some sort of energy. 53:14 You can look at many ways to measure 53:16 of spread in a direction. 53:18 But variance is a good way to measure 53:19 of spread between points. 53:21 If you have a lot of variance between your points, 53:23 then chances are they're going to be spread. 53:25 Now, this is not always the case, right? 53:27 If I have a direction in which all my points are clumped 53:30 onto one big point and one other big point, 53:33 it's going to choose this because that's 53:34 the direction that has a lot of variance. 53:37 But hopefully, the variance is going 53:39 to spread things out nicely. 53:41 So the idea of principal component analysis 53:47 is going to try to identify those variances-- 53:51 those directions along which we have a lot of variance. 53:55 Reciprocally, we're going to try to eliminate 53:57 the directions along which we do not have a lot of variance, OK? 54:01 And let's see why. 54:02 Well, if-- so here's the first claim. 54:08 If you transpose Su is equal to 0, what's happening? 54:14 Well, I know that an empirical variance is equal to 0. 54:17 What does it mean for an empirical variance 54:18 to be equal to 0? 54:22 So I give you a bunch of points, right? 54:23 So those points are those points-- u transpose 54:26 X1, u transpose-- those are a bunch of numbers. 54:29 What does it mean to have the empirical variance 54:31 of those points being equal to 0? 54:33 AUDIENCE: They're all the same. 54:34 PHILIPPE RIGOLLET: They're all the same. 54:36 So what it means is that when I have my points, right? 54:43 So, can you find a direction for those points in which they 54:46 project to all the same point? 54:48 54:51 No, right? 54:52 There's no such thing. 54:53 For this to happen, you have to have your points which 54:55 are perfectly aligned. 54:57 And then when you're going to project 54:59 onto the orthogonal of this guy, they're 55:01 going to all project to the same point 55:03 here, which means that the empirical variance is 55:06 going to be 0. 55:08 Now, this is an extreme case. 55:10 This will never happen in practice, 55:11 because if that happens, well, I mean, 55:13 you can basically figure that out very quickly. 55:16 So in the same way, it's very unlikely 55:21 that you're going to have u transpose sigma u, which 55:23 is equal to 0, which means that, essentially, all 55:26 your points are [INAUDIBLE] or let's say all of them 55:28 are orthogonal to u, right? 55:30 So it's exactly the same thing. 55:31 It just says that in the population case, 55:33 there's no probability that your points deviate from this guy 55:36 here. 55:37 This happens with zero probability, OK? 55:41 And that's just because if you look 55:42 at the variance of this guy, it's going to be 0. 55:46 And then that means that there's no deviation. 55:48 By the way, I'm using the name projection 55:51 when I talk about u transpose X, right? 55:55 So let's just be clear about this. 55:59 If you-- so let's say I have a bunch of points, 56:04 and u is a vector in this direction. 56:06 And let's say that u has the-- 56:08 so this is 0. 56:10 This is u. 56:10 And let's say that u has norm, 1, OK? 56:17 When I look, what is the coordinate of the projection? 56:21 So what is the length of this guy here? 56:23 Let's call this guy X1. 56:25 What is the length of this guy? 56:26 56:31 In terms of inner products? 56:32 56:35 This is exactly u transpose X1. 56:39 This length here, if this is X2, this 56:42 is exactly u transpose X2, OK? 56:46 So those-- u transpose X measure exactly the distance 56:52 to the origin of those-- 56:55 I mean, it's really-- 56:58 think of it as being just an x-axis thing. 57:00 You just have a bunch of points. 57:02 You have an origin. 57:02 And it's really just telling you what 57:04 the coordinate on this axis is going to be, right? 57:07 So in particular, if the empirical variance is 0, 57:10 it means that all these points project 57:12 to the same point, which means that they have 57:14 to be orthogonal to this guy. 57:16 And you can think of it as being also maybe an entire plane 57:19 that's orthogonal to this line, OK? 57:23 So that's why I talk about projection, 57:26 because the inner products, u transpose X, 57:29 is really measuring the coordinates of X 57:36 when u becomes the x-axis. 57:39 Now, if u does not have norm 1, then you just 57:42 have a change of scale here. 57:44 You just have a change of unit, right? 57:46 So this is really u times X1. 57:51 The coordinates should really be divided by the norm of u. 57:54 57:59 OK, so now, just in the same way-- so 58:04 we're never going to have exactly 0. 58:07 But if we [INAUDIBLE] the other end, 58:08 if u transpose Su is large, what does it mean? 58:12 58:14 It means that when I look at my points 58:17 as projected onto the axis generated by u, 58:22 they're going to have a lot of variance. 58:23 They're going to be far away from each other in average, 58:25 right? 58:26 That's what large variance means, or at least 58:28 large empirical variance means. 58:31 And same thing for u. 58:34 So what we're going to try to find 58:36 is a u that maximizes this. 58:39 If I can find a u that maximizes this 58:42 so I can look in every direction, 58:44 and suddenly I find a direction in which the spread is massive, 58:48 then that's a point on which I'm basically 58:50 the less likely to have my points 58:52 project onto each other and collide, right? 58:54 At least I know they're going to project 58:56 at least onto two points. 58:59 So the idea now is to say, OK, let's try 59:02 to maximize this spread, right? 59:04 So we're going to try to find the maximum over all u's 59:09 of u transpose Su. 59:12 And that's going to be the direction that maximizes 59:15 the empirical variance. 59:15 Now of course, if I read it like that for all u's in Rd, 59:22 what is the value of this maximum? 59:23 59:28 It's infinity, right? 59:29 Because I can always multiply u by 10, 59:32 and this entire thing is going to multiplied by 100. 59:34 So I'm just going to take u as large as I want, 59:36 and this thing is going to be as large as I want, 59:38 and so I need to constrain u. 59:40 And as I said, I need to have u of size 1 59:42 to talk about coordinates in the system generated 59:45 by u like this. 59:47 So I'm just going to constrain u to have 59:50 Euclidean norm equal to 1, OK? 59:55 So that's going to be my goal-- trying 59:57 to find the largest possible u transpose Su, 60:01 or in other words, empirical variance of the points 60:03 projected onto the direction u when u is of norm 1, 60:07 which justifies to use the word, "direction," 60:11 and because there's no magnitude to this u. 60:12 60:17 OK, so how am I going to do this? 60:22 I could just fold and say, let's just optimize 60:25 this thing, right? 60:26 Let's just take this problem. 60:28 It says maximize a function onto some constraints. 60:32 Immediately, the constraint is sort of nasty. 60:34 I'm on a sphere, and I'm trying to move points on the sphere. 60:37 And I'm maximizing this thing which 60:38 actually happens to be convex. 60:40 And we know we know how to minimize convex functions, 60:42 but maximize them is a different question. 60:45 And so this problem might be super hard. 60:47 So I can just say, OK, here's what 60:49 I want to do, and let me give that to an optimizer 60:52 and just hope that the optimizer can solve this problem for me. 60:56 That's one thing we can do. 60:57 Now as you can imagine, PCA is so well spread, right? 61:00 Principal component analysis is something 61:01 that people do constantly. 61:03 And so that means that we know how to do this fast. 61:06 So that's one thing. 61:07 The other thing that you should probably question about why-- 61:10 if this thing is actually difficult, why in the world 61:13 would you even choose the variance as a measure of spread 61:16 if there's so many measures of spread, right? 61:19 The variance is one measure of spread. 61:21 It's not guaranteed that everything 61:22 is going to project nicely far apart from each other. 61:26 So we could choose the variance, but we 61:27 could choose something else. 61:28 If the variance does not help, why choose it? 61:30 Turns out the variance helps. 61:32 So this is indeed a non-convex problem. 61:35 I'm maximizing, so it's actually the same. 61:38 I can make this constraint convex 61:41 because I'm maximizing a convex function, 61:43 so it's clear that the maximum is going 61:45 to be attained at the boundary. 61:47 So I can actually just fill this ball into some convex ball. 61:51 However, I'm still maximizing, so this 61:53 is a non-convex problem. 61:55 And this turns out to be the fanciest non-convex problem 61:57 we know how to solve. 61:59 And the reason why we know how to solve it 62:00 is not because of optimization or using gradient-type things 62:04 or anything of the algorithms that I mentioned 62:06 during the maximum likelihood. 62:09 It's because of linear algebra. 62:11 Linear algebra guarantees that we know how to solve this. 62:13 And to understand this, we need to go a little deeper 62:17 in linear algebra, and we need to understand the concept 62:22 of diagonalization of a matrix. 62:24 So who has ever seen the concept of an eigenvalue? 62:29 Oh, that's beautiful. 62:30 And if you're not raising your hand, 62:31 you're just playing "Candy Crush," right? 62:33 All right, so, OK. 62:35 62:44 This is great. 62:46 Everybody's seen it. 62:48 For my live audience of millions, maybe you have not, 62:51 so I will still go through it. 62:53 All right, so one of the basic facts-- 62:58 and I remember when I learned this in-- 63:02 I mean, when I was an undergrad, I 63:04 learned about the spectral decomposition 63:05 and this diagonalization of matrices. 63:07 And for me, it was just a structural property 63:09 of matrices, but it turns out that it's extremely useful, 63:11 and it's useful for algorithmic purposes. 63:13 And so what this theorem tells you 63:14 is that if you take a symmetric matrix-- 63:16 63:22 well, with real entries, but that 63:24 really does not matter so much. 63:28 And here, I'm going to actually-- 63:30 so I take a symmetric matrix, and actually S and sigma 63:33 are two such symmetric matrices, right? 63:36 Then there exists P and D, which are both-- 63:44 so let's say d by d. 63:47 Which are both d by d such that P is orthogonal. 63:55 63:58 That means that P transpose P is equal to PP transpose 64:02 is equal to the identity. 64:06 And D is diagonal. 64:07 64:11 And sigma, let's say, is equal to PDP transpose, OK? 64:20 So it's a diagonalization because it's 64:22 finding a nice transformation. 64:23 P has some nice properties. 64:25 It's really just the change of coordinates in which 64:28 your matrix is diagonal, right? 64:31 And the way you want to see this-- 64:32 and I think it sort of helps to think about this problem 64:35 as being-- 64:36 sigma being a covariance matrix. 64:38 What does a covariance matrix tell you? 64:39 Think of a multivariate Gaussian. 64:41 Can everybody visualize a three-dimensional Gaussian 64:43 density? 64:45 Right, so it's going to be some sort of a bell-shaped curve, 64:48 but it might be more elongated in one direction than another. 64:51 And then going to chop it like that, all right? 64:54 So I'm going to chop it off. 64:56 And I'm going to look at how it bleeds, all right? 65:00 So I'm just going to look at where the blood is. 65:02 And what it's going to look at-- 65:03 it's going to look like some sort of ellipsoid, right? 65:08 In high dimension, it's just going to be an olive. 65:11 And that is just going to be bigger and bigger. 65:13 And then I chop it off a little lower, 65:16 and I get something a little bigger like this. 65:20 And so it turns out that sigma is capturing exactly this, 65:23 right? 65:23 The matrix sigma-- so the center of your covariance matrix 65:27 of your Gaussian is going to be this thing. 65:29 And sigma is going to tell you which direction it's elongated. 65:33 And so in particular, if you look, if you knew an ellipse, 65:36 you know there's something called principal axis, right? 65:38 So you could actually define something 65:39 that looks like this, which is this axis, the one along which 65:43 it's the most elongated. 65:44 Then the axis along which is orthogonal to it, 65:47 along which it's slightly less elongated, 65:49 and you go again and again along the orthogonal ones. 65:52 It turns out that those things here 65:56 is the new coordinate system in which this transformation, P 65:59 and P transpose, is putting you into. 66:03 And D has entries on the diagonal 66:06 which are exactly this length and this length, right? 66:09 So that's just what it's doing. 66:11 It's just telling you, well, if you 66:12 think of having this Gaussian or this high-dimensional 66:16 ellipsoid, it's elongated along certain directions. 66:19 And these directions are actually maybe not well aligned 66:23 with your original coordinate system, which might just 66:25 be the usual one, right-- 66:27 north, south, and east, west. 66:29 Maybe I need to turn it. 66:30 And that's exactly what this orthogonal transformation is 66:33 doing for you, all right? 66:36 So, in a way, this is actually telling you even more. 66:39 It's telling you that any matrix that's symmetric, 66:41 you can actually turn it somewhere. 66:45 And that'll start to dilate things in the directions 66:47 that you have, and then turn it back 66:49 to what you originally had. 66:50 And that's actually exactly the effect 66:53 of applying a symmetric matrix through a vector, right? 66:57 And it's pretty impressive. 66:58 It says if I take sigma times v. Any sigma that's 67:04 of this form, what I'm doing is-- that's symmetric. 67:07 What I'm really doing to v is I'm 67:09 changing its coordinate system, so I'm rotating it. 67:12 Then I'm changing-- I'm multiplying its coordinates, 67:14 and then I'm rotating it back. 67:16 That's all it's doing, and that's 67:18 what all symmetric matrices do, which 67:21 means that this is doing a lot. 67:24 All right, so OK. 67:27 So, what do I know? 67:29 So I'm not going to prove that this is 67:30 the so-called spectral theorem. 67:32 67:39 And the diagonal entries of D is of the form, lambda 1, 67:45 lambda 2, lambda d, 0, 0. 67:49 And the lambda j's are called eigenvalues of D. 68:01 Now in general, those numbers can be positive, negative, 68:05 or equal to 0. 68:06 But here, I know that sigma and S are-- 68:12 well, they're symmetric for sure, 68:15 but they are positive semidefinite. 68:17 68:23 What does it mean? 68:25 It means that when I take u transpose sigma u for example, 68:30 this number is always non-negative. 68:33 68:35 Why is this true? 68:36 68:42 What is this number? 68:43 68:47 It's the variance of-- and actually, I don't even 68:49 need to finish this sentence. 68:51 As soon as I say that this is a variance, well, 68:53 it has to be non-negative. 68:55 We know that a variance is not negative. 68:57 And so, that's also a nice way you can use that. 69:00 So it's just to say, well, OK, this thing 69:02 is positive semidefinite because it's a covariance matrix. 69:04 So I know it's a variance, OK? 69:06 So I get this. 69:08 Now, if I had some negative numbers-- 69:10 so the effect of that is that when I draw this picture, 69:15 those axes are always positive, which is kind of a weird thing 69:19 to say. 69:19 But what it means is that when I take a vector, v, I rotate it, 69:23 and then I stretch it in the directions of the coordinate, 69:28 I cannot flip it. 69:30 I can only stretch or shrink, but I cannot flip its sign, 69:34 all right? 69:34 But in general, for any symmetric matrices, 69:37 I could do this. 69:38 But when it's positive symmetric definite, 69:40 actually what turns out is that all the lambda 69:43 j's are non-negative. 69:48 I cannot flip it, OK? 69:51 So all the eigenvalues are non-negative. 69:53 69:56 That's a property of positive semidef. 69:58 So when it's symmetric, you have the eigenvalues. 70:00 They can be any number. 70:01 And when it's positive semidefinite, in particular 70:03 that's the case of the covariance matrix 70:05 and the empirical covariance matrix, right? 70:07 Because the empirical covariance matrix 70:08 is an empirical variance, which itself is non-negative. 70:12 And so I get that the eigenvalues are non-negative. 70:17 All right, so principal component analysis is saying, 70:23 OK, I want to find the direction, u, 70:32 that maximizes u transpose Su, all right? 70:38 I've just introduced in one slide 70:40 something about eigenvalues. 70:41 So hopefully, they should help. 70:44 So what is it that I'm going to be getting? 70:47 Well, let's just see what happens. 70:51 Oh, I forgot to mention that-- and I will use this. 70:53 So the lambda j's are called eigenvectors. 70:56 And then the matrix, P, has columns v1 to vd, OK? 71:08 The fact that it's orthogonal-- that P transpose P is equal 71:13 to the identity-- 71:15 means that those guys satisfied that vi transpose 71:20 vj is equal to 0 if i is different from j. 71:27 And vi transpose vi is actually equal to 1, 71:31 right, because the entries of PP transpose 71:33 are exactly going to be of the form, vi transpose vj, OK? 71:38 So those v's are called eigenvectors. 71:40 71:46 And v1 is attached to lambda 1, and v2 is attached to lambda 2, 71:52 OK? 71:53 So let's see what's happening with those things. 71:56 What happens if I take sigma-- 71:58 so if you know eigenvalues, you know exactly what's 72:00 going to happen. 72:01 If I look at, say, sigma times v1, well, what is sigma? 72:06 We know that sigma is PDP transpose v1. 72:15 What is P transpose times v1? 72:17 Well, P transpose has rows v1 transpose, 72:21 v2 transpose, all the way to vd transpose. 72:26 So when I multiply this by v1, what 72:30 I'm left with is the first coordinate 72:32 is going to be equal to 1 and the second coordinate is 72:38 going to be equal to 0, right? 72:40 Because they're orthogonal to each other-- 72:42 0 all the way to the end. 72:45 So that's when I do P transpose v1. 72:48 Now I multiply by D. Well, I'm just 72:55 multiplying this guy by lambda 1, this guy by lambda 2, 72:58 and this guy by lambda d, so this is really just lambda 1. 73:02 73:04 And now I need to post-multiply by P. 73:12 So what is P times this guy? 73:14 Well, P is v1 all the way to vd. 73:19 And now I multiply by a vector that 73:21 only has 0's except lambda 1 on the first guy. 73:24 So this is just lambda 1 times v1. 73:26 73:29 So what we've proved is that sigma times v1 is lambda 1 v1, 73:34 and that's probably the notion of eigenvalue you're 73:37 most comfortable with, right? 73:39 So just when I multiply by v1, I get 73:41 v1 back multiplied by something, which is the eigenvalue. 73:45 So in particular, if I look at v1, transpose sigma v1, 73:54 what do I get? 73:55 Well, I get lambda 1 v1 transpose v1, 73:58 which is 1, right? 74:00 So this is actually lambda 1 v1 transpose v1, 74:04 which is lambda 1, OK? 74:08 And if I do the same with v2, clearly I'm 74:10 going to get v2 transpose sigma. 74:13 v2 is equal to lambda 2. 74:16 So for each of the vj's, I know that if I 74:19 look at the variance along the vj, 74:21 it's actually exactly given by those eigenvalues, all right? 74:27 Which proves this, because the variance along the eigenvectors 74:38 is actually equal to the eigenvalues. 74:40 So since they're variances, they have to be non-negative. 74:43 So now, I'm looking for the one direction that 74:47 has the most variance, right? 74:50 But that's not only among the eigenvectors. 74:53 That's also among the other directions 74:55 that are in-between the eigenvectors. 74:57 If I were to look only at the eigenvectors, 74:59 it would just tell me, well, just pick the eigenvector, vj, 75:02 that's associated to the largest of the lambda j's. 75:05 But it turns out that that's also true for any vector-- 75:09 that the maximum direction is actually one direction which 75:11 is among the eigenvectors. 75:13 And among the eigenvectors, we know that the one that's 75:16 the largest-- 75:17 that carries the largest variance is 75:18 the one that's associated to the largest eigenvalue, all right? 75:23 And so this is what PCA is going to try to do for me. 75:26 So in practice, that's what I mentioned already, right? 75:29 We're trying to project the point cloud 75:31 onto a low-dimensional space, D prime, 75:34 by keeping as much information as possible. 75:36 And by "as much information," I mean we do not 75:39 want points to collide. 75:41 And so what PCA is going to do is just 75:45 going to try to project [? on two ?] directions. 75:48 So there's going to be a u, and then 75:49 there's going to be something orthogonal to u, and then 75:52 the third one, et cetera, so that once we project on those, 75:55 we're keeping as much of the covariance as possible, OK? 75:59 And in particular, those directions 76:02 that we're going to pick are actually 76:04 a subset of the vj's that are associated to the largest 76:06 eigenvalues. 76:08 So I'm going to stop here for today. 76:11 We'll finish this on Tuesday. 76:15 But basically, the idea is it's just the following. 76:18 You're just going to-- well, let me skip one more. 76:22 Yeah, this is the idea. 76:24 You're first going to pick the eigenvector associated 76:27 to the largest eigenvalue. 76:30 Then you're going to pick the direction that orthogonal 76:33 to the vector that you've picked, 76:37 and that's carrying the most variance. 76:38 And that's actually the second largest-- 76:40 the eigenvector associated to the second largest eigenvalue. 76:44 And you're going to go all the way to the number of them 76:46 that you actually want to pick, which is in this case, d, OK? 76:50 And wherever you choose to chop this process, 76:53 not going all the way to d, is going to actually give you 76:56 a lower-dimensional representation 76:57 in the coordinate system that's given by v1, v2, v3, et 77:01 cetera, OK? 77:02 So we'll see that in more details on Tuesday. 77:04 But I don't want to get into it now. 77:06 We don't have enough time. 77:07 Are there any questions? 77:10