https://www.youtube.com/watch?v=a1ZCeFpeW0o&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=18 字幕記錄 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high quality educational resources for free. 00:10 To make a donation or to view additional materials 00:12 from hundreds of MIT courses, visit MIT OpenCourseWare 00:16 at ocw.mit.edu. 00:19 PHILIPPE RIGOLLET: We keep on talking 00:20 about principal component analysis, which we essentially 00:24 introduced as a way to work with a bunch of data. 00:27 So the data that's given to us when we want to do PCA 00:31 is a bunch of vectors X1 to Xn. 00:35 So they are random vectors. 00:40 00:45 in Rd. 00:46 And what we mentioned is that we're 00:48 going to be using linear algebra-- in particular, 00:51 the spectral theorem-- that guarantees to us that if I look 00:54 at the convenience matrix of this guy, 00:56 or its empirical covariance matrix, 00:57 since they're symmetric real matrices 01:00 and they are positive semidefinite, 01:01 there exists a diagonalization into non-negative eigenvalues. 01:06 And so here, those things live in Rd, 01:09 so it's a really large space. 01:11 And what we want to do is to map it down 01:14 into a space that we can visualize, 01:16 hopefully a space of size 2 or 3. 01:19 Or if not, then we're just going to take more and start looking 01:22 at subspaces altogether. 01:24 So think of the case where d is large but not larger than n. 01:33 So let's say, you have a large number of points. 01:36 The question is, is it possible to project those things onto 01:40 a lower dimensional space, d prime, 01:45 which is much less than d-- so think of d prime equals, say, 01:49 2 or 3-- 01:52 and so that you keep as much information 01:54 about the cloud of points that you had originally. 01:56 So again, the example that we could have 01:58 is that X1 to Xn for, say, Xi for patient i's recording 02:04 a bunch of body measurements and maybe blood pressure, 02:08 some symptoms, et cetera. 02:10 And then we have a cloud of n patients. 02:12 And we're trying to visualize maybe to see if-- 02:15 If I could see, for example, that there's 02:16 two groups of patients, maybe I would 02:18 know that I have two groups of different disease 02:21 or maybe two groups of different patients 02:22 that respond differently to a particular disease 02:25 or drug et cetera. 02:27 So visualizing is going to give us 02:28 quite a bit of insight about what the spatial arrangement 02:33 of those vectors are. 02:35 And so PCA says, well, here, of course, in this question, 02:40 one thing that's not defined is what is information. 02:42 And we said that one thing we might 02:44 want to do when we project is that points do not 02:46 collide with each other. 02:48 And so that means we're trying to find directions, 02:50 where after I project, the points are still pretty spread 02:53 out. 02:53 And so I can see what's going on. 02:55 And PCA says-- OK, so there's many ways 02:58 to answer this question. 02:59 And PCA says, let's just find a subspace of dimension 03:04 d prime that keeps as much covariance structure as 03:08 possible. 03:10 And the reason is that those directions 03:13 are the ones that maximize the variance, which 03:15 is a proxy for the spread. 03:17 There's many, many ways to do this. 03:19 There's actually a Google video that 03:22 was released maybe last week about the data visualization 03:26 team of Google that shows you something called 03:29 t-SNE, which is essentially something 03:31 that tries to do that. 03:32 It takes points in very high dimensions 03:34 and tries to map them in lower dimensions, 03:36 so that you can actually visualize them. 03:38 And t-SNE is some alternative to PCA 03:41 that gives an other definition for the word information. 03:46 I'll talk about this towards the end, how you can actually 03:49 somewhat automatically extend everything 03:52 we've said for PCA to an infinite family of procedures. 03:58 So how do we do this? 04:00 Well, the way we do this is as follows. 04:02 So remember, given those guys, we 04:05 can form something which is called S, which is the sample, 04:09 or the empirical covariance matrix. 04:16 04:19 And since from couple of slides ago, 04:22 we know that S has a eigenvalue decomposition, 04:25 S is equal to PDP transpose, where P is orthogonal. 04:32 04:35 So that's where we use our linear algebra results. 04:37 So that means that P transpose P is equal to PP transpose, which 04:43 is the identity. 04:46 So remember, S is a d by d matrix. 04:50 And so P is also d by d. 04:53 And d is diagonal. 04:55 05:00 And I'm actually going to take, without loss of generality, 05:02 I'm going to assume that d-- 05:04 so it's going to be diagonal-- and I'm 05:06 going to have something that looks like lambda 1 05:10 to lambda d. 05:10 Those are called the eigenvalues of S. 05:14 What we know is that lambda j's are non-negative. 05:19 And actually, what I'm going to assume without loss 05:21 of generalities is lambda 1 is larger than lambda 2, which 05:24 is larger than lambda d. 05:30 Because in particular, this decomposition-- 05:32 the spectrum decomposition-- is not entirely unique. 05:35 I could permute the columns of P, 05:39 and I would still have an orthogonal matrix. 05:42 And to balance that, I would also have 05:44 to permute the entries of d. 05:46 So there's as many decompositions 05:49 as there are permutations. 05:51 So there's actually quite a bit. 05:52 But the bag of eigenvalues is unique. 05:56 The set of eigenvalues is unique. 05:58 The ordering is certainly not unique. 06:01 So here, I'm just going to pick-- 06:02 I'm going to nail down one particular permutation-- 06:05 actually, maybe two in case I have equalities. 06:08 But let's say, I pick one that satisfies this. 06:12 And the reason why I do this is really not very important. 06:15 It's just to say, I'm going to want 06:18 to talk about the largest of those eigenvalues. 06:20 So this is just going to be easier 06:22 for me to say that this one is lambda 1, 06:23 rather than say it's lambda 7. 06:26 So this is just to say that the largest eigenvalue of S 06:39 is lambda 1. 06:42 If I didn't do that, I would just call it maybe lambda max, 06:45 and you would just know which one I'm talking about. 06:47 06:52 So what's happening now is that if I look at d, 07:01 then it turns out that if I start-- 07:04 so if I do P transpose Xi, I am actually projecting my Xi's-- 07:09 I'm basically changing the basis for my Xi's. 07:12 And now, D is the empirical covariance matrix 07:15 of those guys. 07:16 So let's check that. 07:18 So what it means is that if I look at-- 07:22 07:26 so what I claim is that P transpose Xi-- 07:29 that's a new vector, let's call it Yi, it's also in Rd-- 07:35 and what I claim is that the covariance matrix of this guy 07:37 is actually now this diagonal matrix, which 07:41 means in particular that if they were Gaussian, then 07:45 they would be independent. 07:46 But I also know now that there's no correlation 07:48 across coordinates of Yi. 07:50 So to prove this, let me assume that X bar is equal to 0. 08:00 And the reason why I do this is because it's just 08:02 annoying to carry out all this censuring constantly 08:05 and I talk about S. So when X bar is equal to 0, 08:09 that implies that S has a very simple form. 08:11 It's of the form sum from i equal 1 08:14 to n of Xi Xi transpose. 08:18 So that's my S. 08:20 But what I want is the S of Y-- 08:24 So OK, that implies also that P times X 08:28 bar, which is equal to P times X bar is also equal to 0. 08:34 So that means that Y bar-- 08:37 Y has mean 0, if this is 0. 08:40 So if I look at the sample covariance matrix of Y, 08:43 it's just going to be something that 08:45 looks like the sum of the outer products or the Yi Yi 08:49 transpose. 08:50 08:53 And again, the reason why I make this assumption 08:56 is so that I don't have to write minus X bar X bar transpose. 09:01 But you can do it. 09:02 And it's going to work exactly the same. 09:03 09:06 So now, I look at this S prime. 09:08 And so what is this S prime? 09:11 Well, I'm just going to replace Yi with PXi. 09:14 So it's the sum from i equal 1 to n of PXi PXi transpose, 09:22 which is equal to the sum from-- 09:26 sorry there's a 1/n. 09:27 09:32 So it's equal to 1/n sum from i equal 1 09:34 to n of PXi Xi transpose P transpose. 09:43 Agree? 09:45 I just said that the transpose of AB is the transpose of B 09:48 times the transpose of A. 09:53 And so now, I can push the sum in. 09:55 P does not depend on i. 09:57 So this thing here is equal to PS P transpose, 10:05 because the sum of the Xi Xi transpose divided by n is S. 10:10 But what is PS P transpose? 10:12 Well, we know that S is equal to-- 10:17 sorry that's P transpose. 10:19 So this was with a P transpose. 10:20 I'm sorry, I made an important mistake here. 10:23 So Yi is P transpose Xi. 10:25 So this is P transpose and P transpose 10:27 here, which means that this is P transpose 10:29 and this is double transpose, which is just nothing 10:32 and that transpose and nothing. 10:34 10:36 So now, I write S as PD P transpose. 10:41 That's the spectral decomposition 10:43 that I had before. 10:44 That's my eigenvalue decomposition, 10:46 which means that now, if I look at S prime, 10:49 it's P transpose times PD P transpose P. 10:56 But now, P transpose P is the identity, 10:58 P transpose P is the identity. 11:00 So this is actually just equal to D. 11:06 And again, you can check that this also 11:08 works if you have to center all those guys as you go. 11:12 But if you think about it, this is the same thing 11:15 as saying that I just replaced Xi by Xi minus X bar. 11:19 And then it's true that Y bar is also P times Xi minus X bar. 11:26 So now, we have that D is the empirical covariance 11:29 matrix of those guys-- 11:30 the Yi's, which are P transpose Xi's. 11:33 And so in particular, what it means 11:34 is that if I look at the covariance of Yj Yk-- 11:42 11:46 So that's the covariance of the j-th coordinate of Y 11:48 and the k-th coordinate of Y. I'm just not putting an index. 11:51 But maybe, let's say the first one or something 11:53 like this-- any of them, their IID. 11:56 Then what is this covariance? 11:57 It's actually 0 if j is different from k. 12:01 And the covariance between Yj and Yj, 12:06 which is just the variance of Yj, is equal to lambda j-- 12:13 the j-th largest eigenvalue. 12:17 So the eigenvalues capture the variance of my observations 12:22 in this new coordinate system. 12:25 And they're completely orthogonal. 12:26 So what does that mean? 12:27 Well, again, remember, if I chop off 12:29 the head of my Gaussian in multi dimensions, 12:34 we said that what we started from 12:35 was something that looked like this. 12:39 And we said, well, there's one direction that's important, 12:42 that's this guy, and one important that's this guy. 12:45 When I applied a transformation P transpose, what I'm doing 12:48 is that I'm realigning this thing with the new axes. 12:51 Or in a way, rather to be fair, I'm 12:53 not actually realigning the ellipses with the axes. 12:59 I'm really realigning the axes with the ellipses. 13:02 So really, what I'm doing is I'm saying, after I apply P, 13:05 I'm just rotating this coordinate system. 13:08 So now, it becomes this guy. 13:12 13:19 And now, my ellipses actually completely align. 13:22 And what happens here is that this coordinate is 13:25 independent of that coordinate. 13:27 And that's what we write here, if they are Gaussian. 13:31 I didn't really tell this-- 13:32 I'm only making statements about covariances. 13:34 If they are Gaussians, those implied statements 13:36 about independence. 13:37 13:40 So as I said, the variance now, lambda 1, 13:44 is actually the variance of P transpose Xi. 13:54 13:57 But if I look now at the-- so this is a vector, 14:00 so I need to look at the first coordinate of this guy. 14:04 14:08 So it turns out that doing this is actually 14:11 the same thing as looking at the variance of what? 14:15 Well, the first column of P times Xi. 14:21 So that's the variance of-- 14:24 I'm going to call it v1 transpose Xi, where P-- 14:30 14:44 So the v1 vd in Rd are eigenvectors. 14:53 And each vi is associated to lambda i. 14:57 So that's what we saw when we talked about this eigen 14:59 decomposition a couple of slides back. 15:02 That's the one here. 15:06 So if I call the columns of P v1 to vd, 15:10 this is what's happening. 15:13 So when I look at lambda 1, it's just the variance 15:16 of Xi inner product with v1. 15:19 And we made this picture when we said, well, 15:22 let's say v1 is here and then x1 is here. 15:25 And if vi has a unique norm, then the inner product 15:31 between Xi and v1 is just the length of this guy here. 15:38 So that's the variance of the Xi says the length of Xi-- 15:41 so this is 0-- that's the length of Xi when I project it 15:43 onto the direction that span by v1. 15:46 If v1 has length 2, this is really just twice this length. 15:52 If vi has length 3, it's three times this. 15:56 But it turns out that since P satisfies P transpose 16:01 P is equal to the identity-- 16:04 that's an orthogonal matrix, that's right here-- 16:07 then this is actually saying the same thing 16:11 as vj transpose vj, which is really the norm squared of vj, 16:18 is equal to 1. 16:20 And vj transpose vk is equal to 0, if j is different from k. 16:26 16:29 The eigenvectors are orthogonal to each other. 16:31 And they're actually all of norm 1. 16:33 16:37 So now, I know that this is indeed a direction. 16:39 And so when I look at v1 transpose Xi, 16:44 I'm really measuring exactly this length. 16:46 And what is this length? 16:47 It's the length of the projection of Xi 16:49 onto this line. 16:51 That's the line that's spanned by v1. 16:53 So if I had a very high dimensional problem 16:57 and I started to look at the direction v1-- 17:01 let's say v1 now is not a eigenvector, 17:03 it's any direction-- then if I want to do this lower 17:08 dimensional projection, then I have to understand how those 17:11 Xi's project onto the line that's spanned by v1, 17:14 because this is all that I'm going to be keeping at the end 17:16 of the day about Xi's. 17:17 17:20 So what we want is to find the direction 17:23 where those Xi's, those projections, 17:25 have a lot of variance. 17:26 And we know that the variance of Xi on this direction 17:28 is actually exactly given by lambda 1. 17:30 17:36 Sorry, that's the empirical var-- 17:40 yeah, I should call variance hat. 17:42 That's the empirical variance. 17:43 Everything is in empirical here. 17:45 We're talking about the empirical covariance matrix. 17:48 And so I also have that lambda 2 is the empirical variance 17:54 of when I project Xi onto v2, which is the second one, 17:59 just for exactly this reason. 18:00 18:07 Any question? 18:08 18:14 So lambda j's are going to be important for us. 18:16 Lambda j measure the spread of the points 18:19 when I project them onto a line which is a one dimensional 18:22 space. 18:23 And so I'm going to have-- let's say I want to pick only one, 18:25 I'm going to have to find the one dimensional space that 18:28 carries the most variance. 18:29 And I claim that v1 is the one that 18:32 actually maximizes the spread. 18:35 So the claim-- so for any direction, u in Rd-- 18:55 and by direction, I really just mean that the norm of u 18:59 is equal to 1. 19:00 I need to play fair-- 19:02 I'm going to compare myself to other things of lengths one, 19:04 so I need to play fair and look at directions of length 1. 19:07 Now, if I'm interested in the empirical variance 19:16 of X1 transpose-- 19:20 sorry, u transpose X1 u transpose Xn, then this thing 19:29 is maximized for u equals v1, where 19:37 v1 is the eigenvector associated to lambda 1 19:40 and lambda 1 is not any eigenvalues, 19:42 it's the largest of all those. 19:45 So it's the largest eigenvalue. 19:46 19:50 So why is that true? 19:51 19:55 Well, there's also a claim that for any direction u-- 20:00 so that's 1 and 2-- 20:03 the variance of u transpose X-- now, 20:08 this is just a random variable, and I'm looking about the true 20:11 variance-- 20:13 this is maximized for u equals, let's call it w1, 20:27 where w1 is the eigenvector of sigma-- 20:38 Now, I'm talking about the true variance. 20:40 Whereas, here, I was talking about the empirical variance. 20:42 So the true variance is the eigenvectors 20:44 of the true sigma associated to the largest 20:55 eigenvalue of sigma. 20:59 21:02 So I did not give it a name. 21:04 Here, that was lambda 1 for the empirical one. 21:06 For the true one, you can give it 21:07 another name, mu 1 if you want. 21:10 But that's just the same thing. 21:13 All it's saying is like, wherever I see empirical, 21:15 I can remove it. 21:16 21:27 So why is this claim true? 21:29 Well, let's look at the second one, for example. 21:31 21:38 So what is the variance of u transpose X? 21:44 So that's what I want to know. 21:47 So that's the expectation-- so let's assume that X is 0, 21:54 again, for same reasons as before. 21:56 So what is the variance? 21:57 It's just the expectation of the square? 21:59 22:06 I don't need to remove the expectation. 22:08 And the expedition of the square is just 22:10 the expectation of u transpose X. 22:12 And then I'm going to write the other one X transpose u. 22:15 22:19 And we know that this is deterministic. 22:22 So I'm just going to take that this is just u transpose 22:25 expectation of X X transpose u. 22:31 And what is this guy? 22:32 22:39 That's covariance sigma. 22:40 That's just what sigma is. 22:41 22:44 So the variance I can write as u transpose sigma u. 22:48 We've made this computation before. 22:51 And now what I want to claim is that this thing is actually 22:53 less than the largest eigenvalue, which I actually 22:57 called lambda 1 here. 22:58 I should probably not. 22:59 And the P is-- well, OK. 23:01 23:06 Let's just pretend everything is not empirical. 23:11 So now, I'm going to write sigma as P lambda 1 lambda n P 23:22 transpose. 23:23 That's just the eigendecomposition, 23:25 where I admittedly reuse the same notation as I did for S. 23:32 So I should really put some primes everywhere, 23:34 so you know those are things that are actually 23:36 different in practice. 23:38 So this is just that the decomposition of sigma. 23:43 You seem confused, Helen. 23:44 You have a question? 23:47 Yeah? 23:48 AUDIENCE: What is-- when you talked about the empirical data 23:53 and-- 23:55 PHILIPPE RIGOLLET: So OK-- 23:56 24:00 so I can make everything I'm saying, 24:02 I can talk about either the variance 24:04 or the empirical variance. 24:05 And you can just add the word empirical in front of it 24:07 whenever you want. 24:08 The same thing works. 24:09 But just for the sake of removing the confusion, 24:13 let's just do it again with S. So I'm just 24:20 going to do everything with S. So I'm 24:21 going to assume that X bar is equal to 0. 24:24 And here, I'm going to talk about the empirical variance, 24:27 which is just 1/n sum from i equal 1 24:31 to n of u transpose Xi squared. 24:35 So it's the same thing. 24:36 Everywhere you see an expectation, 24:37 you just put in average. 24:39 24:45 And then I get 1/n sum from i equal 1 24:50 to n of Xi Xi transpose. 24:53 And now, I'm going to call this guy 24:54 S, because that's what it is. 24:58 So this is u transpose Su. 24:59 But just defined that I could just replace the expectation 25:02 by averages everywhere, you can tell 25:03 that the thing is going to work for either one or the other. 25:06 So now, this thing was actually-- so now, 25:08 I don't have any problem with my notation. 25:10 This is actually the decomposition of S. 25:14 That's just the spectral decomposition 25:16 and it's to its eigenvalues. 25:18 And so now, what I have is that when I look at u transpose Su, 25:27 this is actually equal to P u transpose S Pu. 25:34 25:39 OK. 25:40 There's a transpose somewhere. 25:41 That's this guy. 25:42 25:45 And that's this guy. 25:46 25:57 Now-- sorry, that's not P, that's 26:00 D. That's D, that's this diagonal matrix. 26:05 26:10 Let's look at this thing. 26:11 And let's call P transpose u, let's call it b. 26:15 So that's also a vector in Rd. 26:18 What is it? 26:19 It's just, I take a unit vector, and then 26:21 I apply P transpose to it. 26:23 So that's basically what happens to a unit vector 26:25 when I apply the same change of basis that I did. 26:29 So I'm just changing my orthogonal system the same way 26:34 I did for the other ones. 26:36 So what's happening when I write this? 26:38 Well, now I have that u transpose Su is b transpose Db. 26:46 But now, doing b transpose Db when D is diagonal 26:50 and b is a vector is a very simple thing. 26:52 I can expand it. 26:53 This is what? 26:54 This is just the sum from j equal 1 26:57 to d of lambda j bj squared. 27:01 27:05 So that's just like matrix vector multiplication. 27:08 And in particular, I know that the largest of those guys 27:11 is lambda 1 and those guys are all non-negative. 27:14 So this thing is actually less than lambda 1 times 27:16 the sum from j equal 1 to d of lambda j squared-- 27:20 27:23 sorry, bj squared. 27:24 27:27 And this is just the norm of b squared. 27:34 So if I want to prove what's on the slide, all I need to check 27:38 is that b has norm, which is-- 27:40 AUDIENCE: 1. 27:41 PHILIPPE RIGOLLET: At most, 1. 27:43 It's going to be at most 1. 27:45 Why? 27:45 Well, because b is really just a change of basis for u. 27:51 And so if I take a vector, I'm just changing its basis. 27:55 I'm certainly not changing its length-- 27:57 think of a rotation, and I can also flip it, 27:59 but think of a rotation-- 28:00 28:02 well, actually, for vector, it's just going to be a rotation. 28:05 And so now, what I have I just have 28:06 to check that the norm of b squared is equal to what? 28:11 Well, it's equal to the norm of P transpose u squared, 28:16 which is equal to u transpose P P transpose u. 28:21 But P is orthogonal. 28:23 So this thing is actually just the identity. 28:26 So that's just u transpose u, which 28:28 is equal to the norm u squared, which is equal to 1, 28:33 because I took u to have norm 1 in the first place. 28:37 And so this-- you're right-- was actually of norm equal to 1. 28:39 I just needed to have it less, but it's equal. 28:42 And so what I'm left with is that this thing is actually 28:44 equal to lambda 1. 28:45 So I know that for every u that I pick-- 28:50 that has norm-- 28:52 So I'm just reminding you that u here 28:55 has norm squared equal to 1. 28:57 For every u that I pick, this u transpose 29:00 Su is at mostly lambda 1. 29:02 29:06 So that's the u transpose Su is at most lambda 1. 29:11 And we know that that's the variance, that's 29:13 the empirical variance, when I project my points 29:15 onto direction spanned by u. 29:17 29:20 So now, I have an empirical variance, 29:23 which is at most lambda 1. 29:24 But I also know that if I take u to be something very specific-- 29:28 I mean, it was on the previous board-- 29:30 if I take u to be equal to v1, then 29:32 this thing is actually not an inequality, 29:35 this is an equality. 29:37 And the reason is, when I actually take u to be v1, 29:41 all of these bj's are going to be 0, except for the one that's 29:46 b1, which is itself equal to 1. 29:50 So I mean, we can briefly check this. 29:52 But if I take v-- 29:53 29:59 if u is equal to v1, what I have is that u transpose 30:07 Su is equal to P transpose v1 D P transpose v1. 30:24 But what is P transpose v1? 30:26 Well, remember P transpose is just 30:31 the matrix that has vectors v1 transpose here, 30:34 v2 transpose here, all the way to vd transpose here. 30:40 And we know that when I take vj transpose vk, I get 0, 30:45 if j is different from k. 30:46 And if j is equal to k, I get 1. 30:49 So P transpose v1 is equal to what? 30:53 31:05 Take v1 here and multiply it. 31:06 So the first coordinate is going to be 31:08 v1 transpose v1, which is 1. 31:12 The second coordinate is going to be 31:14 v2 transpose v1, which is 0. 31:19 And so I get 0's all the way, right? 31:22 So that means that this thing here is really 31:25 just the vector 1, 0, 0. 31:29 And here, this is just the vector 1, 0, 0. 31:32 So when I multiply it with this guy, 31:34 I am only picking up the top left element 31:37 of D, which is lambda 1. 31:41 So for every one, it's less lambda 1. 31:44 And for v1, it's equal to lambda 1, 31:46 which means that it's maximized for a equals v1. 31:52 And that's where I said that this 31:54 is the fanciest non-convex problem we know how to solve. 31:57 This was a problem that was definitely non-convex. 31:59 We were maximizing a convex function over a sphere. 32:02 But we know that v1, which is something-- 32:06 I mean, of course, you still have 32:07 to believe me that you can compute 32:08 the spectral decomposition efficiently-- 32:11 but essentially, if you've taken linear algebra, 32:14 you know that you can diagonalize a matrix. 32:17 And so you get that v1 is just the maximum. 32:19 So you can find your maximum just by looking 32:21 at the spectral decomposition. 32:24 You don't have to do any optimization 32:25 or anything like this. 32:28 So let's recap. 32:29 Where are we? 32:32 We've established that if I start 32:34 with my empirical covariance matrix, I can diagonalize it 32:37 and PD P transpose. 32:42 And then if I take the eigenvector associated 32:44 to the largest eigenvalues-- so if I permute the columns of P 32:48 and of D's in such a way that they 32:50 are ordered from the largest to the smallest when 32:53 I look at the diagonal elements of D, 32:56 then if I pick the first column of P, it's v1. 32:59 And v1 is the direction on which, if I project my points, 33:04 they are going to carry the most empirical variance. 33:08 Well, that's a good way. 33:09 If I told you, pick one direction 33:13 along which if you were to project your points 33:14 they would be as spread out as possible, that's probably 33:17 the one you would pick. 33:19 And so that's exactly what PCA is doing for us. 33:22 It says, OK, if you ask me to take d prime equal to 1, 33:28 I will take v1. 33:31 I will just take the direction that's spanned by v1. 33:33 And that's just when I come back to this picture that 33:36 was here before, this is v1. 33:43 Of course, here, I only have two of them. 33:45 So v2 has to be this guy, or this guy, 33:48 or I mean or this thing. 33:49 I mean, I don't know them up to sine. 33:53 But then if I have three-- 33:55 think of like an olive in three dimensions-- 33:58 then maybe I have one direction that's slightly more 34:00 elongated than the other one. 34:02 And so I'm going to pick the second one. 34:04 And so the procedure is to say, well, first, I'm 34:07 going to pick v1 the same way I pick v1 in the first place. 34:11 So the first direction I am taking 34:12 is the leading eigenvector. 34:14 And then I'm looking for a direction. 34:18 Well, if I found one-- the one I'm 34:20 going to want to find-- if you say you can take d equal 2, 34:23 you're going to need the basis for this guy. 34:24 So the second one has to be orthogonal to the first one 34:27 you've already picked. 34:28 And so the second one you pick is 34:30 the one that's just, among all those that 34:31 are orthogonal to v1, maximized the empirical variance 34:36 when you project onto it. 34:37 34:40 And it turns out that this is actually exactly v2. 34:44 You don't have to redo anything again. 34:46 You're eigendecomposition, this is 34:47 just the second column of P. Clearly, v2 34:54 is orthogonal to v1. 34:56 We just used it here. 34:58 This 0 here just says this v2 is orthogonal to v1. 35:03 So they're like this. 35:05 And now, what I said-- 35:06 what this slide tells you extra-- 35:08 is that v2 among all those directions that are 35:10 orthogonal-- 35:11 I mean, there's still d minus 1 of them-- 35:13 this is the one that maximizes the, say, 35:16 residual empirical variance-- the one that 35:18 was not explained by the first direction that you picked. 35:21 And you can check that. 35:22 I mean, it's becoming a bit more cumbersome to write down, 35:27 but you can check that. 35:28 If you're not convinced, please raise your concern. 35:32 I mean, basically, one way you view this to-- 35:38 I mean, you're not really dropping a coordinate, 35:40 because v1 is not a coordinate. 35:42 But let's assume actually for simplicity that v1 was actually 35:46 equal to e1, that the direction that carries the most variance 35:49 is the one that just says, just look 35:51 at the first coordinate of X. So if that was the case, then 35:56 clearly the orthogonal directions are 35:58 the ones that comprise only of the coordinates 2 to d. 36:03 So you could actually just drop the first coordinate 36:05 and do the same thing on a slightly shorter vector 36:08 of length d minus 1. 36:10 And then you would just look at the largest eigenvector 36:12 of these guys, et cetera, et cetera. 36:14 So in a way, that's what's happening, 36:16 except that you rotate it before you actually do this. 36:19 And that's exactly what's happening. 36:22 So what we put together here is essentially three things. 36:30 One was statistics. 36:32 Statistics says, if you won't spread, 36:34 if you want information, you should be looking at variance. 36:39 The second one was optimization. 36:40 Optimization said, well, if you want to maximize spread, well, 36:44 you have to maximize variance in a certain direction. 36:48 And that means maximizing over the sphere of vectors 36:51 that have unique norm. 36:54 And that's an optimization problem, which actually 36:56 turned out to be difficult. 36:58 But then the third thing that we use to solve this problem 37:00 was linear algebra. 37:01 Linear algebra said, well, it looks 37:03 like it's a difficult optimization problem. 37:05 But it turns out that the answer comes in almost-- 37:08 I mean, it's not a closed form, but those things are so used, 37:11 that it's almost a closed form-- 37:12 says, just pick the eigenvectors in order 37:17 of their associated eigenvalues from largest to smallest. 37:20 37:23 And that's why principal component analysis 37:24 has been so popular and has gained huge amount of traction 37:29 since we had computers that were allowed to compute eigenvalues 37:33 and eigenvectors for matrices of gigantic sizes. 37:37 You can actually do that. 37:38 If I give you-- 37:39 I don't know, this Google video, for example, 37:42 is talking about words. 37:43 They want to do just the, say, principal component 37:45 analysis of words. 37:47 So I give you all the words in the dictionary. 37:50 And-- sorry, well, you would have 37:53 to have a representation for words, 37:55 so it's a little more difficult. But how do I do this? 37:59 38:03 Let's say, for example, pages of a book. 38:06 I want to understand the pages of a book. 38:08 And I need to turn it into a number. 38:10 And a page of a book is basically the word count. 38:13 So I just count the number of times "the" shows up, 38:15 the number of times "and" shows up, number of times "dog" 38:18 shows up. 38:19 And so that gives me a vector. 38:20 It's in pretty high dimensions. 38:22 It's as many dimensions as there are words in the dictionary. 38:25 And now, I want to visualize how those pages get together-- 38:28 are two pages very similar or not. 38:30 And so what you would do is essentially 38:32 just compute the largest eigenvector of this matrix-- 38:35 maybe the two largest-- and then project this into a plane. 38:38 Yeah. 38:39 AUDIENCE: Can we assume the number of points 38:41 was far larger than the dimension? 38:43 PHILIPPE RIGOLLET: Yeah, but there's 38:44 many pages in the world. 38:46 There's probably more pages in the world 38:48 than there's words in the dictionary. 38:50 38:54 Yeah, so of course, if you are in high dimensions 38:57 and you don't have enough points, 38:58 it's going to be clearly an issue. 39:00 If you have two points, then the leading eigenvector 39:03 is going to be just the line that 39:04 goes through those two points, regardless 39:06 of what the dimension is. 39:07 And clearly, you're not learning anything. 39:09 39:13 So you have to pick, say, the k largest one. 39:16 If you go all the way, you're just reordering your thing, 39:18 and you're not actually gaining anything. 39:20 You start from d and you go too d. 39:22 So at some point, this procedure has to stop. 39:26 And let's say it stops at k. 39:28 Now, of course, you should ask me a question, 39:31 which is, how do you choose k? 39:34 So that's, of course, a natural question. 39:37 Probably the basic answer is just pick k equals 3, 39:41 because you can actually visualize it. 39:43 But what happens if I take k is equal to 4? 39:47 If I take is equal to 4, I'm not going 39:51 to be able to plot points in four dimensions. 39:54 Well, I could, I could add color, 39:55 or I could try to be a little smart about it. 39:57 But it's actually quite difficult. 40:00 And so what people tend to do, if you have four dimensions, 40:04 they actually do a bunch of two dimensional plots. 40:06 And that's what a computer-- a computer is not very good-- 40:08 I mean, by default, they don't spit out 40:10 three dimensional plots. 40:12 So let's say they want to plot only two dimensional things. 40:15 So they're going to take the first directions of, say, v1, 40:17 v2. 40:18 Let's say you have three, but you 40:19 want to have only two dimensional plots. 40:21 And then it's going to do v1, v3; and then v2, v3. 40:29 So really, you take all three of them, 40:31 but it's really just showing you all choices 40:35 of pairs of those guys. 40:37 So if you were to keep k is equal to 5, 40:41 you would have five, choose two different plots. 40:44 40:48 So this is the actual principal component algorithm, 40:51 how it's implemented. 40:53 And it's actually fairly simple. 40:55 I mean, it looks like there's lots of steps. 40:56 But really, there's only one that's important. 40:58 So the first one is the input. 40:59 I give you a bunch of points, x1 to xn in d dimensions. 41:04 And step two is, well, compute their empirical covariance 41:07 matrix S. The points themselves, we don't really care. 41:10 We care about their empirical covariance matrix. 41:12 So it's a d by d matrix. 41:14 Now, I'm going to feed that. 41:15 And that's where the actual computation starts happening. 41:17 I'm going to feed that to something that knows 41:19 how to diagonalize this matrix. 41:21 And you have to trust me, if I want 41:23 to compute the k largest eigenvalues 41:25 and my matrix is d by d, it's going 41:27 to take me about k times d squared operations. 41:32 So if I want only three, it's 3 times d squared, 41:34 which is about-- 41:36 d squared is the time for me it takes to just even read 41:39 the matrix sigma. 41:41 So that's not too bad. 41:43 So what it's going to spit out, of course, 41:45 is the diagonal matrix D. And those are nice, 41:48 because they allow me to tell me what 41:53 is the order in which I should be taking the columns of P. 41:56 But what's really important to me is v1 to vd, 41:58 because those are going to be the ones I'm going to be using 42:01 to draw those plots. 42:04 And now, I'm going to say, OK, I need 42:05 to actually choose some set k. 42:09 And I'm going to basically truncate and look 42:11 only at the first k columns of P. 42:16 Once I have those columns, what I 42:18 want to do is to project onto the linear span 42:20 of those columns. 42:21 And there's actually a simple way 42:23 to do this, which is just take this matrix P, which is really 42:26 the matrix of projection onto the linear span of those k 42:29 columns. 42:30 And you just take Pk transpose. 42:32 And then you apply this to every single one of your points. 42:38 Now Pk transpose, what is the size of the matrix Pk? 42:42 42:46 Yeah, [INAUDIBLE]? 42:47 AUDIENCE: [INAUDIBLE] 42:49 PHILIPPE RIGOLLET: So Pk is just this matrix. 42:52 I take the v1 and I stop at vk-- 42:54 well-- 42:55 AUDIENCE: [INAUDIBLE] 42:57 PHILIPPE RIGOLLET: d by k, right? 42:59 Each of the column is an eigenvector. 43:01 It's of dimension d. 43:02 I mean, that's a vector in the original space. 43:05 So I have this d by k matrix. 43:07 So all it is is if I had my-- 43:11 well, I'm going to talk in a second about Pk transpose. 43:13 Pk transpose is just this guy, where 43:17 I stop at the k-th vector. 43:19 So Pk transpose is k by d. 43:22 So now, when I take Yi, which is Pk transpose Xi, 43:26 I end up with a point which is in k dimensions. 43:29 I have only k coordinates. 43:30 So I took every single one of my original points Xi, 43:33 which had d coordinates, and I turned it into a point that 43:35 has only k coordinates. 43:37 Particularly, I could have k is equal to 2. 43:40 This matrix is exactly the one that projects. 43:42 If you think about it for one second, 43:44 this is just the matrix that says-- 43:46 well, we actually did that several times. 43:48 The matrix, so that was this P transpose u 43:51 that showed up somewhere. 43:53 And so that's just the matrix that 43:57 take your point X in, say, three dimensions, 44:01 and then just project it down to two dimensions. 44:04 And that's just-- it goes to the closest point in the subspace. 44:09 Now, here, the floor is flat. 44:12 But we can pick any subspace we want, 44:16 depending on what the lambdas are. 44:18 So the lambdas were important for us 44:19 to be able to identify which columns to pick. 44:23 The fact that we assumed that they were ordered 44:25 tells us that we can pick the first ones. 44:27 If they were not ordered, it would 44:28 be just a subset of the columns, depending on what 44:30 the size of the eigenvalue is. 44:32 So each column is labeled. 44:36 And so then, of course, we still have this question of, 44:38 how do I pick k? 44:40 So there's definitely the matter of convenience. 44:42 Maybe 2 is convenient. 44:44 If it works for 2, you don't have to go any farther. 44:47 But you might want to say, well-- 44:50 originally, I did that to actually keep 44:52 as much information as possible. 44:54 I know that the ultimate thing is 44:56 to keep as much information, which would be to k 44:58 is equal d-- that's as much information as you want. 45:00 But it's essentially the same question about, well, 45:03 if I want to compress a JPEG image, 45:07 how much information should I keep so it's still visible? 45:10 And so there's some rules for that. 45:11 But none of them is actually really a science. 45:14 So it's really a matter of what you 45:16 think is actually tolerable. 45:18 And we're just going to start replacing this choice by maybe 45:21 another parameter. 45:22 So here, we're going to basically replace k by alpha, 45:26 and so we just do stuff. 45:29 So the first one that people do that is probably 45:32 the most popular one-- 45:33 OK, the most popular one is definitely 45:35 take k is equal to 2 or 3, because it's just 45:39 convenient to visualize. 45:41 The second most popular one is the scree plot. 45:48 So the scree plot-- 45:49 remember, I have my values, lambda j's. 45:54 And I've chosen the lambda j's to decrease. 45:57 So the indices are chosen in such a way 45:59 that lambda is a decreasing function. 46:01 So I have lambda 1, and let's say it's this guy here. 46:04 And then I have lambda 2, and let's say it's this guy here. 46:06 And then I have lambda 3, and let's say it's this guy here, 46:09 lambda 4, lambda 5, lambda 6. 46:12 And all I care about is that this thing decreases. 46:16 The scree plot says something like this-- 46:19 if there's an inflection point, meaning that you can sort of do 46:22 something like this and then something like this, 46:25 you should stop at 3. 46:27 That's what the scree plot tells you. 46:29 What it's saying in a way is that the percentage 46:34 of the marginal increment of explained 46:39 variance that you get starts to decrease after you 46:41 pass this inflection point. 46:43 So let's see why I way this. 46:45 Well, here, what I have-- so this ratio 46:52 that you see there is actually the percentage 46:54 of explained variance. 46:56 So what it means is that, if I look at lambda 1 plus lambda k, 47:01 and then I divide by lambda 1 plus lambda d, well, 47:08 what is this? 47:08 Well, this lambda 1 plus lambda d 47:12 is the total amount of variance that I get in my points. 47:14 That's the trace of sigma. 47:18 So that's the variance in the first direction 47:20 plus the variance in the second direction 47:22 plus the variance in the third direction. 47:24 That's basically all the variance that I have possible. 47:26 47:28 Now, this is the variance that I kept in the first direction. 47:32 This is the variance that I kept in the second direction, 47:34 all the way to the variance that I kept in the k-th direction. 47:37 So I know that this number is always less than or equal to 1. 47:41 And it's larger than 1. 47:43 And this is just the proportion, say, 47:48 of variance explained by v1 to vk, 47:59 or simply, the proportion of explained variance by my PCA, 48:03 say. 48:05 So now, what this thing is telling me, 48:07 its says, well, if I look at this thing 48:09 and I start seeing this inflection point, it's saying, 48:13 oh, here, you're gaining a lot and lot of variance. 48:16 And then at some point, you stop gaining a lot 48:19 in your proportion of explained variance. 48:21 So this will translate in something 48:23 where when I look at this ratio, lambda 1 plus lambda k divided 48:28 by lambda 1 plus lambda d, this would 48:31 translate into a function that would look like this. 48:34 And what it's telling you, it says, well, maybe you 48:36 should stop here, because here every time you add one, 48:38 you don't get as much as you did before. 48:40 You actually get like smaller marginal returns. 48:43 48:50 So explained variance is the numerator of this ratio. 48:56 And the total variance is the denominator. 48:58 Those are pretty straightforward terms 49:01 that you would want to use for this. 49:03 So if your goal is to do data visualization-- 49:06 so why would you take k larger than 2? 49:10 Let's say, if you take k larger than 6, 49:11 you can start to imagine that you're 49:12 going to have six, choose two, which starts to be annoying. 49:15 And if you have k is equal to 10-- 49:16 because you could start in dimension 50,000-- 49:19 and then k equal to 10 would be the place 49:21 where you have this thing that's a lot of plots 49:22 that you would have to show. 49:23 So it's not always for data visualization. 49:26 Once I've actually done this, I've 49:29 actually effectively reduced the dimension of my problem. 49:32 And what I could do with what I have is 49:34 do a regression on those guys. 49:36 The v1-- so I forgot to tell you-- 49:39 why is that called principal component analysis? 49:41 Well, the vj's that I keep, v1 to vk 49:46 are called principal components. 49:51 49:59 And they effectively act as the summary of my Xi's. 50:04 When I mentioned image compression, 50:06 I started with a point Xi that was d numbers-- 50:10 let's say 50,000 numbers. 50:12 And now, I'm saying, actually, you 50:14 can throw out those 50,000 numbers. 50:16 If you actually know only the k numbers that you need-- 50:19 the 6 numbers that you need-- 50:20 you're going to have something that 50:22 was pretty close to getting what information you had. 50:24 So in a way, there is some form of compression 50:26 that's going on here. 50:27 And what you can do is that those principal components, 50:31 you can actually use now for regression. 50:34 If I want to regress Y onto X that's 50:39 very high dimensional, before I do this, 50:41 if I don't have enough points, maybe what I can actually do 50:44 is to do principal component analysis 50:47 throughout my exercise, replace them 50:49 by those compressed versions, and do linear aggression 50:52 on those guys. 50:53 And that's called principal component regression, 50:55 not surprisingly. 50:56 And that's something that's pretty popular. 50:57 And you can do with k is equal to 10, for example. 51:00 51:03 So for data visualization, I did not find a Thanksgiving themed 51:07 picture. 51:08 But I found one that has turkey in it. 51:11 Get it? 51:12 51:15 So this is actually a gene data set that was-- 51:21 so when you see something like this, 51:24 you can imagine that someone has been preprocessing 51:27 the hell out of this thing. 51:28 This is not like, oh, I collect data on 23andMe 51:30 and I'm just going to run PCA on this. 51:32 It just doesn't happen like that. 51:34 And so what happened is that-- so let's assume that this was 51:38 a bunch of preprocessed data, which are gene expression 51:41 levels-- 51:42 so 500,000 genes among 1,400 Europeans. 51:47 So here, I actually have less observations 51:50 than I have samples. 51:52 And that's when you use principal component regression 51:54 most of the time, so it doesn't stop you. 51:57 And then what you do is you say, OK, have those 500,000 genes 52:01 among-- 52:03 so here, that means that there's 1,400 points here. 52:06 And I actually take those 500,000 directions. 52:09 So each person has a vector of, say, 500,000 genes 52:13 that are attached to them. 52:14 And I project them onto two dimensions, which 52:17 should be extremely lossy. 52:19 I lose a lot of information. 52:21 And indeed, I do, because I'm one of these guys. 52:24 And I'm pretty sure I'm very different from this guy, 52:27 even though probably from an American perspective, 52:30 we're all the same. 52:31 But I think we have like slightly different genomes. 52:35 And so the thing is now we have this-- 52:39 so you see there's lots of Swiss that participate in this. 52:41 But actually, those two principal components 52:43 recover sort of the map of Europe. 52:46 I mean, OK, again, this is actually maybe fine-grained 52:50 for you guys. 52:50 But right here, there's Portugal and Spain, 52:52 which are those colors. 52:54 So here is color-coded. 52:55 And here is Turkey, of course, which we know 52:58 has very different genomes. 53:02 So Turks are very at the boundary. 53:04 So you can see all the greens. 53:06 They stay very far apart from everything else. 53:08 And then the rest here is pretty mixed. 53:11 But it sort of recovers-- if you look at the colors, 53:13 it sort of recovers that. 53:14 So in a way, those two principal components 53:16 are just the geographic feature. 53:18 So if you insist to compress all the genomic information 53:25 of these people into two numbers, what you're actually 53:28 going to get is longitude and latitude, 53:31 which is somewhat surprising, but not 53:35 so much if you think that's it's been preprocessed. 53:37 53:43 So what do you do beyond practice? 53:47 Well, you could try to actually study those things. 53:50 If you think about it for a second, 53:52 we did not do any statistics. 53:54 I talked to you about IID observations, 53:57 but we never used the fact that they were independent. 53:59 The way we typically use independence 54:01 is to have central limit theorem, maybe. 54:04 I mentioned the fact that the covariances of the word 54:06 Gaussian would actually give me something which is independent. 54:09 We didn't care. 54:10 This was a data analysis, data mining process that we did. 54:16 I give you points, and you just put them through the crank. 54:19 There was an algorithm in six steps. 54:21 And you just put it through and that's what you got. 54:23 Now, of course, there's some work which studies says, OK, 54:26 if my data is actually generated from some process-- maybe, 54:30 my points are multivariate Gaussian with some structure 54:33 on the covariance-- 54:34 how well am I recovering the covariance structure? 54:37 And that's where statistics kicks in. 54:38 And that's where we stop. 54:41 So this is actually a bit more difficult to study. 54:44 But in a way, it's not entirely satisfactory, 54:48 because we could work for a couple of boards 54:50 and I would just basically sort of reverse engineer this 54:53 and find some models under which it's a good idea to do that. 54:57 And what are those models? 54:58 Well, those are the models that sort of give you 55:01 sort of prominent directions that you want to find. 55:03 And it will say, yes, if you have enough observations, 55:06 you will find those directions along which 55:08 your data is elongated. 55:10 So that's essentially what you want to do. 55:14 So that's exactly what this thing is telling you. 55:20 So where does the statistics lie from? 55:23 Well, everything, remember-- so actually that's 55:26 where Alana was confused-- the idea was to say, well, 55:28 if I have a true covariance matrix sigma 55:32 and I never really have access to it, 55:34 I'm just running PCA on the empirical covariance matrix, 55:38 how do those results relate? 55:41 And this is something that you can study. 55:44 So for example, if n goes to infinity 55:47 and the number of points, your dimension, is fixed, 55:55 then S goes to sigma in any sense you want. 56:00 Maybe each entry is going to each entry of sigma, 56:02 for example. 56:03 So S is a good estimator. 56:04 We know that the empirical covariance 56:06 is a consistent as the mater. 56:07 And if d is fixed, this is actually not an issue. 56:10 So in particular, if you run PCA on the sample covariance 56:14 matrix, you look at, say, v1, then 56:16 v1 is going to converge to the largest eigenvector of sigma 56:20 as n goes to infinity, but for d fixed. 56:23 And that's a story that we know since the '60s. 56:27 More recently, people have started challenging this. 56:30 Because what's happening when you fix the dimension 56:33 and let the sample size go to infinity, 56:35 you're certainly not allowing for this. 56:38 It's certainly not explaining to you anything about the fact 56:41 when d is equal to 500,000 and n is equal to 1,400. 56:44 Because when d is fixed and n goes to infinity, 56:46 in particular, n is much larger than d, 56:48 which is not the case here. 56:50 And so when n is much larger than d, things go well. 56:53 But if d is less than n, it's not clear what happens. 56:57 And particularly, if d is of the order of n, what's happening? 57:01 So there's an entire theory in mathematics that's called 57:04 random matrix theory that studies the behavior of exactly 57:07 this question-- what is the behavior of the spectrum-- 57:10 the eigenvalues and eigenvectors-- 57:13 of a matrix in which I put random numbers and I let-- 57:16 so the matrix I'm interested in here is the matrix of X's. 57:19 When I stack all my X's next to each other, 57:21 so that's a matrix of size, say, d by n, so each column 57:26 is of size d, it's one person. 57:28 And so I put them. 57:29 And when I let the matrix go to infinity, 57:31 I let both d and n to infinity. 57:33 But I want the aspect ratio, d/n, to go to some constant. 57:37 That's what they do. 57:38 And what's nice is that in the end, you have this constant-- 57:41 let's call it gamma-- 57:42 that shows up in all the asymptotics. 57:44 And then you can replace it by d/n. 57:46 And you know that you still have a handle of both the dimension 57:50 and the sample size. 57:51 Whereas, usually the dimension goes away, as you let n 57:54 go to infinity without having dimension going to infinity. 57:57 And so now, when this happens, as soon 57:59 as d/n goes to a constant, you can 58:01 show that essentially there's an angle between the largest 58:07 eigenvector of sigma and the largest eigenvector of S, as n 58:14 and d go to infinity. 58:15 There is always an angle-- you can actually 58:17 write it explicitly. 58:18 And it's an angle that depends on this ratio, gamma-- 58:22 the asymptotic ratio of d/n. 58:24 And so there's been a lot of understanding how to correct, 58:29 how to pay attention to this. 58:30 This creates some biases that were sort of overlooked before. 58:34 In particular, when I do this, this 58:37 is not the proportion of explained variance, 58:40 when n and d are similar. 58:42 This is an estimated number computed from S. 58:44 This is computed from S. All these guys are computed from S. 58:48 So those are actually not exactly 58:49 where you want them to be. 58:51 And there's some nice work that allows you to recalibrate what 58:54 this ratio should be, how this ratio should be computed, 58:57 so it's a better representative of what 58:59 the proportion of explained variance actually is. 59:04 So then, of course, there's the question 59:07 of-- so that's when d/n goes to some constant. 59:09 So the best case-- so that was '60s-- 59:12 d is fixed and it's much larger than d. 59:15 And then random matrix theory tells you, well, d and n 59:18 are sort of the same order of magnitude. 59:20 When they go to infinity, the ratio goes to some constant. 59:23 Think of it as being order 1. 59:25 To be fair, if d is 100 times larger than n, it still works. 59:30 And it depends on what you think what 59:32 the infinity is at this point. 59:33 But I think the random matrix theory results are very useful. 59:37 But then even in this case, I told you 59:39 that the leading eigenvector of S 59:42 is actually an angle of the leading eigenvector of-- 59:48 So what's happening is that-- 59:50 59:56 so let's say that d/n goes to some gamma. 60:01 And what I claim is that, if you look at-- 60:04 so that's v1, that's the v1 of S. And then there's the v1 of-- 60:09 so this should be of size 1. 60:11 So that's the v1 of sigma. 60:13 Then those things are going to have an angle, which 60:15 is some function of gamma. 60:16 It's complicated, but there's a function of gamma 60:18 that you can see there. 60:19 And there's some models. 60:21 When gamma goes to infinity, which 60:24 means that d is now much larger than n, 60:27 this angle is 90 degrees, which means 60:30 that you're getting nothing. 60:32 Yeah. 60:33 AUDIENCE: If d is not on your lower plane, 60:37 so like gamma is 0, is there still angle? 60:40 PHILIPPE RIGOLLET: No, but that's consistent-- 60:43 the fact that it's consistent when-- 60:45 so the angle is a function-- 60:46 AUDIENCE: d is not a constant [INAUDIBLE]?? 60:49 60:52 PHILIPPE RIGOLLET: d is not a constant? 60:54 So if d is little of n? 60:57 Then gamma goes to 0 and f of gamma goes to 0. 60:59 So f of gamma is a function that-- 61:02 so for example, if f of gamma-- 61:05 this is the sine of the angle, for example-- 61:08 then it's a function that starts at 0, and that goes like this. 61:11 61:15 But as soon as gamma is positive, it goes away from 0. 61:18 61:20 So now when gamma goes to infinity, 61:24 then this thing goes to a right angle, which 61:26 means I'm getting just junk. 61:27 So this is not my leading eigenvector. 61:29 So how do you do this? 61:31 Well, just like everywhere in statistics, 61:33 you have to just make more assumptions. 61:35 You have to assume that you're not 61:36 looking for the leading eigenvector or the direction 61:39 that carries the most variance. 61:40 But you're looking, maybe, for a special direction. 61:42 And that's what sparse PCA is doing. 61:44 Sparse PCA is saying, I'm not looking for any direction new 61:48 that carries the most variance. 61:50 I'm only looking for a direction new that is sparse. 61:54 Think of it, for example, as having 10 non-zero coordinates. 61:58 So that's a lot of directions still to look for. 62:02 But once you do this, then you actually 62:05 have not only-- there's a few things 62:07 that actually you get from doing this. 62:08 The first one is you actually essentially replace 62:12 d by k, which means that n now just-- 62:15 I'm sorry, let's say S non-zero coefficients. 62:18 You replace d by S, which means that n only 62:21 has to be much larger than S for this thing to actually work. 62:24 Now, of course, you've set your goal weaker. 62:26 Your goal is not to find any direction, only 62:28 a sparse direction. 62:30 But there's something very valuable 62:31 about sparse directions, is that they actually 62:33 are interpretable. 62:35 When I found the v-- 62:37 let's say that the v that I found before 62:40 was 0.2, and then 0.9, and then 1.1 minus 3, et cetera. 62:48 So that was the coordinates of my leading eigenvector 62:51 in the original coordinate system. 62:54 What does it mean? 62:55 Well, it means that if I see a large number, 62:57 that means that this v is very close-- 63:01 so that's my original coordinate system. 63:03 Let's call it e1 and e2. 63:05 So that's just 1, 0; and then 0, 1. 63:09 Then clearly, from the coordinates of v, 63:11 I can tell if my v is like this, or it's like this, 63:13 or it's like this. 63:15 Well, I mean, they should all be of the same size. 63:18 So I can tell if it's here or here 63:20 or here, depending on-- like here, 63:24 that means I'm going to see something 63:26 where the Y-coordinate it much larger than the X-coordinate. 63:29 Here, I'm going to see something where the X-coordinate is much 63:30 larger than the Y-coordinate. 63:32 And here, I'm going to see something 63:33 where the X-coordinate is about the same size 63:35 of the Y-coordinate. 63:38 So when things starts to be bigger, 63:40 you're going to have to make choices. 63:42 What does it mean to be bigger-- 63:43 when d is 100,000, I mean, the sum 63:48 of the squares of those guys have to be equal to 1. 63:51 So they're all very small numbers. 63:52 And so it's hard for you to tell which one is a big number 63:54 and which ones is a small number. 63:56 Why would you want to know this? 63:57 Because it's actually telling you 63:58 that if v is very close to e1, then that means that e1-- 64:03 in the case of the gene example, that 64:04 would mean that e1 is the gene that's very important. 64:08 Maybe there's actually just two genes 64:10 that explain those two things. 64:12 And those are the genes that have been picked up. 64:14 There's two genes that I encode geographic location, 64:16 and that's it. 64:18 And so it's very important for you 64:19 to be able to interpret what v means. 64:21 Where it has large values, it means 64:23 that maybe it has large values for e1, e2, and e3. 64:26 And it means that it's a combination of e1, e2, and e3. 64:28 And now, you can interpret, because you have 64:30 only three variables to find. 64:33 And so sparse PCA builds that in. 64:36 Sparse PCA says, listen, I'm going 64:39 to want to have at most 10 non-zero coefficients. 64:42 And the rest, I want to be 0. 64:44 I want to be able to be a combination of at most 10 64:47 of my original variables. 64:50 And now, I can do interpretation. 64:52 So the problem with sparse PCA is 64:54 that it becomes very difficult numerically 64:57 to solve this problem. 64:58 I can write it. 64:59 So the problem is simply maximize the variance u 65:05 transpose, say, Su subject to-- well, 65:09 I wanted to have u2 equal to 1. 65:12 So that's the original PCA. 65:14 But now, I also want that the sum 65:16 of the indicators of the uj that are not equal to 0 65:19 is at most, say, 10. 65:23 This constraint is very non-convex. 65:26 So I can relax it to a convex one 65:28 like we did for linear aggression. 65:31 But now, I've totally messed up with the fact 65:33 that I could use linear algebra to solve this problem. 65:37 And so now, you have to go through much more complicated 65:40 optimization techniques, which are called 65:42 semidefinite programs, which do not 65:44 scale well in high dimensions. 65:46 And so you have to do a bunch of tricks-- 65:48 numerical tricks. 65:49 But there are some packages that implements some heuristics 65:52 or some other things-- 65:55 iterative thresholding, all sorts 65:56 of various numerical tricks that you can do. 65:58 But the problem they are trying to solve is exactly this. 66:01 Among all directions that I have norm 1, of course, 66:03 because it's the direction that have at most, say, 66:06 10 non-zero coordinates, I want to find the one that maximizes 66:09 the empirical variance. 66:10 66:23 Actually, let me let me just so you this. 66:27 66:41 I wanted to show you an output of PCA 66:47 where people are actually trying to do directly-- 66:50 66:56 maybe-- there you go. 67:05 67:20 So right here, you see this is SPSS. 67:26 That's a statistical software. 67:29 And this is an output that was preprocessed 67:33 by a professional-- 67:34 not preprocessed, post-processed. 67:36 So that's something where they read PCA. 67:38 So what is the data? 67:39 This is raw data about you ask doctors 67:43 what they think of the behavior of a particular sales 67:47 representative for pharmaceutical companies. 67:49 So pharmaceutical companies are trying 67:51 to improve their sales force. 67:52 And they're asking doctors how would they 67:56 rate-- what do they value about their interaction 67:58 with a sales representative. 68:01 So basically, there's a bunch of questions. 68:04 One offers credible point of view on something trends, 68:10 provides valuable networking opportunities. 68:12 This is one question. 68:13 Rate this on a scale from 1 to 5. 68:15 That was the question. 68:16 And they had a bunch of questions like this. 68:18 And then they asked 1,000 doctors to make those ratings. 68:22 And what they want-- so each doctor now 68:24 is a vector of ratings. 68:25 And they want to know if there's different groups of doctors, 68:28 what do doctors respond to. 68:30 If there's different groups, then 68:31 maybe they know that they can actually address them 68:33 separately, et cetera. 68:35 And so to do that, of course, there's lots of questions. 68:37 And so what you want is to just first project 68:39 into lower dimensions, so you can actually 68:41 visualize what's going on. 68:42 And this is what was done for this. 68:44 So these are the three first principal component 68:47 that came out. 68:49 And even though we ordered the values of the lambdas, 68:52 there's no reason why the entries of v should be ordered. 68:56 And if you look at the values of v here, 68:57 they look like they're pretty much ordered. 68:59 It starts at 0.784, and then you're at 0.3 around here. 69:04 There's something that goes up again, and then you go down. 69:06 Actually, it's marked in red every time it goes up again. 69:11 And so now, what they did is they said, 69:13 OK, I need to interpret those guys. 69:16 I need to tell you what this is. 69:18 If you tell me, we found the principal component 69:21 that really discriminates the doctors in two groups, 69:24 the drug company is going to come back to you 69:26 and say, OK, what is this characteristic? 69:29 And you say, oh, it's actually a linear combination 69:31 of 40 characteristics. 69:33 And they say, well, we don't need you to do that. 69:35 I mean, it cannot be a linear combination of anything you 69:38 didn't ask. 69:39 And so for that, first of all, there's 69:41 a post-processing of PCA, which says, OK, once I actually, 69:44 say, found three principal components, 69:46 that means that I found the dimension three space on which 69:51 I want to project my points. 69:52 In this base, I can pick any direction I want. 69:55 So the first thing is that you do 69:57 some sort of local arrangements, so that those things 69:59 look like they are increasing and then decreasing. 70:01 So you just change, you rotate your coordinate system 70:06 in this three dimensional space that you've actually isolated. 70:09 And so once you do this, the reason 70:11 to do that is that it sort of makes 70:13 them big, sharp differences between large and small values 70:16 of the coordinates of the thing you had. 70:18 And why do you want this? 70:19 Because now, you can say, well, I'm 70:21 going to start looking at the ones that have large values. 70:23 And what do they say? 70:24 They say in-depth knowledge, in-depth knowledge, 70:26 in-depth knowledge, knowledge about. 70:28 This thing is clearly something that 70:30 actually characterizes the knowledge of my sales 70:34 representative. 70:35 And so that's something that doctors are sensitive to. 70:38 That's something that really discriminates 70:40 the doctors in a way. 70:40 There's lots of variance along those things, 70:43 or at least a lot of variance-- 70:45 I mean, doctors are separate in terms of their experience 70:47 with respect to this. 70:49 And so what they did is said, OK, 70:51 all these guys, some of those they have large values, 70:53 but I don't know how to interpret them. 70:55 And so I'm just going to put the first block, 70:56 and I'm going to call it medical knowledge, 70:58 because all those things are knowledge about medical stuff. 71:01 Then here, I didn't know how to interpret those guys. 71:03 But those guys, there's a big clump of large coordinates, 71:06 and they're about respectful of my time, listens, friendly 71:10 but courteous. 71:12 This is all about the quality of interaction. 71:14 So this block was actually called quality of interaction. 71:17 And then there was a third block, 71:18 which you can tell starts to be spreading a little thin. 71:21 There's just much less of them. 71:22 But this thing was actually called 71:24 fair and critical opinion. 71:26 And so now, you have three discriminating directions. 71:30 And you can actually give them a name. 71:31 Wouldn't it be beautiful if all the numbers in the gray box 71:34 came non-zero and all the other numbers 71:36 came zero-- there was no ad hoc choice. 71:38 I mean, this is probably an afternoon of work 71:40 to like scratch out all these numbers 71:42 and put all these color codes, et cetera. 71:44 Whereas, you could just have something that tells you, 71:47 OK, here are the non-zeros. 71:49 If you can actually make a story around why this group of thing 71:52 actually makes sense, such as it is medical knowledge, 71:54 then good for you. 71:55 Otherwise, you could just say, I can't. 71:57 And that's what sparse PCA does for you. 71:59 Sparse PCA outputs something where all those numbers would 72:02 be zero. 72:03 And there would be exactly, say, 10 non-zero coordinates. 72:06 And you can turn this knob off 10. 72:08 You can make it 9. 72:09 72:11 Depending on what your major is, maybe 72:13 you can actually go on with 20 of them 72:15 and have the ability to tell the story about 20 72:18 different variables and how they fit in the same group. 72:20 And depending on how you feel, it's 72:22 easy to rerun the PCA depending on the value 72:25 that you want here. 72:26 And so you could actually just come up with the one 72:28 you prefer. 72:30 And so that's the sparse PCA thing 72:32 which I'm trying to promote. 72:33 I mean, this is not super well-spread. 72:35 It's a fairly new idea, maybe at most 10 years old. 72:39 And it's not completely well-spread 72:40 in statistical packages. 72:42 But that's clearly what people are 72:44 trying to emulate currently. 72:46 Yes? 72:47 AUDIENCE: So what exactly does it 72:48 mean that the doctors have a lot of variance 72:50 in medical knowledge, quality of interaction, 72:53 and fair and critical opinion? 72:55 Like, it was saying that these are like the main things 73:00 that doctors vary on, some doctors care. 73:02 Like we could sort of characterize a doctor by, oh, 73:05 he cares this much about medical knowledge, this much 73:08 about the quality of interaction, 73:09 and this much about critical opinion. 73:11 And that says most of the story about what this doctor wants 73:14 from a drug representative? 73:17 PHILIPPE RIGOLLET: Not really. 73:20 I mean, OK, let's say you pick only one. 73:22 So that means that you would take all your doctors, 73:31 and you would have one direction, which 73:33 is quality of interaction. 73:36 And there would be just spread out points here. 73:38 73:42 So there are two things that can happen. 73:44 The first one is that there's a clump here, 73:46 and then there's a clump here. 73:49 That still represents a lot of variance. 73:50 And if this happens, you probably 73:52 want to go back in your data and see 73:55 were these people visited by a different group 73:58 than these people, or maybe these people 74:00 have a different specialty. 74:02 74:05 I mean, you have to look back at your data 74:07 and try to understand why you would have 74:08 different groups of people. 74:09 And if it's like completely evenly spread out, 74:13 then all it's saying is that, if you 74:15 want to have a uniform quality of interaction, 74:18 you need to take measures on this. 74:20 You need to have this to not be discrimination. 74:24 But I think really when it's becoming interesting it's not 74:26 when it's complete spread out. 74:27 It's when there's a big group here. 74:29 And then there's almost no one here, 74:30 and then there's a big group here. 74:32 And then maybe there's something you can do. 74:34 And so those two things actually give you a lot of variance. 74:40 So actually, maybe I'll talk about this. 74:47 Here, this is sort of a mixture. 74:49 You have a mixture of two different populations 74:51 of doctors. 74:53 And it turns out that principal component analysis-- 74:56 so a mixture is when you have different populations-- 74:59 think of like two Gaussians that are just 75:02 centered at two different points, 75:03 and maybe they're in high dimensions. 75:05 And those are clusters of people, 75:07 and you want to be able to differentiate those guys. 75:09 If you're in very high dimensions, 75:10 it's going to be very difficult. But one 75:12 of the first processing tools that people do is to do PCA. 75:14 Because if you have one big group here and one big group 75:18 here, it means that there's a lot of variance 75:19 along the direction that goes through the centers 75:21 of those groups. 75:22 And that's essentially what happened here. 75:24 You could think of this as being two blobs in high dimensions. 75:27 But you're really just projecting them 75:29 into one dimension. 75:30 And this dimension, hopefully, goes through the center. 75:33 And so as preprocessing-- so I'm going to stop here. 75:37 But PCA is not just made for dimension reduction. 75:42 It's used for mixtures, for example. 75:44 It's also used when you have graphical data. 75:47 What is the idea of PCA? 75:48 It just says, if you have a matrix that seems to have low 75:53 rank-- meaning that there's a lot of those lambda i's that 75:56 are very small-- 75:57 and then I see that plus noise, then 76:00 it's a good idea to do PCA on this thing. 76:02 And in particular, people use that in networks a lot. 76:05 So you take the adjacency matrix of a graph-- 76:08 well, you sort of preprocess it a little bit, so it looks nice. 76:11 And then if you have, for example, two communities 76:13 in there, it should look like something that 76:15 is low rank plus some noise. 76:18 And low rank means that there's just very few non-zero-- 76:22 well, low rank means this. 76:24 Low rank means that if you do the scree plot, 76:26 you will see something like this, 76:27 which means that if you throw out all the smaller ones, 76:29 it should not really matter in the overall structure. 76:33 And so you can use all-- 76:35 these techniques are used everywhere these days, not 76:39 just in PCA. 76:39 So we call it PCA as statisticians. 76:41 But people call it the spectral methods or SVD. 76:46 So everyone-- 76:49