https://www.youtube.com/watch?v=a1ZCeFpeW0o&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=18

字幕記錄


00:00
00:00
The following content is provided under a Creative
00:02
Commons license.
00:03
Your support will help MIT OpenCourseWare
00:06
continue to offer high quality educational resources for free.
00:10
To make a donation or to view additional materials
00:12
from hundreds of MIT courses, visit MIT OpenCourseWare
00:16
at ocw.mit.edu.
00:19
PHILIPPE RIGOLLET: We keep on talking
00:20
about principal component analysis, which we essentially
00:24
introduced as a way to work with a bunch of data.
00:27
So the data that's given to us when we want to do PCA
00:31
is a bunch of vectors X1 to Xn.
00:35
So they are random vectors.
00:40
00:45
in Rd.
00:46
And what we mentioned is that we're
00:48
going to be using linear algebra-- in particular,
00:51
the spectral theorem-- that guarantees to us that if I look
00:54
at the convenience matrix of this guy,
00:56
or its empirical covariance matrix,
00:57
since they're symmetric real matrices
01:00
and they are positive semidefinite,
01:01
there exists a diagonalization into non-negative eigenvalues.
01:06
And so here, those things live in Rd,
01:09
so it's a really large space.
01:11
And what we want to do is to map it down
01:14
into a space that we can visualize,
01:16
hopefully a space of size 2 or 3.
01:19
Or if not, then we're just going to take more and start looking
01:22
at subspaces altogether.
01:24
So think of the case where d is large but not larger than n.
01:33
So let's say, you have a large number of points.
01:36
The question is, is it possible to project those things onto
01:40
a lower dimensional space, d prime,
01:45
which is much less than d-- so think of d prime equals, say,
01:49
2 or 3--
01:52
and so that you keep as much information
01:54
about the cloud of points that you had originally.
01:56
So again, the example that we could have
01:58
is that X1 to Xn for, say, Xi for patient i's recording
02:04
a bunch of body measurements and maybe blood pressure,
02:08
some symptoms, et cetera.
02:10
And then we have a cloud of n patients.
02:12
And we're trying to visualize maybe to see if--
02:15
If I could see, for example, that there's
02:16
two groups of patients, maybe I would
02:18
know that I have two groups of different disease
02:21
or maybe two groups of different patients
02:22
that respond differently to a particular disease
02:25
or drug et cetera.
02:27
So visualizing is going to give us
02:28
quite a bit of insight about what the spatial arrangement
02:33
of those vectors are.
02:35
And so PCA says, well, here, of course, in this question,
02:40
one thing that's not defined is what is information.
02:42
And we said that one thing we might
02:44
want to do when we project is that points do not
02:46
collide with each other.
02:48
And so that means we're trying to find directions,
02:50
where after I project, the points are still pretty spread
02:53
out.
02:53
And so I can see what's going on.
02:55
And PCA says-- OK, so there's many ways
02:58
to answer this question.
02:59
And PCA says, let's just find a subspace of dimension
03:04
d prime that keeps as much covariance structure as
03:08
possible.
03:10
And the reason is that those directions
03:13
are the ones that maximize the variance, which
03:15
is a proxy for the spread.
03:17
There's many, many ways to do this.
03:19
There's actually a Google video that
03:22
was released maybe last week about the data visualization
03:26
team of Google that shows you something called
03:29
t-SNE, which is essentially something
03:31
that tries to do that.
03:32
It takes points in very high dimensions
03:34
and tries to map them in lower dimensions,
03:36
so that you can actually visualize them.
03:38
And t-SNE is some alternative to PCA
03:41
that gives an other definition for the word information.
03:46
I'll talk about this towards the end, how you can actually
03:49
somewhat automatically extend everything
03:52
we've said for PCA to an infinite family of procedures.
03:58
So how do we do this?
04:00
Well, the way we do this is as follows.
04:02
So remember, given those guys, we
04:05
can form something which is called S, which is the sample,
04:09
or the empirical covariance matrix.
04:16
04:19
And since from couple of slides ago,
04:22
we know that S has a eigenvalue decomposition,
04:25
S is equal to PDP transpose, where P is orthogonal.
04:32
04:35
So that's where we use our linear algebra results.
04:37
So that means that P transpose P is equal to PP transpose, which
04:43
is the identity.
04:46
So remember, S is a d by d matrix.
04:50
And so P is also d by d.
04:53
And d is diagonal.
04:55
05:00
And I'm actually going to take, without loss of generality,
05:02
I'm going to assume that d--
05:04
so it's going to be diagonal-- and I'm
05:06
going to have something that looks like lambda 1
05:10
to lambda d.
05:10
Those are called the eigenvalues of S.
05:14
What we know is that lambda j's are non-negative.
05:19
And actually, what I'm going to assume without loss
05:21
of generalities is lambda 1 is larger than lambda 2, which
05:24
is larger than lambda d.
05:30
Because in particular, this decomposition--
05:32
the spectrum decomposition-- is not entirely unique.
05:35
I could permute the columns of P,
05:39
and I would still have an orthogonal matrix.
05:42
And to balance that, I would also have
05:44
to permute the entries of d.
05:46
So there's as many decompositions
05:49
as there are permutations.
05:51
So there's actually quite a bit.
05:52
But the bag of eigenvalues is unique.
05:56
The set of eigenvalues is unique.
05:58
The ordering is certainly not unique.
06:01
So here, I'm just going to pick--
06:02
I'm going to nail down one particular permutation--
06:05
actually, maybe two in case I have equalities.
06:08
But let's say, I pick one that satisfies this.
06:12
And the reason why I do this is really not very important.
06:15
It's just to say, I'm going to want
06:18
to talk about the largest of those eigenvalues.
06:20
So this is just going to be easier
06:22
for me to say that this one is lambda 1,
06:23
rather than say it's lambda 7.
06:26
So this is just to say that the largest eigenvalue of S
06:39
is lambda 1.
06:42
If I didn't do that, I would just call it maybe lambda max,
06:45
and you would just know which one I'm talking about.
06:47
06:52
So what's happening now is that if I look at d,
07:01
then it turns out that if I start--
07:04
so if I do P transpose Xi, I am actually projecting my Xi's--
07:09
I'm basically changing the basis for my Xi's.
07:12
And now, D is the empirical covariance matrix
07:15
of those guys.
07:16
So let's check that.
07:18
So what it means is that if I look at--
07:22
07:26
so what I claim is that P transpose Xi--
07:29
that's a new vector, let's call it Yi, it's also in Rd--
07:35
and what I claim is that the covariance matrix of this guy
07:37
is actually now this diagonal matrix, which
07:41
means in particular that if they were Gaussian, then
07:45
they would be independent.
07:46
But I also know now that there's no correlation
07:48
across coordinates of Yi.
07:50
So to prove this, let me assume that X bar is equal to 0.
08:00
And the reason why I do this is because it's just
08:02
annoying to carry out all this censuring constantly
08:05
and I talk about S. So when X bar is equal to 0,
08:09
that implies that S has a very simple form.
08:11
It's of the form sum from i equal 1
08:14
to n of Xi Xi transpose.
08:18
So that's my S.
08:20
But what I want is the S of Y--
08:24
So OK, that implies also that P times X
08:28
bar, which is equal to P times X bar is also equal to 0.
08:34
So that means that Y bar--
08:37
Y has mean 0, if this is 0.
08:40
So if I look at the sample covariance matrix of Y,
08:43
it's just going to be something that
08:45
looks like the sum of the outer products or the Yi Yi
08:49
transpose.
08:50
08:53
And again, the reason why I make this assumption
08:56
is so that I don't have to write minus X bar X bar transpose.
09:01
But you can do it.
09:02
And it's going to work exactly the same.
09:03
09:06
So now, I look at this S prime.
09:08
And so what is this S prime?
09:11
Well, I'm just going to replace Yi with PXi.
09:14
So it's the sum from i equal 1 to n of PXi PXi transpose,
09:22
which is equal to the sum from--
09:26
sorry there's a 1/n.
09:27
09:32
So it's equal to 1/n sum from i equal 1
09:34
to n of PXi Xi transpose P transpose.
09:43
Agree?
09:45
I just said that the transpose of AB is the transpose of B
09:48
times the transpose of A.
09:53
And so now, I can push the sum in.
09:55
P does not depend on i.
09:57
So this thing here is equal to PS P transpose,
10:05
because the sum of the Xi Xi transpose divided by n is S.
10:10
But what is PS P transpose?
10:12
Well, we know that S is equal to--
10:17
sorry that's P transpose.
10:19
So this was with a P transpose.
10:20
I'm sorry, I made an important mistake here.
10:23
So Yi is P transpose Xi.
10:25
So this is P transpose and P transpose
10:27
here, which means that this is P transpose
10:29
and this is double transpose, which is just nothing
10:32
and that transpose and nothing.
10:34
10:36
So now, I write S as PD P transpose.
10:41
That's the spectral decomposition
10:43
that I had before.
10:44
That's my eigenvalue decomposition,
10:46
which means that now, if I look at S prime,
10:49
it's P transpose times PD P transpose P.
10:56
But now, P transpose P is the identity,
10:58
P transpose P is the identity.
11:00
So this is actually just equal to D.
11:06
And again, you can check that this also
11:08
works if you have to center all those guys as you go.
11:12
But if you think about it, this is the same thing
11:15
as saying that I just replaced Xi by Xi minus X bar.
11:19
And then it's true that Y bar is also P times Xi minus X bar.
11:26
So now, we have that D is the empirical covariance
11:29
matrix of those guys--
11:30
the Yi's, which are P transpose Xi's.
11:33
And so in particular, what it means
11:34
is that if I look at the covariance of Yj Yk--
11:42
11:46
So that's the covariance of the j-th coordinate of Y
11:48
and the k-th coordinate of Y. I'm just not putting an index.
11:51
But maybe, let's say the first one or something
11:53
like this-- any of them, their IID.
11:56
Then what is this covariance?
11:57
It's actually 0 if j is different from k.
12:01
And the covariance between Yj and Yj,
12:06
which is just the variance of Yj, is equal to lambda j--
12:13
the j-th largest eigenvalue.
12:17
So the eigenvalues capture the variance of my observations
12:22
in this new coordinate system.
12:25
And they're completely orthogonal.
12:26
So what does that mean?
12:27
Well, again, remember, if I chop off
12:29
the head of my Gaussian in multi dimensions,
12:34
we said that what we started from
12:35
was something that looked like this.
12:39
And we said, well, there's one direction that's important,
12:42
that's this guy, and one important that's this guy.
12:45
When I applied a transformation P transpose, what I'm doing
12:48
is that I'm realigning this thing with the new axes.
12:51
Or in a way, rather to be fair, I'm
12:53
not actually realigning the ellipses with the axes.
12:59
I'm really realigning the axes with the ellipses.
13:02
So really, what I'm doing is I'm saying, after I apply P,
13:05
I'm just rotating this coordinate system.
13:08
So now, it becomes this guy.
13:12
13:19
And now, my ellipses actually completely align.
13:22
And what happens here is that this coordinate is
13:25
independent of that coordinate.
13:27
And that's what we write here, if they are Gaussian.
13:31
I didn't really tell this--
13:32
I'm only making statements about covariances.
13:34
If they are Gaussians, those implied statements
13:36
about independence.
13:37
13:40
So as I said, the variance now, lambda 1,
13:44
is actually the variance of P transpose Xi.
13:54
13:57
But if I look now at the-- so this is a vector,
14:00
so I need to look at the first coordinate of this guy.
14:04
14:08
So it turns out that doing this is actually
14:11
the same thing as looking at the variance of what?
14:15
Well, the first column of P times Xi.
14:21
So that's the variance of--
14:24
I'm going to call it v1 transpose Xi, where P--
14:30
14:44
So the v1 vd in Rd are eigenvectors.
14:53
And each vi is associated to lambda i.
14:57
So that's what we saw when we talked about this eigen
14:59
decomposition a couple of slides back.
15:02
That's the one here.
15:06
So if I call the columns of P v1 to vd,
15:10
this is what's happening.
15:13
So when I look at lambda 1, it's just the variance
15:16
of Xi inner product with v1.
15:19
And we made this picture when we said, well,
15:22
let's say v1 is here and then x1 is here.
15:25
And if vi has a unique norm, then the inner product
15:31
between Xi and v1 is just the length of this guy here.
15:38
So that's the variance of the Xi says the length of Xi--
15:41
so this is 0-- that's the length of Xi when I project it
15:43
onto the direction that span by v1.
15:46
If v1 has length 2, this is really just twice this length.
15:52
If vi has length 3, it's three times this.
15:56
But it turns out that since P satisfies P transpose
16:01
P is equal to the identity--
16:04
that's an orthogonal matrix, that's right here--
16:07
then this is actually saying the same thing
16:11
as vj transpose vj, which is really the norm squared of vj,
16:18
is equal to 1.
16:20
And vj transpose vk is equal to 0, if j is different from k.
16:26
16:29
The eigenvectors are orthogonal to each other.
16:31
And they're actually all of norm 1.
16:33
16:37
So now, I know that this is indeed a direction.
16:39
And so when I look at v1 transpose Xi,
16:44
I'm really measuring exactly this length.
16:46
And what is this length?
16:47
It's the length of the projection of Xi
16:49
onto this line.
16:51
That's the line that's spanned by v1.
16:53
So if I had a very high dimensional problem
16:57
and I started to look at the direction v1--
17:01
let's say v1 now is not a eigenvector,
17:03
it's any direction-- then if I want to do this lower
17:08
dimensional projection, then I have to understand how those
17:11
Xi's project onto the line that's spanned by v1,
17:14
because this is all that I'm going to be keeping at the end
17:16
of the day about Xi's.
17:17
17:20
So what we want is to find the direction
17:23
where those Xi's, those projections,
17:25
have a lot of variance.
17:26
And we know that the variance of Xi on this direction
17:28
is actually exactly given by lambda 1.
17:30
17:36
Sorry, that's the empirical var--
17:40
yeah, I should call variance hat.
17:42
That's the empirical variance.
17:43
Everything is in empirical here.
17:45
We're talking about the empirical covariance matrix.
17:48
And so I also have that lambda 2 is the empirical variance
17:54
of when I project Xi onto v2, which is the second one,
17:59
just for exactly this reason.
18:00
18:07
Any question?
18:08
18:14
So lambda j's are going to be important for us.
18:16
Lambda j measure the spread of the points
18:19
when I project them onto a line which is a one dimensional
18:22
space.
18:23
And so I'm going to have-- let's say I want to pick only one,
18:25
I'm going to have to find the one dimensional space that
18:28
carries the most variance.
18:29
And I claim that v1 is the one that
18:32
actually maximizes the spread.
18:35
So the claim-- so for any direction, u in Rd--
18:55
and by direction, I really just mean that the norm of u
18:59
is equal to 1.
19:00
I need to play fair--
19:02
I'm going to compare myself to other things of lengths one,
19:04
so I need to play fair and look at directions of length 1.
19:07
Now, if I'm interested in the empirical variance
19:16
of X1 transpose--
19:20
sorry, u transpose X1 u transpose Xn, then this thing
19:29
is maximized for u equals v1, where
19:37
v1 is the eigenvector associated to lambda 1
19:40
and lambda 1 is not any eigenvalues,
19:42
it's the largest of all those.
19:45
So it's the largest eigenvalue.
19:46
19:50
So why is that true?
19:51
19:55
Well, there's also a claim that for any direction u--
20:00
so that's 1 and 2--
20:03
the variance of u transpose X-- now,
20:08
this is just a random variable, and I'm looking about the true
20:11
variance--
20:13
this is maximized for u equals, let's call it w1,
20:27
where w1 is the eigenvector of sigma--
20:38
Now, I'm talking about the true variance.
20:40
Whereas, here, I was talking about the empirical variance.
20:42
So the true variance is the eigenvectors
20:44
of the true sigma associated to the largest
20:55
eigenvalue of sigma.
20:59
21:02
So I did not give it a name.
21:04
Here, that was lambda 1 for the empirical one.
21:06
For the true one, you can give it
21:07
another name, mu 1 if you want.
21:10
But that's just the same thing.
21:13
All it's saying is like, wherever I see empirical,
21:15
I can remove it.
21:16
21:27
So why is this claim true?
21:29
Well, let's look at the second one, for example.
21:31
21:38
So what is the variance of u transpose X?
21:44
So that's what I want to know.
21:47
So that's the expectation-- so let's assume that X is 0,
21:54
again, for same reasons as before.
21:56
So what is the variance?
21:57
It's just the expectation of the square?
21:59
22:06
I don't need to remove the expectation.
22:08
And the expedition of the square is just
22:10
the expectation of u transpose X.
22:12
And then I'm going to write the other one X transpose u.
22:15
22:19
And we know that this is deterministic.
22:22
So I'm just going to take that this is just u transpose
22:25
expectation of X X transpose u.
22:31
And what is this guy?
22:32
22:39
That's covariance sigma.
22:40
That's just what sigma is.
22:41
22:44
So the variance I can write as u transpose sigma u.
22:48
We've made this computation before.
22:51
And now what I want to claim is that this thing is actually
22:53
less than the largest eigenvalue, which I actually
22:57
called lambda 1 here.
22:58
I should probably not.
22:59
And the P is-- well, OK.
23:01
23:06
Let's just pretend everything is not empirical.
23:11
So now, I'm going to write sigma as P lambda 1 lambda n P
23:22
transpose.
23:23
That's just the eigendecomposition,
23:25
where I admittedly reuse the same notation as I did for S.
23:32
So I should really put some primes everywhere,
23:34
so you know those are things that are actually
23:36
different in practice.
23:38
So this is just that the decomposition of sigma.
23:43
You seem confused, Helen.
23:44
You have a question?
23:47
Yeah?
23:48
AUDIENCE: What is-- when you talked about the empirical data
23:53
and--
23:55
PHILIPPE RIGOLLET: So OK--
23:56
24:00
so I can make everything I'm saying,
24:02
I can talk about either the variance
24:04
or the empirical variance.
24:05
And you can just add the word empirical in front of it
24:07
whenever you want.
24:08
The same thing works.
24:09
But just for the sake of removing the confusion,
24:13
let's just do it again with S. So I'm just
24:20
going to do everything with S. So I'm
24:21
going to assume that X bar is equal to 0.
24:24
And here, I'm going to talk about the empirical variance,
24:27
which is just 1/n sum from i equal 1
24:31
to n of u transpose Xi squared.
24:35
So it's the same thing.
24:36
Everywhere you see an expectation,
24:37
you just put in average.
24:39
24:45
And then I get 1/n sum from i equal 1
24:50
to n of Xi Xi transpose.
24:53
And now, I'm going to call this guy
24:54
S, because that's what it is.
24:58
So this is u transpose Su.
24:59
But just defined that I could just replace the expectation
25:02
by averages everywhere, you can tell
25:03
that the thing is going to work for either one or the other.
25:06
So now, this thing was actually-- so now,
25:08
I don't have any problem with my notation.
25:10
This is actually the decomposition of S.
25:14
That's just the spectral decomposition
25:16
and it's to its eigenvalues.
25:18
And so now, what I have is that when I look at u transpose Su,
25:27
this is actually equal to P u transpose S Pu.
25:34
25:39
OK.
25:40
There's a transpose somewhere.
25:41
That's this guy.
25:42
25:45
And that's this guy.
25:46
25:57
Now-- sorry, that's not P, that's
26:00
D. That's D, that's this diagonal matrix.
26:05
26:10
Let's look at this thing.
26:11
And let's call P transpose u, let's call it b.
26:15
So that's also a vector in Rd.
26:18
What is it?
26:19
It's just, I take a unit vector, and then
26:21
I apply P transpose to it.
26:23
So that's basically what happens to a unit vector
26:25
when I apply the same change of basis that I did.
26:29
So I'm just changing my orthogonal system the same way
26:34
I did for the other ones.
26:36
So what's happening when I write this?
26:38
Well, now I have that u transpose Su is b transpose Db.
26:46
But now, doing b transpose Db when D is diagonal
26:50
and b is a vector is a very simple thing.
26:52
I can expand it.
26:53
This is what?
26:54
This is just the sum from j equal 1
26:57
to d of lambda j bj squared.
27:01
27:05
So that's just like matrix vector multiplication.
27:08
And in particular, I know that the largest of those guys
27:11
is lambda 1 and those guys are all non-negative.
27:14
So this thing is actually less than lambda 1 times
27:16
the sum from j equal 1 to d of lambda j squared--
27:20
27:23
sorry, bj squared.
27:24
27:27
And this is just the norm of b squared.
27:34
So if I want to prove what's on the slide, all I need to check
27:38
is that b has norm, which is--
27:40
AUDIENCE: 1.
27:41
PHILIPPE RIGOLLET: At most, 1.
27:43
It's going to be at most 1.
27:45
Why?
27:45
Well, because b is really just a change of basis for u.
27:51
And so if I take a vector, I'm just changing its basis.
27:55
I'm certainly not changing its length--
27:57
think of a rotation, and I can also flip it,
27:59
but think of a rotation--
28:00
28:02
well, actually, for vector, it's just going to be a rotation.
28:05
And so now, what I have I just have
28:06
to check that the norm of b squared is equal to what?
28:11
Well, it's equal to the norm of P transpose u squared,
28:16
which is equal to u transpose P P transpose u.
28:21
But P is orthogonal.
28:23
So this thing is actually just the identity.
28:26
So that's just u transpose u, which
28:28
is equal to the norm u squared, which is equal to 1,
28:33
because I took u to have norm 1 in the first place.
28:37
And so this-- you're right-- was actually of norm equal to 1.
28:39
I just needed to have it less, but it's equal.
28:42
And so what I'm left with is that this thing is actually
28:44
equal to lambda 1.
28:45
So I know that for every u that I pick--
28:50
that has norm--
28:52
So I'm just reminding you that u here
28:55
has norm squared equal to 1.
28:57
For every u that I pick, this u transpose
29:00
Su is at mostly lambda 1.
29:02
29:06
So that's the u transpose Su is at most lambda 1.
29:11
And we know that that's the variance, that's
29:13
the empirical variance, when I project my points
29:15
onto direction spanned by u.
29:17
29:20
So now, I have an empirical variance,
29:23
which is at most lambda 1.
29:24
But I also know that if I take u to be something very specific--
29:28
I mean, it was on the previous board--
29:30
if I take u to be equal to v1, then
29:32
this thing is actually not an inequality,
29:35
this is an equality.
29:37
And the reason is, when I actually take u to be v1,
29:41
all of these bj's are going to be 0, except for the one that's
29:46
b1, which is itself equal to 1.
29:50
So I mean, we can briefly check this.
29:52
But if I take v--
29:53
29:59
if u is equal to v1, what I have is that u transpose
30:07
Su is equal to P transpose v1 D P transpose v1.
30:24
But what is P transpose v1?
30:26
Well, remember P transpose is just
30:31
the matrix that has vectors v1 transpose here,
30:34
v2 transpose here, all the way to vd transpose here.
30:40
And we know that when I take vj transpose vk, I get 0,
30:45
if j is different from k.
30:46
And if j is equal to k, I get 1.
30:49
So P transpose v1 is equal to what?
30:53
31:05
Take v1 here and multiply it.
31:06
So the first coordinate is going to be
31:08
v1 transpose v1, which is 1.
31:12
The second coordinate is going to be
31:14
v2 transpose v1, which is 0.
31:19
And so I get 0's all the way, right?
31:22
So that means that this thing here is really
31:25
just the vector 1, 0, 0.
31:29
And here, this is just the vector 1, 0, 0.
31:32
So when I multiply it with this guy,
31:34
I am only picking up the top left element
31:37
of D, which is lambda 1.
31:41
So for every one, it's less lambda 1.
31:44
And for v1, it's equal to lambda 1,
31:46
which means that it's maximized for a equals v1.
31:52
And that's where I said that this
31:54
is the fanciest non-convex problem we know how to solve.
31:57
This was a problem that was definitely non-convex.
31:59
We were maximizing a convex function over a sphere.
32:02
But we know that v1, which is something--
32:06
I mean, of course, you still have
32:07
to believe me that you can compute
32:08
the spectral decomposition efficiently--
32:11
but essentially, if you've taken linear algebra,
32:14
you know that you can diagonalize a matrix.
32:17
And so you get that v1 is just the maximum.
32:19
So you can find your maximum just by looking
32:21
at the spectral decomposition.
32:24
You don't have to do any optimization
32:25
or anything like this.
32:28
So let's recap.
32:29
Where are we?
32:32
We've established that if I start
32:34
with my empirical covariance matrix, I can diagonalize it
32:37
and PD P transpose.
32:42
And then if I take the eigenvector associated
32:44
to the largest eigenvalues-- so if I permute the columns of P
32:48
and of D's in such a way that they
32:50
are ordered from the largest to the smallest when
32:53
I look at the diagonal elements of D,
32:56
then if I pick the first column of P, it's v1.
32:59
And v1 is the direction on which, if I project my points,
33:04
they are going to carry the most empirical variance.
33:08
Well, that's a good way.
33:09
If I told you, pick one direction
33:13
along which if you were to project your points
33:14
they would be as spread out as possible, that's probably
33:17
the one you would pick.
33:19
And so that's exactly what PCA is doing for us.
33:22
It says, OK, if you ask me to take d prime equal to 1,
33:28
I will take v1.
33:31
I will just take the direction that's spanned by v1.
33:33
And that's just when I come back to this picture that
33:36
was here before, this is v1.
33:43
Of course, here, I only have two of them.
33:45
So v2 has to be this guy, or this guy,
33:48
or I mean or this thing.
33:49
I mean, I don't know them up to sine.
33:53
But then if I have three--
33:55
think of like an olive in three dimensions--
33:58
then maybe I have one direction that's slightly more
34:00
elongated than the other one.
34:02
And so I'm going to pick the second one.
34:04
And so the procedure is to say, well, first, I'm
34:07
going to pick v1 the same way I pick v1 in the first place.
34:11
So the first direction I am taking
34:12
is the leading eigenvector.
34:14
And then I'm looking for a direction.
34:18
Well, if I found one-- the one I'm
34:20
going to want to find-- if you say you can take d equal 2,
34:23
you're going to need the basis for this guy.
34:24
So the second one has to be orthogonal to the first one
34:27
you've already picked.
34:28
And so the second one you pick is
34:30
the one that's just, among all those that
34:31
are orthogonal to v1, maximized the empirical variance
34:36
when you project onto it.
34:37
34:40
And it turns out that this is actually exactly v2.
34:44
You don't have to redo anything again.
34:46
You're eigendecomposition, this is
34:47
just the second column of P. Clearly, v2
34:54
is orthogonal to v1.
34:56
We just used it here.
34:58
This 0 here just says this v2 is orthogonal to v1.
35:03
So they're like this.
35:05
And now, what I said--
35:06
what this slide tells you extra--
35:08
is that v2 among all those directions that are
35:10
orthogonal--
35:11
I mean, there's still d minus 1 of them--
35:13
this is the one that maximizes the, say,
35:16
residual empirical variance-- the one that
35:18
was not explained by the first direction that you picked.
35:21
And you can check that.
35:22
I mean, it's becoming a bit more cumbersome to write down,
35:27
but you can check that.
35:28
If you're not convinced, please raise your concern.
35:32
I mean, basically, one way you view this to--
35:38
I mean, you're not really dropping a coordinate,
35:40
because v1 is not a coordinate.
35:42
But let's assume actually for simplicity that v1 was actually
35:46
equal to e1, that the direction that carries the most variance
35:49
is the one that just says, just look
35:51
at the first coordinate of X. So if that was the case, then
35:56
clearly the orthogonal directions are
35:58
the ones that comprise only of the coordinates 2 to d.
36:03
So you could actually just drop the first coordinate
36:05
and do the same thing on a slightly shorter vector
36:08
of length d minus 1.
36:10
And then you would just look at the largest eigenvector
36:12
of these guys, et cetera, et cetera.
36:14
So in a way, that's what's happening,
36:16
except that you rotate it before you actually do this.
36:19
And that's exactly what's happening.
36:22
So what we put together here is essentially three things.
36:30
One was statistics.
36:32
Statistics says, if you won't spread,
36:34
if you want information, you should be looking at variance.
36:39
The second one was optimization.
36:40
Optimization said, well, if you want to maximize spread, well,
36:44
you have to maximize variance in a certain direction.
36:48
And that means maximizing over the sphere of vectors
36:51
that have unique norm.
36:54
And that's an optimization problem, which actually
36:56
turned out to be difficult.
36:58
But then the third thing that we use to solve this problem
37:00
was linear algebra.
37:01
Linear algebra said, well, it looks
37:03
like it's a difficult optimization problem.
37:05
But it turns out that the answer comes in almost--
37:08
I mean, it's not a closed form, but those things are so used,
37:11
that it's almost a closed form--
37:12
says, just pick the eigenvectors in order
37:17
of their associated eigenvalues from largest to smallest.
37:20
37:23
And that's why principal component analysis
37:24
has been so popular and has gained huge amount of traction
37:29
since we had computers that were allowed to compute eigenvalues
37:33
and eigenvectors for matrices of gigantic sizes.
37:37
You can actually do that.
37:38
If I give you--
37:39
I don't know, this Google video, for example,
37:42
is talking about words.
37:43
They want to do just the, say, principal component
37:45
analysis of words.
37:47
So I give you all the words in the dictionary.
37:50
And-- sorry, well, you would have
37:53
to have a representation for words,
37:55
so it's a little more difficult. But how do I do this?
37:59
38:03
Let's say, for example, pages of a book.
38:06
I want to understand the pages of a book.
38:08
And I need to turn it into a number.
38:10
And a page of a book is basically the word count.
38:13
So I just count the number of times "the" shows up,
38:15
the number of times "and" shows up, number of times "dog"
38:18
shows up.
38:19
And so that gives me a vector.
38:20
It's in pretty high dimensions.
38:22
It's as many dimensions as there are words in the dictionary.
38:25
And now, I want to visualize how those pages get together--
38:28
are two pages very similar or not.
38:30
And so what you would do is essentially
38:32
just compute the largest eigenvector of this matrix--
38:35
maybe the two largest-- and then project this into a plane.
38:38
Yeah.
38:39
AUDIENCE: Can we assume the number of points
38:41
was far larger than the dimension?
38:43
PHILIPPE RIGOLLET: Yeah, but there's
38:44
many pages in the world.
38:46
There's probably more pages in the world
38:48
than there's words in the dictionary.
38:50
38:54
Yeah, so of course, if you are in high dimensions
38:57
and you don't have enough points,
38:58
it's going to be clearly an issue.
39:00
If you have two points, then the leading eigenvector
39:03
is going to be just the line that
39:04
goes through those two points, regardless
39:06
of what the dimension is.
39:07
And clearly, you're not learning anything.
39:09
39:13
So you have to pick, say, the k largest one.
39:16
If you go all the way, you're just reordering your thing,
39:18
and you're not actually gaining anything.
39:20
You start from d and you go too d.
39:22
So at some point, this procedure has to stop.
39:26
And let's say it stops at k.
39:28
Now, of course, you should ask me a question,
39:31
which is, how do you choose k?
39:34
So that's, of course, a natural question.
39:37
Probably the basic answer is just pick k equals 3,
39:41
because you can actually visualize it.
39:43
But what happens if I take k is equal to 4?
39:47
If I take is equal to 4, I'm not going
39:51
to be able to plot points in four dimensions.
39:54
Well, I could, I could add color,
39:55
or I could try to be a little smart about it.
39:57
But it's actually quite difficult.
40:00
And so what people tend to do, if you have four dimensions,
40:04
they actually do a bunch of two dimensional plots.
40:06
And that's what a computer-- a computer is not very good--
40:08
I mean, by default, they don't spit out
40:10
three dimensional plots.
40:12
So let's say they want to plot only two dimensional things.
40:15
So they're going to take the first directions of, say, v1,
40:17
v2.
40:18
Let's say you have three, but you
40:19
want to have only two dimensional plots.
40:21
And then it's going to do v1, v3; and then v2, v3.
40:29
So really, you take all three of them,
40:31
but it's really just showing you all choices
40:35
of pairs of those guys.
40:37
So if you were to keep k is equal to 5,
40:41
you would have five, choose two different plots.
40:44
40:48
So this is the actual principal component algorithm,
40:51
how it's implemented.
40:53
And it's actually fairly simple.
40:55
I mean, it looks like there's lots of steps.
40:56
But really, there's only one that's important.
40:58
So the first one is the input.
40:59
I give you a bunch of points, x1 to xn in d dimensions.
41:04
And step two is, well, compute their empirical covariance
41:07
matrix S. The points themselves, we don't really care.
41:10
We care about their empirical covariance matrix.
41:12
So it's a d by d matrix.
41:14
Now, I'm going to feed that.
41:15
And that's where the actual computation starts happening.
41:17
I'm going to feed that to something that knows
41:19
how to diagonalize this matrix.
41:21
And you have to trust me, if I want
41:23
to compute the k largest eigenvalues
41:25
and my matrix is d by d, it's going
41:27
to take me about k times d squared operations.
41:32
So if I want only three, it's 3 times d squared,
41:34
which is about--
41:36
d squared is the time for me it takes to just even read
41:39
the matrix sigma.
41:41
So that's not too bad.
41:43
So what it's going to spit out, of course,
41:45
is the diagonal matrix D. And those are nice,
41:48
because they allow me to tell me what
41:53
is the order in which I should be taking the columns of P.
41:56
But what's really important to me is v1 to vd,
41:58
because those are going to be the ones I'm going to be using
42:01
to draw those plots.
42:04
And now, I'm going to say, OK, I need
42:05
to actually choose some set k.
42:09
And I'm going to basically truncate and look
42:11
only at the first k columns of P.
42:16
Once I have those columns, what I
42:18
want to do is to project onto the linear span
42:20
of those columns.
42:21
And there's actually a simple way
42:23
to do this, which is just take this matrix P, which is really
42:26
the matrix of projection onto the linear span of those k
42:29
columns.
42:30
And you just take Pk transpose.
42:32
And then you apply this to every single one of your points.
42:38
Now Pk transpose, what is the size of the matrix Pk?
42:42
42:46
Yeah, [INAUDIBLE]?
42:47
AUDIENCE: [INAUDIBLE]
42:49
PHILIPPE RIGOLLET: So Pk is just this matrix.
42:52
I take the v1 and I stop at vk--
42:54
well--
42:55
AUDIENCE: [INAUDIBLE]
42:57
PHILIPPE RIGOLLET: d by k, right?
42:59
Each of the column is an eigenvector.
43:01
It's of dimension d.
43:02
I mean, that's a vector in the original space.
43:05
So I have this d by k matrix.
43:07
So all it is is if I had my--
43:11
well, I'm going to talk in a second about Pk transpose.
43:13
Pk transpose is just this guy, where
43:17
I stop at the k-th vector.
43:19
So Pk transpose is k by d.
43:22
So now, when I take Yi, which is Pk transpose Xi,
43:26
I end up with a point which is in k dimensions.
43:29
I have only k coordinates.
43:30
So I took every single one of my original points Xi,
43:33
which had d coordinates, and I turned it into a point that
43:35
has only k coordinates.
43:37
Particularly, I could have k is equal to 2.
43:40
This matrix is exactly the one that projects.
43:42
If you think about it for one second,
43:44
this is just the matrix that says--
43:46
well, we actually did that several times.
43:48
The matrix, so that was this P transpose u
43:51
that showed up somewhere.
43:53
And so that's just the matrix that
43:57
take your point X in, say, three dimensions,
44:01
and then just project it down to two dimensions.
44:04
And that's just-- it goes to the closest point in the subspace.
44:09
Now, here, the floor is flat.
44:12
But we can pick any subspace we want,
44:16
depending on what the lambdas are.
44:18
So the lambdas were important for us
44:19
to be able to identify which columns to pick.
44:23
The fact that we assumed that they were ordered
44:25
tells us that we can pick the first ones.
44:27
If they were not ordered, it would
44:28
be just a subset of the columns, depending on what
44:30
the size of the eigenvalue is.
44:32
So each column is labeled.
44:36
And so then, of course, we still have this question of,
44:38
how do I pick k?
44:40
So there's definitely the matter of convenience.
44:42
Maybe 2 is convenient.
44:44
If it works for 2, you don't have to go any farther.
44:47
But you might want to say, well--
44:50
originally, I did that to actually keep
44:52
as much information as possible.
44:54
I know that the ultimate thing is
44:56
to keep as much information, which would be to k
44:58
is equal d-- that's as much information as you want.
45:00
But it's essentially the same question about, well,
45:03
if I want to compress a JPEG image,
45:07
how much information should I keep so it's still visible?
45:10
And so there's some rules for that.
45:11
But none of them is actually really a science.
45:14
So it's really a matter of what you
45:16
think is actually tolerable.
45:18
And we're just going to start replacing this choice by maybe
45:21
another parameter.
45:22
So here, we're going to basically replace k by alpha,
45:26
and so we just do stuff.
45:29
So the first one that people do that is probably
45:32
the most popular one--
45:33
OK, the most popular one is definitely
45:35
take k is equal to 2 or 3, because it's just
45:39
convenient to visualize.
45:41
The second most popular one is the scree plot.
45:48
So the scree plot--
45:49
remember, I have my values, lambda j's.
45:54
And I've chosen the lambda j's to decrease.
45:57
So the indices are chosen in such a way
45:59
that lambda is a decreasing function.
46:01
So I have lambda 1, and let's say it's this guy here.
46:04
And then I have lambda 2, and let's say it's this guy here.
46:06
And then I have lambda 3, and let's say it's this guy here,
46:09
lambda 4, lambda 5, lambda 6.
46:12
And all I care about is that this thing decreases.
46:16
The scree plot says something like this--
46:19
if there's an inflection point, meaning that you can sort of do
46:22
something like this and then something like this,
46:25
you should stop at 3.
46:27
That's what the scree plot tells you.
46:29
What it's saying in a way is that the percentage
46:34
of the marginal increment of explained
46:39
variance that you get starts to decrease after you
46:41
pass this inflection point.
46:43
So let's see why I way this.
46:45
Well, here, what I have-- so this ratio
46:52
that you see there is actually the percentage
46:54
of explained variance.
46:56
So what it means is that, if I look at lambda 1 plus lambda k,
47:01
and then I divide by lambda 1 plus lambda d, well,
47:08
what is this?
47:08
Well, this lambda 1 plus lambda d
47:12
is the total amount of variance that I get in my points.
47:14
That's the trace of sigma.
47:18
So that's the variance in the first direction
47:20
plus the variance in the second direction
47:22
plus the variance in the third direction.
47:24
That's basically all the variance that I have possible.
47:26
47:28
Now, this is the variance that I kept in the first direction.
47:32
This is the variance that I kept in the second direction,
47:34
all the way to the variance that I kept in the k-th direction.
47:37
So I know that this number is always less than or equal to 1.
47:41
And it's larger than 1.
47:43
And this is just the proportion, say,
47:48
of variance explained by v1 to vk,
47:59
or simply, the proportion of explained variance by my PCA,
48:03
say.
48:05
So now, what this thing is telling me,
48:07
its says, well, if I look at this thing
48:09
and I start seeing this inflection point, it's saying,
48:13
oh, here, you're gaining a lot and lot of variance.
48:16
And then at some point, you stop gaining a lot
48:19
in your proportion of explained variance.
48:21
So this will translate in something
48:23
where when I look at this ratio, lambda 1 plus lambda k divided
48:28
by lambda 1 plus lambda d, this would
48:31
translate into a function that would look like this.
48:34
And what it's telling you, it says, well, maybe you
48:36
should stop here, because here every time you add one,
48:38
you don't get as much as you did before.
48:40
You actually get like smaller marginal returns.
48:43
48:50
So explained variance is the numerator of this ratio.
48:56
And the total variance is the denominator.
48:58
Those are pretty straightforward terms
49:01
that you would want to use for this.
49:03
So if your goal is to do data visualization--
49:06
so why would you take k larger than 2?
49:10
Let's say, if you take k larger than 6,
49:11
you can start to imagine that you're
49:12
going to have six, choose two, which starts to be annoying.
49:15
And if you have k is equal to 10--
49:16
because you could start in dimension 50,000--
49:19
and then k equal to 10 would be the place
49:21
where you have this thing that's a lot of plots
49:22
that you would have to show.
49:23
So it's not always for data visualization.
49:26
Once I've actually done this, I've
49:29
actually effectively reduced the dimension of my problem.
49:32
And what I could do with what I have is
49:34
do a regression on those guys.
49:36
The v1-- so I forgot to tell you--
49:39
why is that called principal component analysis?
49:41
Well, the vj's that I keep, v1 to vk
49:46
are called principal components.
49:51
49:59
And they effectively act as the summary of my Xi's.
50:04
When I mentioned image compression,
50:06
I started with a point Xi that was d numbers--
50:10
let's say 50,000 numbers.
50:12
And now, I'm saying, actually, you
50:14
can throw out those 50,000 numbers.
50:16
If you actually know only the k numbers that you need--
50:19
the 6 numbers that you need--
50:20
you're going to have something that
50:22
was pretty close to getting what information you had.
50:24
So in a way, there is some form of compression
50:26
that's going on here.
50:27
And what you can do is that those principal components,
50:31
you can actually use now for regression.
50:34
If I want to regress Y onto X that's
50:39
very high dimensional, before I do this,
50:41
if I don't have enough points, maybe what I can actually do
50:44
is to do principal component analysis
50:47
throughout my exercise, replace them
50:49
by those compressed versions, and do linear aggression
50:52
on those guys.
50:53
And that's called principal component regression,
50:55
not surprisingly.
50:56
And that's something that's pretty popular.
50:57
And you can do with k is equal to 10, for example.
51:00
51:03
So for data visualization, I did not find a Thanksgiving themed
51:07
picture.
51:08
But I found one that has turkey in it.
51:11
Get it?
51:12
51:15
So this is actually a gene data set that was--
51:21
so when you see something like this,
51:24
you can imagine that someone has been preprocessing
51:27
the hell out of this thing.
51:28
This is not like, oh, I collect data on 23andMe
51:30
and I'm just going to run PCA on this.
51:32
It just doesn't happen like that.
51:34
And so what happened is that-- so let's assume that this was
51:38
a bunch of preprocessed data, which are gene expression
51:41
levels--
51:42
so 500,000 genes among 1,400 Europeans.
51:47
So here, I actually have less observations
51:50
than I have samples.
51:52
And that's when you use principal component regression
51:54
most of the time, so it doesn't stop you.
51:57
And then what you do is you say, OK, have those 500,000 genes
52:01
among--
52:03
so here, that means that there's 1,400 points here.
52:06
And I actually take those 500,000 directions.
52:09
So each person has a vector of, say, 500,000 genes
52:13
that are attached to them.
52:14
And I project them onto two dimensions, which
52:17
should be extremely lossy.
52:19
I lose a lot of information.
52:21
And indeed, I do, because I'm one of these guys.
52:24
And I'm pretty sure I'm very different from this guy,
52:27
even though probably from an American perspective,
52:30
we're all the same.
52:31
But I think we have like slightly different genomes.
52:35
And so the thing is now we have this--
52:39
so you see there's lots of Swiss that participate in this.
52:41
But actually, those two principal components
52:43
recover sort of the map of Europe.
52:46
I mean, OK, again, this is actually maybe fine-grained
52:50
for you guys.
52:50
But right here, there's Portugal and Spain,
52:52
which are those colors.
52:54
So here is color-coded.
52:55
And here is Turkey, of course, which we know
52:58
has very different genomes.
53:02
So Turks are very at the boundary.
53:04
So you can see all the greens.
53:06
They stay very far apart from everything else.
53:08
And then the rest here is pretty mixed.
53:11
But it sort of recovers-- if you look at the colors,
53:13
it sort of recovers that.
53:14
So in a way, those two principal components
53:16
are just the geographic feature.
53:18
So if you insist to compress all the genomic information
53:25
of these people into two numbers, what you're actually
53:28
going to get is longitude and latitude,
53:31
which is somewhat surprising, but not
53:35
so much if you think that's it's been preprocessed.
53:37
53:43
So what do you do beyond practice?
53:47
Well, you could try to actually study those things.
53:50
If you think about it for a second,
53:52
we did not do any statistics.
53:54
I talked to you about IID observations,
53:57
but we never used the fact that they were independent.
53:59
The way we typically use independence
54:01
is to have central limit theorem, maybe.
54:04
I mentioned the fact that the covariances of the word
54:06
Gaussian would actually give me something which is independent.
54:09
We didn't care.
54:10
This was a data analysis, data mining process that we did.
54:16
I give you points, and you just put them through the crank.
54:19
There was an algorithm in six steps.
54:21
And you just put it through and that's what you got.
54:23
Now, of course, there's some work which studies says, OK,
54:26
if my data is actually generated from some process-- maybe,
54:30
my points are multivariate Gaussian with some structure
54:33
on the covariance--
54:34
how well am I recovering the covariance structure?
54:37
And that's where statistics kicks in.
54:38
And that's where we stop.
54:41
So this is actually a bit more difficult to study.
54:44
But in a way, it's not entirely satisfactory,
54:48
because we could work for a couple of boards
54:50
and I would just basically sort of reverse engineer this
54:53
and find some models under which it's a good idea to do that.
54:57
And what are those models?
54:58
Well, those are the models that sort of give you
55:01
sort of prominent directions that you want to find.
55:03
And it will say, yes, if you have enough observations,
55:06
you will find those directions along which
55:08
your data is elongated.
55:10
So that's essentially what you want to do.
55:14
So that's exactly what this thing is telling you.
55:20
So where does the statistics lie from?
55:23
Well, everything, remember-- so actually that's
55:26
where Alana was confused-- the idea was to say, well,
55:28
if I have a true covariance matrix sigma
55:32
and I never really have access to it,
55:34
I'm just running PCA on the empirical covariance matrix,
55:38
how do those results relate?
55:41
And this is something that you can study.
55:44
So for example, if n goes to infinity
55:47
and the number of points, your dimension, is fixed,
55:55
then S goes to sigma in any sense you want.
56:00
Maybe each entry is going to each entry of sigma,
56:02
for example.
56:03
So S is a good estimator.
56:04
We know that the empirical covariance
56:06
is a consistent as the mater.
56:07
And if d is fixed, this is actually not an issue.
56:10
So in particular, if you run PCA on the sample covariance
56:14
matrix, you look at, say, v1, then
56:16
v1 is going to converge to the largest eigenvector of sigma
56:20
as n goes to infinity, but for d fixed.
56:23
And that's a story that we know since the '60s.
56:27
More recently, people have started challenging this.
56:30
Because what's happening when you fix the dimension
56:33
and let the sample size go to infinity,
56:35
you're certainly not allowing for this.
56:38
It's certainly not explaining to you anything about the fact
56:41
when d is equal to 500,000 and n is equal to 1,400.
56:44
Because when d is fixed and n goes to infinity,
56:46
in particular, n is much larger than d,
56:48
which is not the case here.
56:50
And so when n is much larger than d, things go well.
56:53
But if d is less than n, it's not clear what happens.
56:57
And particularly, if d is of the order of n, what's happening?
57:01
So there's an entire theory in mathematics that's called
57:04
random matrix theory that studies the behavior of exactly
57:07
this question-- what is the behavior of the spectrum--
57:10
the eigenvalues and eigenvectors--
57:13
of a matrix in which I put random numbers and I let--
57:16
so the matrix I'm interested in here is the matrix of X's.
57:19
When I stack all my X's next to each other,
57:21
so that's a matrix of size, say, d by n, so each column
57:26
is of size d, it's one person.
57:28
And so I put them.
57:29
And when I let the matrix go to infinity,
57:31
I let both d and n to infinity.
57:33
But I want the aspect ratio, d/n, to go to some constant.
57:37
That's what they do.
57:38
And what's nice is that in the end, you have this constant--
57:41
let's call it gamma--
57:42
that shows up in all the asymptotics.
57:44
And then you can replace it by d/n.
57:46
And you know that you still have a handle of both the dimension
57:50
and the sample size.
57:51
Whereas, usually the dimension goes away, as you let n
57:54
go to infinity without having dimension going to infinity.
57:57
And so now, when this happens, as soon
57:59
as d/n goes to a constant, you can
58:01
show that essentially there's an angle between the largest
58:07
eigenvector of sigma and the largest eigenvector of S, as n
58:14
and d go to infinity.
58:15
There is always an angle-- you can actually
58:17
write it explicitly.
58:18
And it's an angle that depends on this ratio, gamma--
58:22
the asymptotic ratio of d/n.
58:24
And so there's been a lot of understanding how to correct,
58:29
how to pay attention to this.
58:30
This creates some biases that were sort of overlooked before.
58:34
In particular, when I do this, this
58:37
is not the proportion of explained variance,
58:40
when n and d are similar.
58:42
This is an estimated number computed from S.
58:44
This is computed from S. All these guys are computed from S.
58:48
So those are actually not exactly
58:49
where you want them to be.
58:51
And there's some nice work that allows you to recalibrate what
58:54
this ratio should be, how this ratio should be computed,
58:57
so it's a better representative of what
58:59
the proportion of explained variance actually is.
59:04
So then, of course, there's the question
59:07
of-- so that's when d/n goes to some constant.
59:09
So the best case-- so that was '60s--
59:12
d is fixed and it's much larger than d.
59:15
And then random matrix theory tells you, well, d and n
59:18
are sort of the same order of magnitude.
59:20
When they go to infinity, the ratio goes to some constant.
59:23
Think of it as being order 1.
59:25
To be fair, if d is 100 times larger than n, it still works.
59:30
And it depends on what you think what
59:32
the infinity is at this point.
59:33
But I think the random matrix theory results are very useful.
59:37
But then even in this case, I told you
59:39
that the leading eigenvector of S
59:42
is actually an angle of the leading eigenvector of--
59:48
So what's happening is that--
59:50
59:56
so let's say that d/n goes to some gamma.
60:01
And what I claim is that, if you look at--
60:04
so that's v1, that's the v1 of S. And then there's the v1 of--
60:09
so this should be of size 1.
60:11
So that's the v1 of sigma.
60:13
Then those things are going to have an angle, which
60:15
is some function of gamma.
60:16
It's complicated, but there's a function of gamma
60:18
that you can see there.
60:19
And there's some models.
60:21
When gamma goes to infinity, which
60:24
means that d is now much larger than n,
60:27
this angle is 90 degrees, which means
60:30
that you're getting nothing.
60:32
Yeah.
60:33
AUDIENCE: If d is not on your lower plane,
60:37
so like gamma is 0, is there still angle?
60:40
PHILIPPE RIGOLLET: No, but that's consistent--
60:43
the fact that it's consistent when--
60:45
so the angle is a function--
60:46
AUDIENCE: d is not a constant [INAUDIBLE]??
60:49
60:52
PHILIPPE RIGOLLET: d is not a constant?
60:54
So if d is little of n?
60:57
Then gamma goes to 0 and f of gamma goes to 0.
60:59
So f of gamma is a function that--
61:02
so for example, if f of gamma--
61:05
this is the sine of the angle, for example--
61:08
then it's a function that starts at 0, and that goes like this.
61:11
61:15
But as soon as gamma is positive, it goes away from 0.
61:18
61:20
So now when gamma goes to infinity,
61:24
then this thing goes to a right angle, which
61:26
means I'm getting just junk.
61:27
So this is not my leading eigenvector.
61:29
So how do you do this?
61:31
Well, just like everywhere in statistics,
61:33
you have to just make more assumptions.
61:35
You have to assume that you're not
61:36
looking for the leading eigenvector or the direction
61:39
that carries the most variance.
61:40
But you're looking, maybe, for a special direction.
61:42
And that's what sparse PCA is doing.
61:44
Sparse PCA is saying, I'm not looking for any direction new
61:48
that carries the most variance.
61:50
I'm only looking for a direction new that is sparse.
61:54
Think of it, for example, as having 10 non-zero coordinates.
61:58
So that's a lot of directions still to look for.
62:02
But once you do this, then you actually
62:05
have not only-- there's a few things
62:07
that actually you get from doing this.
62:08
The first one is you actually essentially replace
62:12
d by k, which means that n now just--
62:15
I'm sorry, let's say S non-zero coefficients.
62:18
You replace d by S, which means that n only
62:21
has to be much larger than S for this thing to actually work.
62:24
Now, of course, you've set your goal weaker.
62:26
Your goal is not to find any direction, only
62:28
a sparse direction.
62:30
But there's something very valuable
62:31
about sparse directions, is that they actually
62:33
are interpretable.
62:35
When I found the v--
62:37
let's say that the v that I found before
62:40
was 0.2, and then 0.9, and then 1.1 minus 3, et cetera.
62:48
So that was the coordinates of my leading eigenvector
62:51
in the original coordinate system.
62:54
What does it mean?
62:55
Well, it means that if I see a large number,
62:57
that means that this v is very close--
63:01
so that's my original coordinate system.
63:03
Let's call it e1 and e2.
63:05
So that's just 1, 0; and then 0, 1.
63:09
Then clearly, from the coordinates of v,
63:11
I can tell if my v is like this, or it's like this,
63:13
or it's like this.
63:15
Well, I mean, they should all be of the same size.
63:18
So I can tell if it's here or here
63:20
or here, depending on-- like here,
63:24
that means I'm going to see something
63:26
where the Y-coordinate it much larger than the X-coordinate.
63:29
Here, I'm going to see something where the X-coordinate is much
63:30
larger than the Y-coordinate.
63:32
And here, I'm going to see something
63:33
where the X-coordinate is about the same size
63:35
of the Y-coordinate.
63:38
So when things starts to be bigger,
63:40
you're going to have to make choices.
63:42
What does it mean to be bigger--
63:43
when d is 100,000, I mean, the sum
63:48
of the squares of those guys have to be equal to 1.
63:51
So they're all very small numbers.
63:52
And so it's hard for you to tell which one is a big number
63:54
and which ones is a small number.
63:56
Why would you want to know this?
63:57
Because it's actually telling you
63:58
that if v is very close to e1, then that means that e1--
64:03
in the case of the gene example, that
64:04
would mean that e1 is the gene that's very important.
64:08
Maybe there's actually just two genes
64:10
that explain those two things.
64:12
And those are the genes that have been picked up.
64:14
There's two genes that I encode geographic location,
64:16
and that's it.
64:18
And so it's very important for you
64:19
to be able to interpret what v means.
64:21
Where it has large values, it means
64:23
that maybe it has large values for e1, e2, and e3.
64:26
And it means that it's a combination of e1, e2, and e3.
64:28
And now, you can interpret, because you have
64:30
only three variables to find.
64:33
And so sparse PCA builds that in.
64:36
Sparse PCA says, listen, I'm going
64:39
to want to have at most 10 non-zero coefficients.
64:42
And the rest, I want to be 0.
64:44
I want to be able to be a combination of at most 10
64:47
of my original variables.
64:50
And now, I can do interpretation.
64:52
So the problem with sparse PCA is
64:54
that it becomes very difficult numerically
64:57
to solve this problem.
64:58
I can write it.
64:59
So the problem is simply maximize the variance u
65:05
transpose, say, Su subject to-- well,
65:09
I wanted to have u2 equal to 1.
65:12
So that's the original PCA.
65:14
But now, I also want that the sum
65:16
of the indicators of the uj that are not equal to 0
65:19
is at most, say, 10.
65:23
This constraint is very non-convex.
65:26
So I can relax it to a convex one
65:28
like we did for linear aggression.
65:31
But now, I've totally messed up with the fact
65:33
that I could use linear algebra to solve this problem.
65:37
And so now, you have to go through much more complicated
65:40
optimization techniques, which are called
65:42
semidefinite programs, which do not
65:44
scale well in high dimensions.
65:46
And so you have to do a bunch of tricks--
65:48
numerical tricks.
65:49
But there are some packages that implements some heuristics
65:52
or some other things--
65:55
iterative thresholding, all sorts
65:56
of various numerical tricks that you can do.
65:58
But the problem they are trying to solve is exactly this.
66:01
Among all directions that I have norm 1, of course,
66:03
because it's the direction that have at most, say,
66:06
10 non-zero coordinates, I want to find the one that maximizes
66:09
the empirical variance.
66:10
66:23
Actually, let me let me just so you this.
66:27
66:41
I wanted to show you an output of PCA
66:47
where people are actually trying to do directly--
66:50
66:56
maybe-- there you go.
67:05
67:20
So right here, you see this is SPSS.
67:26
That's a statistical software.
67:29
And this is an output that was preprocessed
67:33
by a professional--
67:34
not preprocessed, post-processed.
67:36
So that's something where they read PCA.
67:38
So what is the data?
67:39
This is raw data about you ask doctors
67:43
what they think of the behavior of a particular sales
67:47
representative for pharmaceutical companies.
67:49
So pharmaceutical companies are trying
67:51
to improve their sales force.
67:52
And they're asking doctors how would they
67:56
rate-- what do they value about their interaction
67:58
with a sales representative.
68:01
So basically, there's a bunch of questions.
68:04
One offers credible point of view on something trends,
68:10
provides valuable networking opportunities.
68:12
This is one question.
68:13
Rate this on a scale from 1 to 5.
68:15
That was the question.
68:16
And they had a bunch of questions like this.
68:18
And then they asked 1,000 doctors to make those ratings.
68:22
And what they want-- so each doctor now
68:24
is a vector of ratings.
68:25
And they want to know if there's different groups of doctors,
68:28
what do doctors respond to.
68:30
If there's different groups, then
68:31
maybe they know that they can actually address them
68:33
separately, et cetera.
68:35
And so to do that, of course, there's lots of questions.
68:37
And so what you want is to just first project
68:39
into lower dimensions, so you can actually
68:41
visualize what's going on.
68:42
And this is what was done for this.
68:44
So these are the three first principal component
68:47
that came out.
68:49
And even though we ordered the values of the lambdas,
68:52
there's no reason why the entries of v should be ordered.
68:56
And if you look at the values of v here,
68:57
they look like they're pretty much ordered.
68:59
It starts at 0.784, and then you're at 0.3 around here.
69:04
There's something that goes up again, and then you go down.
69:06
Actually, it's marked in red every time it goes up again.
69:11
And so now, what they did is they said,
69:13
OK, I need to interpret those guys.
69:16
I need to tell you what this is.
69:18
If you tell me, we found the principal component
69:21
that really discriminates the doctors in two groups,
69:24
the drug company is going to come back to you
69:26
and say, OK, what is this characteristic?
69:29
And you say, oh, it's actually a linear combination
69:31
of 40 characteristics.
69:33
And they say, well, we don't need you to do that.
69:35
I mean, it cannot be a linear combination of anything you
69:38
didn't ask.
69:39
And so for that, first of all, there's
69:41
a post-processing of PCA, which says, OK, once I actually,
69:44
say, found three principal components,
69:46
that means that I found the dimension three space on which
69:51
I want to project my points.
69:52
In this base, I can pick any direction I want.
69:55
So the first thing is that you do
69:57
some sort of local arrangements, so that those things
69:59
look like they are increasing and then decreasing.
70:01
So you just change, you rotate your coordinate system
70:06
in this three dimensional space that you've actually isolated.
70:09
And so once you do this, the reason
70:11
to do that is that it sort of makes
70:13
them big, sharp differences between large and small values
70:16
of the coordinates of the thing you had.
70:18
And why do you want this?
70:19
Because now, you can say, well, I'm
70:21
going to start looking at the ones that have large values.
70:23
And what do they say?
70:24
They say in-depth knowledge, in-depth knowledge,
70:26
in-depth knowledge, knowledge about.
70:28
This thing is clearly something that
70:30
actually characterizes the knowledge of my sales
70:34
representative.
70:35
And so that's something that doctors are sensitive to.
70:38
That's something that really discriminates
70:40
the doctors in a way.
70:40
There's lots of variance along those things,
70:43
or at least a lot of variance--
70:45
I mean, doctors are separate in terms of their experience
70:47
with respect to this.
70:49
And so what they did is said, OK,
70:51
all these guys, some of those they have large values,
70:53
but I don't know how to interpret them.
70:55
And so I'm just going to put the first block,
70:56
and I'm going to call it medical knowledge,
70:58
because all those things are knowledge about medical stuff.
71:01
Then here, I didn't know how to interpret those guys.
71:03
But those guys, there's a big clump of large coordinates,
71:06
and they're about respectful of my time, listens, friendly
71:10
but courteous.
71:12
This is all about the quality of interaction.
71:14
So this block was actually called quality of interaction.
71:17
And then there was a third block,
71:18
which you can tell starts to be spreading a little thin.
71:21
There's just much less of them.
71:22
But this thing was actually called
71:24
fair and critical opinion.
71:26
And so now, you have three discriminating directions.
71:30
And you can actually give them a name.
71:31
Wouldn't it be beautiful if all the numbers in the gray box
71:34
came non-zero and all the other numbers
71:36
came zero-- there was no ad hoc choice.
71:38
I mean, this is probably an afternoon of work
71:40
to like scratch out all these numbers
71:42
and put all these color codes, et cetera.
71:44
Whereas, you could just have something that tells you,
71:47
OK, here are the non-zeros.
71:49
If you can actually make a story around why this group of thing
71:52
actually makes sense, such as it is medical knowledge,
71:54
then good for you.
71:55
Otherwise, you could just say, I can't.
71:57
And that's what sparse PCA does for you.
71:59
Sparse PCA outputs something where all those numbers would
72:02
be zero.
72:03
And there would be exactly, say, 10 non-zero coordinates.
72:06
And you can turn this knob off 10.
72:08
You can make it 9.
72:09
72:11
Depending on what your major is, maybe
72:13
you can actually go on with 20 of them
72:15
and have the ability to tell the story about 20
72:18
different variables and how they fit in the same group.
72:20
And depending on how you feel, it's
72:22
easy to rerun the PCA depending on the value
72:25
that you want here.
72:26
And so you could actually just come up with the one
72:28
you prefer.
72:30
And so that's the sparse PCA thing
72:32
which I'm trying to promote.
72:33
I mean, this is not super well-spread.
72:35
It's a fairly new idea, maybe at most 10 years old.
72:39
And it's not completely well-spread
72:40
in statistical packages.
72:42
But that's clearly what people are
72:44
trying to emulate currently.
72:46
Yes?
72:47
AUDIENCE: So what exactly does it
72:48
mean that the doctors have a lot of variance
72:50
in medical knowledge, quality of interaction,
72:53
and fair and critical opinion?
72:55
Like, it was saying that these are like the main things
73:00
that doctors vary on, some doctors care.
73:02
Like we could sort of characterize a doctor by, oh,
73:05
he cares this much about medical knowledge, this much
73:08
about the quality of interaction,
73:09
and this much about critical opinion.
73:11
And that says most of the story about what this doctor wants
73:14
from a drug representative?
73:17
PHILIPPE RIGOLLET: Not really.
73:20
I mean, OK, let's say you pick only one.
73:22
So that means that you would take all your doctors,
73:31
and you would have one direction, which
73:33
is quality of interaction.
73:36
And there would be just spread out points here.
73:38
73:42
So there are two things that can happen.
73:44
The first one is that there's a clump here,
73:46
and then there's a clump here.
73:49
That still represents a lot of variance.
73:50
And if this happens, you probably
73:52
want to go back in your data and see
73:55
were these people visited by a different group
73:58
than these people, or maybe these people
74:00
have a different specialty.
74:02
74:05
I mean, you have to look back at your data
74:07
and try to understand why you would have
74:08
different groups of people.
74:09
And if it's like completely evenly spread out,
74:13
then all it's saying is that, if you
74:15
want to have a uniform quality of interaction,
74:18
you need to take measures on this.
74:20
You need to have this to not be discrimination.
74:24
But I think really when it's becoming interesting it's not
74:26
when it's complete spread out.
74:27
It's when there's a big group here.
74:29
And then there's almost no one here,
74:30
and then there's a big group here.
74:32
And then maybe there's something you can do.
74:34
And so those two things actually give you a lot of variance.
74:40
So actually, maybe I'll talk about this.
74:47
Here, this is sort of a mixture.
74:49
You have a mixture of two different populations
74:51
of doctors.
74:53
And it turns out that principal component analysis--
74:56
so a mixture is when you have different populations--
74:59
think of like two Gaussians that are just
75:02
centered at two different points,
75:03
and maybe they're in high dimensions.
75:05
And those are clusters of people,
75:07
and you want to be able to differentiate those guys.
75:09
If you're in very high dimensions,
75:10
it's going to be very difficult. But one
75:12
of the first processing tools that people do is to do PCA.
75:14
Because if you have one big group here and one big group
75:18
here, it means that there's a lot of variance
75:19
along the direction that goes through the centers
75:21
of those groups.
75:22
And that's essentially what happened here.
75:24
You could think of this as being two blobs in high dimensions.
75:27
But you're really just projecting them
75:29
into one dimension.
75:30
And this dimension, hopefully, goes through the center.
75:33
And so as preprocessing-- so I'm going to stop here.
75:37
But PCA is not just made for dimension reduction.
75:42
It's used for mixtures, for example.
75:44
It's also used when you have graphical data.
75:47
What is the idea of PCA?
75:48
It just says, if you have a matrix that seems to have low
75:53
rank-- meaning that there's a lot of those lambda i's that
75:56
are very small--
75:57
and then I see that plus noise, then
76:00
it's a good idea to do PCA on this thing.
76:02
And in particular, people use that in networks a lot.
76:05
So you take the adjacency matrix of a graph--
76:08
well, you sort of preprocess it a little bit, so it looks nice.
76:11
And then if you have, for example, two communities
76:13
in there, it should look like something that
76:15
is low rank plus some noise.
76:18
And low rank means that there's just very few non-zero--
76:22
well, low rank means this.
76:24
Low rank means that if you do the scree plot,
76:26
you will see something like this,
76:27
which means that if you throw out all the smaller ones,
76:29
it should not really matter in the overall structure.
76:33
And so you can use all--
76:35
these techniques are used everywhere these days, not
76:39
just in PCA.
76:39
So we call it PCA as statisticians.
76:41
But people call it the spectral methods or SVD.
76:46
So everyone--
76:49