https://www.youtube.com/watch?v=WW3ZJHPwvyg&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=17


字幕記錄


00:00
00:00
The following content is provided under a Creative
00:02
Commons license.
00:03
Your support will help MIT OpenCourseWare
00:06
continue to offer high quality educational resources for free.
00:10
To make a donation or to view additional materials
00:12
from hundreds of MIT courses, visit MIT OpenCourseWare
00:16
at ocw.mit.edu.
00:17
01:14
PHILIPPE RIGOLLET: --bunch of x's and a bunch of y's.
01:17
The y's were univariate, just one real
01:20
valued random variable.
01:21
And the x's were vectors that described a bunch of attributes
01:24
for each of our individuals or each of our observations.
01:27
Let's assume now that we're given essentially only the x's.
01:30
This is sometimes referred to as unsupervised learning.
01:33
There is just the x's.
01:35
Usually, supervision is done by the y's.
01:38
And so what you're trying to do is to make sense of this data.
01:41
You're going to try to understand this data,
01:43
represent this data, visualize this data,
01:47
try to understand something, right?
01:48
So, if I give you a d-dimensional random vectors,
01:52
and you're going to have n independent copies
01:54
of this individual-- of this random vector, OK?
01:57
So you will see that I'm going to have--
01:59
I'm going to very quickly run into some limitations
02:02
about what I can actually draw on the board
02:04
because I'm using [? boldface ?] here.
02:05
I'm also going to use the blackboard [? boldface. ?]
02:08
So it's going to be a bit difficult.
02:09
So tell me if you're actually a little confused by what
02:15
is a vector, what is a number, and what is a matrix.
02:17
But we'll get there.
02:19
So I have X in Rd, and that's a random vector.
02:22
02:26
And I have X1 to Xn that are IID.
02:30
They're independent copies of X. OK,
02:37
so you can think of those as being--
02:40
the realization of these guys are
02:41
going to be a cloud of n points in R to the d.
02:51
And we're going to think of d as being fairly large.
02:54
And for this to start to make sense,
02:55
we're going to think of d as being at least 4, OK?
02:59
And meaning that you're going to have a hard time
03:01
visualizing those things.
03:03
If it was 3 or 2, you would be able to draw these points.
03:06
And that's pretty much as much sense
03:08
you're going to be making about those guys,
03:09
just looking at the [INAUDIBLE]
03:12
All right, so I'm going to write each of those X's, right?
03:16
So this vector, X, has d coordinate.
03:20
And I'm going to write them as X1, to Xd.
03:25
03:30
And I'm going to stack them into a matrix, OK?
03:34
So once I have those guys, I'm going to have a matrix.
03:38
But here, I'm going to use the double bar.
03:40
And it's X1 transpose, Xn transpose.
03:47
So what it means is that the coordinates of this guy,
03:51
of course, are X1,1.
03:53
Here, I have--
03:54
I'm of size d, so I have X1d.
03:57
And here, I have Xn1.
04:01
Xnd.
04:02
And so the i-th, j-th--
04:06
i-th row and j-th column is the matrix, Xij, right--
04:10
is the entry, Xi to-- sorry.
04:12
04:23
OK, so each-- so the rows here are the observations.
04:28
And the columns are the covariance over attributes.
04:32
OK?
04:32
So this is an n by d matrix.
04:34
04:39
All right, this is really just some bookkeeping.
04:41
How do we store this data somehow?
04:43
And the fact that we use a matrix just like for regression
04:46
is going to be convenient because we're going to able
04:48
to talk about projections--
04:50
going to be able to talk about things like this.
04:53
All right, so everything I'm going to say now
04:56
is about variances or covariances
04:59
of those things, which means that I need two moments, OK?
05:01
If the variance does not exist, there's
05:03
nothing I can say about this problem.
05:05
So I'm going to assume that the variance exists.
05:07
And one way to just put it to say
05:09
that the two norm of those guys is
05:12
finite, which is another way to say that each of them
05:15
is finite.
05:15
I mean, you can think of it the way you want.
05:18
All right, so now, the mean of X, right?
05:21
So I have a random vector.
05:22
So I can talk about the expectation of X.
05:26
That's a vector that's in Rd.
05:29
And that's just taking the expectation entrywise.
05:33
Sorry.
05:34
05:42
X1, Xd.
05:45
OK, so I should say it out loud.
05:49
For this, the purpose of this class,
05:51
I will denote by subscripts the indices that
05:55
corresponds to observations.
05:57
And superscripts, the indices that correspond to
06:02
coordinates of a variable.
06:04
And I think that's the same convention that we
06:07
took for the regression case.
06:10
Of course, you could use whatever you want.
06:12
If you want to put commas, et cetera,
06:13
it becomes just a bit more complicated.
06:16
All right, and so now, once I have this,
06:18
so this tells me where my cloud of point is centered, right?
06:21
So if I have a bunch of points--
06:24
OK, so now I have a distribution on Rd,
06:27
so maybe I should talk about this--
06:29
I'll talk about this when we talk
06:31
about the empirical version.
06:32
But if you think that you have, say,
06:34
a two-dimensional Gaussian random variable,
06:36
then you have a center in two dimension, which
06:38
is where it peaks, basically.
06:41
And that's what we're talking about here.
06:43
But the other thing we want to know
06:44
is how much does it spread in every direction, right?
06:47
So in every direction of the two dimensional thing,
06:49
I can then try to understand how much spread I'm getting.
06:52
And the way you measure this is by using covariance, right?
06:54
So the covariance matrix, sigma--
07:02
that's a matrix which is d by d.
07:05
And it records-- in the j, k-th entry,
07:08
it records the covariance between the j-th coordinate
07:10
of X and the k-th coordinate of X, OK?
07:13
So with entries--
07:14
07:21
OK, so I have sigma, which is sigma 1,1, sigma dd, sigma 1d,
07:30
sigma d1.
07:31
07:34
OK, and here I have sigma jk And sigma jk
07:39
is just the covariance between Xj, the j-th coordinate
07:48
and the k-th coordinate.
07:52
OK?
07:52
So in particular, it's symmetric because the covariance
07:55
between Xj and Xk is the same as the covariance between Xk
07:57
and Xj.
07:58
I should not put those parentheses here.
08:01
I do not use them in this, OK?
08:05
Just the covariance matrix.
08:06
So that's just something that records everything.
08:09
And so what's nice about the covariance matrix
08:10
is that if I actually give you X as a vector,
08:13
you actually can build the matrix just
08:15
by looking at vectors times vectors transpose,
08:18
rather than actually thinking about building
08:20
it coordinate by coordinate.
08:21
So for example, if you're used to using MATLAB,
08:23
that's the way you want to build a covariance matrix
08:26
because MATLAB is good at manipulating vectors
08:29
and matrices rather than just entering it entry by entry.
08:33
OK, so, right?
08:34
So, what is the covariance between Xj and Xk?
08:42
Well by definition, it's the expectation of Xj and Xk
08:51
minus the expectation of Xj times the expectation of Xk,
09:01
right?
09:01
That's the definition of the covariance.
09:03
I hope everybody's seeing that.
09:05
And so, in particular, I can actually
09:08
see that this thing can be written as--
09:10
sigma can now be written as the expectation
09:14
of XX transpose minus the expectation of X
09:21
times the expectation of X transpose.
09:25
Why?
09:26
Well, let's look at the jk-th coefficient of this guy, right?
09:29
So here, if I look at the jk-th coefficient, I see what?
09:35
Well, I see that it's the expectation
09:38
of XX transpose jk, which is equal to the expectation of XX
09:50
transpose jk.
09:53
And what are the entries of XX transpose?
09:56
Well, they're of the form, Xj times Xk exactly.
10:00
So this is actually equal to the expectation of Xj times Xk.
10:02
10:09
And this is actually not the way I want to write it.
10:11
I want to write it--
10:12
10:15
OK?
10:16
Is that clear?
10:17
That when I have a rank 1 matrix of this form, XX transpose,
10:20
the entries are of this form, right?
10:21
Because if I take--
10:23
for example, think about x, y, z, and then
10:28
I multiply by x, y, z.
10:32
What I'm getting here is x--
10:36
maybe I should actually use indices here.
10:40
x1, x2, x3.
10:42
x1, x2, x3.
10:44
The entries are x1x1, x1x2, x1x3; x2x1, x2x2, x2x3; x3x1,
10:57
x3x2, x3x3, OK?
11:04
So indeed, this is exactly of the form if you look at jk,
11:08
you get exactly Xj times Xk, OK?
11:12
So that's the beauty of those matrices.
11:15
So now, once I have this, I can do exactly the same thing,
11:19
except that here, if I take the jk-th entry,
11:23
I will get exactly the same thing,
11:25
except that it's not going to be the expectation of the product,
11:27
but the product of the expectation, right?
11:29
So I get that the jk-th entry of E of X, E of X transpose,
11:36
is just the j-th entry of E of X times the k-th entry of E of X.
11:48
So if I put those two together, it's actually telling me
11:52
that if I look at the j, k-th entry of sigma,
11:56
which I called little sigma jk, then
11:59
this is actually equal to what?
12:01
It's equal to the first term minus the second term.
12:04
The first term is the expectation of Xj, Xk
12:11
minus the expectation of Xj, expectation of Xk, which--
12:18
oh, by the way, I forgot to say this is actually
12:20
equal to the expectation of Xj times the expectation of Xk
12:26
because that's just the definition of the expectation
12:28
of random vectors.
12:28
So my j and my k are now inside.
12:31
And that's by definition the covariance between Xj and Xk,
12:37
OK?
12:39
So just if you've seen those manipulations between vectors,
12:43
hopefully you're bored out of your mind.
12:45
And if you have not, then that's something
12:47
you just need to get comfortable with, right?
12:51
So one thing that's going to be useful
12:52
is to know very quickly what's called
12:55
the outer product of a vector with itself, which
12:57
is the vector of times the vector transpose, what
12:59
the entries of these things are.
13:01
And that's what we've been using on this second set of boards.
13:06
OK, so everybody agrees now that we've
13:08
sort of showed that the covariance matrix can
13:11
be written in this vector form.
13:14
So expectation of XX transpose minus expectation
13:17
of X, expectation of X transpose.
13:19
13:22
OK, just like the covariance can be written in two ways,
13:28
right we know that the covariance can also
13:30
be written as the expectation of Xj minus expectation of Xj
13:39
times Xk minus expectation of Xk, right?
13:45
That's the-- sometimes, this is the original definition
13:50
of covariance.
13:50
This is the second definition of covariance.
13:52
Just like you have the variance which
13:54
is the expectation of the square of X minus c of X,
13:57
or the expectation X squared minus the expectation of X
14:00
squared.
14:01
It's the same thing for covariance.
14:03
And you can actually see this in terms of vectors, right?
14:11
So this actually implies that you can also rewrite sigma
14:14
as the expectation of X minus expectation of X
14:21
times the same thing transpose.
14:23
14:32
Right?
14:32
And the reason is because if you just distribute those guys,
14:35
this is just the expectation of XX transpose
14:43
minus X, expectation of X transpose minus expectation
14:54
of XX transpose.
14:59
And then I have plus expectation of X,
15:03
expectation of X transpose.
15:05
15:09
Now, things could go wrong because the main difference
15:13
between matrices slash vectors and numbers is
15:18
that multiplication does not commute, right?
15:21
So in particular, those two things are not the same thing.
15:25
And so that's the main difference that we have before,
15:27
but it actually does not matter for our problem.
15:30
It's because what's happening is that if when
15:32
I take the expectation of this guy, then
15:34
it's actually the same as the expectation of this guy, OK?
15:38
And so just because the expectation is linear--
15:43
15:48
so what we have is that sigma now
15:50
becomes equal to the expectation of XX transpose
15:55
minus the expectation of X, expectation
15:59
of X transpose minus expectation of X,
16:03
expectation of X transpose.
16:07
And then I have--
16:10
well, really, what I have is this guy.
16:14
And then I have plus the expectation
16:15
of X, expectation of X transpose.
16:19
16:23
And now, those three things are actually equal to each other
16:28
just because the expectation of X transpose
16:30
is the same as the expectation of X transpose.
16:34
And so what I'm left with is just
16:35
the expectation of XX transpose minus the expectation of X,
16:44
expectation of X transpose, OK?
16:49
So same thing that's happening when
16:51
you want to prove that you can write
16:53
the covariance either this way or that way.
16:57
The same thing happens for matrices, or for vectors,
17:00
right, or a covariance matrix.
17:02
They go together.
17:04
Is there any questions so far?
17:05
And if you have some, please tell me, because I want to--
17:09
I don't know to which extent you guys are comfortable with this
17:12
at all or not.
17:13
17:16
OK, so let's move on.
17:19
All right, so of course, this is what
17:23
I'm describing in terms of the distribution right here.
17:26
I took expectations.
17:28
Covariances are also expectations.
17:30
So those depend on some distribution of X, right?
17:32
If I wanted to compute that, I would basically
17:34
need to know what the distribution of X is.
17:36
Now, we're doing statistics, so I
17:37
need to [INAUDIBLE] my question is going to be to say, well,
17:41
how well can I estimate the covariance matrix itself,
17:44
or some properties of this covariance matrix
17:47
based on data?
17:48
All right, so if I want to understand
17:50
what my covariance matrix looks like based on data,
17:52
I'm going to have to basically form
17:54
its empirical counterparts, which
17:57
I can do by doing the age-old statistical trick, which
18:02
is replace your expectation by an average, all right?
18:04
So let's just-- everything that's on the board,
18:06
you see expectation, just replace it by an average.
18:09
OK, so, now I'm going to be given X1, Xn.
18:14
So, I'm going to define the empirical mean.
18:16
18:19
OK so, really, the idea is take your expectation
18:22
and replace it by 1 over n sum, right?
18:24
And so the empirical mean is just 1 over n.
18:28
Some of the Xi's--
18:31
I'm guessing everybody knows how to average vectors.
18:34
It's just the average of the coordinates.
18:36
So I will write this as X bar.
18:39
And the empirical covariance matrix, often called
18:51
sample covariance matrix, hence the notation, S.
18:57
Well, this is my covariance matrix, right?
18:59
Let's just replace the expectations by averages.
19:02
1 over n, sum from i equal 1 to n, of Xi, Xi transpose, minus--
19:12
this is the expectation of X. I will replace it
19:14
by the average, which I just called X bar, X bar transpose,
19:21
OK?
19:22
And that's when I want to use the--
19:25
that's when I want to use the notation--
19:28
the second definition, but I could actually
19:30
do exactly the same thing using this definition here.
19:35
Sorry, using this definition right here.
19:38
So this is actually 1 over n, sum from i
19:42
equal 1 to n, of Xi minus X bar, Xi minus X bar transpose.
19:55
And those are actually--
19:56
I mean, in a way, it looks like I
19:58
could define two different estimators,
19:59
but you can actually check.
20:01
And I do encourage you to do this.
20:03
If you're not comfortable making those manipulations,
20:05
you can actually check that those two things are actually
20:08
exactly the same, OK?
20:15
20:20
So now, I'm going to want to talk about matrices, OK?
20:25
And remember, we defined this big matrix, X,
20:27
with the double bar.
20:28
And the question is, can I express
20:31
both X bar and the sample covariance matrix
20:35
in terms of this big matrix, X?
20:37
Because right now, it's still expressed
20:39
in terms of the vectors.
20:40
I'm summing those vectors, vectors transpose.
20:43
The question is, can I just do that in a very compact way,
20:46
in a way that I can actually remove this sum term,
20:50
all right?
20:50
That's going to be the goal.
20:52
I mean, that's not a notational goal.
20:54
That's really something that we want--
20:58
that's going to be convenient for us
20:59
just like it was convenient to talk about matrices when
21:02
we did linear regression.
21:04
21:23
OK, X bar.
21:26
We just said it's 1 over n, sum from I equal 1 to n
21:30
of Xi, right?
21:32
Now remember, what does this matrix look like?
21:35
We said that X bar--
21:39
X is this guy.
21:40
So if I look at X transpose, the columns of this guy
21:45
becomes X1, my first observation, X2,
21:51
my second observation, all the way to Xn, my last observation,
21:54
right?
21:56
Agreed?
21:56
That's what X transpose is.
21:58
So if I want to sum those guys, I
22:00
can multiply by the all-ones vector.
22:02
22:06
All right, so that's what the definition of the all-ones 1
22:08
vector is.
22:09
22:11
Well, it's just a bunch of 1's in Rn, in this case.
22:19
And so when I do X transpose 1, what I get is just the sum from
22:23
i equal 1 to n of the Xi's.
22:27
So if I divide by n, I get my average, OK?
22:36
So here, I definitely removed the sum term.
22:43
Let's see if with the covariance matrix, we can do the same.
22:47
Well, and that's actually a little more difficult to see,
22:53
I guess.
22:55
But let's use this definition for S, OK?
23:05
And one thing that's actually going to be--
23:07
so, let's see for one second, what--
23:10
so it's going to be something that involves X,
23:12
multiplying X with itself, OK?
23:14
And the question is, is it going to be
23:15
multiplying X with X transpose, or X tranpose with X?
23:19
To answer this question, you can go
23:20
the easy route, which says, well, my covariance matrix is
23:23
of size, what?
23:24
What is the size of S?
23:27
AUDIENCE: d by d.
23:28
PHILIPPE RIGOLLET: d by d, OK?
23:30
X is of size n by d.
23:34
So if I do X times X transpose, I'm
23:35
going to have something which is of size n by n.
23:37
If I do X transpose X, I'm going to have
23:39
something which is d by d.
23:40
That's the easy route.
23:41
And there's basically one of the two guys.
23:44
You can actually open the box a little bit
23:46
and see what's going on in there.
23:48
If you do X transpose X, which we know gives you a d by d,
23:52
you'll see that X is going to have vectors that
23:54
are of the form, Xi, and X transpose
23:57
is going to have vectors that are of the form, Xi transpose,
24:02
right?
24:03
And so, this is actually probably the right way to go.
24:06
So let's look at what's X transpose X is giving us.
24:11
So I claim that it's actually going to give us what we want,
24:16
but rather than actually going there, let's--
24:19
to actually-- I mean, we could check it entry by entry,
24:22
but there's actually a nice thing we can do.
24:25
Before we go there, let's write X transpose
24:28
as the following sum of variables, X1 and then
24:33
just a bunch of 0's everywhere else.
24:36
So it's still d by n.
24:39
So n minus 1 of the columns are equal to 0 here.
24:42
Then I'm going to put a 0 and then put X2.
24:45
And then just a bunch of 0's, right?
24:48
So that's just 0, 0 plus 0, 0, all the way to Xn, OK?
24:59
Everybody agrees with it?
25:01
See what I'm doing here?
25:03
I'm just splitting it into a sum of matrices that
25:06
only have one nonzero columns.
25:08
But clearly, that's true.
25:11
Now let's look at the product of this guy with itself.
25:15
So, let's call these matrices M1, M2, Mn.
25:23
25:26
So when I do X transpose X, what I
25:30
do is the sum of the Mi's for i equal 1 to n,
25:37
times the sum of the Mi transpose, right?
25:48
Now, the sum of the Mi's transpose
25:50
is just the sum of each of the Mi's transpose, OK?
25:55
25:58
So now I just have this product of two sums,
26:00
so I'm just going to re-index the second one by j.
26:03
So this is sum for i equal 1 to n, j equal 1 to n of Mi
26:12
Mj transpose.
26:15
OK?
26:16
26:19
And now what we want to notice is
26:20
that if i is different from j, what's happening?
26:26
Well if i is different from j, let's look at say, M1 times XM2
26:34
transpose.
26:35
26:54
So what is the product between those two matrices?
26:56
27:04
AUDIENCE: It's a new entry and [INAUDIBLE]
27:09
PHILIPPE RIGOLLET: There's an entry?
27:11
AUDIENCE: Well, it's an entry.
27:12
It's like a dot product in that form next to [? transpose. ?]
27:17
PHILIPPE RIGOLLET: You mean a dot product is just getting
27:19
[INAUDIBLE] number, right?
27:20
So I want-- this is going to be a matrix.
27:22
It's the product of two matrices, right?
27:24
This is a matrix times a matrix.
27:27
So this should be a matrix, right, of size d by d.
27:31
27:35
Yeah, I should see a lot of hands
27:37
that look like this, right?
27:39
Because look at this.
27:40
So let's multiply the first--
27:42
let's look at what's going on in the first column here.
27:45
I'm multiplying this column with each of those rows.
27:48
The only nonzero coefficient is here,
27:50
and it only hits this column of 0's.
27:54
So every time, this is going to give you 0, 0, 0, 0.
27:57
And it's going to be the same for every single one of them.
28:00
So this matrix is just full of 0's, right?
28:04
They never hit each other when I do
28:06
the matrix-matrix multiplication.
28:08
There's no-- every non-zero hits a 0.
28:11
So what it means is-- and this, of course,
28:13
you can check for every i different from j.
28:16
So this means that Mi times Mj transpose is actually
28:22
equal to 0 when i is different from j, Right?
28:27
Everybody is OK with this?
28:29
So what that means is that when I do this double sum, really,
28:32
it's a simple sum.
28:33
There's only just the sum from i equal 1
28:37
to n of Mi Mi transpose.
28:41
Because this is the only terms in this double sum
28:44
that are not going to be 0 when [INAUDIBLE] [? M1 ?] with M1
28:48
itself.
28:50
Now, let's see what's going on when
28:51
I do M1 times M1 transpose.
28:53
Well, now, if I do Mi times and Mi transpose,
28:57
now this guy becomes [? X1 ?] [INAUDIBLE] it's here.
29:00
And so now, I really have X1 times X1 transpose.
29:03
So this is really just the sum from i
29:06
equal 1 to n of Xi Xi transpose, just because Mi Mi transpose
29:20
is Xi Xi transpose.
29:21
There's nothing else there.
29:22
29:26
So that's the good news, right?
29:28
This term here is really just X transpose X divided by n.
29:37
29:43
OK, I can use that guy again, I guess.
29:45
Well, no.
29:46
Let's just-- OK, so let me rewrite S.
30:08
All right, that's the definition we have.
30:10
And we know that this guy already is equal to 1 over n X
30:14
transpose X. x bar x bar transpose--
30:20
we know that x bar-- we just proved that x bar--
30:25
sorry, little x bar was equal to 1
30:31
over n X bar transpose times the all-ones vector.
30:36
So I'm just going to do that.
30:37
So that's just going to be minus.
30:39
I'm going to pull my two 1 over n's--
30:40
one from this guy, one from this guy.
30:42
So I'm going to get 1 over n squared.
30:44
And then I'm going to get X bar--
30:47
sorry, there's no X bar here.
30:48
It's just X. Yeah.
30:50
X transpose all ones times X transpose all ones transpose,
30:59
right?
31:00
31:04
And X transpose all ones transpose--
31:07
31:11
right, the rule-- if I have A times B transpose,
31:14
it's B transpose times A transpose, right?
31:16
31:23
That's just the rule of transposition.
31:25
So this is 1 transpose X transpose.
31:31
And so when I put all these guys together,
31:34
this is actually equal to 1 over n X transpose X minus one
31:38
over n squared X transpose 1, 1 transpose X. Because X
31:47
transpose transposes X, OK?
31:50
31:53
So now, I can actually--
31:55
I have something which is of the form, X transpose X--
31:59
[INAUDIBLE] to the left, X transpose; to the right, X.
32:01
Here, I have X transpose to the left, X to the right.
32:04
So it can factor out whatever's in there.
32:07
So I can write S as 1 over n--
32:11
sorry, X transpose times 1 over n times the identity of Rd.
32:17
32:21
And then I have minus 1 over n, 1, 1 transpose X.
32:33
OK, because if you--
32:34
I mean, you can distribute it back, right?
32:36
So here, I'm going to get what?
32:38
X transpose identity times X, the whole thing divided by n.
32:41
That's this term.
32:42
And then the second one is going to be-- sorry, 1 over n
32:45
squared.
32:46
And then I'm going to get 1 over n squared times X transpose 1,
32:50
1 transpose which is this guy, times X,
32:53
and that's the [? right ?] [? thing, ?] OK?
32:58
So, the way it's written, I factored out one of the 1 over
33:01
n's.
33:02
So I'm just going to do the same thing as on this slide.
33:05
So I'm just factoring out this 1 over n here.
33:08
So it's 1 over n times X transpose identity
33:16
of our d divided by n divided by 1 this time,
33:21
minus 1 over n 1, 1 transpose times X, OK?
33:26
So that's just what's on the slides.
33:28
33:31
What does the matrix, 1, 1 transpose, look like?
33:35
AUDIENCE: All 1's.
33:36
PHILIPPE RIGOLLET: It's just all 1's, right?
33:38
Because the entries are the products of the all-ones--
33:41
of the coordinates of the all-ones vectors with
33:42
the coordinates of the all-ones vectors, so I only get 1's.
33:45
So it's a d by d matrix with only 1's.
33:49
So this matrix, I can actually write exactly, right?
33:52
H, this matrix that I called H which
33:55
is what's sandwiched in-between this X transpose and X.
33:59
By definition, I said this is the definition of H. Then
34:02
this thing, I can write its coordinates exactly.
34:06
34:18
We know it's identity divided by n minus--
34:23
sorry, I don't know why I keep [INAUDIBLE]..
34:25
Minus 1 over n 1, 1 transpose--
34:29
so it's this matrix with the only 1's
34:30
on the diagonals and 0's and elsewhere-- minus a matrix that
34:34
only has 1 over n everywhere.
34:36
34:41
OK, so the whole thing is 1 minus 1 over n on the diagonals
34:49
and then minus 1 over n here, OK?
34:57
And now I claim that this matrix is an orthogonal projector.
35:01
Now, I'm writing this, but it's completely useless.
35:05
This is just a way for you to see that it's actually very
35:08
convenient now to think about this problem
35:11
as being a matrix problem, because things
35:14
are much nicer when you think about the actual form
35:17
of your matrices, right?
35:18
They could tell you, here is the matrix.
35:21
I mean, imagine you're sitting at a midterm,
35:23
and I say, here's the matrix that has 1 minus 1
35:25
over n on the diagonals and minus 1 over n
35:28
on the [INAUDIBLE] diagonal.
35:30
Prove to me that it's a projector matrix.
35:32
You're going to have to basically
35:34
take this guy times itself.
35:35
It's going to be really complicated, right?
35:37
So we know it's symmetric.
35:38
That's for sure.
35:39
But the fact that it has this particular way
35:42
of writing it is going to make my life
35:44
super easy to check this.
35:45
That's the definition of a projector.
35:47
It has to be symmetric and it has
35:48
to square to itself because we just
35:51
said in the chapter on linear regression
35:54
that once you project, if you apply the projection again,
35:57
you're not moving because you're already there.
35:59
OK, so why is H squared equal to H?
36:04
Well let's just write H square.
36:05
It's the identity minus 1 over n 1, 1
36:09
transpose times the identity minus 1 over n 1, 1
36:16
transpose, right?
36:19
Let's just expand this now.
36:22
This is equal to the identity minus--
36:25
well, the identity times 1, 1 transpose is just the identity.
36:29
So it's 1, 1 transpose, sorry.
36:31
So 1 over n 1, 1 transpose minus 1 over n 1, 1 transpose.
36:38
And then there's going to be what
36:40
makes the deal is that I get this 1 over n
36:42
squared this time.
36:44
And then I get the product of 1 over n trans--
36:46
oh, let's write it completely.
36:48
I get 1, 1 transpose times 1, 1 transpose, OK?
36:58
But this thing here--
37:01
what is this?
37:03
n, right, is the end product of the all-ones vector
37:06
with the all-ones vector.
37:07
So I'm just summing n times 1 squared, which is n.
37:10
So this is equal to n.
37:11
So I pull it out, cancel one of the ends,
37:13
and I'm back to what I had before.
37:15
So I had identity minus 2 over n 1, 1 transpose plus 1
37:21
over n 1, 1 transpose which is equal to H.
37:27
Because one of the 1 over n's cancel, OK?
37:30
37:36
So it's a projection matrix.
37:37
It's projecting onto some linear space, right?
37:41
It's taking a matrix.
37:42
Sorry, it's taking a vector and it's
37:44
projecting onto a certain space of vectors.
37:46
37:49
What is this space?
37:50
37:53
Right, so, how do you-- so I'm only
37:54
asking the answer to this question in words, right?
37:57
So how would you describe the vectors
37:59
onto which this matrix is projecting?
38:02
Well, if you want to answer this question,
38:05
the way you would tackle it is first by saying, OK,
38:07
what does a vector which is of the form, H times something,
38:13
look like, right?
38:14
What can I say about this vector that's
38:16
going to be definitely giving me something
38:19
about the space on which it projects?
38:21
I need to know a little more to know that it projects exactly
38:24
onto this.
38:25
But one way we can do this is just
38:29
see how it acts on a vector.
38:30
What does it do to a vector to apply H, right?
38:32
So I take v. And let's see what taking v and applying H to it
38:44
looks like.
38:46
Well, it's the identity minus something.
38:48
So it takes v and it removes something
38:50
from v. What does it remove?
38:54
Well, it's 1 over n times v transpose 1 times
39:00
the all-ones vector, right?
39:03
Agreed?
39:04
I just wrote v transpose 1 instead of 1 transpose v,
39:13
which are the same thing.
39:16
What is this thing?
39:17
39:25
What should I call it in mathematical notation?
39:27
39:30
v bar, right?
39:31
I should all it v bar because this is exactly the average
39:35
of the entries of v, agreed?
39:38
This is summing the entries of v's, and this is dividing
39:41
by the number of those v's.
39:43
Sorry, now v is in our--
39:44
39:49
sorry, why do I divide by--
39:51
39:53
I'm just-- OK, I need to check what my dimensions are now.
39:59
No, it's in Rd, right?
40:00
So why do I divide by n?
40:02
40:05
So it's not really v bar.
40:07
It's the sum of the v's divided by--
40:13
right, so it's v bar.
40:14
40:24
AUDIENCE: [INAUDIBLE]
40:25
[INTERPOSING VOICES]
40:25
AUDIENCE: Yeah, v has to be [INAUDIBLE]
40:27
PHILIPPE RIGOLLET: Oh, yeah.
40:29
OK, thank you.
40:31
So everywhere I wrote Hd, that was actually Hn.
40:34
Oh, man.
40:35
I wish I had a computer now.
40:37
All right.
40:37
So-- yeah, because the--
40:43
yeah, right?
40:43
So why it's not--
40:45
well, why I thought it was this is because I was thinking
40:48
about the outer dimension of X, really
40:49
of X transpose, which is really the inner dimension,
40:51
didn't matter to me, right?
40:52
So the thing that I can sandwich between X transpose
40:55
and X has to be n by n.
40:56
So this was actually n by n.
40:58
And so that's actually n by n.
41:00
Everything is n by n.
41:03
Sorry about that.
41:04
41:08
So this is n.
41:09
This is n.
41:10
This is-- well, I didn't really tell you
41:12
what the all-ones vector was, but it's also in our n.
41:16
Yeah, OK.
41:18
41:22
Thank you.
41:23
And n-- actually, I used the fact that this was of size n
41:27
here already.
41:28
41:31
OK, and so that's indeed v bar.
41:33
41:38
So what is this projection doing to a vector?
41:40
41:47
It's removing its average on each coordinate, right?
41:51
And the effect of this is that v is a vector.
41:54
What is the average of Hv?
41:58
AUDIENCE: 0.
41:59
PHILIPPE RIGOLLET: Right, so it's 0.
42:00
It's the average of v, which is v bar, minus the average
42:04
of something that only has v bar's entry, which is v bar.
42:07
So this thing is actually 0.
42:08
42:11
So let me repeat my question.
42:12
Onto what subspace does H project?
42:15
42:22
Onto the subspace of vectors that have mean 0.
42:26
A vector that has mean 0 is a vector.
42:30
So if you want to talk more linear algebra, v bar--
42:34
for a vector you have mean 0, it means
42:36
that v is orthogonal to the span of the all-ones vector.
42:43
That's it.
42:44
It projects to this space.
42:46
So in words, it projects onto the space
42:47
of vectors that have 0 mean.
42:49
In linear algebra, it says it projects
42:52
onto the hyperplane which is orthogonal
42:55
to the all-ones vector, OK?
42:58
So that's all.
43:01
Can you guys still see the screen?
43:04
Are you good over there?
43:05
OK.
43:07
All right, so now, what it means is that, well, I'm
43:12
doing this weird thing, right?
43:13
I'm taking the inner product--
43:15
so S is taking X. And then it's removing its mean of each
43:20
of the columns of X, right?
43:21
When I take H times X, I'm basically applying this
43:24
projection which consists in removing the mean of all
43:26
the X's.
43:28
And then I multiply by H transpose.
43:31
But what's actually nice is that, remember,
43:33
H is a projector.
43:35
Sorry, I don't want to keep that.
43:38
Which means that when I look at X transpose HX,
43:47
it's the same as looking at X transpose H squared X.
43:52
But since H is equal to its transpose,
43:54
this is actually the same as looking at X transpose H
43:58
transpose HX, which is the same as looking at HX transpose
44:07
HX, OK?
44:11
So what it's doing, it's first applying this projection
44:14
matrix, H, which removes the mean of each of your columns,
44:18
and then looks at the inner products between those guys,
44:23
right?
44:23
Each entry of this guy is just the covariance
44:25
between those centered things.
44:27
That's all it's doing.
44:28
All right, so those are actually going to be the key statements.
44:35
So everything we've done so far is really
44:37
mainly linear algebra, right?
44:38
I mean, looking at expectations and covariances was just--
44:41
we just used the fact that the expectation was linear.
44:44
We didn't do much.
44:45
But now there's a nice thing that's happening.
44:47
And that's why we're going to switch
44:50
from the language of linear algebra
44:51
to more statistical, because what's happening
44:53
is that if I look at this quadratic form, right?
44:57
So I take sigma.
44:59
So I take a vector, u.
45:00
45:03
And I'm going to look at u-- so let's say, in Rd.
45:09
And I'm going to look at u transpose sigma u.
45:14
OK?
45:15
45:18
What is this doing?
45:19
Well, we know that u transpose sigma u is equal to what?
45:24
Well, sigma is the expectation of XX transpose
45:31
minus the expectation of X expectation of X transpose,
45:35
right?
45:36
45:39
So I just substitute in there.
45:40
45:46
Now, u is deterministic.
45:49
So in particular, I can push it inside the expectation
45:52
here, agreed?
45:55
And I can do the same from the right.
45:57
So here, when I push u transpose here, and u here,
46:00
what I'm left with is the expectation of u transpose X
46:06
times X transpose u.
46:09
OK?
46:11
And now, I can do the same thing for this guy.
46:14
And this tells me that this is the expectation of u transpose
46:17
X times the expectation of X transpose u.
46:21
46:24
Of course, u transpose X is equal to X transpose u.
46:29
And u-- yeah.
46:31
So what it means is that this is actually
46:33
equal to the expectation of u transpose X squared
46:43
minus the expectation of u transpose X,
46:48
the whole thing squared.
46:49
46:56
But this is something that should look familiar.
46:58
This is really just the variance of this particular random
47:01
variable which is of the form, u transpose X,
47:03
right? u transpose X is a number.
47:06
It involves a random vector, so it's a random variable.
47:10
And so it has a variance.
47:11
And this variance is exactly given by this formula.
47:15
So this is just the variance of u transpose X.
47:19
So what we've proved is that if I look at this guy,
47:21
this is really just the variance of u transpose X, OK?
47:29
47:37
I can do the same thing for the sample variance.
47:40
So let's do this.
47:41
47:48
And as you can see, spoiler alert,
47:52
this is going to be the sample variance.
47:56
47:59
OK, so remember, S is 1 over n, sum of Xi Xi transpose minus X
48:09
bar X bar transpose.
48:12
So when I do u transpose, Su, what
48:16
it gives me is 1 over n sum from i equal 1
48:19
to n of u transpose Xi times Xi transpose u, all right?
48:25
So those are two numbers that multiply each other
48:27
and that happen to be equal to each other,
48:30
minus u transpose X bar X bar transpose u,
48:36
which is also the product of two numbers that happen
48:38
to be equal to each other.
48:39
So I can rewrite this with squares.
48:41
48:55
So we're almost there.
48:57
All I need to know to check is that this thing is actually
49:00
the average of those guys, right?
49:02
So u transpose X bar.
49:04
What is it?
49:05
It's 1 over n sum from i equal 1 to n of u transpose Xi.
49:10
So it's really something that I can write as u transpose X bar,
49:17
right?
49:17
That's the average of those random variables
49:19
of the form, u transpose Xi.
49:21
49:23
So what it means is that u transpose Su, I can write as 1
49:29
over n sum from i equal 1 to n of u transpose Xi squared
49:38
minus u transpose X bar squared, which
49:46
is the empirical variance that we need noted by small
49:51
s squared, right?
49:54
So that's the empirical variance of u transpose X1 all the way
50:06
to u transpose Xn.
50:08
50:12
OK, and here, same thing.
50:13
I use exactly the same thing.
50:15
I just use the fact that here, the only thing I use is really
50:17
the linearity of this guy, of 1 over n sum
50:20
or the linearity of expectation, that I can push things
50:24
in there, OK?
50:26
50:30
AUDIENCE: So what you have written
50:31
at the end of that sum for uT Su?
50:33
PHILIPPE RIGOLLET: This one?
50:35
AUDIENCE: Yeah.
50:35
PHILIPPE RIGOLLET: Yeah, I said it's equal to small s,
50:37
and I want to make a difference between the big S
50:39
that I'm using here.
50:40
So this is equal to small--
50:42
I don't know, I'm trying to make it look
50:45
like a calligraphic s squared.
50:47
50:56
OK, so this is nice, right?
51:00
This covariance matrix-- so let's look at capital sigma
51:04
itself right now.
51:05
This covariance matrix, we know that if we
51:07
read its entries, what we get is the covariance
51:11
between the coordinates of the X's, right,
51:15
of the random vector, X. And the coordinates, well,
51:19
by definition, are attached to a coordinate system.
51:22
So I only know what the covariance
51:25
of X in of those two things are, or the covariance of those two
51:30
things are.
51:31
But what if I want to find coordinates between linear
51:33
combination of the X's?
51:35
Sorry, if I want to find covariances between linear
51:37
combination of those X's.
51:38
And that's exactly what this allows me to do.
51:40
It says, well, if I pre- and post-multiply by u,
51:44
this is actually telling me what the variance
51:47
of X along direction u is, OK?
51:51
So there's a lot of information in there,
51:53
and it's just really exploiting the fact
51:55
that there is some linearity going on in the covariance.
52:00
So, why variance?
52:02
Why is variance interesting for us, right?
52:03
Why?
52:04
I started by saying, here, we're going
52:05
to be interested in having something
52:07
to do dimension reduction.
52:08
We have-- think of your points as [? being in a ?] dimension
52:10
larger than 4, and we're going to try to reduce the dimension.
52:13
So let's just think for one second,
52:15
what do we want about a dimension reduction procedure?
52:19
If I have all my points that live in, say, three dimensions,
52:23
and I have one point here and one point here
52:25
and one point here and one point here and one point here,
52:28
and I decide to project them onto some plane--
52:30
that I take a plane that's just like this, what's
52:32
going to happen is that those points are all going to project
52:34
to the same point, right?
52:36
I'm just going to not see anything.
52:38
However, if I take a plane which is like this,
52:40
they're all going to project into some nice line.
52:42
Maybe I can even project them onto a line
52:44
and they will still be far apart from each other.
52:47
So that's what you want.
52:48
You want to be able to say, when I take my points
52:51
and I say I project them onto lower dimensions,
52:54
I do not want them to collapse into one single point.
52:57
I want them to be spread as possible in the direction
53:00
on which I project.
53:02
And this is what we're going to try to do.
53:04
And of course, measuring spread between points
53:06
can be done in many ways, right?
53:08
I mean, you could look at, I don't know,
53:09
sum of pairwise distances between those guys.
53:12
You could look at some sort of energy.
53:14
You can look at many ways to measure
53:16
of spread in a direction.
53:18
But variance is a good way to measure
53:19
of spread between points.
53:21
If you have a lot of variance between your points,
53:23
then chances are they're going to be spread.
53:25
Now, this is not always the case, right?
53:27
If I have a direction in which all my points are clumped
53:30
onto one big point and one other big point,
53:33
it's going to choose this because that's
53:34
the direction that has a lot of variance.
53:37
But hopefully, the variance is going
53:39
to spread things out nicely.
53:41
So the idea of principal component analysis
53:47
is going to try to identify those variances--
53:51
those directions along which we have a lot of variance.
53:55
Reciprocally, we're going to try to eliminate
53:57
the directions along which we do not have a lot of variance, OK?
54:01
And let's see why.
54:02
Well, if-- so here's the first claim.
54:08
If you transpose Su is equal to 0, what's happening?
54:14
Well, I know that an empirical variance is equal to 0.
54:17
What does it mean for an empirical variance
54:18
to be equal to 0?
54:22
So I give you a bunch of points, right?
54:23
So those points are those points-- u transpose
54:26
X1, u transpose-- those are a bunch of numbers.
54:29
What does it mean to have the empirical variance
54:31
of those points being equal to 0?
54:33
AUDIENCE: They're all the same.
54:34
PHILIPPE RIGOLLET: They're all the same.
54:36
So what it means is that when I have my points, right?
54:43
So, can you find a direction for those points in which they
54:46
project to all the same point?
54:48
54:51
No, right?
54:52
There's no such thing.
54:53
For this to happen, you have to have your points which
54:55
are perfectly aligned.
54:57
And then when you're going to project
54:59
onto the orthogonal of this guy, they're
55:01
going to all project to the same point
55:03
here, which means that the empirical variance is
55:06
going to be 0.
55:08
Now, this is an extreme case.
55:10
This will never happen in practice,
55:11
because if that happens, well, I mean,
55:13
you can basically figure that out very quickly.
55:16
So in the same way, it's very unlikely
55:21
that you're going to have u transpose sigma u, which
55:23
is equal to 0, which means that, essentially, all
55:26
your points are [INAUDIBLE] or let's say all of them
55:28
are orthogonal to u, right?
55:30
So it's exactly the same thing.
55:31
It just says that in the population case,
55:33
there's no probability that your points deviate from this guy
55:36
here.
55:37
This happens with zero probability, OK?
55:41
And that's just because if you look
55:42
at the variance of this guy, it's going to be 0.
55:46
And then that means that there's no deviation.
55:48
By the way, I'm using the name projection
55:51
when I talk about u transpose X, right?
55:55
So let's just be clear about this.
55:59
If you-- so let's say I have a bunch of points,
56:04
and u is a vector in this direction.
56:06
And let's say that u has the--
56:08
so this is 0.
56:10
This is u.
56:10
And let's say that u has norm, 1, OK?
56:17
When I look, what is the coordinate of the projection?
56:21
So what is the length of this guy here?
56:23
Let's call this guy X1.
56:25
What is the length of this guy?
56:26
56:31
In terms of inner products?
56:32
56:35
This is exactly u transpose X1.
56:39
This length here, if this is X2, this
56:42
is exactly u transpose X2, OK?
56:46
So those-- u transpose X measure exactly the distance
56:52
to the origin of those--
56:55
I mean, it's really--
56:58
think of it as being just an x-axis thing.
57:00
You just have a bunch of points.
57:02
You have an origin.
57:02
And it's really just telling you what
57:04
the coordinate on this axis is going to be, right?
57:07
So in particular, if the empirical variance is 0,
57:10
it means that all these points project
57:12
to the same point, which means that they have
57:14
to be orthogonal to this guy.
57:16
And you can think of it as being also maybe an entire plane
57:19
that's orthogonal to this line, OK?
57:23
So that's why I talk about projection,
57:26
because the inner products, u transpose X,
57:29
is really measuring the coordinates of X
57:36
when u becomes the x-axis.
57:39
Now, if u does not have norm 1, then you just
57:42
have a change of scale here.
57:44
You just have a change of unit, right?
57:46
So this is really u times X1.
57:51
The coordinates should really be divided by the norm of u.
57:54
57:59
OK, so now, just in the same way-- so
58:04
we're never going to have exactly 0.
58:07
But if we [INAUDIBLE] the other end,
58:08
if u transpose Su is large, what does it mean?
58:12
58:14
It means that when I look at my points
58:17
as projected onto the axis generated by u,
58:22
they're going to have a lot of variance.
58:23
They're going to be far away from each other in average,
58:25
right?
58:26
That's what large variance means, or at least
58:28
large empirical variance means.
58:31
And same thing for u.
58:34
So what we're going to try to find
58:36
is a u that maximizes this.
58:39
If I can find a u that maximizes this
58:42
so I can look in every direction,
58:44
and suddenly I find a direction in which the spread is massive,
58:48
then that's a point on which I'm basically
58:50
the less likely to have my points
58:52
project onto each other and collide, right?
58:54
At least I know they're going to project
58:56
at least onto two points.
58:59
So the idea now is to say, OK, let's try
59:02
to maximize this spread, right?
59:04
So we're going to try to find the maximum over all u's
59:09
of u transpose Su.
59:12
And that's going to be the direction that maximizes
59:15
the empirical variance.
59:15
Now of course, if I read it like that for all u's in Rd,
59:22
what is the value of this maximum?
59:23
59:28
It's infinity, right?
59:29
Because I can always multiply u by 10,
59:32
and this entire thing is going to multiplied by 100.
59:34
So I'm just going to take u as large as I want,
59:36
and this thing is going to be as large as I want,
59:38
and so I need to constrain u.
59:40
And as I said, I need to have u of size 1
59:42
to talk about coordinates in the system generated
59:45
by u like this.
59:47
So I'm just going to constrain u to have
59:50
Euclidean norm equal to 1, OK?
59:55
So that's going to be my goal-- trying
59:57
to find the largest possible u transpose Su,
60:01
or in other words, empirical variance of the points
60:03
projected onto the direction u when u is of norm 1,
60:07
which justifies to use the word, "direction,"
60:11
and because there's no magnitude to this u.
60:12
60:17
OK, so how am I going to do this?
60:22
I could just fold and say, let's just optimize
60:25
this thing, right?
60:26
Let's just take this problem.
60:28
It says maximize a function onto some constraints.
60:32
Immediately, the constraint is sort of nasty.
60:34
I'm on a sphere, and I'm trying to move points on the sphere.
60:37
And I'm maximizing this thing which
60:38
actually happens to be convex.
60:40
And we know we know how to minimize convex functions,
60:42
but maximize them is a different question.
60:45
And so this problem might be super hard.
60:47
So I can just say, OK, here's what
60:49
I want to do, and let me give that to an optimizer
60:52
and just hope that the optimizer can solve this problem for me.
60:56
That's one thing we can do.
60:57
Now as you can imagine, PCA is so well spread, right?
61:00
Principal component analysis is something
61:01
that people do constantly.
61:03
And so that means that we know how to do this fast.
61:06
So that's one thing.
61:07
The other thing that you should probably question about why--
61:10
if this thing is actually difficult, why in the world
61:13
would you even choose the variance as a measure of spread
61:16
if there's so many measures of spread, right?
61:19
The variance is one measure of spread.
61:21
It's not guaranteed that everything
61:22
is going to project nicely far apart from each other.
61:26
So we could choose the variance, but we
61:27
could choose something else.
61:28
If the variance does not help, why choose it?
61:30
Turns out the variance helps.
61:32
So this is indeed a non-convex problem.
61:35
I'm maximizing, so it's actually the same.
61:38
I can make this constraint convex
61:41
because I'm maximizing a convex function,
61:43
so it's clear that the maximum is going
61:45
to be attained at the boundary.
61:47
So I can actually just fill this ball into some convex ball.
61:51
However, I'm still maximizing, so this
61:53
is a non-convex problem.
61:55
And this turns out to be the fanciest non-convex problem
61:57
we know how to solve.
61:59
And the reason why we know how to solve it
62:00
is not because of optimization or using gradient-type things
62:04
or anything of the algorithms that I mentioned
62:06
during the maximum likelihood.
62:09
It's because of linear algebra.
62:11
Linear algebra guarantees that we know how to solve this.
62:13
And to understand this, we need to go a little deeper
62:17
in linear algebra, and we need to understand the concept
62:22
of diagonalization of a matrix.
62:24
So who has ever seen the concept of an eigenvalue?
62:29
Oh, that's beautiful.
62:30
And if you're not raising your hand,
62:31
you're just playing "Candy Crush," right?
62:33
All right, so, OK.
62:35
62:44
This is great.
62:46
Everybody's seen it.
62:48
For my live audience of millions, maybe you have not,
62:51
so I will still go through it.
62:53
All right, so one of the basic facts--
62:58
and I remember when I learned this in--
63:02
I mean, when I was an undergrad, I
63:04
learned about the spectral decomposition
63:05
and this diagonalization of matrices.
63:07
And for me, it was just a structural property
63:09
of matrices, but it turns out that it's extremely useful,
63:11
and it's useful for algorithmic purposes.
63:13
And so what this theorem tells you
63:14
is that if you take a symmetric matrix--
63:16
63:22
well, with real entries, but that
63:24
really does not matter so much.
63:28
And here, I'm going to actually--
63:30
so I take a symmetric matrix, and actually S and sigma
63:33
are two such symmetric matrices, right?
63:36
Then there exists P and D, which are both--
63:44
so let's say d by d.
63:47
Which are both d by d such that P is orthogonal.
63:55
63:58
That means that P transpose P is equal to PP transpose
64:02
is equal to the identity.
64:06
And D is diagonal.
64:07
64:11
And sigma, let's say, is equal to PDP transpose, OK?
64:20
So it's a diagonalization because it's
64:22
finding a nice transformation.
64:23
P has some nice properties.
64:25
It's really just the change of coordinates in which
64:28
your matrix is diagonal, right?
64:31
And the way you want to see this--
64:32
and I think it sort of helps to think about this problem
64:35
as being--
64:36
sigma being a covariance matrix.
64:38
What does a covariance matrix tell you?
64:39
Think of a multivariate Gaussian.
64:41
Can everybody visualize a three-dimensional Gaussian
64:43
density?
64:45
Right, so it's going to be some sort of a bell-shaped curve,
64:48
but it might be more elongated in one direction than another.
64:51
And then going to chop it like that, all right?
64:54
So I'm going to chop it off.
64:56
And I'm going to look at how it bleeds, all right?
65:00
So I'm just going to look at where the blood is.
65:02
And what it's going to look at--
65:03
it's going to look like some sort of ellipsoid, right?
65:08
In high dimension, it's just going to be an olive.
65:11
And that is just going to be bigger and bigger.
65:13
And then I chop it off a little lower,
65:16
and I get something a little bigger like this.
65:20
And so it turns out that sigma is capturing exactly this,
65:23
right?
65:23
The matrix sigma-- so the center of your covariance matrix
65:27
of your Gaussian is going to be this thing.
65:29
And sigma is going to tell you which direction it's elongated.
65:33
And so in particular, if you look, if you knew an ellipse,
65:36
you know there's something called principal axis, right?
65:38
So you could actually define something
65:39
that looks like this, which is this axis, the one along which
65:43
it's the most elongated.
65:44
Then the axis along which is orthogonal to it,
65:47
along which it's slightly less elongated,
65:49
and you go again and again along the orthogonal ones.
65:52
It turns out that those things here
65:56
is the new coordinate system in which this transformation, P
65:59
and P transpose, is putting you into.
66:03
And D has entries on the diagonal
66:06
which are exactly this length and this length, right?
66:09
So that's just what it's doing.
66:11
It's just telling you, well, if you
66:12
think of having this Gaussian or this high-dimensional
66:16
ellipsoid, it's elongated along certain directions.
66:19
And these directions are actually maybe not well aligned
66:23
with your original coordinate system, which might just
66:25
be the usual one, right--
66:27
north, south, and east, west.
66:29
Maybe I need to turn it.
66:30
And that's exactly what this orthogonal transformation is
66:33
doing for you, all right?
66:36
So, in a way, this is actually telling you even more.
66:39
It's telling you that any matrix that's symmetric,
66:41
you can actually turn it somewhere.
66:45
And that'll start to dilate things in the directions
66:47
that you have, and then turn it back
66:49
to what you originally had.
66:50
And that's actually exactly the effect
66:53
of applying a symmetric matrix through a vector, right?
66:57
And it's pretty impressive.
66:58
It says if I take sigma times v. Any sigma that's
67:04
of this form, what I'm doing is-- that's symmetric.
67:07
What I'm really doing to v is I'm
67:09
changing its coordinate system, so I'm rotating it.
67:12
Then I'm changing-- I'm multiplying its coordinates,
67:14
and then I'm rotating it back.
67:16
That's all it's doing, and that's
67:18
what all symmetric matrices do, which
67:21
means that this is doing a lot.
67:24
All right, so OK.
67:27
So, what do I know?
67:29
So I'm not going to prove that this is
67:30
the so-called spectral theorem.
67:32
67:39
And the diagonal entries of D is of the form, lambda 1,
67:45
lambda 2, lambda d, 0, 0.
67:49
And the lambda j's are called eigenvalues of D.
68:01
Now in general, those numbers can be positive, negative,
68:05
or equal to 0.
68:06
But here, I know that sigma and S are--
68:12
well, they're symmetric for sure,
68:15
but they are positive semidefinite.
68:17
68:23
What does it mean?
68:25
It means that when I take u transpose sigma u for example,
68:30
this number is always non-negative.
68:33
68:35
Why is this true?
68:36
68:42
What is this number?
68:43
68:47
It's the variance of-- and actually, I don't even
68:49
need to finish this sentence.
68:51
As soon as I say that this is a variance, well,
68:53
it has to be non-negative.
68:55
We know that a variance is not negative.
68:57
And so, that's also a nice way you can use that.
69:00
So it's just to say, well, OK, this thing
69:02
is positive semidefinite because it's a covariance matrix.
69:04
So I know it's a variance, OK?
69:06
So I get this.
69:08
Now, if I had some negative numbers--
69:10
so the effect of that is that when I draw this picture,
69:15
those axes are always positive, which is kind of a weird thing
69:19
to say.
69:19
But what it means is that when I take a vector, v, I rotate it,
69:23
and then I stretch it in the directions of the coordinate,
69:28
I cannot flip it.
69:30
I can only stretch or shrink, but I cannot flip its sign,
69:34
all right?
69:34
But in general, for any symmetric matrices,
69:37
I could do this.
69:38
But when it's positive symmetric definite,
69:40
actually what turns out is that all the lambda
69:43
j's are non-negative.
69:48
I cannot flip it, OK?
69:51
So all the eigenvalues are non-negative.
69:53
69:56
That's a property of positive semidef.
69:58
So when it's symmetric, you have the eigenvalues.
70:00
They can be any number.
70:01
And when it's positive semidefinite, in particular
70:03
that's the case of the covariance matrix
70:05
and the empirical covariance matrix, right?
70:07
Because the empirical covariance matrix
70:08
is an empirical variance, which itself is non-negative.
70:12
And so I get that the eigenvalues are non-negative.
70:17
All right, so principal component analysis is saying,
70:23
OK, I want to find the direction, u,
70:32
that maximizes u transpose Su, all right?
70:38
I've just introduced in one slide
70:40
something about eigenvalues.
70:41
So hopefully, they should help.
70:44
So what is it that I'm going to be getting?
70:47
Well, let's just see what happens.
70:51
Oh, I forgot to mention that-- and I will use this.
70:53
So the lambda j's are called eigenvectors.
70:56
And then the matrix, P, has columns v1 to vd, OK?
71:08
The fact that it's orthogonal-- that P transpose P is equal
71:13
to the identity--
71:15
means that those guys satisfied that vi transpose
71:20
vj is equal to 0 if i is different from j.
71:27
And vi transpose vi is actually equal to 1,
71:31
right, because the entries of PP transpose
71:33
are exactly going to be of the form, vi transpose vj, OK?
71:38
So those v's are called eigenvectors.
71:40
71:46
And v1 is attached to lambda 1, and v2 is attached to lambda 2,
71:52
OK?
71:53
So let's see what's happening with those things.
71:56
What happens if I take sigma--
71:58
so if you know eigenvalues, you know exactly what's
72:00
going to happen.
72:01
If I look at, say, sigma times v1, well, what is sigma?
72:06
We know that sigma is PDP transpose v1.
72:15
What is P transpose times v1?
72:17
Well, P transpose has rows v1 transpose,
72:21
v2 transpose, all the way to vd transpose.
72:26
So when I multiply this by v1, what
72:30
I'm left with is the first coordinate
72:32
is going to be equal to 1 and the second coordinate is
72:38
going to be equal to 0, right?
72:40
Because they're orthogonal to each other--
72:42
0 all the way to the end.
72:45
So that's when I do P transpose v1.
72:48
Now I multiply by D. Well, I'm just
72:55
multiplying this guy by lambda 1, this guy by lambda 2,
72:58
and this guy by lambda d, so this is really just lambda 1.
73:02
73:04
And now I need to post-multiply by P.
73:12
So what is P times this guy?
73:14
Well, P is v1 all the way to vd.
73:19
And now I multiply by a vector that
73:21
only has 0's except lambda 1 on the first guy.
73:24
So this is just lambda 1 times v1.
73:26
73:29
So what we've proved is that sigma times v1 is lambda 1 v1,
73:34
and that's probably the notion of eigenvalue you're
73:37
most comfortable with, right?
73:39
So just when I multiply by v1, I get
73:41
v1 back multiplied by something, which is the eigenvalue.
73:45
So in particular, if I look at v1, transpose sigma v1,
73:54
what do I get?
73:55
Well, I get lambda 1 v1 transpose v1,
73:58
which is 1, right?
74:00
So this is actually lambda 1 v1 transpose v1,
74:04
which is lambda 1, OK?
74:08
And if I do the same with v2, clearly I'm
74:10
going to get v2 transpose sigma.
74:13
v2 is equal to lambda 2.
74:16
So for each of the vj's, I know that if I
74:19
look at the variance along the vj,
74:21
it's actually exactly given by those eigenvalues, all right?
74:27
Which proves this, because the variance along the eigenvectors
74:38
is actually equal to the eigenvalues.
74:40
So since they're variances, they have to be non-negative.
74:43
So now, I'm looking for the one direction that
74:47
has the most variance, right?
74:50
But that's not only among the eigenvectors.
74:53
That's also among the other directions
74:55
that are in-between the eigenvectors.
74:57
If I were to look only at the eigenvectors,
74:59
it would just tell me, well, just pick the eigenvector, vj,
75:02
that's associated to the largest of the lambda j's.
75:05
But it turns out that that's also true for any vector--
75:09
that the maximum direction is actually one direction which
75:11
is among the eigenvectors.
75:13
And among the eigenvectors, we know that the one that's
75:16
the largest--
75:17
that carries the largest variance is
75:18
the one that's associated to the largest eigenvalue, all right?
75:23
And so this is what PCA is going to try to do for me.
75:26
So in practice, that's what I mentioned already, right?
75:29
We're trying to project the point cloud
75:31
onto a low-dimensional space, D prime,
75:34
by keeping as much information as possible.
75:36
And by "as much information," I mean we do not
75:39
want points to collide.
75:41
And so what PCA is going to do is just
75:45
going to try to project [? on two ?] directions.
75:48
So there's going to be a u, and then
75:49
there's going to be something orthogonal to u, and then
75:52
the third one, et cetera, so that once we project on those,
75:55
we're keeping as much of the covariance as possible, OK?
75:59
And in particular, those directions
76:02
that we're going to pick are actually
76:04
a subset of the vj's that are associated to the largest
76:06
eigenvalues.
76:08
So I'm going to stop here for today.
76:11
We'll finish this on Tuesday.
76:15
But basically, the idea is it's just the following.
76:18
You're just going to-- well, let me skip one more.
76:22
Yeah, this is the idea.
76:24
You're first going to pick the eigenvector associated
76:27
to the largest eigenvalue.
76:30
Then you're going to pick the direction that orthogonal
76:33
to the vector that you've picked,
76:37
and that's carrying the most variance.
76:38
And that's actually the second largest--
76:40
the eigenvector associated to the second largest eigenvalue.
76:44
And you're going to go all the way to the number of them
76:46
that you actually want to pick, which is in this case, d, OK?
76:50
And wherever you choose to chop this process,
76:53
not going all the way to d, is going to actually give you
76:56
a lower-dimensional representation
76:57
in the coordinate system that's given by v1, v2, v3, et
77:01
cetera, OK?
77:02
So we'll see that in more details on Tuesday.
77:04
But I don't want to get into it now.
77:06
We don't have enough time.
77:07
Are there any questions?
77:10