https://www.youtube.com/watch?v=JBIz7UadY5M&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=13


字幕記錄


00:00
00:00
The following content is provided under a Creative
00:02
Commons license.
00:03
Your support will help MIT OpenCourseWare
00:06
continue to offer high-quality educational resources for free.
00:10
To make a donation or to view additional materials
00:12
from hundreds of MIT courses, visit MIT OpenCourseWare
00:16
at ocw.mit.edu.
00:19
PHILIPPE RIGOLLET: [INAUDIBLE] minus xi transpose t.
00:26
I just pick whatever notation I want from a variable.
00:30
And let's say it's t.
00:33
So that's the least squares estimator.
00:35
And it turns out that, as I said last time,
00:38
it's going to be convenient to think
00:39
of those things as matrices.
00:42
So here, I already have vectors.
00:44
I've already gone from one dimension, just real valued
00:47
random variables through random vectors when
00:49
I think of each xi, but if I start stacking them together,
00:52
I'm going to have vectors and matrices that show up.
00:56
So the first vector I'm getting is
00:57
y, which is just a vector where I have y1 to yn.
01:04
Then I have-- so that's a boldface vector.
01:07
Then I have x, which is a matrix where I have--
01:12
well, the first coordinate is always 1.
01:16
So I have 1, and then x1 xp minus 1, and that's-- sorry,
01:24
x1 xp minus 1, and that's for observation 1.
01:29
And then I have the same thing all the way down
01:31
for observation n.
01:32
01:40
OK, everybody understands what this is?
01:42
So I'm just basically stacking up all the xi's.
01:47
So this i-th row is xi transpose.
01:55
I am just stacking them up.
01:57
And so if I want to write all these things to be
02:00
true for each of them, all I need to do
02:03
is to write a vector epsilon, which
02:05
is epsilon 1 to epsilon n.
02:08
And what I'm going to have is that y, the boldface vector,
02:11
now is equal to the matrix x times the vector
02:14
beta plus the vector epsilon.
02:18
And it's really just exactly saying
02:20
what's there, because for 2-- so this is a vector, right?
02:23
This is a vector.
02:25
And what is the dimension of this vector?
02:27
02:32
n, so this is n observations.
02:37
And for all these-- for two vectors to be equal,
02:39
I need to have all the coordinates to be equal,
02:41
and that's exactly the same thing as saying that this
02:44
holds for i equal 1 to n.
02:46
02:48
But now, when I have this, I can actually
02:51
rewrite the sum for t equals--
02:55
sorry, for i equals 1 to n of yi minus xi transpose
03:03
beta squared, this turns out to be
03:05
equal to the Euclidean norm of the vector y minus the matrix x
03:12
times beta squared.
03:14
And I'm going to put a 2 here so we
03:16
know we're talking about the Euclidean norm.
03:19
This just means this is the Euclidean norm.
03:20
03:27
That's the one we've seen before when
03:28
we talked about chi squared--
03:30
that's the square norm is the sum
03:31
of the square of the coefficients,
03:32
and then I take a square root, but here I
03:34
have an extra square.
03:35
So it's really just the sum of the square of the coefficients,
03:38
which is this.
03:38
And here are the coefficients.
03:40
03:43
So then, that I write this thing like that, then minimizing--
03:49
so my goal here, now, is going to solve minimum over t
03:54
in our p of y minus x times t2 squared.
04:05
And just like we did for one dimension,
04:07
we can actually write optimality conditions for this.
04:12
I mean, this is a function.
04:14
So this is a function from rp to r.
04:23
And if I want to minimize it, all I have to do
04:24
is to take its gradient and set it equal to 0.
04:28
So minimum, set gradient to 0.
04:42
So that's where it becomes a little complicated.
04:45
Now I'm going to have to take the gradient of this norm.
04:49
It might be a little annoying to do.
04:51
But actually, what's nice about those things--
04:53
I mean, I remember that it was a bit annoying to learn.
04:56
I mean, it's just basically rules of calculus
04:59
that you don't use that much.
05:01
But essentially, you can actually expend this norm.
05:05
And you will see that the rules are basically
05:07
the same as in one dimension, you just
05:09
have to be careful about the fact that matrices do not
05:11
commute.
05:13
So let's expand this thing.
05:15
y minus xt squared--
05:19
well, this is equal to the norm of y
05:21
squared plus the norm of x squared plus 2
05:30
times y transpose xt.
05:33
05:36
That's just expanding the square in more dimensions.
05:41
And this, I'm actually going to write as y squared plus--
05:47
so here, the norm squared of this guy,
05:50
I always have that the norm of x squared
05:53
is equal to x transpose x.
05:56
So I'm going to write this as x transpose
05:58
x, so it's t transpose x transpose xt
06:04
plus 2 times y transpose xt.
06:10
So now, if I'm going to take the gradient with respect to t,
06:13
I have basically three terms, and each of them
06:16
has some sort of a different nature.
06:18
This term is linear in t, and it's
06:21
going to differentiate the same way
06:23
that I differentiate a times x.
06:25
I'm just going to keep the a.
06:28
This guy is quadratic.
06:29
t appears twice.
06:32
And this guy, I'm going to pick up a 2,
06:34
and it's going to differentiate just like when I differentiate
06:37
a times x squared.
06:38
It's 2 times ax.
06:41
And this guy is a constant with respect to t,
06:43
so it's going to differentiate to 0.
06:47
So when I compute the gradient--
06:49
06:53
now, of course, all of these rules that I give you you
06:55
can check by looking at the partial derivative with respect
06:58
to each coordinate.
06:59
But arguably, it's much faster to know
07:02
the rules of differentiability.
07:04
It's like if I gave you the function exponential x
07:06
and I said, what is the derivative,
07:08
and you started writing, well, I'm
07:09
going to write exponential x plus h minus exponential ax
07:13
divided by h and let h go to 0.
07:15
That's a bit painful.
07:17
AUDIENCE: Why did you transpose your--
07:19
why does x have to be [INAUDIBLE]??
07:23
PHILIPPE RIGOLLET: I'm sorry?
07:25
AUDIENCE: I was wondering why you times t
07:26
times the [INAUDIBLE]?
07:29
PHILIPPE RIGOLLET: The transpose of 2ab
07:33
is b transpose a transpose.
07:35
07:38
If you're not sure about this, just
07:40
make a and b have different size,
07:42
and then you will see that there's some incompatibility.
07:46
I mean, there's basically only one way to not screw
07:48
that one up, so that's easy to remember.
07:51
So if I take the gradient, then it's going to be equal to what?
07:54
It's going to be 0 plus-- we said here,
07:58
this is going to differentiate like-- so
07:59
think a times x squared.
08:05
So I'm going to have 2ax.
08:06
So here, basically, this guy is going to go to x transpose xt.
08:13
Now, I could have made this one go away,
08:17
but that's the same thing as saying that my gradient is--
08:20
I can think of my gradient as being
08:21
either a horizontal vector or a vertical vector.
08:24
So if I remove this guy, I'm thinking of my gradient
08:26
as being horizontal.
08:27
If I remove that guy, I'm thinking of my gradient
08:30
as being vertical.
08:31
And that's what I want to think of, typically--
08:33
vertical vectors, column vectors.
08:36
And then this guy, well, it's like these guys just
08:39
think a times x.
08:42
So the derivative is just a, so I'm going
08:44
to keep only that part here.
08:47
Sorry, I forgot a minus somewhere-- yeah, here.
08:49
08:55
Minus 2y transpose x.
08:59
And what I want is this thing to be equal to 0.
09:02
09:06
So t, the optimal t, is called beta hat and satisfies--
09:20
09:24
well, I can cancel the 2's and put the minus
09:27
on the other side, and what I get
09:29
is that x transpose xt is equal to y transpose x.
09:36
09:44
Yeah, that's not working for me.
09:48
Yeah, that's because when I took the derivative,
09:50
I still need to make sure--
09:51
so it's the same question of whether I want
09:53
things to be columns or rows.
09:55
So this is not a column.
09:57
If I remove that guy, y transpose t is a row.
10:01
So I'm just going to take the transpose of this guy
10:03
to make things work, and this is just going to be x transpose y.
10:07
10:11
And this guy is x transpose y so that I have columns.
10:14
10:19
So this is just the linear equation in t.
10:23
And I have to solve it, so it's of the form some matrix times
10:26
t is equal to another vector.
10:29
And so that's basically in your system.
10:31
And the way to solve it, at least formally,
10:33
is to just take the inverse of the matrix on the left.
10:36
So if x transpose x is invertible, then-- sorry,
10:45
that's beta hat is the t I want.
10:48
I get that beta hat is equal to x transpose
10:51
x inverse x transpose y.
10:54
10:57
And that's the least squares estimator.
10:59
11:12
So here, I use this condition.
11:16
I want it to be invertible so I can actually write its inverse.
11:21
Here, I wrote, rank of x is equal to p.
11:25
What is the difference?
11:26
11:36
Well, there's basically no difference.
11:40
Basically, here, I have to assume--
11:45
what is the size of the matrix x transpose x?
11:47
11:52
[INTERPOSING VOICES]
11:53
PHILIPPE RIGOLLET: Yeah, so what is the size?
11:55
AUDIENCE: p by p.
11:56
PHILIPPE RIGOLLET: p by p.
11:58
So this matrix is invertible if it's a rank p,
12:00
if you know what rank means.
12:02
If you don't, that just rank p means that it's invertible.
12:05
So it's full rank and it's invertible.
12:07
And the rank of x transpose x is actually
12:10
just the rank of x because this is the same matrix that you
12:13
apply twice.
12:14
And that's all it's saying.
12:15
So if you're not comfortable with the notion of rank
12:18
that you see here, just think of this condition
12:21
just being the condition that x transpose x is invertible.
12:25
And that's all it says.
12:26
What it means for it to be invertible-- this was true.
12:29
We made no assumption up to this point.
12:32
If x is not invertible, it means that there
12:35
might be multiple solutions to this equation.
12:38
In particular, for a matrix to not be invertible,
12:42
it means that there's some vector v.
12:45
So if x transpose x is not invertible,
12:55
then this is equivalent to there exists a vector
13:00
v, which is not 0, and such that x transpose xv is equal to 0.
13:07
That's what it means to not be invertible.
13:10
So in particular, if beta hat is a solution--
13:13
13:22
so this equation is sometimes called score equations,
13:26
because the gradient is called the score,
13:28
and so you're just checking if the gradient is equal to 0.
13:31
So if beta hat satisfies star, then so
13:33
does beta hat plus lambda v for all lambda in the real line.
13:46
13:51
And the reason is because, well, if I start looking at--
13:54
what is x transpose x times beta hat plus lambda v?
14:02
Well, by linearity, this is just x transpose x
14:08
beta hat plus lambda x transpose x times v.
14:16
But this guy is what?
14:17
14:22
It's 0, just because that's what we assumed.
14:27
We assumed that x transpose xv was equal to 0,
14:31
so we're left only with this part, which, by star, is just
14:34
x transpose y.
14:35
14:40
So that means that x transpose x beta hat plus lambda v
14:44
is actually equal to x transpose y, which means that there's
14:48
another solution, which is not just beta hat,
14:50
but any move of beta hat along this direction v by any size.
14:56
So that's going to be an issue, because you're
14:58
looking for one estimator.
15:00
And there's not just one, in this case, there's many.
15:03
And so this is not going to be well-defined
15:05
and you're going to have some issues.
15:07
So if you want to talk about the least squares estimator,
15:09
you have to make this assumption.
15:13
What does it imply in terms of, can I
15:15
think of p being too n, for example, in this case?
15:18
What happens if p is equal to 2n?
15:20
15:27
AUDIENCE: Well, then the rank of your matrix is only p/2.
15:31
PHILIPPE RIGOLLET: So the rank of your matrix is only p/2,
15:33
so that means that this is actually not going to happen.
15:36
I mean, it's not only p/2, it's at most p/2.
15:39
It's at most the smallest of the two dimensions of your matrix.
15:42
So if your matrix is n times 2n, it's
15:44
at most n, which means that it's not going to be full rank,
15:47
so it's not going to be invertible.
15:49
So every time the dimension p is larger than the sample size,
15:53
your matrix is not invertible, and you cannot talk about
15:56
the least squares estimator.
15:57
So that's something to keep in mind.
15:59
And it's actually a very simple thing.
16:01
It's essentially saying, well, if p is lower than n,
16:05
it means that you have more parameters
16:07
to estimate than you have equations to estimate it.
16:11
So you have this linear system.
16:12
There's one equation per observation.
16:17
Each row, which was each observation,
16:19
was giving me one equation.
16:21
But then the number of unknowns in this linear system is p,
16:24
and so I cannot solve linear systems that have more unknowns
16:28
than they have equations.
16:30
And so that's basically what's happening.
16:32
Now, in practice, if you think about what
16:34
data sets look like these days, for example,
16:36
people are trying to express some phenotype.
16:38
So phenotype is something you can measure on people-- maybe
16:41
the color of your eyes, or your height,
16:43
or whether you have diabetes or not, things like this,
16:47
so things that are macroscopic.
16:51
And then they want to use the genotype to do that.
16:53
They want to measure your-- they want to sequence your genome
16:55
and try to use this to predict whether you're going
16:58
to be responsive to a drug or whether your r's are
17:01
going to be blue, or something like this.
17:03
Now, the data sets that you can have--
17:05
people, maybe, for a given study about some sort of disease.
17:09
Maybe you will sequence the genome of maybe 100 people.
17:15
n is equal to 100.
17:17
p is basically the number of genes they're sequencing.
17:21
This is of the order of 100,000.
17:23
So you can imagine that this is a case where n is much,
17:26
much smaller than p, and you cannot talk about the least
17:28
squares estimator.
17:29
There's plenty of them.
17:31
There's not just one line like that,
17:33
lambda times v that you can move away.
17:36
There's basically an entire space in which you can move,
17:40
and so it's not well-defined.
17:42
So at the end of this class, I will give you
17:43
a short introduction on how you do this.
17:46
This actually represents more and more.
17:49
It becomes a more and more preponderant part of the data
17:51
sets you have to deal with, because people just
17:53
collect data.
17:54
When I do the sequencing, the machine
17:57
allows me to sequence 100,000 genes.
17:59
I'm not going to stop at 100 because doctors are never
18:03
going to have cohorts of more than 100 patients.
18:06
So you just collect everything you can collect.
18:08
And this is true for everything.
18:11
Cars have sensors all over the place,
18:13
much more than they actually gather data.
18:15
There's data, there's-- we're creating,
18:16
we're recording everything we can.
18:18
And so we need some new techniques for that,
18:20
and that's what high-dimensional statistics is trying to answer.
18:23
So this is way beyond the scope of this class,
18:25
but towards the end, I will give you
18:27
some hints about what can be done in this framework
18:29
because, well, this is the new reality we have to deal with.
18:34
So here, we're in a case where p's less than n
18:37
and typically much smaller than n.
18:38
So the kind of orders of magnitude you want to have
18:40
is maybe p's of order 10 and n's of order 100, something
18:46
like this.
18:46
So you can scale that, but maybe 10 times larger.
18:50
So maybe you cannot solve this guy b for b hat, but actually,
18:57
you can talk about x times b hat,
19:00
even if p is larger than n.
19:02
And the reason is that x times b hat
19:06
is actually something that's very well-defined.
19:09
So what is x times b hat?
19:11
Remember, I started with the model.
19:16
So if I look at this definition, essentially, what I
19:20
had as the original thing was that the vector
19:24
y was equal to x times beta plus the vector epsilon.
19:29
That was my model.
19:32
So beta is actually giving me something.
19:36
Beta is actually some parameter, some coefficients
19:39
that are interesting.
19:40
But a good estimator for-- so here, it
19:43
means that the observations that I have
19:45
are of the form x times beta plus some noise.
19:48
So if I want to adjust the noise, remove the noise,
19:51
a good candidate to do noise is x times beta hat.
19:57
x times beta hat is something that should actually
19:59
be useful to me, which should be close to x times beta.
20:10
So in the one-dimensional case, what it means is that if I
20:13
have-- let's say this is the true line,
20:16
and these are my x's, so I have--
20:19
these are the true points on the real line,
20:22
and then I have my little epsilon
20:24
that just give me my observations that
20:26
move around this line.
20:28
So this is one of epsilons, say epsilon i.
20:34
Then I can actually either talk--
20:37
to say that I recovered the line,
20:39
I can actually talk about recovering the right intercept
20:42
or recovering the right slope for this line.
20:44
Those are the two parameters that I need to recover.
20:46
But I can also say that I've actually
20:48
found a set of points that's closer
20:50
to being on the line that are closer to this set of points
20:56
right here than the original crosses that I observed.
21:00
So if we go back to the picture here,
21:03
for example, what I could do is say, well, for this point
21:08
here--
21:09
there was an x here--
21:11
rather than looking at this dot, which was my observation,
21:14
I can say, well, now that I've estimated the red line,
21:17
I can actually just say, well, this point
21:19
should really be here.
21:21
And actually, I can move all these dots
21:23
so that they're actually on the red line.
21:26
And this should be a better value, something
21:28
that has less noise than the original y value
21:30
that I should see.
21:32
It should be close to the true value
21:33
that I should be seeing without the extra noise.
21:37
So that's definitely something that could be of interest.
21:40
21:43
For example, in imaging, you're not
21:45
trying to understand-- so when you do imaging,
21:48
y is basically an image.
21:50
So think of a pixel image, and you just
21:53
stack it into one long vector.
21:55
And what you see is something that
21:57
should look like some linear combination of some feature
21:59
vectors, maybe.
22:01
So there's people created a bunch of features.
22:05
They're called, for example, Gabor frames or wavelet
22:09
transforms-- so just well-known libraries of variables x such
22:14
that when you take linear combinations of those guys,
22:17
this should looks like a bunch of images.
22:19
And what you want for your image--
22:22
you don't care what the coefficients of the image
22:24
are in these bases that you came up with.
22:26
What you care about is the noise in the image.
22:28
And so you really want to get x beta.
22:31
So if you want to estimate x beta,
22:34
well, you can use x beta hat.
22:36
What is x beta hat?
22:37
Well, since beta hat is x transpose x inverse x transpose
22:42
y, this is x transpose.
22:44
22:48
That's my estimator for x beta.
22:50
22:54
Now, this thing, actually, I can define
22:59
even if I'm not low rank.
23:01
So why is this thing interesting?
23:03
Well, there's a formula for this estimator,
23:08
but actually, I can visualize what this thing is.
23:10
23:18
So let's assume, for the sake of illustration,
23:22
that n is equal to 3.
23:26
23:29
So that means that y lives in a three-dimensional space.
23:33
And so let's say it's here.
23:36
And so I have my, let's say, y's here.
23:43
And I also have a plane that's given
23:48
by the vectors x1 transpose x2 transpose, which
23:55
is, by the way, 1--
23:58
sorry, that's not what I want to do.
24:01
24:04
I'm going to say that n is equal to 3 and that p is equal to 2.
24:10
So I basically have two vectors, 1, 1 and another one,
24:18
let's assume that it's, for example, abc.
24:25
So those are my two vectors.
24:27
This is x1, and this is x2.
24:33
24:36
And those are my three observations for this guy.
24:39
So what I want when I minimize this,
24:48
I'm looking at the point which can
24:50
be formed as the linear combination of the columns
24:52
of x, and I'm trying to find the guy that's the closest to y.
24:57
So what does it look like?
24:58
Well, the two points, 1, 1, 1 is going to be, say, here.
25:01
That's the point 1, 1, 1.
25:04
And let's say that abc is this point.
25:06
25:14
So now I have a line that goes through those two guys.
25:17
25:20
That's not really-- let's say it's
25:22
going through those two guys.
25:24
And this is the line which can be formed by looking only
25:27
at linear combination.
25:28
So this is the line of x times t for t in r2.
25:36
That's this entire line that you can get.
25:39
Why is it-- yeah, sorry, it's not just a line,
25:42
I also have to have t, all the 0's thing.
25:45
So that actually creates an entire plane,
25:48
which is going to be really hard for me to represent.
25:54
I don't know.
25:55
I mean, maybe I shouldn't do it in these dimensions.
26:00
26:05
So I'm going to do it like that.
26:08
26:11
So this plane here is the set of xt for t and r2.
26:14
26:17
So that's a two-dimensional plane, definitely goes to 0,
26:22
and those are all these things.
26:23
So think of a sheet of paper in three dimensions.
26:25
Those are the things I can get.
26:27
So now, what I'm going to have as y
26:29
is not necessarily in this plane.
26:32
y is actually something in this plane, x beta
26:39
plus some epsilon.
26:40
26:44
y is x beta plus epsilon.
26:50
So I start from this plane, and then
26:51
I have this epsilon that pushes me,
26:53
maybe, outside of this plane.
26:54
And what least squares is doing is saying,
26:56
well, I know that epsilon should be fairly small,
26:59
so the only thing I'm going to be doing that actually makes
27:01
sense is to take y and find the point that's
27:04
on this plane that's the closest to it.
27:06
And that corresponds to doing an orthogonal projection of y
27:10
onto this thing, and that's actually exactly x beta hat.
27:13
27:18
So in one dimension, just because this is actually
27:21
a little hard--
27:22
in one dimension, so that's if p is equal to 1.
27:34
So let's say this is my point.
27:36
And then I have y, which is in two dimensions,
27:38
so this is all on the plane.
27:40
27:42
What it does, this is my--
27:44
the point that's right here is actually x beta hat.
27:48
That's how you find x beta hat.
27:49
You take your point y and you project it
27:51
on the linear span of the columns of x.
27:54
And that's x beta hat.
27:56
This does not tell you exactly what beta should be.
27:59
And if you know a little bit of linear algebra,
28:00
it's pretty clear, because if you want to find beta hat,
28:04
that means that you should be able to find
28:06
the coordinates of a point in the system of columns of x.
28:12
And if those guys are redundant, there's
28:13
not going to be unique coordinates for these guys,
28:17
so that's why it's actually not easy to find.
28:19
But x beta hat is uniquely defined.
28:21
It's a projection.
28:21
Yeah?
28:22
AUDIENCE: And epsilon is the distance
28:24
between the y and the--
28:25
PHILIPPE RIGOLLET: No, epsilon is the vector that goes from--
28:29
so there's a true x beta.
28:33
That's the true one.
28:36
It's not clear.
28:36
I mean, x beta hat is unlikely to be exactly equal to x beta.
28:41
And then the epsilon is the one that starts from this line.
28:44
It's the vector that pushes you away.
28:46
So really, this is this vector.
28:48
That's epsilon.
28:50
So it's not a length.
28:51
The lengths of epsilon is the distance,
28:54
but epsilon is just the actual vector that takes you
28:57
from one to the other.
28:58
29:01
So this is all in two dimensions,
29:03
and it's probably much clearer than what's here.
29:05
29:09
And so here, I claim that this x beta hat--
29:12
so from this picture, I implicitly
29:15
claim that forming this operator that ticks y
29:22
and maps it into this vector x times x transpose y, blah,
29:26
blah, blah, this should actually be equal to the projection of y
29:33
onto the linear span of the columns of x.
29:44
That's what I just drew for you.
29:46
And what it means is that this matrix
29:48
must be the projection matrix.
29:49
29:54
So of course, anybody--
29:56
who knows linear algebra here?
29:59
OK, wow.
30:01
So what are the conditions that a projection matrix
30:04
should be satisfying?
30:06
AUDIENCE: Squares through itself.
30:08
PHILIPPE RIGOLLET: Squares through itself, right.
30:09
If I project twice, I'm not moving.
30:11
If I keep on iterating projection,
30:13
once I'm in the space I'm projecting onto,
30:15
I'm not moving.
30:16
What else?
30:17
30:24
Do they have to be symmetric, maybe?
30:28
AUDIENCE: If it's an orthogonal projection.
30:29
PHILIPPE RIGOLLET: Yeah, so this is an orthogonal projection.
30:32
It has to be symmetric.
30:34
And that's pretty much it.
30:36
So from those things, you can actually
30:38
get quite a bit of things.
30:39
But what's interesting is that if you actually
30:41
look at the eigenvalues of this matrix,
30:44
they should be either 0 or 1, essentially.
30:47
And they are 1 if the eigenvector associated
30:52
is within this space, and 0 otherwise.
30:55
And so that's basically what you can check.
30:56
This is not an exercise in linear algebra,
30:58
so I'm not going to go too much into those details.
31:00
But this is essentially what you want to keep in mind.
31:03
What's associated to orthogonal projections
31:05
is Pythagoras theorem.
31:07
And that's something that's going to be useful for us.
31:10
What it's essentially telling is that if I
31:12
look at this norm squared, it's equal to this norm squared--
31:16
sorry, this norm squared plus this norm squared
31:18
is equal to this norm squared.
31:20
And that's something the norm of y squared.
31:22
So Pythagoras tells me that the norm of y squared
31:32
is equal to the norm of x beta hat squared plus the norm of y
31:40
minus x beta hat squared.
31:41
31:46
Agreed?
31:47
It's just because I have a straight angle here.
31:51
So that's this plus this is equal to this.
31:54
31:58
So now, to define this, I made no assumption.
32:02
Epsilon could be as wild.
32:04
I was just crossing my fingers that epsilon was actually
32:07
small enough that it would make sense
32:09
to project onto the linear span, because I implicitly
32:13
assumed that epsilon did not take me all the way there,
32:16
so that actually, it makes sense to project back.
32:19
And so for that, I need to somehow make assumptions
32:22
that epsilon is well-behaved and that it's
32:24
completely wild, that it's moving uniformly
32:31
in all directions of the space.
32:33
There's no privileged direction where
32:34
it's always going, otherwise, I'm
32:36
going to make a systematic error.
32:37
And I need that those epsilons are going to average somehow.
32:42
So here are the assumptions we're
32:44
going to be making so that we can actually
32:46
do some statistical inference.
32:48
The first one is that the design matrix is deterministic.
32:53
So I started by saying the x--
32:55
I have xi, yi, and maybe they're independent.
32:58
Here, they are, but the xi's, I want to think as deterministic.
33:03
If they're not deterministic, it can condition on them,
33:06
but otherwise, it's very difficult
33:08
to think about this thing if I think of those entries
33:11
as being random, because then I have
33:14
the inverse of a random matrix, and things become very, very
33:17
complicated.
33:18
So we're to think of those guys as being deterministic.
33:21
We're going to think of the model as being homoscedastic.
33:27
And actually, let me come back to this in a second.
33:29
Homoscedastic-- well, I mean, if you're
33:31
trying to find the etymology of this word,
33:34
"homo" means the same, "scedastic" means scaling.
33:38
So what I want to say is that the epsilons
33:40
have the same scaling.
33:41
And since my third assumption is that epsilon is Gaussian, then
33:46
essentially, what I'm going to want is that they all share
33:49
the same sigma squared.
33:52
They're independent, so this is definitely in the identity
33:55
covariance matrix.
33:56
And I want them to be centered, as well.
33:58
That means that there's no direction
34:00
that I'm always privileging when I'm moving away from my plane
34:04
there.
34:05
So these are important conditions.
34:09
It depends on how much inference you want to do.
34:13
If you want to write t-tests, you need all these assumptions.
34:16
But if you only want to write, for example, the fact
34:19
that your least squares estimator is consistent,
34:23
you really just need the fact that epsilon
34:25
has variance sigma squared.
34:27
The fact that it's Gaussian won't matter, just
34:29
like Gaussianity doesn't matter for a large number.
34:33
Yeah?
34:34
AUDIENCE: So the first assumption
34:35
that x has to be deterministic, but I just
34:38
made up this x1, x2--
34:40
PHILIPPE RIGOLLET: x is the matrix.
34:41
AUDIENCE: Yeah.
34:42
So most are random variables, right?
34:45
PHILIPPE RIGOLLET: No, that's the assumption.
34:47
AUDIENCE: OK.
34:49
So I mean, once we collect the data and put it in the matrix,
34:52
it becomes deterministic.
34:54
So maybe I'm missing something.
34:55
PHILIPPE RIGOLLET: Yeah.
34:56
So this is for the purpose of the analysis.
35:00
I can actually assume that--
35:01
I look at my data, and I think of this.
35:04
So what is the difference between thinking
35:06
of data as deterministic or thinking of it as random?
35:08
When I talked about random data, the only assumptions
35:11
that I made were about the distribution.
35:12
I said, well, if my x is a random variable,
35:14
I want it to have this variance and I want it to have,
35:16
maybe, this distribution, things like this.
35:19
Here, I'm actually making an assumption on the values
35:25
that I see.
35:25
I'm seeing that the value that you give me is--
35:30
the matrix is actually invertible.
35:32
x transpose x will be invertible.
35:33
So I've never done that before, assuming
35:36
that some random variable--
35:38
assuming that some Gaussian random variable was positive,
35:41
for example.
35:43
We don't do that, because there's always some probability
35:45
that things don't happen if you make things at random.
35:49
And so here, I'm just going to say, OK, forget about--
35:52
here, it's basically a little stronger.
35:54
I start my assumption by saying, the data that's given to me
35:58
will actually satisfy those assumptions.
36:00
And that means that I don't actually
36:02
need to make some modeling assumption on this thing,
36:05
because I'm actually putting directly
36:06
the assumption I want to see.
36:08
36:12
So here, either I know sigma squared
36:14
or I don't know sigma squared.
36:16
So is that clear?
36:16
So essentially, I'm assuming that I have this model, where
36:21
this guy, now, is deterministic, and this
36:26
is some multivariate Gaussian with mean 0
36:30
and covariance matrix identity of rn.
36:33
That's the model I'm assuming.
36:36
And I'm observing this, and I'm given this matrix x.
36:40
Where does this make sense?
36:42
You could say, well, if I think of my rows as being people
36:44
and I'm collecting genes, it's a little intense
36:48
to assume that I actually know, ahead of time,
36:50
what I'm going to be seeing, and that those things are
36:51
deterministic.
36:52
That's true, but it still does not prevent the analysis
36:55
to go through, for one.
36:56
And second, a better example might be this imaging example
37:00
that I described, where those x's are actually libraries.
37:04
Those are libraries of patterns that people
37:07
have created, maybe from deep learning nets,
37:09
or something like this.
37:10
But they've created patterns, and they
37:12
say that all images should be representable as a linear
37:14
combination of those patterns.
37:16
And those patterns are somewhere in books,
37:18
so they're certainly deterministic.
37:19
Everything that's actually written down in a book
37:21
is as deterministic as it gets.
37:22
37:29
Any questions about those assumptions?
37:30
Those are the things we're going to be working with.
37:32
There's only three of them.
37:33
One is about x.
37:35
Actually, there's really two of them.
37:37
I mean, this guy already appears here.
37:41
So there's two-- one on the noise, one on the x's.
37:44
That's it.
37:45
37:48
Those things allow us to do quite a bit.
37:51
They will allow us to--
37:52
37:55
well, that's actually-- they allow
38:02
me to write the distribution of beta hat, which is great,
38:09
because when I know the distribution of my estimator,
38:12
I know it's fluctuations.
38:14
If it's centered around the true parameter,
38:16
I know that it's going to be fluctuating
38:19
around the true parameter.
38:20
And it should tell me what kind of distribution
38:22
the fluctuations are.
38:23
I actually know how to build confidence intervals.
38:26
I know how to build tests.
38:27
I know how to build everything.
38:29
It's just like when I told you that asymptotically,
38:31
the empirical variance was Gaussian
38:33
with mean theta and standard deviation that depended on n,
38:39
et cetera, that's basically the only thing I needed.
38:42
And this is what I'm actually getting here.
38:44
So let me start with this statement.
38:49
So remember, beta hat satisfied this,
38:52
so I'm going to rewrite it here.
38:53
38:57
So beta hat was equal to x transpose
39:02
x inverse x transpose y.
39:07
That was the definition that we found.
39:09
And now, I also know that y was equal to x beta plus epsilon.
39:17
So let me just replace y by x beta plus epsilon here.
39:20
Yeah?
39:21
AUDIENCE: Isn't it x transpose x inverse x transpose y?
39:25
PHILIPPE RIGOLLET: Yes, x transpose.
39:26
Thank you.
39:27
39:31
So I'm going to replace y by x beta plus epsilon.
39:36
So that's-- and here comes the magic.
39:58
I have an inverse of a matrix, and then
40:00
I have the true matrix, I have the original matrix.
40:03
So this is actually the identity times beta.
40:08
And now this guy, well, this is a Gaussian,
40:11
because this is a Gaussian random vector,
40:13
and I just multiply it by a deterministic matrix.
40:18
So we're going to use the rule that if I have, say, epsilon,
40:22
which is n0 sigma, then b times epsilon is n0--
40:29
can somebody tell me what the covariance matrix of b epsilon
40:32
is?
40:32
40:35
AUDIENCE: What is capital B in this case?
40:37
PHILIPPE RIGOLLET: It's just a matrix.
40:38
40:42
And for any matrix, I mean any matrix that I can premultiply--
40:46
that I can postmultiply with epsilon.
40:48
Yeah?
40:49
AUDIENCE: b transpose b.
40:50
PHILIPPE RIGOLLET: b transpose?
40:52
AUDIENCE: Times b.
40:53
PHILIPPE RIGOLLET: And sigma is gone.
40:55
AUDIENCE: Oh, times sigma, sorry.
40:57
PHILIPPE RIGOLLET: That's the matrix, right?
40:59
AUDIENCE: b transpose sigma b.
41:00
PHILIPPE RIGOLLET: Almost.
41:01
41:04
Anybody wants to take a guess at the last one?
41:07
I think we've removed all other possibilities.
41:12
It's b sigma b transpose.
41:15
41:20
So if you ever answered to the question,
41:24
do you know Gaussian random vectors,
41:26
but you did not know that, there's a gap in your knowledge
41:29
that you need to fill, because that's probably
41:31
the most important property of Gaussian vectors.
41:33
When you multiply them by matrices,
41:38
you have a simple rule on how to update the covariance matrix.
41:43
So here, sigma is the identity.
41:49
And here, this is the matrix b that I had here.
41:53
So what this is is, basically, n, some multivariate n,
41:58
of course.
41:59
Then I'm going to have 0.
42:00
And so what I need to do is b times the identity times b
42:04
transpose, which is just b b transpose.
42:07
And what is it going to tell me?
42:08
It's x transpose x--
42:12
sorry, that's inverse-- inverse x transpose, and then
42:17
the transpose of this guy, which is x x
42:21
transpose x inverse transpose.
42:25
But this matrix is symmetric, so I'm actually
42:27
not going to make the transpose of this guy.
42:30
And again, magic shows up.
42:34
Inverse times the matrix of those two guys
42:36
cancel, and so this is actually equal to beta
42:38
plus some n0 x transpose x inverse.
42:43
42:46
Yeah?
42:47
AUDIENCE: I'm a little lost on the [INAUDIBLE]..
42:49
So you define that as the b matrix, and what happens?
42:52
PHILIPPE RIGOLLET: So I just apply this rule, right?
42:54
AUDIENCE: Yeah.
42:55
PHILIPPE RIGOLLET: So if I multiply a matrix
42:57
by a Gaussian, then let's say this Gaussian had
43:01
mean 0, which is the case of epsilon here,
43:05
then the covariance matrix that I get
43:07
is b times the original covariance matrix times b
43:10
transpose.
43:11
So all I did is write this matrix times the identity
43:15
times this matrix transpose.
43:18
And the identity, of course, doesn't play any role,
43:20
so I can remove it.
43:21
It's just this matrix, then the matrix transpose.
43:23
And what happened?
43:25
So what is the transpose of this matrix?
43:27
So I used the fact that if I look at x transpose x inverse x
43:32
transpose, and now I look at the whole transpose of this thing,
43:39
that's actually equal 2.
43:40
And I use the rule that ab transpose is b transpose
43:43
a transpose-- let me finish--
43:46
and it's x x transpose x inverse.
43:51
43:55
Yes?
43:55
AUDIENCE: I thought the--
43:58
for epsilon, it was sigma squared.
44:00
PHILIPPE RIGOLLET: Oh, thank you.
44:01
There's a sigma squared somewhere.
44:03
So this was sigma squared times the identity, so I can just
44:08
pick up a sigma squared anywhere.
44:10
44:14
So here, in our case, so for epsilon, this is sigma.
44:28
Sigma squared times the identity,
44:30
that's my covariance matrix.
44:31
44:33
You seem perplexed.
44:35
AUDIENCE: It's just a new idea for me
44:37
to think of a maximum likelihood estimator as a random variable.
44:41
PHILIPPE RIGOLLET: Oh, it should not be.
44:43
Any estimator is a random variable.
44:45
AUDIENCE: Oh, yeah, that's a good point.
44:48
PHILIPPE RIGOLLET: [LAUGHS] And I have not
44:52
told you that this was the maximum likelihood
44:54
estimator just yet.
44:55
The estimator is a random variable.
44:58
There's a word-- some people use estimate just
45:00
to differentiate the estimator while you're
45:03
doing the analysis with random variables and the values
45:05
when you plug in the numbers in there.
45:09
But then, of course, people use estimate because it's shorter,
45:12
so then it's confusing.
45:14
So any questions about this computation?
45:17
Did I forget any other Greek letter along the way?
45:20
All right, I think we're good.
45:22
So one thing that it says-- and actually,
45:26
thank you for pointing this out--
45:27
I said there's actually a little hidden statement there.
45:30
By the way, this answers this question.
45:33
Beta hat is of the form beta plus something that's centered,
45:35
so it's indeed of the form Gaussian with mean beta
45:39
and covariance matrix sigma squared x transpose x inverse.
45:41
45:45
So that's very nice.
45:47
As long as x transpose x is not huge,
45:50
I'm going to have something that is close to what I want.
45:55
Oh, sorry, x transpose x inverse is not huge.
45:58
46:01
So there's a hidden claim in there,
46:05
which is that least squares estimator
46:08
is equal to the maximum likelihood estimator.
46:11
46:15
Why does the maximum likelihood estimator just
46:17
enter the picture now?
46:19
We've been talking about regression for the past 18
46:23
slides.
46:24
And we've been talking about estimators.
46:26
And I just dumped on you the least squares estimator,
46:29
but I never really came back to this thing that we know--
46:31
maybe the method of moments, or maybe the maximum likelihood
46:35
estimator.
46:35
It turns out that those two things are the same.
46:37
But if I want to talk about a maximum likelihood estimator,
46:41
I need to have a likelihood.
46:43
In particular, I need to have a density.
46:46
And so if I want a density, I have
46:47
to make those assumptions, such as the epsilons have
46:53
this Gaussian distribution.
46:55
So why is this the maximum likelihood estimator?
46:58
Well, remember, y is x transpose beta plus epsilon.
47:04
So I actually have a bunch of data.
47:07
So what is my model here?
47:14
Well, its the family of Gaussians
47:18
on n observations with mean x beta, variance sigma
47:22
squared identity, and beta lives in rp.
47:31
Here's my family of distributions.
47:34
That's the possible distributions for y.
47:38
And so in particular, I can write the density of y.
47:41
47:47
Well, what is it?
47:48
It's something that looks like p of x--
47:52
well, p of y, let's say, is equal to 1 over--
47:58
so now its going to be a little more complicated,
48:00
but its sigma squared times 2 pi to the p/2 exponential minus
48:17
norm of y minus x beta squared divided by 2 sigma squared.
48:26
So that's just the multivariate Gaussian density.
48:29
I just wrote it.
48:30
That's the density of a multivariate Gaussian
48:33
with mean x beta and covariance matrix sigma squared
48:36
times the identity.
48:37
That's what it is.
48:40
So you don't have to learn this by heart,
48:42
but if you are familiar with the case where p is equal to 1,
48:47
you can check that you recover what you're familiar with,
48:49
and this makes sense as an extension.
48:54
48:59
So now, I can actually write my log likelihood.
49:08
How many observations do I have of this vector y?
49:14
49:23
Do I have n observations of y?
49:25
49:30
I have just one, right?
49:33
Oh, sorry, I shouldn't have said p, this is n.
49:36
Everything is in dimension n.
49:38
So I can think of either having n independent observations
49:42
of each coordinate, or I can think
49:44
of having just one observation of the vector y.
49:47
So when I write my log likelihood,
49:50
it's just the log of the density at y.
49:54
50:09
And that's the vector y, which I can
50:13
write as minus n/2 log sigma squared
50:18
2 pi minus 1 over 2 sigma squared norm of y minus x beta.
50:28
And that's, again, my boldface y.
50:30
50:36
And what is my maximum likelihood estimator?
50:39
50:44
Well, this guy does not depend on beta.
50:47
And this is just a constant factor in front of this guy.
50:50
So it's the same thing as just minimizing,
50:54
because I have a minus sign, over all beta and rp.
50:57
51:03
y minus x beta squared, and that's my least squares
51:05
estimator.
51:06
51:15
Is there anything that's unclear on this board?
51:17
Any question?
51:17
51:20
So all I used was-- so I wrote my log likelihood, which
51:23
is just the log of this expression
51:25
where y is my observation.
51:28
And that's indeed the observation that I have here.
51:32
And that was just some constant minus some constant times
51:35
this quantity that depends on beta.
51:37
So maximizing this whole thing is the same thing
51:40
as minimizing only this thing.
51:42
The minimizers are the same.
51:44
And so that tells me that I actually just
51:47
have to minimize the squared norm
51:49
to get my maximum likelihood estimator.
51:51
But this used, heavily, the fact that I could actually
51:55
write exactly what my density was,
52:03
and that when I took the log of this thing,
52:06
I had exactly the square norm that showed up.
52:09
If I had a different density, if, for example,
52:12
I assumed that my coordinates of epsilons were, say, iid
52:17
double exponential random variables.
52:18
So it's just half of an exponential.
52:21
And the plus is half of an exponential on the negatives.
52:24
So if I said that, then this would not
52:27
have the square norm that shows up.
52:28
This is really idiosyncratic to Gaussians.
52:31
If I had something else, I would have,
52:32
maybe, a different norm here, or something different
52:35
measures the difference between y and x beta.
52:39
And that's how you come up with other maximum likelihood
52:41
estimators that leads to other estimators that
52:44
are not the least squares--
52:45
maybe the least absolute deviation,
52:47
for example, or this fourth movement,
52:50
for example, that you suggested last time.
52:52
So I can come up with a bunch of different things,
52:55
and they might be tied--
52:56
maybe I can come up from them from the same perspective
52:59
that I came from the least squares estimator.
53:01
I said, let's just do something smart
53:03
and check, then, that it's indeed the maximum likelihood
53:06
estimator.
53:08
Or I could just start with the modeling on--
53:11
and check, then, what happens--
53:13
what was the implicit assumption that I put on my noise.
53:15
Or I could start with the assumption of the noise,
53:18
compute the maximum likelihood estimator
53:19
and see what it turns into.
53:21
53:24
So that was the first thing.
53:26
I've just proved to you the first line.
53:29
And from there, you can get what you want.
53:31
So all the other lines are going to follow.
53:34
So what is beta hat-- so for example, let's look
53:39
at the second line, the quadratic risk.
53:41
53:46
Beta hat minus beta, from this formula,
53:49
has a distribution, which is n n0,
53:53
and then I have x transpose x inverse.
53:58
AUDIENCE: Wouldn't the dimension be p on the board?
54:03
54:07
PHILIPPE RIGOLLET: Sorry, the dimension of what?
54:10
AUDIENCE: Oh beta hat minus beta.
54:11
Isn't beta only a p dimensional?
54:13
PHILIPPE RIGOLLET: Oh, yeah, you're right, you're right.
54:15
That was all p dimensional there.
54:17
54:22
Yeah.
54:23
So if b here, the matrix that I'm actually applying,
54:28
has dimension p times n--
54:30
so even if epsilon was an n dimensional Gaussian vector,
54:34
then b times epsilon is a p dimensional Gaussian vector
54:39
now.
54:39
So that's how I switch from p to n--
54:42
from n to p.
54:43
Thank you.
54:45
So you're right, this is beta hat minus beta is this guy.
54:50
And so in particular, if I look at the expectation
54:54
of the norm of beta hat minus beta squared, what is it?
55:01
It's the expectation of the norm of some Gaussian vector.
55:08
55:12
And so it turns out-- so maybe we don't have--
55:15
well, that's just also a property of a Gaussian vector.
55:18
So if epsilon is n0 sigma, then the expectation
55:26
of the norm of epsilon squared is just the trace of sigma.
55:34
55:37
Actually, we can probably check this
55:41
by saying that this is the sum from j equal 1
55:44
to p of the expectation of beta hat j minus beta j squared.
55:51
55:54
Since beta j squared is the expectation-- beta j
55:57
is the expectation of beta hat.
55:59
This is actually equal to the sum from j equal 1
56:01
to p of the variance of beta hat j,
56:08
just because this is the expectation of beta hat.
56:11
And how do I read the variances in a covariance matrix?
56:15
There are just the diagonal elements.
56:17
So that's really just sigma jj.
56:25
And so that's really equal to--
56:27
so that's the sum of the diagonal elements
56:29
of this matrix.
56:30
Let's call it sigma.
56:33
So that's equal to the trace of x transpose x inverse.
56:40
56:42
The trace is the sum of the diagonal elements of a matrix.
56:45
56:48
And I still had something else.
56:49
I'm sorry, this was sigma squared.
56:52
I forget it all the time.
56:54
So the sigma squared comes out.
56:56
It's there.
56:58
And so the sigma squared comes out
57:01
because the trace is a linear operator.
57:02
If I multiply all the entries of my matrix by the same number,
57:06
then all the diagonal elements are multiplied
57:08
by the same number, so when I sum them,
57:09
the sum is multiplied by the same number.
57:13
So that's for the quadratic risk of beta hat.
57:18
And now I need to tell you about x beta hat.
57:21
x beta hat was something that was actually telling me
57:27
that that was the point that I reported on the red line
57:30
that I estimated.
57:31
That was my x beta hat.
57:32
That was my y minus the noise.
57:40
Now, this thing here--
57:42
so remember, we had this line, and I had my observation.
57:47
And here, I'm really trying to measure this distance squared.
57:51
This distance is actually quite important for me
57:53
because it actually shows up in the Pythagoras theorem.
57:58
And so you could actually try to estimate this thing.
58:02
So what is the prediction error?
58:03
58:12
So we said we have y minus x beta hat, so that's
58:18
the norm of this thing we're trying to compute.
58:21
But let's write this for what it is for one second.
58:25
So we said that beta hat was x transpose
58:27
x inverse extra transpose y, and we know that y is
58:31
x transpose beta plus epsilon.
58:35
So let's write this--
58:37
58:40
x beta plus epsilon plus x.
58:43
58:57
And actually, maybe I should not write it.
59:00
Let me keep the y for what it is now.
59:02
59:07
So that means that I have, essentially,
59:08
the identity of rn times y minus this matrix times y.
59:13
So I can factor y out, and that's
59:15
the identity of rn minus x x transpose
59:20
x inverse x transpose, the whole thing times y.
59:27
59:32
We call this matrix p because this was the projection matrix
59:37
onto the linear span of the x's.
59:41
So that means that if I take a point x and I apply p times x,
59:46
I'm projecting onto the linear span of the columns of x.
59:50
What happens if I do i minus p times x?
59:57
It's x minus px.
59:59
60:01
So if I look at the point on which--
60:04
so this is the point on which I project.
60:07
This is x.
60:08
I project orthogonally to get p times x.
60:13
And so what it means is that this operator i
60:15
minus px is actually giving me this guy, this vector here--
60:21
x minus p times x.
60:23
60:30
Let's say this is 0.
60:33
This means that this vector, I can put it here.
60:36
It's this vector here.
60:38
And that's actually the orthogonal projection
60:40
of x onto the orthogonal complement of the span
60:43
of the columns of x.
60:45
So if I project x, or if I look of x minus its projection,
60:51
I'm basically projecting onto two orthogonal spaces.
60:55
What I'm trying to say here is that this here
60:59
is another projection matrix p prime.
61:01
61:04
That is just the projection matrix onto the orthogonal--
61:10
projection onto orthogonal of column span of x.
61:29
Orthogonal means the set of vectors
61:31
that's orthogonal to everyone in this linear space.
61:34
61:37
So now, when I'm doing this, this is exactly what--
61:40
I mean, in a way, this is illustrating this Pythagoras
61:42
theorem.
61:43
And so when I want to compute the norm of this guy, the norm
61:47
squared of this guy, I'm really computing--
61:49
if this is my y now, this is px of y,
61:52
I'm really controlling the norm squared of this thing.
61:55
62:06
So if I want to compute the norm squared--
62:08
62:42
so I'm almost there.
62:48
So what am I projecting here onto the orthogonal projector?
62:52
So here, y, now, I know that y is
62:55
equal to x beta plus epsilon.
63:00
So when I look at this matrix p prime times y,
63:06
It's actually p prime times x beta plus p prime times
63:11
epsilon.
63:11
63:14
What's happening to p prime times x beta?
63:18
Let's look at this picture.
63:19
63:23
So we know that p prime takes any point here and projects it
63:26
orthogonally on this guy.
63:29
But x beta is actually a point that lives here.
63:33
It's something that's on the linear span.
63:36
So where do all the points that are on this line
63:39
get projected to?
63:43
AUDIENCE: The origin.
63:43
PHILIPPE RIGOLLET: The origin, to 0.
63:45
They all get projected to 0.
63:47
And that's because I'm basically projecting
63:50
something that's on the column span of x onto its orthogonal.
63:54
So that's always 0 that I'm getting here.
63:56
64:02
So when I apply p prime to y, I'm
64:04
really just applying p prime to epsilon.
64:08
So I know that now, this, actually,
64:10
is equal to the norm of some multivariate Gaussian.
64:18
What is the size of this Gaussian?
64:20
64:22
What is the size of this matrix?
64:24
Well, I actually had it there.
64:25
It's i n, so it's n dimensional.
64:28
So it's some n dimensional with mean 0.
64:31
And what is the covariance matrix
64:32
of p prime times epsilon?
64:34
64:39
AUDIENCE: p p transpose.
64:40
PHILIPPE RIGOLLET: Yeah, p prime p prime transpose,
64:43
which we just said p prime transpose is p,
64:48
so that's p squared.
64:49
And we see that when we project twice,
64:51
it's as if we projected only once.
64:54
So here, this is n0 p prime p prime transpose.
65:00
That's the formula for the covariance matrix.
65:05
But this guy is actually equal to p prime times p prime,
65:09
which is equal to p prime.
65:13
So now, what I'm looking for is the norm squared of the trace.
65:18
So that means that this whole thing here
65:20
is actually equal to the trace.
65:22
Oh, did I forget again a sigma squared?
65:24
Yeah, I forgot it only here, which is good news.
65:28
So I should assume that sigma squared is equal to 1.
65:32
So sigma squared's here.
65:34
And then what I'm left with is sigma squared
65:36
times the trace of p prime.
65:39
65:45
At some point, I mentioned that the eigenvalues of a projection
65:51
matrix were actually 0 or 1.
65:54
The trace is the sum of the eigenvalues.
65:56
So that means that the trace is going
65:58
to be an integer number as the number of non-0 eigenvalues.
66:03
And the non-0 eigenvalues are just
66:05
the dimension of the space onto which I'm projecting.
66:07
66:10
Now, I'm projecting from something of dimension n
66:15
onto the orthogonal of a space of dimension p.
66:19
What is the dimension of the orthogonal
66:21
of a space of dimension p when thought
66:23
of space in dimension n?
66:26
AUDIENCE: [? 1. ?]
66:27
PHILIPPE RIGOLLET: N minus p--
66:28
that's the so-called rank theorem, I guess, as a name.
66:32
And so that's how I get this n minus p here.
66:35
This is really just equal to n minus p.
66:40
Yeah?
66:40
AUDIENCE: Here, we're taking the expectation of the whole thing.
66:43
PHILIPPE RIGOLLET: Yes, you're right.
66:44
So that's actually the expectation
66:48
of this thing that's equal to that.
66:50
Absolutely.
66:53
But I actually have much better.
66:55
I know, even, that the norm that I'm looking at,
66:57
I know it's going to be this thing.
66:58
What is going to be the distribution of this guy?
67:00
67:03
Norm squared of a Gaussian, chi squared.
67:06
So there's going to be some chi squared that shows up.
67:09
And the number of degrees of freedom
67:10
is actually going to be also n minus p.
67:12
And maybe it's actually somewhere--
67:16
yeah, right here-- n minus p times sigma hat
67:20
squared over sigma squared.
67:22
This is my sigma hat squared.
67:24
If I multiply n minus p, I'm left only with this thing,
67:28
and so that means that I get sigma squared times--
67:31
because they always forget my sigma squared--
67:33
I get sigma squared times this thing.
67:34
And it turns out that the square norm of this guy
67:37
is actually exactly chi squared with n minus b
67:39
degrees of freedom.
67:40
67:43
So in particular, so we know that the expectation
67:47
of this thing is equal to sigma squared times n minus p.
67:50
So if I divide both sides by n minus p,
67:53
I'm going to have that something whose expectation is
67:55
sigma squared.
67:57
And this something, I can actually compute.
67:59
It depends on y, and x that I know,
68:02
and beta hat that I've just estimated.
68:04
I know what n is.
68:05
And pr's are the dimensions of my matrix x.
68:07
So I'm actually given an estimator whose expectation
68:11
is sigma squared.
68:13
And so now, I actually have an unbiased estimator
68:15
of sigma squared.
68:17
That's this guy right here.
68:19
And it's actually super useful.
68:20
68:23
So those are called the--
68:25
this is the normalized sum of square residuals.
68:27
These are called the residuals.
68:29
Those are whatever is residual when
68:32
I project my points onto the line that I've estimated.
68:36
And so in a way, those guys-- if you go back to this picture,
68:40
this was yi and this was xi transpose beta hat.
68:47
So if beta hat is close to beta, the difference
68:49
between yi and xi transpose beta should
68:52
be close to my epsilon i.
68:55
It's some sort of epsilon i hat.
68:57
69:00
Agreed?
69:02
And so that means that if I think
69:04
of those as being epsilon i hat, they
69:07
should be close to epsilon i, and so their norm
69:09
should be giving me something that looks like sigma squared.
69:14
And so that's why it actually makes sense.
69:16
It's just magical that everything works out together,
69:18
because I'm not projecting on the right line,
69:21
I'm actually projecting on the wrong line.
69:23
But in the end, things actually work out pretty well.
69:27
There's one thing-- so here, the theorem
69:28
is that this thing not only has the right expectation,
69:31
but also has a chi squared distribution.
69:33
That's what we just discussed.
69:34
So here, I'm just telling you this.
69:36
But it's not too hard to believe,
69:37
because it's actually the norm of some vector.
69:40
You could make this obvious, but again, I
69:42
didn't want to bring in too much linear algebra.
69:44
So to prove this, you actually have
69:46
to diagonalize the matrix p.
69:48
So you have to invoke the eigenvalue decomposition
69:53
and the fact that the norm is invariant by rotation.
69:56
So for those who are familiar with, what I can do
69:59
is just look at the decomposition of p
70:01
prime into ud u transpose where this is an orthogonal matrix,
70:08
and this is a diagonal matrix of eigenvalues.
70:10
And when I look at the norm squared of this thing,
70:13
I mean, I have, basically, the norm
70:14
squared of p prime times some epsilon.
70:20
It's the norm of ud u transpose epsilon squared.
70:26
The norm of a rotation of a vector
70:28
is the same as the norm of the vector, so this guy goes away.
70:32
This is not actually--
70:34
I mean, you don't have to care about this if you
70:36
don't understand what I'm saying, so don't freak out.
70:37
This is really for those who follow.
70:39
What is the distribution of u transpose epsilon?
70:42
70:45
I take a Gaussian vector that has covariance matrix sigma
70:50
squared times the [? identity, ?] and I basically
70:52
rotate it.
70:54
What is its distribution?
70:57
Yeah?
70:58
AUDIENCE: The same.
70:59
PHILIPPE RIGOLLET: It's the same.
71:00
It's completely invariant, because the Gaussian
71:02
think of all directions as being the same.
71:04
So it doesn't really matter if I take a Gaussian or a rotated
71:07
Gaussian.
71:08
So this is also a Gaussian, so I'm
71:10
going to call it epsilon prime.
71:11
And I am left with just the norm of epsilon primes.
71:15
So this is the sum of the dj's squared times epsilon
71:23
j squared.
71:24
71:27
And we just said that the eigenvalues of p
71:30
are either 0 or 1, because it's a projector.
71:33
And so here, I'm going to get only 0's and 1's.
71:36
So I'm really just summing a certain number
71:39
of epsilon i squared.
71:42
So square root of standard Gaussians--
71:45
sorry, with a sigma squared somewhere.
71:48
And basically, how many am I summing?
71:50
Well, the n minus p, the number of non-0 eigenvalues
71:55
of p prime.
71:57
So that's how it shows up.
72:00
When you see this, what theorem am I using here?
72:05
Cochran's theorem.
72:06
This is this magic book.
72:07
I'm actually going to dump everything that I'm not going
72:09
to prove to you and say, oh, this is actually Cochran's.
72:11
No, Cochran's theorem is really just
72:12
telling me something about orthogonality of things,
72:15
and therefore, independence of things.
72:17
And Cochran's theorem was something
72:19
that I used when I wanted to use what?
72:23
That's something I used just one slide before.
72:27
Student t-test, right?
72:28
I used Cochran's theorem to see that the numerator
72:30
and the denominator of the student statistic
72:33
were independent of each other.
72:35
And this is exactly what I'm going to do here.
72:37
72:40
I'm going to actually write a test to test, maybe,
72:42
if the beta j's are equal to 0.
72:44
I'm going to form a numerator, which is beta hat minus beta.
72:49
This is normal.
72:50
And we know that beta hat has a Gaussian distribution.
72:53
I'm going to standardized by something
72:54
that makes sense to me.
72:55
And I'm not going to go into details,
72:56
because we're out of time.
72:58
But there's the sigma hat that shows up.
72:59
And then there's a gamma j, which takes into account
73:03
the fact that my x's--
73:06
if I look at the distribution of beta, which is gone, I think--
73:12
yeah, beta is gone.
73:14
Oh, yeah, that's where it is.
73:16
The covariance matrix depends on this matrix x transpose x.
73:20
So this will show up in the variance.
73:22
In particular, diagonal elements are going to play a role here.
73:25
And so that's what my gammas are.
73:26
The gammas is the j's diagonal element of this matrix.
73:30
So we'll resume that on Tuesday, so
73:35
don't worry too much if this is going too fast.
73:38
I'm not supposed to cover it, but just so you
73:40
get a hint of why Cochran's theorem actually was useful.
73:45
So I don't know if we actually ended up recording.
73:51
I have your homework.
73:53
And as usual, I will give it to you outside.
73:56