https://www.youtube.com/watch?v=V4xOdtqic3o&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=14
字幕記錄


00:00
00:00
The following content is provided under a Creative
00:02
Commons license.
00:03
Your support will help MIT OpenCourseWare
00:06
continue to offer high-quality educational resources for free.
00:10
To make a donation or to view additional materials
00:12
from hundreds of MIT courses, visit MIT OpenCourseWare
00:16
at ocw.mit.edu.
00:17
00:20
PHILIPPE RIGOLLET: So yes, before we start,
00:25
this chapter will not be part of the midterm.
00:27
Everything else will be, so all the way up to goodness of fit
00:31
tests.
00:33
And there will be some practice exams
00:36
that will be posted in the recitation
00:38
section of the course.
00:39
And that will be--
00:40
you will be working on.
00:41
So the recitation tomorrow will be a review session
00:44
for the midterm.
00:46
I'll send an announcement by email.
00:49
So going back to our estimator, we
00:55
showed that the least squares estimator in the case
00:58
where we had some Gaussian observations.
01:01
So we had something that looked like this-- y was
01:04
equal to some matrix x times beta plus some epsilon.
01:07
This was an equation that was happening in r
01:10
to the n for n observations.
01:13
And then we wrote the least squares estimator beta hat.
01:15
01:21
And for the purpose from here on,
01:23
you see that you have this normal distribution,
01:26
this Gaussian p variant distribution.
01:28
That means that, at some point, we've
01:29
made the assumption that epsilons
01:31
were n and dimensional 0 identity of rn
01:38
times sigma squared, which I kept
01:41
on forgetting about last time.
01:43
I will try not to do that this time.
01:45
And so from this, we derived a bunch
01:48
of properties of this least squares estimator, beta hat.
01:54
And in particular, the key thing that everything was built on
01:58
was that we could write beta hat as the true unknown beta
02:02
plus some multivariate Gaussian that was centered,
02:05
but had a weird covariant structure.
02:07
So that was definitely p dimensional.
02:08
And it was sigma squared times x--
02:11
02:13
so that's x transpose x.
02:16
And that's inverse.
02:17
02:19
And the way we derived that was by having a lot of--
02:22
at least one cancellation between x transpose x and x
02:26
transpose x inverse.
02:28
So this is the basis for inference in linear regression.
02:47
02:51
So in a way, that's correct, because what
02:54
happened is that we used the fact that x beta hat--
02:58
once we have this beta, x beta hat
02:59
is really just a projection of y onto the linear span
03:04
of the columns of x, or the column span of x.
03:08
And so in particular, those things--
03:10
y minus x beta hats--
03:11
are called residuals.
03:13
03:22
So that's the vector of residuals.
03:25
03:28
What's the dimension of this vector?
03:32
03:36
AUDIENCE: n by 1.
03:37
PHILIPPE RIGOLLET: n by 1.
03:38
So those things, we can write as epsilon hat.
03:42
There's an estimate for this epsilon
03:44
because we just put a hat on beta.
03:47
And from this one, we could actually
03:49
build an unbiased estimator of sigma hat squared,
03:54
and that was this guy.
03:55
And we showed that, indeed, the right normalization for this
03:59
was n minus p, because y minus x beta hat to norm
04:04
is actually a chi squared with n minus p degrees of freedom.
04:07
And so that's up to this scaling by sigma squared.
04:11
So that's what we came up with.
04:12
And something I told you, which follows
04:14
from Cochran's theorem--
04:15
we did not go into details about this.
04:17
But essentially, since one of them
04:18
corresponds to projection onto the linear span of the columns
04:22
of x, and the other one corresponds to projection
04:25
onto the orthogonal of this guy, and we're in a Gaussian case,
04:28
things that are orthogonal are actually
04:30
independent in a Gaussian case.
04:31
So from a geometric point of view,
04:33
you can sort of understand everything.
04:34
You think of your subspace of the linear span of the x's,
04:37
sometimes you project onto this guy,
04:39
sometimes you project onto its orthogonal.
04:41
Beta hat corresponds to projection
04:43
onto the linear span.
04:44
Epsilon hats correspond to a projection onto the orthogonal.
04:46
And those things tend to be independent,
04:48
and that's what you have that beta hat
04:50
is independent of sigma hat squared.
04:53
So it's really just a statement about two linear spaces being
04:56
orthogonal with respect to each other.
05:00
So we left on this slide last time.
05:07
And what I claim is that this thing here is actually--
05:10
oh, yeah-- the other thing we want to use.
05:12
So that's good for beta hat.
05:14
But since we don't know what sigma squared is--
05:15
if we knew what sigma squared is,
05:17
that would totally be enough for us.
05:19
But we also need this extra thing--
05:21
that sigma squared hat squared over sigma squared follows--
05:27
and there's an n minus p.
05:29
This follows a chi squared with n minus p degrees of freedom.
05:33
And sigma hat squared is independent of beta hat.
05:36
So that's going to be something we need.
05:41
So that's useful if sigma squared if unknown.
05:47
05:51
And again, sometimes it might be known
05:53
if you're using some sort of measurement device
05:56
for which it's written on the side of the box.
05:58
06:01
So from these two things, we're going
06:02
to be able to do inference And inference, we
06:05
said there's three pillars to inference.
06:09
The first one is estimation, and we've been doing that so far.
06:12
We've constructed this least squares estimator,
06:14
which happens to be the maximum likelihood
06:16
estimator in the Gaussian case.
06:18
The two other things we do in inference
06:20
are confidence intervals.
06:22
And we can do confidence intervals.
06:24
We're not going to do much because we're
06:25
going to talk about their sort of cousin, which are tests.
06:29
And that's really where the statistical inference
06:31
comes into.
06:32
And here, we're going to be interested in a very
06:34
specific kind of test for linear regression.
06:36
And those are tests of the form beta j--
06:42
so the j-th coefficient of beta is equal to 0,
06:46
and that's going to be our null hypothesis, versus h1 where
06:52
beta j is, say, not equal to 0.
06:55
And for the purpose of regression,
06:57
unless you have lots of domain-specific knowledge,
07:00
it won't be beta j positive or beta j negative.
07:03
It's really non-0 that's interesting to you.
07:06
So why would I want to do this test?
07:09
Well, if I expand this thing where I have y
07:14
is equal to x beta plus epsilon--
07:19
so what happens if I look, for example,
07:21
at the first coordinates?
07:24
So I have that y is actually-- so say, y1 is equal to beta 1
07:32
plus beta 2 x 1.
07:37
Well, that's actually complicated.
07:38
Let me write it like this--
07:39
07:42
beta 0 plus beta 1 x1 plus beta p minus 1 xp minus 1
07:56
plus epsilon.
07:58
And that's true for all i's.
08:00
08:04
So this is beta 1 times 1.
08:05
That was our first coordinate.
08:07
So that's just expanding this--
08:09
going back to the scalar form rather than
08:12
going to the matrix vector form.
08:15
That's what we're doing.
08:16
When I write y is equal to x beta plus epsilon,
08:19
I assume that each of my y's can be represented
08:22
as a linear combination of the x's, the first one
08:25
being 1 plus some epsilon i.
08:26
Everybody agrees with this?
08:29
What does it mean for beta j to be equal to 0?
08:32
08:40
Yeah?
08:41
AUDIENCE: That xj's not important.
08:43
PHILIPPE RIGOLLET: Yeah, that xj doesn't even
08:45
show up in this thing.
08:46
So if beta j is equal to 0, that means that, essentially, we
08:51
can remove the j's coordinate, xj, from all observations.
09:05
09:12
So for example, I'm a banker, and I'm
09:15
trying to predict some score--
09:19
let's call it y--
09:21
without the noise.
09:22
So I'm trying to predict what is going to be your score.
09:26
And that's something that should be telling me
09:29
how likely you are to reimburse your loan on time
09:33
or do you have late payments.
09:34
Or actually, maybe these days bankers
09:36
are actually looking at how much late fees will I
09:40
be collecting from you.
09:41
Maybe that's what they are more after rather than making sure
09:44
that you reimburse everything.
09:45
So they're trying to maximize this number of late fees.
09:47
And they collect a bunch of things about you--
09:49
definitely your credit score, but maybe your
09:52
zip code, profession, years of education, family status,
09:57
a bunch of things.
09:59
One might be your shoe size.
10:01
And they want to know-- maybe shoe is actually
10:03
a good explanation for how much fees
10:07
they're going to be collecting from you.
10:08
But as you can imagine, this would be a controversial thing
10:10
to bring, and people might want to test for their shoe
10:12
size is a good idea.
10:14
And so they would just look at the j corresponding
10:17
to shoe size and test whether shoe size should appear or not
10:21
in this formula.
10:22
And that's essentially the kind of thing
10:24
that people are going to do.
10:25
Now, if I do genomics and I'm trying
10:27
to predict the size, the girth, of a pumpkin for a competition
10:32
based on some available genomic data,
10:37
then I can test whether gene j, which is called--
10:40
I don't know-- pea snap 24-- they always have these crazy
10:44
names--
10:44
appears or not in this formula.
10:46
Is the gene pea snap 24 going to be important or not
10:49
for the size of the final pumpkin?
10:52
So those are definitely the important things.
10:54
And definitely, we want to put beta j not
10:57
equal to 0 as the alternative because that's where
11:00
scientific discovery shows up.
11:02
And so to do that, well, we're in a Gaussian set-up,
11:06
so we know that even if we don't know what sigma hat is,
11:10
we can actually call for a t-test.
11:14
So how did we build the t-test in general?
11:16
Well, we had something that looked like-- so before, what
11:23
we had was something that looked like theta hat was
11:28
equal to theta plus some n0 and something that
11:35
depended on n, maybe, something like this-- sigma squared
11:38
over n.
11:39
So that's what it looked like.
11:41
Now what we have is that beta hat
11:46
is equal to beta plus some n, but this time, it's p variant,
11:50
and then x transpose x inverse sigma squared.
11:56
So it's actually very similar, except that the matrix
12:00
x transpose x inverse is now replacing
12:03
just this number, 1/n, but it's playing the same role.
12:06
So in particular, this implies that for every j from 1
12:12
to p, what is the distribution of beta hat j?
12:16
12:19
Well, beta hat j is actually equal to--
12:22
so all I have to do-- so this is a system of p equations,
12:26
and all I have to do is to read the j through.
12:29
So it's telling me here, I'm going to read beta hat j.
12:32
Here, I'm going to read beta j.
12:34
And here, I need to read, what is
12:36
the distribution of the j-th coordinates of this guy?
12:40
So this is a Gaussian vector, so we
12:43
need to understand what its definition is.
12:45
12:49
So how do I do this?
12:52
Well, the observation that's actually useful for this--
12:56
maybe I shouldn't use the word observation in a stats class,
12:59
so let's call it claim.
13:00
13:03
The interesting claim is that if I have a vector--
13:09
let's call it v--
13:13
then vj is equal to v transpose ej where
13:20
ej is the vector with 0, 0, 0, and then the 1 on the j-th
13:28
coordinate, and then 0 elsewhere.
13:30
That's the j-th coordinate.
13:32
13:35
So that's the j-th vector of the canonical basis of rp.
13:38
13:41
So now that I have this form, I can
13:43
see that, essentially, beta hat j
13:45
is just ej transpose this np0 sigma squared
13:51
x transpose x inverse.
13:53
13:59
And now, I know what the distribution
14:02
of the inner product between a Gaussian
14:05
and a deterministic vector is.
14:08
What is it?
14:09
14:13
It's a Gaussian.
14:15
So all I have to check is that ej transpose np0 sigma squared
14:23
x transpose x inverse--
14:27
well, this is equal in distribution to what?
14:31
Well, this is going to be a one-dimensional thing.
14:34
A then your product is just a real number.
14:38
So it's going to be some Gaussian.
14:42
The mean is going to be 0 in a product with ej, which is 0.
14:49
What is the variance of this guy?
14:52
We actually used this, except that ej was not a vector,
14:55
but it was a matrix.
14:57
So what we do is we, to see-- so the rule is that v transpose,
15:04
say, n mu sigma is some n v transpose mu,
15:16
and then v transpose sigma v. That's
15:21
the rule for Gaussian vectors.
15:23
There's just the property of Gaussian vectors.
15:25
15:27
So what do we have here?
15:29
Well, ej plays the role of v. And sigma
15:33
squared x transpose x inverse is the role of sigma.
15:36
So here, I'm left with ej transpose--
15:40
let me pull out the sigma squared here.
15:42
15:54
But this thing is, what happens if I take a matrix,
15:57
I premultiply it by this vector ej,
16:00
and I postmultiply it by this vector ej?
16:02
16:05
I'm claiming that this corresponds to only one
16:07
single element of this matrix.
16:09
Which one is it?
16:11
AUDIENCE: j.
16:11
PHILIPPE RIGOLLET: j's diagonal element.
16:14
So this thing here is nothing but x transpose x inverse,
16:23
and then the j-th diagonal element is jj.
16:27
Now, I cannot go any further.
16:30
x transpose x inverse can be a complicated matrix,
16:34
and I do not know how to express jj's diagonal element much
16:40
better than this.
16:41
16:43
Well, no, actually, I don't.
16:46
It involves basically all the coefficients.
16:48
Yeah?
16:49
AUDIENCE: [INAUDIBLE] second j come from,
16:52
so I get why ej transpose [INAUDIBLE]..
16:55
Where did the--
16:56
PHILIPPE RIGOLLET: From this rule?
16:58
AUDIENCE: [INAUDIBLE]
16:59
PHILIPPE RIGOLLET: So you always pre-
16:59
and postmultiply when you talk about the covariance,
17:01
because if you did not, it would be a vector and not a scalar,
17:04
for one.
17:06
But in general, think of v as a matrix.
17:08
It's still true even in v is a matrix that's
17:11
compatible with the premultiplying
17:12
by some Gaussian.
17:13
17:19
Any other question?
17:20
Yeah?
17:21
AUDIENCE: When you say claim a vector v, what is vector v?
17:25
17:29
PHILIPPE RIGOLLET: So for any vector v--
17:31
AUDIENCE: OK.
17:32
17:37
PHILIPPE RIGOLLET: Any other question?
17:40
So now we've identified that the j-th coefficient
17:44
of this Gaussian, which I can represent from the claim
17:47
as ej transpose this guy, is also
17:49
a Gaussian that's centered.
17:51
And its variance, now, is sigma squared
17:54
times the j-th diagonal element of x transpose x inverse.
17:58
So the conclusion is that beta hat j
18:05
is equal to beta j plus some n.
18:10
And I'm going to emphasize the fact that now it's
18:12
one-dimensional with mean 0 and covariance sigma squared x
18:19
transpose x inverse inverse jj.
18:25
Now, if you look at the last line of the second board
18:28
and the first line on the first board,
18:31
those are basically the same thing.
18:33
18:36
Beta hat j is my theta hat.
18:39
Beta j is my theta.
18:41
And the variance sigma squared over n
18:44
is now sigma squared times this [? jj's ?] element.
18:47
Now, the inverse suggests that it looks like the inverse of n.
18:52
So those things are going to--
18:53
we're going to want to think of those guys
18:55
as being some sort of 1/n kind of statement.
18:59
19:04
So from this, the fact that those two things are the same
19:09
leads us to believe that we are now
19:11
equipped to perform the task that we're trying to do,
19:14
because under the null hypothesis,
19:16
beta j is known it's equal to 0, so I can remove it.
19:22
And I have to deal with the sigma squared.
19:24
If sigma squared is known, then I
19:26
can just perform a regular Gaussian
19:29
test using Gaussian quintiles.
19:31
And if sigma squared is unknown, I'm
19:33
going to just divide by sigma squared
19:35
and multiply by sigma hat, and then I'm
19:38
going to basically get my t-test.
19:40
20:00
Actually, for the purpose of your exam,
20:03
I really suggest that you understand every single word
20:06
I'm going to be saying now, because this is exactly
20:08
the same thing that you're expected
20:09
to know from other courses, because right now, I'm just
20:12
going to apply exactly the same technique that we
20:14
did for the single parameter estimation.
20:17
So what do we have now is that under h0, beta j is equal to 0.
20:26
Therefore, beta hat j follows some n0 sigma squared.
20:39
Just like I do in the slide, I'm going to call this gamma j.
20:41
20:50
So gamma j is this x transpose x inverse j-th diagonal element.
20:56
20:59
So that implies that beta hat j over sigma--
21:06
oh, was it a square root?
21:08
Yeah, sigma square root of gamma j follows some n0 1.
21:16
So I can form my test statistic, which
21:21
to be reject if the absolute value of beta hat j divided
21:30
by sigma square root gamma j is larger than what?
21:38
Can somebody tell me what I want this
21:39
to be larger than to reject?
21:41
21:43
AUDIENCE: q alpha.
21:45
PHILIPPE RIGOLLET: q alpha.
21:46
21:48
Everybody agrees?
21:49
Of what?
21:50
Of this guy, where the standard notation
21:58
that this is the quintile.
21:59
Everybody agrees?
22:01
AUDIENCE: It's alpha over 2 I think.
22:02
I think alpha's--
22:03
PHILIPPE RIGOLLET: Alpha over 2.
22:04
So not everybody should be agreeing.
22:06
Thank you, you're the first one to disagree with yourself,
22:08
which is probably good.
22:09
22:12
It's alpha over 2 because of the absolute value.
22:14
I want to just be away from this guy,
22:15
and that's because I have--
22:17
so the alpha over 2--
22:19
the sanity check should be that h1 is beta j not equal to 0.
22:27
So that works if sigma is known, because I need to know sigma
22:35
to be able to build my test.
22:37
So if sigma is unknown, well, I can tell you, use this test,
22:39
but you're going to be like, OK, when
22:41
I'm going to have to plug in some numbers,
22:44
I'm going to be stuck.
22:45
22:49
But if sigma is unknown, we have sigma hat
22:59
squared as an estimator.
23:03
So let me write sigma squared here.
23:06
So in particular, beta hat divided
23:12
by sigma hat squared times square root gamma j-- something
23:18
I can compute.
23:19
Sorry, that's beta hat j.
23:20
23:23
I can compute that thing.
23:24
Agreed?
23:25
Now I have sigma hat j.
23:27
What I need to do is to be able to compute
23:28
the distribution of this thing.
23:32
So I know the distribution of beta hat j over the square root
23:37
of gamma j.
23:38
That's some Gaussian 0, 1.
23:40
I don't know exactly what the distribution of sigma hat
23:42
j squared is, but what I know is that that was actually written,
23:46
maybe, here is that n minus p sigma hat squared over sigma
23:54
squared follows some chi squared with n minus p
23:59
degrees of freedom, and that it's actually
24:01
independent of beta hat j.
24:06
It's independent of beta hat, so it's
24:08
independent of each of its coordinates.
24:10
That was part of your homework where you had to--
24:13
some of you were confused by the fact that--
24:15
I mean, if you're independent of some big thing,
24:18
you're independent of all the smaller
24:19
components of this big thing.
24:20
That's basically what you need to know.
24:24
And so now I can just write this as--
24:26
24:29
this is beta hat j divided by--
24:35
so now I want to make this guy appear,
24:37
so it's beta hat j sigma squared over sigma squared--
24:42
sigma hat squared over sigma squared times n minus p divided
24:48
by the square root of gamma j.
24:49
So that's what I want to see.
24:51
Yeah?
24:52
AUDIENCE: Why do you have to stick
24:53
the hat in the denominator?
24:54
Shouldn't it be sigma?
24:56
PHILIPPE RIGOLLET: Yeah, so I write this.
24:59
I decide to write this.
25:01
I could have put a Mickey Mouse here.
25:03
It just wouldn't make sense.
25:04
I just decided to take this thing.
25:05
AUDIENCE: OK.
25:06
PHILIPPE RIGOLLET: OK.
25:07
So now, let-- so I take this guy, and now,
25:12
I'm going to rewrite it as something I want,
25:15
because if you don't know what sigma is--
25:17
sorry, that's not sigm--
25:18
you mean the square?
25:19
AUDIENCE: Yeah.
25:20
PHILIPPE RIGOLLET: Oh, thank you.
25:21
Yes, that's correct.
25:22
[LAUGHS] OK, so you don't know what's sigma
25:25
is, you replace it by sigma hat.
25:26
That's the most natural thing to do.
25:28
You just now want to find out what
25:30
the distribution of this guy is.
25:33
So this is not exactly what I had.
25:35
To be able to get this, I need to divide by sigma squared--
25:41
sorry, I need to--
25:42
AUDIENCE: Square root.
25:43
PHILIPPE RIGOLLET: I'm sorry.
25:44
AUDIENCE: Do we need a square root
25:46
of the sigma hat [INAUDIBLE].
25:47
PHILIPPE RIGOLLET: That's correct now.
25:49
25:55
And now I have that--
25:57
sorry, I should not write it like that.
25:59
That's not what I want.
26:01
What I want is this.
26:04
26:08
And to be able to get this guy, what I need
26:11
is sigma over sigma hat square root.
26:25
And then I need to make this thing show up.
26:27
So I need to have this n minus p show up in the denominator.
26:32
So to be able to get it, I need to multiply
26:34
the entire thing by the square root of n minus p.
26:37
26:41
So this is just a tautology.
26:42
I just squeezed in what I wanted.
26:46
But now this whole thing here, this is actually
26:50
of the form beta hat j divided by sigma over square root gamma
26:56
j, and then divided by square root of sigma hat squared
27:01
over sigma squared.
27:04
27:08
No, I don't want to divide it by square root of minus p, sorry.
27:11
27:15
And now it's times n minus p divided by n minus p.
27:21
27:27
And what is the distribution of this thing here?
27:29
27:43
So I'm going to keep going here.
27:45
So the distribution of this thing here is what?
27:48
Well, this numerator, what is this distribution?
27:54
27:58
AUDIENCE: [INAUDIBLE]
28:01
PHILIPPE RIGOLLET: Yeah, n0 1.
28:02
It's actually still written over there.
28:04
28:09
So that's our n0 1.
28:11
What is the distribution of this guy?
28:13
28:16
Sorry, I don't think you have color again.
28:18
So what is the distribution of this guy?
28:22
This is still written on the board.
28:24
AUDIENCE: Chi squared.
28:25
PHILIPPE RIGOLLET: It's the chi squared that I have right here.
28:28
28:32
So that's a chi squared n minus p divided by n minus p
28:35
degrees of freedom.
28:36
The only thing I need to check is
28:37
that those two guys are independent, which
28:39
is also what I have from here.
28:43
And so that implies that beta hat j divided
28:49
by sigma hat square root of gamma
28:53
j, what is the distribution of this guy?
28:55
29:04
[INTERPOSING VOICES]
29:06
PHILIPPE RIGOLLET: n minus p.
29:09
Was that crystal clear for everyone?
29:12
Was that so simple that it was boring to everyone?
29:15
OK, good.
29:16
That's where the point at which you should be.
29:18
So now I have this, I can read the quintiles of this guy.
29:23
So my test statistic becomes--
29:28
well, my rejection region, I reject
29:31
if the absolute value of this new guy
29:40
exceeds the quintile of order alpha over 2, but this time,
29:44
of a tn minus p.
29:48
And now you can actually see that the only difference
29:50
between this test and that test, apart from replacing sigma
29:53
by sigma hat, is that now I've moved
29:55
from the quintiles of a Gaussian to the quintiles
29:58
of a tn minus p.
29:59
30:11
What's actually interesting, from this perspective,
30:13
is that the tn minus p, we know, has
30:18
heavier tails than the Gaussian, but if the number of degrees
30:20
of freedom reaches, maybe, 30 or 40, they're virtually the same.
30:26
And here, the number of degrees of freedom
30:27
is not given only by n, but it's n minus p.
30:30
So if I have more and more parameters to estimate,
30:33
this will result in some heavier, heavier tails,
30:35
and that's just to account for the fact
30:37
that it's harder and harder to estimate the variance
30:41
when I have a lot of parameters.
30:44
That's basically where it's coming from.
30:46
So now let's move on to--
30:52
well, I don't know what because this is not working anymore.
30:57
So this is the simplest test.
30:59
And actually, if you run any statistical software
31:02
for least squares, the output in any of them
31:06
will look like this.
31:08
You will have a sequence of rows.
31:11
And you're going to have an estimate for beta 0,
31:15
an estimate for beta 1, et cetera.
31:17
Here, you're going to have a bunch of things.
31:19
And on this row, you're going to have the value here,
31:23
so that's going to be what's estimated by least squares.
31:25
And then the second line immediately is going to be,
31:30
well, either the value of this thing--
31:32
31:35
so let's call it t.
31:36
And then there's going to be the p value
31:38
corresponding to this t.
31:40
This is something that's just routinely coming out because--
31:44
oh, and then there's, of course, the last line for people who
31:46
cannot read numbers that's really just giving you little
31:49
stars.
31:50
31:53
They're not stickers, but that's close to it.
31:56
And that's just saying, well, I have three stars,
31:59
I'm very significantly different from 0's.
32:01
If I have 2 stars, I'm moderately differently from 0.
32:04
And if I have 1 star, it means, well, just
32:07
give me another $1,000 and I will sign that it's actually
32:10
different from 0.
32:12
So that's basically the kind of outputs.
32:14
Everybody sees what I mean by that?
32:16
So what I mean, what I'm trying to emphasize here,
32:18
is that those things are so routine when
32:20
you run linear aggression, because people stuff in maybe--
32:23
even if you have 200 observations,
32:25
you're going to stuff in maybe 20 variables-- p equals 20.
32:28
That's still a big number to interpret what's going on.
32:31
And it's nice for you if you can actually trim some fat out.
32:35
And so the problem is that when you start doing this, and then
32:41
this, and then this, and then this,
32:44
the probability that you make a mistake
32:47
in your test, the probably that you erroneously
32:52
reject the null here is 5%.
32:55
Here, it's 5%.
32:56
Here, it's 5%.
32:58
Here, it's 5%.
33:00
And at some point, if things happen with 5% chances
33:05
and you keep on doing them over and over again,
33:08
they're going to start to happen.
33:10
So you can see that basically what's happening
33:14
is that you actually have an issue is
33:15
that if you start repeating those tests,
33:18
you might not be at 5% error at some point.
33:23
And so what do you do to prevent from that,
33:25
if you want to test all those beta j's simultaneously,
33:28
you have to do what's called the Bonferroni correction.
33:32
And the Bonferroni correction follows from what's
33:35
called a union bound.
33:36
A union bound is actually-- so if you're a computer scientist,
33:40
you're very familiar with it.
33:41
If you're a mathematician, that's just, essentially,
33:44
the third axiom of probability that you see,
33:46
that the probability of the union
33:48
is less than the sum of the probabilities.
33:50
34:00
That's the union bound.
34:02
And you, of course, can generalize that to more than 2.
34:05
And that's exactly what you're doing here.
34:07
So let's see how we would want to perform Bonferroni
34:11
correction to control the probability that they're all
34:19
equal to 0 at the same time.
34:21
34:26
So recall-- so if I want to perform this test over there
34:29
where I want to test h0, that beta j
34:34
is equal to 0 for all j in some subset s.
34:40
34:43
So think of s included in 1p.
34:48
You can think of it as being all of 1 of p if you want.
34:51
It really doesn't matter. s is something that's given to you.
34:53
Maybe you want to test the subset of them,
34:55
but maybe you want to test all of them.
34:57
Versus h1, beta j is not equal to 0 for some j in s.
35:04
35:07
That's a test that tests all these things at once.
35:10
And if you actually look at this table all at once,
35:13
implicitly, you're performing this test for all of the rows,
35:16
for s equal 1 to p.
35:19
You will do that.
35:19
Whether you like it or not, you will.
35:23
So now let's look at what the probability of type I error
35:27
looks like.
35:28
So I want the probability of type 1 error,
35:31
so that's the probably when h0 is true.
35:35
Well, so let me call psi j the indicator that, say, beta j
35:41
hat over sigma hat square root gamma j exceeds
35:51
q alpha over 2 of tn minus p.
35:54
So we know that those are the tests that I perform.
35:56
Here, I just add this extra index j
35:59
to tell me that I'm actually testing the j-th coefficient.
36:02
So what I want is the probability that under the null
36:06
so that those are all equal to 0 that beta j's--
36:12
that I will reject to the alternative for one of them.
36:16
So that's psi 1 is equal to 1 or psi 2
36:25
is equal to 1, all the way to psi--
36:29
well, let's just say that this is the entire thing,
36:31
because it's annoying.
36:32
36:36
I mean, you can check the slide if you
36:37
want to do it more generally.
36:39
But psi p is equal to--
36:44
or, or-- everybody agrees that this is the probability
36:48
of type I error?
36:51
So either I reject this one, or this one,
36:54
or this one, or this one, or this one.
36:55
And that's exactly when I'm going to reject at least one
36:58
of them.
36:59
So this is the probability of type I error.
37:08
And what I want is to keep this guy less than alpha.
37:12
37:15
But what I know is to control the probability
37:17
that this guy is less than alpha, that this guy is
37:20
less than alpha, that this guy is less than alpha.
37:22
In particular, if all these guys are disjoint,
37:26
then this could really be the sum of all these probabilities.
37:29
So in the worst case, if psi j equals 1 intersected with psi k
37:42
equals 1 is the empty set, so that means
37:46
those are called disjoint sets.
37:47
37:51
You've seen this terminology in probability, right?
37:53
So if those sets are disjoint, for all of them,
38:00
for all j different from k, then this probability--
38:04
38:07
well, let me write it as star--
38:14
then star is equal to, well, the probability under h0
38:20
that psi 1 is equal to 1 plus the probability under h0
38:30
that psi p is equal to 1.
38:33
Now, if I use this test with this alpha here,
38:37
then this probability is equal to alpha.
38:40
This probability is also equal to alpha.
38:43
So the probably of type I error is actually not equal to alpha.
38:45
It's equal to?
38:47
AUDIENCE: p alpha.
38:48
PHILIPPE RIGOLLET: p alpha.
38:49
38:52
So what is the solution here?
38:54
Well, it's to run those guys not with alpha,
38:58
but with alpha over p.
38:59
39:02
And if they do this, then this guy is equal to alpha over p,
39:06
this guy is equal to alpha over p.
39:09
And so when I get those things, I
39:10
get p times alpha over p, which is just alpha.
39:13
39:17
So all I do is, rather than running each of the tests
39:20
with probability of error--
39:23
so that's a test at level alpha over p.
39:28
39:32
That's actually very stringent.
39:33
If you think about it for 1 second,
39:35
even if you have only 5 variables-- p equals 5--
39:41
and you started with the tests, you
39:43
wanted to do your tests at 5%.
39:45
It forces you to do the test at 1% for each of those variables.
39:50
If you have 10 variables, I mean, that
39:53
start to be very stringent.
39:55
So it's going to be harder and harder for you
39:59
to conclude to the alternative.
40:01
Now, one thing I need to tell you
40:03
is that here I said, if they are disjoint,
40:05
then those probabilities are equal.
40:07
But if they are not disjoint, the union bound
40:12
tells me that the probability of the union
40:14
is less than the sum of the probabilities.
40:16
And so now I'm not exactly equal to alpha,
40:20
but I'm bounded by alpha.
40:23
And that's why Bonferroni correction,
40:26
people are not super comfortable with,
40:28
is because, in reality, you never think
40:30
that those tests are going to be giving you
40:32
completely disjoint things.
40:34
I mean, why would it be?
40:36
Why would it be that if this guy is equal to 1,
40:39
then all the other ones are equal to 0?
40:42
Why would it make any sense?
40:44
So this is definitely conservative,
40:45
but the problem is that we don't know how to do much better.
40:49
I mean, we have a formula that tells you
40:51
the probability of the union as some crazy sum that
40:54
looks at all the intersection and all these little things.
40:57
I mean, it's the generalization of p of a or b
41:01
is equal to p of a plus p of b minus probability
41:06
of the intersection.
41:08
But if you start doing this for more than 2,
41:10
it's super complicated.
41:12
The number of terms grows really fast.
41:15
But most importantly, even if you go here,
41:17
you still need to control the probability
41:19
of the intersection.
41:20
And those tests are not necessarily independent.
41:22
If they were independent, then that would be easy.
41:24
The probably of the intersection would be the product
41:26
of the probabilities.
41:27
But those things are super correlated,
41:31
and so it doesn't really help.
41:33
And so we'll see, when we talk about high-dimensional stats
41:37
towards the end, that there's something
41:38
called false discovery rate, which is essentially saying,
41:41
listen, if I want to control this thing,
41:45
if I really define my probability of type I
41:47
error as this, I want to make sure that I never make
41:50
this kind of error, I'm doomed.
41:52
This is just not going to happen.
41:54
But I can revise what my goals are in terms of errors
41:59
that I make, and then I will actually be able to do.
42:02
And what people are looking at is false discovery rate.
42:05
And this is called family-wise error rate, which
42:07
is a stronger thing to control.
42:10
So this trick that consists in replacing
42:14
alpha by alpha over the number of times
42:16
you're going to be performing your test,
42:18
or alpha over the number of terms in your union,
42:21
is actually called the Bonferroni correction.
42:24
42:32
And that's something you use when you have what's called--
42:35
another key word here is multiple testing,
42:41
when you're trying to do multiple tests simultaneously.
42:43
42:47
And if s is not of p, well, you just
42:49
divide by the number of tests that you are actually making.
42:52
So if s is of size k for some k less than p,
42:56
you just divide alpha by k and not by p, of course.
42:59
I mean, you can always divide by p,
43:00
but you're going to make your life harder for no reason.
43:03
43:11
Any question about Bonferroni correction?
43:13
43:18
So one thing that is maybe not as obvious
43:26
as the test beta j equal to 0 versus beta j not equal to 0--
43:30
and in particular, what it means is
43:32
that it's not going to come up as a software output
43:36
without even you requesting it because this is so standard
43:39
that it's just coming out.
43:40
But there's other tests that you might
43:42
think of that might be more complicated and more
43:45
tailored to your particular problem.
43:47
And those tests are of the form g times beta
43:52
is equal to some lambda.
43:56
So let's see, the test we've just done,
44:05
beta j equals 0 versus beta j not equal to 0,
44:14
is actually equivalent to ej transpose beta equals
44:23
0 versus ej transpose beta not equal to 0.
44:28
That was our claim.
44:31
But now I don't have to stop here.
44:32
I don't have to multiply by a vector
44:34
and test if it's equal to 0.
44:36
I can actually replace this by some general matrix g
44:46
and replace this guy by some general vector lambda.
44:54
And I'm not telling you what the dimensions
44:56
are because they're general.
44:57
I can take whatever I want.
44:58
Take your favorite matrix, as long
45:00
as the right side of the matrix can be multiplying beta,
45:05
and lambda, take it as the number of rows of g,
45:09
and then you can do that.
45:11
I can always formulate this test.
45:14
What will this test encompass?
45:16
Well, those are kind of weird tests.
45:18
So you can think of things like, I
45:22
want to test if beta 2 plus beta 3 are equal to 0, for example.
45:30
Maybe I want to test if beta 5 minus 2 beta 6 is equal to 23.
45:40
Well, that's weird.
45:42
But why would you want to test if beta 2 plus beta 3
45:44
is equal to 0?
45:46
Maybe you don't want to know if the-- you know
45:48
that the effect of some gene is not 0.
45:50
Maybe you know that this gene affects this trait,
45:54
but you want to know if the effect of this gene
45:56
is canceled by the effect of that gene.
45:59
And this is the kind of stuff that you're
46:00
going to be testing for that.
46:02
46:04
Now, this guy is much more artificial,
46:06
and I don't have a bedtime story to tell you around this.
46:08
So those things can happen and can be much more complicated.
46:13
Now, here, notice that the matrix g
46:15
has one row for both of the examples.
46:18
But if I want to test if those two things happen
46:20
at the same time, then I actually can take a matrix.
46:25
Another matrix that can be useful
46:27
is g equals the identity of rp and lambda is equal to 0.
46:34
What am I doing here in this case?
46:39
What is this test testing?
46:41
Sorry, this test.
46:42
46:44
Yeah?
46:45
AUDIENCE: Whether or not beta is 0.
46:46
PHILIPPE RIGOLLET: Yeah, we're testing if the entire vector
46:49
beta is equal to 0, because g times beta is equal to beta,
46:54
and we're asking whether it's equal to 0.
46:56
47:00
So the thing is, when you want to actually test
47:04
if beta is equal to 0, you're actually
47:07
testing if your entire model, everything you're
47:09
doing in life, is just junk.
47:12
This is just telling you, actually,
47:13
forget about this y is x beta plus epsilon.
47:17
y is really just epsilon.
47:18
There's nothing.
47:19
There's just some big noise with some big variants,
47:21
and there's nothing else.
47:23
So turns out that the statistical software
47:26
output that I wrote here spits out an answer to this question.
47:30
Just the last line, usually, is doing this test.
47:34
Does your model even make sense?
47:36
And it's probably for people to check whether they actually
47:39
just mix their two data sets.
47:41
Maybe they're actually trying to predict--
47:43
I don't know-- some credit score from genomic data,
47:49
and so just want to make sure, maybe, that's
47:51
not the right thing.
47:53
So it turns out that the machinery is exactly the same
47:56
as the one we've just taken.
47:58
So we actually start from here.
48:00
48:05
So let me pull this up.
48:06
48:12
So we start from here.
48:15
Beta hat was equal to beta plus this guy.
48:18
48:21
And the first thing we did was to say, well,
48:23
beta j is equal to this thing because, well, beta j was
48:27
just ej times beta.
48:29
So rather than taking ej here, let me just take g.
48:32
48:42
Now, we said that for any vector--
48:45
well, that was trivial.
48:47
So the thing we need to know is, what is this thing?
48:50
Well, this thing here, what is this guy?
48:55
It's also normal and the mean is 0.
48:59
Again, that's just using properties of Gaussian vectors.
49:03
And what is the covariance matrix?
49:06
Let's call these guys sigma so that you can make an answer,
49:09
you can formulate an answer.
49:11
So what is the distribution of-- what
49:14
is the covariance of g times some Gaussian 0 sigma?
49:18
AUDIENCE: g sigma g transpose.
49:20
PHILIPPE RIGOLLET: g sigma g transpose, right?
49:22
So that's gx transpose x inverse g transpose.
49:33
49:38
Now, I'm not going to be able to go much farther.
49:41
I mean, I made this very acute observation
49:44
that ej transpose the matrix times ej is the j-th angle
49:47
element.
49:48
Now, if I have a general matrix, the price to pay is that I
49:50
cannot just shrink this thing any further because I'm trying
49:52
to be abstract.
49:54
And so I'm almost there.
49:56
The only thing that happened last time
49:58
is that when this was ej under h0, 0,
50:00
we knew that this was equal to 0 under the null.
50:03
But under the null, what is this equal to?
50:08
50:12
AUDIENCE: Lambda.
50:13
PHILIPPE RIGOLLET: Lambda, which I know.
50:15
I mean, I wrote my thing.
50:16
And in the couple instances I just showed you,
50:19
including this one over there on top, lambda was equal to 0.
50:22
But in general, it can be any lambda.
50:24
But what's key about this lambda is that I actually know it.
50:27
That's the hypothesis I'm formulating.
50:31
So now I'm going to have to be a little more careful when
50:34
I want to build the distribution of g beta hat.
50:36
I need to actually subtract this lambda.
50:39
So now we go from this, and we say,
50:40
well, g beta hat minus lambda follows
50:47
some np0 sigma squared g x transpose x
50:57
inverse g transpose.
51:00
51:04
So that's true.
51:06
Let's assume-- let's go straight to the case when
51:08
we don't know what sigma is.
51:10
So what I'm going to be interested in
51:11
is g beta hat minus lambda divided by sigma hat.
51:26
And that's going to follow some Gaussian that has this thing,
51:29
gx transpose x inverse g transpose.
51:37
So now, what did I do last time?
51:40
So clearly, the quintiles of this distribution
51:45
is-- well, OK, what is the size of this distribution?
51:48
Well, I need to tell you that g is an--
51:52
what did I take here?
51:54
AUDIENCE: 1 divided by sigma, not sigma hat.
51:57
PHILIPPE RIGOLLET: Oh, yeah, you're right.
51:58
So let me write it like this.
52:00
52:05
Well, let me write it like this--
52:15
sigma squared over sigma.
52:17
52:21
So let's forget about the size of g now.
52:23
Let's just think of any general g.
52:25
52:27
When g was a vector, what was nice
52:30
is that this guy was just the scalar number, just one number.
52:35
And so if I wanted to get rid of this in the right-hand side,
52:38
all I had to do was to divide it by this thing.
52:39
We called it gamma j.
52:41
And we just had to divide by square root of gamma j,
52:43
and that would be gone.
52:45
Now I have a matrix.
52:48
So I need to get rid of this matrix
52:50
somehow because, clearly, the quintiles of this distribution
52:55
are not going to be written in the back
52:56
of a book for any value of g and any value of x.
52:59
So I need to standardize before I can read anything out
53:01
of a table.
53:03
So how do we do it?
53:04
Well, we just form this guy here.
53:14
So what we know is that if--
53:18
so here's the claim, again, another
53:21
claim about Gaussian vector.
53:23
If x follows some n0 sigma, then x transpose sigma inverse x
53:43
follows some chi squared.
53:44
53:48
And here, it's going to depend on what is the dimension here.
53:51
So if I make this a k by k, a k-dimensional Gaussian vector,
53:56
this is x squared k.
53:57
54:02
Where have we used that before?
54:04
54:08
Yeah?
54:09
AUDIENCE: Wald's test.
54:10
PHILIPPE RIGOLLET: Wald's test, that's exactly what we used.
54:13
Wald's test had a chi squared that was showing up.
54:16
And the way we made it show up was
54:18
by taking the asymptotic variance,
54:20
taking its inverse, which, in this framework, was called--
54:24
AUDIENCE: Fisher.
54:25
PHILIPPE RIGOLLET: Fisher information.
54:27
And then we pre- and postmultiply by this thing.
54:31
So this is the key.
54:33
And so now, it tells me exactly, when
54:35
I start from this guy that has this multivariate Gaussian,
54:38
it tells me how to turn it into something
54:40
that has a distribution which is pivotal.
54:42
Chi squared k is completely pivotal, does not depend
54:45
on anything I don't know.
54:46
55:03
The way I go from here is by saying, well, now,
55:06
I look at g beta hat minus lambda transpose,
55:13
and now I need to look at the inverse
55:15
of the matrix over there.
55:16
So it's gx transpose x inverse g inverse g beta
55:29
hat minus lambda.
55:32
55:35
This guy is going to follow--
55:36
55:39
well, here, I need to actually divide by sigma in this case--
55:42
55:56
if g is k times p.
56:00
So what I mean here is just that's the same k.
56:04
The k that shows up is the number of constraints
56:07
that I have in my tests.
56:08
56:13
So now, if I go from here to using sigma hat,
56:20
the key thing to observe is that this guy is actually
56:23
not a Gaussian.
56:25
I'm not going to have a student t-distribution that shows up.
56:28
56:36
So that implies that if I take the same thing,
57:03
so now I just go from sigma to sigma hat,
57:06
then this thing is of the form--
57:08
57:12
well, this chi squared k divided by the chi squared that shows
57:17
up in the denominator of the t-distribution,
57:20
which is square root of--
57:28
oh, I should not divide by sigma--
57:30
so this is sigma squared, right?
57:31
AUDIENCE: Yeah.
57:32
PHILIPPE RIGOLLET: So this is sigma squared.
57:34
So this is of the form divided by chi squared n
57:40
minus p divided by n minus p.
57:44
So that's the same denominator that I saw in my t-test.
57:48
The numerator has changed, though.
57:49
The numerator is now this chi squared and no longer
57:52
a Gaussian.
57:52
57:55
But this distribution is actually pivotal, as long
58:00
as we can guarantee that there's no hidden
58:02
parameter in the correlation between the two chi squares.
58:08
So again, as all statements of independence in this class,
58:13
I will just give it to you for free.
58:15
Those two things, I claim--
58:20
so OK, let's say admit these are independent.
58:29
58:37
We're almost there.
58:38
This could be a distribution that's pivotal.
58:41
But there's something that's a little unbalanced with it
58:43
is that this guy is divided by its number of degrees
58:46
of freedom, but this guy is not divided by its number
58:48
of degrees of freedom.
58:50
And so we just have to make the extra step
58:53
that if I divide this guy by k, and this guy is a chi squared
58:57
divided by k, if I divide this guy by k,
59:00
then I get this guy divided by k.
59:03
And now it looks--
59:05
I mean, it doesn't change anything.
59:06
I've just divided by a fixed number.
59:09
But it just looks more elegant--
59:11
is the ratio of two independent chi
59:13
squared that are individually divided
59:15
by the number of degrees of freedom.
59:16
59:20
And this has a name, and it's called a Fisher
59:31
or F-distribution.
59:34
So unlike William Gosset, who was not
59:40
allowed to use his own name and used the name student,
59:43
Fisher was allowed to use his own name,
59:45
and that's called the Fisher distribution.
59:47
And the Fisher distribution has now 2 parameters,
59:52
a set of 2 degrees of freedom--
59:53
1 for the numerator and 1 for the denominator.
59:57
So F- of Fisher distribution--
60:01
60:07
so F is equal to the ratio of a chi squared p/p
60:13
and a chi squared q/q.
60:16
So that's Fpq P-q where the 2 chi squareds are independent.
60:27
60:32
Is that clear what I'm defining here?
60:35
So this is basically what plays the role of t-distributions
60:41
when you're testing more than 1 parameter at a time.
60:43
So you basically replace--
60:45
the normal that was in the numerator,
60:47
you replace it by chi squared because you're
60:49
testing if 2 vectors are simultaneously close.
60:51
And the way you do it is by looking at their squared norm.
60:55
And that's how the chi squared shows up.
60:57
61:00
Quick remark-- are those things really very different?
61:08
How can I relate a chi squared with a t-distribution?
61:12
Well, if t follows, say, a t--
61:19
I don't know, let's call it q.
61:20
61:24
So that means that t, let me look at--
61:28
t is some n01 divided by the square root of a chi
61:38
squared q/q.
61:40
61:44
That's the distribution of t.
61:48
So if I look at the square of the-- the distribution of t
61:51
squared--
61:53
let me put it here--
61:55
61:58
well, that's the square of some n01 divided by chi squared q/q.
62:06
62:09
Agreed?
62:11
I just removed the square root here,
62:13
and I took the square of the Gaussian.
62:15
But what is the distribution of a square of a Gaussian?
62:20
AUDIENCE: Chi squared with 1 degree.
62:21
PHILIPPE RIGOLLET: Chi squared with 1 degree of freedom.
62:25
So this is a chi squared with 1 degree of freedom.
62:27
And in particular, it's also a chi
62:28
squared with 1 degree of freedom divided by 1.
62:31
So t-squared, in the end, has an F-distribution with 1
62:38
and q degrees of freedom.
62:41
So those two things are actually very similar.
62:43
The only thing that's going to change
62:45
is that, since we're actually looking at, typically,
62:48
absolute values of t when we do our tests,
62:51
it's going to be exactly the same thing.
62:52
These quintiles of one guy are going
62:54
to be, essentially, the square root of the quintiles
62:56
of the other guy.
62:57
That's all it's going to be.
63:00
So if my test is psi is equal to the indicator
63:07
that t exceeds q alpha over 2 of tq, for example,
63:16
then it's equal to the indicator that t-squared
63:19
exceeds q squared alpha over 2 tq,
63:26
because I had the absolute value here,
63:28
which is equal to the indicator that t squared is
63:33
greater than q alpha over 2.
63:35
And now this time, it's an F1q.
63:37
63:39
So in a way, those two things belong to the same family.
63:42
They really are a natural generalization of each other.
63:44
I mean, at least the F-test is a generalization of the t-test.
63:47
63:51
And so now I can perform my test just like it's written here.
63:54
I just formed this guy, and then I
63:56
perform against the quintile of an F-test.
63:58
Notice, there's no absolute value--
64:01
oh, yeah, I forgot, this is actually
64:04
q alpha because the F-statistic is already positive.
64:09
So I'm not going to look between left and right,
64:11
I'm just going to look whether it's too large or not.
64:15
So that's by definition.
64:18
So you can check--
64:19
if you look at a table for student
64:21
and you look at a table for F1q, one
64:23
it just going to-- you're going to have to move from one column
64:25
to the other because you're going
64:26
to have to move from alpha over 2 to alpha,
64:28
but one is going to be squared root of the other one,
64:31
just like the chi squared is the square of the Gaussian.
64:34
I mean, if you look at the chi squared 1 degree of freedom,
64:36
you will see the same thing as the Gaussians.
64:40
64:47
So I'm actually going to start with the last one
64:53
because you've been asking a few questions about why
64:55
is my design deterministic.
64:58
So there's many answers.
64:59
Some are philosophical.
65:01
But one that's actually-- well, there's the one that says
65:04
everything you cannot do if you don't have a condition--
65:07
if you don't have x, because all of the statements that we made
65:09
here, for example, just the fact that this is chi squared,
65:12
if those guys start to be random variables,
65:15
then it's clearly not going to be a chi squared.
65:17
I mean, it cannot be chi squared when those guys are
65:19
deterministic and when they are random.
65:20
I mean, things change.
65:22
So that's just maybe [INAUDIBLE] check statement.
65:25
But I think the one that really matters is that--
65:27
remember when we did the t-test, we
65:30
had this gamma j that showed up.
65:32
Gamma j was playing the role of the variance.
65:34
So here, the variance, you never think of--
65:36
I mean, we'll talk about this in the Bayesian setup,
65:39
but so far, we haven't thought of the variance
65:41
as a random variable.
65:42
And so here, your x's really are the parameters of your data.
65:45
And the diagonal elements of x transpose x inverse
65:48
actually tell you what the variance is.
65:49
So that's also one reason why you should think of your x
65:52
as being a deterministic number.
65:53
They are, in a way, things that change
65:55
the geometry of your problem.
65:56
They just say, oh, let me look at it
65:58
from the perspective of x.
66:01
Actually, for that matter, we didn't really
66:03
spend much time commenting on what
66:06
is the effect of x onto gamma.
66:09
So remember, gamma j, so that was the variance parameter.
66:19
So we should try to understand what x's lead to big variance
66:23
and what x's lead to small variance.
66:26
That would be nice.
66:28
Well, if this is the identity matrix--
66:31
let's say identity over n, which is the natural thing
66:35
to look at, because we want this thing to scale like 1/n--
66:38
then this is just 1/n.
66:39
We're back to the original case.
66:41
Yes?
66:41
AUDIENCE: Shouldn't that be inverse?
66:43
PHILIPPE RIGOLLET: Yeah, thank you. x inverse, yes.
66:45
So if this is the identity, then, well, the inverse
66:48
is-- let's say just this guy here is n times this guy.
66:53
So then the inverse is 1/n.
66:56
So in this case, that means that gamma j is equal to 1/n
66:59
and we're back to the theta hat theta
67:02
case, the basic one-dimensional thing.
67:06
What does it mean for a matrix for when I take its--
67:11
yeah, so that's of dimension p.
67:13
But when I take its transpose--
67:15
so forget about the scaling by n right now.
67:17
This is just a matter of scaling things.
67:19
I can always multiply my x's so that I
67:20
have this thing that shows up.
67:22
But when I have a matrix, if I look at x transpose x
67:24
and I get something which is the identity, how
67:26
do I call this matrix?
67:27
67:31
AUDIENCE: Orthonormal?
67:32
PHILIPPE RIGOLLET: Orthogonal, yeah.
67:34
Orthonormal or orthogonal.
67:35
So you call this thing an orthogonal matrix.
67:37
And when it's an orthogonal matrix, what it means
67:39
is that the--
67:42
so this matrix here, if you look at the matrix xx transpose,
67:46
the entries of this matrix are the inner products
67:48
between the columns of x.
67:49
That's what's happening.
67:51
You can write it, and you will see
67:52
that the entries of this matrix are linear products.
67:55
If it's the identity, that means that you get some 1's
67:59
and a bunch of 0's, it means that all the inner products
68:03
between 2 different columns is actually 0.
68:05
What it means is that this matrix x
68:07
is an orthonormal basis for your space.
68:09
The columns form an orthonormal basis.
68:12
So they're basically as far from each other as they can.
68:15
Now, if I start making those guys closer and closer,
68:20
then I'm starting to have some issues.
68:21
x transpose x is not going to be the identity.
68:24
I'm going to start to have some non-0 entries.
68:27
But if they all remain of norm 1, then--
68:32
oh, sorry, so that's for the inverse.
68:34
So I first start putting some stuff here, which is non-0,
68:37
by taking my x's.
68:39
Rather than having this, I move to this.
68:44
Now I'm going to start seeing some non-0 entries.
68:46
And when I'm going to take the inverse of this matrix,
68:49
the diagonal elements are going to start to blow up.
68:52
Oh, sorry, the diagonals start to become smaller and smaller.
68:56
So when I take the inverse--
68:57
no, sorry, the diagonal limits are going to blow up.
69:01
And so what it means is that the variance is going to blow up.
69:05
And that's essentially telling you
69:06
that if you get to choose your x's, you
69:09
want to take them as orthogonal as you can.
69:12
But if you don't, then you just have to deal with it,
69:14
and it will have a significant impact on your estimation
69:18
performance.
69:19
And that's what, also, routinely, statistical software
69:25
is going to spit out this value here for you.
69:26
And you're going to have-- well, actually square
69:28
root of this value.
69:30
And it's going to tell you, essentially--
69:32
you're going to know how much randomness, how much variation
69:34
you have in this particular parameter
69:36
that you're estimating.
69:37
So if gamma j is large, then you're
69:41
going to have wide confidence intervals
69:43
and your tests are not going to reject very much.
69:45
And that's all captured by x.
69:47
That's what's important.
69:48
Everything, all of this, is completely captured by x.
69:50
Then, of course, there was the sigma squared
69:52
that showed up here.
69:54
Actually, it was here, even in the definition of gamma j.
69:57
I forgot it.
69:58
What is the sigma squared police doing?
70:00
And so this thing was here as well,
70:02
and that's just exogenous.
70:04
It comes from the noise itself.
70:06
But there was this huge factor that came from the x's itself.
70:08
70:11
So let's go back, now, to reading
70:13
this list in a linear fashion.
70:15
So I mean, you're MIT students, you've probably
70:20
heard that correlation does not imply causation many times.
70:25
Maybe you don't know what it means.
70:27
If you don't, that's OK, you just have to know the sentence.
70:30
No, what it means is that it's done
70:32
because I decided that something was going to be the x
70:35
and that something else was going
70:36
to be the y, that whatever thing I'm getting,
70:39
it means that x implies y.
70:42
For example, even if I do genetics, genomics,
70:44
or whatever, I mean, I implicitly
70:47
assume that my genes are going to have
70:49
an effect on my outside look.
70:52
I could be the opposite.
70:54
I mean, who am I to say?
70:55
I'm not a biologist.
70:56
I don't know.
70:57
I didn't open a biology book in 20 years.
70:59
So maybe, if I start hitting my head with a hammer,
71:02
I'm going to have changing my genetic material.
71:04
Probably not, but that's why--
71:07
but causation definitely does not come from statistics.
71:09
So if you know that that's the different thing,
71:11
it's actually going to--
71:13
it's not coming from there.
71:14
So actually, I remember, once, I put an exam to students,
71:18
and there was an old data set from police expenditures,
71:21
I think, in Chicago in the '60s.
71:23
And they were trying to understand--
71:27
no, it was on crime.
71:28
It was the crime data set.
71:29
And they were trying-- so the y variable was just
71:31
the rate of crime, and the x's were a bunch of things,
71:34
and one of them was police expenditures.
71:36
And if you rend the regression, you
71:38
would find that the coefficient in front of police expenditure
71:41
was a positive number, which means
71:42
that if you increase police expenditures,
71:45
that increases the crime.
71:48
I mean, that's what it means to have a positive coefficient.
71:52
Everybody agrees with this fact?
71:55
If beta j is 10, then it means that if I increase by $1
71:57
my police expenditure, I [INAUDIBLE] by 10 my crime,
72:01
everything else being kept equal.
72:04
Well, there were, I think, about 80%
72:06
of the students that were able to explain to me that if you
72:09
give more money to the police, then
72:11
the crime is going to raise.
72:13
Some people were like, well, police
72:14
is making too much money, and they
72:16
don't think about their work, and they become lazy.
72:19
And I mean, people were really coming up
72:20
with some crazy things.
72:22
And what it just meant is that, no, it's not causation.
72:26
It's just, if you have more crime,
72:28
you give more money to your police.
72:29
That's what's happening.
72:31
And that's all there is.
72:33
So just be careful when you actually
72:35
draw some conclusions that causation is a very important
72:38
thing to keep in mind.
72:39
And in practice, unless you have external sources of reason
72:43
for causality-- for example, genetic material
72:45
and physical traits, we agree upon what
72:52
the direction of the arrow of causality is here.
72:54
There's places where you might not.
72:57
Now, finally, the normality on the noise--
72:59
everything we did today required normal Gaussian distribution
73:04
on the noise.
73:05
I mean, it's everywhere.
73:07
There's some Gaussian, there's some chi squared.
73:09
Everything came out of Gaussian.
73:11
And for that, we needed this basic formula
73:13
for inference, which we derived from the fact
73:15
that the noise was Gaussian itself.
73:18
If we did not have that, the only thing we could write
73:20
is, beta hat is this number, or this vector.
73:24
We would not be able to say, the fluctuations of beta hat
73:27
are this guy.
73:28
We would not be able to do tests.
73:30
We would not be able to build, say,
73:31
confidence regions or anything.
73:34
And so this is an important condition that we need,
73:38
and that's what statistical software assumes by default.
73:40
But we now have a recipe on how to do those tests.
73:44
We can do it either visually, if we really
73:47
want to conclude that, yes, this is Gaussian,
73:49
using our normal Q-Q plots.
73:51
And we can also do it using our favorite tests.
73:54
What test should I be using to test that?
73:56
74:01
With two names?
74:03
Yeah?
74:04
AUDIENCE: Normal [INAUDIBLE].
74:06
PHILIPPE RIGOLLET: Not the 2 Russians.
74:08
So I want a Russian and a Scandinavian person
74:10
for this one.
74:12
What's that?
74:13
AUDIENCE: Lillie-something?
74:14
PHILIPPE RIGOLLET: Yeah, Lillie-something.
74:16
So Kolmogorov Lillie-something test.
74:18
And [LAUGHS] so it's the Kolmogorov Lilliefors test.
74:23
And because I'm testing if there Gaussian, and I'm actually
74:26
not really making any--
74:28
I don't need to know what the variance is.
74:30
The mean is 0.
74:31
We saw that at the beginning.
74:32
It's 0 by construction, so we actually
74:34
don't need to think about the mean being 0 itself.
74:37
This just happens to be 0.
74:38
So we know that it's 0, but the variance, we don't know.
74:41
So we just want to know if it belongs
74:42
to the family of Gaussians, and so we need to Kolmogorov
74:45
Lilliefors for that.
74:46
And that's also one of the thing that's spit out by statistical
74:49
software by default. When you run a linear regression,
74:52
actually, it spits out both Kolmogorov-Smirnov
74:54
and Kolmogorov Lilliefors, probably contributing
74:59
to the widespread use of Kolmogorov-Smirnov when you
75:01
really shouldn't.
75:03
So next time, we will talk about more advanced topics
75:08
on regression.
75:09
But I think I'm going to stop here for today.
75:11
So again, tomorrow, sometime during the day,
75:14
at least before the recitation, you
75:16
will have a list of practice exercises that will be posted.
75:20
And if you go to the optional recitation,
75:23
you will have someone solving them
75:26