https://www.youtube.com/watch?v=QXkOaifVfW4&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=9

字幕記錄


00:00
00:00
The following content is provided under a Creative
00:02
Commons license.
00:03
Your support will help MIT OpenCourseWare
00:06
continue to offer high quality educational resources for free.
00:10
To make a donation or give you additional materials
00:12
from hundreds of MIT courses, visit MIT OpenCourseWare
00:16
at ocw.mit.edu.
00:17
00:21
PHILIPPE RIGOLLET: So again, before we start,
00:23
there is a survey online if you haven't done so.
00:27
I would guess at least one of you has not.
00:30
Some of you have entered their answers and their thoughts,
00:33
and I really appreciate this.
00:35
It's actually very helpful.
00:36
So it seems that the course is going fairly well
00:40
from what I've read so far.
00:42
So if you don't think this is the case,
00:43
please enter your opinion and tell us
00:45
how we can make it better.
00:47
One of the things that was said is
00:48
that I speak too fast, which is absolutely true.
00:53
I just can't help it.
00:54
I get so excited, but I will really do my best.
00:59
I will try to.
01:02
I think I always start OK.
01:04
I just end not so well.
01:07
So last time we talked about this chi squared distribution,
01:10
which is just another distribution that's
01:13
so common that it deserves its own name.
01:16
And this is something that arises
01:17
when we sum the squares of independent standard Gaussian
01:22
random variables.
01:23
And in particular, why is that relevant?
01:25
It's because if I look at the sample variance,
01:27
then it is a chi square distribution,
01:29
and the parameter that shows up is also
01:32
known as the degrees of freedom, is the number
01:35
of observations of minus one.
01:37
And so as I said, this chi squared
01:39
distribution has an explicit probability density function,
01:43
and I tried to draw it.
01:44
And one of the comments was also about my handwriting,
01:47
so I will actually not rely on it for detailed things.
01:52
So this is what the chi squared with one degree of freedom
01:54
would look like.
01:55
And really, what this is is just the distribution of the square
01:57
of a standard Gaussian.
01:58
I'm summing only one, so that's what it is.
02:01
Then when I go to 2, this is what it is--
02:03
3, 4, 5, 6, and 10.
02:07
And as I move, you can see this thing
02:08
is becoming flatter and flatter, and it's pushing to the right.
02:11
And that's because I'm summing more and more squares,
02:14
and in expectation we just get one every time.
02:18
So it really means that the mass is moving to infinity.
02:23
In particular, a chi squared distribution
02:26
with n degrees of freedom is going to infinity
02:29
as n goes to infinity.
02:32
Another distribution that I asked
02:35
you to think about-- anybody looked around
02:38
about the student t-distribution, what
02:39
the history of this thing was?
02:42
So I'll tell you a little bit.
02:44
I understand if you didn't have time.
02:46
So the t-distribution is another common distribution
02:50
that is so common that it will be used
02:53
and will have its table of quintiles that are
02:56
drawn at the back of the book.
02:59
Now, remember, when I mentioned the Gaussian, I said,
03:02
well, there are several values for alpha
03:04
that we're interested in.
03:06
And so I wanted to draw a table for the Gaussian.
03:11
We had something that looked like this,
03:13
and I said, well, q alpha over 2 to get alpha over 2
03:21
to the right of this number.
03:22
And we said that there is a table for this things,
03:25
for common values of theta.
03:28
Well, if you try to envision what this table will look like,
03:31
it's actually a pretty sad table,
03:34
because it's basically one list of numbers.
03:35
Why would I call it a table?
03:37
Because all I need to tell you is
03:38
something that looks like this.
03:40
If I tell you this is alpha and this is q alpha over 2
03:43
and then I say, OK, basically the three alphas
03:47
that I told you I care about are something like 1%, 5%, and 10%,
03:54
then my table will just give me q alpha over 2.
03:57
So that's alpha, and that's q alpha over 2.
03:59
And that's going to tell me that--
04:01
I don't remember this one, but this guy is 1.96.
04:04
This guy is something like 2.45.
04:08
I think this one is like 1.65 maybe.
04:11
And maybe you can be a little finer,
04:15
but it's not going to be an entire page
04:16
at the back of the book.
04:18
And the reason is because I only need
04:19
to draw these things for d1 standard Gaussian
04:22
when the parameters are 0 for the mean
04:24
and 1 for the variance.
04:26
Now, if I'm actually doing this for the the chi squared,
04:30
I basically have to give you one table per values
04:34
of the degrees of freedom, because those things
04:37
are different.
04:38
There is no way I can take--
04:41
for Gaussian's, if you give me a different mean,
04:43
I can substract it and make it back to be a standard Gaussian.
04:46
For the chi squared, there is no such thing.
04:49
There is no thing that just takes
04:50
the chi squared with d degrees of freedom, nd,
04:53
and turns it into, say, a chi square
04:54
with one degree of freedom.
04:56
This just does not happen.
04:58
So the word is standardized.
05:01
Make it a standard chi squared.
05:02
There is no such thing as standard chi squared.
05:04
So what it means is that I'm going
05:05
to need one row like that for each value of the number
05:09
of degrees of freedom.
05:11
So that will certainly fill a page at the back of a book--
05:14
maybe even more.
05:16
I need one per sample size.
05:18
So if I want to go from simple size 1 to 1,000,
05:21
I need 1,000 rows.
05:24
So now the student distribution is
05:26
one that arises where it looks very much like the Gaussian
05:30
distribution, and there's a very simple reason for that, is
05:33
that I take a standard Gaussian and I divide it by something.
05:37
That's how I get the student.
05:39
What do I divide it with?
05:40
Well, I take an independent chi square--
05:42
I'm going to call it v--
05:44
and I want it to be independent from z.
05:47
And I'm going to divide z by root v over d.
05:52
So I start with a chi squared v.
05:55
So this guy is chi squared d.
05:58
I start with z, which is n 0, 1.
06:02
I'm going to assume that those guys are independent.
06:06
In my t-distribution, I'm going to write
06:08
a T. Capital T is z divided by the square root of v over d.
06:17
Why would I want to do this?
06:18
Well, because this is exactly what
06:20
happens when a divide not by the true variance, a Gaussian,
06:25
but by its empirical variance.
06:28
So let's see why in a second.
06:30
So I know that if you give me some random variable--
06:34
let's call it x, which is N mu sigma squared--
06:38
then I can do this.
06:40
x minus mu divided by sigma.
06:45
I'm going to call this thing z, because this thing actually
06:47
has some standard Gaussian distribution.
06:51
I have standardized x into something
06:54
that I can read the quintiles at the back of the book.
06:58
So that's this process that I want to do.
07:00
Now, to be able to do this, I need to know what mu is,
07:03
and I need to know what sigma is.
07:05
Otherwise I'm not going to be able to make this operation.
07:09
mu I can sort of get away with, because remember,
07:13
when we're doing confidence intervals
07:15
we're actually solving for mu.
07:17
So it was good that mu was there.
07:20
When we're doing hypothesis testing,
07:22
we're actually plugging in here the mu that shows up in h0.
07:26
So that was good.
07:27
We had this thing.
07:28
Think of mu as being p, for example.
07:31
But this guy here, we don't necessarily know what it is.
07:36
I just had to tell you for the entire first chapter,
07:40
assume you have Gaussian random variables
07:41
and that you know what the variance is.
07:44
And the reason why I said assume you
07:45
know it-- and I said sometimes you can read it
07:47
on the side of the box of measuring equipment in the lab.
07:52
That was just the way I justified it,
07:54
but the real reason why I did this is because I would not
07:57
be able to perform this operation if I actually did not
08:00
know what sigma was.
08:02
But from data, we know that we can form this estimator
08:07
Sn, which is 1 over n, sum from i equals 1 to n
08:11
of Xi, minus X bar squared.
08:15
And this thing is approximately equal to sigma squared.
08:18
That's the sample variance, and it's actually
08:21
a good estimator just by the law of large number, actually.
08:25
This thing, by the law of large number, as n goes to infinity--
08:29
08:32
well, let's say it in probability
08:34
goes to sigma squared by the law of large number.
08:36
So it's a consistent estimator of sigma squared.
08:40
So now, what I want to do is to be
08:43
able to use this estimator rather than using sigma.
08:46
And the way I'm going to do it is
08:47
I'm going to say, OK, what I want to form
08:50
is x minus mu divided by Sn this time.
08:58
I don't know what the distribution of this guy is.
09:01
Sorry, it's square root of Sn.
09:02
This is sigma squared.
09:05
So this is what I would take.
09:07
And I could think of Slutsky, maybe,
09:10
something like this that would tell me, well, just use that
09:14
and pretend it's a Gaussian.
09:15
And we'll see how actually it's sort
09:18
of valid to do that, because Slutsky tells us
09:20
it is valid to do that.
09:22
But what we can also do is to say,
09:24
well, this is actually equal to x minus mu, divided by sigma,
09:28
which I knew what the distribution of this guy is.
09:31
And then what I'm going to do is I'm going to just--
09:33
well, I'm going to cancel this effect, sigma over square root
09:38
Sn.
09:39
So I didn't change anything.
09:41
I just put the sigma here.
09:43
So now what I know what I know is that this is some z,
09:47
and it has some standard Gaussian distribution.
09:51
What is this guy?
09:54
Well, I know that Sn--
09:57
we wrote this here.
09:59
Maybe I shouldn't have put those pictures,
10:01
because now I keep on skipping before and after.
10:04
We know that Sn times n divided by sigma squared
10:14
is actually chi squared n minus 1.
10:18
10:22
So what do I have here?
10:23
I have that chi squared--
10:25
so here I have something that looks like 1 over square root
10:29
of Sn divided by sigma squared.
10:32
10:35
This is what this guy is if I just do some more writing.
10:38
And maybe I actually want to make my life a little easier.
10:41
I'm actually going to plug in my n here,
10:45
and so I'm going to have to multiply by square root of n
10:48
here.
10:49
10:56
Everybody's with me?
10:59
So now what I end up with is something that
11:01
looks like this, where I have--
11:06
here I started with x.
11:07
11:15
I should really start with Xn bar minus mu times
11:19
square root of n.
11:21
That's what the central limit theorem would tell me.
11:24
I need to work with the average rather than just one
11:26
observation.
11:27
So if I start with this, then I pick up a square root of n
11:30
here.
11:30
11:43
So if I had the sigma here, I would know
11:45
that this thing is actually--
11:47
Xn bar minus mu divided by sigma times the square root of n
11:54
would be a standard Gaussian.
11:56
So if I put Xn bar here, I really
11:58
need to put this thing that goes around the Xn bar.
12:00
12:04
That's just my central limit theorem
12:06
that says if I average, then my variance has shrunk by a factor
12:10
1 over n.
12:12
Now, I can still do this.
12:15
That was still fine.
12:16
And now I said that this thing is basically this guy.
12:26
So what I know is that this thing
12:28
is a chi squared with n minus 1 degrees of freedom,
12:32
so this guy here is chi squared with n
12:37
minus 1 degrees of freedom.
12:40
Let me call this thing v in the spirit of what was used there
12:44
and in the spirit of what is written here.
12:49
So this guy was called v, so I'm going to call this v.
12:53
So what I can write is that square root of n Xn
12:57
bar minus mu divided by square root of Sn
13:02
is equal to z times square root of n
13:10
divided by square root of v. Everybody's with me here?
13:20
13:23
Which I can rewrite as z times square root of v divided by n
13:37
And if you look at what the definition of this thing is,
13:40
I'm almost there.
13:41
What is the only thing that's wrong here?
13:45
This is a student distribution, right?
13:48
So there's two things.
13:49
The first one was that they should be independent,
13:51
and they actually are independent.
13:53
That's what Cochran's theorem tells me,
13:55
and you just have to count on me for this.
13:57
I told you already that Sn was independent of Xn bar.
14:01
So those two guys are independent,
14:04
which implies that the numerator and denominator here
14:07
are independent.
14:08
That's what Cochran's theorem tells us.
14:12
But is this exactly what I should
14:14
be seeing if I wanted to have my sample variance, if I
14:17
want to have to write this?
14:19
Is this actually the definition of a student distribution?
14:23
Yes?
14:25
No.
14:25
14:28
So we see z divided by square root of v over d.
14:33
That looks pretty much like it, except there's
14:35
a small discrepancy.
14:36
What is the discrepancy?
14:38
14:47
There's just the square root of n minus 1 thing.
14:50
So here, v has n minus 1 degrees of freedom.
14:55
And in the definition, if the v has d degrees of freedom,
14:58
I divide it by d, not by d minus 1 or not by d plus 1, actually,
15:04
in this case.
15:06
So I have this extra thing.
15:07
Well, there's two ways I can address this.
15:09
15:13
The first one is by saying, well,
15:14
this is actually equal to z over square root
15:18
of v divided by n minus 1 times square root of n
15:27
over n minus 1.
15:28
15:32
I can always do that and say for n large enough
15:35
this thing is actually going to be pretty small,
15:37
or I can take account for it.
15:39
Or for any n you give me, I can compute this number.
15:43
And so rather than having a t-distribution,
15:45
I'm going to have a t-distribution time
15:47
this deterministic number, which is just
15:49
a function of my number of observations.
15:52
But what I actually want to do instead
15:55
is probably use a slightly different normalization,
16:00
which is just to say, well, why do I have to define Sn--
16:04
16:10
where was my Sn?
16:11
Yeah, why do I have to define Sn tend to be divided by n?
16:14
Actually, this is a biased estimator,
16:17
and if I wanted to be unbiased, I can actually just
16:20
put an n minus 1 here.
16:22
You can check that.
16:23
You can expend this thing and compute the expectation.
16:25
You will see that it's actually not sigma squared,
16:27
but n over n minus 1 sigma squared.
16:31
So you can actually just make it unbiased.
16:33
Let's call this guy tilde, and then
16:35
when I put this tilde here what I actually get is s tilde here
16:43
and s tilde here.
16:46
I need actually to have n minus 1 here
16:49
to have this s tilde be a chi squared distribution.
16:55
Yes?
16:56
AUDIENCE: [INAUDIBLE] defined this way so that you--
17:02
PHILIPPE RIGOLLET: So basically, this is what the story did.
17:04
So the story was, well, rather than using always
17:08
the central limit theorem and just pretending
17:10
that my Sn is actually the true sigma squared,
17:13
since this is something I'm going to do a lot,
17:16
I might as well just compute the distribution,
17:19
like the quintiles for this particular distribution,
17:21
which clearly does not depend on any unknown parameter.
17:24
d is the only parameter that shows up here,
17:27
and it's completely characterized
17:28
by the number of observations that you have,
17:30
which you definitely know.
17:32
And so people said, let's just be slightly more accurate.
17:35
And in a second, I'll show you how the distribution of the T--
17:38
so we know that if the sample size is large enough,
17:41
this should not have any difference with the Gaussian
17:43
distribution.
17:44
I mean, those two things should be
17:45
the same because we've actually not paid
17:48
attention to this discrepancy by using empirical variance rather
17:51
than true so far.
17:52
And so we'll see what the difference is,
17:55
and this difference actually manifests itself only
17:57
in small sample sizes.
17:59
So those are things that matter mostly
18:02
if you have less than, say, 50 observations.
18:04
Then you might want to be slightly more precise
18:06
and use t-distribution rather than Gaussian.
18:08
So this is just a matter of being slightly more precise.
18:12
If you have more than 50 observations,
18:14
just drop everything and just pretend
18:15
that this is the true one.
18:17
18:19
Any other questions?
18:22
So now I have this thing, and so I'm
18:25
on my way to changing this guy.
18:27
So here now, I have not root n but root n minus 1.
18:31
18:47
So I have a z.
18:48
So this guy here is S. Yet Where did I get my root
18:55
n from in the first place?
18:56
19:00
Yeah, because I wanted this guy.
19:02
And so now what I am left with is Xn minus mu
19:05
divided by Sn tilde, which is the new one, which is now
19:08
indeed of the form z v root n minus 1, which now I
19:14
can write it as z v minus 1.
19:16
And so now I have exactly what I want,
19:22
and so this guy is n 0, 1.
19:25
And this guy is chi squared with n minus 1 degrees of freedom.
19:30
And so now I'm back to what I want.
19:33
So rather than using Sn to be the empirical variance where
19:37
I just divide my normatizations by n, if I use n minus 1,
19:41
I'm perfect.
19:42
Of course, I can still use n and do this multiplying
19:45
by root n minus 1 over n at the end.
19:47
But that just doesn't make as much sense.
19:49
19:52
Everybody's fine with what this T n distribution is doing
19:54
and why this last line is correct?
19:58
So that's just basically because it's
20:01
been defined so that this is actually happening.
20:04
That was your question, and that's really what happened.
20:07
So what is this student t-distribution?
20:11
Where does the name come from?
20:13
Well, it does not come from Mr. T. And if you know who Mr.
20:18
T was-- you're probably too young for that--
20:20
he was our hero in the 80s.
20:23
And it comes from this guy.
20:26
His name is Sean William Gosset--
20:29
1908.
20:29
So that was back in the day.
20:31
And this guy actually worked at the Guinness Brewery
20:33
in Dublin, Ireland.
20:35
And Mr. Guinness back then was a bit of a fascist,
20:38
and he didn't want him to actually publish papers.
20:41
And so what he had to do is to use a fake name to do that.
20:45
And he was not very creative, and he used a name "student."
20:50
Because I guess he was a student of life.
20:52
And so here's the guy, actually.
20:55
So back in 1908, it was actually not
20:57
difficult to put your name or your pen name
21:01
on a distribution.
21:03
So what does this thing look like?
21:05
How does it compare to the standard normal distribution?
21:09
You think it's going to have heavier or lighter tails
21:12
compared to the standard distribution,
21:13
the Gaussian distribution?
21:17
Yeah, because they have extra uncertainty in the denominator,
21:21
so it's actually going to make things wiggle a little wider.
21:25
So let's start with a reference, which
21:26
is the standard normal distribution.
21:29
So that's my usual bell-shaped curve.
21:31
And this is actually the t-distribution
21:33
with 50 degrees of freedom.
21:35
So right now, that's probably where you should just
21:37
stand up and leave, because you're like,
21:39
why are we wasting our time?
21:40
Those are actually pretty much the same thing, and it is true.
21:43
If you have 50 observations, both the central limit
21:46
theorem-- so here one of the things that you need to know
21:49
is that if I want to talk about t-distribution for, say, eight
21:54
observations, I need those observations to be Gaussian
21:57
for real.
21:57
There's no central limit theorem happening
21:59
at eight observations.
22:00
But really, what this is telling me
22:02
is not that the central limit theorem kicks in.
22:04
It's telling me what are the asymptotics that kick in?
22:07
22:13
The law of large number, right?
22:15
This is exactly this guy.
22:19
That's here.
22:21
When I write this statement, what this picture is really
22:24
telling us is that for n is equal to 50, I'm at the limit
22:28
already almost.
22:29
There's virtually no difference between using
22:32
the left-hand side or using sigma squared.
22:36
And now I start reducing.
22:38
40, I'm still pretty good.
22:39
We can start seeing that this thing is actually
22:41
losing some mass on top, and that's
22:43
because it's actually pushing it to the left
22:44
and to the right in the tails.
22:46
And then we keep going, keep going, keep going.
22:49
So that's at 10.
22:50
When you're at 10, there's not much of a difference.
22:53
And so you can start seeing difference
22:54
when you're at five, for example.
22:57
You can see the tails become heavier.
22:59
And the effect of this is that when I'm going to build,
23:01
for example, a confidence interval to put the same amount
23:05
of mass to the right of some number--
23:07
let's say I'm going to look at this q alpha over 2--
23:09
I'm going to have to go much farther, which
23:11
is going to result in much wider confidence intervals
23:17
to 4, 3, 2, 1.
23:20
So that's the t1.
23:22
Obviously that's the worst.
23:24
And if you ever use the t1 distribution,
23:30
please ask yourself, why in the world are you doing statistics
23:33
based on one observation?
23:35
23:38
But that's basically what it is.
23:41
So now that we have this t-distribution,
23:44
we can define a more sophisticated test
23:48
than just take your favorite estimator
23:50
and see if it's far from the value you're currently testing.
23:53
That was our rationale to build a test before.
23:57
And the first test that's non-trivial
24:00
is a test that exploits the fact that the maximum likelihood
24:04
estimator, under some technical condition,
24:07
has a limit distribution which is Gaussian with mean 0
24:12
when properly centered and a covariance matrix given
24:18
by the Fisher information matrix.
24:19
Remember this Fisher information matrix?
24:21
24:26
And so this is the setup that we have.
24:29
So we have, again, an i.i.d.
24:31
sample.
24:32
Now I'm going to assume that I have a d-dimensional parameter
24:35
space, theta.
24:36
And that's why I talk about Fisher information matrix--
24:39
and not just Fisher information.
24:41
It's a number.
24:42
And I'm going to consider two hypotheses.
24:45
So you're going to have h0, theta is equal to theta 0.
24:52
h1, theta is not equal to theta 0.
24:56
And this is basically what we thought
25:00
when we said, are we testing if a coin is fair or unfair.
25:05
So fair was p equals 1/2, and fair was p different from 1/2.
25:09
And here I'm just making my life a bit easier.
25:13
So now, I have this maximum likelihood estimate
25:16
that I can construct.
25:17
Because let's say I know what p theta is,
25:20
and so I can build a maximum likelihood estimator.
25:23
And I'm going to assume that these technical conditions that
25:26
ensure that this maximum likelihood properly
25:29
standardized converges to some Gaussian are actually satisfy,
25:35
and so this thing is actually true.
25:38
So the theorem, the way I stated it--
25:41
if you're a little puzzled, this is not the way I stated it.
25:44
And the first time, the way we stated it was that theta hat
25:47
mle minus theta not-- so here I'm
25:51
going to place myself under the null hypothesis,
25:53
so here I'm going to say under h0.
25:58
And honestly, if you have any exercise on tests,
26:01
that's the way that it should start.
26:03
What is the distribution under h0?
26:05
Because otherwise you don't know what this guy should be.
26:08
So you have this, and what we showed
26:10
is that this thing was going in distribution as n goes
26:12
to infinity to some normal with mean 0
26:15
and covariance matrix, which was i of theta,
26:19
which was here for the true parameter.
26:21
But here I'm under h0, so there's
26:22
only one true parameter, which is theta 0.
26:24
26:32
This was our limiting central limit theorem for--
26:36
I mean, it's not really central limited theorem;
26:38
limited theorem for the maximum likelihood estimator.
26:43
Everybody remembers that part?
26:47
The line before said, under technical conditions, I guess.
26:50
So now, it's not really stated in the same way.
26:53
If you look at what's on the slide,
26:54
here I don't have the Fisher information matrix,
26:57
but I really have the identity of rd.
26:59
27:02
If I have a random variable x, which
27:05
has some covariance matrix sigma,
27:10
how do I turn this thing into something that
27:12
has covariance matrix identity?
27:15
So if this was a sigma squared, well, the thing I would do
27:20
would be divide by sigma, and then I
27:21
would have a 1, which is also known
27:24
as the identity matrix of r1.
27:28
Now, what is this?
27:30
This was root of sigma squared.
27:32
So what I'm looking for is the equivalent
27:35
of taking sigma and dividing by the square root of sigma,
27:40
which--
27:40
obviously those are matrices--
27:42
I'm certainly not allowed to do.
27:43
And so what I'm going to do is I'm actually
27:45
going to do the following.
27:48
Sigma 1 over root of sigma squared
27:51
can be written as sigma to the negative 1/2.
27:55
And this is actually the same thing here.
27:58
So I'm going to write it as sigma to the negative 1/2,
28:02
and now this guy is actually well-defined.
28:06
So this is a positive symmetric matrix,
28:08
and you can actually define the square root
28:10
by just taking the square root of its eigenvalues,
28:16
for example.
28:17
And so you get sigma 1/2 equals and follows n0 identity.
28:23
28:26
And in general, I'm going to see something
28:30
that looks like sigma 1/2 negative 1/2 sigma
28:34
sigma negative 1/2.
28:37
And I have minus 1/2 plus 1 minus 1/2.
28:40
This whole thing collapses to 0, and it's actually the identity.
28:45
So that's the actual rule.
28:47
So if you're not familiar, this is basic multivariate Gaussian
28:52
distribution computations.
28:54
Take a look at it.
28:57
If you feel like you don't need to look at it
28:59
but you the basic maneuver, it's fine as well.
29:03
We're not going to go much deeper into that,
29:05
but those are part of the thing that
29:07
are sort of standard manipulations
29:09
about standard Gaussian vectors.
29:11
Because obviously, standard Gaussian vectors
29:13
arise from this theorem a lot.
29:17
So now I pre-multiplied my sigma to minus minus 1/2.
29:22
Now of course, I'm doing all of this in the asymptotics,
29:24
and so I have this effect.
29:26
So if I pre-multiply everything by sigma to the 1/2,
29:29
sigma being the Fisher information matrix at theta 0,
29:34
then this is actually equivalent to saying that square root
29:38
of n--
29:39
29:43
so now i of theta now plays the role of sigma--
29:51
times theta hat mle minus theta not goes in distribution
29:59
as n goes to infinity to some multivariate standard Gaussian
30:06
and 0 identity of rd.
30:09
And here, to make sure that we're
30:10
talking about a multivariate distribution,
30:13
I can put a d here--
30:16
so just so we know we're talking about the multivariate,
30:18
though it's pretty clear from the context,
30:20
since the covariance matrix is actually a matrix and not
30:23
a number.
30:23
Michael?
30:24
AUDIENCE: [INAUDIBLE].
30:26
30:29
PHILIPPE RIGOLLET: Oh, yeah.
30:30
Right.
30:31
Thanks.
30:31
30:34
So Yeah, you're right.
30:35
So that's a minus and that's a plus.
30:39
Thanks.
30:40
So yeah, anybody has a way to remember
30:47
whether it's inverse Fisher information or Fisher
30:49
information as a variance other than just learning it?
30:54
It is called information, so it's really telling me
30:58
how much information I have.
31:00
So when a variance increases, I'm
31:02
getting less and less information,
31:04
and so this thing should actually be 1 over a variance.
31:08
The notion of information is 1 over a notion of variance.
31:10
31:13
So now I just wrote this guy like this, and the reason
31:19
why I did this is because now everything
31:21
on the right-hand side does not depend on any known parameter.
31:26
There's 0 and identity.
31:30
Those two things are just absolute numbers
31:33
or absolute quantities, which means that this thing--
31:38
I call this quantity here--
31:42
what was the name that I used?
31:44
Started with a "p."
31:47
Pivotal.
31:47
So this is a pivotal quantity, meaning
31:50
that its distribution, at least asymptotic distribution,
31:53
does not depend on any unknown parameter.
31:56
Moreover, it is indeed a statistic,
32:00
because I can actually compute it.
32:03
I know theta 0 and I know theta hat mle.
32:05
One thing that I did, and you should actually
32:08
complain about this, is on the board
32:11
I actually used i of theta not.
32:15
And on the slides, it says i of theta hat.
32:20
And it's exactly the same thing that we did before.
32:22
Do I want to use the variance as a way for me
32:26
to check whether I'm under the right assumption or not?
32:29
Or do I actually want to leave that part
32:31
and just plug in the theta hat mle, which should
32:33
go to the true one eventually?
32:36
Or do I actually want to just plug in the theta 0?
32:39
So this is exactly playing the same role
32:41
as whether I wanted to see square root of Xn bar
32:45
1 minus Xn bar in the denominator of my test
32:48
statistic for p, or if I wanted to see square root of 0.5,
32:55
1 minus 0.5 when I was testing if p was equal to 0.5.
32:59
So this is really a choice that's left up to you,
33:03
and that's something you can really choose the two.
33:06
And as we said, maybe this guy is slightly more precise,
33:09
but it's not going to extend to the case
33:11
where theta 0 is not reduced to one single number.
33:15
33:20
Any questions?
33:22
So now we have our pivotal distribution, so from there
33:26
this is going to be my test statistic.
33:29
I'm going to use this as a test statistic
33:31
and declare that if this thing is too large,
33:35
n absolute value--
33:36
because this is really a way to quantify how far theta hat is
33:41
from theta 0.
33:41
And since theta hat should be close to the true one, when
33:44
this thing is large in absolute value,
33:45
it means that the true theta should be far from theta 0.
33:50
So this is my new test statistic.
33:56
Now, I said it should be far, but this is a vector.
33:59
So if I want a vector to be far, two vectors to be far,
34:02
I measure their norm.
34:04
And so I'm going to form the Euclidean norm of this guy.
34:07
So if I look at the Euclidean norm of n--
34:10
34:14
and Euclidean norm is the one you know--
34:16
34:22
I'm going to take its square.
34:25
Let me now put a 2 here.
34:26
So that's just the Euclidean norm,
34:28
and so the norm of vector x is just x transpose x.
34:36
In the slides, the transpose is denoted by prime.
34:40
Wow, that's hard to say.
34:41
Put prime in quotes.
34:42
34:48
That's a statistic standard that people do.
34:50
They put prime for transpose.
34:53
Everybody knows what the transpose is?
34:56
So I just make it flat and I do it like this,
34:58
and then that means that's actually
34:59
equal to the sum of the coordinates Xi squared.
35:03
35:06
And that's what you know.
35:08
But here, I'm just writing it in terms of vectors.
35:10
And so when I run to write this, this is equivalent,
35:13
this is equal to--
35:14
well, the square root of n is going to pick up the square.
35:17
So I get square root of n times square root of n.
35:20
So this guy is just 1/2.
35:23
So 1/2 times 1/2 is going to give me 1,
35:25
and so I get theta hat mle minus theta.
35:29
And then I have e of theta not.
35:32
And then I get theta hat mle minus theta not.
35:37
And so by definition, I'm going to say that this
35:41
is my test statistic Tn.
35:45
And now I'm going to have a test that rejects if Tn is large,
35:50
because Tn is really measuring the distance between theta hat
35:53
and theta 0.
35:55
So my test now is going to be psi, which rejects.
36:20
So it says 1 if Tn is larger than some threshold T.
36:27
And how do I pick this T?
36:30
Well, by controlling my type I error--
36:32
sorry, the c by controlling my type I error.
36:35
So to choose c, what we have to check
36:44
is that p under theta not--
36:47
so here it's theta not--
36:49
that I reject so that psi is equal to 1.
36:55
I want this to be equal to alpha, right?
36:58
That's how I maximize my type I error
37:01
under the budget that's actually given to me, which is alpha.
37:04
So that's actually equivalent to checking whether p not of Tn
37:12
is larger than c.
37:13
37:19
And so if I want to find the c, all I need to know
37:23
is what is the distribution of Tn when
37:25
theta is equal to theta not?
37:28
Whatever this distribution is-- maybe it has some weird density
37:31
like this--
37:32
whatever this distribution is, I'm
37:35
just going to be able to pick this number,
37:37
and I'm going to take this quintile alpha, here alpha,
37:41
and I'm going to reject if I'm larger than alpha--
37:44
whatever this guy is.
37:45
So to be able to do that, I need to know
37:47
what is the distribution of Tn when theta is equal to theta 0.
37:56
What is this distribution?
38:00
What is Tn?
38:02
It's the norm squared of this vector.
38:08
What is this vector?
38:09
What is the asymptotic distribution of this vector?
38:12
38:17
Yes?
38:18
AUDIENCE: [INAUDIBLE].
38:21
PHILIPPE RIGOLLET: Just look one board up.
38:23
What is this asymptotic distribution
38:24
of the vector for which we're taking the norm squared?
38:27
It's right here.
38:30
It's a standard Gaussian multivariate.
38:33
So when I look at the norm squared--
38:36
so if z is a standard Gaussian multivariate,
38:45
then the norm of z squared, by definition of the norm squared,
38:51
is the sum of the Zi squared.
38:54
39:01
That's just the definition of the norm.
39:04
But what is this distribution?
39:06
AUDIENCE: Chi-squared.
39:07
PHILIPPE RIGOLLET: That's a chi-square,
39:09
because those guys are all of variance 1.
39:12
That's what the diagonal tells me--
39:15
only ones.
39:15
And they're independent because they have all these zeros
39:18
outside of the diagonal.
39:20
So really, this follows some chi-squared distribution.
39:23
How many degrees of freedom?
39:25
Well, the number of them that I sell, d.
39:30
So now I have found the distribution of Tn
39:33
under this guy.
39:35
And that's true because this is true under h0.
39:41
If I was not under h0, again, I would
39:44
need to take another guy here.
39:46
39:49
How did I use the fact that theta is equal to theta 0
39:52
when I centered by theta 0?
39:54
And that was very important.
39:57
So now what I know is that this is really equal--
40:01
why did I put 0 here?
40:02
40:05
So this here is actually equal.
40:10
So in the end, I need c such that the probability--
40:23
and here I'm not going to put a theta 0.
40:25
I'm just talking about the possibility
40:26
of the random variable that I'm going to put in there.
40:29
It's a chi-square with d degrees of freedom [INAUDIBLE]
40:31
is equal to alpha.
40:32
40:35
I just replaced the fact that this guy, Tn,
40:39
under the distribution was just a chi-square.
40:41
And this distribution here is just
40:42
really referring to the distribution of a chi-square.
40:44
There's no parameters here.
40:46
And now, that means that I look at my chi-square distribution.
40:51
It sort of looks like this.
40:55
And I'm going to pick some alpha here,
40:59
and I need to read this number q alpha.
41:02
41:04
And so here what I need to do is to pick this q alpha here,
41:09
for c.
41:11
So take c to be q alpha, the quintile of order 1 minus
41:28
alpha of a chi-squared distribution
41:31
with this d degree of freedom.
41:32
And why do I say 1 minus alpha?
41:33
Because again, the quintiles are usually
41:36
referring to the area that's to the left of them by--
41:41
well, actually, it's by a convention.
41:47
However, in statistics, we only care about the right tail
41:52
usually, so it's not very convenient for us.
41:55
And that's why rather than calling
41:56
this guy s sub 1 minus alpha all the time, I write it q alpha.
42:01
So now you have this q alpha, which
42:03
is the 1 minus alpha quintile, or quintile of order 1 minus
42:08
alpha of chi squared d.
42:10
And so now I need to use a table.
42:12
For each d, this thing is going to take a different value,
42:15
and this is why I can not just spit out a number to you
42:18
like I spit out 1.96.
42:21
Because if I were able to do that,
42:24
that would mean that I would remember
42:25
an entire column of this table for each possible value of d,
42:30
and that I just don't know.
42:32
So you need just to look at tables,
42:34
and this is what it will tell you.
42:36
Often software will do that, too.
42:38
You don't have to search through tables.
42:41
And so just as a remark is that this test, Wald's test,
42:46
is also valid when I have this sort of other alternative
42:50
that I could see quite a lot--
42:51
if I actually have what's called a one-sided alternative.
42:55
By the way, this is called Wald's test--
42:58
so taking Tn to be this thing.
43:01
43:09
So this is Wald's test.
43:12
Abraham Wald was a famous statistician
43:15
in the early 20th century, who actually was at Columbia
43:22
for quite some time.
43:26
And that was actually at the time
43:27
where statistics were getting very popular in India,
43:33
so he was actually traveling all over India
43:35
in some dinky planes.
43:37
And one of them crashed, and that's how he died--
43:41
pretty young.
43:42
But actually, there's a huge school of statistics
43:45
now in India thanks to him.
43:47
There's the Indian Statistical Institute,
43:49
which is actually a pretty big thing
43:51
and trans the best statisticians.
43:53
So this is called Wald's test, and it's actually
43:55
a pretty popular test.
43:56
Let's just look back a second.
43:59
So you can do the other alternatives,
44:01
as I said, and for the other alternatives
44:03
you can actually do this trick where you put theta 0 as
44:06
well, as long as you take the theta 0 that's
44:08
the closest to the alternative.
44:10
You just basically take the one that's the least favorable
44:13
to you--
44:13
44:16
the alternative, I mean.
44:18
So what is this thing doing?
44:21
If you did not know anything about statistics and I told
44:25
you here's a vector--
44:26
that's the mle vector, theta hat mle.
44:29
44:32
So let's say this theta hat mle takes the values, say--
44:36
44:44
so let's say theta hat mle takes values, say, 1.2, 0.9, and 2.1.
44:57
And then testing h0, theta is equal to 1, 1, 2, versus theta
45:06
is not equal to the same number.
45:08
That's what I'm testing.
45:11
So you compute this thing and you find this.
45:13
If you don't know any statistics,
45:14
what are you going to do?
45:15
45:18
You're just going to check if this guy goes to that guy,
45:21
and probably what you're going to do is compute something that
45:24
looks like the norm squared between those guys-- so
45:27
the sum.
45:28
So you're going to do 1.2 minus 1 squared
45:31
plus 0.9 minus 1 squared plus 2.1 minus 2 squared
45:38
and check if this number is large or not.
45:41
Maybe you are going to apply some stats to try to understand
45:44
how those things are, but this is basically
45:46
what you are going to want to do.
45:49
What Wald's test is telling you is
45:52
that this average is actually not what you should be doing.
45:56
It's telling you that you should have some sort
45:59
of a weighted average.
46:00
Actually, it would be a weighted average
46:01
if I was guaranteed that my Fisher information
46:06
matrix was diagonal.
46:08
If my Fisher information matrix is diagonal,
46:10
looking at this number minus this guy,
46:13
transpose i, and then this guy minus this,
46:16
that would look like I have some weight here, some weight here,
46:19
and some weight here.
46:19
46:25
Sorry, that's only three.
46:29
So if it has non-zero numbers on all of its nine entries,
46:32
then what I'm going to see is weird cross-terms.
46:36
If I look at some vector pre-multiplying this thing
46:41
and post-multiplying this thing--
46:42
so if I look at something that looks like this,
46:44
x transpose i of theta not, x transpose--
46:51
think of x as being theta hat mle minus theta--
46:56
so if I look at what this guy looks like,
46:58
it's basically a sum over i and j of Xi, Xj, i, theta not Ij.
47:08
And so if none of those things are 0,
47:11
you're not going to see a sum of three terms that are squares,
47:14
but you're going to see a sum of nine cross-products.
47:18
And it's just weird.
47:20
This is not something standard.
47:21
So what is Wald's test doing for you?
47:26
Well, it's saying, I'm actually going
47:29
to look at all the directions all at once.
47:32
Some of those directions are going
47:33
to have more or less variance, i.e., less or more information.
47:41
And so for those guys, I'm actually
47:43
going to use a different weight.
47:45
So what you're really doing is putting a weight
47:47
on all directions of the space at once.
47:51
So what this Wald's test is doing--
47:53
by squeezing in the Fisher information matrix,
47:56
it's placing your problem into the right geometry.
48:00
It's a geometry that's distorted and where balls become ellipses
48:05
that are distorted in some directions
48:07
and shrunk in others, or depending
48:10
on if you have more variance or less variance in those
48:12
directions.
48:13
Those directions don't have to be
48:14
aligned with the axes of your coordinate system.
48:18
And if they were, then that would
48:19
mean you would have a diagonal information matrix,
48:24
but they might not be.
48:25
And so there's this weird geometry that shows up.
48:28
There is actually an entire field,
48:31
admittedly a bit dormant these days,
48:34
that's called information geometry,
48:36
and it's really doing differential geometry
48:39
on spaces that are defined by Fisher information matrices.
48:44
And so you can do some pretty hardcore--
48:46
something that I certainly cannot do--
48:50
differential geometry , just by playing around with statistical
48:53
models and trying to understand with the geometry of those
48:55
models are.
48:56
What does it mean for two points to be
48:58
close in some curved space?
49:01
So that's basically the idea.
49:02
So this thing is basically curving your space.
49:06
So again, I always feel satisfied
49:10
when my estimator on my test does not
49:12
involve just computing an average
49:14
and checking if it's big or not.
49:16
And that's not what we're doing here.
49:18
We know that this theta hat mle can be complicated--
49:23
CF problem set, too, I believe.
49:26
And we know that this Fisher information matrix can also
49:29
be pretty complicated.
49:30
So here, your test is not going to be trivial at all,
49:33
and that requires understanding the mathematics behind it.
49:37
I mean, it all built upon this theorem
49:40
that I just erased, I believe, which
49:43
was that this guy here inside this norm
49:45
was actually converging to some standard Gaussian.
49:47
49:52
So there's another test that you can actually use.
49:55
So Wald's test is one option, and there's another option.
50:00
And just like maximum likelihood estimation and method
50:05
of moments would sometimes agree and sometimes disagree,
50:09
those guys are going to sometimes agree and sometimes
50:12
disagree.
50:13
And this test is called the likelihood ratio test.
50:17
So let's parse those words--
50:21
"likelihood," "ratio," "test."
50:25
So at some point, I'm going to have
50:26
to take the likelihood of something divided
50:29
by the likelihood of some other thing and then work with this.
50:33
And this test is just saying the following.
50:36
Here's the simplest principle you can think of.
50:39
50:44
You're going to have to understand
50:45
the notion of likelihood in the context of statistics.
50:51
You just have to understand the meaning of the word
50:53
"likelihood."
50:54
This test is just saying if I want to test h0,
51:03
theta is equal to theta 0, versus theta is equal to theta
51:07
1, all I have to look at is whether theta 0 is more or less
51:13
likely than theta 1.
51:14
And I have an exact number that spits out.
51:18
Given a theta 0 or a theta 1 and given data,
51:24
I can put in this function called the likelihood,
51:26
and they tell me exactly how likely those things are.
51:31
And so all I have to check is whether one
51:33
is more likely than the other, and so what I can do
51:35
is form the likelihood of theta, say,
51:41
1 divided by the likelihood of theta 0
51:50
and check if this thing is larger than 1.
51:52
That would mean that this guy is more likely than that guy.
51:57
That's a natural way to proceed.
52:00
Now, there's one caveat here, which
52:03
is that when I do hypothesis testing
52:05
and I have this asymmetry between h0 and h1,
52:10
I still need to be able to control what
52:13
my probably of type I error is.
52:15
And here I basically have no knob.
52:19
This is something if you give me data in theta 0
52:21
and theta 1 I can compute to you and spit out the yes/no answer.
52:24
But I have no way of controlling the type II and type I error,
52:29
so what we do is that we replace this 1 by some number c.
52:33
And then we calibrate c in such a way
52:35
that the type I error is exactly at level alpha.
52:37
52:40
So for example, if I want to make sure
52:44
that my type I error is always 0, all I have to do
52:50
is to say that this guy is actually never
52:52
more likely than that guy, meaning never reject.
52:55
And so if I let c go to infinity,
52:57
then this is actually going to make
52:59
my type I error go to zero.
53:02
But if I let c go to negative infinity,
53:05
then I'm always going to conclude
53:12
that h1 is the right one.
53:14
So I have this straight off, and I
53:16
can turn this knob by changing the values of c
53:19
and get different results.
53:22
And I'm going to be interested in the one that maximizes
53:25
my chances of rejecting the null hypothesis while staying
53:29
under my alpha budget of type I error.
53:33
So this is nice when I have two very simple hypotheses,
53:37
but to be fair, we've actually not seen
53:40
any tests that correspond to real-life example.
53:45
Where theta 0 was of the form am I equal to, say, 0.5
53:49
or am I equal to 0.41, we actually
53:51
sort of suspected that if somebody
53:53
asked you to perform this test, they've
53:54
sort of seen the data before and they're sort of cheating.
53:57
So it's typically something am I equal to 0.5
54:00
or not equal to 0.5 or am I equal to 0.5
54:02
or larger than 0.5.
54:03
But it's very rare that you actually get only two points
54:06
to test--
54:07
am I this guy or that guy?
54:09
Now, I could go on.
54:11
There's actually a nice mathematical theory,
54:13
something called the Neyman-Pearson lemma
54:15
that actually tells me that this test, the likelihood ratio
54:18
test, is the test, given the constraint of type I error,
54:22
that will have the smallest type II error.
54:25
So this is the ultimate test.
54:27
No one should ever use anything different.
54:29
And we could go on and do this, but in a way,
54:32
it's completely irrelevant to practice because you will never
54:35
encounter such tests.
54:37
And I actually find students that they took my class
54:41
as sophomores and then they're still around a couple of years
54:44
later and they're doing, and they're like,
54:46
I have this testing problem and I want to use likelihood ratio
54:50
test, the Neyman-Pearson one, but I just can't because it
54:54
just never occurs.
54:56
This just does not happen.
54:57
So here, rather than going into details,
54:59
let's just look at what building on this principle
55:02
we can actually make a test that will work.
55:05
So now, for simplicity, I'm going
55:08
to assume that my alternatives-- so now, I still
55:11
have a d dimensional vector theta.
55:16
And what I'm going to assume is that the null hypothesis
55:20
is actually only testing if the last coefficients from r
55:26
plus 1 to d are fixed numbers.
55:31
So in this example, where I have theta was equal--
55:35
so if I have d equals 3, here's an example.
55:38
55:42
h0 is theta 2 equals 1, and theta 3 equals 2.
55:53
That's my h0, but I say I don't actually
55:56
care about what theta 1 is going to be.
55:58
56:02
So that's my null hypothesis.
56:04
I'm not going to specify right now what the alternative is.
56:07
That's what the null is.
56:08
And in particular, this null is actually not of this form.
56:13
It's not restricting it to one point.
56:15
It's actually restricting it to an infinite amount of points.
56:18
Those are all the vectors of the form theta 1 1,
56:22
2 for all theta 1 in, say, r.
56:29
That's a lot of vectors, and so it's certainly
56:31
not like it's equal to one specific vector.
56:34
56:36
So now, what I'm going to do is I'm actually
56:39
going to look at the maximum likelihood estimator,
56:43
and I'm going to say, well, the maximum likelihood estimator,
56:45
regardless of anything, is going to be close to. reality.
56:50
Now, if you actually tell me ahead of time
56:53
that the true parameter is of this form,
56:56
I'm not going to maximize over all three coordinates of theta.
56:59
I'm just going to say, well, I might as well just
57:01
set the second one to 1, the third one to 2,
57:06
and just optimize for this guy.
57:09
So effectively, you can say if you're telling me
57:11
that this is the reality, I can compute
57:14
a constrained maximum likelihood estimator
57:17
which is constrained to look like what you think reality is.
57:21
So this is what the maximum likelihood estimator is.
57:24
That's the one that's maximizing, say,
57:26
here the log likelihood over the entire space of candidate
57:30
vectors, of candidate parameters.
57:32
But this partial one, this is the constraint mle.
57:36
That's the one that's actually not maximizing our real thetas,
57:38
but only over the thetas that are plausible
57:41
under the null hypothesis.
57:44
So in particular, if I look at ln of this constraint thing
57:52
theta hat n c compared to ln, theta hat--
57:59
let's say n mle, so we know which one--
58:04
which one is bigger?
58:05
58:13
The first one is bigger.
58:15
So why?
58:17
AUDIENCE: [INAUDIBLE].
58:18
58:20
PHILIPPE RIGOLLET: So the second one
58:22
is maximized over a larger space.
58:25
Right.
58:25
So I have this all of theta, which
58:28
are all the parameters I can take,
58:30
and let's say theta 0 is this guy.
58:32
I'm maximizing a function over all these things.
58:35
So if the true maximum is this here,
58:38
then the two things are equal, but if the maximum
58:41
is on this side, then the one on the right
58:43
is actually going to be larger.
58:45
They're maximizing over a bigger space,
58:48
so this guy has to be less than this guy.
58:51
So maybe it's not easy to see.
58:53
So let's say that this is theta and this is theta 0
59:01
and now I have a function.
59:04
The maximum over theta 0 is this guy here,
59:09
but the maximum over the entire space is here.
59:12
59:15
So the maximum over a larger space
59:17
has to be larger than the maximum over a smaller space.
59:20
It can be equal, but the one in the bigger space
59:26
can be even bigger.
59:28
However, if my true theta actually
59:33
did belong to theta 0--
59:35
if h0 was true--
59:38
what would happen?
59:39
Well, if theta 0 is true, then theta isn't theta 0,
59:45
and since the maximum likelihood should be close to theta,
59:49
it should be the case that those two things should
59:51
be pretty similar.
59:52
I should be in a case not in this kind of thing,
59:56
but more in this kind of position,
59:58
where the true maximum is actually attained at theta 0.
60:00
And in this case, they're actually
60:02
of the same size, those two things.
60:05
If it's not true, then I'm going to see a discrepancy
60:08
between the two guys.
60:09
60:12
So my test is going to be built on this intuition
60:15
that if h0 is true, the values of the likelihood at theta hat
60:20
mle and at the constraint mle should be pretty much the same.
60:24
But if theta hat--
60:25
if it's not true, then the likelihood of the mle
60:29
should be much larger than the likelihood
60:33
of the constrained mle.
60:34
60:37
And this is exactly what this test is doing.
60:40
So that's the likelihood ratio test.
60:42
So rather than looking at the ratio of the likelihoods,
60:46
we look at the difference of the log likelihood, which
60:48
is really the same thing.
60:51
And there is some weird normalization factor, too,
60:54
that shows up here.
60:55
61:04
And this is what we get.
61:06
So if I look at the likelihood ratio test,
61:18
so it's looking at two times ln of theta hat mle
61:25
minus ln of theta hat mle constrained.
61:32
And this is actually the test statistic.
61:34
So we've actually decided that this statistic is what?
61:39
61:42
It's non-negative, right?
61:44
We've also decided that it should
61:45
be close to zero if h0 is true and of course
61:49
then maybe far from zero if h0 is not true.
61:52
So what should be the natural test based on Tn?
62:00
Let me just check that it's--
62:03
well, it's already there.
62:05
So the natural test is something that looks like indicator
62:08
that Tn is larger than c.
62:12
And you should say, well, again?
62:13
I mean, we just did that.
62:15
I mean, it is basically the same thing that we just did.
62:19
Agreed?
62:20
But the Tn now is different.
62:22
The Tn is the difference of log likelihoods,
62:24
whereas before the Tn was this theta hat minus theta
62:29
not transpose identity of Fisher information matrix theta
62:35
hat minus theta not.
62:37
And this, there's no reason why this guy
62:39
should be of the same form.
62:41
Now, if I have a Gaussian model, you
62:43
can check that those two things are actually exactly the same.
62:45
62:49
But otherwise, they don't have any reason to be.
62:52
And now, what's happening is that
62:54
under some technical conditions--
62:57
if h0 is true, now what happens is
62:59
that if I want to calibrate c, what I need to do
63:02
is to look at what is the c such that this guy is
63:08
equal to alpha?
63:10
And that's for the distribution of T under the knob.
63:15
63:20
But there's not only one.
63:22
The null hypothesis here was actually
63:26
just the family of things.
63:28
It was not just one vector.
63:29
It was an entire family of vectors,
63:31
just like in this example.
63:33
So if I want my type I error to be constrained
63:35
over the entire space, what I need to make sure of
63:39
is that the maximum overall theta n theta not
63:44
is actually equal to alpha.
63:45
63:53
Agreed?
63:53
Yeah?
63:54
AUDIENCE: [INAUDIBLE].
63:55
63:59
PHILIPPE RIGOLLET: So not equal.
64:04
In this case, it's going to be not equal.
64:06
I mean, it can really be anything you want.
64:08
It's just you're going to have a different type II error.
64:12
I guess here we're sort of stuck in a corner.
64:15
We built this T, and it has to be small under the null.
64:18
And whatever not the null is, we just
64:21
hope that it's going to be large.
64:22
64:25
So even if I tell you what the alternative is,
64:27
you're not going to change anything about the procedure.
64:31
So here, q alpha-- so what I need to know
64:33
is that if h0 is true, then Tn in this case
64:37
actually converges to some chi-square distribution.
64:41
And now here, the number of degrees of freedom
64:44
is kind of weird.
64:45
64:58
But actually, what it should tell you is, oh, finally, I
65:02
know when you call this parameter degrees of freedom
65:05
rather than dimension or just d parameter.
65:08
It's because here what we did is we actually pinned down
65:13
everything, but r--
65:19
sorry, we pinned down everything but r
65:23
coordinates of this thing.
65:24
65:26
And so now I'm actually wondering why--
65:30
65:34
did I make a mistake here?
65:36
65:40
I think this should be chi square
65:41
with r degrees of freedom.
65:43
65:46
Let me check and send you an update about this,
65:48
because the number of degrees of freedom,
65:53
if you talk to normal people they will tell you
65:55
that here the number of degrees of freedom is r.
65:59
This is what's allowed to move, and that's
66:01
what's called degrees of freedom.
66:03
The rest is pinned down to being something.
66:06
So here, this chi-square should be a chi-squared r.
66:10
And that's something you just have to believe me.
66:12
Anybody guess what theorem is going to tell me this?
66:15
66:19
In some cases, it's going to be Cochran's theorem--
66:21
just something that tells me that thing's [INAUDIBLE]..
66:23
Now, here, I use the very specific form
66:27
of the null alternative.
66:29
And so for those of you who are sort
66:31
of familiar with linear algebra, what I did here is h0
66:35
consists in saying that theta belongs
66:39
to an r dimensional linear space.
66:43
It's actually here, the r dimensional linear space
66:45
of vectors, that have the first r coordinates that can move
66:49
and the last coordinates that are fixed to some number.
66:54
Actually, it's undefined space because it doesn't necessarily
66:57
go through zero.
66:58
And so I have this defined space that
67:00
has dimension r, and if I were to constrain it to any other r
67:05
dimensional space, that would be exactly the same thing.
67:08
And so to do that, essentially what you need to do is to say,
67:10
if I take any matrix that's say, invertible-- let's call it u--
67:15
and then so h0 is going to be something like of the form u
67:21
times theta and now I look only at the coordinates r plus 1 2d,
67:33
then I want to fix those guys to some numbers.
67:35
So I want to call them theta, so let's call them tau.
67:39
So it's going to be tau r plus 1, all the way to tau d.
67:44
So this is not part of the requirements,
67:47
but just so you know, it's really not a matter
67:50
of keeping only some coordinates.
67:51
Really, what matters is the dimension
67:54
in the sense of linear subspaces of the problem,
67:56
and that's what determines what your degrees of freedom are.
67:59
68:03
So now that we know what the asymptotic distribution is
68:06
under the null, then we know basically
68:10
that we know how which table we need to pick our q alpha from.
68:17
And here, again, the table is a chi-squared table,
68:20
but here, the number of degrees of freedom
68:22
is this weird d minus r degrees of freedom thing.
68:26
68:29
I just said it was r.
68:31
68:34
I'm just checking, actually, if I'm--
68:36
68:41
it's r.
68:42
It's definitely r.
68:42
68:51
So here we've made tests.
68:54
We're testing if r parameter theta was explicitly
68:57
in some set or not.
69:00
By explicitly, I mean we're saying, is theta like this
69:03
or is theta not like this?
69:04
Is theta equal to theta not or is theta
69:06
not equal to theta not?
69:07
Are the last coordinates of theta
69:10
equal to those fixed numbers, or are they not?
69:12
There was something I was stating directly about theta.
69:15
But there's going to be some instances where you actually
69:17
want to test something about a function of theta,
69:21
not theta itself.
69:22
For example, is the difference between the first coordinate
69:27
of theta and the second coordinate of theta positive?
69:30
That's definitely something you might want to test,
69:32
because maybe theta 1 is--
69:37
let me try to think of some good example.
69:39
69:44
I don't know.
69:45
Maybe theta 1 is your drawing accuracy with the right hand
69:49
and theta 2 is the drawing accuracy with the left hand,
69:52
and I'm actually collecting data on young children
69:56
to be able to test early on whether they're
69:58
going to be left-handed or right-handed, for example.
70:01
And so I want to just compare those two with respect
70:04
to each other, but I don't necessarily
70:06
need to know what the absolute score for this handwriting
70:10
skills are.
70:12
So sometimes it's just interesting to look
70:14
at the difference of things or maybe the sum,
70:17
say the combined effect.
70:18
Maybe this is my two measurements of blood pressure,
70:22
and I just want to talk about the average blood pressure.
70:25
And so I can make a linear combination of those two,
70:28
and so those things implicitly depend on theta.
70:30
And so I can generically encapsule them
70:36
in some test of the form g of theta is equal to 0
70:39
versus g of theta is not equal to 0.
70:42
And sometimes, in the first test that we saw, g of theta
70:46
was just the identity or maybe the identity minus 0.5.
70:53
If g of theta is theta minus 0.5,
70:55
that's exactly what we've been testing.
70:57
If g of theta is theta minus 0.5 and theta
71:01
is p, the parameter of a coin, this is exactly of this form.
71:06
So this is a simple one, but then there's
71:08
more complicated ones we can think of.
71:11
71:14
So how can I do this?
71:20
Well, let's just follow a recipe.
71:22
71:24
So we traced back.
71:26
We were trying to build a test statistic which was pivotal.
71:31
We wanted to have this thing that
71:33
had nothing that depended on the parameter,
71:37
and the only thing we had for that
71:39
that we built in our chi-square test
71:41
one is basically some form of central limit theorem.
71:44
Maybe it's for the maximum likelihood estimator.
71:46
Maybe it's for the average, but it's basically
71:48
some form of asymptotic normality of the estimator.
71:52
And that's what we started from every single time.
71:55
So let's assume that I have this,
71:58
and I'm going to talk very abstractly.
72:00
Let's assume that I start with an estimator.
72:03
Doesn't have to be the mle.
72:04
It doesn't have to be the average,
72:06
but it's just something.
72:08
And I know that I have the estimator such that this guy
72:11
converges in distribution to some n0,
72:15
and I have some covariance matrix theta.
72:17
Maybe it's not the Fisher information.
72:20
Maybe that's something that's not as good as the mle,
72:23
meaning that this is going to give me
72:25
less information than the Fisher information, less accuracy.
72:29
And now I can actually just say, OK, if I know this about theta,
72:34
I can apply the multivariate delta method, which tells me
72:43
that square root of n, g of theta hat, minus g of theta
72:50
goes in distribution to some n0.
72:56
And then the price to pay in one dimension
72:58
was multiplying the square root of the derivative,
73:01
and we know that in multivariate dimensions pre-multiplying
73:03
by the gradient, post-multiplying
73:05
by the gradient.
73:06
So I'm going to write delta g of theta transpose sigma--
73:14
sorry, not delta; nabla--
73:15
g of theta-- so gradient.
73:19
And here, I assume that g takes values into rk.
73:25
That's what's written here. g takes value from d to k,
73:28
but think of k as being 1 for now.
73:30
So the gradient is really just a vector and not a matrix.
73:33
That's your usual gradient for real valid functions.
73:40
So effectively, if g takes values in dimension 1,
73:45
what is the size of this matrix?
73:47
73:58
I only ask trivial questions.
73:59
Remember, that's rule number one.
74:02
It's one by one, right?
74:04
And you can check it, because on this side
74:06
those are just the difference between numbers.
74:08
And it would be kind of weird if they had
74:10
a covariance matrix at the end.
74:11
I mean, this is a random variable, not a random vector.
74:15
So I know that this thing happens.
74:17
And now, if I basically divide by the square root
74:21
of this thing--
74:22
74:30
so for board I'm working with k is equal to 1 divided by square
74:35
root of delta g of theta transpose sigma delta nabla--
74:41
sorry, g of theta--
74:43
74:45
then this thing should go to some standard normal random
74:51
variable, standard normal distribution.
74:56
I just divided by square root of the variance here,
74:59
which is the usual thing.
75:01
Now, if you do not have a univariate thing,
75:05
you do the same thing we did before,
75:07
which is 3 multiplied by the covariance matrix
75:11
to the negative 1/2--
75:12
so before this role was played by the inverse Fisher
75:16
information matrix.
75:18
That's why we ended up having i of theta to the 1/2,
75:22
and now we just have this gamma, which is just this function
75:25
that I wrote up there.
75:26
That could be potentially k by k if g takes values into rk.
75:31
Yes?
75:32
AUDIENCE: [INAUDIBLE].
75:35
PHILIPPE RIGOLLET: Yeah, the gradient of a vector
75:37
is just the vector with all the derivatives with respect
75:41
to each component, yes.
75:42
75:45
So you know the word vector for derivatives, but not
75:48
for vectors?
75:49
I mean, the word gradient you use for one-dimensional?
75:54
Yes, derivative in one dimension.
75:57
76:01
Now, of course, here, you notice there's something--
76:03
I actually have a little caveat here.
76:06
I want this to have rank k.
76:08
I want this to be invertible.
76:10
I want this matrix to be invertible.
76:11
Even for the Fisher information matrix,
76:13
I sort of need it to be invertible.
76:15
Even for the original theorem, that
76:16
was part of my technical condition,
76:18
just so that I could actually write Fisher information matrix
76:21
inverse.
76:22
And so here, you can make your life easy and just assume
76:26
that it's true all the time, because I'm actually writing
76:28
in a fairly abstract way.
76:29
But in practice, we're going to have
76:31
to check whether this is going to be
76:33
true for specific distributions.
76:35
And we will see an example towards the end
76:37
of the chapter, the multinomial, where
76:39
it's actually not the case that Fisher information
76:42
matrix exists.
76:43
76:46
The asymptotic covariance matrix, is not invertible,
76:49
so it's not the inverse of a Fisher information matrix.
76:52
Because to be the inverse of someone,
76:54
you need to be invertible yourself.
76:55
76:58
And so now what I can do is apply Slutsky.
77:01
So here, what I needed to have is theta, the true theta.
77:06
So what I can do is just put some theta hat in there,
77:10
and so that's the gamma of theta hat that I see there.
77:16
And if theta is true, then g of theta is equal to 0.
77:19
That's what we assume.
77:20
That was our h0, was that under h0 g of theta is equal to 0.
77:25
So the number I need to plug in here,
77:29
I don't need to replace theta here.
77:31
What I need to replace here is 0.
77:33
77:36
Now let's go back to what you were saying.
77:38
Here you could say, let me try to replace 0 here,
77:41
but there is no such thing.
77:42
There is no g here.
77:43
It's only the gradient of g.
77:45
So this thing that says replace theta by theta 0
77:50
wherever you see it could not work here.
77:53
If g was invertible, I could just
77:57
say that theta is equal to g inverse of 0 in the null
78:02
and then I could plug in that value.
78:05
But in general, it doesn't have to be invertible.
78:08
And it might be a pain to invert g, even.
78:11
I mean, it's not clear how you can
78:13
invert all functions like that.
78:15
And so here you just go with Slutsky, and you say,
78:17
OK, I'm just going to put theta hat in there.
78:20
But this guy, I know I need to check whether it's 0 or not.
78:24
Same recipe we did for theta, except we do it for g of theta
78:27
now.
78:28
78:30
And now I have my asymptotic thing.
78:34
I know this is a pivotal distribution.
78:36
This might be a vector.
78:38
So rather than looking at the matrix itself,
78:41
I'm going to actually look at the norm--
78:43
rather than looking at the vectors,
78:44
I'm going to look at their square norm.
78:46
That gives me a chi square, and I
78:47
reject when my test statistic, which is the norm square,
78:51
exceeds the quintile of a chi square--
78:53
same as before, just doing on your own.
78:56
Before we part ways, I wanted to just mention one thing, which
79:00
is look at this thing.
79:02
If g was of dimension 1, the Euclidean norm in dimension 1
79:08
is just the absolute value of the number, right?
79:10
79:13
Which means that when I am actually computing this,
79:19
I'm looking at the square, so it's the square of something.
79:22
So it means that this is the square of a Gaussian.
79:25
And it's true that, indeed, the chi
79:26
squared 1 is just the square of a Gaussian.
79:28
79:31
Sure, this is the tautology, but let's look at this test now.
79:36
This test was built using Wald's theory and some pretty heavy
79:40
stuff.
79:42
But now if I start looking at Tn and I think of it
79:44
as being just the absolute value of this quantity over there,
79:47
squared, what I'm really doing is
79:50
I'm looking at whether the square of some Gaussian
79:54
exceeds the quintile of a chi squared of 1 degree of freedom,
80:00
which means that this thing is actually equivalent--
80:02
completely equivalent-- to the test.
80:04
So if k is equal to 1, this is completely
80:10
equivalent to looking at the absolute value of something
80:15
and check whether it's larger than, say, q over 2--
80:19
well, than q alpha--
80:22
well, that's q alpha over 2--
80:24
so that the probability of this thing
80:26
is actually equal to alpha.
80:27
And that's exactly what we've been doing before.
80:29
When we introduced tests in the first place,
80:31
we just took absolute values, said, well,
80:33
is the absolute value of a Gaussian in the limit.
80:36
And so it's the same thing.
80:37
So this is actually equivalent to the probability
80:40
that the norm squared is larger so that the chi squared
80:44
of some normal--
80:45
and that's the q alpha of some chi squared
80:52
with one degree of freedom.
80:53
Those are exactly the two same tests.
80:58
So in one dimension, those things just
81:00
collapse into being one little thing,
81:03
and that's because there's no geometry in one dimension.
81:05
It's just one dimension, whereas if I'm in a higher dimension,
81:08
then things get distorted and things can become weird.
81:12