https://www.youtube.com/watch?v=4HRhg4eUiMo&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=8


字幕記錄


00:00
00:00
The following content is provided under a Creative
00:02
Commons license.
00:03
Your support will help MIT OpenCourseWare
00:06
continue to offer high-quality educational resources for free.
00:10
To make a donation or to view additional materials
00:12
from hundreds of MIT courses, visit MIT OpenCourseWare
00:16
at ocw.mit.edu.
00:19
PHILIPPE RIGOLLET: We're talking about tests.
00:21
And to be fair, we spend most of our time
00:24
talking about new jargon that we're using.
00:28
The main goal is to take a binary decision, yes and no.
00:31
So just so that we're clear and we make sure that we all
00:36
speak the same language, let me just
00:37
remind you what the key words are for tests.
00:42
So the first thing is that we split theta
00:48
in theta 0 and theta 1.
00:53
Both are included in theta, and they are disjoint.
00:57
01:00
So I have my set of possible parameters.
01:04
And then I have theta 0 is here, theta 1 is here.
01:10
And there might be something that I leave out.
01:14
And so what we're doing is, we have two hypotheses.
01:18
So here's our hypothesis testing problem.
01:20
And it's h0 theta belongs to theta 0 versus h1 theta
01:27
belongs to theta 1.
01:29
This guy was called the null, and this guy
01:32
was called the alternative.
01:36
And why we give them special names
01:37
is because we saw that they have an asymmetric role.
01:41
The null represents the status quo,
01:43
and data is here to bring evidence against this guy.
01:46
And we can really never conclude that h0 is true
01:49
because all we could conclude is that h1 is not true, or may not
01:56
be true.
01:59
So that was the first thing.
02:00
The second thing was the hypothesis.
02:03
The third thing is, what is a test?
02:05
Well, psi, it's a statistic, and it takes the data,
02:17
and it maps it into 0 or 1.
02:21
And I didn't really mention it, but there's some things such
02:24
as called randomized tests, which is, well,
02:27
if I cannot really make a decision,
02:28
I might as well flip a coin.
02:30
That tends to be biased, but that's really--
02:32
I mean, think about it in practice.
02:34
You probably don't want to make decisions
02:36
based on flipping a coin.
02:37
And so what people typically do--
02:39
this is happening, typically, at one specific value.
02:41
So rather than flipping a coin for this very specific value,
02:44
what people typically do is they say,
02:45
OK, I'm going to side with h0 because that's the most
02:48
conservative choice I can make.
02:50
So in a way, they think of flipping this coin,
02:52
but always falling on heads, say.
02:55
So associated to this test was something called, well,
02:58
the rejection region r psi, which
03:05
is just the set of data x1 xn such that psi of x1 xn
03:15
is equal to 1.
03:16
So that means we rejected h0 when the test is 1.
03:19
And those are the set of data points
03:21
that actually are going to lead me to reject the test.
03:25
03:28
And then the things that we're actually, slightly,
03:30
a little more important and really peculiar to test,
03:35
specific to tests, were the type I and type II error.
03:40
So the type I error arises when--
03:44
so type I error is when you reject, whereas h0 is correct.
04:01
And the type II error is the opposite,
04:06
so it's failed to reject, whereas h1 is correct--
04:17
h is correct, yeah.
04:20
So those are the two types of errors you can make.
04:23
And we quantified their probability of type I error.
04:26
So alpha psi is the probability--
04:31
so that's the probability of type I error.
04:38
04:41
So psi is just the probability for theta that psi rejects
04:49
and that's defined for theta and theta 0,
04:54
so for different values of theta 0.
04:56
So h0 being correct means there exists a theta in theta 0
05:00
for which that actually is the right distribution.
05:03
So for different values of theta,
05:05
I might make different errors.
05:07
So if you think, for example, about the coin example,
05:12
I'm testing if the coin is biased towards heads
05:16
or biased towards tails.
05:18
So if I'm testing whether p is larger
05:21
than 1/2 or less than 1/2, then when the true p-- let's
05:25
say our h0 is larger than 1/2.
05:27
When p is equal to 1, it's actually very difficult for me
05:29
to make a mistake, because I only see heads.
05:33
So when p is getting closer to 1/2,
05:35
I'm going to start making more and more probability of error.
05:38
And so the type II error-- so that's the probability of type
05:42
II--
05:43
is denoted by beta psi.
05:46
And it's the function, well, that does the opposite
05:50
and, this time, is defined for theta in theta 1.
05:58
And finally, we define something called the power, pi of psi.
06:13
And this time, this is actually a number.
06:16
And so this number is equal to the maximum over theta n theta
06:23
0.
06:23
I mean, that could be a supremum, but think of it
06:25
as being a maximum of p theta of psi is equal--
06:32
sorry, that's n0, right?
06:37
Give me one sec.
06:39
No, sorry, that's the min.
06:42
06:46
So this is not making a mistake.
06:48
Theta 0 is in theta 2 So if theta is in theta 1
06:52
and I conclude 1, so this is a good thing.
06:55
I want this number to be large.
06:56
And I'm looking at the worst house--
06:58
what is the smallest value this number can be?
07:02
So what I want to show you a little bit is a picture.
07:06
07:09
So now I'm going to take theta, and think of it as being a p.
07:12
So I'm going to take p for some variable in the experiment.
07:18
So p can range between 0 and 1, that's for sure.
07:20
07:23
And what I'm going to try to test
07:24
is whether p is less than 1/2 or larger than 1/2.
07:30
So this is going to be, let's say, theta 0.
07:34
And this guy here is theta 1.
07:37
Just trying to give you a picture of what those guys are.
07:40
So I have my y-axis, and now I'm going to start drawing number.
07:46
All these things-- this function,
07:48
this function, and this number-- are
07:51
all numbers between 0 and 1.
07:52
07:56
So now I'm claiming that--
07:59
so when I move from left to right,
08:03
what is my probability of rejecting going to do?
08:08
So what I'm going to plot is the probability under theta.
08:11
The first thing I want to plot is the probability under theta
08:14
that psi is equal to 1.
08:19
And let's say psi--
08:20
think of psi as being just this indicator
08:25
that square root on n xn bar minus p over square root xn
08:35
bar 1 minus xn bar is larger than some constant c
08:40
for a probability chosen c.
08:43
So what we choose is that c is in such a way that, at 1/2,
08:48
when we're testing for 1/2, what we
08:50
wanted was this number to be equal to alpha, basically.
08:56
So we fix this alpha number so that this guy--
09:00
so if I want alpha of psi of theta less than alpha
09:09
given in advanced--
09:12
so think of it as being equal to, say, 5%.
09:15
So I'm fixing this number, and I want
09:16
this to be controlled for all theta and theta 0.
09:19
09:23
So if you're going to give me this budget,
09:26
well, I'm actually going to make it equal where I can.
09:29
If you're telling me you can make it equal to alpha,
09:31
we know that if I increase my type I error,
09:34
I'm going to decrease my type II error.
09:36
If I start putting everyone in jail
09:39
or if I start letting everyone go free,
09:41
that's what we were discussing last time.
09:43
So since we have this trade-off and you're
09:45
giving me a budget for one guy, I'm just going to max it out.
09:49
And where am I going to max it out?
09:50
Exactly at 1/2 at the boundary.
09:53
So this is going to be 5%.
09:54
10:00
So what I know is that since alpha of theta
10:03
is less than alpha for all theta in theta
10:06
0-- sorry, that's for theta 0, that's where alpha is defined.
10:12
So for theta and theta 0, I knew that my function
10:14
is going to look like this.
10:15
It's going to be somewhere in this rectangle.
10:18
Everybody agrees?
10:20
So this function for this guy is going to look like this.
10:22
When I'm at 0, when p is equal to 0,
10:25
which means I only observe 0's, then I
10:29
know that p is going to be 0, and I will certainly not
10:32
conclude that p is equal to 1.
10:34
This test will never conclude that p is equal to 1--
10:39
10:42
that p is larger than 1/2, just because xn bar
10:44
is going to be equal to 0.
10:46
Well, this is actually not well-defined,
10:48
so maybe I need to do something-- put it equal to 0
10:51
if xn bar is equal to 0.
10:52
So I guess, basically, I get something which is negative,
10:55
and so it's never going to be larger than what I want.
10:58
And so here, I'm actually starting at 0.
11:00
So now, this is this function here that increases--
11:04
I mean, it should increase smoothly.
11:06
This function here is alpha psi of theta--
11:11
or alpha psi of p, let's say, because we're talking about p.
11:15
Then it reaches alpha here.
11:17
Now, when I go on the other side,
11:19
I'm actually looking at beta.
11:21
When I'm on theta 1, the function that matters
11:23
is the probability of type II error, which is beta of psi.
11:28
And this beta of psi is actually going to increase.
11:30
11:34
So beta of psi is what?
11:35
Well, beta of psi should also--
11:37
sorry, that's the probability of being equal to alpha.
11:39
So what I'm going to do is I'm going
11:41
to look at the probability of rejecting.
11:43
So let me draw this functional all the way.
11:46
It's going to look like this.
11:48
Now here, if I look at this function here or here,
11:52
this is the probability under theta that psi is equal to 1.
11:57
And we just said that, in this region,
11:59
this function is called alpha of psi.
12:02
In that region, it's not called alpha of psi.
12:06
It's not called anything.
12:08
It's just the probability of rejection.
12:11
So it's not any error, it's actually
12:12
what you should be doing.
12:14
What we're looking at in this region is 1 minus this guy.
12:19
We're looking at the probability of not rejecting.
12:21
So I need to actually, basically, look at the 1
12:23
minus this thing, which here is going to be 95%.
12:27
So I'm going to do 95%.
12:31
12:34
And this is my probability.
12:36
Ability And I'm just basically drawing
12:38
the symmetric of this guy.
12:40
So this here is the probability under theta
12:44
that psi is equal to 0, which is 1 minus p theta
12:50
that psi is equal to 1.
12:52
So it's just 1 minus the wide curve.
12:56
And it's actually, by definition, equal
12:59
to beta of psi of theta.
13:00
13:06
Now, where do I read pi psi?
13:09
13:20
What is pi psi on this picture?
13:22
13:26
Is pi psi a number or a function?
13:28
13:32
AUDIENCE: Number.
13:32
PHILIPPE RIGOLLET: It's a number, right?
13:33
It's the minimum of a function.
13:35
What is this function?
13:36
It's the probability under theta that theta is equal to 1.
13:39
I drew this entire function for between theta 0 and theta 1.
13:44
I drew-- this is this entire white curve.
13:46
This is this probability.
13:48
Now I'm saying, look at the smallest value this probability
13:50
can take on the set theta 1.
13:54
What is this?
13:55
14:00
This guy.
14:02
This is where my pi--
14:03
this thing here is pi psi, and so it's equal to 5%.
14:08
14:11
So that's for this particular test,
14:13
because this test has a continuous curve for this psi.
14:19
And so if I want to make sure that I'm
14:20
at 5% when I come to the right of the theta 0,
14:24
if it touches theta 1, then I'd better
14:26
have 5% on the other side if the function is continuous.
14:30
So basically, if this function is
14:33
increasing, which will be the case for most tests,
14:38
and continuous, then what's going to happen
14:39
is that the level of the test, which is alpha,
14:42
is actually going to be equal to the power of the test.
14:44
14:48
Now, there's something I didn't mention,
14:50
and I'm just mentioning it passing by.
14:52
Here, I define the power itself.
14:55
This function, this entire white curve here,
14:59
is actually called the power function--
15:01
15:06
this thing.
15:07
That's the entire white curve.
15:09
And what you could have is tests that
15:12
have the entire curve which is dominated by another test.
15:16
So here, if I look at this test--
15:18
and let's assume I can build another test that
15:21
has this curve.
15:23
Let's say it's the same here, but then here, it
15:29
looks like this.
15:29
15:34
What is the power of this test?
15:38
AUDIENCE: It's the same.
15:39
PHILIPPE RIGOLLET: It's the same.
15:40
It's 5%, because this point touches here exactly
15:43
at the same point.
15:44
However, for any other value than the worst possible,
15:48
this guy is doing better than this guy.
15:51
Can you see that?
15:52
Having a curve higher on the right-hand side
15:55
is a good thing because it means that you
15:57
tend to reject more when you're actually in h1.
16:03
So this guy is definitely better than this guy.
16:06
And so what we say, in this case,
16:07
is that the test with the dashed line
16:09
is uniformly more powerful than the other tests.
16:13
But we're not going to go into those details
16:15
because, basically, all the tests that we will describe
16:18
are already the most powerful ones.
16:22
In particular, this guy is--
16:24
there's no such thing.
16:25
All the other guys you can come up with
16:26
are going to actually be below.
16:27
16:33
So we saw a couple tests, then we
16:36
saw how to pick this threshold, and we defined those two
16:40
things.
16:41
AUDIENCE: Question.
16:42
PHILIPPE RIGOLLET: Yes?
16:43
AUDIENCE: But in that case, the dashed line,
16:45
if it were also higher in the region of theta 0,
16:48
do you still consider it better?
16:50
PHILIPPE RIGOLLET: Yeah.
16:51
AUDIENCE: OK.
16:52
PHILIPPE RIGOLLET: Because you're given this budget of 5%.
16:55
So in this paradigm where you're given the--
16:58
actually, if the dashed line was this dashed line,
17:01
I would still be happy.
17:03
I mean, I don't care what this thing does here,
17:05
as long as it's below 5%.
17:06
But here, I'm going to try to discover.
17:08
Think about, again, the drug discovery example.
17:11
You're trying to find-- let's say you're a scientist
17:14
and you're trying to prove that your drug works.
17:17
What do you want to see?
17:18
Well, FDA puts on you this constraint
17:22
that your probability of type I error should never exceed 5%.
17:26
You're going to work under this assumption.
17:28
But what you're going to do is, you're
17:30
going to try to find a test that will make you find something
17:33
as often as possible.
17:35
And so you're going to max this constraint of 5%.
17:38
And then you're going to try to make this curve, that means--
17:41
this is, basically, this number here, for any point
17:45
here, is the probability that you publish your paper.
17:47
That's the probability that you can
17:50
release to market your drug.
17:51
That's the probability that it works.
17:53
And so you want this curve to be as high as possible.
17:56
You want to make sure that if there's evidence in the data
18:02
that h1 is the truth, you want to squeeze as much
18:05
of this evidence as possible.
18:07
And the test that has the highest possible curve
18:09
is the most powerful one.
18:11
Now, you have to also understand that having two curves that
18:15
are on top of each other completely, everywhere,
18:19
is a rare phenomenon.
18:22
It's not always the case that there
18:24
is a test that's uniformly more powerful than any other test.
18:27
It might be that you have some trade-off,
18:29
that it might be better here, but then you're
18:31
losing power here.
18:32
Maybe it's-- I mean, things like this.
18:33
Well, actually, maybe it should not go down.
18:35
But let's say it goes like this, and then, maybe, this guy
18:37
goes like this.
18:39
Then you have to, basically, make an educated guess
18:43
whether you think that the theta you're going to find is here
18:46
or is here, and then you pick your test.
18:47
18:51
Any other question?
18:51
Yes?
18:52
AUDIENCE: Can you explain the green curve again?
18:53
That's just the type II error?
18:55
PHILIPPE RIGOLLET: So the green curve is-- exactly.
18:57
So that's beta psi of theta.
18:58
So it's really the type II error.
19:00
And it's defined only here.
19:02
So here, it's not a definition, it's
19:05
really I'm just mapping it to this point.
19:08
So it's defined only here, and it's the probability
19:10
of type II error.
19:11
19:15
So here, it's pretty large.
19:17
I'm making it, basically, as large
19:19
as I could because I'm at the boundary,
19:22
and that means, at the boundary, since the status quo is h0,
19:26
I'm always going to go for h0 if I
19:29
don't have any evidence, which means that what's going to pay
19:31
is the type II error that's going to basically pay this.
19:34
19:38
Any other question?
19:38
19:41
So let's move on.
19:43
So did we do this?
19:47
No, I think we stopped here, right?
19:50
I didn't cover that part.
19:53
So as I said, in this paradigm, we're
19:55
going to actually fix this guy to be something.
19:58
And this thing is actually called the level of the test.
20:01
I'm sorry, this is, again, more words.
20:03
Actually, the good news is that we split it into two lectures.
20:06
So we have, what is a test?
20:09
What is a hypothesis?
20:11
What is the null?
20:11
What is the alternative?
20:14
What is the type I error?
20:15
What is the type II error?
20:16
And now, I'm telling you there's another thing.
20:18
So we define the power, which was some sort of a lower bound
20:22
on the--
20:24
or it's 1 minus the upper bound on the type II
20:26
error, basically.
20:28
And so it's alternative-- so the power
20:32
is the smallest probability of rejecting
20:34
when you're in the null.
20:36
And it's alternative when you're in theta 1, so that's my power.
20:41
I looked here, and I looked at the smallest value.
20:43
And I can look at this side and say, well,
20:45
what is the largest probability that I make a type I error?
20:48
Again, this largest probability is the level of the test.
20:51
20:58
So this is alpha equal, by definition,
21:03
to the maximum for theta in theta 0 of alpha psi of theta.
21:15
So here, I just put the level itself.
21:18
As you can see, here, it essentially says
21:20
that if I'm of level of 5%, I'm also of level 10%,
21:23
I'm also of level 15%.
21:25
So here, it's really an upper bound.
21:27
Whatever you guys want to take, this is what it is.
21:29
But as we said, if this number is 4.5%,
21:34
you're losing in your type II error.
21:36
So if you're allowed to have--
21:38
if this maximum here is 4.5% and FDA told you you can go to 5%,
21:43
you're losing in your type II error.
21:44
So you actually want to make sure
21:46
that this is the 5% that's given to you.
21:48
So the way it works is that you give me the alpha,
21:51
then I'm going to go back, pick c that depends on alpha here,
21:56
so that this thing is actually equal to 5%.
21:58
22:01
And so of course, in many instances,
22:04
we do not know the probability.
22:06
We do not know how to compute the probability of type I
22:09
error.
22:10
This is a maximum value for the probability of type I error.
22:12
We don't know how to compute it.
22:13
I mean, it might be a very complicated random variable.
22:15
Maybe it's a weird binomial.
22:17
We could compute it, but it would be painful.
22:19
But we know how to compute is its asymptotic value.
22:21
Just because of the central limit theorem, convergence
22:24
and distribution tells me that the probability of type I error
22:28
is basically going towards the probability
22:30
that some Gaussian is in some region.
22:33
And so we're going to compute, not the level itself,
22:36
but the asymptotic level.
22:37
22:43
And that's basically the limit as n
22:48
goes to infinity of alpha psi of theta.
22:56
And then I'm going to make the max here.
22:58
23:06
So how am I going to compute this?
23:08
Well, if I take a test that has rejection region of the form
23:13
tn--
23:14
because it depends on the data, that's tn of x1 xn--
23:17
my observation's larger than some number c.
23:23
Of course, I can almost always write
23:26
tests like that, except that sometimes,
23:28
it's going to be an absolute value, which essentially means
23:30
I'm going away from some value.
23:32
Maybe, actually, I'm less than something,
23:34
but I can always put a negative sign in front of everything.
23:37
So this is not without much of generality.
23:39
So this includes something that looks like--
23:47
23:51
something is larger than the constants, so that means--
23:56
which is equivalent to-- well, let me write that as tq,
24:02
because then that means that--
24:05
so that's tn.
24:07
But this actually encompasses the fact
24:10
that qn is larger than c or qn is less than c and n minus c.
24:21
So that includes this guy.
24:22
That also includes qn less than c,
24:26
because this is equivalent to qn is larger than minus c.
24:32
And minus qn is--
24:33
and so that's going to be my tn.
24:35
24:37
So I can actually encode several type of things--
24:42
rejection regions.
24:44
So here, in this case, I have a rejection region
24:47
that looks like this, or a rejection region
24:50
that looks like this, or a rejection
24:53
region that looks like this.
24:54
24:57
And here, I don't really represent it
24:58
for the whole data, but maybe for the average, for example,
25:02
or the normalized average.
25:04
25:17
So if I write this, then--
25:23
yeah.
25:25
And in this case, this tn that shows up
25:32
is called test statistic.
25:35
25:41
I mean, this is not set in stone.
25:43
Here, for example, q could be the test statistic.
25:46
It doesn't have to be minus q itself
25:48
that's the test statistic.
25:50
So what is the test statistic?
25:52
Well, it's what you're going to build from your data
25:55
and then compare to some fixed value.
25:57
So in the example we had here, what is our test statistic?
26:01
Well, it's this guy.
26:02
26:05
This was our test statistic.
26:09
And is this thing a statistic?
26:12
What are the criteria for a statistic?
26:14
What is the statistic?
26:15
26:21
I know you know the answer.
26:23
AUDIENCE: Measurable function.
26:25
PHILIPPE RIGOLLET: Yeah, it's a measurable function
26:26
of the data that does not depend on the parameter.
26:29
Is this guy a statistic?
26:32
AUDIENCE: It's not.
26:33
26:35
PHILIPPE RIGOLLET: Let's think again.
26:37
26:40
When I implemented the test, what did I do?
26:45
I was able to compute my test.
26:47
My test did not depend on some unknown parameter.
26:49
How did we do it?
26:52
We just plugged in 0.5 here, remember?
26:57
That was the value for which we computed it,
26:59
because under h0, that was the value we're seeing.
27:02
And if theta 0 is actually an entire set,
27:05
I'm just going to take the value that's the closest to h1.
27:09
We'll see that in a second.
27:11
I mean, I did not guarantee that to you.
27:13
But just taking the worst type I error and bounded by alpha
27:18
is equivalent to taking p and taking the value of p that's
27:22
the closest to theta 1, which is completely intuitive.
27:26
The worst type I error is going to be attained for the p that's
27:29
the closest to the alternative.
27:32
So even if the null is actually just an entire set,
27:36
it's as if it was just the point that's
27:38
the closest to the alternative.
27:41
So now we can compute this, because there's
27:44
no unknown parameters that shows up.
27:46
We replace p by 0.5.
27:48
And so that was our test statistic.
27:50
27:53
So when you're building a test, you
27:55
want to first build a test statistic,
27:58
and then see what threshold you should be getting.
28:01
So now, let's go back to our example where we want to have--
28:08
we have x1 xn, their IID [INAUDIBLE] p.
28:16
And I want to test if p is 1/2 versus p not equal to 1/2,
28:25
which, as I said, is what you want to do if you
28:27
want to test if a coin is fair.
28:33
And so here, I'm going to build a test statistic.
28:36
And we concluded last time that--
28:39
what do we want for this statistic?
28:41
We want it to have a distribution which,
28:44
under the null, does not depend on the parameters,
28:49
a distribution that I can actually compute quintiles of.
28:54
So what we did is, we said, well,
28:56
if I look at-- the central limit theorem tells me that square
28:59
root of n xn bar minus p divided by--
29:03
so if I do central limit theorem plus Slutsky, for example,
29:06
I'm going to have square root.
29:08
29:12
And we've had this discussion whether we want
29:13
to use Slutsky or not here.
29:15
But let's assume we're taking Slutsky wherever we can.
29:17
So this thing tells me that, by the central limit
29:20
theorem, as n goes to infinity, this thing converges
29:23
in distribution to some n01.
29:25
29:28
Now, as we said, this guy is not something we know.
29:31
But under the null, we actually know it.
29:34
And we can actually replace it by 1/2.
29:37
So this thing holds under h0.
29:41
When I write under h0, it means when this is the truth.
29:44
29:47
So now I have something that converges
29:49
to something that has no dependence on anything I
29:52
don't know.
29:53
And in particular, if you have any statistics textbook, which
29:56
you don't because I didn't require one--
29:59
and you should be thankful, because these things cost $350.
30:04
Actually, if you look at the back,
30:05
you actually have a table for a standard Gaussian.
30:12
I could have anything else here.
30:13
I could have an exponential distribution.
30:15
I could have a--
30:17
I don't know-- well, we'll see the chi squared
30:20
distribution in a minute.
30:22
Any distribution from which you can actually
30:24
see a table that somebody actually
30:25
computed this thing for which you can actually
30:27
draw the pdf and start computing whatever probability you want
30:30
on them, then this is what you want
30:32
to see at the right-hand side.
30:35
This is any distribution.
30:36
It's called pivotal.
30:38
I think we've mentioned that before.
30:39
Pivotal means it does not depend on anything
30:41
that you don't know.
30:43
And maybe it's easy to compute those things.
30:45
Probably, typically, you need a computer to simulate them
30:47
for you because computing probabilities for Gaussians
30:50
is not an easy thing.
30:51
We don't know how to solve those integrals exactly,
30:53
we have to do it numerically.
30:56
So now I want to do this test.
31:08
My test statistic will be declared to be what?
31:12
Well, I'm going to reject if what
31:17
is larger than some number?
31:18
31:24
The absolute value of this guy.
31:27
So my test statistic is going to be
31:29
square root of n minus 0.5 divided by square root of xn
31:35
bar 1 minus xn bar.
31:38
31:41
That's my test statistic, absolute value of this guy,
31:43
because I want to reject either when this guy is too large
31:45
or when this guy is too small.
31:47
31:50
I don't know ahead whether I'm going
31:51
to see p larger than 1/2 or less than 1/2.
31:55
So now I need to compute c such that the probability
31:59
that tn is larger than c.
32:05
So that's the probability under p, which is unknown.
32:11
I want this probability to be less than some level alpha,
32:17
asymptotically.
32:18
So I want the limit of this guy to be less than alpha,
32:24
and that's the level of my test.
32:26
So that's the given level.
32:32
So I want this thing to happen.
32:33
Now, what I know is that this limit--
32:35
32:38
actually, I should say given asymptotic level.
32:40
32:48
So what is this thing?
32:50
32:54
Well, OK, that's the probability that something
33:00
that looks like under p.
33:03
So under p, this guy--
33:05
so what I know is that tn is square root of n
33:08
minus xn bar minus 0.5 divided by square root of xn bar
33:15
1 minus xn bar exceeds.
33:18
33:23
Is this true that as n to infinity,
33:26
this probability is the same as the probability
33:28
that the absolute value of a Gaussian
33:30
exceeds c of a standard Gaussian?
33:33
Is this true?
33:34
33:37
AUDIENCE: The absolute value of the standard Gaussian.
33:39
PHILIPPE RIGOLLET: Yeah, the absolute.
33:41
So you're saying that this, as n becomes large enough, this
33:43
should be the probability that some absolute value of n01
33:48
exceeds c, right?
33:49
AUDIENCE: Yes.
33:51
PHILIPPE RIGOLLET: So I claim that this is not correct.
33:54
Somebody tell me why.
33:56
AUDIENCE: Even in the limit it's not correct?
33:57
PHILIPPE RIGOLLET: Even in the limit, it's not correct.
33:59
34:03
AUDIENCE: OK.
34:04
PHILIPPE RIGOLLET: So what do you see?
34:05
AUDIENCE: It's because, at the beginning,
34:07
we picked the worst possible true parameter, 0.5.
34:11
So we don't actually know that this 0.5 is the mean.
34:13
PHILIPPE RIGOLLET: Exactly.
34:15
So we pick this 0.5 here, but this is for any p.
34:19
But what is the only p I can get?
34:21
So what I want is that this is true for all p in theta 0.
34:26
But the only p that's in theta 0 is actually p is equal to 0.5.
34:31
So yes, what you said was true, but it
34:33
required to specify p to be equal to 0.5.
34:38
So this, in general, is not true.
34:40
But it happens to be true if p belongs to theta 0, which
34:47
is strictly equivalent to p is equal to 0.5,
34:53
because theta 0 is really just this one point, 0.5.
34:59
So now, this becomes true.
35:01
And so what I need to do is to find c such
35:03
that this guy is equal to what?
35:05
35:11
I mean, let's just follow.
35:14
So I want this to be less than alpha.
35:16
But then we said that this was equal to this,
35:19
which is equal to this.
35:21
So all I want is that this guy is less than alpha.
35:24
But we said we might as well just make it equal to alpha
35:28
if you allow me to make it as big as I want,
35:30
as long as it's less than alpha.
35:32
AUDIENCE: So this is a true statement.
35:33
PHILIPPE RIGOLLET: So this is a true statement.
35:35
But it's under this condition.
35:38
AUDIENCE: Exactly.
35:39
35:43
PHILIPPE RIGOLLET: So I'm going to set it equal to alpha,
35:48
and then I'm going to try to solve for c.
35:52
36:10
So what I'm looking for is a c such that
36:13
if I draw a standard Gaussian--
36:17
so that's pdf of some n01--
36:20
I want the probability that the absolute value of my Gaussian
36:23
exceeding this guy--
36:25
so that means being either here or here.
36:29
So that's minus c and c.
36:31
I want the sum of those two things to be equal to alpha.
36:36
So I want the sum of these areas to equal alpha.
36:53
So by symmetry, each of them should
36:56
be equal to alpha over 2.
36:58
37:02
And so what I'm looking for is c such that the probability
37:08
that my n01 exceeds c, which is just this area to the right,
37:15
now, equals alpha, which is equivalent to taking c, which
37:20
is q equals alpha over 2, and that's q alpha over 2
37:26
by definition of q alpha over 2.
37:28
That's just what q alpha over 2 is.
37:30
And that's what the tables at the back of the book give you.
37:34
Who has already seen a table for Gaussian probabilities?
37:42
What it does, it's just a table.
37:44
I mean, it's pretty ancient.
37:45
I mean, of course, you can actually ask
37:47
Google to do it for you now.
37:49
I mean, it's basically standard issue.
37:52
But back in the day, they actually had to look at tables.
37:56
And since the values alphas were pretty standard,
37:59
the values alpha that people were requesting
38:01
were typically 1%, 5%, 10%, all you
38:04
could do is to compute these different values
38:07
for different values of alpha.
38:08
That was it.
38:10
So there's really not much to give you.
38:13
So for the Gaussian, I can tell you
38:15
that alpha is equal to-- if alpha is equal to 5%,
38:20
then q alpha over 2, q 2.5% is equal to 1.96, for example.
38:27
So those are just fixed numbers that
38:28
are functions of the Gaussian.
38:31
So everybody agrees?
38:32
We've done that before for our confidence intervals.
38:37
38:40
And so now we know that if I actually
38:42
plug in this guy to be q alpha over 2, then
38:48
this limit is actually equal to alpha.
38:51
And so now I've actually constrained this.
38:53
39:01
So q alpha over 2 here for alpha equals 5%, as I said, is 1.96.
39:07
So in the example 1, the number that we found was 3.54,
39:13
I think, or something like that, 3.55 for t.
39:18
So if we scroll back very quickly, 3.45--
39:29
that was example 1.
39:30
Example two-- negative 0.77.
39:33
So if I look at tn in example 1, tn
39:40
was just the absolute value of 3.45, which--
39:46
don't pull out your calculators-- is equal to 3.45.
39:50
Example 2, absolute value of negative 0.77
39:54
was equal to 0.77.
39:57
And so all I need to check is, is this number
39:59
larger or smaller than 1.96?
40:01
That's what my test ends up being.
40:06
So in example 1, 3.45 being larger
40:12
than 1.96, that means that I reject.
40:18
Fairness of my coins, in example 2,
40:22
0.77 being smaller than 1.96--
40:27
what do I do?
40:29
I fail to reject.
40:30
40:44
So here is a question.
40:45
40:47
In example 1, for what level alpha would psi alpha--
40:54
40:57
OK, so here, what's going to happen
41:00
if I start decreasing my level?
41:04
When I decrease my level, I'm actually
41:07
making this area smaller and smaller,
41:09
which means that I push this c to the right.
41:13
So now I'm asking, what is the smallest c
41:17
I should pick so that now, I actually do not reject h0?
41:22
What is the smallest c I should be taking here?
41:29
What is the smallest c?
41:30
41:37
So c here, in the example I gave you for 5%, was 1.96.
41:43
What is the smallest c I should be taking so that now,
41:49
this inequality is reversed?
41:50
41:54
3.45.
41:55
I ask only trivial questions, don't be worried.
41:58
So 3.45 is the smallest c that I'm actually
42:02
willing to tolerate.
42:04
So let's say this was my 5%.
42:07
If this was 2.5--
42:09
if here, let's say, in this picture,
42:11
alpha is 5%, that means maybe I need to push here.
42:16
And this number should be what?
42:18
So this is going to be 1.96.
42:20
And this number here is going to be 3.45, clearly to scale.
42:26
And so now, what I want to ask you is,
42:30
well, there's two ways I can understand this number 3.45.
42:33
It is the number 3.45, but I can also
42:36
try to understand what is the area to the right of this guy.
42:40
And if I understand what the area to the right of this guy
42:42
is, this is actually some alpha prime over 2.
42:47
And that means that if I actually
42:49
fix this level alpha prime, that would
42:53
be exactly the tipping point at which I would
42:57
go from accepting to rejecting.
43:01
So I knew, in terms of absolute thresholds,
43:04
3.45 is the trivial answer to the question.
43:07
That's the tipping point, because I'm
43:09
comparing a number to 3.45.
43:11
But now, if I try to map this back
43:13
and understand what level would have been giving me
43:16
this particular tipping point, that's
43:18
a number between 0 and 1.
43:21
The smaller the number, the larger this number here,
43:25
which means that the more evidence I have in my data
43:28
against h0.
43:30
And so this number is actually something called the p-value.
43:36
And so saying, for example 2, there's
43:38
the tipping point alpha at which I
43:40
go from failing to reject to rejecting.
43:44
And that's exactly the number, the area under the curve,
43:47
such that here, I see 0.77.
43:53
And this is this alpha prime prime over 2.
43:56
43:59
Alpha prime prime is clearly larger than 5%.
44:04
So what's the advantage of thinking and mapping back
44:06
these numbers?
44:08
Well, now, I'm actually going to spit out some number which
44:11
is between 0 and 1.
44:12
And that should be the only scale you should have in mind.
44:18
Remember, we discussed that last time.
44:20
I was like, well, if I actually spit out
44:22
a number which is 3.45, maybe you can try to think,
44:26
is 3.45 a large number for a Gaussian?
44:29
That's a number.
44:29
But if I had another random variable that was not Gaussian,
44:32
maybe it was a double exponential,
44:33
you would have to have another scale in your mind.
44:36
Is 3.45 so large that it's unlikely for it
44:42
to come from a double exponential.
44:44
If I had a gamma distribution--
44:46
I can think of any distribution, and then that means,
44:48
for each distribution, you would have to have scale in mind.
44:51
So of course, you can have the Gaussian scale in mind.
44:53
I mean, I have the Gaussian scale in mind.
44:55
But then, if I map it back into this number between 0 and 1,
44:59
all the distributions play the same role.
45:02
So whether I'm talking about if my limiting distribution is
45:05
normal or exponential or gamma, or whatever you want,
45:09
for all these guys, I'm just going
45:11
to map it into one number between 0 and 1.
45:13
Small number means lots of evidence against h1.
45:16
Large number means lots of evidence against h0.
45:21
Small number means very few evidence against h9.
45:25
And this is the only number you need to keep in mind.
45:27
And the question is, am I willing
45:29
to tolerate this number between 5%, 6%, or maybe 10%, 12%?
45:34
And this is the only scale you have to have in mind.
45:37
And this scale is the scale of p-values.
45:41
So the p-value is the tipping point in terms of alpha.
45:48
In words, I can make it formal, because tipping point,
45:52
as far as I know, is not a mathematical term.
45:54
So a p-value of a test is the smallest,
45:58
potentially asymptotic level if I talk about an asymptotic
46:01
p-value--
46:02
and that's what we do when we talk about central theorem--
46:05
at which the test rejects h0.
46:09
If I were to go any smaller--
46:10
46:14
sorry, it's the smallest level--
46:17
yeah, if I were to go any smaller,
46:19
I would fail to reject.
46:21
The smaller the level, the less likely it is for me to reject.
46:25
And if I were to go any smaller, I
46:26
would start failing to reject.
46:31
And so it is a random number.
46:33
It depends on what I actually observe.
46:35
So here, of course, I instantiated those two numbers,
46:39
3.45 and 0.77, as realizations of random variables.
46:44
But if you think of those as being the random numbers
46:46
before I see my data, this was a random number,
46:50
and therefore, the area under the curve to the right of it
46:53
is also a random area.
46:55
If this thing fluctuates, then the area under the curve
46:58
fluctuates.
47:00
And that's what the p-value is.
47:02
That's what-- what is his name?
47:05
I forget.
47:06
John Oliver talks about when he talks about p-hacking.
47:10
And so we talked about this in the first lecture.
47:14
So p-hacking is, how do I do-- oh, if I'm a scientist,
47:18
do I want to see a small p-value or a large p-value?
47:20
AUDIENCE: Small.
47:21
PHILIPPE RIGOLLET: Small, right?
47:22
Scientists want to see small p-values because small p-values
47:24
equals rejecting, which equals discovery,
47:28
which equals publications, which equals promotion.
47:31
So that's what people want to see.
47:34
So people are tempted to see small p-values.
47:37
And what's called p-hacking is, well, find a way to cheat.
47:41
Maybe look at your data, formulate your hypothesis
47:44
in such a way that you will actually have a smaller
47:49
p-value than you should have.
47:51
So here, for example, there's one thing
47:53
I did not insist on because, again, this is not
47:54
a particular course on statistical thinking,
47:57
but one thing that we implicitly did
47:59
was set those theta 0 and theta 1 ahead of time.
48:04
I fixed them, and I'm trying to test this.
48:08
This is to be contrasted with the following approach.
48:11
I draw my data.
48:13
So I draw--
48:15
I run this experiment, which is probably
48:16
going to get me a publication in nature.
48:18
I'm trying to test if a coin is fair.
48:23
And I draw my data, and I see that there's
48:24
13 out of 30 of my observations that are heads.
48:31
That means that, from this data, it
48:32
looks like p is less than 1/2.
48:36
So if I look at this data and then
48:38
decide that my alternative is not p not equal to 1/2,
48:42
but rather p less than 1/2, that's p-hacking.
48:47
I'm actually making my p-value strictly smaller
48:50
by first looking at the data, and then deciding what
48:53
my alternative is going to be.
48:54
And that's cheating, because all the things we did,
48:58
we're assuming that this 0.5, or the alternative,
49:02
was actually a fixed-- everything was deterministic.
49:05
The only randomness came from the data.
49:07
But if I start looking at the data
49:08
and designing my experiment or my alternatives
49:11
and null hypothesis based on the data,
49:13
it's as if I started putting randomness all over the place.
49:15
And then I cannot control it because I don't know how it
49:18
just intermingles with each other.
49:22
So that was for the John Oliver moment.
49:26
49:29
So the p-value is nice.
49:32
So maybe I mentioned that, before, my wife
49:35
works in market research.
49:36
And maybe every two years, she seems
49:40
to run into a statistician in the hallway,
49:42
and she comes home and says, what is a p-value again?
49:45
And for her, a p-value is just the number
49:48
in an Excel spreadsheet.
49:50
And actually, small equals good and large equals bad.
49:55
And that's all she needs to know at this point.
49:57
Actually, they do the job for her-- small is green,
50:01
large is red.
50:02
And so for her, a p-value is just green or red.
50:06
But so what she's really implicitly doing
50:08
with this color code is just applying the golden rule.
50:12
What the statisticians do for her in the Excel spreadsheet
50:16
is that they take the numbers for the p-values that
50:18
are less than some fixed level.
50:20
So depending on the field in which she works--
50:22
so she works for pharmaceutical companies--
50:24
so the p-values are typically compared--
50:26
the tests are usually performed at level 1%, rather than 5%.
50:31
So 5% is maybe your gold standard
50:33
if you're doing sociology or trying to--
50:36
I don't know-- release a new blueberry flavor
50:39
for your toothpaste.
50:40
Something that's not going to change the life of people,
50:43
maybe you're going to run at 5%.
50:45
It's OK to make a mistake.
50:46
See, people are just going to feel gross,
50:47
but that's about it, whereas here,
50:50
if you have this p-value which is less than 1%,
50:53
it might be more important for some drug discovery,
50:55
for example.
50:56
And so let's say you run at 1%.
50:59
And so what they do in this Excel spreadsheet is
51:02
that all the numbers that are below 1% show up in green
51:05
and all the numbers that are above 1% show up in red.
51:09
And that's it.
51:09
That's just applying the golden rule.
51:11
If the number is green, reject.
51:13
If the number is red, fail to reject.
51:18
Yeah?
51:18
AUDIENCE: So going back to example 2
51:20
where the prior example where you
51:23
want to cheat by looking after beta
51:26
and then formulating, say, theta 1 to be p less than 1/2.
51:32
PHILIPPE RIGOLLET: Yeah.
51:33
AUDIENCE: So how would you achieve your goal
51:38
by changing the theta--
51:40
PHILIPPE RIGOLLET: By achieving my goal,
51:42
you mean letting ethics aside, right?
51:45
AUDIENCE: Yeah, yeah.
51:46
PHILIPPE RIGOLLET: Ah, you want to be published.
51:47
AUDIENCE: Yeah.
51:48
PHILIPPE RIGOLLET: [LAUGHS] So let me teach you how, then.
51:54
So well, here, what do you do?
51:58
You want to-- at the end of the day,
52:03
a test is only telling you whether you found evidence
52:06
in your data that h1 was more likely than h0, basically.
52:11
How do you make h1 more likely?
52:12
Well, you just basically target h1 to be what it is--
52:18
what the data is going to make it more likely to be.
52:21
So if, for example, I say h1 can be on both sides,
52:26
then my data is going to have to take into account fluctuations
52:29
on both sides, and I'm going to lose a factor or two somewhere
52:31
because things are not symmetric.
52:33
Here is the ultimate way of making this work.
52:38
I'm going back to my example of flipping coins.
52:42
And now, so here, what I did is, I said,
52:45
oh, this number 0.43 is actually smaller than 0.5,
52:54
so I'm just going to test whether I'm 0.5
52:56
or I'm less than 0.5.
52:58
But here is something that I can promise you
53:01
I did not make the computation will reject.
53:04
So here, this one actually--
53:06
yeah, this one fails to reject.
53:08
So here is one that will certainly reject.
53:11
h0 is 0.5, p is 0.5, h1p is 0.43.
53:24
Now, you can try, but I can promise you
53:27
that your data will tell you that h1 is the right one.
53:32
I mean, you can check very quickly that this is really
53:36
extremely likely to happen.
53:37
53:40
Actually, what am I--
53:41
53:45
no, actually, that's not true, because here,
53:52
the test that I derive that's based on this kind of stuff,
53:56
here at some point, somewhere under some layers,
53:59
I assume that all our tests are going to have this form.
54:04
But here, this is only when you're
54:06
trying to test one region versus another region next to it,
54:09
or one point versus a region around it,
54:11
or something like this, whereas for this guy,
54:13
there's another test that could come up with,
54:15
which is, what is the probability that I get 0.43,
54:18
and what is the probability that I get 0.5?
54:21
Now, what I'm going to do is, I'm
54:23
going to just conclude it's whichever
54:25
has the largest probability.
54:27
Then maybe I'm going to have to make some adjustments so
54:29
that the level is actually 5%.
54:32
But I can make this happen.
54:33
I can make the level be 5% and always conclude this guy,
54:36
but I would have to use a different test.
54:38
Now, the test that I described, again,
54:40
those tn larger than c are built in
54:42
to be tests that are resilient to these kind of manipulations
54:46
because they're oblivious towards what
54:48
the alternative looks like.
54:50
I mean, they're just saying it's either to the left
54:51
or to the right, but whether it's
54:53
a point or an entire half-line doesn't matter.
54:55
54:59
So if you try to look at your data
55:01
and just put the data itself into your hypothesis testing
55:05
problem, then you're failing the statistical principle.
55:10
And that's what people are doing.
55:12
I mean, how can I check?
55:13
I mean, of course, here, it's going
55:15
to be pretty blatant if you publish
55:16
a paper that looks like this.
55:17
But there's ways to do it differently.
55:19
For example, one way to do it is to just do mult--
55:21
so typically, what people do is they
55:23
do multiple hypothesis testing.
55:24
They're doing 100 tests at a time.
55:27
Then you have random fluctuations every time.
55:30
And so they just pick the one that
55:32
has the random fluctuations that go their way.
55:34
I mean, sometimes it's going in your way,
55:36
and sometimes it's going the opposite way,
55:37
so you just pick the one that works for you.
55:39
We'll talk about multiple hypothesis testing soon
55:41
if you want to increase your publication count.
55:44
55:49
There's actually papers--
55:50
I think it was a big news that some papers,
55:53
I think, in psychology or psychometrics
55:54
papers that actually refused to publish p-values now.
55:57
56:03
Where were we?
56:05
Here's the golden rule.
56:07
So one thing that I like to show is this thing,
56:11
just so you know how you apply the golden rule
56:14
and how you apply the standard tests.
56:16
So the standard paradigm is the following.
56:25
You have a black box, which is your test.
56:29
For my wife, this is the 4th floor of the building.
56:32
That's where the statisticians sit.
56:33
What she sends there is data--
56:35
56:38
let's say x1 xn.
56:41
And she says, well, this one is about toothpaste,
56:43
so here's a level--
56:45
let's say 5%.
56:47
What the 4th floor brings back is that answer-- yes,
56:50
no, green, red, just an answer.
56:53
56:58
So that's the standard testing.
56:59
You just feed it the data and the level at which you
57:02
want to perform the test, maybe asymptotic,
57:04
and it spits out a yes, no answer.
57:06
What p-value does, you just feed it the data itself.
57:15
57:18
And what it spits out is the p-value.
57:22
And now it's just up to you.
57:23
I mean, hopefully your brain has the computational power
57:27
of deciding whether a number is larger or smaller than 5%
57:31
without having to call a statistician for this.
57:33
And that's what it does.
57:35
So now we're on 1 scale.
57:37
Now, I see some of you nodding when I talk about p-hacking,
57:41
so that means you've seen p-values.
57:43
If you've seen more than 100 p-values in your life,
57:45
you have an entire scale.
57:47
A good p-value is less than 10 to the minus 4.
57:50
That's the ultimate sweet spot.
57:53
Actually, statistical software spits out
57:56
an output which says less than 10 to the minus 4.
58:01
But then maybe you want a p-val--
58:02
58:05
if you tell me my p-value was 4.65, then I will say,
58:08
you've been doing some p-hacking until you found
58:10
a number that was below 5%.
58:12
That's typically what people will do.
58:14
But if you tell me--
58:16
if you're doing the test, if you're saying,
58:18
I published my result, my test at 5%
58:21
said yes, that means that maybe you're p-value was 4.99,
58:27
or you're p-value was 10 to the minus 4, I will never know.
58:29
I will never know how much evidence
58:31
you had against the null.
58:34
But if you tell me what the p-value is,
58:36
I can make my own decision.
58:37
I don't have to tell me whether it's a yes, no.
58:39
You tell me it's 4.99, I'm going to say, well, maybe yes,
58:42
but I'm going to take it with a grain of salt.
58:45
And so that's why p-values are good numbers to have in mind.
58:48
Now, I should, as if it was like an old trick
58:51
that you start mastering when you're 45 years old.
58:54
No, it's just, how small is the number between 0 and 1?
58:57
That's really what you need to know.
59:00
Maybe on the log scale-- if it's 10 to the minus 1,
59:03
10 to the minus 2, 10 to the minus 3, et cetera--
59:07
that's probably the extent of the mastery here.
59:09
59:12
So this traditional standard paradigm that I showed
59:16
is actually commonly referred to as the Neyman-Pearson paradigm.
59:21
So here, it says name Neyman-Pearson's theory,
59:23
so there's an entire theory that comes with it.
59:25
But it's really a paradigm.
59:27
It's a way of thinking about hypothesis testing that
59:29
says, well, if I'm not going to be able to optimize both
59:32
my type I and type II error, I'm actually
59:34
going to lock in my type I error below some level
59:37
and just minimize the type II error under this constraint.
59:42
That's what the Neyman-Pearson paradigm is.
59:45
And it sort of makes sense for hypothesis testing problems.
59:48
Now, if you were doing some other applications
59:50
with multi-objective optimization,
59:52
you would maybe come up with something different.
59:54
For example, machine learning is not performing typically
59:58
under Neyman-Pearson paradigm.
60:01
So if you do spam filtering, you could say, well,
60:05
I want to constrain the probability as much as I can
60:08
of taking somebody's important emails
60:10
and throwing them out as spam, and under this constraint,
60:14
not send too much spam to that person.
60:17
That sort of makes sense for spams.
60:19
Now, if you're labeling cats versus dogs, that's probably
60:23
not like you want to make sure that no more than 5%
60:27
of the dogs are labeled cat because, I mean,
60:30
it doesn't matter.
60:31
So what you typically do is, you just
60:33
sum up the two types of errors you can make,
60:34
and you minimize the sum without putting any more
60:36
weight on one or the other.
60:38
So here's an example where doing a binary decision, one or two
60:42
of the errors you can make, you don't
60:45
have to actually be like that.
60:47
So this example here, I did not.
60:50
The trivial test psi is equal to 0, what was it
60:55
in the US trial court example?
61:00
What is psi equals 0?
61:03
That was concluding always to the null.
61:05
What was the null?
61:08
AUDIENCE: Innocent.
61:08
PHILIPPE RIGOLLET: Innocent, right?
61:10
That's the status quo.
61:11
So that means that this guy never rejects h0.
61:14
Everybody's going away free.
61:16
So you're sure you're not actually
61:18
going against the constitution because alpha is 0%, which
61:25
is certainly less than 5%.
61:26
But the power, the fact that a lot of criminals
61:30
go back outside in the free world
61:34
is actually formulated in terms of low power, which,
61:37
in this case, is actually 0.
61:39
Again, the power is the number between 0 and 1.
61:41
Close to 0, good.
61:43
Close to 1, bad.
61:45
Now, what is the definition of the p-value?
61:51
That's going to be something-- it's a mouthful.
61:54
The definition of the p-value is a mouthful.
61:58
It's the tipping point.
62:00
It is the smallest level at which blah, blah, blah, blah,
62:02
blah.
62:03
It's complicated to remember it.
62:05
Now, I think that my 6th explanation, my wife,
62:09
after saying, oh, so it's the probability of making an error,
62:12
I said, yeah, that's the probability of making
62:14
an error because, of course, she can
62:16
think probability of making an error small, good, large, bad.
62:22
So that's actually a good way to remember.
62:24
I'm pretty sure that at least 50%
62:26
of people who are using p-values out there
62:28
think that the p-value is the probability of making an error.
62:31
Now, for all matters of purposes,
62:33
if your goal is to just threshold the p-value,
62:35
this is OK to have this in y.
62:37
But when comes, at least until December 22,
62:42
I would recommend trying to actually memorize
62:44
the right definition for the p-value.
62:46
62:53
So the idea, again, is fix the level
62:55
and try to optimize the power.
62:57
63:01
So we're going to try to compute some p-values from now on.
63:05
How do you compute the p-value?
63:06
Well, you can actually see it from this picture over there.
63:10
63:14
One thing I didn't show on this picture-- so here,
63:16
it was my q alpha over 2 that had alpha here,
63:19
alpha over 2 here.
63:21
That was my q alpha over 2.
63:22
And I said, if tn is to the right of this guy,
63:26
I'm going to reject.
63:27
If tn is to the left of this guy,
63:29
I'm going to fail to reject.
63:31
Pictorially, you can actually represent the p-value.
63:34
It's when I replace this guy by tn itself.
63:36
63:41
Sorry, that's p-value over 2.
63:44
No, actually, that's p-value.
63:47
So let me just keep it like that and put the absolute value
63:51
here.
63:51
63:54
So if you replace the role of q alpha over 2, by your test
63:58
statistic, the area under the curve
64:01
is actually the p-value itself up
64:03
to a scale because of the symmetric thing.
64:06
So there's a good way to see, pictorially,
64:09
what the p-value is.
64:10
It's just the probability that some Gaussians--
64:13
it's just the probability that some absolute value of n01
64:17
exceeds tn.
64:18
64:22
That's what the p-value is.
64:24
Now, this guy has nothing to do with this guy,
64:26
so this is really just 1 minus the Gaussian cdf of tn,
64:32
and that's it.
64:34
So that's how I would compute p-values.
64:36
Now, as I said, the p-value is a beauty
64:40
because you don't have to understand
64:43
the fact that your limiting distribution is a Gaussian.
64:47
It's already factored in this construction.
64:49
The fact that I'm actually looking
64:50
at this cumulative distribution function of a standard Gaussian
64:54
makes my p-value automatically adjust to what
64:57
the limiting distribution is.
64:58
And if this was the cumulative distribution
65:00
function of a exponential, I would just
65:03
have a different function here denoted by f, for example,
65:06
and I would just compute a different value.
65:07
But in the end, regardless of what the limiting value is,
65:10
my p-value would still be a number between 0 and 1.
65:13
And so to illustrate that, let's look
65:16
at other weird distributions that we could get in place
65:20
of the standard Gaussian.
65:22
And we're not going to see many, but we'll see one.
65:24
And it's not called the chi squared distribution.
65:27
It's actually called the Student's distribution,
65:29
but it involves the chi squared distribution
65:31
as a building block.
65:34
So I don't know if my phonetics are not really right there,
65:38
so I try to say, well, it's chi squared.
65:43
Maybe it's "kee" squared above, in Canada, who knows.
65:47
So for a positive integer, so there's only 1 parameter.
65:50
So for the Gaussian, you have 2 parameters,
65:52
which are mu and sigma squared.
65:54
Those are real numbers.
65:55
Sigma squared's positive.
65:57
Here, I have 1 integer parameter.
65:59
66:03
Then the chi squared distribution
66:05
with d degrees of freedom--
66:07
so the parameter is called a degree of freedom,
66:09
just like mu is called the expected value and sigma
66:11
squared is called the variance.
66:12
Here, we call it degrees of freedom.
66:14
You don't have to really understand why.
66:17
So that's the law that you would get--
66:19
that's the random variable you would
66:21
get if you were to sum d squares of independent standard
66:26
Gaussians.
66:26
66:29
So I take the square of an independent random Gaussian.
66:33
I take another one.
66:34
I sum them, and that's a chi squared
66:36
with 2 degrees of freedom.
66:39
That's how you get it.
66:40
Now, I could define it using its probability density function.
66:46
I mean, after all, this is the sum
66:49
of positive random variables, so it
66:51
is a positive random variable.
66:53
It has a density on the positive real line.
66:56
And the pdf of chi squared with d degrees of freedom is what?
67:03
Well, it's fd of x is--
67:07
what is it?-- x to the d/2 minus 1 e to the minus x/2.
67:13
And then here, I have a gamma of d/2.
67:16
And the other one is, I think, 2 to the d/2 minus 1.
67:20
67:23
No, 2 to the d/2.
67:26
That's what it is.
67:28
That's the density.
67:30
If you are very good at probability,
67:32
you can make the change of variable
67:33
and write your Jacobian and do all this stuff
67:35
and actually check that this is true.
67:37
I do not recommend doing that.
67:40
So this is the density, but it's better understood like that.
67:44
I think it was just something that you
67:46
built from standard Gaussian.
67:48
So for example, an example of a chi
67:50
squared with 2 degrees of freedom
67:52
is actually the following thing.
67:54
Let's assume I have a target like this.
67:56
68:00
And I don't aim very well.
68:02
And I'm trying to hit the center.
68:05
And I'm not going to have, maybe,
68:07
a deviation, which is standard Gaussian left, right
68:10
and standard Gaussian north, south.
68:16
So I'm throwing, and then I'm here,
68:18
and I'm claiming that this number here, by Pythagoras
68:22
theorem, the square distance here
68:24
is the sum of this square distance
68:25
here, which is the square of a Gaussian by assumption.
68:30
This is plus the square of this distance,
68:31
which is the square of another independent Gaussian.
68:34
I assume those are independent.
68:35
And so the square distance from this point to this point
68:37
is the chi squared with 2 degrees of freedom.
68:40
So this guy here is n01 squared.
68:45
This is n01 squared.
68:48
And so this guy here, this distance here,
68:50
is chi squared with 2 degrees of freedom.
68:53
I mean the square distance.
68:54
I'm talking about square distances here.
68:58
So now you can see that, actually, Pythagoras
69:02
is basically why chi squared [? arrives. ?]
69:05
That's why it has its own name.
69:07
I mean, I could define this random variable.
69:10
I mean, it's actually a gamma distribution.
69:13
It's a special case of something called the gamma distribution.
69:15
The fact that the special case has its own name
69:17
is because there's many times what
69:19
we're going to take sum of squares
69:20
of independent Gaussians because Gaussians, the sum of squares
69:23
is really the norm, the Euclidean norm squared,
69:25
just by Pythagoras theorem.
69:26
If I'm in higher dimension, I can
69:28
start to sum more squared coordinates,
69:30
and I'm going to measure the norm squared.
69:32
69:34
So if you want to draw this picture, it looks like this.
69:37
Again, it's the sum of positive numbers,
69:39
so it's going to be on 0 plus infinity.
69:43
That's fd.
69:44
And so f1 looks like this, f2 looks like this.
69:52
So the tails become heavier and heavier as d increases.
69:57
And then at [INAUDIBLE] to 3, it starts
70:00
to have a different shape.
70:01
It starts from 0 and it looks like this.
70:04
And then, as d increases, it's basically
70:06
as if you were to push this thing to the right.
70:09
It's just like, psh, so it's just falling like a big blob.
70:14
Everybody sees what's going on?
70:16
So there's just this fat thing that's just going there.
70:19
What is the expected value of a chi squared?
70:21
70:28
So it's the expected value of the sum
70:30
of Gaussian random variables, squared.
70:37
I know I said that.
70:40
AUDIENCE: So it's the sum of their second moments, right?
70:42
PHILIPPE RIGOLLET: Which is?
70:43
70:46
Those are n01.
70:47
AUDIENCE: It's like-- oh, I see, 1.
70:50
PHILIPPE RIGOLLET: Yeah.
70:51
AUDIENCE: So n times 1 or d times 1.
70:53
PHILIPPE RIGOLLET: Yeah, which is d.
70:55
So one thing you can check quickly
70:56
is that the expected value of a chi squared is d.
71:00
And so you see, that's why the mass is shifting to the right
71:04
as d increases.
71:05
It's just going there.
71:06
Actually, the variance is also increasing.
71:08
The variance is 2d.
71:10
71:14
So this is one thing.
71:16
And so why do we care about this?
71:19
In basic statistics, it's not like we actually
71:22
have statistics much about throwing darts
71:25
at high-dimensional boards.
71:28
So what's happening is that if I look at the sample variance,
71:31
the average of the sum of squared centered by their mean,
71:36
then I can actually expend this as the sum
71:38
of the squares minus the average squared
71:42
It's just the same trick that we have
71:44
for the variance-- second moment minus first moment square.
71:49
And then I claim that Cochran's theorem--
71:53
and I will tell you in a second what Cochran's theorem tells me
71:56
is that this sample variance is actually--
71:58
so if I had only this--
72:01
look at those guys.
72:04
Those guys are Gaussian with mean mu and variance
72:07
sigma squared.
72:08
Think for 1 second mu being 0 and sigma squared being 1.
72:13
Now, this part would be a chi squared with n degrees
72:16
of freedom divided by n.
72:19
Now I get another thing here, which
72:21
is the square of something that looks like a Gaussian as well.
72:24
So it looks like I have something else here, which
72:27
looks also like a chi squared.
72:29
Now, Cochran's theorem is essentially telling you
72:31
that those things are independent,
72:35
and so that in a way, you can think of those guys as being,
72:39
here, n degrees of freedom minus 1 degree of freedom.
72:43
Now, here, as I said, this does not mean 0 and variance 1.
72:47
The fact that it's not mean 0 is not a problem
72:50
because I can remove the mean here and remove the mean here.
72:54
And so this thing has the same distribution,
72:57
regardless of what the actual mean is.
72:59
So without loss of generality, I can
73:00
assume that mu is equal to 0.
73:02
Now, the variance, I'm going to have to pay,
73:03
because if I multiply all these numbers by 10,
73:06
then this sn is going to multiplied by 100.
73:09
So this thing is going to scale with the variance.
73:11
And not surprisingly, it's scaling like the square
73:13
of the variance.
73:15
So if I look at sn, it's distributed
73:18
as sigma squared times the chi squared
73:21
with n minus 1 degrees of freedom divided by n.
73:25
And we don't really write that, because a chi squared
73:28
times sigma squared divided by n is not a distribution,
73:30
so we put everything to the left,
73:32
and we say that this is actually a chi squared with n
73:34
minus 1 degrees of freedom.
73:36
So here, I'm actually dropping a fact on you,
73:40
but you can see the building block.
73:43
What is the thing that's fuzzy at this point,
73:46
but the rest should be crystal clear to you?
73:48
The thing that's fuzzy is that removing this squared guy
73:52
here is actually removing 1 degree of freedom.
73:55
That should be weird, but that's what Cochran's theorem tells.
73:59
It's essentially stating something
74:00
about orthogonality of subspaces with the span
74:04
of the constant vector, something like that.
74:07
So you don't have to think about it too much,
74:09
but that's what it's telling me.
74:11
But the rest, if you plug in-- so the scaling in sigma squared
74:15
and in n, so that should be completely clear to you.
74:18
So in particular, if I remove that part,
74:20
it should be clear to you that this thing, if mean is 0,
74:24
this thing is actually distributed.
74:27
Well, if mu is 0, what is the distribution of this guy?
74:30
74:35
So I remove that part, just this part.
74:37
74:46
So I have xi, which are n0 sigma squared.
74:50
And I'm asking, what is the distribution of 1/n sum from i
74:53
equal 1 to n of xi squared?
74:57
So it is the sum of their IID.
75:00
So it's the sum of independent Gaussians, but not standard.
75:03
So the first thing to make them standard
75:05
is that I divide all of them by sigma squared.
75:07
75:10
Now, this guy is of the form zi squared where zi is n01.
75:17
75:20
So now, this thing here has what distribution?
75:25
AUDIENCE: Chi squared n.
75:27
PHILIPPE RIGOLLET: Chi squared n.
75:30
And now, sigma squared over n times chi squared n--
75:33
so if I have sigma squared divided by n times chi
75:35
squared--
75:37
sorry, so n times n divided by sigma squared.
75:41
So if I take this thing and I multiply it
75:45
by n divided by sigma squared, it means I remove this term,
75:48
and now I am left with a chi squared
75:49
with n degrees of freedom.
75:51
Now, the effect of centering with the sample mean here
75:55
is only to lose 1 degree of freedom.
75:57
That's it.
75:58
76:01
So if I want to do a test about variance, since this
76:05
is supposedly a good estimator of variance,
76:08
this could be my pivotal distribution.
76:10
This could play the role of a Gaussian.
76:12
If I want to know if my variance is equal to 1 or larger than 1,
76:16
I could actually build a test based on this only statement
76:21
and test if the variance is larger than 1 or not.
76:23
Now, this is not asymptotic because I
76:25
started with the very assumption that my data was
76:28
Gaussian itself.
76:29
76:32
Now, just a side remark-- you can
76:33
check that this chi squared 2, 2 is an exponential with 1/2
76:37
degrees of freedom, which is certainly not
76:38
clear from the fact that z1 squared plus z2 squared
76:42
is a chi squared with 2 degrees of freedom.
76:44
if I give you the sum of the square
76:46
of 2 independent Gaussian, this is actually an exponential.
76:50
That's not super clear, right?
76:53
But if you look at what was here--
77:00
I don't know if you took notes, but let me rewrite it for you.
77:03
So it was x to the d/2 minus 1 e to the minus x/2 divided
77:08
by 2 to the d/2 gamma of d/2.
77:14
So if I plug in d is equal to 2, gamma of 2/2
77:18
is gamma of 1, which is 1.
77:21
It's factorial of 0.
77:23
So it's 1, so this guy goes away.
77:26
2 to the d/2 is 2 to 1, so that's just 1.
77:33
No, that's just 2.
77:36
Then x the d/2 minus 1 is x the 0, goes away.
77:40
And so I have x minus x/2 1/2, which is really, indeed,
77:47
of the form lambda e to the minus lambda
77:50
x for lambda is equal to 1/2, which was
77:53
our exponential distribution.
77:54
77:59
Well, next week is, well, Columbus Day?
78:05
So not next Monday--
78:08
so next week, we'll talk about Student's distribution.
78:12
And so that was discovered by a guy
78:15
who pretended his name was Student, but was not Student.
78:19
And I challenge you to find why in the meantime.
78:23
So I'll see you next week.
78:24
Your homework is going to be outside
78:28
so we can release the room.
78:31