https://www.youtube.com/watch?v=a66tfLdr6oY&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=10


字幕記錄


00:00
00:00
The following content is provided under a Creative
00:02
Commons license.
00:03
Your support will help MIT OpenCourseWare
00:06
continue to offer high quality educational resources for free.
00:10
To make a donation or to view additional materials
00:12
from hundreds of MIT courses, visit MIT OpenCourseWare
00:16
at ocw.mit.edu.
00:17
00:20
PROFESSOR: So we've been talking about this chi square test.
00:22
And the name chi square comes from the fact
00:26
that we build a test statistic that
00:28
has asymptotic distribution given
00:31
by the chi square distribution.
00:36
Let's just give it another shot.
00:37
00:44
OK.
00:47
This test.
00:48
Who has actually ever encountered the chi square test
00:50
outside of a stats classroom?
00:54
All right.
00:54
So some people have.
00:55
It's a fairly common test that you might encounter.
00:59
And it was essentially to test, if given
01:01
some data with a fixed probability mass function, so
01:06
a discrete distribution, you wanted
01:08
to test if the PMF was equal to a set value, p0,
01:12
or if it was different from p0.
01:15
And the way the chi square arose here
01:18
was by looking at Wald's test.
01:22
And essentially if you write-- so Wald's is the one that
01:25
has the chi square as the limiting distribution,
01:27
and if you invert the covariance matrix,
01:31
the asymptotic covariance matrix, so you compute
01:33
the Fisher information, which in this particular case
01:36
does not exist for the multinomial distribution,
01:39
but we found the trick on how to do this.
01:41
We remove the part that forbid it to be invertible,
01:44
then we found this chi square distribution.
01:46
In a way we have this test statistic,
01:47
which you might have learned as a black box, laundry list,
01:50
but going through the math which might have been slightly
01:53
unpleasant, I acknowledge, but really told you
01:56
why you should do this particular normalization.
01:59
So since some of you requested a little more practical examples
02:04
of how those things work, let me show you a couple.
02:07
The first one is you want to answer the question, well,
02:12
you know, when should I be born to be successful.
02:16
Some people believe in zodiac, and so Fortune magazine
02:20
actually collected the signs of 256 heads of the Fortune 500.
02:24
Those were taken randomly.
02:26
And they were collected there, and you
02:27
can see the count of number of CEOs that
02:31
have a particular zodiac sign.
02:33
And if this was completely uniformly distributed,
02:35
you should actually get a number that's
02:37
around 256 divided by 12, which in this case is 21.33.
02:42
And you can see that there is numbers
02:45
that are probably in the vicinity, but look at this guy.
02:49
Pisces, that's 29.
02:51
So who's Pisces here?
02:53
All right.
02:55
All right, so give me your information
02:57
and we'll meet again in 10 years.
02:59
And so basically you might want to test
03:02
if actually the fact that it's uniformly distributed
03:04
is a valid assumption.
03:06
Now this is clearly a random variable.
03:09
I pick a random CEO and I measure
03:13
what his zodiac sign is.
03:16
And I want to know, so it's a probability over, I don't know,
03:19
12 zodiac signs.
03:20
And I want to know if it's uniform or not.
03:23
Uniform sounds like it should be the status
03:25
quo, if you're reasonable.
03:27
And maybe there's actually something that moves away.
03:31
So we could do this, in view of these data is there evidence
03:34
that one is different.
03:36
Here is another example where you might want
03:38
to apply the chi square test.
03:40
So as I said, the benchmark distribution
03:44
was the uniform distribution for the zodiac sign,
03:46
and that's usually the one I give you.
03:47
1 over k, 1 over k, because well that's
03:49
sort of the zero, the central point for all distributions.
03:53
That's the point, the center of what we call the simplex.
03:57
But you can have another benchmark
03:58
that sort of makes sense.
03:59
So for example this is an actual dataset where 275 jurors were
04:04
identified, racial group were collected,
04:09
and you actually might want to know
04:10
if you know juries in this country
04:12
are actually representative of the actual population.
04:17
And so here of course, the population
04:19
is not uniformly distributed according to racial group.
04:23
And the way you actually do it is you
04:24
actually go on Wikipedia, for example,
04:26
and you look at the demographics of the United States,
04:28
and you find that the proportion of white is 72%, black is 7%,
04:33
Hispanic is 12, and other is about 9%.
04:41
So that's a total of 1.
04:43
And this is what we actually measured for some jurors.
04:46
So for this guy, you can actually
04:48
run the chi square test.
04:49
You have the estimated proportion, which
04:51
comes from this first line.
04:53
You have the tested proportion, p0,
04:55
that comes from the second line, and you
04:56
might want to check if those things actually
04:58
correspond to each other.
04:59
OK, so I'm not going to do it for you,
05:01
but I sort of invite you to do it
05:03
and test, and see how this compares
05:05
to the quantiles of the appropriate chi
05:07
square distribution and see what you can conclude from those two
05:10
things.
05:12
All right.
05:12
So this was the multinomial case.
05:15
So this is essentially what we did.
05:17
We computed the MLE under the right constraint,
05:19
and that was our test statistic that converges
05:20
to the chi square distribution.
05:22
So if you've seen it before, that's
05:23
all that was given to you.
05:24
Now we know why the normalization here
05:27
is p0 j and not p0 j squared or square root of p0 j, or even 1.
05:33
I mean it's not clear that this should
05:35
be the right normalization, but we
05:36
know that's what comes from taking
05:38
the right normalization, which comes from the Fisher
05:41
information.
05:42
All right?
05:43
OK.
05:43
05:47
The thing I wanted to move onto, so we've basically covered
05:50
chi square test.
05:51
Are there any questions about chi square test?
05:53
And for those of you who were not here on Thursday,
05:56
I'm really just--
05:57
do not pretend I just did it.
05:59
That's something we did last Thursday.
06:01
But are there any questions that arose
06:03
when you were reading your notes, things
06:04
that you didn't understand?
06:06
Yes.
06:06
AUDIENCE: Is there like a formal name?
06:09
Before we had talked about how what we call the Fisher
06:12
information [INAUDIBLE],, still has the same [INAUDIBLE]
06:17
because it's the same number.
06:21
PROFESSOR: So it's not the Fisher.
06:23
The Fisher information does not exist in this case.
06:25
And so there's no appropriate name for this.
06:27
It's the pseudoinverse of the asymptotic covariance matrix,
06:30
and that's what it is.
06:32
I don't know if I mentioned it last time,
06:34
but there's this entire field that uses--
06:36
you know, for people who really aspire to differential geometry
06:39
but are stuck in the stats department,
06:41
and there's this thing called information geometry, which
06:43
is essentially studying the manifolds associated
06:47
to the Fisher information metric, the metric that's
06:50
associated to Fisher information.
06:52
And so those of course can be lower dimensional manifolds,
06:55
not only distorts the geometry but forces everything
06:58
to live on a lower dimension, which
06:59
is what happens when your Fisher information does not exist.
07:02
And so there's a bunch of things that you
07:04
can study, what this manifold looks like, et cetera.
07:06
But no, there's no particular terminology here
07:09
about going here.
07:12
To be fair, within the scope of this class,
07:14
this is the only case where you--
07:18
multinomial case is the only case
07:19
where you typically see a lack of a Fisher information matrix.
07:26
And that's just because we have these extra constraints
07:28
that the sum of the parameters should be 1.
07:30
And if you have an extra constraint that
07:31
seems like it's actually remove one degree of freedom,
07:34
this will happen inevitably.
07:36
And so maybe what you can do is reparameterize.
07:40
So if I actually reparameterize everything function of p1
07:44
to p k minus 1, and then 1 minus the sum,
07:46
this would not have happened.
07:48
Because I have only a k-dimensional space.
07:51
So there's tricks around this to make it
07:53
exist if you want it to exist.
07:56
Any other question?
07:58
All right.
07:59
So let's move on to Student's t-test.
08:02
We mentioned it last time.
08:03
So essentially you've probably done it
08:06
more even in the homework than you've done it in lectures,
08:09
but just quickly this is essentially the test.
08:12
That's the test when we have an actual data that comes
08:15
from a normal distribution.
08:16
There is no Central Limit Theorem that exists.
08:18
This is really to account for the fact
08:21
that for smaller sample sizes, it
08:24
might be the case that it's not exactly true that when
08:27
I look at xn bar minus mu divided by-- so if I look
08:33
at xn bar minus mu divided by sigma times square root of n,
08:37
then this thing should have N 0, 1 distribution approximately.
08:41
Right?
08:42
By the Central Limit Theorem.
08:45
So that's for n large.
08:47
But if n is small, then it's still true
08:53
when the data is N mu, sigma squared,
09:00
then it's true that square root of n--
09:02
09:09
so here it's approximately.
09:12
And this is always true.
09:14
But I don't know sigma in practice, right?
09:16
Maybe mu, it comes from my, maybe mu comes from my mu
09:20
0, maybe something from the test statistic
09:23
where mu actually is here.
09:25
But for this guy I'm going to have inevitably
09:27
to find an estimator.
09:29
And now in this case, for small n, this is no longer true.
09:32
And what the t statistic is doing
09:34
is essentially telling you what the distribution of this guy
09:36
is.
09:37
So what you should say is that now this guy
09:41
has a t distribution with n minus 1 degrees of freedom.
09:44
That's basically the laundry list stats
09:47
that you would learn.
09:48
It says just look at a different table, that's what it is.
09:50
But we actually defined what a t distribution was.
09:55
And a t distribution is basically
09:58
something that has the same distribution as some N 0, 1,
10:03
divided by the square root of a chi square
10:06
with d degrees of freedom divided by d.
10:08
And that's a t distribution with d degrees of freedom.
10:12
And those two have to be independent.
10:14
10:20
And so what I need to check is that this guy over there
10:24
is of this form.
10:25
10:37
OK?
10:39
So let's look at the numerator.
10:41
Well, square root of n, xn bar minus mu.
10:45
What is the distribution of this thing?
10:47
Is it an N 0, 1?
10:50
AUDIENCE: N 0, sigma squared?
10:52
PROFESSOR: N 0, sigma squared, right.
10:54
10:58
So I'm not going to put it here.
11:00
So if I want this guy to be N 0, 1,
11:01
I need to divide by sigma, that's what we have over there.
11:04
11:06
So that's my N 0, 1 that's going to play the role of this guy
11:09
here.
11:11
So if I want to go a little further,
11:13
I need to just say, OK, now I need to have square root of n,
11:21
and I need to find something here
11:23
that looks like my square root of chi square divided
11:27
by-- yeah?
11:28
AUDIENCE: Really quick question.
11:29
The equals sign with the d on top, that's just defined as?
11:32
PROFESSOR: No, that's just the distribution.
11:35
So, I don't know.
11:37
AUDIENCE: Then never mind.
11:38
PROFESSOR: Let's just write it like that, if you want.
11:41
I mean, that's not really appropriate to have.
11:44
Usually you write only one distribution
11:46
on the right-hand inside of this little thing.
11:48
So not just this complicated function of distributions.
11:51
This is more like to explain.
11:53
OK, and so usually the thing you should
11:54
say that t is equal to this X divided by square root of Z
11:58
divided by d where X has normal distribution,
12:01
Z has chi square distribution with d degrees of freedom.
12:06
So what do we need here?
12:07
Well I need to have something which looks like my sigma hat,
12:10
right?
12:10
So somehow inevitably I'm going to need to have sigma hat.
12:13
12:16
Now of course I need to divide this by my sigma
12:18
so that my sigma goes away.
12:19
12:22
And so now this thing here--
12:25
sorry, I should move on to the right, OK.
12:27
And so this thing here, so sigma hat is square root of Sn.
12:33
And now I'm almost there.
12:35
So this thing is actually equal to square root of n.
12:38
12:47
But this thing here is actually not a--
12:51
12:55
so this thing here follows a distribution
12:57
which is actually a chi square, square root
13:00
of a chi square distribution divided by n.
13:11
13:15
Yeah, that's the square root chi square distribution
13:18
with n minus 1 degrees of freedom divided
13:20
by n, because sigma hat is equal to 1 over n sum
13:25
from i equal 1 to n, xi minus x bar squared.
13:30
And we just said that this part here
13:32
was a chi square distribution.
13:34
We didn't just say it, we said it a few lectures years back,
13:36
that this thing was a chi square distribution, and the fact
13:39
that the presence of this x bar here
13:42
was actually removing one degree of freedom from this sum.
13:46
OK, so this guy here has the same distribution
13:48
as a chi square n minus 1 divided by n.
13:52
So I need to actually still arrange this thing a little bit
13:56
to have a t distribution.
13:58
I should not see n here, but I should n minus 1.
14:01
The d is the same as this d here.
14:06
And so let me make the correction
14:07
so that this actually happens.
14:09
Well, if I actually write this to be equal to--
14:14
so if I write square root of n minus 1, as on the slide,
14:19
times xn bar minus mu divided by--
14:25
well let me write it as square root of Sn,
14:27
which is my sigma hat.
14:29
Then what this thing is actually equal to,
14:33
it follows a N 0, 1, divided by the square root
14:39
of my chi square distribution with n
14:40
minus 1 degrees of freedom.
14:42
And here the fact that I multiply
14:43
by square root of n minus 1, and I
14:45
have the square root of n here, is essentially the same
14:47
as dividing here by n minus 1.
14:51
And that's my tn distribution.
14:54
My t distribution with n minus 1 degrees of freedom.
14:58
Just by definition of what this thing is.
15:00
OK?
15:00
15:22
All right.
15:22
Yes?
15:23
AUDIENCE: Where'd you get the square root from?
15:26
PROFESSOR: This guy?
15:27
Oh sorry, that's sigma squared.
15:28
Thank you.
15:30
That's the estimator of the variance, not the estimator
15:32
of the standard deviation.
15:33
And when I want to divide it I divide by standard deviation.
15:35
Thank you.
15:38
Any other question or remark?
15:40
AUDIENCE: Shouldn't you divide by sigma squared?
15:42
The actual.
15:45
The estimator for the variance is
15:47
equal to sigma squared times chi square, right?
15:52
PROFESSOR: The estimator for the variance.
15:55
Oh yes, you're right.
15:56
So there's a sigma squared here.
15:59
Is that what you're asking?
16:00
AUDIENCE: Yeah.
16:00
PROFESSOR: Yes, absolutely.
16:01
And that's where, it get cancels here.
16:03
It gets canceled here.
16:04
16:10
OK?
16:10
16:13
So this is really a sigma squared times chi square.
16:15
16:20
OK.
16:21
So the fact that it's sigma squared
16:22
is just because I can pull out sigma
16:24
squared and just think those guys N 0, 1.
16:26
16:32
All right.
16:33
So that's my t distribution.
16:34
Now that I actually have a pivotal distribution, what I do
16:37
is that I form the statistic.
16:40
Here I called it Tn tilde.
16:42
16:52
OK.
16:53
And what is this thing?
16:54
I know that this has a pivotal distribution.
16:56
So for example, I know that the probability
16:59
that Tn tilde in absolute value exceeds some number that I'm
17:05
going to call q alpha over 2 for the t n minus 1,
17:11
is equal to alpha.
17:13
So that's basically, remember the t distribution
17:16
has the same shape as the Gaussian distribution.
17:19
What I'm finding is, for this t distribution,
17:21
some number q alpha over 2 of t n minus 1
17:26
and minus q alpha over 2 of t minus 1.
17:29
So those are different from the Gaussian one.
17:31
Such that the area under the curve
17:33
here is alpha over 2 on each side
17:36
so that the probability that my absolute value exceeds
17:39
this number is equal to alpha.
17:43
And that's what I'm going to use to reject the test.
17:46
So now my test becomes, for H0, say mu is equal to some mu 0,
17:59
versus H1, mu is not equal to mu 0.
18:05
18:08
The rejection region is going to be equal to the set on which
18:13
square root of n minus 1 times xn bar minus mu 0 this time,
18:19
divided by square root of Sn exceeds, in absolute value,
18:25
exceeds q-- sorry that's already here--
18:28
exceeds q alpha over 2 of t n minus 1.
18:34
So I reject when this thing increases.
18:36
The same as the Gaussian case, except that rather than reading
18:39
my quantiles from the Gaussian table
18:41
I read them from the Student table.
18:44
It's just the same thing.
18:45
So they're just going to be a little bit farther.
18:48
So this guy here is just going to be a little bigger
18:52
than the one for the Gaussian one,
18:54
because it's going to require me a little more evidence
18:57
in my data to be able to reject because I
18:59
have to account for the fluctuations of sigma hat.
19:01
19:09
So of course Student's test is used everywhere.
19:12
People use only t tests, right?
19:15
If you look at any data point, any output,
19:19
even if you had 500 observations,
19:21
if you look at the statistical software output
19:23
it's going to say t test.
19:25
And the reason why you see t test
19:26
is because somehow it's felt like it's not asymptotic.
19:29
You don't need to actually do, you
19:31
know, to be particularly careful.
19:33
And anyway, if n is equal to 500,
19:35
since the two curves are above each other
19:37
it's basically the same thing.
19:39
So it doesn't really change anything.
19:40
So why not use the t test?
19:43
So it's not asymptotic.
19:44
It doesn't require Central Limit Theorem to kick in.
19:47
And so in particular it be run if you have 15 observations.
19:50
Of course, the drawback of the Student test
19:52
is that it relies on the assumption
19:54
that the sample is Gaussian, and that's something
19:56
we really need to keep in mind.
19:57
If you have a small sample size, there is no magic going on.
20:01
It's not like Student t test allows you to get rid
20:04
of this asymptotic normality.
20:06
It sort of assumes that it's built in.
20:08
It assumes that your data has a Gaussian distribution.
20:14
So if you have 15 observations, what are you going to do?
20:18
You want to test if the mean is equal to 0 or not equal to 0,
20:21
but you have only 15 observations.
20:24
You have to somehow assume that your data is Gaussian.
20:27
But if the data is given to you, this is not math,
20:30
you actually have to check that it's Gaussian.
20:32
And so we're going to have to find
20:33
a test that, given some data, tells us whether it's Gaussian
20:38
or not.
20:39
If I have 15 observations, 8 of them
20:42
are equal to plus 1 and 7 of them are equal to minus 1,
20:46
then it's pretty unlikely that you're
20:47
going to be able to conclude that your data has a Gaussian
20:50
distribution.
20:51
However, if you see some sort of spread around some value,
20:54
you form a histogram maybe and it sort of
20:56
looks like it's a Gaussian, you might
20:57
want to say it's Gaussian.
20:59
And so how do we make this more quantitative?
21:01
Well, the sad answer to this question
21:05
is that there will be some tests that make it quantitative,
21:08
but here, if you think about it for one second, what is going
21:11
to be your null hypothesis?
21:13
Your null hypothesis, since it's one point,
21:15
it's going to be that it's Gaussian,
21:17
and then the alternative is going
21:19
to be that it's not Gaussian.
21:21
So what it means is that, for the first time
21:23
in your statistician life, you're
21:26
going to want to conclude that H0 is the true one.
21:30
You're definitely not going to want
21:31
to say that it's not Gaussian, because then everything you
21:34
know is sort of falling apart.
21:36
And so it's kind of a weird thing where
21:39
you're sort of going to be seeking tests
21:41
that have no power basically.
21:43
You're going to want to test that, and that's the nature.
21:46
The amount of alternatives, the number
21:49
of ways you can be not Gaussian, is so huge
21:52
that all tests are sort of bound to have very low power.
21:56
And so that's why people are pretty happy with the idea
21:58
that things are Gaussian, because it's
22:00
very hard to find a test that's going
22:01
to reject this hypothesis.
22:04
And so we're even going to find some tests that are visual,
22:08
where you're going to be able to say,
22:10
well, sort of looks Gaussian to me.
22:12
It allows you to deal with the borderline cases
22:16
pretty efficiently.
22:17
We'll see actually a particular example.
22:19
All right, so this theory of testing
22:22
whether data comes from a particular distribution
22:24
is called goodness of fit.
22:26
Is this distribution a good fit for my data?
22:31
That's the goodness of fit test.
22:33
We have just seen a goodness of fit test.
22:36
What was it?
22:36
22:41
Yeah.
22:44
The chi square test, right?
22:46
The case square test, we were given a candidate PMF
22:49
and we were testing if this was a good fit for our data.
22:52
That was a goodness of fit test.
22:54
So of course multinomial is one example,
22:57
but really what we have in the back of our mind is
22:59
I want to test if my data is Gaussian.
23:01
That's basically the usual thing.
23:03
And just like you always see t test as the standard output
23:06
from statistical software whether you ask for it or not,
23:09
there will be a test for normality
23:11
whether you ask it or not from any statistical software app.
23:16
All right.
23:17
So a goodness of fit test looks as follows.
23:19
There's a random variable X and you're
23:21
given i.i.d. copies of X, X1 to Xn,
23:23
they come from the same distribution.
23:25
And you're going to ask the following question: does X have
23:28
a standard normal distribution?
23:31
So for t distribution that's definitely
23:33
the kind of questions you may want to ask.
23:35
Does X have a uniform distribution on 0, 1?
23:39
That's different from the distribution 1
23:41
over k, 1 over k, it's the continuous notion
23:44
of uniformity.
23:47
And for example, you might want to test that--
23:49
so there's actually a nice exercise, which
23:51
is if you look at the p-values.
23:53
So we've defined what the p-values were.
23:55
And the p-value's a number between 0 and 1, right?
23:59
And you could actually ask yourself,
24:01
what is the distribution of the p-value under the null?
24:04
So the p-value is a random number.
24:08
It's the probability-- so the p-value-- let's look
24:10
at the following test.
24:13
24:17
H0, mu is equal to 0, versus H1, mu is not equal to 0.
24:25
And I know that the p-value is--
24:28
so I'm going to form what?
24:29
I'm going to look at Xn bar minus mu
24:34
times square root of n divided by-- let's say that we
24:37
know sigma for one second.
24:40
Then the p-value is the probability
24:43
that this is larger then square root of n little xn
24:48
bar minus mu, minus 0 actually in this case,
24:54
divided by sigma, where this guy is the observed.
24:59
25:04
OK.
25:05
So now you could say, well, how is that a random variable?
25:09
It's just a number.
25:11
It's just a probability of something.
25:13
But then I can view this as a function of this guy
25:17
here when I plug it back to be a random variable.
25:23
So what I mean by this is that if I look at this value
25:26
here, if I say that phi is the CDF of N 0, 1,
25:34
so the p-value is the probability
25:36
that it exceeds this.
25:37
So that's the probability that I'm either here or here.
25:41
25:44
AUDIENCE: [INAUDIBLE]
25:47
PROFESSOR: No, it's not, right?
25:49
AUDIENCE: [INAUDIBLE]
25:52
PROFESSOR: This is a big X and this is a small x.
25:55
This is just where you plug in your data.
25:57
The p-value is the probability that you
25:59
have more evidence against your null
26:03
than what you already have.
26:05
OK, so now I can write it in terms
26:06
of cumulative distribution functions.
26:09
So this is what?
26:09
This is phi of this guy, which is minus this thing here.
26:14
26:17
Well it's basically 2 times this guy,
26:19
phi of minus square root of n, Xn bar divided by sigma.
26:27
26:30
That's my p-value.
26:31
If you give me data, I'm going to compute the average
26:33
and plug it in there, and it can spit out the p-value.
26:36
Everybody agrees?
26:37
26:39
So now I can view this, if I start now looking back I say,
26:42
well, where does this data come from?
26:45
Well, it could be a random variable.
26:48
It came from the realization of this thing.
26:51
So I can try to, I can think of this value,
26:54
where now this is a random variable because I just plugged
26:57
in a random variable in here.
26:59
So now I view my p-value as a random variable.
27:04
So I keep switching from small x to large X. Everybody
27:06
agrees what I'm doing here?
27:08
So I just wrote it as a deterministic function
27:11
of some deterministic number, and now the function
27:14
stays deterministic but the number becomes random.
27:17
And so I can think of this as some statistic of my data.
27:21
And I could say, well, what is the distribution
27:23
of this random variable?
27:26
Now if my data is actually normally distributed,
27:29
so I'm actually under the null, so
27:31
under the null, that means that Xn bar times square root of n
27:37
divided by sigma has what distribution?
27:40
27:48
Normal?
27:48
27:56
Well it was sigma, I assume I knew it.
27:59
So it's N 0, 1, right?
28:00
I divided by sigma here.
28:02
OK?
28:03
So now I have this random variable.
28:04
28:15
And so my random variable is now 2 phi of minus absolute value
28:24
of a Gaussian.
28:24
28:34
And I'm actually interested in the distribution of this thing.
28:40
I could ask that.
28:41
Anybody has an idea of how you would
28:43
want to tackle this thing?
28:45
If I ask you, what is the distribution
28:46
of a random variable, how do you tackle this question?
28:48
28:53
There's basically two ways.
28:54
One is to try to find something that
28:55
looks like the expectation of h of x for all h.
29:02
And you try to write this using change of variables
29:04
and something that looks like integral of h of x p of x dx.
29:09
And then you say, well, that's the density.
29:12
If you can read this for any h, then that's
29:15
the way you would do it.
29:16
But there's a simpler way that does not
29:19
involve changing variables, et cetera,
29:21
you just try to compute the cumulative distribution
29:23
function.
29:25
So let's try to compute the probability
29:26
that 2 phi minus N 0, 1, is less than t.
29:34
And maybe we can find something we know.
29:38
OK.
29:38
Well that's equal to what?
29:39
That's the probability that a minus N 0,
29:43
well let's say that an N 0, 1--
29:45
sorry, N 0, 1 absolute value is greater than minus phi inverse
29:57
of t over 2.
29:58
30:04
And that's what?
30:05
Well, it's just the same thing that we had before.
30:07
It's equal to-- so if I look again,
30:12
this is the probability that I'm actually on this side
30:15
or that side of this number.
30:17
And this number is what?
30:18
It's minus phi of t over 2.
30:25
Why do I have a minus here?
30:27
30:32
That's fine, OK.
30:33
So it's actually not this, it's actually the probability
30:36
that my absolute value--
30:39
oh, because phi inverse.
30:41
OK.
30:42
Because phi inverse is--
30:44
so I'm going to look at t between 0
30:48
and-- so this number is ranging between 0 and 1.
30:52
So it means that this number is ranging between 0--
30:55
well, the probability that something is less than t
30:58
should be ranging between the numbers that this guy takes,
31:03
so that's between 0 and 2.
31:04
31:11
Because this thing takes values between 0 and 2.
31:14
I want to see 0 and 1, though.
31:16
31:21
AUDIENCE: Negative absolute value is always less
31:23
than [INAUDIBLE].
31:24
31:29
PROFESSOR: Yeah.
31:29
You're right, thank you.
31:30
So this is always some number which is less than 0,
31:34
so the probability that the Gaussian is less
31:36
than this number is always less than the probability
31:38
it's less than 0, which is 1/2, so t only
31:40
has to be between 0 and 1.
31:41
Thank you.
31:43
And so now for t between 0 and 1, then
31:47
this guy is actually becoming something which is positive,
31:50
for the same reason as before.
31:52
And so that's what?
31:53
That's just basically 2 times phi of phi inverse of t over 2.
32:04
32:07
That's just playing with the symmetry a little bit.
32:09
You can look at the areas under the curve.
32:11
And so what it means is that those two guys cancel.
32:13
This is the identity.
32:15
And so this is equal to t.
32:18
So which distribution has a density--
32:23
sorry, which distribution has a cumulative distribution
32:27
function which is equal to t for t between 0 and 1?
32:32
That's the uniform distribution, right?
32:34
So it means that this guy follows a uniform distribution
32:37
on the interval 0, 1.
32:39
32:44
And you could actually check that.
32:45
For any test you're going to come up with,
32:47
this is going to be the case.
32:48
Your p-value under the null will have a distribution
32:52
which is uniform.
32:54
So now if somebody shows up and says, here's my test,
32:58
it's awesome, it just works great.
33:00
I'm not going to explain to you how I built it,
33:02
it's a complicated statistics that
33:03
involve moments of order 27.
33:06
And I'm like, OK, you know, how am I
33:08
going to test that your test statistic actually makes sense?
33:11
Well one thing I can do is to run a bunch of data,
33:16
draw a bunch of samples, compute your test statistic,
33:18
compute the p-value, and check if my p-value has
33:22
a uniform distribution on the interval 0, 1.
33:27
But for that I need to have a test that,
33:29
given a bunch of observations, can tell me
33:31
whether they're actually distributed uniformly
33:33
on the interval 0, 1.
33:34
And again one thing I could do is build a histogram
33:36
and see if it looks like that of a uniform,
33:40
but I could also try to be slightly more quantitative
33:42
about this.
33:43
AUDIENCE: Why does the [INAUDIBLE] have
33:44
to be for a [INAUDIBLE]?
33:47
PROFESSOR: For two tests?
33:48
AUDIENCE: For each test.
33:51
Why does the p-value have to be normal?
33:54
I mean, uniform.
33:55
PROFESSOR: It's uniform under the null.
33:57
So because my test statistic was built under the null,
34:00
and so I have to be able to plug in the right value in there,
34:03
otherwise it's going to shift everything
34:04
for this particular test.
34:06
AUDIENCE: At the beginning while your probabilities
34:08
were of big Xn, that thing.
34:09
That thing is the p-value.
34:11
PROFESSOR: That's the p-value, right?
34:13
That's the definition of the p-value.
34:15
AUDIENCE: OK.
34:15
34:17
PROFESSOR: So it's the probability
34:19
that my test statistic exceeds what I've actually observed.
34:23
AUDIENCE: So how you run the test is basically
34:26
you have your observations and plug them
34:29
into the cumulative distribution function for a normal,
34:33
and then see if it falls under the given--
34:35
PROFESSOR: Yeah.
34:36
So my p-value is just this number
34:39
when I just plug in the values that I observe here.
34:42
That's one number.
34:43
For every dataset you're going to give me,
34:45
it's going to be one number.
34:46
Now what I can do is generate a bunch of datasets of size n,
34:51
like 200 of them.
34:53
And then I'm going to have a new sample
34:55
of say 200, which is just the sample of 200 p-values.
34:59
And I want to test if those p-values have
35:00
a uniform distribution.
35:02
OK?
35:02
Because that's the distribution they should be having.
35:05
All right?
35:06
35:11
OK.
35:12
This one we've already seen.
35:13
Does x have a PMF with 30%, 50%, and 20%?
35:18
That's something I could try to test.
35:21
That looks like your grade point distribution for this class.
35:27
Well not exactly, but that looks like it.
35:30
So all these things are known as goodness of fit tests.
35:33
The goodness of fit test is something
35:34
that you want to know if the data that you have at hand
35:38
follows the hypothesized distribution.
35:41
So it's not a parametric test.
35:43
It's not a test that says, is my mean equal to 25 or not.
35:46
Is my proportion of heads larger than 1/2 or not?
35:51
It's something that says, my distribution
35:53
this particular thing.
35:54
35:57
So I'm going to write them as goodness of fit, G-O-F here.
36:00
You don't need to have parametric modeling to do that.
36:02
36:05
So how do I work?
36:06
So if I don't have any parametric modeling,
36:09
I need to have something which is somewhat non-parametric,
36:12
something that goes beyond computing the mean
36:14
and the standard deviation, something
36:16
that computes some intrinsic non-parametric aspect
36:19
of my data.
36:21
And just like here we made this computation, what we did
36:24
is we said well, if I actually check
36:28
that the CDF of my data, that my p-value is uniform,
36:34
then I know it's uniform.
36:35
So it means that the cumulative distribution function
36:37
has an intrinsic value about it that captures
36:39
the entire distribution.
36:41
Everything I need to know about my distribution
36:44
is captured by the cumulative distribution function.
36:47
Now I have an empirical way of computing,
36:49
I have a data-driven way of computing
36:52
an estimate for the cumulative distribution function, which
36:54
is using the old statistical trick which
36:57
consists of replacing expectations by averages.
37:00
So as I said, the cumulative distribution function
37:04
for any distribution, for any random variable, is--
37:08
37:12
so F of t is the probability that X
37:17
is less than or equal to t, which
37:19
is equal to the expectation of the indicator
37:22
that X is less than or equal to t.
37:26
That's the definition of a probability.
37:28
And so here I'm just going to replace expectation
37:31
by the average.
37:34
That's my usual statistical trick.
37:37
And so my estimator Fn for--
37:42
the distribution is going to be 1 over n sum from i
37:45
equal 1 to n of these indicators.
37:48
37:53
And this is called the empirical CDF.
37:58
It's just the data version of the CDF.
38:01
38:04
So I just replaced this expectation here by an average.
38:08
38:13
Now when I sum indicators, I'm actually
38:17
counting the number of them that satisfy something.
38:20
So if you look at what this guy is,
38:24
this is the number of X i's that is less than t, right?
38:32
And so if I divide by n, it's the proportion of observations
38:35
I have that are less than t.
38:36
38:41
That's what the empirical distribution is.
38:43
38:46
That's what's written here, the number of data points
38:50
that are less than t.
38:52
And so this is going to be something
38:53
that's sort of trying to estimate one or the other.
38:57
And the law of large number actually
38:59
tells me that for any given t, if n is large enough, Fn of t
39:03
should be close to F of t.
39:05
Because it's an average.
39:07
And this entire thing, this entire statistical trick,
39:10
which consists of replacing expectations by averages,
39:13
is justified by the law of large number.
39:16
Every time we used it, that was because the law of large number
39:19
sort of guaranteed to us that the average was
39:21
close to the expectation.
39:23
39:26
OK.
39:27
So law of large numbers tell me that Fn of t converges,
39:30
so that's the strong law, says that almost surely actually
39:34
Fn of t goes to F of t.
39:35
39:40
And that's just for any given t.
39:43
Is there any question about this?
39:46
That averages converge to expectation,
39:48
that's the law of large number.
39:49
39:52
And almost surely we could say in probability
39:54
it's the same, that would be the weak law of large number.
39:57
40:00
Now this is fine.
40:01
For any given t, the average converges to the true.
40:05
It just happens that this random variable is indexed by t,
40:09
and I could do it for t equals 1 or 2 or 25,
40:12
and just check it again.
40:14
But I might want to check it for all t's at once.
40:18
And that's actually a different result.
40:19
That's called a uniform result. I
40:21
want this to hold for all t at the same time.
40:25
And it may be the case that it works for each t individually
40:28
but not for all t's at the same time.
40:31
What could happen is that for t equals 1
40:33
it converges at a certain rate, and for t equals 2
40:36
it converges at a bit of a slower rate,
40:37
and for t equals 3 at a slower rate and slower rate.
40:41
And so as t goes to infinity, the rate is going to vanish
40:43
and nothing is going to converge.
40:45
That could happen.
40:46
I could make this happen at a finite point.
40:48
There's many ways where it could make this happen.
40:50
Let's see how that could work.
40:52
I could say, well, actually no.
40:54
I still need to have this at infinity for some reason.
40:59
It turns out that this is still true uniformly,
41:01
and this is actually a much more complicated result
41:03
than the law of large number.
41:05
It's called Glivenko-Cantelli Theorem.
41:07
And the Glivenko-Cantelli Theorem
41:09
tells me that, for all t's at once, Fn converges to F.
41:14
So let me just show you quickly why
41:18
this is just a little bit stronger than the one
41:22
that we had.
41:25
If sup is confusing you, think of max.
41:29
It's just the max over an infinite set.
41:31
And so what we know is that Fn of t goes to F of t
41:40
as n goes to infinity.
41:43
And that's almost surely.
41:45
And that's the law of large numbers.
41:48
Which is equivalent to saying that Fn of t minus F of t as n
41:54
goes to infinity converges almost surely to 0, right?
41:59
This is the same thing.
42:01
Now I want this to happen for all t's at once.
42:07
So what I'm going to do-- oh, and this is actually
42:09
equivalent to this.
42:11
And so what I'm going to do is I'm going
42:12
to make it a little stronger.
42:14
So here the arrow only goes one way.
42:16
And this is where the sup for t in R of Fn of t.
42:20
42:26
And you could actually show that this happens also
42:28
almost surely.
42:29
42:35
Now maybe almost surely is a bit more
42:37
difficult to get a grasp on.
42:39
42:43
Does anybody want to see, like why this statement for this sup
42:48
is strictly stronger than the one that holds individually
42:51
for all t's?
42:52
You want to see that?
42:54
OK, so let's do that.
42:54
So forget about it almost surely for one second.
42:57
Let's just do it in probability.
42:59
The fact that Fn of t converges to F of t for all t,
43:09
in probability means that this goes to 0 as n goes
43:12
to infinity for any epsilon.
43:13
43:17
For any epsilon in t we know we have this.
43:19
That's the convergence in probability.
43:22
Now what I want is to put a sup here.
43:24
43:28
The probability that the sup is lower than epsilon,
43:32
might be actually always larger than, never go to 0
43:38
in some cases.
43:39
It could be the case that for each given t,
43:42
I can make n large enough so that this probability becomes
43:46
small.
43:47
But then maybe it's an n of t.
43:49
So this here means that for any--
43:53
maybe I shouldn't put, let me put a delta here.
43:56
So for any epsilon, for any t and for any epsilon,
44:02
there exists n, which could depend on both epsilon
44:09
and t, such that the probability that Fn t
44:15
minus F of t exceeding delta is less than epsilon t.
44:25
There exists an n and a delta.
44:29
No, that's for all delta, sorry.
44:30
44:34
So this is true.
44:36
That's what this limit statement actually means.
44:40
But it could be the case that now when I take the sup over t,
44:43
maybe that n of t is something that looks like t.
44:47
44:50
Or maybe, well, integer part of t.
44:54
It could be, right?
44:56
I don't say anything.
44:57
It's just an n that depends on t.
44:59
So if this n is just t, maybe t over epsilon,
45:04
because I want epsilon.
45:05
Something like this.
45:07
Well that means that if I want this
45:09
to hold for all t's at once, I'm going
45:11
to have to go for the n that works for all t's at once.
45:15
But there's no such n that works for all t's at once.
45:19
The only n that works is infinity.
45:21
And so I cannot make this happen for all of them.
45:24
What Glivenko-Cantelli tells you,
45:26
it's actually this is not something that holds like this.
45:29
That the n that depends on t, there's actually one largest n
45:33
that works for all the t's at once, and that's it.
45:37
45:39
OK.
45:39
So just so you know why this is actually a stronger statement,
45:44
and that's basically how it works.
45:48
Any other question?
45:50
Yeah.
45:51
AUDIENCE: So what's the position for this
45:53
to have, because the random variable have
45:54
a finite mean, finite variance?
45:57
PROFESSOR: No.
45:58
Well the random variable does have finite mean
46:00
and finite variance, because the random variable
46:02
is an indicator.
46:03
So it has everything you want.
46:04
This is one of the nicest random variables,
46:06
this is a Bernoulli random variable.
46:08
So here when I say law of large number, that this holds.
46:11
Where did I write this?
46:12
I think I erased it.
46:14
Yeah, the one over there.
46:15
This is actually the law of large numbers
46:16
for Bernoulli random variables.
46:17
They have everything you want.
46:18
They're bounded.
46:21
Yes.
46:21
AUDIENCE: So I'm having trouble understanding
46:23
the first statement.
46:25
So it says, for all epsilon and all t,
46:27
the probability of that--
46:29
PROFESSOR: So you mean this one?
46:31
AUDIENCE: Yeah.
46:31
PROFESSOR: For all epsilon and all t.
46:34
So you fix them now.
46:36
Then the probability that, sorry, that was delta.
46:39
I changed this epsilon to delta at some point.
46:41
AUDIENCE: And then what's the second line?
46:44
PROFESSOR: Oh, so then the second line says that,
46:49
so I'm just rewriting in terms of epsilon delta
46:53
what this n goes to infinity means.
46:56
So it means that for any a t and delta,
47:01
so that's the same as this guy here,
47:04
then here I'm just going back to rewriting this.
47:06
It says that for any epsilon there exists an n large
47:08
enough such that, well, n larger than this thing
47:11
basically, such that this thing is less than epsilon.
47:14
47:18
So Glivenko-Cantelli tells us that not only is this thing
47:21
a good idea pointwise, but it's also a good idea uniformly.
47:25
And all it's saying is if you actually
47:27
were happy with just this result, you should
47:30
be even happier with that result.
47:32
And both of those results only tell you one thing.
47:34
They're just telling you that the empirical CDF
47:36
is a good estimator of the CDF.
47:38
47:41
Now since those indicators are Bernoulli distributions,
47:47
I can actually do even more.
47:50
So let me get this guy here.
47:52
48:00
OK so, those guys, Fn of t, this guy
48:14
is a Bernoulli distribution.
48:16
What is the parameter of this Bernoulli distribution?
48:20
What is the probability that it takes value 1?
48:22
48:26
AUDIENCE: F of t.
48:26
PROFESSOR: F of t, right?
48:28
It's just the probability that this thing happens,
48:30
which is F of t.
48:31
48:34
So in particular the variance of this guy
48:40
is the variance of this Bernoulli.
48:42
So it's F of t 1 minus F of t.
48:46
And I can use that in my Central Limit Theorem.
48:50
And Central Limit Theorem is just
48:51
going to tell me that if I look at the average
48:53
of random variables, I remove their mean,
48:56
so I look at square root of n Fn of t,
49:01
which I could really write as xn bar, right?
49:04
That's really just an xn bar.
49:06
Minus the expectation, which is F
49:08
of t, that comes from this guy.
49:11
Now if I divide by square root of the variance, that's
49:16
my square root p1 minus p.
49:18
Then this guy, by the Central Limit Theorem,
49:22
goes to some N 0, 1.
49:23
49:27
Which is the same thing as you see there,
49:28
except that the variance was put on the other side.
49:30
49:34
OK.
49:36
Do I have the same thing uniformly in t?
49:42
49:46
Can I write something that holds uniformly in t?
49:48
Well, if you think about it for one second
49:50
it's unlikely it's going to go too well.
49:53
In the sense that it's unlikely that the supremum
49:55
of those random variables over t is going to also be a Gaussian.
49:58
50:02
And the reason is that, well actually the reason
50:08
is that this thing is actually a stochastic process indexed
50:10
by t.
50:11
A stochastic process is just a sequence in random variables
50:14
that's indexed by, let's say time.
50:17
The one that's the most famous is Brownian motion,
50:20
and it's basically a bunch of Gaussian increments.
50:24
So when you go from t to just t a little after that,
50:27
you have add some Gaussian into the thing.
50:30
And here it's basically the same thing that's happening.
50:33
And you would sort of expect, since each of this guy
50:35
is Gaussian, you would expect to see
50:37
something that looks like a Brownian motion at the end.
50:40
But it's not exactly a Brownian motion,
50:41
it's something that's called the Brownian bridge.
50:43
So if you've seen the Brownian motion, if I make
50:45
it start at 0 for example, so this is the value
50:49
of my Brownian motion.
50:50
Let's write it.
50:52
So this is one path, one realization of Brownian motion.
50:56
Let's call it w of t as t increases.
50:59
So let's say it starts at 0 and looks like something like this.
51:04
So that's what Brownian motion looks like.
51:06
It's just something that's pretty nasty.
51:11
I mean it looks pretty nasty, it's not continuous et cetera,
51:13
but it's actually very benign in some average way.
51:19
So Brownian motion is just something,
51:21
you should view this as if I sum some random variable that
51:25
are Gaussian, and then I look at this from farther and farther,
51:29
it's going to look like this.
51:31
And so here I cannot have a Brownian motion in the n,
51:34
because what is the variance of Fn of t minus F of t at t is
51:40
equal to 1?
51:40
51:43
Sorry, at t is equal to infinity.
51:47
AUDIENCE: 0.
51:48
PROFESSOR: It's 0, right?
51:49
The variance goes from 0 at t is negative infinity,
51:52
because at negative infinity F of t is going to 0.
51:56
And as t goes to plus infinity, F of t
51:59
is going to 1, which means that the variance of this guy as t
52:03
goes from negative infinity to plus infinity
52:06
is pinned to be 0 on each side.
52:09
And so my Brownian motion cannot,
52:12
when I describe a Brownian motion I'm just adding more
52:14
and more entropy to the thing and it's going all over
52:16
the place, but here what I want is that as I go back it should
52:20
go back to essentially 0.
52:21
It should be pinned down to a specific value at the n.
52:25
And that's actually called the Brownian bridge.
52:27
It's a Brownian motion that's conditioned
52:29
to come back to where it started essentially.
52:32
Now you don't need to understand Brownian bridges to understand
52:35
what I'm going to be telling you.
52:36
The only thing I want to communicate to you
52:39
is that this guy here, when I say a Brownian bridge,
52:42
I can go to any probabilist and they can tell you
52:45
all the probability properties of this stochastic process.
52:51
It can tell me the probability that it
52:52
takes any value at any point.
52:55
In particular, it can tell me--
52:57
the supremum between 0 and 1 of this guy,
53:01
it could tell me what the cumulative distribution
53:03
function of this thing is, can tell me
53:04
what the density of this thing is, can tell me everything.
53:07
So it means that if I want to compute probabilities
53:09
on this object here, which is the maximum value that this guy
53:14
can take over a certain period of time, which is basically
53:17
this random variable.
53:18
So if I look at the value here, it's
53:20
a random variable that fluctuates.
53:22
It can tell me where it is with hyperability, can tell me
53:25
the quantiles of this thing, which is useful
53:28
because I can build a table and use it to compute my quantiles
53:31
and form tests from it.
53:34
So that's what actually is quite nice.
53:36
It says that if I look at the square root of n Fn
53:38
hat minus sup over t, I get something
53:40
that looks like the sup of these Gaussians,
53:42
but it's not really sup of Gaussian,
53:44
it's sup of a Brownian motion.
53:46
Now there's something you should be very careful here.
53:48
I cheated a little bit.
53:49
I mean, I didn't cheat, I can do whatever I want.
53:51
But my notation might be a little confusing.
53:55
Everybody sees that this t here is not the same as this t here?
54:01
Can somebody see that?
54:03
Just because, first of all, this guy's between 0 and 1.
54:05
And this guy is in all of R.
54:09
What is this t here?
54:12
As a function of this t here?
54:14
54:21
This guy is F of this guy.
54:23
So really, if I want it to be completely transparent
54:27
and not save the keys of my keyboard,
54:32
I would read this as sup over t of Fn t minus F of t
54:42
goes to N distribution as n goes to infinity.
54:46
The supremum over t, again in R, so this guy is
54:50
for t in the entire real line, this guy
54:52
is for t in the entire real line.
54:54
But now I should write b of what?
54:58
F of t, exactly.
55:00
So really the t here is F of the original one.
55:04
And so that's a Brownian bridge, where
55:06
when t goes to infinity the Brownian bridge
55:09
goes from 0 to 1 and it looks like this.
55:11
A Brownian bridge at 0 is 0, at 1 it's 0.
55:16
And it does this.
55:18
But it doesn't stray too far because I condition
55:20
it to come back to this point.
55:22
That's what a Brownian bridge is.
55:26
OK.
55:28
So in particular, I can find a distribution for this guy.
55:33
And I can use this to build a test which is called
55:35
the Kolmogorov-Smirnov test.
55:37
55:39
The idea is the following.
55:40
It says, if I want to test some distribution
55:44
F0, some distribution that has a particular CDF F0,
55:49
and I plug it in under the null, then
55:52
this guy should have pretty much the same distribution
55:55
as the supremum of Brownian bridge.
55:58
And so if I see this to be much larger than it should
56:00
be when it's the supremum of a Brownian bridge,
56:02
I'm actually going to reject my hypothesis.
56:05
56:08
So here's the test.
56:09
I want to test whether H0, F is equal to F0,
56:17
and you will see that most of the goodness of fit tests
56:22
are formulated mathematically in terms
56:24
of the cumulative distribution function.
56:26
I could formulate them in terms of personality density
56:29
function, or just write x follows N 0, 1,
56:33
but that's the way we write it.
56:34
We formulate them in terms of cumulative distribution
56:37
function because that's what we have
56:39
a handle on through the empirical cumulative
56:42
distribution function.
56:44
And then it's versus H1, F is not equal to F0.
56:50
So now I have my empirical CDF.
56:52
And I hope that for all t's, Fn of t
56:54
should be close to F0 of t.
56:57
Let me write it like this.
57:00
I put it on the exponent because otherwise that
57:03
would be the empirical distribution function based
57:06
on zero observations.
57:07
57:11
Now I form the following test statistic.
57:14
57:21
So my test statistic is tn, which
57:24
is the supremum over t in the real line of square root
57:28
of n Fn of t minus F of t, sorry, F0 of t.
57:34
So I can compute everything.
57:35
I know this from the data, and this
57:37
is the one that comes from my null hypothesis.
57:39
As I can compute this thing.
57:41
And I know that if this is true, this
57:43
should actually be the supremum of a Brownian bridge.
57:46
Pretty much.
57:48
And so the Kolmogorov-Smirnov test is simply,
58:01
reject if this guy, tn, in absolute value,
58:09
no actually not in absolute value.
58:10
This is just already absolute valued.
58:13
Then this guy should be what?
58:14
It should be larger than the q alpha over 2 distribution
58:20
that I have.
58:21
But now rather than putting N 0, 1, or Tn,
58:24
this is here whatever notation I have for supremum
58:30
of Brownian bridge.
58:31
58:40
Just like I did for any pivotal distribution.
58:43
That was the same recipe every single time.
58:45
I formed the test statistic such that
58:47
the asymptotic distribution did not depend on anything I know,
58:51
and then I would just reject when this pivotal distribution
58:54
was larger than something.
58:56
Yes?
58:56
AUDIENCE: I'm not really sure why Brownian bridge appears.
58:59
59:02
PROFESSOR: Do you know what a Brownian bridge is, or?
59:05
AUDIENCE: Only vaguely.
59:06
PROFESSOR: OK.
59:07
So this thing here, think of it as being a Gaussian.
59:14
So for all t you have a Gaussian distribution.
59:18
Now a Brownian motion, so if I had a Brownian motion
59:27
I need to tell you what the--
59:28
so it's basically a Brownian motion
59:30
is something that looks like this.
59:31
It's some random variable that's indexed by t.
59:34
I want, say, the expectation of Xt could be equal to 0
59:38
for all t.
59:40
And what I want is that the increments
59:42
have a certain distribution.
59:44
So what I want is that the expectation of Xt minus Xs
59:53
follows some distribution which is N 0, t minus s.
59:58
So the increments are bigger as I go farther,
60:00
in terms of variability.
60:02
And I also want some covariance structure between the two.
60:05
So what I want is that the covariance between Xs and Xt
60:10
is actually equal to the minimum of s and t.
60:12
60:18
Yeah, maybe.
60:21
Yeah, that should be there.
60:23
So this is, you open a probability book, that's
60:26
what it's going to look like.
60:27
So in particular, you can see, if I put 0 here
60:31
and X0 is equal to 0, it has 0 variance.
60:34
So in particular, it means that Xt,
60:38
if I look only at the t-th one, it
60:39
has some normal distribution with variance t.
60:43
So this is something that just blows up.
60:46
So this guy here looks like it's going
60:49
to be a Brownian motion because when
60:50
I look at the left-hand side it has a normal distribution.
60:53
Now there's a bunch of other things you need to check.
60:55
It's the fact that you have this covariance, for example,
60:58
which I did not tell you.
61:00
But it sure look somewhat like that.
61:03
And in particular, when I look at the normal with mean 0
61:07
and variance here, then it's clear
61:10
that this guy does not have a variance that's
61:12
going to go to infinity just like the variance of this guy.
61:16
We know that the variance is forced to be back to 0.
61:21
And so in particular we have something
61:23
that has mean 0 always, whose variance has to be 0 at 0,
61:28
and variance-- sorry, at t equals negative infinity,
61:31
and variance 1 at t equals plus infinity.
61:34
So a variance 0 at t equals plus infinity,
61:36
and so I have to basically force it to be equal to 0 at each n.
61:40
So the Brownian motion here tends
61:42
to just go to infinity somewhere,
61:44
whereas this guy forces it to come back.
61:47
Now everything I described to you
61:48
is on the scale negative infinity to plus infinity,
61:52
but since everything depends on F of t,
61:56
I can actually just put that back
61:58
into a scale, which is 0 and 1 by a simple change of variable.
62:02
It's called change of time for the Brownian motion.
62:06
OK?
62:07
Yeah.
62:08
AUDIENCE: So does a Brownian bridge
62:09
have a variance at each point that's proportional?
62:13
Like it starts at 0 variance and then
62:15
goes to 1/4 variance in the middle
62:17
and then goes back to 0 variance?
62:21
Like in the same parabolic shape?
62:23
PROFESSOR: Yeah.
62:24
I mean, definitely.
62:26
I mean by symmetry you can probably infer all the things.
62:29
AUDIENCE: Well I can imagine Brownian bridge
62:31
with a variance that starts at 0 and stays, like,
62:34
the shape of the variance as you move along.
62:38
PROFESSOR: Yeah, so I don't know if-- there
62:40
is an explicit formula for this, and it's simple.
62:43
That's what I can tell you, but I don't know what the explicit,
62:45
off the top of my head what the explicit formula is.
62:47
AUDIENCE: But would it have to match this F
62:49
of t 1 minus F of t structure?
62:53
Or not?
62:53
PROFESSOR: Yeah.
62:54
62:56
AUDIENCE: Or does the fact that we're taking the supremum--
62:58
PROFESSOR: No.
62:59
Well the Brownian bridge, this is the supremum-- you're right.
63:03
So this will be this form for the variance for sure,
63:06
because this is only marginal distributions that
63:08
don't take-- right, the process is not just
63:10
what is the distribution at each instant t.
63:13
It's also how do those distributions interact
63:15
with each other in terms of covariance.
63:17
For the marginal distributions at each instance t,
63:19
you're right, the variance is F of t 1 minus F of t.
63:22
We're not going to escape that.
63:25
But then the covariance structure between those guys
63:27
is a little more complicated.
63:29
But yes, you're right.
63:30
For marginal that's enough.
63:32
Yeah?
63:32
AUDIENCE: So the supremum of the Brownian bridge
63:34
is a number between 0 and 10, let's just say.
63:38
PROFESSOR: Yeah, it could be infinity.
63:40
AUDIENCE: So it's not symmetrical with respect to 0,
63:43
so why are we doing all over 2?
63:45
63:56
PROFESSOR: OK.
63:57
Did say raise it?
63:58
Yeah.
63:59
Because here I didn't say the supremum of the absolute value
64:01
of a Brownian bridge, I just said the supremum
64:03
of a Brownian bridge.
64:04
But you're right, let's just do this like that.
64:08
And then it's probably cleaner.
64:11
64:14
So yeah, actually well it should be q alpha.
64:17
So this is basically, you're right.
64:19
So think of it as being one-sided.
64:22
And there's actually no symmetry for the supremum.
64:25
I mean the supremum is not symmetric around 0,
64:29
so you're right.
64:29
I should not use alpha over 2, thank you.
64:33
Any other question?
64:35
This should be alpha.
64:36
Yeah.
64:37
I mean those slides were written with 1 minus alpha
64:39
and I have not replaced all instances of 1 minus alpha
64:42
by alpha.
64:43
I mean, except this guy, tilde.
64:45
Well, depends on how you want to call it.
64:47
But this is still, the probability that Z exceeds
64:50
this guy should be alpha.
64:53
OK?
64:54
And this can be found in tables.
64:55
And we can compute the p-value just like we did before.
65:00
But we have to simulate it because it's not
65:02
going to depend on the cumulative distribution
65:04
function of a Gaussian, like it did for the usual Gaussian
65:06
test.
65:07
That's something that's more complicated,
65:09
and typically you don't even try.
65:11
You get the statistical software to do it for you.
65:14
So just let me skip a few lines.
65:17
This is what the table looks like for the Kolmogorov-Smirnov
65:20
test.
65:21
So it just tells you, what is your number of observations, n.
65:25
Then you want alpha to be equal to 5%, say.
65:28
Let's say you have nine observations.
65:30
So if square root of n absolute value of Fn of t minus F of t
65:34
exceeds this thing, you reject.
65:36
65:46
Well it's pretty clear from this test
65:47
is that it looks very nice, and I tell
65:49
you this is how you build it.
65:50
But if you think about it for one second,
65:52
it's actually really an annoying thing
65:54
to build because you have to take the supremum over t.
65:57
This depends on computing a supremum, which in practice
66:01
might be super cumbersome.
66:03
I don't want to have to compute this for all values t
66:05
and then to take the maximum of those guys.
66:07
It turns out that that's actually quite nice that we
66:09
don't have to actually do this.
66:11
What does the empirical distribution function
66:14
look like?
66:15
Well, this thing, remember Fn of t by definition was--
66:23
so let me go to the slide that's relevant.
66:25
So Fn of t looks like this.
66:27
66:38
So what it means is that when t is between two observations,
66:41
then this guy is actually keeping the same value.
66:44
So if I put my observations on the real line here.
66:48
So let's say I have one observation here,
66:49
one observation here, one observation here,
66:51
one observation here, and one observation here,
66:53
for simplicity.
66:55
Then this guy is basically, up to this normalization,
66:57
counting how many observations they have that are less than t.
67:01
So since I normalize by n, I know that the smallest number
67:05
here is going to be 0, and the largest number here
67:10
is going to be 1.
67:13
So let's say this looks like this.
67:14
This is the value 1.
67:18
At the value, since I take it less than or equal to,
67:21
when I'm at Xi, I'm actually counting it.
67:24
So the jump happens at Xi.
67:26
So that's the first observation, and then I jump.
67:29
By how much do I jump?
67:30
67:33
Yeah?
67:35
One over n, right?
67:38
And then this value belongs to the right.
67:41
And then I do it again.
67:42
67:50
I know it's not going to work out for me, but we'll see.
67:54
Oh no actually, I did pretty well.
67:55
68:00
This is what my cumulative distribution looks like.
68:04
Now if you look on this slide, there
68:05
is this weird notation where I start putting now
68:07
my indices in parentheses.
68:10
X parenthesis 1, X parenthesis 2, et cetera.
68:13
Those are called the ordered statistic.
68:15
It's just because it might be, when my data is given
68:18
to me I just call the first observation,
68:20
the one that's on top of the table,
68:21
but it doesn't have to be the smallest value.
68:24
So it might be that this is X1 and that this is X2,
68:28
and then this is X3, X4, and X5.
68:31
These might be my observations.
68:33
So what I do is that I call them in such a way
68:35
that this is actually, I recall this guy X1,
68:38
which is just really X3.
68:40
This is X2, X3, X4, and X5.
68:46
These are my reordered observations
68:48
in such a way that the smallest one is indexed by one
68:52
and the largest one is indexed by n.
68:54
68:58
So now this is actually quite nice,
69:01
because what I'm trying to do is to find the largest
69:04
deviation from this guy to the true cumulative distribution
69:07
function.
69:07
The true cumulative distribution function,
69:09
let's say it's Gaussian, looks like this.
69:11
69:15
It's something continuous, for a symmetric distribution
69:19
it crosses this axis at 1/2, and that's what it looks like.
69:22
And the Kolmogorov-Smirnov test is just
69:25
telling me how far do those two curves get
69:31
in the worst possible case?
69:35
So in particular here, where are they the farthest?
69:37
Clearly that's this point.
69:40
And so up to rescaling, this is the value
69:42
I'm going to be interested in.
69:44
That's how they get as far as possible from each other.
69:49
Here, something just happened, right?
69:52
The farthest distance that I got was exactly
69:54
at one of those dots.
69:55
69:58
It turns out this is enough to look at those dots.
70:01
And the reason is, well because after this dot
70:04
and until the next jump, this guy does not change,
70:08
but this guy increases.
70:11
And so the only point where they can be the farthest apart
70:15
is either to the left of a jump or to the right of a jump.
70:19
That's the only place where they can be far from each other.
70:22
And that means that only one observation.
70:24
Everybody sees that?
70:26
The farthest points, the points at which those two curves are
70:29
the farthest from each other, has
70:31
to be at one of the observations.
70:34
And so rather than looking at a sup over all possible t's,
70:37
really all I need to do is to look at a maximum
70:40
only at my observations.
70:43
70:46
I just need to check at each of those points
70:48
whether they're far.
70:51
Now here, notice that you did not,
70:53
this is not written Fn of Xi.
70:57
The reason is because I actually know what Fn of Xi is.
71:01
Fn of the i-th order observation is just
71:05
the number of jumps I've had until this observation.
71:08
So here, I know that the value of Fn is 1 over n,
71:11
here it's 2 over n, 3 over n, 4 over n, 5 over n.
71:15
So I knew that the values of Fn at my observations,
71:19
and those are actually the only values that Fn can take,
71:22
are an integer divided by n.
71:25
And that's why you see i minus 1 over n, or i over n.
71:29
This is the difference just before the jump,
71:32
and this is the difference at the jump.
71:34
71:38
So here the key message is that this is no longer
71:42
a supremum over all t's, but it's just
71:44
the maximum from 1 to n.
71:46
So I really have only two n values to compute.
71:49
This value and this value for each observation, that's 2n
71:51
total.
71:52
I look at the maximum and that's actually the value.
71:55
And it's actually equal to tn.
71:58
It's not an approximation.
71:59
Those things are equal.
72:00
That's just the only places where
72:02
those guys can be maximum.
72:03
72:09
Yes?
72:10
AUDIENCE: It seems like since the null hypothesis [INAUDIBLE]
72:15
the entire distribution of theta,
72:17
this is like strictly more powerful than just
72:19
doing it [INAUDIBLE].
72:23
PROFESSOR: It's strictly less powerful.
72:24
AUDIENCE: Strictly less powerful.
72:27
But is there, is that like a big trade-off
72:30
that we're making when we do that?
72:32
Obviously we're not certain in the first place
72:33
that we want to assume normality.
72:35
Does it make sense to [INAUDIBLE],,
72:37
the Gaussian [INAUDIBLE].
72:39
72:48
PROFESSOR: So can you, I'm not sure what
72:50
question you're asking.
72:51
AUDIENCE: So when we're doing a normal test,
72:53
we're just asking questions about the mus,
72:55
the means of our distribution.
72:57
[INAUDIBLE] This one, it seems like it
73:00
would be both at the same time.
73:02
[INAUDIBLE] Is this decreasing power [INAUDIBLE]??
73:11
PROFESSOR: So remember, here in this test
73:13
we want to conclude to H0, in the other test we typically
73:16
want to conclude to H1.
73:17
So here we actually don't want power, in a way.
73:21
And you have to also assume that doing a test on the mean
73:24
is probably not the only thing you're
73:26
going to end up doing on your data
73:27
after you actually establish that it's normally distributed.
73:31
Then you have the dataset, you've sort of
73:33
established it's normally distributed,
73:34
and then you can just run the arsenal of statistical studies.
73:38
And we're going to see regression
73:39
and all sorts of predictive things, which are not just
73:42
tests if the mean is equal to something.
73:44
Maybe you want to build a confidence interval
73:45
for the mean.
73:46
Then this is not, confidence interval is not a test.
73:50
So you're going to have to first test if it's normal,
73:52
and then see if you can actually use
73:53
the quantiles of a Gaussian distribution or a t
73:55
distribution to build this confidence interval.
73:59
So in a way you should do this as like, the flat fee
74:03
to enter the Gaussian world, and then you
74:05
can do whatever you want to do in the Gaussian world.
74:09
We'll see actually that your question goes back
74:11
to something that's a little important, is here
74:14
I said F0 is fully specified.
74:17
It's like an N 1, 5.
74:21
But I didn't say, is it normally distributed,
74:24
which is the question that everybody asks.
74:26
You're not asking, is it this particular normal distribution
74:29
with this particular mean and this particular variance.
74:31
So how would you do it in practice?
74:32
Well you would say, I'm just going
74:34
to replace the mean by the empirical mean and the variance
74:36
by the empirical variance.
74:38
But by doing that you're making a huge mistake because you
74:41
are sort of depriving your test of the possibility
74:45
to reject the Gaussian hypothesis just
74:46
based on the fact that the mean is wrong or the variance
74:49
is wrong.
74:49
You've already stuck to your data pretty well.
74:52
And so you're sort of like already
74:55
tilting the game in favor of H0 big time.
74:59
So there's actually a way to arrange for this.
75:01
75:03
OK, so this is about pivotal statistic.
75:05
We've used this word many times.
75:06
75:09
And So that's how.
75:12
I'm not going to go into this test.
75:13
It's really, this is a recipe on how you would actually
75:16
build the table that I showed you, this table.
75:20
This is basically the recipe on how to build it.
75:23
There's another recipe to build it, which is just
75:25
open a book at this page.
75:27
That's a little faster.
75:29
Or use software.
75:32
I just wanted to show you.
75:34
So let's just keep in mind, anybody has a good memory?
75:36
Let's just keep in mind this number.
75:38
This is the threshold for the Kolmogorov-Smirnov statistic.
75:44
If I have 10 observations and I want to do it at 5%,
75:47
it's about 41%.
75:50
So that's the number that it should be larger from.
75:52
So it turns out that if you want to test if it's normal, and not
75:56
just the specific normal, this number
75:59
is going to be different.
76:00
Do you think the number I'm going
76:01
to read in a table that's appropriate for this is
76:03
going to be larger or smaller?
76:05
Who says larger?
76:07
AUDIENCE: Sorry, what was the question?
76:09
PROFESSOR: So the question is, this
76:10
is the number I should see if my test was, is X, say, N 0, 5.
76:20
Right?
76:20
That's a specific distribution with a specific F0.
76:25
So that's the number, I would build
76:27
the Kolmogorov-Smirnov statistic from this.
76:29
I would perform a test and check if my Kolmogorov-Smirnov
76:32
statistic tn is larger than this number or not.
76:34
If it's larger I'm going to reject.
76:36
Now I say, actually, I don't want to test if H0 is N 0, 5,
76:40
but it's just a mu sigma squared for some mu and sigma squared.
76:47
And in particular I'm just going to plugin mu hat and sigma
76:50
hat into my F0, run the same statistic,
76:52
but compare it to a different number.
76:56
So the larger the number, the more or less
77:00
likely am I to reject?
77:03
The less likely I am to reject, right?
77:05
So if I just use that number, let's say
77:09
this is a large number, I would be more
77:12
tempted to say it's Gaussian.
77:14
And if you look at the table you would
77:15
get that if you make the appropriate correction
77:18
at the same number of observations, 10,
77:21
and the same level, you get 25% as opposed to 41%.
77:26
That means that you're actually much more likely if you
77:28
use the appropriate test to reject the fact that it's
77:32
normal, which is bad news, because that means
77:34
you don't have access to the Gaussian arsenal,
77:36
and nobody wants to do this.
77:38
So actually this is a mistake that people do a lot.
77:40
They use the Kolmogorov-Smirnov test
77:42
to test for normality without adjusting for the fact
77:45
that they've plugged in the estimated mean
77:48
and the estimated variance.
77:50
This leads to rejecting less often, right?
77:53
I mean this is almost half of the number that we had.
77:58
And then they can be happy and walk home
78:00
and say, well, I did the test and it was normal.
78:03
So this is actually a mistake that I
78:04
believe that genuinely at least a quarter of the people
78:07
do make in purpose.
78:09
They just say, well I want it to be Gaussian so I'm just
78:11
going to make my life easier.
78:13
So this is the so-called Kolmogorov Lilliefors test.
78:17
We'll talk about it, well not today for sure.
78:20
There's other statistics that you can test, that you can use.
78:24
And the idea is to say, well, we want
78:26
to know if the empirical distribution
78:28
function, the empirical CDF, is close to the true CDF.
78:31
The way we did it is by forming the difference
78:33
in looking at the worst possible distance they can be.
78:36
That's called a sup norm, or L infinity norm,
78:39
in functional analysis.
78:42
So here, this is what it looked like.
78:44
The distance between Fn and F that we measured was just
78:46
the supremum distance over all t's.
78:48
That's one way to measure distance between two functions.
78:51
But there's an infinite many ways
78:53
to measure distance between functions.
78:54
One is something we're much more familiar with,
78:56
which is the squared L2-norm.
78:59
This is nice because this has like an inner product,
79:02
it has some nice properties.
79:04
And you could actually just, rather than taking the sup,
79:06
you could just integrate the squared distance.
79:10
And this is what leads to Cramier-Von Mises test.
79:14
And then there's another one that
79:15
says, well, maybe I don't want to integrate without weights.
79:18
Maybe I want to put weights that account for the variance.
79:22
And this guy is called Anderson-Darling.
79:24
For each of these tests you can check
79:26
that the asymptotic distribution is going to be pivotal,
79:29
which means that there will be a table at the back of some book
79:32
that tells you what the statistic, the quantiles
79:37
of square root of n times this guy
79:38
are asymptotically, basically.
79:40
Yeah?
79:41
AUDIENCE: For the Kolmogorov-Smirnov test,
79:44
for the table that shows the value it has,
79:48
it has the value for different n.
79:51
But I thought we [INAUDIBLE]--
79:53
PROFESSOR: Yeah.
79:54
So that's just to show you that asymptotically it's pivotal,
79:56
and I can point you to one specific thing.
79:59
But it turns out that this thing is actually pivotal for each n.
80:02
And that's why you have this recipe to construct the entire
80:05
thing, because it's actually not true for all possible n's.
80:08
Also there's the n that shows up here.
80:10
So no actually, this is something
80:13
you should have in mind.
80:14
So basically, let me strike what I just said.
80:18
This thing you can actually, this distribution
80:20
will not depend on F0 for any particular n.
80:24
It's just not going to be a Brownian bridge
80:25
but a finite sample approximation of a Brownian
80:28
bridge, and you can simulate that just drawing samples
80:31
from it, building a histogram, and constructing
80:33
the quantiles for this guy.
80:35
AUDIENCE: No one has actually developed
80:36
a table for Brownian--
80:38
PROFESSOR: Oh, there is one.
80:39
That's the table, maybe.
80:42
Let's see if we see it at the bottom of the other table.
80:46
Yeah.
80:47
See?
80:47
Over 40, over 30.
80:48
So this is not the Kolmogorov-Smirnov,
80:50
but that's the Kolmogorov Lilliefors.
80:52
Those numbers that you see here, they
80:54
are the numbers for the asymptotic thing which is
80:57
some sort of Brownian bridge.
80:59
Yeah?
81:00
AUDIENCE: Two questions.
81:01
If I want to build the Kolmogorov-Smirnov test,
81:03
it says that F0 is required to be continuous.
81:08
PROFESSOR: Yeah.
81:10
AUDIENCE: [INAUDIBLE] If we have, like, probability
81:13
mass of a particular value.
81:15
Like some sort of data.
81:18
PROFESSOR: So then you won't have this nice picture, right?
81:20
This can happen at any point because you're
81:22
going to have discontinuities in F
81:24
and those things can happen everywhere.
81:26
And then--
81:27
AUDIENCE: Would the supremum still work?
81:29
PROFESSOR: You mean the Brownian bridge?
81:30
AUDIENCE: Yeah.
81:32
The Kolmogorov test doesn't say that you
81:35
have to be able to easily calculate the supremum.
81:37
PROFESSOR: No, no, no, but you still need it.
81:39
You still need it for--
81:40
so there's some finite sample versions of it
81:42
that you can use that are slightly more conservative,
81:45
which is in a way good news because you're
81:47
going to conclude more to H0.
81:50
And there's are some, I forget the name,
81:52
it's Kiefer-Wolfowitz, the Kiefer-Dvoretzky-Wolfowitz,
81:57
an equality which is basically like Hoeffding's inequality.
81:59
So it's basically up to bad constants
82:01
telling you the same result as the Brownian bridge result,
82:04
and those are true all the time.
82:06
But for the exact asymptotic distribution,
82:08
you need continuity.
82:11
Yes.
82:12
AUDIENCE: So just a clarification.
82:13
So when we are testing the Kolmogorov,
82:15
we shouldn't test a particular mu and sigma squared?
82:19
PROFESSOR: Well if you know what they are you can use
82:22
Kolmogorov-Smirnov, but if you don't know what they are
82:25
you're going to plug in--
82:26
as soon as you're going to estimate
82:27
the mean and the variance from the data,
82:29
you should use the one we'll see next time, which is
82:31
called Kolmogorov Lilliefors.
82:33
You don't have to think about it too much.
82:34
We'll talk about it on Thursday.
82:38
Any other question?
82:39
So we're out of time.
82:40
So I think we should stop here, and we'll resume on Thursday.
82:45