字幕記錄
https://www.youtube.com/watch?v=C_W1adH-NVE&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=2


00:00
00:00
The following content is provided under a Creative
00:02
Commons license.
00:03
Your support will help MIT OpenCourseWare
00:06
continue to offer high quality educational resources for free.
00:10
To make a donation or to view additional materials
00:12
from hundreds of MIT courses, visit MIT OpenCourseWare
00:16
at ocw.mit.edu.
00:17
00:20
PHILIPPE RIGOLLET: --of our limiting distribution,
00:22
which happen to be Gaussian.
00:24
But if the central limit theorem told
00:25
us that the limiting distribution of some average
00:28
was something that looked like a Poisson
00:30
or an [? exponential, ?] then we would just
00:32
have in the same way taken the quintiles
00:34
of the exponential distribution.
00:36
So let's go back to what we had.
00:39
So generically if you have a set of observations X1 to Xn.
00:46
So remember for the kiss example they were denoted by R1 to Rn,
00:52
because they were turning the head to the right,
00:55
but let's just go back.
00:56
We say X1 to Xn, and in this case
00:59
I'm going to assume they're IID, and I'm
01:02
going to make them Bernoulli with [INAUDIBLE] p,
01:05
and p is unknown, right?
01:06
01:10
So what did we do from here?
01:11
Well, we said p is the expectation of Xi,
01:15
and actually we didn't even think about it too much.
01:17
We said, well, if I need to estimate
01:19
the proportion of people who turn their head to the right
01:21
when they kiss, I just basically I'm
01:22
going to compute the average.
01:24
So our p hat was just Xn bar, which was just 1
01:28
over n sum from i over 1 2n of the Xi.
01:32
The average of the observations was their estimate.
01:34
And then we wanted to build some confidence intervals
01:37
around this.
01:38
So what we wanted to understand is, how much that this p hat
01:41
fluctuates.
01:42
This is a random variable.
01:44
It's an average of random variables.
01:45
It's a random variable, so we want
01:46
to know what the distribution is.
01:47
And if we know what the distribution is,
01:49
then we actually know, well, where it fluctuates.
01:51
What the expectation is.
01:52
Around which value it tends to fluctuate et cetera.
01:55
And so what the central limit theorem
01:57
told us was if I take square root of n times Xn bar minus p,
02:03
which is its average.
02:04
And then I divide it by the standard deviation.
02:07
02:10
Then this thing here converges as n goes to infinity,
02:15
and we will say a little bit more
02:17
about what it means in distribution
02:19
to some standard normal random variable.
02:23
So that was the central limit theorem.
02:24
02:27
So what it means is that when I think
02:28
of this as a random variable, when n is large enough
02:35
it's going to look like this.
02:37
And so I understand perfectly its fluctuations.
02:40
I know that this thing here has--
02:43
I know the probability of being in this zone.
02:45
I know that this number here is 0.
02:47
I know a bunch of things.
02:49
And then, in particular, what I was
02:51
interested in was that the probability, that's
02:55
the absolute value of a Gaussian random variable,
02:59
exceeds q alpha over 2, q alpha over 2.
03:05
We said that this was equal to what?
03:06
03:13
Anybody?
03:15
What was that?
03:16
AUDIENCE: [INAUDIBLE]
03:18
PHILIPPE RIGOLLET: Alpha, right?
03:19
So that's the probability.
03:21
That's my random variable.
03:23
So this is by definition q alpha over 2 is the number.
03:27
So that to the right of it is alpha over 2.
03:29
And this is a negative q alpha over 2 by symmetry.
03:34
And so the probability that i exceeds-- well,
03:36
it's not very symmetric, but the probability
03:38
that i exceeds this value, q alpha over 2,
03:41
is just the sum of the two gray areas.
03:46
All right?
03:47
So now I said that this thing was approximately equal,
03:50
due to the central limit theorem,
03:51
to the probability, that square root of n.
03:55
Xn bar minus p divided by square root p 1 minus p.
03:59
04:04
Well, absolute value was larger than q alpha over 2.
04:10
Well, then this thing by default is actually approximately equal
04:12
to alpha, just because of virtue of the central limit theorem.
04:16
And then we just said, well, I'll solve for p.
04:23
Has anyone attempted to solve the degree two equation for p
04:28
in the homework?
04:29
Everybody has tried it?
04:30
04:35
So essentially, this is going to be an equation in p.
04:37
Sometimes we don't want to solve it.
04:39
Some of the p's we will replace by their worst possible value.
04:41
For example, we said one of the tricks we had was
04:44
that this value here, square root of p 1 minus p,
04:48
was always less than one half.
04:51
Until we could actually get the confidence interval that
04:53
was larger than all possible confidence
04:55
intervals for all possible values of p,
04:57
but we could solve for p.
04:59
Do we all agree on the principle of what we did?
05:01
So that's how you build confidence intervals.
05:03
Now let's step back for a second,
05:05
and see what was important in the building of this confidence
05:08
interval.
05:09
The really key thing is that I didn't tell you
05:11
why I formed this thing, right?
05:15
We started from x bar, and then I
05:17
took some weird function of x bar that depended on p and n.
05:21
And the reason is, because when I take this function,
05:23
the central limit theorem tells me
05:25
that it converges to something that I know.
05:28
But this very important thing about the something that I know
05:30
is that it does not depend on anything that I don't know.
05:35
For example, if I forgot to divide
05:36
by square root of p 1 minus p, then this thing would have
05:40
had a variance, which is the p 1 minus p.
05:43
If I didn't remove this p here, the mean here
05:47
would have been affected by p.
05:49
And there's no table for normal p 1.
05:53
Yes?
05:53
AUDIENCE: [INAUDIBLE]
05:55
PHILIPPE RIGOLLET: Oh, so the square root of n terms
05:58
come from.
05:58
So really you should view this.
06:00
So there's a rule and sort of a quiet rule in math
06:04
that you don't write a divided by b over c, right?
06:08
You write c times a divided by b, because it looks nicer.
06:12
But the way you want to think about this
06:14
is that this is x bar minus p divided by the square root of p
06:20
1 minus p divided by n.
06:23
And the reason is, because this is actually
06:25
the standard deviation of this--
06:27
oh sorry, x bar n.
06:28
This is actually the standard deviation of this guy,
06:31
and the square root of n comes from the [INAUDIBLE] average.
06:36
So the key thing was that this thing,
06:39
this limiting distribution did not depend on anything
06:42
I don't know.
06:43
And this is actually called a pivotal distribution.
06:45
It's pivotal.
06:47
I don't need anything.
06:49
I don't need to know anything, and I can read it in a table.
06:51
Sometimes there's going to be complicated things,
06:54
but now we have computers.
06:55
The beauty about Gaussian is that people have studied them
06:57
to death, and you can open any stats textbook,
07:00
and you will see a table again that will tell you
07:02
for each value of alpha you're interested in,
07:04
it will tell you what q alpha over 2 is.
07:07
But there might be some crazy distributions,
07:10
but as long as they don't depend on anything,
07:12
we might actually be able to simulate
07:13
from them, and in particular compute what q alpha over 2
07:16
is for any possible value [INAUDIBLE]..
07:19
And so that's what we're going to be trying to do.
07:21
Finding pivotal distributions.
07:22
How do we take this Xn bar, which is a good estimate,
07:26
and turn it into something which may be exactly
07:28
or asymptotically does not depend
07:31
on any unknown parameter.
07:33
So here is one way we can actually--
07:35
so that's what we did for the kiss example, right?
07:38
And here I mentioned, for example,
07:39
in the extreme case, when n was equal to 3
07:41
we would get a different thing, but here the CLT
07:44
would not be valid.
07:45
And what that means is that my pivotal distribution
07:49
is actually not the normal distribution,
07:52
but it might be something else.
07:54
And I said we can make take exact computations.
07:56
Well, let's see what it is, right?
07:58
If I have three observations, so I'm going to have X1, X2, X3.
08:06
So now I take the average of those guys.
08:08
08:13
OK, so that's my estimate.
08:15
How many values can this guy take?
08:16
08:23
It's a little bit of counting.
08:25
08:27
Four values.
08:28
How did you get to that number?
08:29
08:37
OK, so each of these guys can take value 0, 1, right?
08:41
So the number of values that it can take,
08:43
I mean, it's a little annoying, because then I
08:45
have to sum them, right?
08:47
So basically, I have to count the number of 1's.
08:51
So how many 1's can I get, right?
08:54
Sorry I have to-- yeah, so this is the number of 1's that I--
08:57
OK, so let's look at that.
08:58
So we get 0, 0, 0.
09:00
0, 0, 1.
09:01
And then I get basically three of them
09:03
that have just the one in there, right?
09:04
09:07
So there's three of them.
09:08
How many of them have exactly two 1's?
09:12
2.
09:13
Sorry, 3, right?
09:15
So it's just this guy where I replaced the 0's and the 1.
09:18
OK, so now I get--
09:21
so here I get three that take the value 1,
09:23
and one that gets the value 0.
09:25
And then I get three that take the value 2,
09:28
and then one that takes the value 1.
09:30
The value [? 0 ?] 1's, right?
09:33
OK, so everybody knows what I'm missing here is just the ones
09:35
here where I replaced the 0's by 1's.
09:38
So the number of values that this thing can take
09:40
is 1, 2, 3, 4.
09:43
So someone is counting much faster than me.
09:45
And so those numbers, you've probably seen them before,
09:48
right?
09:48
1, 3, 3, 1, remember?
09:50
And so essentially those guys, it
09:52
takes only three values, which are either 1/3, 1.
09:58
Sorry, 1/3.
10:02
Oh OK, so it's 0, sorry.
10:06
1/3, 2/3, and 1.
10:10
Those are the four possible values you can take.
10:12
And so now-- which is probably much easier
10:14
to count like that-- and so now all
10:16
I have to tell you if I want to describe
10:18
the distribution of this probability
10:20
of this random variable, is just the probability
10:23
that it takes each of these values.
10:24
So X bar 3 takes the value 0 probability
10:30
that X bar 3 takes the value 1/3, et cetera.
10:34
If I give you each of these possible values,
10:36
then you will be able to know exactly what the distribution
10:38
is, and hopefully maybe to turn it into something
10:41
you can compute.
10:42
Now the thing is that those values will actually
10:44
depend on the unknown p.
10:47
What is the unknown p here?
10:48
What is the probability that X bar
10:49
3 is equal to 0 for example?
10:52
I'm sorry?
10:53
AUDIENCE: [INAUDIBLE]
10:54
PHILIPPE RIGOLLET: Yeah, OK.
10:55
So let's write it without making the computation So 1/8 is
10:59
probably not the right answer, right?
11:03
For example, if p is equal to 0, what is this probability?
11:09
1.
11:10
If p is 1, what is this probability?
11:13
0.
11:14
So it will depend on p.
11:16
So the probability that this thing is equal to 0,
11:18
is just the probability that all three of those guys
11:20
are equal to 0.
11:21
The probability that X1 is equal to 0, and X2 is equal to 0,
11:24
and X3 is equal to 0.
11:25
Now my things are independent, so I
11:26
do what I actually want to do, which
11:28
say the probability of the intersection
11:29
is the product of the probabilities, right?
11:32
So it's just the probability that each of them is equal to 0
11:34
to the power of 3.
11:36
And the probability that each of them, or say one of them
11:38
is equal to 0, is just 1 minus p.
11:41
11:45
And then for this guy I just get the probability-- well,
11:48
it's more complicated, because I have to decide which one it is.
11:51
But those things are just the probability
11:53
of some binomial random variables, right?
11:56
This is just a binomial, X bar 3.
12:00
So if I look at X bar 3, and then I multiply it by 3,
12:03
it's just this sum of independent Bernoulli's
12:05
with parameter p.
12:07
So this is actually a binomial with parameter 3 and p.
12:11
And there's tables for binomials,
12:13
and they tell you all this.
12:16
Now the thing is I want to invert this guy, right?
12:18
Somehow.
12:19
This thing depends on p.
12:21
I don't like it, so I'm going to have
12:22
to find ways to get this things depending on p,
12:25
and I could make all these nasty computations,
12:27
and spend hours doing this.
12:29
But there's tricks to go around this.
12:31
There's upper bounds.
12:32
Just like we just said, well, maybe I
12:34
don't want to solve the second degree equation in p,
12:36
because it's just going to capture maybe smaller order
12:40
terms, right?
12:41
Things that maybe won't make a huge difference numerically.
12:43
You can check that in your problem set one.
12:46
Does it make a huge difference numerically
12:48
to solve the second degree equation,
12:50
or to just use the [INAUDIBLE] p 1
12:52
minus p or even to plug in p hat instead of p.
12:56
Those are going to be the-- problem
12:57
set one is to make sure that you see what magnitude of changes
13:01
you get by changing from one method to the other.
13:05
So what I wanted to go to is something
13:13
where we can use something, which is just
13:16
a little more brute force.
13:17
So the probability that-- so here
13:19
is this Hoeffding's inequality.
13:20
We saw that.
13:21
That's what we've finished on last time.
13:23
So Hoeffding's inequality is actually
13:25
one of the most useful inequalities.
13:27
If any one of you is doing anything really to algorithms,
13:30
you've seen that inequality before.
13:32
It's extremely convenient that it tells you
13:33
something about bounded random variables,
13:35
and if you do algorithms typically with things bounded.
13:37
And that's the case of Bernoulli's random variables,
13:40
right?
13:40
They're bounded between 0 and 1.
13:42
And so when I do this thing, when
13:44
I do Hoeffding's inequality, what this thing is telling
13:46
me is for any given epsilon here, for any given epsilon,
13:53
what is the probability that Xn bar goes away
13:55
from its expectation?
13:58
All right, then we saw that it decreases somewhat similarly
14:02
to the way a Gaussian would look like.
14:04
So essentially what Hoeffding's inequality is telling me, is
14:08
that I have this picture, when I have a Gaussian with mean u,
14:18
I know it looks like this, right?
14:20
What Hoeffding's inequality is telling
14:22
me is that if I actually take the average
14:24
of some bounded random variables,
14:27
then their probability distribution function or maybe
14:30
math function-- this thing might not even have [INAUDIBLE]
14:32
the density, but let's think of it as being a density just
14:35
for simplicity-- it's going to be something
14:38
that's going to look like this.
14:40
It's going to be somewhat-- well,
14:42
sometimes it's going to have to escape just
14:44
for the sake of having integral 1.
14:46
But it's essentially telling me that those guys
14:49
stay below those guys.
14:52
The probability that Xn bar exceeds mu
14:56
is bounded by something that decays
14:58
like to tail of Gaussian.
15:00
So really that's the picture you should have in mind.
15:03
When I average bounded random variables,
15:05
I actually have something that might be really rugged.
15:08
It might not be smooth like a Gaussian,
15:10
but I know that it's always bounded by a Gaussian.
15:12
And what's nice about it is that when I actually
15:14
start computing probability that exceeds some number,
15:17
say alpha over 2, then I know that this I can actually
15:24
get a number, which is just--
15:29
sorry, the probability that it exceeds, yeah.
15:31
So this number that I get here is actually
15:33
going to be somewhat smaller, right?
15:35
So that's going to be the q alpha over 2 for the Gaussian,
15:37
and that's going to be the--
15:39
I don't know, r alpha over 2 for this [? Bernoulli ?]
15:41
random variable.
15:43
Like q prime or different q.
15:46
So I can actually do this without actually
15:50
taking any limits, right?
15:51
This is valid for any n.
15:53
I don't need to actually go to infinity.
15:54
Now this seems a bit magical, right?
15:57
I mean, I just said we need n to be,
15:59
we discussed that we wanted n to be larger
16:01
than 30 last time for the central limit theorem
16:03
to kick in, and this one seems to tell me
16:05
I can do it for any n.
16:07
Now there will be a price to pay is that I pick up this 2 over b
16:12
minus alpha squared.
16:13
So that's the variance of the Gaussian that I have, right?
16:20
Sort of.
16:20
That's telling me what the variance should be,
16:23
and this is actually not as nice.
16:24
I pick factor 4 compared to the Gaussian
16:27
that I would get for that.
16:29
So let's try to solve it for our case.
16:32
So I just told you try it.
16:33
Did anybody try to do it?
16:35
16:37
So we started from this last time, right?
16:39
16:41
And the reason was that we could say
16:43
that the probability that this thing exceeds q alpha over 2
16:46
is alpha.
16:47
So that was using CLT, so let's just keep it here, and see
16:52
what we would do differently.
16:53
16:56
What Hoeffding tells me is that the probability that Xn
16:58
bar minus--
17:00
well, what is mu in this case?
17:04
It's p, right?
17:06
It's just notation here.
17:07
Mu was the average, but we call it
17:09
p in the case of Bernoulli's, exceeds--
17:12
let's just call it epsilon for a second.
17:17
So we said that this was bounded by what?
17:19
So Hoeffding tells me that this is bounded
17:21
by 2 times exponential minus 2.
17:26
Now the nice thing is that I pick up a factor n here,
17:29
epsilon squared.
17:30
And what is b minus a squared for the Bernoulli's?
17:33
1.
17:33
So I don't have a denominator here.
17:36
And I'm going to do exactly what I did here.
17:38
I'm going to set this guy to be equal to alpha.
17:40
17:43
So that if I get alpha here, then that
17:46
means that just solving for epsilon,
17:50
I'm going to have some number, which will play the role of q
17:52
alpha over 2, and then I'm going to be
17:54
able to just say that p is between X bar and minus
17:58
epsilon, and X bar n plus epsilon.
18:00
OK, so let's do it.
18:02
18:05
So we have to solve the equation.
18:06
18:14
2 exponential minus 2n epsilon squared equals alpha,
18:20
which means that--
18:22
so here I'm going to get, there's a 2 right here.
18:26
So that means that I get alpha over 2 here.
18:29
Then I take the logs on both sides,
18:30
and now let me just write it.
18:32
18:36
And then I want to solve for epsilon.
18:39
So that means that epsilon is equal to square root log
18:43
q over alpha divided by 2n.
18:45
18:50
Yes?
18:51
AUDIENCE: [INAUDIBLE]
18:53
PHILIPPE RIGOLLET: Why is b minus a 1?
18:55
Well, let's just look, right?
18:57
X lives in the interval b minus a.
19:00
So I can take b to be 25, and a to be my negative 42.
19:06
But I'm going to try to be as sharp as I can.
19:09
All right, so what is the smallest value
19:10
you can think of such that a Bernoulli random variable
19:13
is larger than or equal to this value?
19:15
19:19
What values does a Bernoulli random variable take?
19:23
0 and 1.
19:24
So it takes values between 0 and 1.
19:29
It just maxes the value.
19:31
Actually, this is the worst possible case
19:33
for the Hoeffding inequality.
19:38
So now I just get this one, and so now you
19:40
tell me that I have this thing.
19:41
So when I solve this guy over there.
19:43
So combining this thing and this thing
19:46
implies that the probability that p lives between Xn
19:53
bar minus square root log 2 over alpha divided by 2n and X
20:01
bar plus the square root log 2 over alpha divided by 2n
20:10
is equal to?
20:12
20:15
I mean, is at least.
20:16
What is it at least equal to?
20:18
20:22
Here this controls the probability of them outside
20:25
of this interval, right?
20:27
It tells me the probability that Xn bar is far from p
20:31
by more than epsilon.
20:32
So there's a probability that they're actually
20:34
outside of the interval that I just wrote.
20:36
So it's 1 minus the probability of being in the interval.
20:39
So this is at least 1 minus alpha.
20:43
So I just use the fact that a probability of the complement
20:46
is 1 minus the probability of the set.
20:50
And since I have an upper bound on the probability of the set,
20:53
I have a lower bound on the probability of the complement.
20:59
So now it's a bit different.
21:03
Before, we actually wrote something that was--
21:06
so let me get it back.
21:08
So if we go back to the example where we took the [INAUDIBLE]
21:11
over p, we got this guy.
21:16
q alpha over square root of--
21:19
over 2 square root n.
21:21
So we had Xn bar plus minus q alpha over 2 square root n.
21:24
Actually, that was q alpha over 2n, I'm sorry about that.
21:27
21:30
And so now we have something that replaces this q alpha,
21:34
and it's essentially square root of 2 log 2 over alpha.
21:40
Because if I replace q alpha by square root
21:43
of 2 log 2 over alpha, I actually
21:47
get exactly this thing here.
21:49
21:52
And so the question is, what would you guess?
21:55
Is this number, this margin, square root of log 2 over alpha
22:01
divided by 2n, is it smaller or larger than this guy?
22:05
q alpha all over 2/3n.
22:08
Yes?
22:09
Larger.
22:10
Everybody agrees with this?
22:12
Just qualitatively?
22:14
Right, because we just made a very conservative statement.
22:17
We do not use anything.
22:18
This is true always.
22:20
So it can only be better.
22:22
The reason in statistics where you use those assumptions
22:24
that n is large enough, that you have this independence that you
22:27
like so much, and so you can actually have the central limit
22:30
theorem kick in, all these things
22:32
are for you to have enough assumptions
22:35
so that you can actually make sharper and sharper decisions.
22:38
More and more confident statement.
22:40
And that's why there's all this junk science out there,
22:42
because people make too much assumptions for their own good.
22:45
They're saying, well, let's assume
22:46
that everything is the way I love it, so that I can for sure
22:50
any minor change, I will be able to say
22:53
that's because I made an important scientific discovery
22:55
rather than, well, that was just [INAUDIBLE] OK?
23:02
So now here's the fun moment.
23:04
And actually let me tell you why we look at this thing.
23:09
So there's actually-- who has seen
23:11
different types of convergence in the probability statistic
23:14
class?
23:14
23:17
[INAUDIBLE] students.
23:20
And so there's different types of--
23:22
in the real numbers there's very simple.
23:25
There's one convergence, Xn turns
23:27
to X. To start thinking about functions,
23:29
well, maybe you have uniform convergence,
23:32
you have pointwise convergence.
23:33
So if you've done some real analysis,
23:34
you know there's different types of convergence
23:36
you can think of.
23:37
And in the convergence of random variables,
23:40
there's also different types, but for different reasons.
23:42
It's just because the question is, what do you
23:44
do with the randomness?
23:45
When you see that something converges to something,
23:47
it probably means that you're willing to tolerate
23:50
low probability things happening or where it doesn't happen,
23:54
and on how you handle those, creates
23:56
different types of convergence.
23:58
So to be fair, in statistics the only convergence we care about
24:03
is the convergence in distribution.
24:05
That's this one.
24:07
The one that comes from the central limit theorem.
24:09
That's actually the weakest possible you could make.
24:12
Which is good, because that means it's
24:14
going to happen more often.
24:16
And so why do we need this thing?
24:17
Because the only thing we really need
24:19
to do is to say that when I start computing
24:21
probabilities on this random variable,
24:23
they're going to look like probabilities
24:25
on that random variable.
24:27
All right, so for example, think of the following
24:30
two random variables, x and minus x.
24:41
So this is the same random variable,
24:42
and this one is negative.
24:44
When I look at those two random variables,
24:48
think of them as a sequence, a constant sequence.
24:51
These two constant sequences do not go to the same number,
24:53
right?
24:54
One is plus-- one is x, the other one is minus x.
24:57
So unless x is the random variable always equal to 0,
25:01
those two things are different.
25:03
However, when I compute probabilities on this guy,
25:05
and when I compute probabilities on that guy, they're the same.
25:09
Because x and minus x have the same distribution
25:12
just by symmetry of the gaps in random variables.
25:15
And so you can see this is very weak.
25:17
I'm not saying anything about the two random variables being
25:19
close to each other every time I'm
25:20
going to flip my coin, right?
25:22
Maybe I'm going to press my computer and say, what is x?
25:25
Well, it's 1.2.
25:26
Negative x is going to be negative 1.2.
25:29
Those things are far apart, and it
25:30
doesn't matter, because in average those things
25:32
are going to have the same probabilities that's happening.
25:34
And that's all we care about in statistics.
25:36
You need to realize that this is what's important,
25:37
and why you need to know.
25:39
Because you have it really good.
25:40
If your problem is you really care more about convergence
25:43
almost surely, which is probably the strongest you can think of.
25:45
So what we're going to do is talk about different types
25:48
of convergence not to just reflect on the fact
25:51
on how our life is good.
25:53
It's just that the problem is that when the convergence
25:56
in distribution is so weak that it cannot do anything I want
26:00
with it.
26:00
In particular, I cannot say that if X converges,
26:04
Xn converges in distribution, and Yn converges
26:07
in distribution, then Xn plus Yn converge in distribution
26:10
to the sum of their limits.
26:12
I cannot do that.
26:12
It's just too weak.
26:14
Think of this example and it's preventing you
26:16
to do quite a lot of things.
26:17
26:20
So this is converge in distribution to sum n 0, 1.
26:26
This is converge in distribution to sum n 0, 1.
26:28
But their sum is 0, and it's certainly not--
26:31
it doesn't look like the sum of two
26:33
independent Gaussian random variables, right?
26:36
And so what we need is to have stronger conditions here
26:40
and there, so that we can actually put things together.
26:42
And we're going to have more complicated formulas.
26:45
One of the formulas, for example,
26:46
is if I replace p by p hat in this denominator.
26:50
We mentioned doing this at some point.
26:53
So I would need that p hat goes to p,
26:57
but I need stronger than n distributions
26:59
so that this happens.
27:00
I actually need this to happen in a stronger sense.
27:04
So here are the first two strongest sense in which
27:07
random variables can converge.
27:09
The first one is almost surely.
27:13
And who has already seen this notation little omega
27:16
when they're talking about random variables?
27:19
All right, so very few.
27:20
So this little omega is-- so what is a random variable?
27:24
A random variable is something that you measure
27:25
on something that's random.
27:27
So the example I like to think of
27:29
is, if you take a ball of snow, and put it
27:34
in the sun for some time.
27:37
You come back.
27:38
It's going to have a random shape, right?
27:39
It's going to be a random blurb of something.
27:42
But there's still a bunch of things you can measure on it.
27:45
You can measure its volume.
27:46
You can measure its inner temperature.
27:48
You can measure its surface area.
27:50
All these things are random variables,
27:52
but the ball itself is omega.
27:54
That's the thing on which you make your measurement.
27:56
And so a random variable is just a function of those omegas.
28:00
Now why do we make all these things fancy?
28:03
Because you cannot take any function.
28:04
This function has to be what's called measurable,
28:06
and there's entire courses on measure theory,
28:09
and not everything is measurable.
28:11
And so that's why you have to be a little careful
28:13
why not everything is measurable,
28:14
because you need some sort of nice property.
28:17
So that the measure of something,
28:19
the union of two things, is less than the sum of the measures,
28:22
things like that.
28:23
And so almost surely is telling you that for most of the balls,
28:30
for most of the omegas, that's the right-hand side.
28:34
The probability of omega is such that those things converge
28:37
to each other is actually equal to 1.
28:41
So it tells me that for almost all omegas, all the omegas,
28:45
if I put them together, I get something
28:47
that has probability of 1.
28:48
It might be that there are other ones that have probability 0,
28:50
but what it's telling is that this thing
28:52
happens for all possible realization of the underlying
28:55
thing.
28:56
That's very strong.
28:57
It essentially says randomness does not matter,
29:00
because it's happening always.
29:01
29:04
Now convergence in probability allows
29:06
you to squeeze a little bit of probability under the rock.
29:09
It tells you I want the convergence to hold,
29:12
but I'm willing to let go of some little epsilon.
29:17
So I'm willing to allow Tn to be less than epsilon.
29:23
Tn minus T to be-- sorry, to be larger than epsilon.
29:27
But the problem is they want this to go to 0
29:29
as well as n goes to infinity, but for each
29:31
n this thing does not have to be 0, which
29:34
is different from here, right?
29:36
So this probability here is fine.
29:40
So it's a little weaker, but it's a slightly different one.
29:44
I'm not going to ask you to learn and show that one
29:46
is weaker than the other one.
29:48
But just know that these are two different types.
29:51
This one is actually much easier to check than this one.
29:53
30:02
Then there's something called convergence in Lp.
30:06
So this one is the fact that it embodies the following fact.
30:09
If I give you a random variable with mean 0,
30:11
and I tell you that its variance is going to 0, right?
30:14
You have a sequence of random variables, their mean is 0,
30:16
their expectation is 0, but their variance is going to 0.
30:20
So think of Gaussian random variables with mean 0,
30:23
and a variance that shrinks to 0.
30:26
And this random variable converges to a spike at 0,
30:29
so it converges to 0, right?
30:31
And so what I mean by that is that to have this convergence,
30:35
all I had to tell you was that the variance was going to 0.
30:38
And so in L2 this is really what it's telling you.
30:41
It's telling you, well, if the variance is going to 0--
30:44
well, it's for any random variable T,
30:46
so here what I describe was for a deterministic.
30:50
So Tn goes to a random variable T. If you look at the square--
30:55
the expectation of the square distance, and it goes to 0.
30:58
But you don't have to limit yourself to the square.
31:00
You can take power of three.
31:01
You can take power 67.6, power of 9 pi.
31:06
You take whatever power you want, it can be fractional.
31:09
It has to be lower than 1, and that's the convergence in Lp.
31:13
But we mostly care about integer p.
31:17
And then here's our star, the convergence in distribution,
31:20
and that's just the one that tells you
31:21
that when I start computing probabilities on the Tn,
31:27
they're going to look very close to the probabilities on the T.
31:31
So that was our Tn with this guy, for example,
31:34
and T was this standard Gaussian distribution.
31:37
Now here, this is not any probability.
31:38
This is just the probability then less than or equal to x.
31:42
But if you remember your probability class,
31:44
if you can compute those probabilities,
31:45
you can compute any probabilities
31:47
just by subtracting and just building things together.
31:50
31:55
Well, I need this for all x's, so I want this for each x,
31:58
So you fix x, and then you make the limit go to infinity.
32:01
You make n go to infinity, and I want
32:03
this for the point x's at which the cumulative distribution
32:06
function of T is continuous.
32:08
There might be jumps, and that I don't actually care for those.
32:15
All right, so here I mentioned it for random variables.
32:17
If you're interested, there's also random vectors.
32:19
A random vector is just a table of random variables.
32:23
You can talk about random matrices.
32:25
And you can talk about random whatever you want.
32:27
Every time you have an object that's
32:28
just collecting real numbers, you can just
32:31
plug random variables in there.
32:33
And so there's all these definitions that [? extend. ?]
32:37
So where I see you see an absolute value,
32:39
we'll see a norm.
32:40
Things like this.
32:43
So I'm sure this might look scary a little bit,
32:46
but really what we are going to use is only the last one, which
32:49
as you can see is just telling you
32:50
that the probabilities converge to the probabilities.
32:52
But I'm going to need the other ones every once in a while.
32:55
And the reason is, well, OK, so here I'm
32:59
actually going to the important characterizations
33:02
of the convergence in distribution,
33:05
which is R convergence style.
33:08
So i converge in distribution if and only
33:10
if for any function that's continuous and bounded,
33:14
when I look at the expectation of f of Tn,
33:16
this converges to the expectation of f of T. OK,
33:19
so this is just those two things are actually equivalent.
33:25
Sometimes it's easier to check one, easier to check the other,
33:27
but in this class you won't have to prove that something
33:30
converges in distribution other than just combining
33:33
our existing convergence results.
33:37
And then the last one which is equivalent to the above two
33:40
is, anybody knows what the name of this quantity is?
33:42
This expectation here?
33:45
What is it called?
33:47
The characteristic function, right?
33:49
And so this i is the complex i, and is the complex number.
33:52
And so it's essentially telling me
33:54
that, well, rather than actually looking
33:56
at all bounded and continuous but real functions,
33:58
I can actually look at one specific family
34:03
of complex functions, which are the functions that maps
34:08
T to E to the ixT for x and R. That's
34:12
a much smaller family of functions.
34:14
All possible continuous embedded functions
34:17
has many more elements than just the real element.
34:21
And so now I can show that if I limit myself to do it,
34:24
it's actually sufficient.
34:25
34:28
So those three things are used all over the literature just
34:32
to show things.
34:33
In particular, if you're interested in deep digging
34:37
a little more mathematically, the central limit theorem
34:39
is going to be so important.
34:40
Maybe you want to read about how to prove it.
34:42
We're not going to prove it in this class.
34:43
There's probably at least five different ways of proving it,
34:49
but the most canonical one, the one that you find in textbooks,
34:52
is the one that actually uses the third element.
34:55
So you just look at the characteristic function
34:59
of the square root of n Xn bar minus say mu,
35:04
and you just expand the thing, and this is what you get.
35:07
And you will see that in the end,
35:09
you will get the characteristic function of a Gaussian.
35:13
Why a Gaussian?
35:14
Why does it kick in?
35:15
Well, because what is the characteristic function
35:17
of a Gaussian?
35:17
Does anybody remember the characteristic function
35:19
of a standard Gaussian?
35:20
AUDIENCE: [INAUDIBLE]
35:21
PHILIPPE RIGOLLET: Yeah, well, I mean
35:23
there's two pi's and stuff that goes away, right?
35:27
A Gaussian is a random variable.
35:29
A characteristic function is a function,
35:31
and so it's not really itself.
35:33
It looks like itself.
35:34
Anybody knows what the actual formula is?
35:37
Yeah.
35:37
AUDIENCE: [INAUDIBLE]
35:39
PHILIPPE RIGOLLET: E to the minus?
35:41
AUDIENCE: E to the minus x squared over 2.
35:42
PHILIPPE RIGOLLET: Exactly.
35:43
E to the minus x squared over 2.
35:44
But this x squared over 2 is actually
35:46
just the second order expansion in the Taylor expansion.
35:49
And that's why the Gaussian is so important.
35:51
It's just the second order Taylor expansion.
35:54
And so you can check it out.
35:56
I think Terry Tao has some stuff on his blog,
35:58
and there's a bunch of different proofs.
36:00
But if you want to prove convergence in distribution,
36:02
you very likely are going to use one this three right here.
36:07
So let's move on.
36:09
36:13
This is when I said that this convergence is
36:15
weaker than that convergence.
36:17
This is what I meant.
36:18
If you have convergence in one style,
36:20
it implies convergence in the other stuff.
36:23
So the first [INAUDIBLE] is that if Tn converges almost surely,
36:26
this a dot s dot means almost surely,
36:28
then it also converges in probability
36:31
and actually the two limits, which
36:32
are this random variable T, are equal almost surely.
36:37
Basically what it means is that whatever you measure one
36:39
is going to be the same that you measure on the other one.
36:42
So that's very strong.
36:44
So that means that convergence almost surely
36:47
is stronger than convergence in probability.
36:50
If you're converge in Lp then you also converge
36:53
in Lq for sum q less than p.
36:56
So if you converge in L2, you'll also converge in L1.
36:59
If you converge in L67, you converge in L2.
37:03
If you're converge in L infinity,
37:04
you converge in Lp for anything.
37:09
And so, again, limits are equal.
37:12
And then when you converge in distribution,
37:14
when you converge in probability,
37:15
you also converge in distribution.
37:18
OK, so almost surely implies probability.
37:22
Lp implies probability.
37:24
Probability implies distribution.
37:26
And here note that I did not write,
37:28
and the limits are equal almost surely.
37:30
Why?
37:31
37:35
Because the convergence in distribution
37:37
is actually not telling you that your random variable
37:38
is converging to another random variable.
37:40
It's telling you that the distribution
37:42
of your random variable is converging to a distribution.
37:45
And think of this, guys.
37:47
x and minus x.
37:49
The central limit theorem tells me
37:50
that I'm converging to some standard Gaussian distribution,
37:53
but am I converging to x or am I converging to minus x?
37:57
It's not well identified.
37:58
It's any random variable that has this distribution.
38:01
So there's no way the limits are equal.
38:04
Their distributions are going to be the same,
38:06
but they're not the same limit.
38:07
Is that clear for everyone?
38:09
So in a way, convergence in distribution
38:12
is really not a convergence of a random variable
38:15
towards another random variable.
38:16
It's just telling you the limiting distribution
38:18
of your random variable [INAUDIBLE]
38:20
which is enough for us.
38:22
And one thing that's actually really nice
38:24
is this continuous mapping theorem, which
38:28
essentially tells you that--
38:30
so this is one of the theorems that we like,
38:32
because they tell us you can do what
38:33
you feel like you want to do.
38:35
So if I have Tn that goes to T, f of Tn goes to f of T,
38:39
and this is true for any of those convergence
38:42
except for Lp.
38:45
38:48
But they have to have f, which is continuous, otherwise
38:51
weird stuff can happen.
38:54
So this is going to be convenient, because here I
38:58
don't have X to n minus p.
39:00
I have a continuous function.
39:01
It's between a linear function of Xn minus p,
39:03
but I could think of like even crazier stuff to do,
39:05
and it would still be true.
39:07
If I took the square, it would converge to something that
39:10
looks like its distribution.
39:11
It's the same as the distribution
39:12
of a square Gaussian.
39:16
So this is a mouthful, these two slides--
39:18
actually this particular slide is a mouthful.
39:20
What I have in my head since I was pretty much where you're
39:24
sitting, is this diagram.
39:27
So what it tells me-- so it's actually voluntarily cropped,
39:32
so you can start from any Lq you want large.
39:35
And then as you decrease the index,
39:38
you are actually implying, implying,
39:39
implying until you imply convergence in probability.
39:42
Convergence almost surely implies convergence
39:44
in probability, and everything goes to the [? sync, ?]
39:49
that is convergence in distribution.
39:52
So everything implies convergence in distribution.
39:55
So that's basically rather than remembering those formulas,
39:57
this is really the diagram you want to remember.
39:59
40:02
All right, so why do we bother learning about those things.
40:06
That's because of this limits and operations.
40:09
Operations and limits.
40:10
If I have a sequence of real numbers,
40:13
and I know that Xn converges to X and Yn converges to Y,
40:17
then I can start doing all my manipulations and things
40:20
are happy.
40:20
I can add stuff.
40:21
I can multiply stuff.
40:23
But it's not true always for convergence in distribution.
40:28
But it is, what's nice, it's actually
40:29
true for convergence almost surely.
40:32
Convergence almost surely everything is true.
40:35
It's just impossible to make it fail.
40:38
But convergence in probability is not always everything,
40:41
but at least you can actually add stuff and multiply stuff.
40:43
And it will still give you the sum of the n,
40:46
and the product of the n.
40:49
You can even take the ratio if V is not 0 of course.
40:55
If the limit is not 0, then actually
40:57
you need Vn to be not 0 as well.
40:58
41:01
You can actually prove this last statement, right?
41:05
Because it's a combination of the first statement
41:08
of the second one, and the continuous mapping theorem.
41:11
Because the function that maps x to 1
41:14
over x on everything but 0, is continuous.
41:19
And so 1 over Vn converges to 1 over V,
41:24
and then I can multiply those two things.
41:26
So you actually knew that one.
41:28
But really this is not what matters,
41:30
because this is something that you will do whatever happens.
41:35
If I don't tell you you cannot do it, well, you will do it.
41:37
But in general those things don't
41:39
apply to convergence in distribution
41:40
unless the pair itself is known to converge in distribution.
41:44
Remember when I said that these things apply to vectors,
41:48
then you need to actually say that the vector converges
41:51
in distributions to the limiting factor.
41:53
Now this tells you in particular,
41:55
since the cumulative distribution function is not
41:57
defined for vectors, I would have
41:59
to actually use one of the other distributions, one
42:02
of the other criteria, which is convergence
42:04
of characteristic functions or convergence
42:07
of a function of bounded continuous function
42:11
of the random variable.
42:12
0.2 or 0.3, but 0.1 is not going get you anywhere.
42:17
But this is something that's going
42:18
to be too hard for us to deal with, so we're actually
42:20
going to rely on the fact that we have
42:23
something that's even better.
42:24
There's something that is waiting for us
42:26
at the end of his lecture, which is called Slutsky's that says
42:29
that if V, in this case, converges in probability
42:33
but U converge in distribution, I can actually still do that.
42:36
I actually don't need both of them
42:37
to converge in probability.
42:38
I actually need only one of them to converge in probability
42:41
to make this statement.
42:42
But two sum.
42:45
So let's go to another example.
42:47
So I just want to make sure that we keep on doing statistics.
42:49
And every time we're going to just do a little bit
42:51
too much probability, I'm going to reset the pressure,
42:54
and start doing statistics again.
42:56
All right, so assume you observe the times
42:59
the inter-arrival time of the T at Kendall.
43:04
So this is not the arrival time.
43:06
It's not like 7:56, 8:15.
43:09
No, it's really the inter-arrival time, right?
43:12
So say the next T is arriving in six minutes.
43:17
So let's say [INAUDIBLE] bound.
43:20
And so you have this inter-arrival time.
43:23
So those are numbers say, 3, 4, 5, 4, 3, et cetera.
43:27
So I have this sequence of numbers.
43:29
So I'm going to observe this, and I'm
43:31
going to try to infer what is the rate of T's going out
43:36
of the station from this.
43:38
So I'm going to assume that these things are
43:40
mutually independent.
43:43
That's probably not completely true.
43:44
Again, it just means that what it would mean
43:46
is that two consecutive inter-arrival times are
43:49
independent.
43:50
I mean, you can make it independent if you want,
43:52
but again, this independent assumption
43:53
is for us to be happy and safe.
43:56
Unless someone comes with overwhelming proof
43:58
that it's not independent and far from being independent,
44:01
then yes, you have a problem.
44:03
But it might be the fact that it's actually-- if you
44:06
have a T that's one hour late.
44:09
If an inter-arrival time is one hour, then the other T,
44:12
either they fixed it, and it's going
44:14
to be just 30 seconds behind, or they haven't fixed it,
44:17
then it's going to be another hour behind.
44:18
So they're not exactly independent,
44:20
but they are when things work well and approximate.
44:24
And so now I need to model a random variable that's
44:27
positive, maybe not upper bounded.
44:29
I mean, people complain enough that this thing
44:31
can be really large.
44:32
And so one thing that people like for inter-arrival times
44:34
is exponential distribution.
44:36
So that's a positive random variable.
44:38
Looks like an exponential on the right-hand slide,
44:40
on the positive line.
44:41
And so it decays very fast towards 0.
44:43
The probability that you have very large
44:45
values exponentially small, and there's a [INAUDIBLE] lambda
44:49
that controls how exponential is defined.
44:50
It's exponential minus lambda times something.
44:53
And so we're going to assume that they
44:56
have the same distribution, the same random variable.
44:58
So they're IID, because they are independent,
45:00
and they're identically distributed.
45:01
They all have this exponential with parameter lambda,
45:04
and I'm going to try to learn something about lambda.
45:06
What is the estimated value of lambda,
45:08
and can I build a confidence interval for lambda.
45:12
So we observe n arrival times.
45:16
So as I said, the mutual independence
45:20
is plausible, but not completely justified.
45:24
The fact that they're exponential
45:25
is actually something that people like in all this what's
45:27
called queuing theory.
45:29
So exponentials arise a lot when you
45:31
talk about inter-arrival times.
45:32
It's not about the bus, but where
45:34
it's very important is call centers, service, servers where
45:41
tasks come, and people want to know how long it's
45:45
going to take to serve a task.
45:47
So when I call at a center, nobody
45:50
knows how long I'm going to stay on the phone with this person.
45:52
But it turns out that empirically exponential
45:54
distributions have been very good at modeling this.
45:56
And what it means is that they're actually--
45:58
you have this memoryless property.
46:01
It's kind of crazy if you think about it.
46:03
What does that thing say?
46:04
Let's parse it.
46:06
That's the probability.
46:08
So this is condition on the fact that T1 is larger than T.
46:12
So T1 is just say the first arrival time.
46:14
That means that conditionally on the fact
46:16
that I've been waiting for the first T, well,
46:19
the first [INAUDIBLE].
46:23
Well, I should probably-- the first subway for more than T
46:27
conditionally-- so I've been there T minutes already.
46:30
Then the probability that I wait for s more minutes.
46:33
So that's the probability that T1 is learned,
46:35
and the time that we've already waited plus x.
46:38
Given that I've been waiting for T minutes,
46:40
really I wait for s more minutes,
46:42
is actually the probability that I wait for s minutes total.
46:46
It's completely memoryless.
46:47
It doesn't remember how long have you been waiting.
46:49
The probability does not change.
46:51
You can have waited for two hours, the probability
46:53
that it takes another 10 minutes is
46:55
going to be the same as if you had
46:56
been waiting for zero minutes.
46:59
And that's something that's actually
47:00
part of your problem set.
47:02
Very easy to compute.
47:03
This is just an analytical property.
47:05
And you just manipulate functions,
47:07
and you see that this thing just happen to be true,
47:09
and that's something that people like.
47:11
Because that's also something that benefit.
47:15
And also what we like is that this thing is positive
47:17
almost surely, which is good when you model arrival times.
47:21
To be fair, we're not going to be that careful.
47:23
Because sometimes we are just going
47:24
to assume that something follows a normal distribution.
47:29
And in particular, I mean, I don't
47:30
know if we're going to go into that details,
47:32
but a good thing that you can model with a Gaussian
47:34
distribution are heights of students.
47:38
But technically with positive probability,
47:40
you can have a negative Gaussian random variable, right?
47:44
And the probability being it's probably 10 to the minus 25,
47:48
but it's positive.
47:49
But it's good enough for us for our modeling.
47:51
So this thing is nice, but this is not going to be required.
47:54
When you're modeling positive random variables,
47:56
you don't always have to use positive distributions that are
47:59
supported on positive numbers.
48:01
You can use distributions like Gaussian.
48:03
48:06
So now this exponential distribution of T1, Tn
48:09
they have the same parameter, and that
48:11
means that in average they have the same inter-arrival time.
48:14
So this lambda is actually the expectation.
48:16
And what I'm just saying is that they're identically
48:19
distributed means that I mean some sort
48:21
of a stationary regime, and it's not always true.
48:24
I have to look at a shorter period of time,
48:26
because at rush hour and 11:00 PM
48:28
clearly those average inter-arrival times
48:31
are going to be different So it means that I am really
48:33
focusing maybe on rush hour.
48:35
48:38
Sorry, I said it's lambda.
48:39
It's actually 1 over lambda.
48:40
I always mix the two.
48:42
All right, so you have the density of T1.
48:44
So f of T is this.
48:46
So it's on the positive real line.
48:49
The fact that I have strictly positive or larger [INAUDIBLE]
48:52
to 0 doesn't make any difference.
48:54
So this is the density.
48:55
So it's lambda E to the minus lambda T. The lambda in front
48:58
just ensures that when I integrate
48:59
this function between 0 and infinity, I get 1.
49:03
And you can see, it decays like exponential minus lambda T.
49:06
So if I were to draw it, it would just look like this.
49:09
49:13
So at 0, what value does it take?
49:17
Lambda.
49:19
And then I decay like exponential minus lambda T.
49:23
So this is 0, and this is f of T.
49:30
So very small probability of being very large.
49:33
Of course, it depends on lambda.
49:35
Now the expectation, you can compute the expectation
49:37
of this thing, right?
49:38
So you integrate T times f of T. This
49:41
is part of the little sheet that I gave you last time.
49:44
This is one of the things you should
49:45
be able to do blindfolded.
49:47
And then you get the expectation of T1 is 1 over lambda.
49:51
That's what comes out.
49:53
So as I actually tell many of my students, 99% of statistics
49:57
is replacing expectations by averages.
50:00
And so what you're tempted to do is say, well, if in average I'm
50:02
supposed to see 1 over lambda, I have 15 observations.
50:05
I'm just going to average those observations,
50:07
and I'm going to see something that should be close to 1
50:10
over lambda.
50:11
So statistics is about replacing averages,
50:14
expectations with averages, and that's we do.
50:17
So Tn bar here, which is the average of the Ti's, is
50:21
a pretty good estimator for 1 over lambda.
50:25
So if I want an estimate for lambda,
50:27
then I need to take 1 over Tn bar.
50:30
So here is one estimator.
50:32
I did it without much principle except that I just
50:36
want to replace expectations by averages,
50:38
and then I fixed the problem that I was actually estimating
50:41
1 over lambda by lambda.
50:43
But you could come up with other estimators, right?
50:45
But let's say this is my way of getting to that estimator.
50:49
Just like I didn't give you any principled way of getting p
50:52
hat, which is Xn bar in the kiss example.
50:54
But that's the natural way to do it.
50:57
Everybody is completely shocked by this approach?
51:01
All right, so let's do this.
51:03
So what can I say about the properties of this estimator
51:06
lambda hat?
51:08
Well, I know that Tn bar is going to 1 over lambda
51:12
by the law of large number.
51:14
It's an average.
51:14
It converges to the expectation both almost surely,
51:18
and in probability.
51:19
So the first one is the strong law of large number,
51:21
the second one is the weak law of large number.
51:23
I can apply the strong one.
51:24
I have enough conditions.
51:26
And hence, what do I apply so that 1 over Tn bar
51:31
actually goes to lambda?
51:35
So I said hence.
51:36
What is hence?
51:37
What is it based on?
51:37
AUDIENCE: [INAUDIBLE]
51:43
PHILIPPE RIGOLLET Yeah, continuous mapping theorem,
51:45
right?
51:45
So I have this function 1 over x.
51:47
I just apply this function.
51:49
So if it was 1 over lambda squared,
51:51
I would have the same thing that would
51:52
happen just because the function 1 over x
51:54
is continuous away from 0.
51:58
And now the central limit theorem
52:00
is also telling me something about lambda.
52:02
About Tn bar, right?
52:03
It's telling me that if I look at my average,
52:05
I remove the expectation here.
52:08
So if I do Tn bar minus my expectation,
52:11
rescale by this guy here, then this thing is going
52:15
to converge to some Gaussian random variable,
52:18
but here I have this lambda to the negative 1--
52:21
to the negative 2 here, and that's
52:23
because they did not tell you that if you
52:25
compute the variance--
52:28
so from this, you can probably extract.
52:30
52:34
So if I have X that follows some exponential distribution
52:39
with parameter lambda.
52:40
Well, let's call it T.
52:42
So we know that T in expectation, the expectation
52:46
of T is 1 over lambda.
52:48
What is the variance of T?
52:49
52:56
You should be able to read it from the thing here.
53:00
53:09
1 over lambda squared.
53:10
That's what you actually read in the variance,
53:12
because the central limit theorem is really telling you
53:16
the distribution goes through this n.
53:19
But this numbers and this number you can read, right?
53:23
If you look at the expectation of this guy it's-- of this guy
53:26
comes out.
53:26
This is 1 over lambda minus 1 over lambda.
53:28
That's why you read the 0.
53:30
And if you look at the variance of the dot,
53:32
you get n times the variance of this average.
53:36
Variance of the average is picking up a factor 1 over n.
53:39
So the n cancels.
53:40
And then I'm left with only one of the variances, which
53:42
is 1 over lambda squared.
53:45
OK, so we're not going to do that in details,
53:48
because, again, this is just a pure calculus exercise.
53:50
But this is if you compute integral of lambda e
53:54
to the minus t lambda times t squared.
53:58
Actually t minus 1 over lambda squared
54:01
dt between 0 and infinity.
54:05
You will see that this thing is 1 over lambda squared.
54:07
54:14
How would I do this?
54:15
54:20
Configuration by [INAUDIBLE] or you know it.
54:24
All right.
54:26
So this is what the central limit theorem tells me.
54:29
So this gives me if I solve this,
54:31
and I plug in so I can multiply by lambda and solve,
54:35
it would give me somewhat a confidence interval for 1
54:40
over lambda.
54:42
If we just think of 1 over lambda
54:44
as being the p that I had before,
54:46
this would give me a central limit theorem for--
54:48
54:51
sorry, a confidence interval for 1 over lambda.
54:54
So I'm hiding a little bit under the rug
54:56
the fact that I have to still define it.
54:58
Let's just actually go through this.
55:00
I see some of you are uncomfortable with this,
55:02
so let's just do it.
55:04
So what we've just proved by the central limit
55:06
theorem is that the probability, that's
55:09
square root of n Tn minus 1 over lambda exceeds q alpha over 2
55:21
is approximately equal to alpha, right?
55:24
That's just the statement of the central limit theorem,
55:27
and by approximately equal I mean as n goes to infinity.
55:30
55:34
Sorry I did not write it correctly.
55:36
I still have to divide by square root of 1
55:39
over lambda squared, which is the standard deviation, right?
55:43
And we said that this is a bit ugly.
55:44
So let's just do it the way it should be.
55:46
So multiply all these things by lambda.
55:50
So that means now that the absolute value, so
55:56
with probability 1 minus alpha asymptotically,
55:59
I have that square root of n times lambda Tn minus 1
56:07
is less than or equal to q alpha over 2.
56:11
56:14
So what it means is that, oh, I have negative q alpha over 2
56:20
less than square root of n.
56:22
Let me divide by square root of n here.
56:25
lambda Tn minus 1 q alpha over 2.
56:34
And so now what I have is that I get that lambda is between--
56:41
that's Tn bar-- is between 1 plus q alpha over 2
56:50
divided by root n.
56:53
And the whole thing is divided by Tn bar,
56:57
and same thing on the other side except I have 1 minus q alpha
57:04
over 2 divided by root n divided by Tn bar.
57:08
57:12
So it's kind of a weird shape, but it's still
57:16
of the form 1 over Tn bar plus or minus something.
57:20
But this something depends on Tn bar itself.
57:23
And that's actually normal, because Tn bar is not only
57:26
giving me information about the mean,
57:29
but it's also giving me information about the variance.
57:31
So it should definitely come in the size of my error bars.
57:37
And that's the way it comes in this fairly natural way.
57:41
Everybody agrees?
57:43
So now I have actually built a confidence interval.
57:46
But what I want to show you with this example is,
57:50
can I translate this in a central limit
57:52
theorem for something that converges to lambda, right?
57:57
I know that Tn bar converges to 1 over lambda,
58:00
but I also know that 1 over Tn bar converges to lambda.
58:05
So do I have a central limit theorem for 1 over Tn bar?
58:09
Technically no, right?
58:11
Central limit theorems are about averages, and 1 over an average
58:14
is not an average.
58:16
But there's something that statisticians like a lot,
58:20
and it's called the Delta method.
58:23
The Delta method is really something
58:24
that's telling you that you can actually
58:27
take a function of an average, and let
58:30
it go to the function of the limit,
58:32
and you still have a central limit theorem.
58:34
And the factor or the price to pay for this
58:37
is something which depends on the derivative of the function.
58:44
And so let's just go through this,
58:46
and it's, again, just like the proof of the central limit
58:48
theorem.
58:49
And actually in many of those asymptotic statistics results,
58:53
this is actually just a Taylor expansion,
58:55
and here it's not even the second order,
58:57
it's actually the first order, all right?
58:59
So I'm just going to do linear approximation of this function.
59:02
59:04
So let's do it.
59:05
So I have that g of Tn bar--
59:12
actually let's use the notation of this slide,
59:15
which is Zn and theta.
59:17
So what I know is that Zn minus theta square root of n
59:24
goes to some Gaussian, this standard Gaussian.
59:29
No, not standard.
59:32
OK, so that's the assumptions.
59:34
And what I want to show is some convergence of g of Zn
59:40
to g of theta.
59:43
So I'm not going to multiply by root n just yet.
59:46
So I'm going to do a first order Taylor expansion.
59:49
So what it is telling me is that this is equal to Zn minus theta
59:57
times g prime of, let's call it theta bar
60:01
where theta bar is somewhere between say
60:06
Zn and theta, for sum.
60:11
60:13
OK, so if theta is less than Zn you just permute those two.
60:17
So that's what the Taylor first order Taylor
60:21
expansion tells me.
60:21
There exists a theta bar that's between the two
60:23
values at which I'm expanding so that those two things are
60:26
equal.
60:29
Is everybody shocked?
60:31
No?
60:31
So that's standard Taylor expansion.
60:36
Now I'm going to multiply by root n.
60:38
60:44
And so that's going to be what?
60:45
That's going to be root n Zn minus theta.
60:50
Ah-ha, that's something I like.
60:51
Times g prime of theta bar.
60:57
60:59
Now the central limit theorem tells me
61:01
that this goes to what?
61:02
61:06
Well, this goes to sum n 0 sigma squared, right?
61:12
That was the first line over there.
61:15
This guy here, well, it's not clear, right?
61:20
Actually it is.
61:21
Let's start with this guy.
61:24
What does theta bar go to?
61:28
Well, I know that Zn is going to theta.
61:30
61:33
Just because, well, that's my law of large numbers.
61:37
Zn is going to theta, which means
61:41
that theta bar is sandwiched between two values that
61:44
converge to theta.
61:46
So that means that theta bar converges to theta itself
61:49
as n goes to infinity.
61:51
That's just the law of large numbers.
61:54
Everybody agrees?
61:57
Just because it's sandwiched, right?
61:58
So I have Zn.
62:01
I have theta, and theta bar is somewhere here.
62:05
The picture might be reversed.
62:06
It might be that Zn end is larger than theta.
62:08
But the law of large number tells me
62:10
that this guy is not moving, but this guy is moving that way.
62:14
So you know when n is [INAUDIBLE],,
62:16
there's very little wiggle room for theta bar,
62:18
and it can only get to theta.
62:19
62:23
And I call it the sandwich theorem,
62:25
or just find your favorite food in there.
62:29
So this guy goes to theta, and now I
62:31
need to make an extra assumption, which
62:33
is that g prime is continuous.
62:38
And if g prime is continuous, then g prime of theta bar
62:42
goes to g prime of theta.
62:44
So this thing goes to g prime of theta.
62:49
62:52
But I have an issue here.
62:54
Is that now I have something that
62:56
converges in distribution and something
62:57
that converges in say--
63:01
I mean, this converges almost surely or saying probability
63:04
just to be safe.
63:06
And this one converges in distribution.
63:09
And I want to combine them.
63:11
But I don't have a slide that tells me
63:12
I'm allowed to take the product of something that converges
63:15
in distribution, and something that converges in probability.
63:18
This does not exist.
63:19
Actually, if anything it told me,
63:21
do not do anything with things that converge in distribution.
63:25
And so that gets us to our--
63:32
OK, so I'll come back to this in a second.
63:36
And that gets us to something called Slutsky's theorem.
63:39
And Slutsky's theorem tells us that in very specific cases,
63:42
you can do just that.
63:44
So you have two sequences of random variables, Xn bar,
63:49
that's Xn that converges to X. And Yn that converges to Y,
63:53
but Y is not anything.
63:55
Y is not any random variable.
63:57
So X converges in this distribution.
63:59
Sorry, I forgot to mention, this is very important.
64:01
Xn converges in distribution, Y converges in probability.
64:04
And we know that in generality we cannot combine those two
64:07
things, but Slutsky tells us that if the limit of Y is
64:11
a constant, meaning it's not a random variable,
64:13
but it's a deterministic number 2,
64:16
just a fixed number that's not a random variable,
64:18
then you can combine them.
64:21
Then you can sum them, and then you can multiply them.
64:24
64:28
I mean, actually you can do whatever combination you want,
64:31
because it actually implies that X, the vector Xn, Yn
64:34
converges to the vector Xc.
64:39
OK, so here I just took two combinations.
64:41
They are very convenient for us, the sum and the product
64:44
so I could do other stuff like the ratio
64:45
if c is not 0, things like that.
64:47
64:51
So that's what Slutsky does for us.
64:53
So what you're going to have to write a lot in your homework,
64:56
in your mid-terms, by Slutsky.
64:58
I know some people are very generous with their by Slutsky.
65:03
They just do numerical applications,
65:05
mu is equal to 6, and therefore by Slutsky
65:08
mu square is equal to 36.
65:10
All right, so don't do that.
65:11
Just use, write Slutsky when you're actually using Slutsky.
65:15
But this is something that's very important for us,
65:17
and it turns out that you're going
65:18
to feel like you can write by Slutsky all the time,
65:20
because that's going to work for us all the time.
65:23
Everything we're going to see is actually
65:25
going to be where we're going to have to combine stuff.
65:27
Since we only rely on convergence from distribution
65:30
arising from the central limit theorem,
65:32
we're actually going to have to rely on something that
65:34
allows us to combine them, and the only thing we know
65:36
is Slutsky.
65:37
So we better hope that this thing works.
65:40
So why Slutsky works for us.
65:41
Can somebody tell me why Slutsky works
65:43
to combine those two guys?
65:46
So this one is converging in distribution.
65:48
This one is converging in probability,
65:51
but to a deterministic number.
65:54
g prime of theta is a deterministic number.
65:57
I don't know what theta is, but it's certainly deterministic.
66:02
All right, so I can combine them, multiply them.
66:04
So that's just the second line of that in particular.
66:08
All right, everybody is with me?
66:12
So now I'm allowed to do this.
66:13
You can actually-- you will see something
66:15
like counterexample questions in your problem
66:16
set just so that you can convince yourself.
66:18
It's always a good thing.
66:19
I don't like to give them, because I
66:21
think it's much better for you to actually come
66:23
to the counterexample yourself.
66:24
Like what can go wrong if Y is not a random--
66:35
sorry, if Y is not a--
66:38
sorry, if c is not the constant, but it's a random variable.
66:42
You can figure that out.
66:45
All right, so let's go back.
66:46
So we have now this Delta method that tells us
66:49
that now I have a central limit theorem
66:51
for functions of averages, and not just for averages.
66:55
So the only price to pay is this derivative there.
66:57
67:00
So, for example, if g is just a linear function,
67:05
then I'm going to have a constant multiplication.
67:07
If g is a quadratic function, then I'm
67:10
going to have theta squared that shows up there.
67:13
Things like that.
67:14
So just think of what kind of applications
67:16
you could have for this.
67:17
Here are the functions that we're interested in,
67:19
is x maps to 1 over x.
67:21
What is the derivative of this guy?
67:23
67:25
What is the derivative of 1 over x?
67:29
Negative 1 over x squared, right?
67:31
That's the thing we're going to have to put in there.
67:33
And so this is what we get.
67:37
So now when I'm actually going to write this,
67:44
so if I want to show square root of n lambda hat minus lambda.
67:51
That's my application, right?
67:52
This is actually 1 over Tn, and this is 1 over 1 over lambda.
67:59
So the function g of x is 1 over x in this case.
68:05
So now I have this thing.
68:06
So I know that by the Delta method--
68:08
oh, and I knew that Tn, remember,
68:11
square root of Tn minus 1 over lambda
68:16
was going to sum normal with mean 0
68:19
and variance 1 over lambda squared, right?
68:21
So the sigma square over there is 1 over lambda squared.
68:26
So now this thing goes to what?
68:27
Sum normal.
68:28
What is going to be the mean?
68:32
0.
68:32
68:35
And what is the variance?
68:37
So the variance is going--
68:38
I'm going to pick up this guy, 1 over lambda
68:40
squared, and then I'm going to have to take g prime of what?
68:46
Of 1 over lambda, right?
68:48
That's my theta.
68:49
68:52
So I have g of theta, which is 1 over theta.
68:55
So I'm going to have g prime of 1 over lambda.
68:58
And what is g prime of 1 over lambda?
69:00
69:05
So we said that g prime is 1 over negative 1 over x squared.
69:09
So it's negative 1 over 1 over lambda squared--
69:13
69:17
sorry, squared.
69:18
69:21
Which is nice, because g can be decreasing.
69:24
So that would be annoying to have a negative variance.
69:26
And so g prime is negative 1 over, and so
69:29
what I get eventually is lambda squared up here,
69:33
but then I square it again.
69:36
So this whole thing here becomes what?
69:39
Can somebody tell me what the final result is?
69:41
69:44
Lambda squared right?
69:45
So it's lambda 4 divided by lambda 2.
69:47
69:55
So that's what's written there.
69:59
And now I can just do my good old computation for a--
70:04
70:10
I can do a good computation for a confidence interval.
70:14
All right, so let's just go from the second line.
70:17
So we know that lambda hat minus lambda
70:21
is less than, we've done that several times already.
70:23
So it's q alpha over 2--
70:25
sorry, I should put alpha over 2 over this thing, right?
70:28
So that's really the quintile of what our alpha over 2 times
70:31
lambda divided by square root of n.
70:34
All right, and so that means that my confidence interval
70:39
should be this, lambda hat.
70:42
Lambda belongs to lambda plus or minus q alpha
70:47
over 2 lambda divided by root n, right?
70:51
So that's my confidence interval.
70:53
But again, it's not very suitable, because--
70:56
sorry, that's lambda hat.
70:59
Because they don't know how to compute it.
71:02
So now I'm going to request from the audience
71:04
some remedies for this.
71:06
What do you suggest we do?
71:07
71:12
What is the laziest thing I can do?
71:14
71:18
Anybody?
71:19
Yeah.
71:19
AUDIENCE: [INAUDIBLE]
71:21
PHILIPPE RIGOLLET Replace lambda by lambda hat.
71:23
What justifies for me to do this?
71:25
AUDIENCE: [INAUDIBLE]
71:27
PHILIPPE RIGOLLET Yeah, and Slutsky
71:29
tells me I can actually do it, because Slutsky tells me,
71:32
where does this lambda come from, right?
71:35
This lambda comes from here.
71:37
That's the one that's here.
71:39
So actually I could rewrite this entire thing
71:41
as square root of n lambda hat minus lambda divided by lambda
71:47
converges to sum n 0, 1.
71:51
Now if I replace this by lambda hat, what I have is
71:55
that this is actually really the original one times
72:01
lambda divided by lambda hat.
72:04
And this converges to n 0, 1, right?
72:07
And now what you're telling me is, well, this guy
72:10
I know it converges to n 0, 1, and this guy is converging to 1
72:15
by the law of large number.
72:16
But this one is converging to 1, which happens to be a constant.
72:19
It converges in probability, so by Slutsky I can actually
72:22
take the product and still maintain my conversion
72:25
to distribution to a standard Gaussian.
72:29
So you can always do this.
72:30
Every time you replace some p by p hat,
72:34
as long as their ratio goes to 1,
72:35
which is going to be guaranteed by the law of large number,
72:38
you're actually going to be fine.
72:40
And that's where we're going to use Slutsky a lot.
72:42
When we do plug in, Slutsky is going to be our friend.
72:46
OK, so we can do this.
72:47
72:51
And that's one way.
72:52
And then other ways to just solve
72:53
for lambda like we did before.
72:56
So the first one we got is actually--
72:58
I don't know if I still have it somewhere.
73:00
Yeah, that was the one, right?
73:03
So we had 1 over Tn q, and that's exactly the same
73:08
that we have here.
73:09
So your solution is actually giving us exactly this guy when
73:12
we actually solve for lambda.
73:14
73:17
So this is what we get.
73:20
Lambda hat.
73:21
We replace lambda by lambda hat, and we
73:24
have our asymptotic convergence theorem.
73:27
And that's exactly what we did in Slutsky's theorem.
73:30
Now we're getting to it at this point is just telling us
73:32
that we can actually do this.
73:36
Are there any questions about what we did here?
73:39
So this derivation right here is exactly what I
73:42
did on the board I showed you.
73:44
So let me just show you with a little more space
73:46
just so that we all understand, right?
73:49
So we know that square root of n lambda hat minus lambda divided
73:58
by lambda, the true lambda defined
74:00
converges to sum n 0, 1.
74:04
So that was CLT plus Delta method.
74:07
74:11
Applying those two, we got to here.
74:13
And we know that lambda hat converges
74:17
to lambda in probability and almost surely, and that's what?
74:21
That was law of large number plus continued mapping theorem,
74:24
right?
74:25
Because we only knew that one of our lambda hat
74:27
converges to 1 over lambda.
74:29
So we had to flip those things around.
74:31
And now what I said is that I apply Slutsky,
74:33
so I write square root of n lambda hat minus lambda divided
74:38
by lambda hat, which is the suggestion that was made to me.
74:42
They said, I want this, but I would
74:44
want to show that it converges to sum n 0,
74:45
1 so I can legitimately use q alpha over 2 in this one
74:49
though.
74:50
And the way we said is like, well, this thing is actually
74:53
really q divided by lambda times lambda divided by lambda hat.
75:00
So this thing that was proposed to me,
75:02
I can decompose it in the product
75:03
of those two random variables.
75:05
The first one here converges through the Gaussian
75:09
from the central limit theorem.
75:10
And the second one converges to 1 from this guy,
75:14
but in probability this time.
75:17
75:20
That was the ratio of two things in probability,
75:23
we can actually get it.
75:25
And so now I apply Slutsky.
75:26
75:31
And Slutsky tells me that I can actually do that.
75:34
But when I take the product of this thing that converges
75:36
to some standard Gaussian, and this thing that converges
75:40
in probability to 1, then their product actually
75:43
converges to still this standard Gaussian [INAUDIBLE]
75:48
75:55
Well, that's exactly what's done here,
75:58
and I think I'm getting there.
76:02
So in our case, OK, so just a remark for Slutsky's theorem.
76:07
So that's the last line.
76:09
So in the first example we used the problem dependent trick,
76:11
which was to say, well, turns out
76:13
that we knew that p is between 0 and 1.
76:16
So we have this p 1 minus p that was annoying to us.
76:18
We just said, let's just bound it by 1/4,
76:21
because that's going to be true for any value of p.
76:23
But here, lambda takes any value between 0 and infinity,
76:26
so we didn't have such a trick.
76:27
It's something like we could see that lambda was less
76:29
than something.
76:30
Maybe we know it, in which case we could use that.
76:34
But then in this case, we could actually also
76:36
have used Slutsky's theorem by doing plug in, right?
76:39
So here this is my p 1 minus p that's replaced by p hat 1
76:41
minus p hat.
76:43
And Slutsky justify, so we did that
76:45
without really thinking last time.
76:46
But Slutsky actually justifies the fact
76:48
that this is valid, and still allows me to use
76:51
this q alpha over 2 here.
76:52
76:56
All right, so that's the end of this lecture.
76:58
Tonight I will post the next set of slides, chapter two.
77:01
And, well, hopefully the video.
77:04
I'm not sure when it's going to come out.
77:06