字幕記錄


00:00
00:00
The following content is provided under a Creative
00:02
Commons license.
00:03
Your support will help MIT OpenCourseWare
00:06
continue to offer high quality educational resources for free.
00:10
To make a donation or to view additional materials
00:12
from hundreds of MIT courses, visit MIT OpenCourseWare
00:16
at ocw.mit.edu.
00:17
00:20
PHILIPPE RIGOLLET: --124.
00:22
If I were to repeat this 1,000 times,
00:24
so every one of those 1,000 times
00:26
they collect 124 data points and then
00:29
I'd do it again and do it again and again,
00:31
then in average, the number I should get
00:34
should be close to the true parameter that I'm looking for.
00:37
The fluctuations that are due to the fact
00:38
that I get different samples every time
00:40
should somewhat vanish.
00:42
And so what I want is to have a small bias, hopefully a 0 bias.
00:46
If this thing is 0, then we see that the estimator is unbiased.
00:50
01:06
So this is definitely a property that we
01:08
are going to be looking for in an estimator,
01:10
trying to find them to be unbiased.
01:11
But we'll see that it's actually maybe not enough.
01:14
So unbiasedness should not be something
01:16
you lose your sleep over.
01:18
Something that's slightly better is the risk, really
01:21
the quadratics risk, which is expectation of--
01:33
so if I have an estimator, theta hat,
01:35
I'm going to look at the expectation of theta hat n
01:38
minus theta squared.
01:41
And what we showed last time is that we can actually--
01:44
by inserting in there, adding and removing
01:46
the expectation of theta hat, we actually
01:49
get something where this thing can
01:50
be decomposed as the square of the bias plus the variance,
01:59
which is just the expectation of theta hat minus its expectation
02:04
squared.
02:06
That came from the fact that when
02:08
I added and removed the expectation of theta hat
02:10
in there, the cross-terms cancel.
02:13
All right.
02:14
So that was the bias squared, and this is the variance.
02:19
02:25
And so for example, if the quadratic risk goes to 0,
02:29
then that means that theta hat converges
02:31
to theta in the L2 sense.
02:34
And here we know that if we want this to go to 0,
02:38
since it's the sum of two positive terms,
02:40
we need to have both the bias that goes to 0
02:42
and the variance that goes to 0, so we
02:44
need to control both of those things.
02:46
And so there is usually an inherent trade-off
02:49
between getting a small bias and getting a small variance.
02:53
If you reduce one too much, then the variance of the other one
02:56
is going to--
02:57
then the other one is going to increase, or the opposite.
02:59
That happens a lot, but not so much, actually, in this class.
03:03
So let's just look at a couple of examples.
03:07
So am I planning--
03:10
yeah.
03:11
So examples.
03:19
So if I do, for example, X1, Xn, there are iid Bernoulli.
03:26
And I'm going to write it theta so
03:27
that we keep the same notation.
03:29
Then theta hat, what is the theta hat
03:32
that we proposed many times?
03:33
It's just X bar, Xn bar, the average of Xi's.
03:38
So what is the bias of this guy?
03:40
Well, to know the bias, I just have to remove theta
03:44
from the expectation.
03:46
What is the expectation of Xn bar?
03:49
Well, by linearity of the expectation,
03:51
it's just the average of the expectations.
03:53
03:57
But since all my Xi's are Bernouilli with the same theta,
04:00
then each of this guy is actually equal to theta.
04:03
So this thing is actually theta, which means
04:06
that this isn't biased, right?
04:07
04:14
Now, what is the variance of this guy?
04:16
04:22
So if you forgot the properties of the variance
04:27
for sum of independent random variables,
04:29
now it's time to wake up.
04:30
So we have the variance of something
04:34
that looks like 1 over n, the sum from i equal 1 to n of Xi.
04:38
04:41
So it's of the form variance of a constant times
04:45
a random variable.
04:46
So the first thing I'm going to do is pull out the constant.
04:49
But we know that the variance leaves on the square scale,
04:52
so when I pull out a constant outside of the variance,
04:54
it comes out with a square.
04:56
The variance of a times X is a-squared
04:59
times the variance of X, so this is equal to 1
05:02
over n squared times the variance of the sum.
05:06
05:10
So now we want to always do what we want to do.
05:13
So we have the variance of the sum.
05:16
We would like somehow to say that this
05:17
is the sum of the variances.
05:19
And in general, we are not allowed to say that,
05:22
but we are because my Xi's are actually independent.
05:26
So this is actually equal to 1 over n squared sum from i equal
05:30
1 to n of the variance of each of the Xi's.
05:36
And that's by independence, so this is basic probability.
05:42
And now, what is the variance of Xi's where again they're
05:45
all the same distribution, so the variance of Xi
05:47
is the same as the variance of X1.
05:49
And so each of those guys has variance what?
05:51
What is the variance of a Bernoulli?
05:53
We've said it once.
05:54
It's theta times 1 minus theta.
05:55
06:00
And so now I'm going to have the sum of n times a constant,
06:03
so I get n times the constant divided by n squared,
06:05
so one of the n's is going to cancel.
06:07
And so the whole thing here is actually
06:10
equal to theta, 1 minus theta divided by n.
06:15
06:18
So if I'm interested in the quadratic risk--
06:20
06:27
and again, I should just say risk,
06:28
because this is the only risk we're going
06:30
to be actually looking at.
06:32
Yeah.
06:32
This parenthesis should really stop here.
06:34
06:38
I really wanted to put quadratic in parenthesis.
06:41
So the risk of this guy is what?
06:43
Well, it's the expectation of x bar n minus theta squared.
06:50
And we know it's the square of the variance,
06:54
so it's the square of the bias, which
06:56
we know is 0, so it's 0 squared plus the variance, which
07:00
is theta, 1 plus theta--
07:03
1 minus theta divided by n.
07:07
So it's just theta, 1 minus theta divided by n.
07:14
So this is just summarizing the performance of an estimator,
07:17
which is the random variable.
07:18
I mean, it's complicated.
07:19
If I really wanted to describe it,
07:22
I would just tell you the entire distribution
07:25
of this random variable.
07:27
But now what I'm doing is I'm saying, well,
07:29
let's just take this random variable, remove theta from it,
07:32
and see how small the fluctuations around theta--
07:36
the squared fluctuations around theta are in expectation.
07:41
So that's what the quadratic risk is doing.
07:43
And in a way, this decomposition,
07:44
as the sum of the bias square and the variance,
07:46
is really telling you that--
07:47
it is really accounting for the bias, which is, well,
07:50
even if I had an infinite amount of observations,
07:52
is this thing doing the right thing?
07:54
And the other thing is actually the variance,
07:56
so for finite number of observations,
07:57
what are the fluctuations?
07:59
All right.
08:00
Then you can see that those things, bias and variance,
08:02
are actually very different.
08:05
So I don't have any colors here, so you're
08:08
going to have to really follow the speed--
08:12
the order in which I draw those curves.
08:14
All right.
08:14
So let's find--
08:15
I'm going to give you three candidate estimators, so--
08:19
08:29
estimators for theta.
08:31
08:35
So the first one is definitely Xn bar.
08:38
That will be a good candidate estimator.
08:40
The second one is going to be 0.5, because after all,
08:45
why should I bother if it's actually going to be--
08:47
right?
08:47
So for example, if I ask you to predict
08:51
the score of some candidate in some election,
08:54
then since you know it's going to be very close to 0.5,
08:57
you might as well just throw 0.5 and you're not going
08:59
to be very far from reality.
09:00
And it's actually going to cost you 0 time and $0
09:02
to come up with that.
09:03
So sometimes maybe just a good old guess
09:06
is actually doing the job for you.
09:08
Of course, for presidential elections
09:10
or something like this, it's not very helpful
09:12
if your prediction is telling you this.
09:14
But if it was something different,
09:17
that would be a good way to generate some close to 1/2.
09:21
For a coin, for example, if I give you a coin,
09:23
you never know.
09:24
Maybe it's slightly biased.
09:25
But the good guess, just looking at it, inspecting it,
09:27
maybe there's something crazy happening
09:29
with the structure of it, you're going
09:31
to guess that it's 0.5 without trying to collect information.
09:34
And let's find another one, which is, well, you know,
09:36
I have a lot of observations.
09:38
But I'm recording couples kissing, but I'm on a budget.
09:43
I don't have time to travel all around the world
09:46
and collect some people.
09:47
So really, I'm just going to look at the first couple
09:49
and go home.
09:49
So my other estimator is just going to be X1.
09:53
I just take the first observation, 0, 1,
09:55
and that's it.
09:57
So now I'm going--
09:57
I want to actually understand what the behavior of those guys
10:01
is.
10:01
All right.
10:02
So we know-- and so we know that for this guy, the bias is 0
10:09
and the variance is equal to theta,
10:14
1 minus theta divided by n.
10:19
What is the bias of this guy, 0.5?
10:22
10:28
AUDIENCE: 0.5.
10:29
AUDIENCE: 0.5 minus theta?
10:31
PHILIPPE RIGOLLET: 0.5 minus theta, right.
10:32
10:35
So the bias, 0.5 minus theta.
10:39
What is the variance of this guy?
10:40
10:44
What is the variance of 0.5?
10:46
AUDIENCE: It's 0.
10:47
PHILIPPE RIGOLLET: 0.
10:48
Right.
10:49
It's just a deterministic number,
10:50
so there's no fluctuations for this guy.
10:53
What is the bias?
10:54
Well, X1 is actually--
10:56
just for simplicity, I can think of it
10:58
as being X1 bar, the average of itself,
11:00
so that wherever I saw an n for this guy, I can replace it by 1
11:03
and that will give me my formula.
11:05
So the bias is still going to be 0.
11:07
And the variance is going to be equal to theta, 1 minus theta.
11:10
11:13
So now I have those three estimators.
11:16
Well, if I compare X1 and Xn bar, then
11:19
clearly I have 0 bias in both cases.
11:22
That's good.
11:23
And I have the variance that's actually n times smaller when I
11:27
use my n observations than when I don't.
11:29
So those two guys, on these two fronts,
11:31
you can actually look at the two numbers
11:32
and say, well, the first number is the same.
11:35
The second number is better for the other guy,
11:37
so I will definitely go for this guy compared to this guy.
11:40
So this guy is gone.
11:42
But not this guy.
11:43
Well, if I look at the bias, the variance is 0.
11:47
It's always beating the variance of this guy.
11:49
And if I look at the bias, it's actually really not that bad.
11:52
It's 0.5 minus theta.
11:53
In particular, if theta is 0.5, then this guy
11:55
is strictly better.
11:57
And so you can actually now look at what
12:00
the quadratic risk looks like.
12:05
So here, what I'm going to do is I'm
12:06
going to take my true theta-- so it's
12:08
going to range between 0 and 1.
12:09
And we know that those two things are functions of theta,
12:12
so I can only understand them if I plot them
12:13
as functions of theta.
12:16
And so now I'm going to actually plot--
12:18
the y-axis is going to be the risk.
12:20
12:23
So what is the risk of the estimator of 0.5?
12:26
This one is easy.
12:27
Well, it's 0 plus the square of 0.5 minus theta.
12:33
So we know that at theta, it's actually going to be 0.
12:37
And then it's going to be a square.
12:39
So at 0, it's going to be 0.25.
12:44
And at 1, it's going to be 0.25 as well.
12:49
So it looks like this.
12:49
Well, actually, sorry.
12:50
Let me put the 0.5 where it should be.
12:52
12:56
OK.
12:57
So this here is the risk of 0.5.
13:03
And we'll write it like this.
13:06
So when theta is very close to 0.5, I'm very happy.
13:09
When theta gets farther, it's a little bit annoying.
13:13
And then here, I want to plot the risk of this guy.
13:16
So now the thing with the risk of this guy
13:18
is that it will depend on n.
13:20
So I will just pick some n that I'm happy with just
13:24
so that I can actually draw a curve.
13:26
Otherwise, I'm going to have to plot one curve per value of n.
13:29
So let's just say, for example, that n is equal to 10.
13:31
And so now I need to plot the function theta, 1 minus
13:35
theta divided by 10.
13:37
We know that theta, 1 minus theta
13:39
is a curve that goes like this.
13:40
It takes value at 1/2.
13:42
It thinks value 1/4.
13:43
That's the maximum.
13:44
And then it's 0 at the end.
13:46
So really, if n is equal to 1, this
13:52
is what the variance looks like.
13:53
The bias doesn't count in the risk.
13:56
Yeah.
13:57
AUDIENCE: [INAUDIBLE]
14:00
PHILIPPE RIGOLLET: Sure.
14:01
Can you move?
14:03
All right.
14:04
Are you guys good?
14:05
14:08
All right.
14:08
So now I have this picture.
14:10
And I know I'm going up to 25.
14:12
And there's a place where those curves cross.
14:15
So if you're sure--
14:16
let's say you're talking about presidential election,
14:18
you know that those things are going to be really close.
14:20
Maybe you're actually better by predicting 0.5
14:23
if you know it's not going to go too far.
14:25
But that's for one observation, so that's the risk of X1.
14:32
But if I look at the risk of Xn, all I'm doing
14:34
is just crushing this curve down to 0.
14:38
So as n increases, it's going to look more and more like this.
14:42
It's the same curve divided by n.
14:44
14:48
And so now I can just start to understand
14:50
that for different values of thetas,
14:52
now I'm going to have to be very close to theta is equal to 1/2
14:56
if I want to start saying that Xn bar is worse
14:58
than the naive estimator 0.5.
15:03
Yeah.
15:04
AUDIENCE: Sorry.
15:04
I know you explained a little bit before, but can you just--
15:08
what is an intuitive definition of risk?
15:11
What is it actually describing?
15:13
PHILIPPE RIGOLLET: So either you can--
15:16
well, when you have an unbiased estimator, it's simple.
15:18
It's just telling you it's the variance,
15:20
because the theta that you have over there is really-- so
15:23
in the definition of the risk, the theta
15:26
that you have here if you're unbiased
15:28
is really the expectation of theta hat.
15:31
So that's really just the variance.
15:33
So the risk is really telling you
15:35
how much fluctuations I have around my expectation
15:39
if unbiased.
15:39
But actually here, it's telling you how much fluctuations
15:42
I have in average around theta.
15:44
So if you understand the notion of variance as being--
15:47
AUDIENCE: [INAUDIBLE]
15:47
PHILIPPE RIGOLLET: What?
15:48
AUDIENCE: Like variance on average.
15:49
PHILIPPE RIGOLLET: No.
15:49
AUDIENCE: No.
15:50
PHILIPPE RIGOLLET: It's just like variance.
15:51
AUDIENCE: Oh, OK.
15:52
PHILIPPE RIGOLLET: So when you--
15:53
I mean, if you claim you understand what variance is,
15:56
it's telling you what is the expected
15:58
squared fluctuation around the expectation
16:00
of my random variable.
16:01
It's just telling you on average how far I'm going to be.
16:04
And you take the square because you want to cancel the signs.
16:06
Otherwise, you're going to get 0.
16:07
AUDIENCE: Oh, OK.
16:07
PHILIPPE RIGOLLET: And here it's saying, well,
16:08
I really don't care what the expectation of theta hat is.
16:11
What I want to get to is theta, so I'm
16:13
looking at the expectation of the squared fluctuations
16:15
around theta itself.
16:16
If I'm unbiased, it coincides with the variance.
16:19
But if I'm biased, then I have to account for the fact
16:21
that I'm really not computing the--
16:23
AUDIENCE: OK.
16:23
OK.
16:24
Thanks.
16:25
PHILIPPE RIGOLLET: OK?
16:27
All right.
16:28
Are there any questions?
16:29
So here, what I really want to illustrate
16:31
is that the risk itself is a function
16:33
of theta most of the times.
16:34
And so for different thetas, some estimators
16:35
are going to be better than others.
16:37
But there's also the entire range
16:38
of estimators, those that are really biased,
16:41
but the bias can completely vanish.
16:44
And so here, you see you have no bias,
16:47
but the variance can be large.
16:48
Or you have 0 bias--
16:50
you have a bias, but the variance is 0.
16:52
So you can actually have this trade-off
16:54
and you can find things that are in the entire range in general.
16:58
17:01
So those things are actually-- those trade-offs
17:05
between bias and variance are usually much better illustrated
17:10
if we're talking about multivariate parameters.
17:12
If I actually look at a parameter which
17:14
is the mean of some multivariate Gaussian, so an entire vector,
17:19
then the bias is going to--
17:20
I can make the bias bigger by, for example,
17:23
forcing all the coordinates of my estimator to be the same.
17:26
So here, I'm going to get some bias,
17:27
but the variance is actually going
17:29
to be much better, because I get to average all
17:31
the coordinates for this guy.
17:32
And so really, the bias/variance trade-off
17:35
is when you have multiple parameters to estimate,
17:38
so you have a vector of parameters,
17:40
a multivariate parameter, the bias
17:42
increases when you're trying to pull more information
17:45
across the different components to actually have
17:49
a lower variance.
17:50
So the more you average, the lower the variance.
17:53
That's exactly what we've illustrated.
17:54
As n increases, the variance decreases,
17:56
like 1 over n or theta, 1 minus theta over n.
17:59
And so this is how it happens in general.
18:01
In this class, it's mostly one-dimensional parameter
18:03
estimation, so it's going to be a little harder to illustrate
18:06
that.
18:06
But if you do, for example, non-parametric estimation,
18:09
that's all you do.
18:10
There's just bias/variance trade-offs all the time.
18:14
And in between, when you have high-dimensional parametric
18:16
estimation, that happens a lot as well.
18:20
OK.
18:21
So I'm just going to go quickly through those two remaining
18:25
slides, because we've actually seen them.
18:26
But I just wanted you to have somewhere a formal definition
18:29
of what a confidence interval is.
18:32
And so we fixed a statistical model for n observations, X1
18:37
to Xn.
18:38
The parameter theta here is one-dimensional.
18:42
Theta is a subset of the real line,
18:44
and that's why I talk about intervals.
18:47
An interval is a subset of the line.
18:48
If I had a subset of R2, for example,
18:51
that would no longer be called an interval, but a region,
18:54
just because-- well, that's just we can say a set,
18:57
a confidence set.
18:59
But people like to say confidence region.
19:01
So an interval is just a one-dimensional conference
19:04
region.
19:04
And it has to be an interval as well.
19:07
So a confidence interval of level 1 minus alpha--
19:11
so we refer to the quality of a confidence interval
19:16
is actually called it's level.
19:18
It takes value 1 minus alpha for some positive alpha.
19:21
And so the confidence level--
19:23
the level of the confidence interval is between 0 and 1.
19:26
The closer to 1 it is, the better the confidence interval.
19:29
The closer to 0, the worse it is.
19:32
And so for any random interval-- so
19:34
a confidence interval is a random interval.
19:37
The bounds of this interval depends on random data.
19:41
Just like we had X bar plus/minus
19:44
1 over square root of n, for example, or 2
19:46
over square root of n, this X bar
19:49
was the random thing that would make fluctuate those guys.
19:53
And so now I have an interval.
19:54
And now I have its boundaries, but now the boundaries
19:56
are not allowed to depend on my unknown parameter.
19:58
Otherwise, it's not a confidence interval,
20:00
just like an estimator that depends
20:02
on the unknown parameter is not an estimator.
20:04
The confidence interval has to be something
20:06
that I can compute once I collect data.
20:10
And so what I want is that-- so there's this weird notation.
20:14
The fact that I write theta--
20:17
that's the probability that I contains theta.
20:19
You're used to seeing theta belongs to I.
20:23
But here, I really want to emphasize
20:24
that the randomness is in I. And so the way
20:26
you actually say it when you read
20:28
this formula is the probability that I contains theta
20:32
is at least 1 minus alpha.
20:36
So it better be close to 1.
20:39
You want 1 minus alpha to be very close to 1,
20:41
because it's really telling you that whatever
20:43
random variable I'm giving you, my error bars are actually
20:46
covering the right theta.
20:49
And I want this to be true.
20:50
But I want this-- since I don't know
20:52
what my confidence-- my parameter of theta
20:54
is, I want this to hold true for all possible values
20:58
of the parameters that nature may have come up with from.
21:02
So I want this-- so there's theta that changes here,
21:05
so the distribution of the interval
21:06
is actually changing with theta hopefully.
21:08
And theta is changing with this guy.
21:11
So regardless of the value of theta that I'm getting,
21:13
I want that the probability that it contains the theta
21:17
is actually larger than 1 minus alpha.
21:20
So I'll come back to it in a second.
21:22
I just want to say that here, we can
21:23
talk about asymptotic level.
21:25
And that's typically when you use central limit
21:27
theorem to compute this guy.
21:29
Then you're not guaranteed that the value is
21:32
at least 1 minus alpha for every n,
21:35
but it's actually in the limit larger than 1 minus alpha.
21:40
So maybe for each fixed n it's going to be not true.
21:43
But for as no goes to infinity, it's
21:44
actually going to become true.
21:46
If you want this to hold for every n,
21:49
you actually need to use things such as Hoeffding's inequality
21:51
that we described at some point, that hold for every n.
21:55
So as a rule of thumb, if you use the central limit theorem,
22:00
you're dealing with a confidence interval
22:01
with asymptotic level 1 minus alpha.
22:04
And the reason is because you actually
22:05
want to get the quintiles of the normal-- the Gaussian
22:10
distribution that comes from the central limit theorem.
22:13
And if you want to use Hoeffding's, for example,
22:15
you might actually get away with a confidence interval that's
22:18
actually true even non-asymptotically.
22:20
It's just the regular confidence interval.
22:22
22:24
So this is the formal definition.
22:26
It's a bit of a mouthful.
22:28
But we actually-- the best way to understand them
22:30
is to build them.
22:31
Now, at some point I said--
22:33
and I think it was part of the homework--
22:35
22:38
so here, I really say the probability
22:39
the true parameter belongs to the confidence interval
22:42
is actually 1 minus alpha.
22:44
And so that's because here, this confidence interval
22:47
is still a random variable.
22:48
Now, if I start plugging in numbers instead
22:50
of the random variables X1 to Xn,
22:52
I start putting 1, 0, 0, 1, 0, 0, 1,
22:55
like I did for the kiss example, then in this case,
22:58
the random interval is actually going to be 0.42, 0.65.
23:03
And this guy, the probability that theta belongs to it
23:05
is not 1 minus alpha.
23:07
It's either 0 if it's not in there
23:10
or it's 1 if it's in there.
23:11
23:16
So here is the example that we had.
23:19
So just let's look at back into our favorite example, which
23:24
is the average of Bernoulli random variables,
23:26
so we studied that maybe that's the third time already.
23:30
So the sample average, Xn bar, is a strongly consistent
23:34
estimator of p.
23:35
That was one of the properties that we wanted.
23:37
Strongly consistent means that as n goes to infinity,
23:40
it converges almost surely to the true parameter.
23:42
That's the strong law of large number.
23:44
It is consistent also, because it's strongly consistent,
23:47
so it also converges in probability,
23:49
which makes it consistent.
23:52
It's unbiased.
23:53
We've seen that.
23:53
We've actually computed its quadratic risk.
23:57
And now what I have is that if I look at--
24:00
thanks to the central limit theorem, we actually did this.
24:02
We built a confidence interval at level 1 minus alpha--
24:08
asymptotic level, sorry, asymptotic level 1 minus alpha.
24:12
And so here, this is how we did it.
24:15
Let me just go through it again.
24:17
So we know from the central limit theorem--
24:19
24:28
so the central limit theorem tells us
24:31
that Xn bar minus p divided by square root of p1 minus p,
24:38
square root of n converges in distribution as n
24:41
goes to infinity to some standard normal distribution.
24:47
So what it means is that if I look at the probability
24:49
under the true p, that's square root of n, Xn bar
24:53
minus p divided by square root of p1 minus p,
25:03
it's less than Q alpha over 2, where this is
25:06
the definition of the quintile.
25:07
Then this guy-- and I'm actually going to use the same notation,
25:11
limit as n goes to infinity, this is the same thing.
25:17
So this is actually going to be equal to 1 minus alpha.
25:22
That's exactly what I did last time.
25:25
This is by definition of the quintile of a standard Gaussian
25:28
and of a limit in distribution.
25:32
So the probabilities computed on this guy in the limit converges
25:36
to the probability computed on this guy.
25:38
And we know that this is just the probability
25:40
that the absolute value of sum n 0, 1
25:42
is less than Q alpha over 2.
25:44
25:47
And so in particular, if it's equal,
25:50
then I can put some larger than or equal to,
25:54
which guarantees my asymptotic confidence level.
25:57
And I just solve for p.
25:59
So this is equivalent to the limit
26:03
as n goes to infinity of the probability
26:07
that theta is between Xn bar minus Q
26:15
alpha over 2 divided by--
26:21
times square root of p1 minus p divided by square root of n, Xn
26:26
bar plus q alpha over 2, square root of p1 minus p
26:33
divided by square root of n is larger than or equal
26:37
to 1 minus alpha.
26:39
And so there you go.
26:39
I have my confidence interval.
26:43
Except that's not, right?
26:45
We just said that the bounds of a confidence interval
26:48
may not depend on the unknown parameter.
26:50
And here, they do.
26:52
And so we actually came up with two ways
26:54
of getting rid of this.
26:55
Since we only need this thing-- so this thing, as we said,
26:58
is really equal.
26:59
Every time I'm going to make this guy smaller
27:01
and this guy larger, I'm only going
27:03
to increase the probability.
27:05
And so what we do is we actually just take
27:06
the largest possible value for p1 minus
27:08
p, which makes the interval as large as possible.
27:13
And so now I have this.
27:15
I just do one of the two tricks.
27:17
I replace p1 minus p by their upper bound, which is 1/4.
27:22
27:25
As we said, p1 minus p, the function looks like this.
27:28
So I just take the value here at 1/2.
27:31
Or, I can use Slutsky and say that if I replace p by Xn bar,
27:37
that's the same as just replacing p by Xn bar here.
27:40
27:45
And by Slutsky, we know that this is actually converging
27:48
also to some standard Gaussian.
27:50
27:59
We've seen that when we saw Slutsky as an example.
28:04
And so those two things-- actually,
28:05
just because I'm taking the limit
28:07
and I'm only caring about the asymptotic confidence level,
28:10
I can actually just plug in consistent quantities in there,
28:13
such as Xn bar where I don't have a p.
28:15
And that gives me another confidence interval.
28:18
All right.
28:19
So this by now, hopefully after doing it three times,
28:24
you should really, really be comfortable with just creating
28:28
this confidence interval.
28:29
We did it three times in class.
28:31
I think you probably did it another couple times
28:33
in your homework.
28:34
So just make sure you're comfortable with this.
28:36
All right.
28:37
That's one of the basic things you would want to know.
28:39
Are there any questions?
28:41
Yes.
28:42
AUDIENCE: So Slutsky holds for any single response set p.
28:46
But Xn converges [INAUDIBLE].
28:48
28:52
PHILIPPE RIGOLLET: So that's not Slutsky, right?
28:55
AUDIENCE: That's [INAUDIBLE].
28:58
PHILIPPE RIGOLLET: So Slutsky tells you that if you--
29:04
Slutsky's about combining two types of convergence.
29:06
So Slutsky tells you that if you actually
29:08
have one Xn that converges to X in distribution and Yn
29:13
that converges to Y in probability, then
29:16
you can actually multiply Xn and Yn
29:18
and get that the limit in distribution
29:20
is the product of X and Y, where X is now a constant.
29:28
And here we have the constant, which is 1.
29:32
But I did that already, right?
29:35
Using Slutsky to replace it for the--
29:37
to replace P by Xn bar, we've done
29:40
that last time, maybe a couple of times ago, actually.
29:44
Yeah.
29:45
AUDIENCE: So I guess these statements are [INAUDIBLE]..
29:49
PHILIPPE RIGOLLET: That's correct.
29:51
AUDIENCE: So could we like figure out [INAUDIBLE]
29:53
can we set a finite [INAUDIBLE].
29:58
PHILIPPE RIGOLLET: So of course, the short answer is no.
30:00
30:04
So here's how you would go about thinking
30:06
about which method is better.
30:08
So there's always the more conservative method.
30:10
The first one, the only thing you're losing
30:13
is the rate of convergence of the central limit theorem.
30:16
So if n is large enough so that the central limit theorem
30:19
approximation is very good, then that's all you're
30:22
going to be losing.
30:24
Of course, the price you pay is that your confidence interval
30:27
is wider than it would be if you were
30:28
to use Slutsky for this particular problem,
30:31
typically wider.
30:32
Actually, it is always wider, because Xn bar--
30:37
1 minus Xn bar is always less than 1/4 as well.
30:41
And so that's the first thing you--
30:45
so Slutsky basically adds your relying on the central limit--
30:51
your relying on the asymptotics again.
30:53
Now of course, you don't want to be conservative,
30:56
because you actually want to squeeze as much from your data
30:59
as you can.
30:59
So it depends on how comfortable and how critical it is for you
31:04
to put valid error bars.
31:06
If they're valid in the asymptotics,
31:07
then maybe you're actually going to go with Slutsky
31:09
so it actually gives you slightly narrower confidence
31:11
intervals and so you feel like you're a little more--
31:16
you have a more precise answer.
31:17
Now, if you really need to be super-conservative,
31:19
then you're actually going to go with the P1 minus P.
31:23
Actually, if you need to be even more conservative,
31:25
you are going to go with Hoeffding's so you don't even
31:28
have to rely on the asymptotics level at all.
31:31
But then you're confidence interval
31:32
becomes twice as wide and twice as wide
31:35
and it becomes wider and wider as you go.
31:37
So depends on--
31:39
I mean, there's a lot of data in statistics
31:41
which is gauging how critical it is for you to output
31:46
valid error bounds or if they're really just here
31:48
to be indicative of the precision of the estimator you
31:51
gave from a more qualitative perspective.
31:55
AUDIENCE: So the error there is [INAUDIBLE]??
31:57
PHILIPPE RIGOLLET: Yeah.
31:58
So here, there's basically a bunch of errors.
32:01
There's one that's-- so there's a theorem called Berry-Esseen
32:04
that quantifies how far this probability is from 1 minus
32:09
alpha, but the constants are terrible.
32:12
So it's not very helpful, but it tells you
32:14
as n grows how smaller this thing grows--
32:17
becomes smaller.
32:18
And then for Slutsky, again you're
32:20
multiplying something that converges by something that
32:22
fluctuates around 1, so you need to understand
32:24
how this thing fluctuates.
32:25
Now, there's something that shows up.
32:28
Basically, what is the slope of the function 1
32:31
over square root of X1 minus X around the value
32:36
you're interested in?
32:37
And so if this function is super-sharp,
32:39
then small fluctuations of Xn bar around this expectation
32:43
are going to lead to really high fluctuations
32:45
of the function itself.
32:47
So if you're looking at--
32:49
if you have f of Xn bar and f around say the true P,
32:55
if f is really sharp like that, then
32:58
if you move a little bit here, then you're
33:00
going to move really a lot on the y-axis.
33:03
So that's what the function here-- the function
33:05
you're interested in is 1 over square root of X1 minus X.
33:09
So what does this function look like around the point where you
33:11
think P is the true parameter?
33:14
33:17
Its derivative really is what matters.
33:19
OK?
33:21
Any other question.
33:22
33:24
OK.
33:25
So it's important, because now we're
33:26
going to switch to the real let's do some hardcore
33:29
computation type of things.
33:31
All right.
33:32
33:36
So in this chapter, we're going to talk about maximum
33:39
likelihood estimation.
33:40
33:44
Who has already seen maximum likelihood estimation?
33:49
OK.
33:50
And who knows what a convex function is?
33:55
OK.
33:56
So we'll do a little bit of reminders on those things.
34:00
So those things are when we do maximum likelihood estimation,
34:04
likelihood is the function, so we need to maximize a function.
34:07
That's basically what we need to do.
34:09
And if I give you a function, you
34:10
need to know how to maximize this function.
34:12
Sometimes, you have closed-form solutions.
34:14
You can take the derivative and set it equal to 0 and solve it.
34:18
But sometimes, you actually need to resort to algorithms
34:21
to do that.
34:21
And there's an entire industry doing that.
34:25
And we'll briefly touch upon it, but this is definitely
34:27
not the focus of this class.
34:30
OK.
34:31
So before diving directly into the definition
34:34
of the likelihood and what is the definition
34:36
of the maximum likelihood estimator, what
34:38
I'm going to try to do is to give you
34:41
an insight for what we're actually doing when we do
34:45
maximum likelihood estimation.
34:48
So remember, we have a model on a sample space E
34:53
and some candidate distributions P theta.
34:57
And really, your goal is to estimate a true theta
35:00
star, the one that generated some data, X1 to Xn,
35:04
in an iid fashion.
35:06
But this theta star is really a proxy for us
35:08
to know that we actually understand
35:10
the distribution itself.
35:12
The goal of knowing theta star is so that you can actually
35:15
know what P theta star.
35:17
Otherwise, it has-- well, sometimes we
35:19
said it has some meaning itself, but really you
35:21
want to know what the distribution is.
35:23
And so your goal is to actually come up with the distribution--
35:27
hopefully that comes from the family P theta--
35:30
that's close to P theta star.
35:33
So in a way, what does it mean to have two distributions that
35:38
are close?
35:39
It means that when you compute probabilities
35:41
on one distribution, you should have
35:43
the same probability on the other distribution pretty much.
35:46
So what we can do is say, well, now I
35:49
have two candidate distributions.
35:51
35:59
So if theta hat leads to a candidate distribution P theta
36:03
hat, and this is the true theta star,
36:06
it leads to the true distribution P theta star
36:08
according to which my data was drawn.
36:11
That's my candidate.
36:12
36:16
As a statistician, I'm supposed to come up
36:18
with a good candidate, and this is the truth.
36:20
36:23
And what I want is that if you actually give me
36:26
the distribution, then I want when
36:30
I'm computing probabilities for this guy,
36:31
I know what the probabilities for the other guys are.
36:34
And so really what I want is that if I compute a probability
36:40
under theta hat of some interval a, b,
36:44
it should be pretty close to the probability
36:46
under theta star of a, b.
36:51
And more generally, if I want to take
36:53
the union of two intervals, I want this to be true.
36:55
If I take just 1/2 lines, I want this to be true from 0
36:58
to infinity, for example, things like this.
37:00
I want this to be true for all of them at once.
37:03
And so what I do is that I write A for a probability event.
37:07
And I want that P hat of A is close to P star of A
37:11
for any event A in the sample space.
37:15
Does that sound like a reasonable goal
37:17
for a statistician?
37:18
So in particular, if I want those to be close,
37:20
I want the absolute value of their difference
37:22
to be close to 0.
37:23
37:26
And this turns out to be--
37:28
if I want this to hold for all possible A's, I
37:31
have all possible events, so I'm going to actually maximize over
37:35
these events.
37:36
And I'm going to look at the worst
37:37
possible event on which theta hat can depart from theta star.
37:41
And so rather than defining it specifically
37:43
for theta hat and theta star, I'm
37:44
just going to say, well, if you give me two probability
37:47
measures, P theta and P theta prime,
37:51
I want to know how close they are.
37:53
Well, if I want to measure how close they
37:55
are by how they can differ when I measure
37:58
the probability of some event, I'm
38:01
just looking at the absolute value of the difference
38:04
of the probabilities and I'm just
38:06
maximizing over the worst possible event that might
38:09
actually make them differ.
38:11
Agreed?
38:13
That's a pretty strong notion.
38:14
So if the total variation between theta and theta prime
38:17
is small, it means that for all possible A's that you give me,
38:22
then P theta of A is going to be close to P
38:25
theta prime of A, because if--
38:30
let's say I just found the bound on the total variation
38:33
distance, which is 0.01.
38:41
All right.
38:42
So that means that this is going to be larger
38:46
than the max over A of P theta minus P theta prime of A,
39:00
which means that for any A--
39:04
actually, let me write P theta hat and P theta star,
39:06
like we said, theta hat and theta star.
39:10
And so if I have a bound, say, on the total variation,
39:12
which is 0.01, that means that P theta hat--
39:19
every time I compute a probability on P theta hat,
39:23
it's basically in the interval P theta star of A,
39:29
the one that I really wanted to compute, plus or minus 0.01.
39:34
This has nothing to do with confidence interval.
39:36
This is just telling me how far I
39:38
am from the value of actually trying to compute.
39:41
And that's true for all A. And that's key.
39:44
That's where this max comes into play.
39:47
It just says, I want this bound to hold
39:49
for all possible A's at once.
39:50
39:55
So this is actually a very well-known distance
39:58
between probability measures.
39:59
It's the total variation distance.
40:00
It's extremely central to probabilistic analysis.
40:04
And it essentially tells you that every time--
40:07
if two probability distributions are close,
40:09
then it means that every time I compute a probability
40:11
under P theta but I really actually
40:15
have data from P theta prime, then
40:17
the error is no larger than the total variation.
40:21
OK.
40:23
So this is maybe not the most convenient way
40:29
of finding a distance.
40:30
I mean, how are you going--
40:32
in reality, how are you to compute this maximum
40:34
over all possible events?
40:35
I mean, it's just crazy, right?
40:36
There's an infinite number of them.
40:38
It's much larger than the number of intervals, for example,
40:41
so it's a bit annoying.
40:43
And so there's actually a way to compress it
40:46
by just looking at the basically function distance or vector
40:50
distance between probability mass functions or probability
40:53
density functions.
40:55
So I'm going to start with the discrete version
40:58
of the total variation.
40:59
So throughout this chapter, I will
41:03
make the difference between discrete random variables
41:05
and continuous random variables.
41:07
It really doesn't matter.
41:08
All it means is that when I talk about discrete,
41:10
I will talk about probability mass functions.
41:12
And when I talk about continuous,
41:13
I will talk about probability density functions.
41:16
When I talk about probability mass functions,
41:20
I talk about sums.
41:21
When I talk about probability density functions,
41:24
I talk about integrals.
41:26
But they're all the same thing, really.
41:30
So let's start with the probability mass function.
41:32
Everybody remembers what the probability mass
41:34
function of a discrete random variable is.
41:37
This is the function that tells me for each possible value
41:42
that it can take, the probability
41:43
that it takes this value.
41:46
So the Probability Mass Function, PMF,
41:53
is just the function for all x in the sample space
41:57
tells me the probability that my random variable is
42:01
equal to this little value.
42:03
And I will denote it by P sub theta of X.
42:09
So what I want is, of course, that the sum
42:10
of the probabilities is 1.
42:12
42:17
And I want them to be non-negative.
42:20
Actually, typically we will assume that they are positive.
42:23
Otherwise, we can just remove this x from the sample space.
42:27
And so then I have the total variation distance, I mean,
42:31
it's supposed to be the maximum overall sets of--
42:35
of subsets of E, such that the probability
42:39
of A minus probability of theta prime of A--
42:43
it's complicated, but really there's
42:44
this beautiful formula that tells me
42:46
that if I look at the total variation between P theta
42:50
and P theta prime, it's actually equal to just 1/2
42:54
of the sum for all X in E of the absolute difference between P
43:04
theta X and P theta prime of X.
43:12
So that's something you can compute.
43:13
If I give you two probability mass functions,
43:16
you can compute this immediately.
43:19
But if I give you just the densities
43:24
and the original distribution, the original definition
43:26
where you have to max over all possible events,
43:28
it's not clear you're going to be
43:29
able to do that very quickly.
43:31
So this is really the one you can work with.
43:35
But the other one is really telling you
43:36
what it is doing for you.
43:37
It's controlling the difference of probabilities
43:39
you can compute on any event.
43:41
But here, it's just telling you, well,
43:42
if you do it for each simple event, it's little x.
43:46
It's actually simple.
43:49
Now, if we have continuous random variables-- so
43:53
by the way, I didn't mention, but discrete means Bernoulli.
43:56
Binomial, but not only those that have finite support,
43:59
like Bernoulli has support of size 2,
44:02
binomial NP has support of size n--
44:05
there's n possible values it can take-- but also Poisson.
44:08
Poisson distribution can take an infinite number
44:10
of values, all the positive integers,
44:13
non-negative integers.
44:16
And so now we have also the continuous ones,
44:18
such as Gaussian, exponential.
44:19
And what characterizes those guys is that they
44:21
have a probability density.
44:24
So the density, remember the way I
44:26
use my density is when I want to compute
44:28
the probability of belonging to some event A.
44:31
The probability of X falling to some subset of the real line A
44:37
is simply the integral of the density on this set.
44:40
That's the famous area under the curve thing.
44:43
So since for each possible value, the probability at X--
44:49
so I hope you remember that stuff.
44:51
That's just probably something that you
44:57
must remember from probability.
44:59
But essentially, we know that the probability that X is equal
45:02
to little x is 0 for a continuous random variable,
45:04
for all possible X's.
45:06
There's just none of them that actually gets weight.
45:09
So what we have to do is to describe the fact that it's
45:11
in some little region.
45:12
So the probability that it's in some interval, say, a, b, this
45:18
is the integral between A and B of f theta of X, dx.
45:25
So I have this density, such as the Gaussian one.
45:28
And the probability that I belong to the interval a,
45:30
b is just the area under the curve between A and B.
45:36
If you don't remember that, please take immediate remedy.
45:43
So this function f, just like P, is non-negative.
45:48
And rather than summing to 1, it integrates to 1
45:51
when I integrate it over the entire sample space E.
45:55
And now the total variation, well, it
45:56
takes basically the same form.
45:58
I said that you essentially replace sums
46:00
by integrals when you're dealing with densities.
46:03
And here, it's just saying, rather than having
46:05
1/2 of the sum of the absolute values,
46:07
you have 1/2 of the integral of the absolute value
46:09
of the difference.
46:11
Again, if I give you two densities
46:15
and if you're not too bad at calculus, which you will often
46:18
be, because there's lots of them you can actually not compute.
46:21
But if I gave you, for example, two Gaussian densities,
46:24
exponential minus x squared, blah, blah, blah, and I say,
46:27
just compute the total variation distance,
46:29
you could actually write it as an integral.
46:30
Now, whether you can actually reduce this integral
46:33
to some particular number is another story.
46:35
But you could technically do it.
46:38
So now, you have actually a handle on this thing
46:41
and you could technically ask Mathematica,
46:43
whereas asking Mathematica to take
46:45
the max over all possible events is going to be difficult.
46:48
All right.
46:48
So the total variation has some properties.
46:55
So let's keep on the board the definition that
46:59
involves, say, the densities.
47:05
So think Gaussian in your mind.
47:06
And you have two Gaussians, one with mean theta
47:09
and one with mean theta prime.
47:10
And I'm looking at the total variation between those two
47:13
guys.
47:14
So if I look at P theta minus--
47:20
sorry.
47:20
TV between P theta and P theta prime, this
47:25
is equal to 1/2 of the integral between f theta, f theta prime.
47:31
And when I don't write it--
47:32
so I don't write the X, dx but it's there.
47:34
And then I integrate over E.
47:38
So what is this thing doing for me?
47:39
It's just saying, well, if I have-- so
47:41
think of two Gaussians.
47:42
For example, I have one that's here and one that's here.
47:44
47:47
So this is let's say f theta, f theta prime.
47:51
This guy is doing what?
47:52
It's computing the absolute value of the difference
47:55
between f and f theta prime.
47:57
You can check for yourself that graphically, this I
48:01
can represent as an area not under the curve,
48:05
but between the curves.
48:10
So this is this guy.
48:11
48:16
Now, this guy is really the integral of the absolute value.
48:20
So this thing here, this area, this
48:22
is 2 times the total variation.
48:25
48:28
The scaling 1/2 really doesn't matter.
48:29
It's just if I want to have an actual correspondence
48:32
between the maximum and the other guy, I have to do this.
48:36
48:39
So this is what it looks like.
48:41
So we have this definition.
48:42
And so we have a couple of properties that come into this.
48:48
The first one is that it's symmetric.
48:49
TV of P theta and P theta prime is
48:51
the same as the TV between P theta prime and P theta.
48:55
Well, that's pretty obvious from this definition.
48:59
I just flip those two, I get the same number.
49:02
It's actually also true if I take the maximum.
49:05
Those things are completely symmetric in theta and theta
49:07
prime.
49:08
You can just flip them.
49:10
It's non-negative.
49:11
Is that clear to everyone that this thing is non-negative?
49:15
I integrate an absolute value, so this thing
49:20
is going to give me some non-negative number.
49:22
And so if I integrate this non-negative number,
49:24
it's going to be a non-negative number.
49:26
The fact also that it's an area tells me
49:29
that it's going to be non-negative.
49:32
The nice thing is that if TV is equal to zero, then
49:36
the two distributions, the two probabilities are the same.
49:42
That means that for every A, P theta of A
49:46
is equal to P theta prime of A. Now,
49:49
there's two ways to see that.
49:50
The first one is to say that if this integral is
49:53
equal to 0, that means that for almost all X,
49:56
f theta is equal to f theta prime.
49:58
The only way I can integrate a non-negative and get 0
50:01
is that it's 0 pretty much everywhere.
50:05
And so what it means is that the two densities
50:07
have to be the same pretty much everywhere,
50:09
which means that the distributions are the same.
50:11
But this is not really the way you want to do this,
50:13
because you have to understand what
50:15
pretty much everywhere means--
50:16
which I should really say almost everywhere.
50:18
That's the formal way of saying it.
50:20
But let's go to this definition--
50:22
50:24
which is gone.
50:26
Yeah.
50:26
That's the one here.
50:28
The max of those two guys, if this maximum is equal to 0--
50:35
I have a maximum of non-negative numbers, their absolute values.
50:39
Their maximum is equal to 0, well,
50:42
they better be all equal to 0, because if one is not
50:44
equal to 0, then the maximum is not equal to 0.
50:47
So those two guys, for those two things
50:50
to be-- for the maximum to be equal to 0,
50:52
then each of the individual absolute values
50:54
have to be equal to 0, which means that the probability here
50:57
is equal to this probability here for every event A.
51:03
So those two things--
51:04
this is nice, right?
51:06
That's called definiteness.
51:08
The total variation equal to 0 implies that P theta
51:10
is equal to P theta prime.
51:12
So that's really some notion of distance, right?
51:14
That's what we want.
51:16
If this thing being small implied
51:17
that P theta could be all over the place compared
51:20
to P theta prime, that would not help very much.
51:24
Now, there's also the triangle inequality
51:26
that follows immediately from the triangle
51:28
inequality inside this guy.
51:32
If I squeeze in some f theta prime prime in there,
51:35
I'm going to use the triangle inequality
51:37
and get the triangle inequality for the whole thing.
51:39
51:42
Yeah?
51:42
AUDIENCE: The fact that you need two definitions
51:45
of the [INAUDIBLE],, is it something
51:48
obvious or is it complete?
51:50
PHILIPPE RIGOLLET: I'll do it for you now.
51:52
So let's just prove that those two things are actually
51:56
giving me the same definition.
51:58
52:00
So what I'm going to do is I'm actually going
52:02
to start with the second one.
52:04
And I'm going to write--
52:05
I'm going to start with the density version.
52:07
But as an exercise, you can do it for the PMF version
52:10
if you prefer.
52:11
So I'm going to start with the fact that f--
52:13
52:20
so I'm going to write f of g so I don't have to write f and g.
52:23
So think of this as being f sub theta, and think of this guy
52:27
as being f sub theta prime.
52:29
I just don't want to have to write indices all the time.
52:32
So I'm going to start with this thing, the integral of f
52:34
of X minus g of X dx.
52:38
The first thing I'm going to do is this is an absolute value,
52:41
so either the number in the absolute value is positive
52:45
and I actually kept it like that, or it's negative
52:47
and I flipped its sign.
52:48
So let's just split between those two cases.
52:51
So this thing is equal to 1/2 the integral of--
52:55
so let me actually write the set A star as
53:00
being the set of X's such that f of X is larger than g of X.
53:09
So that's the set on which the difference is
53:11
going to be positive or the difference is
53:13
going to be negative.
53:14
So this, again, is equivalent to f
53:17
of X minus g of X is positive.
53:23
OK.
53:23
Everybody agrees?
53:24
So this is the set I'm interested in.
53:26
53:29
So now I'm going to split my integral into two parts,
53:31
in A, A star, so on A star, f is larger than g,
53:38
so the absolute value is just the difference itself.
53:40
53:45
So here I put parenthesis rather than absolute value.
53:48
And then I have plus 1/2 of the integral on the complement.
53:54
What are you guys used to to write the complement, to the C
53:57
or the bar?
54:01
To the C?
54:01
54:05
And so here on the complement, then f is less than g,
54:08
so this is actually really g of X minus f of X, dx.
54:17
Everybody's with me here?
54:19
So I just said--
54:20
I mean, those are just rewriting what the definition
54:23
of the absolute value is.
54:24
54:33
OK.
54:33
So now there's nice things that I know about f and g.
54:38
And the two nice things is that the integral of f is equal to 1
54:40
and the integral of g is equal to 1.
54:42
54:46
This implies that the integral of f minus g is equal to what?
54:53
AUDIENCE: 0.
54:54
PHILIPPE RIGOLLET: 0.
54:56
And so now that means that if I want
54:59
to just go from the integral here on A complement
55:04
to the integral on A--
55:05
or on A star, complement to the integral of A star,
55:08
I just have to flip the sign.
55:11
So that implies that an integral on A star
55:14
complement of g of X minus f of X,
55:21
dx, this is simply equal to the integral on A star
55:25
of f of X minus g of X, dx.
55:30
55:40
All right.
55:41
So now this guy becomes this guy over there.
55:46
So I have 1/2 of this plus 1/2 of the same guy,
55:50
so that means that 1/2 half of the integral between of f
55:55
minus g absolute value--
55:57
so that was my original definition,
55:59
this thing is actually equal to the integral on A star
56:03
of f of X minus g of X, dx.
56:10
56:14
And this is simply equal to P of A star--
56:21
so say Pf of A start minus Pg of A star.
56:26
56:34
Which one is larger than the other one?
56:36
56:41
AUDIENCE: [INAUDIBLE]
56:43
PHILIPPE RIGOLLET: It is.
56:44
Just look at this board.
56:45
AUDIENCE: [INAUDIBLE]
56:47
PHILIPPE RIGOLLET: What?
56:48
AUDIENCE: [INAUDIBLE]
56:49
PHILIPPE RIGOLLET: The first one has
56:50
to be larger, because this thing is actually
56:51
equal to a non-negative number.
56:53
56:59
So now I have this absolute value of two things,
57:01
and so I'm closer to the actual definition.
57:04
But I still need to show you that this thing is
57:06
the maximum value.
57:09
So this is definitely at most the maximum over A of Pf
57:17
of A minus Pg of A.
57:21
That's certainly true.
57:24
Right?
57:24
We agree with this?
57:27
Because this is just for one specific A,
57:30
and I'm bounding it by the maximum over all possible A.
57:34
So that's clearly true.
57:36
So now I have to go the other way around.
57:38
I have to show you that the max is actually this guy, A star.
57:44
So why would that be true?
57:45
Well, let's just inspect this thing over there.
57:49
So we want to show that if I take
57:50
any other A in this integral than this guy A star,
57:53
it's actually got to decrease its value.
57:56
So we have this function.
57:57
I'm going to call this function delta.
57:59
58:02
And what we have is-- so let's say
58:03
this function looks like this.
58:04
Now it's the difference between two densities.
58:06
It doesn't have to integrate-- it doesn't
58:09
have to be non-negative.
58:10
But it certainly has to integrate to 0.
58:12
58:15
And so now I take this thing.
58:18
And the A star, what is the set A star here?
58:22
The set A star is the set over which the function
58:25
delta is non-negative.
58:27
58:36
So that's just the definition.
58:37
A star was the set over which f minus g was positive,
58:41
and f minus g was just called delta.
58:44
So what it means is that what I'm really integrating
58:47
is delta on this set.
58:50
So it's this area under the curve,
58:53
just on the positive things.
58:55
Agreed?
58:57
So now let's just make some tiny variations around this guy.
59:03
If I take A to be larger than A star--
59:08
so let me add, for example, this part here.
59:10
59:12
That means that when I compute my integral,
59:15
I'm removing this area under the curve.
59:18
It's negative.
59:18
The integral here is negative.
59:20
So if I start adding something to A, the value goes lower.
59:25
If I start removing something from A, like say this guy,
59:29
I'm actually removing this value from the integral.
59:32
So there's no way.
59:33
I'm actually stuck.
59:34
This A star is the one that actually maximizes
59:37
the integral of this function.
59:39
So we used the fact that for any function,
59:49
say delta, the integral over A of delta
59:59
is less than the integral over the set of X's
60:02
such that delta of X is non-negative of delta of X, dx.
60:07
60:10
And that's an obvious fact, just by picture, say.
60:13
60:18
And that's true for all A. Yeah?
60:24
AUDIENCE: [INAUDIBLE] could you use
60:28
like a portion under the axis as like less than
60:33
or equal to the portion above the axis?
60:34
PHILIPPE RIGOLLET: It's actually equal.
60:36
We know that the integral of f minus g--
60:39
the integral of delta is 0.
60:41
So there's actually exactly the same area above and below.
60:47
But yeah, you're right.
60:49
You could go to the extreme cases.
60:51
You're right.
60:51
60:57
No.
60:57
It's actually still be true, even if there was--
61:00
if this was a constant, that would still be true.
61:02
Here, I never use the fact that the integral is equal to 0.
61:05
61:11
I could shift this function by 1 so that the integral of delta
61:15
is equal to 1, and it would still
61:18
be true that it's maximized when I take A to be
61:21
the set where it's positive.
61:24
Just need to make sure that there is someplace where it is,
61:27
but that's about it.
61:28
61:33
Of course, we used this before, when we made this thing.
61:36
But just the last argument, this last fact
61:38
does not require that.
61:39
61:43
All right.
61:44
So now we have this notion of--
61:47
I need the--
61:48
61:52
OK.
61:53
So we have this notion of distance
61:57
between probability measures.
61:58
I mean, these things are exactly what--
62:00
if I were to be in a formal math class and I said,
62:03
here are the axioms that a distance should satisfy,
62:06
those are exactly those things.
62:08
If it's not satisfying this thing,
62:10
it's called pseudo-distance or quasi-distance or just metric
62:13
or nothing at all, honestly.
62:15
So it's a distance.
62:16
It's symmetric, non-negative, equal to 0,
62:18
if and only if the two arguments are equal, then
62:21
it satisfies the triangle inequality.
62:25
And so that means that we have this actual total variation
62:28
distance between probability distributions.
62:31
And here is now a statistical strategy to implement our goal.
62:36
Remember, our goal was to spit out
62:38
a theta hat, which was close such that P theta
62:41
hat was close to P theta star.
62:45
So hopefully, we were trying to minimize the total variation
62:48
distance between P theta hat and P theta star.
62:51
Now, we cannot do that, because just by this fact, this slide,
62:55
if we wanted to do that directly, we would just take--
62:57
well, let's take theta hat equals theta star and that will
62:59
give me the value 0.
63:00
And that's the minimum possible value we can take.
63:03
The problem is that we don't know
63:04
what the total variation is to something that we don't know.
63:07
We know how to compute total variations if I give you
63:09
the two arguments.
63:10
But here, one of the arguments is not known.
63:12
P theta star is not known to us, so we need to estimate it.
63:16
And so here is the strategy.
63:18
Just build an estimator of the total variation
63:21
distance between P theta and P theta star
63:24
for all candidate theta, all possible theta
63:27
in capital theta.
63:30
Now, if this is a good estimate, then when I minimize it,
63:33
I should get something that's close to P theta star.
63:37
So here's the strategy.
63:38
This is my function that maps theta
63:40
to the total variation between P theta and P theta star.
63:44
I know it's minimized at theta star.
63:47
That's definitely TV of P-- and the value here, the y-axis
63:51
should say 0.
63:53
And so I don't know this guy, so I'm
63:54
going to estimate it by some estimator that
63:56
comes from my data.
63:57
Hopefully, the more data I have, the better this estimator is.
64:00
And I'm going to try to minimize this estimator now.
64:03
And if the two things are close, then the minima
64:05
should be close.
64:07
That's a pretty good estimation strategy.
64:09
The problem is that it's very unclear
64:11
how you would build this estimator of TV,
64:13
of the Total Variation.
64:18
So building estimators, as I said,
64:21
typically consists in replacing expectations by averages.
64:25
But there's no simple way of expressing the total variation
64:29
distance as the expectations with respect
64:31
to theta star of anything.
64:33
So what we're going to do is we're
64:36
going to move from total variation distance
64:38
to another notion of distance that sort of has
64:41
the same properties and the same feeling
64:43
and the same motivations as the total variation distance.
64:47
But for this guy, we will be able to build
64:49
an estimate for it, because it's actually
64:51
going to be of the form expectation of something.
64:53
And we're going to be able to replace
64:55
the expectation by an average and then minimize this average.
65:00
So this surrogate for total variation distance
65:04
is actually called the Kullback-Leibler divergence.
65:07
And why we call it divergence is because it's actually
65:09
not a distance.
65:11
It's not going to be symmetric to start with.
65:14
So this Kullback-Leibler or even KL divergence--
65:17
I will just refer to it as KL--
65:20
is actually just more convenient.
65:22
But it has some roots coming from information theory, which
65:27
I will not delve into.
65:29
But if any of you is actually a Core 6 student,
65:31
I'm sure you've seen that in some--
65:32
I don't know-- course that has any content on information
65:37
theory.
65:39
All right.
65:39
So the KL divergence between two probability measures, P theta
65:42
and P theta prime--
65:43
and here, as I said, it's not going to be the symmetric,
65:47
so it's very important for you to specify
65:49
which order you say it is, between P theta and P theta
65:51
prime.
65:52
It's different from saying between P theta prime and P
65:55
theta.
65:56
And so we denote it by KL.
65:58
And so remember, before we had either the sum or the integral
66:04
of 1/2 of the distance-- absolute value of the distance
66:07
between the PMFs and 1/2 of the absolute values
66:10
of the distances between the probability density functions.
66:17
And then we replace this absolute value
66:19
of the distance divided by 2 by this weird function.
66:24
This function is P theta, log P theta,
66:28
divided by P theta prime.
66:30
That's the function.
66:31
That's a weird function.
66:34
OK.
66:35
So this was what we had.
66:38
66:40
That's the TV.
66:41
66:44
And the KL, if I use the same notation, f and g,
66:48
is integral of f of X, log of f of X over g of X, dx.
66:57
67:01
It's a bit different.
67:04
And I go from discrete to continuous using an integral.
67:09
Everybody can read this.
67:10
Everybody's fine with this.
67:11
Is there any uncertainty about the actual definition here?
67:15
So here I go straight to the definition,
67:17
which is just plugging the functions
67:19
into some integral and compute.
67:22
So I don't bother with maxima or anything.
67:24
I mean, there is something like that,
67:26
but it's certainly not as natural as the total variation.
67:29
Yes?
67:30
AUDIENCE: The total variation, [INAUDIBLE]..
67:33
67:38
PHILIPPE RIGOLLET: Yes, just because it's
67:40
hard to build anything from total variation,
67:42
because I don't know it.
67:43
So it's very difficult. But if you can actually--
67:45
and even computing it between two Gaussians,
67:47
just try it for yourself.
67:49
And please stop doing it after at most six minutes,
67:52
because you won't be able to do it.
67:54
And so it's just very hard to manipulate,
67:56
like this integral of absolute values of differences
67:59
between probability density function, at least
68:01
for the probability density functions
68:02
we're used to manipulate is actually a nightmare.
68:04
And so people prefer KL, because for the Gaussian,
68:08
this is going to be theta minus theta prime squared.
68:10
And then we're going to be happy.
68:12
And so those things are much easier to manipulate.
68:15
But it's really-- the total variation
68:18
is telling you how far in the worst case
68:20
the two probabilities can be.
68:21
This is really the intrinsic notion
68:23
of closeness between probabilities.
68:25
So that's really the one-- if we could,
68:27
that's the one we would go after.
68:30
Sometimes people will compute them numerically,
68:32
so that they can say, oh, here's the total variation distance I
68:34
have between those two things.
68:36
And then you actually know that that
68:38
means they are close, because the absolute value-- if I tell
68:41
you total variation is 0.01, like we did here,
68:44
it has a very specific meaning.
68:46
If I tell you the KL divergence is 0.01,
68:49
it's not clear what it means.
68:50
68:55
OK.
68:55
So what are the properties?
68:58
The KL divergence between P theta and P theta prime
69:00
is different from the KL divergence between P theta
69:03
prime and P theta in general.
69:05
Of course, in general, because if theta
69:07
is equal to theta prime, then this certainly is true.
69:11
So there's cases when it's not true.
69:14
The KL divergence is non-negative.
69:17
Who knows the Jensen's inequality here?
69:19
That should be a subset of the people who
69:21
raised their hand when I asked what a convex function is.
69:25
All right.
69:26
So you know what Jensen's inequality is.
69:27
This is Jensen's-- the proof is just one step
69:30
Jensen's inequality, which we will not go into details.
69:33
But that's basically an inequality
69:35
involving expectation of a convex function
69:38
of a random variable compared to the convex function
69:40
of the expectation of a random variable.
69:42
69:45
If you know Jensen, have fun and prove it.
69:48
What's really nice is that if the KL is equal to 0,
69:51
then the two distributions are the same.
69:55
And that's something we're looking for.
69:57
Everything else we're happy to throw out.
69:59
And actually, if you pay attention,
70:00
we're actually really throwing out everything else.
70:03
So they're not symmetric.
70:05
It does satisfy the triangle inequality in general.
70:08
But it's non-negative and it's 0 if and only if the two
70:12
distributions are the same.
70:13
And that's all we care about.
70:15
And that's what we call a divergence rather than
70:17
a distance, and divergence will be enough for our purposes.
70:21
And actually, this asymmetry, the fact
70:24
that it's not flipping-- the first time I saw it,
70:26
I was just annoyed.
70:27
I was like, can we just like, I don't
70:29
know, take the average of the KL between P theta
70:31
and P theta prime and P theta prime and P theta,
70:34
you would think maybe you could do this.
70:36
You just symmatrize it by just taking the average of the two
70:39
possible values it can take.
70:41
The problem is that this will still not satisfy the triangle
70:44
inequality.
70:45
And there's no way basically to turn it into something
70:48
that is a distance.
70:49
But the divergence is doing a pretty good thing for us.
70:52
And this is what will allow us to estimate it and basically
70:55
overcome what we could not do with the total variation.
71:03
So the first thing that you want to notice
71:06
is the total variation distance--
71:08
the KL divergence, sorry, is actually
71:10
an expectation of something.
71:12
Look at what it is here.
71:15
It's the integral of some function against a density.
71:20
That's exactly the definition of an expectation, right?
71:25
So this is the expectation of this particular function
71:29
with respect to this density f.
71:31
So in particular, if I call this is density f-- if I say,
71:35
I want the true distribution to be the first argument,
71:38
this is an expectation with respect
71:39
to the true distribution from which my data is actually
71:42
drawn of the log of this ratio.
71:45
So ha ha.
71:46
I'm a statistician.
71:47
Now I have an expectation.
71:49
I can replace it by an average, because I have data
71:51
from this distribution.
71:52
And I could actually replace the expectation by an average
71:54
and try to minimize here.
71:56
The problem is that--
71:57
actually the star here should be in front of the theta,
72:00
not of the P, right?
72:01
That's P theta star, not P star theta.
72:04
But here, I still cannot compute it,
72:05
because I have this P theta star that shows up.
72:08
I don't know what it is.
72:10
And that's now where the log plays a role.
72:13
If you actually pay attention, I said
72:15
you can use Jensen to prove all this stuff.
72:16
You could actually replace the log by any concave function.
72:21
That would be f divergent.
72:22
That's called an f divergence.
72:24
But the log itself is a very, very specific property,
72:26
which allows us to say that the log of the ratio
72:29
is the ratio of the log.
72:33
Now, this thing here does not depend on theta.
72:38
If I think of this KL divergence as a function of theta,
72:43
then the first part is actually a constant.
72:45
If I change theta, this thing is never going to change.
72:47
It depends only on theta star.
72:49
So if I look at this function KL--
72:51
73:03
so if I look at the function, theta maps
73:05
to KL P theta star, P theta, it's
73:11
of the form expectation with respect to theta star,
73:15
log of P theta star of X. And then I
73:23
have minus expectation with respect to theta star of log
73:29
of P theta of x.
73:33
Now as I said, this thing here, this second expectation
73:38
is a function of theta.
73:39
When theta changes, this thing is going to change.
73:42
And that's a good thing.
73:43
We want something that reflects how close theta and theta
73:45
star are.
73:46
But this thing is not going to change.
73:48
This is a fixed value.
73:49
Actually, it's the negative entropy of P theta star.
73:53
And if you've heard of KL, you've
73:54
probably heard of entropy.
73:55
And that's what-- it's basically minus the entropy.
73:58
And that's a quantity that just depends on theta star.
74:01
But it's just the number.
74:03
I could compute this number if I told
74:05
you this is n theta star 1.
74:07
You could compute this.
74:09
So now I'm going to try to minimize
74:11
the estimate of this function.
74:14
And minimizing a function or a function plus a constant
74:16
is the same thing.
74:18
I'm just shifting the function here or here,
74:20
but it's the same minimizer.
74:23
OK.
74:24
So the function that maps theta to KL of P theta star
74:28
to P theta is of the form constant minus this expectation
74:32
of a log of P theta.
74:35
Everybody agrees?
74:38
Are there any questions about this?
74:40
Are there any remarks, including I
74:42
have no idea what's happening right now?
74:46
OK.
74:46
We're good?
74:47
Yeah.
74:48
AUDIENCE: So when you're actually employing this method,
74:50
how do you know which theta to use as theta star and which
74:52
isn't?
74:53
PHILIPPE RIGOLLET: So this is not a method just yet, right?
74:55
I'm just describing to you what the KL divergence
74:57
between two distributions is.
74:58
If you really wanted to compute it,
75:00
you would need to know what P theta star is
75:01
and what P theta is.
75:02
AUDIENCE: Right.
75:03
PHILIPPE RIGOLLET: And so here, I'm just saying at some point,
75:06
we still-- so here, you see--
75:07
so now let's move onto one step.
75:09
I don't know expectation of theta star.
75:12
But I have data that comes from distribution P theta star.
75:15
So the expectation by the law of large numbers
75:17
should be close to the average.
75:19
And so what I'm doing is I'm replacing any--
75:23
I can actually-- this is a very standard estimation method.
75:27
You write something as an expectation with respect
75:30
to the data-generating process of some function.
75:34
And then you replace this by the average of this function.
75:37
And the law of large numbers tells me
75:38
that those two quantities should actually be close.
75:41
Now, it doesn't mean that's going to be the end of the day,
75:43
right.
75:44
When we did Xn bar, that was the end of the day.
75:46
We had an expectation.
75:47
We replaced it by an average.
75:49
And then we were gone.
75:51
But here, we still have to do something,
75:53
because this is not telling me what theta is.
75:55
Now I still have to minimize this average.
75:58
So this is now my candidate estimator for KL, KL hat.
76:04
And that's the one where I said, well, it's
76:06
going to be of the form of constant.
76:07
And this constant, I don't know.
76:09
You're right.
76:09
I have no idea what this constant is.
76:11
It depends on P theta star.
76:13
But then I have minus something that I can completely compute.
76:16
If you give me data and theta, I can compute this entire thing.
76:20
And now what I claim is that the minimizer of f or f plus--
76:25
f of X or f of X plus 4 are the same thing,
76:28
or say 4 plus f of X. I'm just shifting
76:32
the plot of my function up and down,
76:34
but the minimizer stays exactly where it is.
76:36
76:39
If I have a function--
76:41
76:43
so now I have a function of theta.
76:45
76:51
This is KL hat of P theta star, P theta.
76:56
And it's of the form-- it's a function like this.
76:58
I don't know where this function is.
77:00
It might very well be this function or this function.
77:06
Every time it's a translation on the y-axis of all these guys.
77:10
And the value that I translated by depends on theta star.
77:14
I don't know what it is.
77:15
But what I claim is that the minimizer is always this guy,
77:19
regardless of what the value is.
77:22
OK?
77:25
So when I say constant, it's a constant with respect to theta.
77:28
It's an unknown constant.
77:29
But it's with respect to theta, so without loss of generality,
77:32
I can assume that this constant is 0 for my purposes,
77:36
or 25 if you prefer.
77:38
77:41
All right.
77:41
So we'll just keep going on this property next time.
77:46
And we'll see how from here we can move on to--
77:49
the likelihood is actually going to come out of this formula.
77:51
Thanks.
77:53