https://www.youtube.com/watch?v=JTbZP0yt9qc&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=6

字幕記錄


00:00
00:00
The following content is provided under a Creative
00:02
Commons license.
00:03
Your support will help MIT OpenCourseWare
00:06
continue to offer high quality educational resources for free.
00:10
To make a donation or to view additional materials
00:12
from hundreds of MIT courses, visit MIT OpenCourseWare
00:16
at ocw.mit.edu.
00:17
00:21
PHILIPPE RIGOLLET: So welcome back.
00:23
We're going to finish this chapter on maximum likelihood
00:27
estimation.
00:28
And last time, I briefly mentioned something that
00:30
was called Fisher information.
00:33
So Fisher information, in general,
00:35
is actually a matrix when you have a multivariate parameter
00:40
theta.
00:41
So if theta, for example, is of dimension d,
00:44
then the Fisher information matrix
00:46
is going to be a d by d matrix.
00:48
You can see that, because it's the outer product.
00:51
So it's of the form gradient gradient transpose.
00:55
So if it's gradient gradient transpose,
00:57
the gradient is the d dimensional.
00:59
And so gradient times gradient transpose is a d by d matrix.
01:03
And so this matrix actually contains--
01:07
well, tells you it's called Fisher information matrix.
01:09
So it's basically telling you how much
01:11
information about the theta is in your model.
01:14
So for example, if your model is very well-parameterized,
01:17
then you will have a lot of information.
01:20
You will have a higher--
01:21
so let's think of it as being a scalar number,
01:23
just one number now-- so you're going
01:25
to have a larger information about your parameter
01:27
in the same probability distribution.
01:30
But if start having a weird way to parameterize your model,
01:35
then the Fisher information is actually going to drop.
01:38
So as a concrete example think of, for example,
01:40
a parameter of interest in a Gaussian model,
01:44
where the mean is known to be zero.
01:45
But what you're interested in is the variance, sigma squared.
01:48
If I'm interested in sigma square,
01:50
I could parameterize my model by sigma, sigma squared, sigma
01:53
to the fourth, sigma to 24th.
01:55
I could parameterize it by whatever I want,
01:57
then I would have a simple transformation.
01:59
Then you could say that some of them
02:01
are actually more or less informative,
02:02
and you're going to have different values
02:04
for your Fisher information.
02:06
So let's just review a few well-known computations.
02:10
So I will focus primarily on the one dimensional case as usual.
02:17
And I claim that there's two definitions.
02:19
So if theta is a real valued parameter,
02:24
then there's basically two definitions
02:26
that you can think of for your Fisher information.
02:28
One involves the first derivative
02:30
of your log likelihood.
02:31
And the second one involves the second derivative.
02:34
So the log likelihood here, we're
02:36
actually going to define it as l of theta.
02:39
And what is it?
02:40
Well, it's simply the likelihood function for one observation.
02:43
So it's l-- and I'm going to write 1 just to make sure that
02:46
we all know what we're talking about one observation--
02:49
of-- which is the order again, I think it's X and theta.
02:52
02:55
So that's the log likelihood, remember?
02:57
03:05
So for example, if I have a density,
03:07
what is it going to be?
03:08
It's going to be log of f sub theta of X.
03:12
So this guy is a random variable,
03:15
because it's a function of a random variable.
03:17
And that's what you see expectations of this thing.
03:19
It's a random function of theta.
03:21
If I view this as a function of theta,
03:23
the function becomes random, because it
03:25
depends on this random X.
03:27
And so I of theta is actually defined as the variance
03:35
of l prime of theta--
03:40
so the variance of the derivative of this function.
03:43
And I also claim that it's equal to negative the expectation
03:50
of the second derivative of theta.
03:54
03:57
And here, the expectation and the variance
04:00
are computed, because this function, remember, is random.
04:03
So I need to tell you what is the distribution
04:05
of the X with respect to which I'm
04:06
computing the expectation and the variance.
04:08
And it's the theta itself.
04:09
04:13
So typically, the theta we're going
04:15
to be interested in-- so there's a Fisher
04:18
information for all values of the parameter,
04:20
but the one typically we're interested in
04:22
is the true parameter, theta star.
04:25
But view this as a function of theta right now.
04:29
So now, I need to prove to you-- and this is not
04:31
a trivial statement-- the variance of the derivative
04:34
is equal to negative the expectation
04:36
of the second derivative.
04:37
I mean, there's really quite a bit that comes into this right.
04:40
And it comes from the fact that this is a log not of anything.
04:44
It's the log of a density.
04:45
So let's just prove that without having
04:48
to bother too much ourselves with
04:51
some technical assumptions.
04:54
And the technical assumptions are the assumptions
04:56
that allow me to permute derivative and integral.
04:59
Because when I compute the variances and expectations,
05:01
I'm actually integrating against the density.
05:04
And what I want to do is to make sure that I can always do that.
05:09
So my technical assumptions are I can always permute
05:13
integral and derivatives.
05:15
So let's just prove this.
05:19
So what I'm going to do is I'm going to assume
05:21
that X has density f theta.
05:32
05:35
And I'm actually just going to write-- well, let
05:37
me write it f theta right now.
05:39
Let me try to not be lazy about writing.
05:42
And so the thing I'm going to use
05:43
is the fact that the integral of this density is equal to what?
05:50
1.
05:51
And this is where I'm going to start doing weird things.
05:54
That means that if I take the derivative of this guy,
05:56
it's equal to 0.
05:59
So that means that if I look at the derivative with respect
06:03
to theta of integral f theta of X dX, this is equal to 0.
06:11
And this is where I'm actually making this switch,
06:14
is that I'm going to say that this is actually
06:16
equal to the integral of the derivative.
06:19
06:27
So that's going to be the first thing I'm going to use.
06:30
And of course, if it's true for the first derivative,
06:32
it's going to be true for the second derivative.
06:34
So I'm going to actually do it a second time.
06:36
And the second thing I'm going to use
06:38
is the fact the integral of the second derivative
06:46
is equal to 0.
06:47
06:50
So let's start from here.
06:51
06:59
And let me start from, say, the expectation
07:01
of the second derivative of l prime theta.
07:06
So what is l prime prime theta?
07:08
Well, it's the second derivative of log of f theta of X.
07:21
And we know that the derivative of the log--
07:24
sorry-- yeah, so the derivative of the log is 1 over--
07:30
well, it's the derivative of f divided by f itself.
07:34
07:49
Everybody's with me?
07:50
07:53
Just log of f prime is f prime over f.
07:58
Here, it's just that f, I view this as a function of theta
08:01
and not as a function of X.
08:04
So now, I need to take another derivative of this thing.
08:08
So that's going to be equal to--
08:09
well, so we all know the formula for the derivative
08:13
of the ratio.
08:14
So I pick up the second derivative times f theta
08:22
minus the first derivative squared
08:30
divided by f theta squared--
08:33
08:38
basic calculus.
08:40
And now, I need to check that negative the expectation
08:43
of this guy is giving me back what I want.
08:47
Well what is negative the expectation of l prime prime
08:52
of theta?
08:54
Well, what we need to do is to do negative integral
08:56
of this guy against f theta.
08:59
So it's minus the integral of--
09:01
09:28
That's just the definition of the expectation.
09:31
I take an integral against f theta.
09:34
But here, I have something nice.
09:36
What's happening is that those guys are canceling.
09:38
09:41
And now that those guys are canceling,
09:43
those guys are canceling too.
09:44
09:51
So what I have is that the first term--
09:53
I'm going to break this difference here.
09:56
So I'm going to say that integral of this difference
09:58
is the difference of the integrals.
10:00
So the first term is going to be the integral
10:03
of d over d theta squared of f theta.
10:09
10:13
And the second one, the negative signs are going to cancel,
10:17
and I'm going to be left with this.
10:32
10:37
Everybody's following?
10:39
Anybody found the mistake?
10:40
10:44
How about the other mistake?
10:46
I don't know if there's a mistake.
10:48
I'm just trying to get you to check what I'm doing.
10:50
10:54
With me so far?
10:56
So this guy here is the integral of the second the derivative
10:59
of f of X dX.
11:02
What is this?
11:04
AUDIENCE: It's 0.
11:06
PHILIPPE RIGOLLET: It's 0.
11:08
And that's because of this guy, which I will call frowny face.
11:16
So frowny face tells me this.
11:20
And let's call this guy monkey that hides his eyes.
11:26
No, let's just do something simpler.
11:28
Let's call it star.
11:29
And this guy, we will use later on.
11:32
11:35
So now, I have to prove that this guy, which
11:37
I have proved is equal to this, is now
11:40
equal to the variance of l prime theta.
11:46
So now, let's go back to the other way.
11:48
We're going to meet halfway.
11:49
I'm going to have a series--
11:51
I want to prove that this guy is equal to this guy.
11:56
And I'm going to have a series of equalities
11:58
that I'm going to meet halfway.
12:01
So let's start from the other end.
12:03
We started from the negative l prime prime theta.
12:05
Let's start with the variance part.
12:06
12:10
Variance of l prime of theta, so that's the variance--
12:17
12:22
so that's the expectation of l prime
12:29
of theta squared minus the square of the expectation of l
12:35
prime of theta.
12:35
12:41
Now, what is the square of the expectation
12:43
of l prime of theta?
12:44
Well, l prime of theta is equal to the partial with respect
12:50
to theta of log f theta of X, which we know from the first
12:57
line over there-- that's what's in the bracket on the second
12:59
line there--
13:01
is actually equal to the partial over theta of f
13:05
theta X divided by f theta X. That's
13:09
the derivative of the log.
13:11
So when I look at the expectation of this guy,
13:16
I'm going to have the integral of this against f theta.
13:18
And the f thetas are going to cancel again,
13:20
just like I did here.
13:23
So this thing is actually equal to the integral
13:26
of partial over theta of f theta of X dX.
13:33
And what does this equal to?
13:34
13:37
0, by the monkey hiding is eyes.
13:42
So that's star-- tells me that this is equal to 0.
13:46
13:50
So basically, when I compute the variance, this term is not.
13:52
Going to matter.
13:53
I only have to complete the first one.
13:54
14:10
So what is the first one?
14:12
Well, the first one is the expectation of l prime squared.
14:21
14:24
And so that guy is the integral of-- well, what is l prime?
14:29
Again, it's partial over partial theta
14:33
f theta of X divided by f theta of X. Now, this time, this guy
14:37
is squared against the density.
14:40
14:44
So one of the f thetas cancel.
14:45
15:07
But now, I'm back to what I had before for this guy.
15:11
15:16
So this guy is now equal to this guy.
15:20
There's just the same formula.
15:21
So they're the same thing.
15:23
And so I've moved both ways.
15:25
Starting from the expression that
15:27
involves the expectation of the second derivative,
15:30
I've come to this guy.
15:31
And starting from the expression that tells me
15:34
about the variance of the first derivative,
15:36
I've come to the same guy.
15:38
So that completes my proof.
15:41
Are there any questions about the proof?
15:43
15:47
We also have on our way found an explicit formula for the Fisher
15:53
information as well.
15:55
So now that I have this thing, I could actually
15:57
add that if X has a density, for example,
16:02
this is also equal to the integral of--
16:07
16:09
well, the partial over theta of f theta of X
16:15
squared divided by f theta of X, because I just
16:24
proved that those two things were actually
16:26
equal to the same thing, which was this guy.
16:28
16:31
Now in practice, this is really going to be the useful one.
16:34
The other two are going to be useful depending
16:35
on what case you're in.
16:37
So if I ask you to compute the Fisher information,
16:40
you have now three ways to pick from.
16:43
And basically, practice will tell you
16:44
which one to choose if you want to save five minutes when
16:47
you're doing your computations.
16:48
Maybe you're the guy who likes to take derivatives.
16:50
And then you're going to go with the second derivative one.
16:52
Maybe you're the guy who likes to extend squares,
16:55
so you're going to take the one that
16:56
involves the square of the squared prime.
16:59
And maybe you're just a normal person,
17:01
and you want to use that guy.
17:02
17:06
Why do I care?
17:07
This is the Fisher information.
17:09
And I could have defined the [? Hilbert ?] information
17:11
by taking the square root of this guy
17:13
plus the sine of this thing and just be super happy
17:16
and have my name in textbooks.
17:18
But this thing has a very particular meaning.
17:22
When we're doing the maximum likelihood estimation--
17:24
17:28
so remember the maximum likelihood estimation is just
17:32
an empirical version of trying to minimize the KL divergence.
17:36
So what we're trying to do, maximum likelihood,
17:39
is really trying to minimize the KL divergence.
17:42
17:54
And we're trying to minimize this function, remember?
17:57
So now what we're going to do is we're
18:01
going to plot this function.
18:02
We said that, let's place ourselves
18:04
in cases where this KL is convex,
18:06
so that the inverse is concave.
18:09
So it's going to look like this--
18:11
U-shaped, that's convex.
18:13
So that's the truth thing I'm trying to minimize.
18:16
And what I said is that I'm going to actually try
18:18
to estimate this guy.
18:19
So in practice, I'm going to have
18:20
something that looks like this, but it's not really this.
18:22
18:26
And we're not going to do this, but you
18:28
can show that you can control this
18:30
uniformly over the entire space, that there is no space where
18:33
it just becomes huge.
18:35
In particular, this is not the space
18:37
where it just becomes super huge,
18:38
and the minimum of the dotted line
18:39
becomes really far from this guy.
18:42
So if those two functions are close to each other,
18:45
then this implies that the minimum here of the dotted line
18:48
is close to the minimum of the solid line.
18:53
So we know that this is theta star.
18:56
And this is our MLE, estimator, theta hat ml.
19:00
So that's basically the principle--
19:01
the more data we have, the closer the dotted line
19:05
is to the solid line.
19:07
And so the minimum is closer to the minimum.
19:10
But now, this is just one example,
19:12
where I drew a picture for you.
19:14
But there could be some really nasty examples.
19:17
Think of this example, where I have
19:20
a function, which is convex, but it looks more like this.
19:23
19:30
That's convex, it's U-shaped.
19:31
It's just a professional U.
19:36
Now, I'm going to put a dotted line around it that has pretty
19:41
much the same fluctuations.
19:42
The bend around it is of this size.
19:44
19:52
So do we agree that the distance between the solid line
19:56
and the dotted line is pretty much the same
19:58
in those two pictures?
20:01
Now, here, depending on how I tilt this guy,
20:04
basically, I can put the minimum theta star wherever I want.
20:07
And let's say that here, I actually put it here.
20:11
That's pretty much the minimum of this line.
20:13
And now, the minimum of the dotted line is this guy.
20:16
20:20
So they're very far.
20:22
The fact that I'm very flat at the bottom
20:25
makes my requirements for being close
20:28
to the U-shaped solid curve much more stringent,
20:31
if I want to stay close.
20:34
And so this is the canonical case.
20:38
This is the annoying case.
20:39
And of course, you have the awesome case--
20:43
looks like this.
20:45
And then whether you deviate, you
20:47
can have something that moves pretty far.
20:50
It doesn't matter, it's always going to stay close.
20:53
Now, what is the quantity that measures how
20:57
curved I am at a given point--
20:59
how curved the function is at a given point?
21:03
The secondary derivative.
21:05
And so the Fisher information is negative the second derivative.
21:11
Why the negative?
21:12
21:17
Well here-- Yeah, we're looking for a minimum,
21:18
and this guy is really the-- you should view
21:20
this as a reverted function.
21:23
This is we're trying to maximize the likelihood, which
21:26
is basically maximizing the negative KL.
21:28
So the picture I'm showing you is trying to minimize the KL.
21:31
So the truth picture that you should see for this guy
21:33
is the same, except that it's just flipped over.
21:37
But the curvature is the same, whether I flip my sheet or not.
21:40
So it's the same thing.
21:42
So apart from this negative sign,
21:44
which is just coming from the fact
21:45
that we're maximizing instead of minimizing,
21:47
this is just telling me how curved my likelihood is
21:50
around the maximum.
21:51
And therefore, it's actually telling me how good,
21:55
how robust my maximum likelihood estimator is.
21:58
It's going to tell me how close, actually, my likelihood
22:01
estimator is going to be--
22:03
maximum likelihood is going to be to the true parameter.
22:06
So I should be able to see that somewhere.
22:09
There should be some statement that tells me
22:11
that this Fisher information will
22:13
play a role when assessing the precision of this estimator.
22:17
And remember, how do we characterize a good estimator?
22:20
Well, we look at its bias, or we look its variance.
22:24
And we can combine the two and form the quadratic risk.
22:26
So essentially, what we're going to try to say
22:29
is that one of those guys-- either the bias
22:31
or the variance or the quadratic risk--
22:33
is going to be worse if my function is flatter,
22:35
meaning that my Fisher information is smaller.
22:37
22:40
And this is exactly the point of this last theorem.
22:44
So let's look at a couple of conditions.
22:46
So this is your typical 1950s statistics
22:51
paper that has like one page of assumptions.
22:54
And this was like that in the early days,
22:56
because people were trying to make
22:58
theories that would be valid for as many models as possible.
23:01
And now, people are sort of abusing this,
23:03
and they're just making all this lists of assumptions
23:05
so that their particular method works
23:06
for their particular problem, because they just
23:08
want to take shortcuts.
23:10
But really, the maximum likelihood estimator
23:13
is basically as old as modern statistics.
23:15
And so this was really necessary conditions.
23:18
And we'll just parse that.
23:19
The model is identified.
23:21
Well, better be, because I'm trying to estimate
23:24
theta and not P theta.
23:25
So this one is good.
23:26
For all theta, the support of P theta does not depend on theta.
23:32
So that's just something that we need to have.
23:34
Otherwise, things become really messy.
23:36
And in particular, I'm not going to be
23:38
able to define likelihood--
23:40
Kullback-Leibler divergences.
23:42
Then why can I not do that?
23:44
Well, because the Kullback-Leibler divergence
23:46
has a log of the ratio of two densities.
23:49
And if one of the support is changing with theta
23:51
is it might be they have the log of something that's
23:53
0 or something that's not 0.
23:55
And the log of 0 is a slightly annoying quantity to play with.
23:59
And so we're just removing that case.
24:01
Nothing depends on theta--
24:02
think of it as being basically the entire real line
24:05
as a support for Gaussian, for example.
24:08
Theta star is not on the boundary of theta.
24:10
Can anybody tell me why this is important?
24:13
24:17
We're talking about derivatives.
24:19
So when I want to talk about derivatives,
24:20
I'm talking about fluctuations around a certain point.
24:23
And if I'm at the boundary, it's actually really annoying.
24:26
I might have the derivative-- remember,
24:27
I give you this example--
24:28
where the maximum likelihood is just obtained at the boundary,
24:31
because the function cannot grow anymore at the boundary.
24:34
But it does not mean that the first order
24:36
derivative is equal to 0.
24:38
It does not mean anything.
24:39
So all this picture here is valid
24:42
only if I'm actually achieving the minimum inside.
24:46
Because if my theta space stops here and it's just this guy,
24:52
I'm going to be here.
24:53
And there's no questions about curvature or anything
24:55
that comes into play.
24:56
It's completely different.
24:58
So here, it's inside.
25:00
Again, think of theta as being the entire real line.
25:02
Then everything is inside.
25:05
I is invertible.
25:08
What does it mean for a positive number, a 1 by 1 matrix
25:11
to be invertible?
25:12
25:16
Yep.
25:17
AUDIENCE: It'd be equal to its [INAUDIBLE]..
25:20
PHILIPPE RIGOLLET: A 1 by 1 matrix, that's a number, right?
25:24
What is a characteristic-- if I give you a matrix with numbers
25:26
and ask you if it's invertible, what
25:28
are you going to do with it?
25:31
AUDIENCE: Check if the determinant is 0.
25:33
PHILIPPE RIGOLLET: Check if the determinant is 0.
25:35
What is the determinant of the 1 by 1 matrix?
25:37
It's just the number itself.
25:38
So that's basically, you want to check if this number is 0
25:41
or not.
25:42
So we're going to think in the one dimensional case here.
25:44
And in the one dimensional case, that just
25:46
means that the curvature is not 0.
25:51
Well, it better be not 0, because then I'm
25:53
going to have no guarantees.
25:54
If I'm totally flat, if I have no curvature,
25:56
I'm basically totally flat at the bottom.
25:58
And then I'm going to get nasty things.
26:00
Now, this is not true.
26:02
I could have the curvature which grows like-- so here, it's
26:05
basically-- the second derivative is telling me--
26:08
if I do the Taylor expansion, it's
26:09
telling me how I grow as a function of, say, x squared.
26:13
It's the quadratic term that I'm controlling.
26:15
It could be that this guy is 0, and then the term of order,
26:19
x to the fourth, is picking up.
26:20
That could be the first one that's non-zero.
26:22
26:23
But that would mean that my rate of convergence
26:25
would not be square root of n.
26:26
When I'm actually playing central limit theorem,
26:28
it would become n to the 1/4th.
26:30
And if I have all a bunch of 0 until the 16th order,
26:33
I would have n to the 1/16th, because that's really
26:36
telling me how flat I am.
26:39
So we're going to focus on the case
26:41
where it's only quadratic terms, and the rates
26:43
of the central limit theorems kick in.
26:46
And then a few other technical conditions--
26:48
we just used a couple of them.
26:49
So I permuted limit and integral.
26:52
And you can check that really what you want
26:54
is that the integral of a derivative is equal to 0.
26:58
Well, it just means that the values at the two ends
27:00
are actually the same.
27:01
So those are slightly different things.
27:05
So now, what we have is that the maximum likelihood estimator
27:08
has the following two properties.
27:10
The first one, if I were to say that in words, what would
27:13
I say, that theta hat is--
27:15
27:18
Is what?
27:20
Yeah, that's what I would say when I--
27:22
that's for mathematicians.
27:23
But if I'm a statistician, what am I going to say?
27:27
It's consistent.
27:28
It's a consistent estimator of theta star.
27:30
It converges in probability to theta star.
27:33
And then we have this sort of central limit theorem
27:35
statement.
27:36
The central limit theorem statement tells me that if this
27:39
was an average and I remove the expectation of the average--
27:44
let's say it's 0, for example--
27:45
then square root of n times the average blah
27:47
goes through some normal distribution.
27:49
This is telling me that this is actually true,
27:52
even if theta hat has nothing to do with an average.
27:55
That's remarkable.
27:56
Theta hat might not even have a closed form,
27:59
and I'm still having basically the same properties
28:02
as an average that would be given to me
28:04
by a central limit theorem.
28:08
And what is the asymptotic variance?
28:10
So that's the variance in the n.
28:12
So here, I'm thinking of having those guys being multivariate.
28:15
And so I have the inverse of the covariance matrix
28:18
that shows up as the variance-covariance matrix
28:21
asymptotically.
28:22
But if you think of just being a one dimensional parameter,
28:25
it's one over this Fisher information,
28:27
one over the curvature.
28:29
So the curvature is really flat, the variance
28:31
becomes really big.
28:33
If the function is really flat, curvature is low,
28:36
variance is big.
28:37
If the curvature is very high, the variance becomes very low.
28:41
And so that illustrates everything
28:42
that's happening with the pictures that we have.
28:45
And if you look, what's amazing here,
28:48
there is no square root 2 pi, there's no fudge factors
28:51
going on here.
28:52
This is the asymptotic variance, nothing else.
28:56
It's all in there, all in the curvature.
28:58
29:03
Are there any questions about this?
29:05
29:07
So you can see here that theta star is the true parameter.
29:11
And the information matrix is evaluated at theta star.
29:17
That's the point that matters.
29:18
When I drew this picture, the point
29:20
that was at the very bottom was always theta star.
29:22
It's the one that minimizes the KL divergence,
29:26
am as long as I'm identified.
29:32
Yes.
29:33
AUDIENCE: So the higher the curvature,
29:35
the higher the inverse of the Fisher information?
29:38
PHILIPPE RIGOLLET: No, the higher
29:39
the Fisher information itself.
29:42
So the inverse is going to be smaller.
29:46
So small variance is good.
29:48
So now what it means, actually, if I
29:50
look at what is the quadratic risk
29:51
of this guy, asymptotically-- what
29:54
is asymptotic quadratic risk?
29:56
Well, it's 0 actually.
29:57
But if I assume that this thing is true,
30:01
that this thing is pretty much Gaussian,
30:03
if I look at the quadratic risk, well, it's
30:05
the expectation of the square of this thing.
30:08
And so it's going to scale like the variance divided by n.
30:12
30:14
The bias goes to 0, just by this.
30:18
And then the quadratic risk is going
30:20
to scale like one over Fisher information divided by n.
30:23
30:28
So here, the-- I'm not mentioning the constants.
30:30
There must be constants, because everything is asymptotic.
30:32
So for each finite n, I'm going to have
30:33
some constants that show up.
30:35
30:39
Everybody just got their mind blown by this amazing theorem?
30:43
So I mean, if you think about it, the MLE can be anything.
30:48
I'm sorry to say to you, in many instances,
30:50
the MLE is just going to be an average, which is just
30:52
going to be slightly annoying.
30:54
But there are some cases where it's not.
30:57
And we have to resort to this theorem
30:59
rather than actually resorting to the central limit theorem
31:01
to prove this thing.
31:03
And more importantly, even if this was an average,
31:05
you don't have to even know how to compute
31:07
the covariance matrix--
31:09
sorry, the variance of this thing
31:11
to plug it into the central limit theorem.
31:14
I'm telling you, it's actually given by the Fisher information
31:17
matrix.
31:18
So if it's an average, between you and me,
31:22
you probably want to go the central limit theorem route
31:24
if you want to prove this kind of stuff.
31:26
But if it's not, then that's your best shot.
31:28
But you have to check those conditions.
31:31
I will give you for granted the 0.5.
31:38
Ready?
31:39
Any questions?
31:40
We're going to wrap up this chapter four.
31:41
So if you have questions, that's the time.
31:43
Yes.
31:43
AUDIENCE: What was the quadratic risk up there?
31:45
PHILIPPE RIGOLLET: You mean the definition?
31:47
AUDIENCE: No, the-- what is was for this.
31:49
PHILIPPE RIGOLLET: Well, you see the quadratic risk,
31:51
if I think of it as being one dimensional,
31:53
the quadratic risk is the expectation
31:55
of the square of the difference between theta hat and theta
31:57
star.
31:58
32:01
So that means that if I think of this as having a normal 0, 1,
32:05
that's basically computing the expectation of the square
32:09
of this Gaussian divided by n.
32:13
I just divided by square root of n on both sides.
32:15
So it's the expectation of the square of this Gaussian.
32:18
The Gaussian is mean 0, so the expectation of the square
32:20
is just a variance.
32:23
And so I'm left with 1 over the Fisher information divided
32:25
by n.
32:26
AUDIENCE: I see.
32:26
OK.
32:27
32:34
PHILIPPE RIGOLLET: So let's move on to chapter four.
32:36
And this is the method of moments.
32:38
So the method of moments is actually maybe a bit older
32:42
than maximum likelihood.
32:44
And maximum likelihood is dated, say, early 20th century,
32:48
I mean as a systematic thing, because as I said,
32:50
many of those guys are going to be averages.
32:52
So finding an average is probably a little older.
32:56
The method of moments, there's some really nice uses.
32:58
There's a paper by Pearson in 1904, I believe, or maybe 1894.
33:03
I don't know.
33:04
33:06
And this paper, he was actually studying some species
33:10
of crab in an island, and he was trying
33:12
to make some measurements.
33:13
That's how he came up with this model of mixtures of Gaussians,
33:16
because there was actually two different populations
33:18
in this populations of crab.
33:20
And the way he actually fitted the parameters
33:23
was by doing the method of moments,
33:25
except that since there were a lot of parameters,
33:27
he actually had to basically solve six equations with six
33:33
unknowns.
33:34
And that was a complete nightmare.
33:35
And the guy did it by hand.
33:36
And we don't know how he did it actually.
33:40
But that is pretty impressive.
33:43
So I want to start--
33:44
and this first part is a little brutal.
33:48
But this is a Course 18 class, and I do not want to give you--
33:52
So let's all agree that this course might be slightly more
33:54
challenging than AP statistics.
33:56
And that means that it's going to be challenging just
34:00
during class.
34:01
I'm not going to ask you about the Weierstrass Approximation
34:04
Theorem during the exams.
34:05
But what I want is to give you mathematical motivations
34:08
for what we're doing.
34:10
And I can promise you that maybe you
34:12
will have a slightly higher body temperature during the lecture,
34:17
but you will come out smarter of this class.
34:20
And I'm trying to motivate to you for using mathematical tool
34:24
and show you where interesting mathematical things that you
34:27
might find dry elsewhere actually work very beautifully
34:31
in the stats literature.
34:33
And one that we saw was using Kullback-Leibler divergence out
34:35
of motivation for maximum likelihood estimation,
34:38
for example.
34:39
So the Weierstrass Approximation Theorem
34:42
is something that comes from pure analysis.
34:45
So maybe-- I mean, it took me a while before I saw that.
34:49
And essentially, what it's telling you
34:51
is that if you look at a function that
34:52
is continuous on an interval a, b--
34:55
on a segment a, b--
34:57
then you can actually approximate it
35:02
uniformly well by polynomials as long
35:05
as you're willing to take the degree
35:06
of this polynomials large enough.
35:09
So the formal statement is, for any epsilon,
35:11
there exists the d that depends on epsilon in a1 to ad--
35:16
so if you insist on having an accuracy which is 1/10,000,
35:20
maybe you're going to need a polynomial of degree 100,000,
35:23
who knows.
35:24
It doesn't tell you anything about this.
35:26
But it's telling you that at least you
35:28
have only a finite number of parameters
35:29
to approximate those functions that typically
35:31
require an infinite number of parameters to be described.
35:35
So that's actually quite nice.
35:36
And that's the basis for many things
35:39
and many polynomial methods typically.
35:43
And so here, it's uniform, so there's
35:45
this max over x that shows up that's actually nice as well.
35:50
That's Weierstrass Approximation Theorem.
35:52
Why is that useful to us?
35:54
Well, in statistics, I have a sample of X1 to Xn.
35:58
I have, say, a unified statistical model.
36:00
I'm not always going to remind you
36:01
that it's identified-- not unified-- identified
36:04
statistical model.
36:05
And I'm going to assume that it has a density.
36:08
You could think of it as having a PMF,
36:10
but think of it as having a density for one second.
36:13
Now, what I want is to find the distribution.
36:16
I want to find theta.
36:18
And finding theta, since it's identified
36:20
as equivalent to finding P theta, which
36:22
is equivalent to finding f theta, and knowing a function
36:26
is the same--
36:28
knowing a density is the same as knowing a density
36:30
against any test function h.
36:33
So that means that if I want to make sure I know a density--
36:38
if I want to check if two densities are the same,
36:40
all I have to do is to compute their integral
36:42
against all bounded continuous functions.
36:46
You already know that it would be true
36:48
if I checked for all functions h.
36:50
But since f is a density, I can actually
36:53
look only at functions h that are bounded,
36:56
say, between minus 1 and 1, and that are continuous.
37:04
That's enough.
37:06
Agreed?
37:06
Well, just trust me on this.
37:08
Yes, you have a question?
37:11
AUDIENCE: Why is this-- like, why shouldn't you
37:14
just say that [INAUDIBLE]?
37:17
37:20
PHILIPPE RIGOLLET: Yeah, I can do that.
37:21
I'm just finding a characterization
37:23
that's going to be useful for me later on.
37:25
I can find a bunch of them.
37:26
But here, this one is going to be useful.
37:28
So all I need to say is that f theta star integrated against
37:32
X, h of x-- so this implies that f--
37:35
if theta is equal to f theta star not everywhere,
37:38
but almost everywhere.
37:41
And that's only true if I guarantee to you that f theta
37:44
and f theta stars are densities.
37:46
This is not true for any function.
37:49
So now, that means that, well, if I wanted to estimate theta
37:53
hat, all I would have to do is to compute the average, right--
37:56
so this guy here, the integral--
38:01
let me clean up a bit my board.
38:02
38:22
So my goal is to find theta such that, if I look at f theta
38:30
and now I integrate it against h of x, then
38:34
this gives me the same thing as if I were
38:36
to do it against f theta star.
38:42
And I want this for any h, which is continuous and bounded.
38:45
38:48
So of course, I don't know what this quantity is.
38:51
It depends on my unknown theta star.
38:53
But I have theta from this.
38:54
And I'm going to do the usual--
38:56
the good old statistical trick, which is,
38:57
well, this I can write as the expectation with respect
39:01
to P theta star of h theta of x.
39:06
That's just the integral of a function against something.
39:08
And so what I can do is say, well, now I
39:11
don't know this guy.
39:12
But my good old trick from the book
39:14
is replace expectations by averages.
39:15
And what I get--
39:16
39:23
And that's approximately by the law of large numbers.
39:29
So if I can actually find a function f theta such
39:33
that when I integrate it against h
39:36
it gives me pretty much the average of the evaluations
39:39
of h over my data points for all h,
39:42
then that should be a good candidate.
39:44
39:47
The problem is that's a lot of functions to try.
39:52
Even if we reduced that from all possible functions
39:54
to bounded and continuous ones, that's
39:56
still a pretty large infinite number of them.
40:01
And so what we can do is to use our Weierstrass Approximation
40:05
Theorem.
40:06
And it says, well, maybe I don't need to test it against all h.
40:09
Maybe the polynomials are enough for me.
40:11
So what I'm going to do is I'm going to look only at functions
40:14
h that are of the form sum of ak--
40:20
so h of x is sum of ak X to the k-th for k
40:29
equals 0 to d-- only polynomials of degree d.
40:34
So when I look at the average of my h's, I'm
40:37
going to get a term like the first one.
40:40
So the first one here, this guy, becomes 1/n sum from i equal 1
40:47
to n sum from k equal 0 to d of ak Xi to the k-th.
40:54
That's just the average of the values of h of Xi.
40:58
And now, what I need to do is to check
41:00
that it's the same thing when I integrate
41:03
h of this form as well.
41:06
I want this to hold for all polynomials of degree d.
41:10
That's still a lot of them.
41:12
There's still an infinite number of polynomials,
41:14
because there's an infinite number of numbers a0 to ad
41:17
that describe those polynomials.
41:21
But since those guys are polynomials,
41:23
it's actually enough for me to look only at the terms
41:26
of the form X to the k-th--
41:28
no linear combination, no nothing.
41:30
So actually, it's enough to look only
41:32
at h of x, which is equal to X to the k-th for k equal 0 to d.
41:40
41:43
And now, how many of those guys are there?
41:46
Just d plus 1, 0 to d.
41:49
So that's actually a much easier thing for me to solve.
41:51
41:54
Now, this quantity, which is the integral of f theta against X
42:01
to the k-th-- so that the expectation of X to the k-th
42:05
here--
42:06
it's called moment of order k, or k-th moment of P theta.
42:12
That's a moment.
42:13
A moment is just the expectation of the power.
42:16
The mean is which moment?
42:19
The first moment.
42:21
And variance is not exactly the second moment.
42:24
It's the second moment minus the first moment squared.
42:27
42:29
That's the variance.
42:30
It's E of X squared minus E of X squared.
42:34
So those are things that you already know.
42:36
And then you can go higher.
42:37
You can go to E of X cube, E of X blah, blah.
42:40
Here, I say go to E of X to the d-th.
42:43
Now, as you can see, this is not something
42:44
you can really put in action right now,
42:47
because the Weierstrass Approximation Theorem does not
42:50
tell you what d should be.
42:52
Actually, we totally lost track of the epsilon
42:54
I was even looking for.
42:54
I just said approximately equal, approximately equal.
42:57
And so all this thing is really just motivation.
42:59
But it's essentially telling you that if you
43:02
go to d large enough, technically
43:05
you should be able to identify exactly your distribution up
43:08
to epsilon.
43:11
So I should be pretty good, if I go to d large enough.
43:16
Now in practice, actually there should be much
43:19
less than arbitrarily large d.
43:23
Typically, we are going to need d which is 1 or 2.
43:25
43:28
So there are some limitations to the Weierstrass Approximation
43:31
Theorem.
43:32
And there's a few.
43:33
The first one is that it only works
43:35
for continuous functions, which is not so much of a problem.
43:39
That can be fixed.
43:42
Well, we need bounded continuous functions.
43:44
It works only on intervals.
43:45
That's annoying, because we're going
43:47
to have random variables that are defined beyond intervals.
43:51
So we need something that just goes beyond the intervals.
43:53
And you can imagine that if you let your functions be huge,
43:55
it's going to be very hard for you
43:57
to have a polynomial approximately [INAUDIBLE] well.
44:00
Things are going to start going up and down at the boundary,
44:02
and it's going to be very hard.
44:05
And again, as I said several times,
44:07
it doesn't tell us what d should be.
44:09
And as statisticians, we're looking for methods, not
44:11
like principles of existence of a method that exists.
44:15
So if E is discrete, I can actually
44:21
get a handle on this d.
44:23
If E is discrete and actually finite--
44:26
I'm going to actually look at a finite E,
44:29
meaning that I have a PMF on, say, r possible values, x1
44:33
and xr.
44:34
My random variable, capital X, can
44:35
take only r possible values.
44:37
Let's think of them as being the integer numbers 1 to r.
44:41
That's the number of success out of r trials
44:44
that I get, for example.
44:46
Binomial rp, that's exactly something like this.
44:51
So now, clearly this entire distribution
44:55
is defined by the PMF, which gives me exactly r numbers.
45:00
So it can completely describe this distribution
45:02
with r numbers.
45:03
The question is, do I have an enormous amount of redundancy
45:08
if I try to describe this distribution using moments?
45:12
It might be that I need-- say, r is equal to 10,
45:14
maybe I have only 10 numbers to describe this thing,
45:18
but I actually need to compute moments up to the order of 100
45:20
before I actually recover entirely the distribution.
45:23
Maybe I need to go infinite.
45:25
Maybe the Weierstrass Theorem is the only thing
45:27
that actually saves me here.
45:28
And I just cannot recover it exactly.
45:30
I can go to epsilon if I'm willing to go to higher
45:33
and higher polynomials.
45:34
Oh, by the way, in the Weierstrass Approximation
45:36
Theorem, I can promise you that as epsilon goes to 0,
45:39
d goes to infinity.
45:41
So now, really I don't even have actually r parameters.
45:46
I have only r minus parameter, because the last one--
45:50
because they sum up to 1.
45:51
So the last one I can always get by doing
45:53
1 minus the sum of the first r minus 1 first.
45:56
Agreed?
45:58
So each distribution r numbers is described
46:01
by r minus 1 parameters.
46:04
The question is, can I use only r minus moments
46:07
to describe this guy?
46:08
46:12
This is something called Gaussian quadrature.
46:16
The Gaussian quadrature tells you, yes, moments
46:18
are actually a good way to reparametrize your distribution
46:22
in the sense that if I give you the moments
46:24
or if I give you the probability mass function,
46:27
I'm basically giving you exactly the same information.
46:29
You can recover all the probabilities from there.
46:32
So here, I'm going to denote by--
46:34
I'm going to drop the notation in theta.
46:37
I don't have theta.
46:38
Here, I'm talking about any generic distribution.
46:41
And so I'm going to call mk the k-th moment.
46:44
46:49
And I have a PMF, this is really the sum for j
46:54
equals 1 to r of xj to the k-th times p of xj.
47:06
And this is the PMF.
47:10
So that's my k-th moment.
47:12
So the k-th moment is a linear combination of the numbers
47:15
that I am interested in.
47:16
47:19
So that's one equation.
47:22
47:25
And I have as many equations as I'm actually
47:27
willing to look at moments.
47:28
So if I'm looking at 25 moments, I have 25 equations.
47:34
m1 equals blah with this to the power of 1,
47:36
m2 equals blah with this to the power of 2, et cetera.
47:40
And then I also have the equation
47:41
that 1 is equal to the sum of the p of xj.
47:51
That's just the definition of PMF.
47:55
So this is r's.
47:56
They're ugly, but those are r's.
47:58
48:00
So now, this is a system of linear equations in p,
48:04
and I can actually write it in its canonical form, which
48:07
is of the form a matrix of those guys
48:11
times my parameters of interest is equal to a right hand side.
48:15
The right hand side is the moments.
48:17
That means, if I did you the moments,
48:20
can you come back and find what the PMF,
48:22
because we know already from probability
48:24
that the PMF is all I need to know to fully describe
48:27
my distribution.
48:29
Given the moments, that's unclear.
48:32
Now, here, I'm going to actually take exactly r minus 1 moment
48:37
and this extra condition that the sum of those guys
48:39
should be 1.
48:41
So that gives me r equations based on r minus 1 moments.
48:45
And how many unknowns do I have?
48:47
Well, I have my r unknown parameters
48:54
for the PMF, the r values of the PMF.
48:59
Now, of course, this is going to play a huge role
49:02
in whether the are many p's that give me the same.
49:06
The goal is to find if there are several p's that can give me
49:09
the same moments.
49:10
But if there's only one p that can give me a set of moment,
49:13
that means that I have a one-to-one correspondence
49:15
between PMF and moments.
49:17
And so if you give me the moments,
49:18
I can just go back to the PMF.
49:23
Now, how do I go back?
49:23
Well, by inverting this matrix.
49:26
If I multiply this matrix by its inverse,
49:28
I'm going to get the identity times the vector of p's equal
49:32
the inverse of the matrix times the m's.
49:36
So what we want to do is to say that p
49:41
is equal to the inverse of this big matrix times the moments
49:45
that you give me.
49:47
And if I can actually talk about the inverse,
49:49
then I have basically a one-to-one mapping
49:52
between the m's, the moments, and the matrix.
49:55
So what I need to show is that this matrix is invertible.
49:58
And we just decided that the way to check
50:01
if a matrix is invertible is by computing its determinant.
50:05
Who has computed a determinant before?
50:10
Who was supposed to compute a determinant at least than just
50:12
to say, no, you know how to do it.
50:15
So you know how to compute determinants.
50:16
And if you've seen any determinant in class,
50:19
there's one that shows up in the exercises that professors love.
50:22
And it's called the Vandermonde determinant.
50:25
And it's the determinant of a matrix
50:26
that have a very specific form.
50:28
It looks like-- so there's basically only r parameters
50:31
to this r by r matrix.
50:33
The first row, or the first column-- sometimes,
50:36
it's presented like that--
50:37
is this vector where each entry is to the power of 1.
50:41
And the second one is each entry is to the power of 2,
50:43
and to the power of 3, and to the power 4, et cetera.
50:46
So that's exactly what we have-- x1 to the first, x2
50:49
to the first, all the way to xr to the first,
50:51
and then same thing to the power of 2,
50:53
all the way to the last one.
50:54
And I also need to add the row of all 1's, which
50:58
you can think of those guys are to the power of 0, if you want.
51:01
So I should really put it on top,
51:02
if I wanted to have a nice ordering.
51:05
So that was the matrix that I had.
51:07
And I'm not asking you to check it.
51:09
You can prove that by induction actually,
51:10
typically by doing the usual let's eliminate
51:14
some rows and columns type of tricks
51:15
that you do for matrices.
51:17
So you basically start from the whole matrix.
51:19
And then you move onto a matrix that has only one 1's and then
51:21
0's here.
51:22
And then you have Vandermonde that's just slightly smaller.
51:25
And then you just iterate.
51:26
Yeah.
51:27
AUDIENCE: I feel like there's a loss to either the supra index,
51:31
or the sub index should have a k somewhere [INAUDIBLE]..
51:35
51:38
[INAUDIBLE] the one I'm talking about?
51:39
PHILIPPE RIGOLLET: Yeah, I know, but I
51:41
don't think the answer to your question is yes.
51:45
So k is the general index, right?
51:48
So there's no k. k does not exist. k just is here for me
51:51
to tell me for k equals 1 to r.
51:53
So this is an r by r matrix.
51:56
And so there is no k there.
51:58
So if you wanted the generic term,
52:00
if I wanted to put 1 in the middle on the j-th row and k-th
52:03
column, that would be x--
52:07
so j-th row would be x sub k to the power of j.
52:13
That would be the--
52:15
And so now, this is basically the sum--
52:19
well, that should not be strictly--
52:20
So that would be for j and k between 1 and r.
52:25
So this is the formula that get when
52:26
you try to expand this Vandermonde determinant.
52:29
You have to do it only once when you're a sophomore typically.
52:32
And then you can just go on Wikipedia to do it.
52:34
That's what I did.
52:34
I actually made a mistake copying it.
52:36
The first one should be 1 less than or equal to j.
52:39
And the last one should be k less than or equal to r.
52:42
And now what you have is the product
52:43
of the differences of xj and xk.
52:45
52:47
And for this thing to be non-zero,
52:48
you need all the terms to be non-zero.
52:51
And for all the terms to be non-zero,
52:52
you need to have no xi, xj, and no xj, xk that are identical.
52:58
If all those are different numbers,
52:59
then this product is going to be different from 0.
53:03
And those are different numbers, because those
53:05
are r possible values that your random verbal takes.
53:09
You're not going to say that it takes two
53:11
with probability 1.5--
53:14
sorry, two with probability 0.5 and two with probability 0.25.
53:18
You're going to say it takes two with probability 0.75 directly.
53:22
So those xj's are different.
53:24
These are the different values that your random variable
53:27
can take.
53:28
53:32
Remember, xj, xk was just the different values x1 to xr--
53:37
sorry-- was the different values that your random variable
53:39
can take.
53:41
Nobody in their right mind would write twice the same value
53:43
in this list.
53:44
53:47
So my Vandermonde is non-zero.
53:49
So I can invert it.
53:49
And I have a one-to-one correspondence
53:51
between my entire PMF and the first r minus 1's moments
53:55
to which I append the number 1, which is really
54:00
the moment of order 0 again.
54:02
It's E of X to the 0-th, which is 1.
54:05
So good news, I only need r minus 1 parameters
54:10
to describe r minus 1 parameters.
54:12
And I can choose either the values of my PMF.
54:14
Or I can choose the r minus 1 first moments.
54:16
54:20
So the moments tell me something.
54:22
Here, it tells me that if I have a discrete distribution
54:26
with r possible values, I only need
54:28
to compute r minus 1 moments.
54:30
So this is better than Weierstrass Approximation
54:34
Theorem.
54:34
This tells me exactly how many moments I need to consider.
54:37
And this is for any distribution.
54:39
This is not a distribution that's
54:41
parametrized by one parameter, like the Poisson
54:43
or the binomial or all this stuff.
54:47
This is for any distribution under a finite number.
54:50
So hopefully, if I reduce the family of PMFs
54:53
that I'm looking at to a one-parameter family,
54:55
I'm actually going to need to compute much less than r
54:58
minus 1 values.
55:01
But this is actually hopeful.
55:02
It tells you that the method of moments
55:04
is going to work for any distribution.
55:06
You just have to invert a Vandermonde matrix.
55:09
55:13
So just the conclusion-- the statistical conclusion--
55:17
is that moments contain important information
55:20
about the PMF and the PDF.
55:24
If we can estimate these moments accurately,
55:26
we can solve for the parameters of the distribution
55:30
and recover the distribution.
55:32
And in a parametric setting, where
55:34
knowing P theta amounts to knowing theta, which
55:36
is identifiability-- this is not innocuous--
55:41
it is often the case that even less moments are needed.
55:44
After all, if theta is a one dimensional parameter,
55:46
I have one parameter to estimate.
55:48
Why would I go and get 25 moments
55:51
to get this one parameter.
55:52
Typically, there is actually-- we
55:54
will see that the method of moments
55:55
just says if you have a d dimensional parameter,
55:58
just compute d moments, and that's it.
56:02
But this is only on a case-by-case basis.
56:04
I mean, maybe your model will totally screw up its parameters
56:07
and you actually need to get them.
56:09
I mean, think about it, if the function is parameterized just
56:16
by its 27th moment--
56:19
like, that's the only thing that matters in this distribution,
56:21
I just describe the function, it's just a density,
56:24
and the only thing that can change from one distribution
56:26
to another is this 27th moment--
56:28
well, then you're going to have to go get the 27th moment.
56:30
And that probably means that your modeling step was actually
56:33
pretty bad.
56:34
56:37
So the rule of thumb, if theta is in Rd, we need d moments.
56:40
56:46
So what is the method of moments?
56:48
56:52
That's just a good old trick.
56:55
Replace the expectation by averages.
56:58
That's the beauty.
56:59
The moments are expectations.
57:02
So let's just replace the expectations by averages
57:04
and then do it with the average version,
57:07
as if it was the true one.
57:10
So for example, I'm going to talk about population moments,
57:14
when I'm computing them with the true distribution,
57:16
and I'm going to talk about them empirical moments, when
57:18
I talk about averages.
57:22
So those are the two quantities that I have.
57:24
And now, what I hope is that there is.
57:28
So this is basically--
57:30
everything is here.
57:32
That's where all the money is.
57:33
I'm going to assume there's a function psi that maps
57:36
my parameters-- let's say they're in Rd--
57:40
to the set of the first d moments.
57:42
57:45
Well, what I want to do is to come from this guy
57:48
back to theta.
57:49
So it better be that this function is--
57:50
57:54
invertible.
57:55
I want this function to be invertible.
57:57
In the Vandermonde case, this function
57:59
with just a linear function-- multiply a matrix by theta.
58:03
Then inverting a linear function is inverting the matrix.
58:06
Then this is the same thing.
58:08
So now what I'm going to assume--
58:09
and that's key for this method to work-- is that this theta--
58:14
so this function psi is one to one.
58:16
There's only one theta that gets only one set of moments.
58:24
And so if it's one to one, I can talk about its inverse.
58:26
And so now, I'm going to be able to define theta
58:28
as the inverse of the moments--
58:32
the reciprocal of the moments.
58:33
And so now, what I get is that the moment estimator is just
58:37
the thing where rather than taking the true guys in there,
58:42
I'm actually going to take the empirical moments in there.
58:44
58:48
Before we go any further, I'd like
58:50
to just go back and tell you that this is not
58:53
completely free.
58:56
How well-behaved your function psi
58:58
is going to play a huge role.
58:59
59:02
Can somebody tell me what the typical distance--
59:05
if I have a sample of size n, what
59:06
is the typical distance between an average and the expectation?
59:10
59:12
What is the typical distance?
59:14
What is the order of magnitude as a function of n between xn
59:18
bar and its expectation.
59:23
AUDIENCE: 1 over square root of n.
59:25
PHILIPPE RIGOLLET: 1 over square root n.
59:25
That's what the central limit theorem tells us, right?
59:28
The central limit theorem tells us
59:29
that those things are basically a Gaussian, which
59:31
is of order of 1 divided by its square of n.
59:34
And so basically, I start with something
59:37
which is 1 over square root of n away from the true thing.
59:41
Now, if my function psi inverse is super steep like this--
59:49
that's psi inverse-- then just small fluctuations, even
59:54
if they're of order 1 square root of n,
59:57
can translate into giant fluctuations in the y-axis.
60:04
And that's going to be controlled
60:06
by how steep psi inverse is, which is the same
60:09
as saying how flat psi is--
60:14
how flat is psi.
60:15
So if you go back to this Vandermonde inverse,
60:20
what it's telling you is that if this inverse matrix blows up
60:26
this guy a lot--
60:29
so if I start from a small fluctuation of this thing
60:32
and then they're blowing up by applying
60:34
the inverse of this matrix, things
60:36
are not going to go well.
60:37
Anybody knows what is the number that I should be looking for?
60:41
So that's from, say, numerical linear algebra
60:45
numerical methods.
60:47
When I have a system of linear equations,
60:49
what is the actual number I should
60:50
be looking at to know how much I'm
60:53
blowing up the fluctuations?
60:54
Yeah.
60:55
AUDIENCE: Condition number?
60:55
PHILIPPE RIGOLLET: The condition number, right.
60:57
So what's important here is the condition number
60:59
of this matrix.
61:00
If the condition number of this matrix is small,
61:03
then it's good.
61:04
It's not going to blow up much.
61:05
But if the condition number is very large,
61:07
it's just going to blow up a lot.
61:08
And the condition number is the ratio
61:10
of the largest and the smallest eigenvalues.
61:13
So you'll have to know what it is.
61:14
But this is how all these things get together.
61:17
So the numerical stability translates
61:21
into statistical stability here.
61:24
And numerical means just if I had
61:26
errors in measuring the right hand side,
61:28
how much would they translate into errors on the left hand
61:30
side.
61:31
So the error here is intrinsic to statistical questions.
61:33
61:38
So that's my estimator, provided that it exists.
61:42
And I said it's a one to one, so it should exist,
61:45
if I assume that psi is invertible.
61:48
So how good is this guy?
61:51
That's going to be definitely our question--
61:53
how good is this thing.
61:54
And as I said, there's chances that if psi is really steep,
62:00
then it should be not very good--
62:02
if psi inverse is very steep, it should not be very good,
62:06
which means that it's--
62:07
well, let's just leave it to that.
62:11
So that means that I should probably
62:13
see the derivative of psi showing up somewhere.
62:16
If the derivative of psi inverse, say, is very large,
62:19
then I should actually have a larger variance
62:21
in my estimator.
62:22
So hopefully, just like we had a theorem that told us
62:26
that the Fisher information was key in the variance
62:29
of the maximum likelihood estimator,
62:30
we should have a theorem that tells us
62:32
that the derivative of psi inverse
62:33
is going to have a key role in the method of moments.
62:37
So let's do it.
62:38
62:57
So I'm going to talk to you about matrices.
63:01
So now, I have--
63:02
63:10
So since I have to manipulate d numbers at any given time,
63:15
I'm just going to concatenate them into a vector.
63:17
So I'm going to call capital M theta--
63:19
so that's basically the population moment.
63:24
And I have M hat, which is just m hat 1 to m hat d.
63:31
And that's my empirical moment.
63:32
63:39
And what's going to play a role is
63:41
what is the variance-covariance of the random vector.
63:45
So I have this vector 1--
63:49
do I have 1?
63:50
No, I don't have 1.
63:51
63:59
So that's a d dimensional vector.
64:02
And here, I take the successive powers.
64:04
Remember, that looks very much like a column of my Vandermonde
64:08
matrix.
64:10
So now, I have this random vector.
64:12
It's just the successive powers of some random variable X.
64:15
And the variance-covariance matrix is the expectation--
64:19
so sigma--
64:20
of theta.
64:21
The theta just means I'm going to take expectations
64:23
with respect to theta.
64:26
That's the expectation with respect
64:28
to theta of this guy times this guy
64:31
transpose minus the same thing but with the expectation
64:40
inside.
64:41
64:45
Why do I do X, X1.
64:46
I have X, X2, X3.
64:48
64:50
X, X2, Xd times the expectation of X, X2, Xd.
65:04
Everybody sees what this is?
65:05
So this is a matrix where if I look at the ij-th term of this
65:11
matrix--
65:13
or let's say, jk-th term, so on row j and column k,
65:20
I have sigma jk of theta.
65:26
And it's simply the expectation of X to the j
65:30
plus k-- well, Xj Xk minus expectation of Xj expectation
65:40
of Xk.
65:41
65:45
So I can write this as m j plus k of theta minus mj of theta
65:53
times mk of theta.
65:55
66:00
So that's my covariance matrix of this particular vector
66:04
that I define.
66:06
And now, I'm going to assume that psi inverse--
66:09
well, if I want to talk about the slope
66:11
in an analytic fashion, I have to assume
66:14
that psi is differentiable.
66:16
And I will talk about the gradient
66:18
of psi, which is, if it's one dimensional,
66:20
it's just the derivative.
66:22
And here, that's where notation becomes annoying.
66:24
And I'm going to actually just assume
66:26
that so now I have a vector.
66:28
But it's a vector of functions and I
66:30
want to compute those functions at a particular value.
66:32
And the value I'm actually interested in
66:34
is at the m of theta parameter.
66:37
So psi inverse goes from the set of moments
66:41
to the set of parameters.
66:43
So when I look at the gradient of this guy,
66:45
it should be a function that takes as inputs moments.
66:48
And where do I want this function to be evaluated at?
66:51
At the true moment--
66:54
at the population moment vector.
66:58
Just like when I computed my Fisher information,
67:00
I was computing it at the true parameter.
67:04
So now, once they compute this guy--
67:08
so now, why is this a d by d gradient matrix?
67:11
67:15
So I have a gradient vector when I have a function from rd to r.
67:19
This is the partial derivatives.
67:22
But now, I have a function from rd to rd.
67:25
So I have to take the derivative with respect
67:28
to the arrival coordinate and the departure coordinate.
67:32
67:35
And so that's the gradient matrix.
67:39
And now, I have the following properties.
67:41
The first one is that the law of large numbers
67:46
tells me that theta hat is a weakly or strongly consistent
67:52
estimator.
67:54
So either I use the strong law of large numbers
67:56
or the weak law of large numbers,
67:57
and I get strong or weak consistency.
68:01
So what does that mean?
68:02
Why is that true?
68:03
Well, because now so I really have the function--
68:12
so what is my estimator?
68:13
Theta hat this psi inverse of m hat 1 to m hat k.
68:23
Now, by the law of large numbers,
68:26
let's look only at the weak one.
68:28
Law of large numbers tells me that each of the mj hat
68:35
is going to converge in probability as n
68:38
to infinity to the-- so the empirical moments
68:40
converge to the population moments.
68:44
That's what the good old trick is using,
68:48
the fact that the empirical moments
68:49
are close to the true moments as n becomes larger.
68:52
And that's because, well, just because the m hat j's
68:55
are averages, and the law of large numbers
68:57
works for averages.
68:59
So now, plus if I look at my continuous mapping theorem,
69:04
then I have that psi inverse is continuously differentiable.
69:10
So it's definitely continuous.
69:12
And so what I have is that psi inverse of m hat
69:16
1 m hat d converges to psi inverse m1 to md, which
69:28
is equal to of theta star.
69:33
So that's theta star.
69:35
By definition, we assumed that that was the unique one that
69:37
was actually doing this.
69:40
Again, this is a very strong assumption.
69:43
I mean, it's basically saying, if the method of moment works,
69:46
it works.
69:47
So the fact that psi inverse one to one
69:51
is really the key to making this guy work.
69:55
And then I also have a central limit theorem.
69:57
And the central limit theorem is basically
70:00
telling me that M hat is converging to M even
70:04
in the multivariate sense.
70:06
So if I look at the vector of M hat and the true vector of M,
70:09
then I actually make them go--
70:11
I look at the difference for scale by square root of n.
70:14
It goes to some Gaussian.
70:15
And usually, we would see-- if it was one dimensional,
70:18
we would see the variance.
70:19
Then we see the variance-covariance matrix.
70:22
Who has never seen the-- well, nobody answers this question.
70:25
Who has already seen the multivariate central limit
70:28
theorem?
70:28
70:31
Who was never seen the multivariate central limit
70:33
theorem?
70:35
So the multivariate central limit theorem
70:37
is basically just the slight extension
70:41
of the univariate one.
70:43
It just says that if I want to think--
70:48
so the univariate one would tell me something like this--
70:51
71:05
and 0.
71:06
And then I would have basically the variance of X to the j-th.
71:18
So that's what the central limit theorem tells me.
71:22
This is an average.
71:23
71:29
So this is just for averages.
71:31
The central limit theorem tells me this.
71:33
Just think of X to the j-th as being y.
71:36
And that would be true.
71:37
Everybody agrees with me?
71:40
So now, this is actually telling me
71:41
what's happening for all these guys individually.
71:45
But what happens when those guys start to correlate together?
71:48
I'd like to know if they actually correlate
71:51
the same way asymptotically.
71:53
And so if I actually looked at the covariance matrix
71:56
of this vector--
71:57
72:03
so now, I need to look at a matrix which is d by d--
72:07
then would those univariate central limit theorems
72:10
tell me--
72:12
so let me right like this, double bar.
72:16
So that's just the covariance matrix.
72:19
This notation, V double bar is the variance-covariance matrix.
72:23
So what this thing tells me-- so I know this thing
72:26
is a matrix, d by d.
72:30
Those univariate central limit theorems only
72:31
give me information about the diagonal terms.
72:36
But here, I have no idea where the covariance matrix is.
72:40
This guy is telling me, for example, that this thing is
72:46
like variance of X to the j-th.
72:49
But what if I want to find off-diagonal elements
72:51
of this matrix?
72:53
Well, I need to use a multivariate central limit
72:55
theorem.
72:56
And really what it's telling me is that you can actually
72:58
replace this guy here--
73:00
73:10
so that goes in distribution to some normal mean 0, again.
73:14
And now, what I have is just sigma
73:17
of theta, which is just the covariance matrix
73:22
of this vector X, X2, X3, X4, all the way to Xd.
73:26
And that's it.
73:27
So that's a multivariate Gaussian.
73:28
Who has never seen a multivariate Gaussian?
73:33
Please, just go on Wikipedia or something.
73:35
There's not much to know about it.
73:37
But I don't have time to redo probability here.
73:40
So we're going to have to live with it.
73:43
Now, to be fair, if your goal is not
73:46
to become a statistical savant, we
73:48
will stick to univariate questions
73:52
in the scope of homework and exams.
74:01
So now, what was the delta method telling me?
74:09
It was telling me that if I had a central limit theorem that
74:13
told me that theta hat was going to theta,
74:16
or square root of n theta hat minus theta
74:17
was going to some Gaussian, then I
74:19
could look at square root of Mg of theta hat minus g of theta.
74:23
And this thing was also going to a Gaussian.
74:25
But what it had to be is the square
74:27
of the derivative of g in the variance.
74:32
So the delta method, it was just a way
74:35
to go from square root of n theta
74:38
hat n minus theta goes to some N, say 0, sigma squared, to--
74:46
so delta method was telling me that this was square root
74:50
Ng of theta hat N minus g of theta
74:56
was going in distribution to N0 sigma squared
75:01
g prime squared of theta.
75:04
75:07
That was the delta method.
75:09
Now, here, we have a function of those guys.
75:12
The central limit theorem, even the multivariate one,
75:15
is only guaranteeing something for me regarding the moments.
75:20
But now, I need to map the moments back into some theta,
75:23
so I have a function of the moments.
75:26
And there is something called the multivariate delta
75:31
method, where derivatives are replaced by gradients.
75:35
Like, they always are in multivariate calculus.
75:39
And rather than multiplying, since things do not compute,
75:43
rather than choosing which side I want to put the square,
75:46
I'm actually just going to take half of the square on one side
75:49
and the other half of the square on the other side.
75:51
So the way you should view this, you
75:53
should think of sigma squared times g prime squared
75:59
as being g prime of theta times sigma
76:02
squared times g prime of theta.
76:06
And now, this is completely symmetric.
76:08
And the multivariate delta method
76:14
is basically telling you that you get the gradient here.
76:20
So you start from something that's
76:21
like that over there, a sigma--
76:24
so that's my sigma squared, think of sigma squared.
76:26
And then I premultiply by the gradient and postmultiply
76:29
by the gradient.
76:30
The first one is transposed.
76:31
The second one is not.
76:33
But that's very straightforward extension.
76:36
You don't even have to understand it.
76:37
Just think of what would be the natural generalization.
76:41
Here, by the way, I wrote explicitly
76:44
what the gradient of a multivariate function is.
76:48
So that's a function that goes from Rd to Rk.
76:53
So now, the gradient is a d by k matrix.
76:56
76:58
And so now, for this guy, we can do it
77:00
for the method or moments.
77:01
And we can see that basically we're
77:03
going to have this scaling that depends
77:04
on the gradient of the reciprocal of psi, which
77:08
is normal.
77:08
Because if psi is super steep, if psi inverse is super steep,
77:13
then the gradient is going to be huge,
77:14
which is going to translate into having a huge variance
77:17
for the method of moments.
77:18
77:21
So this is actually the end.
77:24
I would like to encourage you-- and we'll probably do it
77:26
on Thursday just to start.
77:27
But I encourage you do it in one dimension,
77:30
so that you know how to use the method of moments,
77:35
you know how to do a bunch of things.
77:37
Do it in one dimension and see how you can check those things.
77:40
So just as a quick comparison, in terms of the quadratic risk,
77:43
the maximum likelihood estimator is typically
77:46
more accurate than the method of moments.
77:50
What is pretty good to do is, when
77:51
you have a non-concave likelihood
77:54
function, what people like to do is
77:56
to start with the method of moments as an initialization
77:58
and then run some algorithm that optimizes locally
78:01
the likelihood starting from this point,
78:03
because it's actually likely to be closer.
78:05
And then the MLE is going to improve it
78:07
a little bit by pushing the likelihood a little better.
78:12
So of course, the maximum likelihood
78:13
is sometimes intractable.
78:14
Whereas, computing moments is fairly doable.
78:18
If the likelihood is concave, as I said,
78:20
we can use optimization algorithms,
78:21
such as interior-point methods or gradient descent,
78:24
I guess, to maximize it.
78:25
And if the likelihood is non-concave,
78:26
we only have local heuristics.
78:28
Risk And that's what I meant--
78:29
you have only local maxima.
78:31
And one trick you can do--
78:32
so your likelihood looks like this,
78:37
and it might be the case that if you have a lot of those peaks,
78:42
you basically have to start your algorithm in each
78:44
of those peaks.
78:46
But the method of moments can actually
78:48
start you in the right peak, and then you
78:50
just move up by doing some local algorithm
78:53
for maximum likelihood.
78:55
So that's not key.
78:56
But that's just if you want to think about algorithmically
78:59
how I would end up doing this and how can I combine the two.
79:03
So I'll see you on Thursday.
79:04
Thank you.
79:06