https://www.youtube.com/watch?v=k2inA31Gups&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=16

字幕記錄


00:00
00:00
The following content is provided under a Creative
00:02
Commons license.
00:03
Your support will help MIT OpenCourseWare
00:06
continue to offer high-quality educational resources for free.
00:10
To make a donation or to view additional materials
00:12
from hundreds of MIT courses, visit MIT OpenCourseWare
00:16
at ocw.mit.edu.
00:17
00:20
PHILIPPE RIGOLLET: So today, we're
00:21
going to close this chapter, this short chapter,
00:24
on Bayesian inference.
00:26
Again, this was just an overview of what you
00:28
can do in Bayesian inference.
00:32
And last time, we started defining
00:34
what's called Jeffreys priors.
00:36
Right?
00:36
So when you do Bayesian inference,
00:38
you have to introduce a prior on your parameter.
00:41
And we said that usually, it's something
00:43
that encodes your domain knowledge about where
00:45
the parameter could be.
00:47
But there's also some principle way to do it,
00:49
if you want to do Bayesian inference without really
00:51
having to think about it.
00:53
And for example, one of the natural priors
00:56
were those non-informative priors, right?
00:58
If you were on a compact set, it's
00:59
a uniform prior of this set.
01:01
If you're on an infinite set, you can still think of taking
01:04
the [? 01s ?] prior.
01:06
And that's called an [INAUDIBLE] That's always equal to 1.
01:09
And that's an improper prior if you are an infinite set
01:13
or proportional to one.
01:14
And so another prior that you can think of,
01:17
in the case where you have a Fisher information, which
01:20
is well-defined, is something called Jefferys prior.
01:23
And this prior is a prior which is
01:25
proportional to square root of the determinant of the Fisher
01:28
information matrix.
01:29
And if you're in one dimension, it's
01:31
basically proportional to a square root of the Fisher
01:37
information coefficient, which we know, for example,
01:40
is the asymptotic variance of the maximum likelihood
01:44
estimator.
01:45
And it turns out that it's basically.
01:48
So square root of this thing is basically
01:50
one over the standard deviation of the maximum likelihood
01:54
estimator.
01:55
And so you can compute this, right?
01:56
So you can compute for the maximum likelihood estimator.
01:59
We know that the variance is going
02:01
to be p1 minus p in the Bernoulli
02:09
statistical experiment.
02:11
So you get this one over the square root of this thing.
02:13
And for example, in the Gaussian setting,
02:16
you actually have the Fisher information,
02:19
even in the multi-variate one, is actually
02:22
going to be something like the identity matrix.
02:24
So this is proportional to 1.
02:25
It's the improper prior that you get, in this case, OK?
02:29
Meaning that, for the Gaussian setting,
02:31
no place where you center your Gaussian
02:33
is actually better than any other.
02:36
All right.
02:36
So we basically left on this slide,
02:40
where we saw that Jeffreys prior satisfy
02:43
a reparametrization [INAUDIBLE] invariant
02:46
by transformation of your parameter, which
02:49
is a desirable property.
02:51
And the way, it says that, well, if I have my prior on theta,
02:57
and then I suddenly decide that theta is not
02:59
the parameter I want to use to parameterize my problem,
03:01
actually what I want is phi of theta.
03:04
So think, for example, as theta being the mean of a Gaussian,
03:07
and phi of theta as being mean to the cube.
03:11
OK?
03:11
This is a one-to-one map phi, right?
03:15
So for example, if I want to go from theta to theta cubed,
03:20
and now I decide that this is the actual parameter that I
03:22
want, well, then it means that, on this parameter,
03:26
my original prior is going to induce another prior.
03:29
And here, it says, well, this prior
03:30
is actually also Jeffreys prior.
03:33
OK?
03:33
So it's essentially telling you that,
03:35
for this new parametrization, if you take Jeffreys prior, then
03:38
you actually go back to having exactly something that's
03:41
of the form's [INAUDIBLE] of determinant of the Fisher
03:43
information, but this thing with respect
03:45
to your new parametrization All right.
03:47
And so why is this true?
03:50
Well, it's just this change of variable theorem.
03:53
So it's essentially telling you that, if you call--
03:58
let's call p-- well, let's go pi tilde of eta prior over eta.
04:08
And you have pi of theta as the prior
04:11
over theta, than since eta is of the form phi of theta,
04:18
just by change of variable, so that's essentially
04:26
a probability result. It says that pi tilde of eta
04:33
is equal to pi of eta times d pi of theta times d
04:42
theta over d eta and--
04:48
04:55
sorry, is that the one?
04:57
Sorry, I'm going to have to write it,
04:58
because I always forget this.
04:59
05:05
So if I take a function--
05:07
05:14
OK.
05:14
So what I want is to check.
05:16
05:38
OK, so I want the function of eta that I can here.
05:41
And what I know is that this is h of phi of theta.
05:48
All right?
05:48
So sorry, eta is phi of theta, right?
05:51
Yeah.
05:53
So what I'm going to do is I'm going
05:54
to do the change of variable, theta is phi inverse of eta.
06:09
So eta is phi of theta, which means
06:14
that d eta is equal to d--
06:20
well, to phi prime of theta d theta.
06:26
So when I'm going to write this, I'm going to get integral of h.
06:31
Actually, let me write this, as I
06:33
am more comfortable writing this as e
06:36
with respect to eta of h of eta.
06:40
OK?
06:40
So that's just eta according to being drawn from the prior.
06:44
And I want to write this as the integral of he of eta times
06:47
some function, right?
06:49
So this is the integral of h of phi
06:58
of theta pi of theta d theta.
07:03
Now, I'm going to do my change of variable.
07:06
So this is going to be the integral of h of eta.
07:09
And then pi of phi of--
07:16
so theta is phi inverse of eta.
07:20
And then d theta is phi prime of theta d theta, OK?
07:27
And so what is pi of phi theta?
07:30
So this thing is proportional.
07:32
So we're in, say, dimension 1, so it's
07:33
proportional of square root of the Fisher information.
07:38
And the Fisher information, we know,
07:39
is the expectation of the square of the derivative of the log
07:44
likelihood, right?
07:45
So this is square root of the expectation
07:48
of d over d theta of log of--
08:03
well, now, I need the density.
08:06
Well, let's just call it l of theta.
08:10
And I want this to be taken at phi inverse of eta squared.
08:17
08:19
And then what I pick up is the--
08:21
08:23
so I'm going to put everything under the square.
08:25
So I get phi prime of theta squared d theta.
08:31
OK?
08:33
So now, I have the expectation of a square.
08:35
This does not depend, so this is-- sorry, this is l of theta.
08:38
This is the expectation of l of theta of an x, right?
08:42
That's for some variable, and the expectation here
08:44
is with respect to x.
08:45
That's just the definition of the Fisher information.
08:49
So now I'm going to squeeze this guy into the expectation.
08:52
It does not depend on x.
08:53
It just acts as a constant.
08:55
And so what I have now is that this is actually
08:57
proportional to the integral of h
08:59
eta times the square root of the expectation with respect
09:05
to x of what?
09:06
Well, here, I have d over d theta of log of theta.
09:10
And here, this guy is really d eta over d theta, right?
09:15
09:19
Agree?
09:21
So now, what I'm really left by-- so I get d over d theta
09:24
times d--
09:25
sorry, times d theta over d eta.
09:28
09:42
so that's just d over d eta of log of eta x.
09:51
10:00
And then this guy is now becoming d eta, right?
10:04
OK, so this was a mess.
10:06
10:09
This is a complete mess, because I actually want to use phi.
10:12
I should not actually introduce phi at all.
10:14
I should just talk about d eta over d theta type of things.
10:21
And then that would actually make my life so much easier.
10:24
OK.
10:25
I'm not going to spend more time on this.
10:26
This is really just the idea, right?
10:28
You have square root of a square in there.
10:30
And then, when you do your change of variable,
10:31
you just pick up a square.
10:32
You just pick up something in here.
10:35
And so you just move this thing in there.
10:38
You get a square.
10:38
It goes inside the square.
10:40
And so your derivative of the log likelihood
10:42
with respect to theta becomes a derivative of the log
10:44
likelihood with respect to eta.
10:46
And that's the only thing that's happening here.
10:48
I'm just being super sloppy, for some reason.
10:52
OK.
10:54
And then, of course, now, what you're left with
10:56
is that this is really just proportional.
10:59
Well, this is actually equal.
11:00
Everything is proportional, but this
11:02
is equal to the Fisher information tilde with respect
11:05
to eta now.
11:07
Right?
11:07
You're doing this with respect to eta.
11:09
And so that's your new prior with respect to eta.
11:17
OK.
11:17
So one thing that you want to do,
11:21
once you have-- so remember, when you actually
11:23
compute your posterior rate, rather
11:26
than having-- so you start with a prior,
11:29
and you have some observations, let's say, x1 to xn.
11:32
11:36
When you do Bayesian inference, rather than spitting
11:41
out just some theta hat, which is an estimator for theta,
11:45
you actually spit out an entire posterior distribution--
11:48
11:53
pi of theta, given x1 xn.
11:57
OK?
11:57
So there's an entire distribution
11:59
on the [INAUDIBLE] theta.
12:01
And you can actually use this to perform inference, rather
12:04
than just having one number.
12:06
OK?
12:06
And so you could actually build confidence regions
12:09
from this thing.
12:10
OK.
12:11
And so a Bayesian confidence interval--
12:16
so if your set of parameters is included in the real line,
12:21
then you can actually-- it's not even guaranteed
12:23
to be to be an interval.
12:25
So let me call it a confidence region, so a Bayesian
12:33
confidence region, OK?
12:40
So it's just a random subspace.
12:43
So let's call it r, is included in theta.
12:47
And when you have the deterministic one,
12:49
we had a definition, which was with respect to the randomness
12:53
of the data, right?
12:54
That's how you actually had a random subset.
12:57
So you had a random confidence interval.
12:59
Here, it's actually conditioned on the data,
13:02
but with respect to the randomness
13:03
that you actually get from your posterior distribution.
13:06
OK?
13:07
So such that the probability that your theta
13:16
belongs to this confidence region,
13:18
given x1 xn is, say, at least 1 minus alpha.
13:24
Let's just take it equal to 1 minus alpha.
13:27
OK so that's a confidence region at level 1 minus alpha.
13:34
OK, so that's one way.
13:36
So why would you actually--
13:38
when I actually implement Bayesian inference,
13:41
I'm actually spitting out that entire distribution.
13:44
I need to summarize this thing to communicate it, right?
13:47
I cannot just say this is this entire function.
13:49
I want to know where are the regions
13:51
of high probability, where my perimeter is supposed to be?
13:54
And so here, when I have this thing, what I actually
13:56
want to have is something that says,
13:58
well, I want to summarize this thing
14:00
into some subset of the real line, in which I'm
14:03
sure that the area under the curve, here, of my posterior
14:08
is actually 1 minus alpha.
14:11
And there's many ways to do this, right?
14:13
14:16
So one way to do this is to look at level sets.
14:22
14:27
And so rather than actually-- so let's
14:29
say my posterior looks like this.
14:32
I know, for example, if I have a Gaussian distribution,
14:35
I can actually take my posterior to be-- my posterior is
14:38
actually going to be Gaussian.
14:39
14:43
And what I can do is to try to cut it here on the y-axis
14:50
so that now, the area under the curve, when I cut here,
14:54
is actually 1 minus alpha.
14:59
OK, so I have some threshold tau.
15:02
If tau goes to plus infinity, then I'm
15:05
going to have that this area under the curve
15:07
here is going to--
15:10
15:18
AUDIENCE: [INAUDIBLE]
15:19
PHILIPPE RIGOLLET: Well, no.
15:21
So the area under the curve, when
15:23
tau is going to plus infinity, think
15:24
of the small, the when tau is just right here.
15:27
AUDIENCE: [INAUDIBLE]
15:29
PHILIPPE RIGOLLET: So this is actually going to 0, right?
15:32
And so I start here.
15:33
And then I start going down and down and down and down,
15:36
until I actually get something which is going down to 1 plus
15:39
alpha.
15:40
And if tau is going down to 0, then my area under the curve
15:44
is going to--
15:44
15:48
if tau is here, I'm cutting nowhere.
15:51
And so I'm getting 1, right?
15:52
15:56
Agree?
15:56
Think of, when tau is very close to 0,
16:00
I'm cutting [? s ?] s very far here.
16:02
And so I'm getting some area under the curve,
16:04
which is almost everything.
16:06
And so it's going to 1-- as tau going down to 0.
16:08
Yeah?
16:09
AUDIENCE: Does this only work for [INAUDIBLE]
16:12
PHILIPPE RIGOLLET: No, it does not.
16:14
I mean-- so this is a picture.
16:17
So those two things work for all of them, right?
16:20
But when you have a [? bimodal, ?] actually,
16:22
this is actually when things start
16:23
to become interesting, right?
16:24
So when we built a frequentist confidence interval,
16:30
it was always of the form x bar plus or minus something.
16:34
But now, if I start to have a posterior that
16:36
looks like this, what I'm going to start cutting off,
16:40
I'm going to have two--
16:41
I mean, my confidence region is going
16:44
to be the union of those two things, right?
16:47
And it really reflects the fact that there
16:50
is this bimodal thing.
16:51
It's going to say, well, with hyperbole,
16:53
I'm actually going to be either here or here.
16:56
Now, the meaning here of a Bayesian confidence region
16:59
and the confidence interval are completely distinct notions,
17:02
right?
17:03
And I'm going to work out on example with you
17:06
so that we can actually see that sometimes--
17:08
I mean, both of them, actually you
17:10
can come up with some crazy paradoxes.
17:11
So since we don't have that much time,
17:13
I will actually talk to you about why, in some instances,
17:17
it's actually a good idea to think of Bayesian confidence
17:19
intervals rather than frequentist ones.
17:22
So before we go into more details about what
17:25
those Bayesian confidence intervals are,
17:27
let's remind ourselves what does it
17:29
mean to have a frequentist confidence interval?
17:33
Right?
17:33
17:46
OK.
17:46
So when I have a frequentist confidence interval,
17:49
let's say something like x bar n to minus 1.96 sigma over root n
17:59
and x bar n plus 1.96 sigma over root n,
18:06
so that's the confidence interval
18:07
that you get for the mean of some Gaussian
18:10
with known variants to be equal to sigma square, OK.
18:16
So what we know is that the meaning of this
18:18
is the probability that theta belongs
18:20
to this is equal to 95%, right?
18:25
And this, more generally, you can
18:27
think of being q alpha over 2.
18:29
And what you're going to get is 1 minus alpha here, OK?
18:33
So what does it mean here?
18:34
Well, it looks very much like what we have here,
18:37
except that we're not conditioning on x1 xn.
18:39
And we should not.
18:40
Because there was a question like that in the midterm--
18:43
if I condition on x1 xn, this probability is either 0 or 1.
18:47
OK?
18:48
Because once I condition-- so here,
18:50
this probability, actually, here is with respect
18:52
to the randomness in x1 xn.
18:55
So if I condition--
18:56
18:58
so let's build this thing, r freq, for frequentist.
19:04
19:07
Well, given x1 xn--
19:11
and actually, I don't need to know x1 xn really.
19:13
What I need to know is what xn bar is.
19:16
Well, this thing now is what?
19:18
It's 1, if theta is in r, and it's 0,
19:22
if theta is not in r, right?
19:27
That's all there is.
19:28
This is a deterministic confidence interval,
19:29
once I condition x1 xn.
19:32
So I have a number.
19:33
The average is maybe 3.
19:35
And so I get 3.
19:36
Either theta is between 3 minus 0.5 or in 3 plus 0.5,
19:41
or it's not.
19:42
And so there's basically--
19:44
I mean, I write it as probability,
19:45
but it's really not a probalistic statement.
19:47
It's either it's true or not.
19:49
Agreed?
19:50
So what does it mean to have a frequentist confidence
19:52
interval?
19:53
It means that if I were--
19:55
and here, where the word frequentist comes from,
19:58
it says that if I repeat this experiment over and over,
20:02
meaning that on Monday, I collect a sample of size n,
20:06
and I build a confidence interval,
20:09
and then on Tuesday, I collect another sample of size n,
20:12
and I build a confidence interval,
20:13
and on Wednesday, I do this again and again, what's going
20:17
to happen is the following.
20:18
I'm going to have my true theta that lives here.
20:21
And then on Monday, this is the confidence interval
20:23
that I build.
20:25
OK, so this is the real line.
20:28
The true theta is here, and this is the confidence interval
20:31
I build on Monday.
20:32
All right?
20:32
So x bar was here, and this is my confidence interval.
20:37
On Tuesday, I build this confidence interval maybe.
20:41
x bar was closer to theta, but smaller.
20:44
But then on Wednesday, I build this confidence interval.
20:49
I'm not here.
20:50
It's not in there.
20:51
And that's this case.
20:53
Right?
20:54
It happens that it's just not in there.
20:56
And then on Thursday, I build another one.
20:57
I almost miss it, but I'm in there, et cetera.
21:01
Maybe here, Here, I miss again.
21:04
And so what it means to have a confidence interval-- so what
21:07
does it mean to have a confidence interval at 95%?
21:12
AUDIENCE: [INAUDIBLE]
21:15
PHILIPPE RIGOLLET: Yeah, so it means that if I repeat this
21:18
the frequency of times--
21:19
hence, the word frequentist-- at which
21:21
I'm actually going to overlap that,
21:24
I'm actually going to contain theta, should be 95%.
21:26
That's what frequentist means.
21:28
So it's just a matter of trusting that.
21:31
So on one given thing, one given realization of your data,
21:35
it's not telling you anything.
21:36
[INAUDIBLE] it's there or not.
21:38
So it's not really something that's actually
21:42
something that assesses the confidence of your decision,
21:46
such as data is in there or not.
21:48
It's something that assesses the confidence
21:50
you have in the method that you're using.
21:52
If you were you repeat it over and again,
21:54
it'd be the same thing.
21:56
It would be 95% of the time correct, right?
21:58
So for example, we know that we could build a test.
22:02
So it's pretty clear that you can actually
22:04
build a test for whether theta is equal to theta naught
22:09
or not equal to theta naught, by just
22:10
checking whether theta naught is in a confidence interval
22:13
or not.
22:13
And what it means is that, if you actually
22:15
are doing those tests at 5%, that means that 5% of the time,
22:21
if you do this over and again, 5% of the time
22:23
you're going to be wrong.
22:24
I mentioned my wife does market research.
22:27
And she does maybe, I don't know, 100,000 tests a year.
22:31
And if they do all of them at 1%,
22:34
then it means that 1% of the time, which is a lot of time,
22:37
right?
22:38
When you do 100,000 a year, it's 1,000 of them
22:40
are actually wrong.
22:41
OK, I mean, she's actually hedging
22:44
against the fact that 1% of them that are going to be wrong.
22:47
That's 1,000 of them that are going to be wrong.
22:49
Just like, if you do this 100,000 times at 95%,
22:52
5,000 of those guys are actually not going
22:54
to be the correct ones.
22:56
OK?
22:56
So I mean, it's kind of scary.
22:58
But that's the way it is.
23:01
So that's with the frequentist interpretation of this is.
23:03
Now, as I mentioned, when we started this Bayesian chapter,
23:07
I said, Bayesian statistics converge to--
23:10
I mean, Bayesian decisions and Bayesian methods converge
23:14
to frequentist methods.
23:16
When the sample size is large enough,
23:18
they lead to the same decisions.
23:20
And in general, they need not be the same,
23:22
but they tend to actually, when the sample
23:24
size is large enough, to have the same behavior.
23:27
Think about, for example, the posterior
23:30
that you have when you have in the Gaussian case, right?
23:34
We said that, in the Gaussian case,
23:36
what you're going to see is that it's
23:38
as if you had an extra observation which
23:40
was essentially given by your prior.
23:43
OK?
23:44
And now, what's going to happen is that, when this just one
23:50
observation among n plus 1, it's really
23:53
going to be totally drawn, and you
23:55
won't see it when the sample size grows larger.
23:58
So Bayesian methods are particularly useful when
24:00
you have a small sample size.
24:02
And when you have a small sample size, the effect of the prior
24:05
is going to be bigger.
24:06
But most importantly, you're not going
24:08
to have to repeat this thing over and again.
24:10
And you're going to have a meaning.
24:11
You're going to have to have something
24:13
that has a meaning for this particular data set
24:15
that you have.
24:16
When I said that the probability that theta belongs to r--
24:19
and here, I'm going to specify the fact that it's a Bayesian
24:22
confidence region, like this one--
24:24
this is actually conditionally on the data
24:27
that you've collected.
24:29
It says, given this data, given the points that you have--
24:32
just put in some numbers, if you want, in there--
24:34
it's actually telling you the probability
24:36
that theta belongs to this Bayesian thing,
24:39
to this Bayesian confidence region.
24:41
Here, since I have conditioned on x1 xn,
24:44
this probability is really just with respect to theta
24:46
drawn from the prior, right?
24:51
And so now, it has a slightly different meaning.
24:54
It's just telling you that when--
24:57
it's really making a statement about where
24:59
the regions of hyperability of your posterior are.
25:03
Now, why is that useful?
25:05
Well, there's actually an interesting story that
25:11
goes behind Bayesian methods.
25:13
Anybody knows the story of the USS I think it's Scorpion?
25:17
Do you know the story?
25:18
So that was an American vessel that disappeared.
25:22
I think it was close to Bermuda or something.
25:25
But you can tell the story of the Malaysian Airlines,
25:28
except that I don't think it's such a successful story.
25:31
But the idea was essentially, we're
25:33
trying to find where this thing happened.
25:36
And of course, this is a one-time thing.
25:39
You need something that works once.
25:41
You need something that works for this particular vessel.
25:44
And you don't care, if you go to the Navy, and you tell them,
25:46
well, here's a method.
25:48
And for 95 out of 100 vessels that you're going to lose,
25:51
we're going to be able to find it.
25:53
And they want this to work for this particular one.
25:57
And so they were looking, and they were
25:59
diving in different places.
26:02
And suddenly, they brought in this guy.
26:04
I forget his name.
26:05
I mean, there's a whole story about this on Wikipedia.
26:08
And he started collecting the data
26:10
that they had from different dives and maybe from currents.
26:13
And he started to put everything in.
26:14
And he said, OK, what is the posterior distribution
26:17
of the location of the vessel, given all the things
26:21
that I've seen?
26:22
And what have you seen?
26:23
Well, you've seen that it's not here, it's not there,
26:25
and it's not there.
26:26
And you've also seen that the currents were going that way,
26:29
and the winds were going that way.
26:30
And you can actually put some modeling traits
26:32
to understand this.
26:33
Now, given this, for this particular data that you have,
26:37
you can actually think of having a two-dimensional density that
26:41
tells you where it's more likely that the vessel is.
26:44
And where are you going to be looking for?
26:46
Well, if it's a multimodal distribution,
26:48
you're just going to go to the highest mode first,
26:50
because that's where it's the most likely to be.
26:52
And maybe it's not there, so you're just
26:53
going to update your posterior, based on the fact
26:55
that it's not there, and do it again.
26:56
And actually, after two dives, I think,
26:59
he actually found the thing.
27:01
And that's exactly where Bayesian statistics
27:03
start to kick in.
27:03
Because you put a lot of knowledge into your model,
27:08
but you also can actually factor in a bunch of information,
27:11
right?
27:11
The model, he had to build a model
27:13
that was actually taking into account and currents, and when.
27:17
And what you can have as a guarantee is that,
27:20
when you talk about the probability
27:22
that this vessel is in this location,
27:27
given what you've observed in the past,
27:28
it actually has some sense.
27:30
Whereas, if you were to use a frequentist approach,
27:34
then there's no probability.
27:35
Either it's underneath this position or it's not, right?
27:38
So that's actually where it start to make sense.
27:41
And so you can actually build this.
27:43
And there's actually a lot of methods
27:44
that are based on, for search, that
27:47
are based on Bayesian methods.
27:48
I think, for example, the Higgs boson
27:50
was based on a lot of Bayesian methods,
27:51
because this is something you need to find [INAUDIBLE],,
27:54
right?
27:54
I mean, there was a lot of prior that has to be built in.
27:57
OK.
27:57
So now, you build this confidence interval.
27:59
And the nicest way to do it is to use level sets.
28:02
But again, just like for Gaussians, I mean, if I had,
28:05
even in the Gaussian case, I decided
28:12
to go at x bar plus or minus something,
28:16
but I could go at something that's completely asymmetric.
28:19
So what's happening is that here, this method
28:21
guarantees that you're going to have the narrowest
28:23
possible confidence intervals.
28:24
That's essentially what it's telling you, OK?
28:27
Because every time I'm choosing a point, starting from here,
28:31
I'm actually putting as much area under the curve as I can.
28:36
All right.
28:38
So those are called Bayesian confidence [? interval. ?]
28:41
Oh yeah, and I promised you that we're
28:43
going to work on some example that actually
28:46
gives a meaning to what I just told you, with actual numbers.
28:50
So this is something that's taken from Wasserman's book.
28:56
And also, it's coming from a paper,
29:01
from a stats paper, from [? Wolpert ?] and I
29:03
don't know who, from the '80s.
29:05
And essentially, this is how it works.
29:07
So assume that you have n equals 2 observations.
29:10
29:14
And you have y1, so those observations are y1--
29:18
no, sorry, let's call them x1, which
29:20
is theta, plus epsilon 1 and x2, which is theta plus epsilon 2,
29:26
where epsilon 1 and epsilon 2 are iid.
29:31
And the probability that epsilon i is equal
29:33
to plus 1 is equal to the probability
29:35
that epsilon i is equal to minus 1 is equal to 1/2.
29:38
OK, so it's just the uniform sign plus minus 1, OK?
29:44
Now, let's think about so you're trying
29:46
to do some inference on theta.
29:47
Maybe you actually want to find some inference on theta
29:50
that's actually based on--
29:51
and that's based only on the x1 and x2.
29:55
OK?
29:56
So I'm going to actually build a confidence interval.
29:58
But what I really want to build is a--
30:01
30:03
but let's start thinking about how
30:05
I would find an estimator for those two things.
30:07
Well, what values am I going to be getting, right?
30:09
So I'm going to get either theta plus 1 or theta minus 1.
30:13
And actually, I can get basically four
30:15
different observations, right?
30:19
Sorry, four different pairs of observations--
30:21
30:30
plus plus theta minus 1.
30:32
Agreed?
30:33
Those are the four possible observations that I can get.
30:37
Agreed?
30:38
Either they're both equal to plus 1, both equal to minus 1,
30:42
or one of the two is equal to plus
30:44
1, the other one to minus 1, or the epsilons.
30:46
OK.
30:47
So those are the four observations I can get.
30:49
So in particular, if they take the same value,
30:56
and you know it's either theta plus 1 or theta minus 1,
30:59
and if they take a different value, I know one of them
31:02
is theta plus 1, and one is actually theta minus 1.
31:04
So in particular, if I take the average of those two guys, when
31:07
they take different values, I know I'm actually
31:09
getting theta right.
31:10
So let's build a confidence region.
31:14
OK, so I'm actually going to take a confidence region, which
31:16
is just a singleton.
31:18
31:21
And I'm going to say the following.
31:23
Well, if x1 is equal to x2, I'm just going to take x1 minus 1,
31:32
OK?
31:33
So I'm just saying, well, I'm never
31:34
going to able to resolve whether it's plus 1 or minus 1
31:37
that actually gives me the best one,
31:38
so I'm just going to take a dive and say, well, it's
31:41
just plus 1.
31:42
OK?
31:44
And then, if they're different, then here,
31:47
I can do much better.
31:50
I'm going to actually just think the average.
31:52
31:56
OK?
31:58
Now, what I claim is that this is a confidence region--
32:08
and by default, when I don't mention it,
32:10
this is a frequentist confidence region--
32:16
at level 75%.
32:18
32:21
OK?
32:21
So let's just check that.
32:23
To check that this is correct, I need
32:24
to check that the probability under the realization of x1
32:27
and x2, that theta belongs, is one of those two guys,
32:30
is actually equal to 0.75.
32:33
Yes?
32:33
AUDIENCE: What are the [INAUDIBLE]
32:36
PHILIPPE RIGOLLET: Well, it's just the frequentist confidence
32:39
interval that does not need to be an interval.
32:41
Actually, in this case, it's going to be an interval.
32:44
But that's just what it means.
32:46
Yeah, region for Bayesian was just because--
32:50
I mean, the confidence intervals,
32:51
when we're frequentist, we tend to make them
32:53
intervals, because we want--
32:54
but when you're Bayesian, and you're doing this level set
32:56
thing, you cannot really guarantee,
32:58
unless its [INAUDIBLE] is going to be an interval.
33:00
So region is just a way to not have to say interval,
33:02
in case it's not.
33:03
33:06
OK.
33:06
So I have this thing.
33:08
So what I need to check is the probability that theta
33:11
is in one of those two things, right?
33:13
So what I need to find is the probability that theta
33:16
is an [INAUDIBLE] Well, x1 minus 1 and x1 is not equal to x2.
33:24
And those are disjoint events, so it's plus the probability
33:26
that theta is in x1 plus x2 over 2 and x1--
33:35
sorry, that's equal.
33:37
That's different.
33:39
OK.
33:40
And OK, just before we actually finish the computation,
33:42
why do I have 75%?
33:44
75% is 3/4.
33:46
So it means that we have four cases.
33:48
And essentially, I did not account for one case.
33:52
And it's true.
33:52
I did not account for this case, when
33:56
the both of the epsilon i's are equal to minus 1.
34:01
Right?
34:01
So this is essentially the one I'm not going
34:03
to be able to account for.
34:04
And so we'll see that in a second.
34:06
So in this case, we know that everything goes great.
34:09
Right?
34:09
So in this case, this is--
34:11
OK.
34:11
Well, let's just start from the first line.
34:13
So the first line is the probability
34:15
that theta is equal to x1 minus 1 and those two are equal.
34:20
So this is the probability that theta is equal to--
34:28
well, this is theta plus epsilon 1 minus 1.
34:36
And epsilon 1 is equal to epsilon 2, right?
34:43
Because I can remove the theta from here,
34:45
and I can actually remove the theta from here,
34:47
so that this guy here is just epsilon 1 is equal to 1.
34:50
So when I intersect with this guy,
34:52
it's actually the same thing as epsilon 1 is equal to 1,
34:54
as well--
34:56
episilon 2 is equal to 1, as well, OK?
34:59
So this first thing is actually equal to the probability
35:05
that epsilon 1 is equal to 1 and epsilon 2 is equal to 1,
35:10
which is equal to what?
35:14
AUDIENCE: [INAUDIBLE]
35:15
PHILIPPE RIGOLLET: Yeah, 1/4, right?
35:17
So that's just the first case over there.
35:19
They're independent.
35:21
Now, I still need to do the second one.
35:23
So this case is what?
35:24
Well, when those things are equal, x1 plus x2 over 2
35:28
is what?
35:29
Well, I get theta plus theta over 2.
35:31
So that's just equal to the probability
35:33
that epsilon 1 plus epsilon 2 over 2 is equal to 0
35:39
and epsilon 1 is different from epsilon 2.
35:43
Agreed?
35:44
35:46
I just removed the thetas from these equations, because I can.
35:49
They're just on both sides every time.
35:51
35:54
OK.
35:55
And so that means what?
35:56
That means that the second part-- so this thing
35:58
is actually equal to 1/4 plus the probability
36:02
that epsilon 1 and epsilon 2 over 2 is equal to 0.
36:05
I can remove the 2.
36:06
So this is just the probability that one is 1,
36:08
and the other one is minus 1, right?
36:10
So that's equal to the probability
36:12
that epsilon 1 is equal to 1 and epsilon 2 is equal to minus 1
36:17
plus the probability that epsilon 1 is equal to minus 1
36:21
and epsilon 2 is equal to plus 1, OK?
36:24
Because they're disjoint events.
36:25
So I can break them into the sum of the two.
36:28
And each of those guys is also one of the atomic part of it.
36:32
It's one of the basic things.
36:33
And so each of those guys has probability 1/4.
36:36
And so here, we can really see that we accounted
36:38
for everything, except for the case when epsilon 1 was equal
36:41
to minus 1, and epsilon 2 was equal to minus 1.
36:44
So this is 1/4.
36:45
This is 1/4.
36:46
So the whole thing is equal to 3/4.
36:49
So now, what we have is that the probability that epsilon 1
36:56
is in--
36:57
so the probability that data belongs to this confidence
37:03
region is equal to 3/4.
37:06
And that's very nice.
37:07
But the thing is some people are sort of--
37:09
I mean, it's not super nice to be able to see this,
37:12
because, in a way, I know that, if I observe x1 and x2 that
37:17
are different, I know for sure that theta,
37:24
that I actually got the right theta, right?
37:25
That this confidence interval is actually
37:27
happening with probability 1.
37:31
And the problem is that I do not know--
37:34
I cannot make this precise with the notion of frequentist
37:37
confidence intervals.
37:39
OK?
37:39
Because frequentist confidence intervals
37:41
have to account for the fact that, in the future,
37:43
it might not be the case that x1 and x2 are different.
37:47
So Bayesian confidence regions, by definition--
37:53
well, they're all gone--
37:54
but they are conditioned on the data that I have.
37:57
And so that's what I want.
37:58
I want to be able to make this statement conditionally
38:00
and the data that I have.
38:02
OK.
38:03
So if I want to be able to make this statement,
38:06
if I want to build a Bayesian confidence region,
38:08
I'm going to have to put a prior on theta.
38:10
So without loss of generality--
38:12
I mean, maybe with-- but let's assume
38:16
that pi is a prior on theta.
38:25
And let's assume that pi of j is strictly positive
38:31
for all integers j equal, say, 0--
38:35
well, actually, for all j in the integers, positive or negative.
38:42
OK.
38:43
So that's a pretty weak assumption on my prior.
38:46
I'm just assuming that theta is some integer.
38:52
And now, let's build our Bayesian confidence region.
38:57
Well, if I want to build a Bayesian confidence region,
38:59
I need to understand what my posterior is going to be.
39:01
OK?
39:02
And if I want to understand what my posterior is going to be,
39:04
I actually need to build a likelihood, right?
39:11
So we know that it's the product of the likelihood
39:16
and of the prior divided by--
39:20
OK.
39:21
39:31
So what is my likelihood?
39:32
So my likelihood is the probability
39:35
of x1 x2, given theta.
39:40
Right?
39:41
That's what the likelihood should be.
39:45
And now let's say that actually, just
39:49
to make things a little simpler, let
39:51
us assume that x1 is equal to, I don't know, 5,
40:07
and x2 is equal to 7.
40:11
OK?
40:12
So I'm not going to take the case where they're actually
40:16
equal to each other, because I know that, in this case,
40:19
x1 and x2 are different.
40:20
I know I'm going to actually nail exactly what theta is,
40:23
by looking at the average of those guys, right?
40:26
Here, it must be that theta is equal to 6.
40:30
So what I want is to compute the likelihood at 5 and 7, OK?
40:34
40:38
And what is this likelihood?
40:42
Well, if theta is equal to 6, that's
40:53
just the probability that I will observe 5 and 7, right?
41:00
So what is the probability I observe 5 and 7?
41:01
41:04
Yeah?
41:05
1?
41:06
AUDIENCE: 1/4.
41:08
PHILIPPE RIGOLLET: That's 1/4, right?
41:10
As the probability, I have minus 1 for the first epsilon 1,
41:15
right?
41:15
So this is infinity 6.
41:17
This is the probability that epsilon 1 is equal to minus 1,
41:23
and epsilon 2 is equal to plus 1, which is equal to 1/4.
41:28
So this probability is 1/4.
41:31
If theta is different from 6, what is this probability?
41:35
So if theta is different from 6, since we
41:37
know that we've only loaded the integers--
41:41
so if theta has to be another integer,
41:46
what is the probability that I see 5 and 7?
41:49
AUDIENCE: 0.
41:49
PHILIPPE RIGOLLET: 0.
41:50
41:53
So that's my likelihood.
41:55
And if I want to know what my posterior is,
42:00
well, it's just pi of theta times
42:03
p of 5/6, given theta, divided by the sum over all T's, say,
42:10
in Z. Right?
42:11
So now, I just need to normalize this thing.
42:14
So of pi of T, p of 4/6, given T. Agreed?
42:21
42:24
That's just the definition of the posterior.
42:27
But when I sum these guys, there's
42:30
only one that counts, because, for those things,
42:34
we know that this is actually equal to 0 for every T,
42:38
except for when T is equal to 6.
42:41
So this entire sum here is actually
42:45
equal to pi of 6 times p of 5/6--
42:54
sorry, 5/7, of 5/7, given that theta
43:03
is equal to 6, which we know is equal to 1/4.
43:08
And I did not tell you what pi of 6 was.
43:10
43:16
But it's the same thing here.
43:18
The posterior for any theta that's not 6
43:21
is actually going to be-- this guy's going to be equal to 0.
43:23
So I really don't care what this guy is.
43:26
So what it means is that my posterior becomes what?
43:29
43:33
It becomes the posterior pi of theta,
43:40
given 5 and 7 is equal to-- well, when theta is not
43:46
equal to 6, this is actually 0.
43:49
So regardless of what I do here, I get something which is 0.
43:52
43:55
And if theta is equal to 6, what I get
43:58
is pi of 6 times p of 5/7, given 6,
44:02
which I've just computed here, which is 1/4 divided
44:05
by pi of 6 times 1/4.
44:08
So it's the ratio of two things that are identical.
44:10
So I get 1.
44:13
So now, my posterior tells me that, given
44:16
that I observe 5 and 7, theta has
44:22
to be 1 with probability-- has to be 6 with probability 1.
44:27
So now, I say that this thing here-- so now, this
44:32
is not something that actually makes
44:34
sense when I talk about frequentist confidence
44:37
intervals.
44:38
They don't really make sense, to talk about confidence
44:40
intervals, given something.
44:42
And so now, given that I observe 5 and 7,
44:44
I know that the probability of theta is equal to 1.
44:46
And in this sense, the Bayesian confidence interval
44:50
is actually more meaningful.
44:54
So one thing I want to actually say about this Bayesian
44:56
confidence interval is that it's--
44:58
45:01
I mean, here, it's equal to the value 1, right?
45:03
So it really encompasses the thing that we want.
45:05
But the fact that we actually computed
45:06
it using the Bayesian posterior and the Bayesian rule
45:09
did not really matter for this argument.
45:10
All I just said was that it had a prior.
45:12
But just what I want to illustrate
45:15
is the fact that we can actually give a meaning
45:17
to the probability that theta is equal to 6,
45:21
given that I see 5 and 7.
45:23
Whereas, we cannot really in the other cases.
45:26
And we don't have to be particularly
45:28
precise in the prior and theta to be able to give theta this--
45:31
to give this meaning.
45:32
OK?
45:35
All right.
45:36
45:38
So now, as I said, I think the main power of Bayesian
45:43
inference is that it spits out the posterior distribution,
45:45
and not just the single number, like frequentists
45:48
would give you.
45:50
Then we can say decorate, or theta hat, or point estimate,
45:55
with maybe some confidence interval.
45:56
Maybe we can do a bunch of tests.
45:58
But at the end of the day, we just have,
46:01
essentially, one number, right?
46:02
Then maybe we can understand where
46:04
the fluctuations of this number are in a frequentist setup.
46:07
but the Bayesian framework is essentially
46:11
giving you a natural method.
46:13
And you can interpret it in terms of the probabilities that
46:15
are associated to the prior.
46:17
But you can actually also try to make some--
46:21
so a Bayesian, if you give me any prior,
46:25
you're going to actually build an estimator from this prior,
46:29
maybe from the posterior.
46:30
And maybe it's going to have some frequentist properties.
46:32
And that's what's really nice about [? Bayesians, ?] is
46:35
that you can actually try to give
46:36
some frequentist properties of Bayesian methods, that
46:39
are built using Bayesian methodology.
46:42
But you cannot really go the other way around.
46:44
If I give you a frequency methodology,
46:46
how are you going to say something about the fact
46:48
that there's a prior going on, et cetera?
46:51
And so this is actually one of the things
46:53
there's actually some research that's going on for this.
46:55
They call it Bayesian posterior concentration.
46:58
And one of the things-- so there's something
46:59
called the Bernstein-von Mises theorem.
47:01
And those are a class of theorems,
47:03
and those are essentially methods that tell you, well,
47:06
if I actually run a Bayesian method,
47:10
and I look at the posterior that I get--
47:12
it's going to be something like this--
47:14
but now, I try to study this in a frequentist point of view,
47:16
there's actually a true parameter of theta
47:18
somewhere, the true one.
47:20
There's no prior for this guy.
47:21
This is just one fixed number.
47:23
Is it true that as my sample size is
47:25
going to go to infinity, then this thing is going
47:27
to concentrate around theta?
47:29
And the rate of concentration of this thing,
47:31
the size of this width, the standard deviation
47:35
of this thing, is something that should decay maybe
47:38
like 1 over square root of n, or something like this.
47:40
And the rate of posterior concentration,
47:43
when you characterize it, it's called the Bernstein-von Mises
47:45
theorem.
47:46
And so people are looking at this
47:47
in some non-parametric cases.
47:49
You can do it in pretty much everything
47:51
we've been doing before.
47:52
You can do it for non-parametric regression estimation
47:55
or density estimation.
47:56
You can do it for, of course-- you
47:58
can do it for sparse estimation, if you want.
48:01
OK.
48:01
So you can actually compute the procedure and--
48:04
48:08
yeah.
48:09
And so you can think of it as being just a method somehow.
48:12
Now, the estimator I'm talking about-- so
48:14
that's just a general Bayesian posterior concentration.
48:18
But you can also try to understand
48:20
what is the property of something that's
48:22
extracted from this posterior.
48:24
And one thing that we actually describe
48:26
was, for example, well, given this guy,
48:28
maybe it's a good idea to think about what
48:30
the mean of this thing is, right?
48:32
So there's going to be some theta hat,
48:35
which is just the integral of theta pi theta, given x1 xn--
48:41
so that's my posterior--
48:43
d theta.
48:44
Right?
48:44
So that's the posterior mean.
48:46
That's the expected value with respect
48:48
to the posterior distribution.
48:50
And I want to know how does this thing behave,
48:53
how close it is to a true theta if I actually
48:56
am in a frequency setup.
48:58
So that's the posterior mean.
48:59
49:04
But this is not the only thing I can actually spit out, right?
49:08
This is definitely uniquely defined.
49:09
If you give me a distribution, I can actually
49:13
spit out its posterior mean.
49:15
But I can also think of the posterior median.
49:17
49:21
But now, if this is not continuous,
49:23
you might have some uncertainty.
49:24
Maybe the median is not uniquely defined,
49:26
and so maybe that's not something you use as much.
49:29
Maybe you can actually talk about the posterior mode.
49:31
49:35
All right, so for example, if you're posterior density looks
49:38
like this, then maybe you just want
49:40
to summarize your posterior with this number.
49:43
So clearly, in this case, it's not such a good idea,
49:46
because you completely forget about this mode.
49:48
But maybe that's what you want to do.
49:49
Maybe you want to focus on the most peak mode.
49:53
And this is actually called maximum a posteriori.
49:58
As I said, maybe you want a sample
49:59
from this posterior distribution.
50:03
OK, and so in all these cases, these Bayesian estimators
50:06
will depend on the prior distribution.
50:09
And the hope is that, as the sample size grows,
50:11
you won't see that again.
50:14
OK.
50:14
So to conclude, let's just do a couple of experiments.
50:20
So if I look at--
50:22
50:25
did we do this?
50:26
Yes.
50:26
So for example, so let's focus on the posterior mean.
50:30
50:34
And we know-- so remember in experiment one--
50:45
[INAUDIBLE] example one, what we had
50:48
was x1 xn that were [? iid, ?] Bernoulli p,
50:56
and the prior I put on p was a beta with parameter aa.
51:06
OK?
51:07
And if I go back to what we computed,
51:09
you can actually compute the posterior of this thing.
51:12
And we know that it's actually going to be--
51:15
sorry, that was uniform?
51:17
Where is-- yeah.
51:18
So what we get is that the posterior, this thing
51:31
is actually going to be a beta with parameter
51:36
a plus the sum, so a plus the number of 1s
51:42
and a plus the number of 0s.
51:44
51:48
OK?
51:49
And the beta was just something that looked like--
51:53
51:56
the density was p to the a minus 1, 1 minus p.
52:00
52:05
OK?
52:05
So if I want to understand the posterior mean,
52:11
I need to be able to compute the expectation of a beta,
52:13
and then maybe plug in a for a plus
52:16
this guy and minus this guy.
52:17
OK.
52:18
So actually, let me do this.
52:21
OK.
52:22
So what is the expectation?
52:23
52:26
So what I want is something that looks
52:27
like the integral between 0 and 1 of p times a minus 1--
52:34
sorry, p times p a minus 1, 1 minus p, b minus 1.
52:42
Do we agree that this--
52:43
and then there's a normalizing constant.
52:46
Let's call it c.
52:49
OK?
52:49
52:53
So this is what I need to compute.
52:56
So that's c of a and b.
52:57
53:00
Do we agree that this is the posterior
53:01
mean with respect to a beta with parameters a and b?
53:08
Right?
53:09
I just integrate p against the density.
53:13
So what does this thing look like?
53:14
Well, I can actually move this guy in here.
53:18
And here, I'm going to have a plus 1 minus 1.
53:23
OK?
53:26
So the problem is that this thing is actually--
53:29
the constant is going to play a big role, right?
53:31
Because this is essentially equal
53:33
to c a plus 1b divided by c ab, where
53:40
ca plus 1b is just the normalizing
53:42
constant of a beta a plus 1 b.
53:46
So I need to know the ratio of those two constants.
53:48
53:58
And this is not something--
53:59
I mean, this is just a calculus exercise.
54:01
So in this case, what you get is--
54:06
sorry.
54:08
In this case, you get--
54:09
54:12
well, OK, so we get essentially a divided by,
54:34
I think, it's a plus b.
54:37
Yeah, it's a plus b.
54:38
54:41
So that's this quantity.
54:43
54:47
OK?
54:47
54:51
And when I plug in a to be this guy and b to be this guy, what
54:56
I get is a plus sum of the xi.
55:02
And then I get a plus this guy, a plus n minus this guy.
55:06
So those two guys go away, and I'm
55:07
left with 2a plus n, which does not work.
55:14
No, that actually works.
55:15
And so now what I do, I can actually divide and get
55:18
this thing, over there.
55:19
OK.
55:20
So what you can see, the reason why this thing has been divided
55:23
is that you can really see that, as n goes to infinity,
55:27
then this thing behaves like xn bar, which
55:30
is our frequentist estimator.
55:31
The effect of a is actually going away.
55:34
The effect of the prior, which is completely captured by a,
55:37
is going away as n goes to infinity.
55:40
Is there any question?
55:42
55:47
You guys have a question.
55:48
What is it?
55:50
Do you have a question?
55:51
AUDIENCE: Yeah, on the board, is that divided
55:53
by some [INAUDIBLE] stuff?
55:56
PHILIPPE RIGOLLET: Is that divided by what?
55:58
AUDIENCE: That a over a plus b, and then you just expanded--
56:00
PHILIPPE RIGOLLET: Oh yeah, yeah,
56:01
then I said that this is equal to this, right.
56:05
So that's for a becomes a plus sum of the xi's, and b becomes
56:15
a plus n minus sum of the xi's.
56:20
OK.
56:20
So that's just for the posterior one.
56:22
AUDIENCE: What's [INAUDIBLE]
56:26
PHILIPPE RIGOLLET: This guy?
56:27
AUDIENCE: Yeah.
56:28
PHILIPPE RIGOLLET: 2a.
56:28
AUDIENCE: 2a.
56:29
Oh, OK.
56:30
PHILIPPE RIGOLLET: Right.
56:31
So I get a plus a plus n.
56:34
And then those two guys cancel.
56:37
OK?
56:38
And that's what you have here.
56:41
So for a is equal to 1/2--
56:44
and I claim that this is Jeffreys prior.
56:47
Because remember, Jeffreys was [INAUDIBLE] was square root
56:53
and was proportional to the square root of p1 minus
56:56
p, which I can write as p to the 1/2, 1 minus p to the 1/2.
57:01
So it's just the case a is equal to 1/2.
57:03
OK.
57:04
So if I use Jeffreys prior, I just plug in a equals to 1/2,
57:07
and this is what I get.
57:10
OK?
57:12
So those things are going to have an impact again when
57:14
n is moderately large.
57:16
For large n, those things, whether you take Jeffreys prior
57:19
or you take whatever a you prefer,
57:20
it's going to have no impact whatsoever.
57:23
But n is of the order of 10 maybe,
57:26
then you're going to start to see some impact,
57:28
depending on what a you want to pick.
57:30
57:33
OK.
57:34
And then in the second example, well, here we actually
57:38
computed the posterior to be this guy.
57:42
Well, here, I can just read off what the expectation is, right?
57:45
I mean, I don't have to actually compute
57:47
the expectation of a Gaussian.
57:48
It's just that xn bar.
57:50
And so in this case, there's actually no--
57:52
I mean, when I have a non-informative prior
57:57
for a Gaussian, then I have basically xn in bar.
58:01
As you can see, actually, this is an interesting example.
58:04
When I actually look at the posterior,
58:06
it's not something that cost me a lot to communicate to you,
58:09
right?
58:10
There's one symbol here, one symbol here, and one symbol
58:12
here.
58:13
I tell you the posterior is a Gaussian with mean xn bar
58:17
and variance 1/n.
58:19
When I actually turn that into a poster mean,
58:23
I'm dropping all this information.
58:26
I'm just giving you the first parameter.
58:27
So you can see there's actually much more information
58:30
in the posterior than there is in the posterior mean.
58:35
The posterior mean is just a point.
58:37
It's not telling me how confident I am in this point.
58:39
And this thing is actually very interesting.
58:41
OK.
58:42
So you can talk about the posterior variance
58:44
that's associated to it, right?
58:45
You can talk about, as an output,
58:47
you could give the posterior mean and posterior variance.
58:49
And those things are actually interesting.
58:53
All right.
58:53
So I think this is it.
58:56
So as I said, in general, just like in this case,
59:05
the impact of the prior is being washed away
59:07
as the sample size goes to infinity.
59:10
Just well, like here, there's no impact of the prior.
59:12
It was an noninvasive one.
59:14
But if you actually had an informative one, [? CF ?]
59:17
homework-- yeah?
59:18
AUDIENCE: [INAUDIBLE]
59:19
PHILIPPE RIGOLLET: Yeah, so [? CF ?] homework,
59:21
you would actually see an impact of the prior, which,
59:23
again, would be washed away as your sample size increases.
59:25
Here, it goes away.
59:26
You just get xn bar over 1.
59:29
And actually, in these cases, you
59:31
see that the posterior distribution converges
59:35
to-- sorry, the Bayesian estimator
59:37
is asymptotically normal.
59:39
This is different from the distribution of the posterior,
59:43
right?
59:43
This is just the posterior mean, which happens
59:45
to be asymptotically normal.
59:47
But the posterior may not have a--
59:49
I mean, here, the posterior is a beta, right?
59:53
I mean, it's not normal.
59:55
OK, so there's different-- those things
59:57
are two different things.
59:59
Your question?
60:01
AUDIENCE: What was the prior [INAUDIBLE]
60:04
PHILIPPE RIGOLLET: All 1, right?
60:05
That was the improper prior.
60:06
AUDIENCE: OK.
60:08
And so that would give you the same thing as [INAUDIBLE],, not
60:12
just the proportion.
60:13
PHILIPPE RIGOLLET: Well, I mean, yeah.
60:15
So it's essentially telling you that--
60:17
so we said that, when you have a non-informative prior,
60:23
essentially, the maximum likelihood is the maximum
60:25
a posteriori, right?
60:26
But in this case, there's so much symmetry,
60:28
that it just so happens that the maximum in this thing
60:30
is completely symmetric around its maximum.
60:32
So it means that the expectation is equal to the maximum,
60:34
to [INAUDIBLE] max.
60:35
60:40
Yeah?
60:41
AUDIENCE: I read somewhere that one
60:43
of the issues with Bayesian methods
60:45
is that we choose the wrong prior,
60:46
and it could mess up your results.
60:49
PHILIPPE RIGOLLET: Yeah, but hence,
60:51
do not pick the wrong prior.
60:53
I mean, of course, it would.
60:55
I mean, it would mess up your res-- of course.
60:57
I mean, you're putting extra information.
60:58
But you could say the same thing by saying,
61:00
well, the issue with frequentist method
61:03
is that, if you mess up the choice of your likelihood,
61:06
then it's going to mess up your output.
61:09
So here, you just have two chances of messing it up,
61:11
right?
61:12
You have the-- well, it's gone.
61:14
So you have the product of the likelihood and the prior,
61:17
and you have one more chance to--
61:20
but it's true, if you assume that the model is
61:22
right, then, of course, finding the wrong prior could
61:25
completely mess up things if your prior, for example,
61:28
has no support on the true parameter.
61:30
But if your prior has a positive weight on the true parameter
61:34
as n goes to infinity--
61:38
I mean, OK, I cannot speak for all counterexamples
61:40
in the world.
61:41
But I'm sure, under minor technical conditions,
61:44
you can guarantee that your posterior
61:46
mean is going to converge to what
61:48
you need it to converge to.
61:49
61:53
Any other question?
61:54
61:57
All right.
61:58
So I think this closes the more traditional mathematical-- not
62:07
mathematical, but traditional statistics part of this class.
62:11
And from here on, we'll talk about more multivariate
62:14
statistics, starting with principal component analysis.
62:17
So that's more like when you have multiple data.
62:19
We started, in a way, to talk about multivariate statistics
62:22
when we talked about multivariate regression.
62:25
But we'll move on to principal component analysis.
62:28
I'll talk a bit about multiple testing.
62:30
I haven't made my mind yet about what
62:32
we'll talk really in December.
62:34
But I want to make sure that you have
62:36
a taste and a flavor of what is being interesting in statistics
62:41
these days, especially as you go towards more [INAUDIBLE]
62:44
learning type of questions, where really, the focus is
62:46
on prediction rather than the modeling itself.
62:48
We'll talk about logistic regression,
62:50
as well, for example, which is generalized
62:52
linear models, which is just the generalization in the case
62:55
that y does not take value in the whole real line, maybe 0,1,
63:00
for example, for regression.
63:03
All right.
63:03
Thanks.