https://www.youtube.com/watch?v=bFZ-0FH5hfs&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=15


字幕記錄


00:00
00:00
The following content is provided under a Creative
00:02
Commons license.
00:03
Your support will help MIT OpenCourseWare
00:06
continue to offer high quality educational resources for free.
00:10
To make a donation or to view additional materials
00:12
from hundreds of MIT courses, visit
00:15
MITOpenCourseWare@OCW.MIT.edu
00:21
PHILIPPE RIGOLLET: So today WE'LL actually just do a brief
00:26
chapter on Bayesian statistics.
00:28
And there's entire courses on Bayesian statistics,
00:31
there's entire books on Bayesian statistics,
00:33
there's entire careers in Bayesian statistics.
00:36
So admittedly, I'm not going to be
00:39
able to do it justice and tell you
00:40
all the interesting things that are happening
00:42
in Bayesian statistics.
00:44
But I think it's important as a statistician
00:47
to know what it is, how it works,
00:49
because it's actually a weapon of choice
00:52
for many practitioners.
00:55
And because it allows them to incorporate their knowledge
00:58
about a problem in a fairly systematic manner.
01:00
So if you look at like, say the Bayesian statistics literature,
01:04
it's huge.
01:05
And so here I give you sort of a range
01:09
of what you can expect to see in Bayesian statistics
01:12
from your second edition of a traditional book, something
01:18
that involves computation, some things that
01:20
involve risk thinking.
01:22
And there's a lot of Bayesian thinking.
01:24
There's a lot of things that you know
01:26
talking about sort of like philosophy of thinking
01:29
Bayesian.
01:30
This book, for example, seems to be one of them.
01:32
This book is definitely one of them.
01:34
This one represents sort of a wide, a broad literature
01:38
on Bayesian statistics, for applications for example,
01:42
in social sciences.
01:43
But even in large scale machine learning,
01:45
there's a lot of Bayesian statistics happening,
01:47
particular using something called Bayesian parametrics,
01:50
or hierarchical Bayesian modeling.
01:53
So we do have some experts at MIT in the c-cell.
01:59
Tamara Broderick for example, is a person
02:02
who does quite a bit of interesting work
02:04
on Bayesian parametrics.
02:06
And if that's something you want to know more about,
02:08
I urge you to go and talk to her.
02:10
So before we go into more advanced things,
02:14
we need to start with what is the Bayesian approach.
02:17
What do Bayesians do, and how is it
02:19
different from what we've been doing so far?
02:22
So to understand the difference between Bayesians
02:26
and what we've been doing so far is,
02:28
we need to first put a name on what we've been doing so far.
02:31
It's called frequentist statistics.
02:32
Which usually Bayesian versus frequentist statistics,
02:36
by versus I don't mean that there is naturally
02:38
in opposition to them.
02:40
Actually, often you will see the same method that
02:43
comes out of both approaches.
02:45
So let's see how we did it, right.
02:46
The first thing, we had data.
02:48
We observed some data.
02:50
And we assumed that this data was generated randomly.
02:52
The reason we did that is because this
02:54
would allow us to leverage tools from probability.
02:57
So let's say by nature, measurements, you do a survey,
03:01
you get some data.
03:03
Then we made some assumptions on the data generating process.
03:06
For example, we assumed they were iid.
03:07
That was one of the recurring things.
03:09
Sometimes we assume it was Gaussian.
03:11
If you wanted to use say, T-test.
03:13
Maybe we did some nonparametric statistics.
03:15
We assume it was a smooth function or maybe
03:18
linear regression function.
03:20
So those are our modeling.
03:21
And this was basically a way to say, well,
03:24
we're not going to allow for any distributions for the data
03:28
that we have.
03:29
But maybe a small set of distributions
03:31
that indexed by some small parameters, for example.
03:34
Or at least remove some of the possibilities.
03:38
Otherwise, there's nothing we can learn.
03:41
And so for example, this was associated
03:45
to some parameter of interest, say data or beta
03:48
in the regression model.
03:51
Then we had this unknown problem and this unknown thing,
03:55
a known parameter.
03:56
And we wanted to find it.
03:57
We wanted to either estimate it or test it,
03:59
or maybe find a confidence interval for the subject.
04:02
So, so far I should not have said anything that's new.
04:06
But this last sentence is actually
04:08
what's going to be different from the Bayesian part.
04:10
And particular, this unknown but fixed things
04:12
is what's going to be changing.
04:14
04:16
In the Bayesian approach, we still
04:18
assume that we observe some random data.
04:22
But the generating process is slightly different.
04:24
It's sort of a two later process.
04:25
And there's one process that generates
04:27
the parameter and then one process
04:28
that, given this parameter generates the data.
04:31
So what the first layer does, nobody really
04:35
believes that there's some random process that's
04:38
happening, about generating what is going
04:41
to be the true expected number of people
04:44
who turn their head to the right when they kiss.
04:47
But this is actually going to be something that brings us
04:49
some easiness for us to incorporate
04:53
what we call prior belief.
04:57
We'll see an example in a second.
04:58
But often, you actually have prior belief
05:01
of what this parameter should be.
05:02
When we, say least squares, we looked
05:05
over all of the vectors in all of R to the p,
05:09
including the ones that have coefficients equal
05:11
to 50 million.
05:15
Those are things that we might be able to rule out.
05:18
We might be able to rule out that on a much smaller scale.
05:21
For example, well I'm not an expert
05:24
on turning your head to the right or to the left.
05:29
But maybe you can rule out the fact
05:30
that almost everybody is turning their head
05:33
in the same direction, or almost everybody is turning their head
05:35
to another direction.
05:38
So we have this prior belief.
05:39
And this belief is going to play say, hopefully
05:43
less and less important role as we collect more and more data.
05:47
But if we have a smaller amount of data,
05:49
we might want to be able to use this information,
05:52
rather than just shooting in the dark.
05:54
And so the idea is to have this prior belief.
05:58
And then, we want to update this prior belief
06:00
into what's called the posterior belief after we've
06:03
seen some data.
06:04
Maybe I believe that there's something
06:08
that should be in some range.
06:09
But maybe after I see data, it's comforting me in my beliefs.
06:12
So I'm actually having maybe a belief that's more.
06:15
So belief encompasses basically what you think
06:18
and how strongly you think about it.
06:20
That's what I call belief.
06:21
So for example, if I have a belief about some parameter
06:24
theta, maybe my belief is telling me
06:26
where theta should be and how strongly I
06:28
believe in it, in the sense that I have a very narrow region
06:32
where theta could be.
06:35
The posterior beliefs, as well, you see some data.
06:37
And maybe you're more confident or less confident about what
06:40
you've seen.
06:40
Maybe you've shifted your belief a little bit.
06:42
And so that's what we're going to try to see,
06:44
and how to do this in a principal manner.
06:48
To understand this better, there's
06:50
nothing better than an example.
06:52
So let's talk about another stupid statistical question.
06:56
Which is, let's try to understand p.
06:58
Of course, I'm not going to talk about politics from now on.
07:01
So let's talk about p, the proportion of women
07:03
in the population.
07:04
07:15
And so what I could do is to collect some data, X1, Xn
07:21
and assume that they're Bernoulli
07:23
with some parameter, p unknown.
07:25
So p is in 0, 1.
07:30
OK, let's assume that those guys are iid.
07:33
So this is just an indicator for each of my collected data,
07:38
whether the person I randomly sample is a woman, I get a one.
07:42
If it's a man, I get a zero.
07:43
07:46
Now the question is, I sample these people randomly.
07:49
I do you know their gender.
07:51
And the frequentist approach was just saying,
07:54
OK, let's just estimate p hat being Xn bar.
07:58
And then we could do some tests.
08:01
So here, there's a test.
08:02
I want to test maybe if p is equal to 0.5 or not.
08:05
That sounds like a pretty reasonable thing to test.
08:09
But we want to also maybe estimate p.
08:13
But here, this is a case where we definitely prior belief
08:16
of what p should be.
08:17
We are pretty confident that p is not going to be 0.7.
08:22
We actually believe that we should
08:23
be extremely close to one half, but maybe not exactly.
08:29
Maybe this population is not the population in the world.
08:32
But maybe this is the population of, say some college
08:35
and we want to understand if this college has half women
08:38
or not.
08:40
Maybe we know it's going to be close to one half,
08:42
but maybe we're not quite sure.
08:43
08:46
We're going to want to integrate that knowledge.
08:49
So I could integrate it in a blunt manner by saying,
08:52
discard the data and say that p is equal to one half.
08:55
But maybe that's just a little too much.
08:57
So how do I do this trade off between adding the data
09:01
and combining it with this prior knowledge?
09:06
In many instances, essentially what's going to happen
09:09
is this one half is going to act like one new observation.
09:14
So if you have five observations,
09:17
this is just the sixth observation,
09:18
which will play a role.
09:20
If you have a million observations,
09:21
you're going to have a million and one.
09:22
It's not going to play so much of a role.
09:24
That's basically how it goes.
09:25
09:28
But, definitely not always because we'll
09:33
see that if I take my prior to be a point minus one half here,
09:36
it's basically as if I was discarding my data.
09:39
So essentially, there's also your ability
09:41
to encompass how strongly you believe in this prior.
09:45
And if you believe infinitely more in the prior
09:47
than you believe in the data you collected,
09:49
then it's not going to act like one more observation.
09:54
The Bayesian approach is a tool to one,
09:56
include mathematically our prior.
09:59
And our prior belief into statistical procedures.
10:02
Maybe I have this prior knowledge.
10:04
But if I'm a medical doctor, it's not clear to me
10:06
how I'm going to turn this into some principal way of building
10:09
estimators.
10:10
And the second goal is going to be
10:12
to update this prior belief into a posterior belief
10:16
by using the data.
10:17
10:22
How do I do this?
10:23
And at some point, I sort of suggested
10:25
that there's two layers.
10:28
One is where you draw the parameter at random.
10:31
And two, once you have the parameter,
10:35
conditionless parameter, you draw your data.
10:39
Nobody believed this actually is happening, that nature is just
10:42
rolling dice for us and choosing parameters at random.
10:45
But what's happening is that, this idea
10:48
that the parameter comes from some random distribution
10:51
actually captures, very well, this idea that how
10:54
you would encompass your prior.
10:56
How would you say, my belief is as follows?
10:59
Well here's an example about p.
11:01
I'm 90% sure that p is between 0.4 and 0.6.
11:07
And I'm 95% sure that p is between 0.3 and 0.8.
11:14
So essentially, I have this possible value of p.
11:18
And what I know is that, there's 90% here between 0.4 and 0.6.
11:35
And then I have 0.3 and 0.8.
11:39
And I know that I'm 95% sure that I'm in here.
11:44
If you remember, this sort of looks like the kind of pictures
11:47
that I made when I had some Gaussian, for example.
11:50
And I said, oh here we have 90% of the observations.
11:54
And here, we have 95% of the observations.
11:57
12:00
So in a way, if I were able to tell you
12:04
all those ranges for all possible values,
12:07
then I would essentially describe a probability
12:10
distribution for p.
12:13
And what I'm saying is that, p is going
12:15
to have this kind of shape.
12:16
So of course, if I tell you only two twice this information
12:19
that there's 90% I'm here, and I'm between here and here.
12:22
And 95%, I'm between here and here, then there's
12:24
many ways I can accomplish that, right.
12:26
I could have something that looks like this, maybe.
12:28
12:33
It could be like this.
12:35
There's many ways I can have this.
12:37
Some of them are definitely going
12:38
to be mathematically more convenient than others.
12:42
And hopefully, we're going to have things
12:44
that I can parameterize very well.
12:47
Because if I tell you this is this guy,
12:49
then there's basically one, two three, four, five, six,
12:54
seven parameters.
12:56
So I probably don't want something
12:57
that has seven parameters.
12:59
But maybe I can say, oh, it's a Gaussian and I all
13:01
I have to do is to tell you where it's centered
13:03
and what the standard deviation is.
13:04
13:07
So the idea of using this two layer thing,
13:11
where we think of the parameter p
13:12
as being drawn from some distribution,
13:14
is really just a way for us to capture this information.
13:17
Our prior belief being, well there's
13:20
this percentage of chances that it's there.
13:22
But the percentage of this chance, I'm not I'm
13:24
deliberately not using probability here.
13:28
So it's really a way to get close to this.
13:30
13:33
That's why I say, the true parameter is not random.
13:36
But the Bayesian approach does as if it was random.
13:40
And then, just spits out a procedure
13:42
out of this thought process, this thought experiment.
13:49
So when you practice Bayesian statistics a lot,
13:54
you start getting automatisms.
13:57
You start getting some things that you do without really
14:00
thinking about it. just like when
14:02
you you're a statistician, the first thing you do is,
14:04
can I think of this data as being Gaussian for example?
14:07
When you're Bayesian you're thinking about,
14:09
OK I have a set of parameters.
14:11
So here, I can describe my parameter
14:14
as being theta in general, in some big space
14:20
parameter of theta.
14:21
But what spaces did we encounter?
14:24
Well, we encountered the real line.
14:27
We encountered the interval 0, 1 for Bernoulli's And we
14:31
encountered some of the positive real line
14:36
for exponential distributions, etc.
14:39
And so what I'm going to need to do,
14:42
if I want to put some prior on those spaces,
14:44
I'm going to have to have a usual set of tools
14:47
for this guy, usual set of tools for this guy,
14:49
usual sort of tools for this guy.
14:51
And by usual set of tools, I mean
14:52
I'm going to have to have a family of distributions that's
14:54
supported on this.
14:56
So in particular, this is the speed
14:59
in which my parameter that I usually denote
15:01
by p for Bernoulli lives.
15:03
And so what I need is to find a distribution on the interval 0,
15:07
1 just like this guy.
15:13
The problem with the Gaussian is that it's
15:15
not on the interval 0, 1.
15:17
It's going to spill out in the end.
15:20
And it's not going to be something that works for me.
15:22
And so the question is, I need to think about distributions
15:25
that are probably continuous.
15:27
Why would I restrict myself to discrete distributions that
15:30
are actually convenient and for Bernoulli, one that's actually
15:34
basically the main tool that everybody is using
15:36
is the so-called beta distribution.
15:39
So the beta distribution has two parameters.
15:42
15:50
So x follows a beta with parameters
15:56
a and b if it has a density, f of x
16:05
is equal to x to the a minus 1.
16:09
1 minus x to the b minus 1, if x is in the interval 0,
16:15
1 and 0 for all other x's.
16:22
OK?
16:23
16:27
Why is that a good thing?
16:30
Well, it's a density that's on the interval 0, 1 for sure.
16:33
But now I have these two parameters and a set of shapes
16:37
that I can get by tweaking those two parameters is incredible.
16:41
16:44
It's going to be a unimodal distribution.
16:46
It's still fairly nice.
16:47
It's not going to be something that goes like this and this.
16:49
Because if you think about this, what
16:52
would it mean if your prior distribution of the interval 0,
16:55
1 had this shape?
16:57
16:59
It would mean that, maybe you think that p is here
17:01
or maybe you think that p is here,
17:03
or maybe you think that p is here.
17:05
Which essentially means that you think
17:06
that p can come from three different phenomena.
17:10
And there's other models that are called mixers
17:12
for that, that directly account for the fact
17:15
that maybe there are several phenomena that are aggregated
17:19
in your data set.
17:21
But if you think that your data set is sort of pure,
17:23
and that everything comes from the same phenomenon,
17:25
you want something that looks like this,
17:28
or maybe looks like this, or maybe is sort of symmetric.
17:32
You want to get all this stuff.
17:34
Maybe you want something that says, well
17:36
if I'm talking about p being the probability of the proportion
17:42
of women in the whole world, you want something that's probably
17:45
really spiked around one half.
17:48
Almost the point math, because you know
17:50
let's agree that 0.5 is the actual number.
17:54
So you want something that says, OK maybe I'm wrong.
17:58
But I'm sure I'm not going to be really that way off.
18:01
So you want something that's really pointy.
18:03
But if it's something you've never checked,
18:06
and again I can not make references at this point,
18:09
but something where you might have some uncertainty that
18:13
should be around one half.
18:14
Maybe you want something that a little more allows
18:17
you to say, well, I think there's more around one half.
18:19
But there's still some fluctuations that are possible.
18:22
And in particular here, I talk about p,
18:25
where the two parameters a and b are actually the same.
18:29
I call them a.
18:30
One is called scale.
18:31
The other one is called shape.
18:33
Oh sorry, this is not a density.
18:35
So it actually has to be normalized.
18:38
When you integrate this guy, it's
18:40
going to be some function that depends on a
18:41
and b, actually depends on this function
18:43
through the beta function.
18:45
Which is this combination of gamma function,
18:47
so that's why it's called beta distribution.
18:51
That's the definition of the beta function when you
18:53
integrate this thing anyway.
18:55
You just have to normalize it.
18:56
That's just a number that depends on the a and b.
18:59
So here, if you take a equal to b,
19:01
you have something that essentially
19:03
is symmetric around one half.
19:05
Because what does it look like?
19:07
Well, so my density f of x, is going to be what?
19:10
It's going to be my constant times x, times one minus x
19:19
to a minus one.
19:21
And this function, x times 1 minus x looks like this.
19:26
We've drawn it before.
19:27
That was something that showed up
19:29
as being the variance of my Bernoulli.
19:36
So we know it's something that takes its maximum at one half.
19:42
And now I'm just taking a power of this guy.
19:44
So I'm really just distorting this thing
19:46
into some fairly symmetric manner.
19:51
19:56
This distribution that we actually take for p.
20:00
I assume that p, the parameter, notice
20:03
that this is kind of weird.
20:04
First of all, this is probably the first time
20:06
in this entire course that something
20:09
has a distribution when it's actually a lower case letter.
20:12
That's something you have to deal with,
20:13
because we've been using lower case letters for parameters.
20:16
And now we want them to have a distribution.
20:18
So that's what's going to happen.
20:20
This is called the prior distribution.
20:23
So really, I should write something like f of p
20:27
is equal to a constant times p, 1 minus p, to the n minus 1.
20:35
Well no, actually I should not because then it's confusing.
20:39
One thing in terms of notation that I'm
20:41
going to write, when I have a constant here
20:43
and I don't want to make it explicit.
20:45
And we'll see in a second why I don't need to make it explicit.
20:48
I'm going to write this as f of x
20:53
is proportional to x 1 minus x to the n minus 1.
21:04
That's just to say, equal to some constant that does not
21:08
depend on x times this thing.
21:11
21:16
So if we continue with our experiment
21:21
where I'm drawing this data, X1 to Xn,
21:25
which is Bernoulli p, if p has some distribution
21:29
it's not clear what it means to have a Bernoulli
21:31
with some random parameter.
21:32
So what I'm going to do is, then I'm going to first draw my p.
21:35
Let's say I get a number, 0.52.
21:38
And then, I'm going to draw my data conditionally on p.
21:41
So here comes the first and last flowchart of this class.
21:45
21:49
So nature first draws p.
21:51
21:53
p follows some data on a, a.
21:58
Then I condition on p.
21:59
22:02
And then I draw X1, Xn that are iid, Bernoulli p.
22:10
Everybody understand the process of generating this data?
22:14
So you first draw a parameter, and then you just
22:16
flip those independent biased coins with this particular p.
22:21
There's this layered thing.
22:23
22:26
Now conditionally p, right so here I have this prior about p
22:31
which was the thing.
22:32
So this is just the thought process again,
22:34
it's not anything that actually happens in practice.
22:36
This is my way of thinking about how the data was generated.
22:39
And from this, I'm going to try to come up with some procedure.
22:43
Just like, if your estimator is the average of the data,
22:47
you don't have to understand probability
22:49
to say that my estimator is the average of the data.
22:52
Anyone outside this room understands
22:54
that the average is a good estimator
22:55
for some average behavior.
22:58
And they don't need to think of the data
23:01
as being a random variable, et cetera.
23:02
So same thing, basically.
23:04
23:10
In this case, you can see that the posterior distribution
23:13
is still a beta.
23:14
23:18
What it means is that, I had this thing.
23:20
Then, I observed my data.
23:21
And then, I continue and here I'm
23:23
going to update my prior into some posterior
23:32
distribution, pi.
23:36
And here, this guy is actually also a beta.
23:39
23:43
My posterior distribution, p, is also
23:45
a beta distribution with the parameters
23:48
that are on this slide.
23:48
And I'll have the space to reproduce them.
23:51
So I start the beginning of this flowchart
23:54
as having p, which is a prior.
23:57
I'm going to get some observations
23:58
and then, I'm going to update what my posterior is.
24:01
24:04
This posterior is basically something
24:06
that's, in business statistics was
24:09
beautiful is as soon as you have this distribution,
24:13
it's essentially capturing all the information about the data
24:17
that you want for p.
24:19
And it's not just the point.
24:20
It's not just an average.
24:21
It's actually an entire distribution
24:23
for the possible values of theta.
24:27
And it's not the same thing as saying, well
24:30
if theta hat is equal to Xn bar, in the Gaussian case I know
24:35
that this is some mean, mu.
24:37
And then maybe it has varying sigma squared over n.
24:39
That's not what I mean by, this is my posterior distribution.
24:43
This is not what I mean.
24:46
This is going to come from this guy, the Gaussian thing
24:49
and the central limit theorem.
24:51
But what I mean is this guy.
24:52
And this came exclusively from the prior distribution.
24:58
If I had another prior, I would not necessarily
25:00
have a beta distribution on the output.
25:03
So when I have the same family of distributions
25:07
at the beginning and at the end of this flowchart,
25:11
I say that beta is a conjugate prior.
25:16
25:21
Meaning I put in beta as a prior and I get beta as [INAUDIBLE]
25:27
And that's why betas are so popular.
25:30
Conjugate priors are really nice,
25:32
because you know that whatever you put in, what you're going
25:35
to get in the end is a beta.
25:37
So all you have to think about is the parameters.
25:38
You don't have to check again what the posterior is
25:41
going to look like, what the PDF of this guy is going to be.
25:43
You don't have to think about it.
25:44
You just have to check what the parameters are.
25:46
And there's families of conjugate priors.
25:48
Gaussian gives Gaussian, for example.
25:51
There's a bunch of them.
25:52
And this is what drives people into using specific priors as
25:57
opposed to others.
25:58
It has nice mathematical properties.
26:00
Nobody believes that p is really distributed according to beta.
26:05
But it's flexible enough and super convenient
26:08
mathematically.
26:09
26:12
Now let's see for one second, before we actually
26:14
go any further.
26:17
I didn't mention A and B are both in here,
26:19
A and B are both positive numbers.
26:21
26:24
They can be anything positive.
26:27
So here what I did is that, I updated A
26:29
into a plus the sum of my data, and b
26:34
into b plus n minus the sum of my data.
26:38
So that's essentially, a becomes a plus the number of ones.
26:41
26:45
Well, that's only when I have a and a.
26:47
So the first parameters become itself plus the number of ones.
26:50
And the second one becomes itself
26:51
plus the number of zeros.
26:52
26:55
And so just as a sanity check, what does this mean?
26:59
If a it goes to zero, what is the beta when a goes to 0?
27:08
We can actually read this from here.
27:10
27:16
Actually, let's take a goes to--
27:19
27:25
no.
27:26
Sorry, let's just do this.
27:27
27:38
I'll do it when we talk about non-informative prior,
27:40
because it's a little too messy.
27:42
27:47
How do we do this?
27:47
How did I get this posterior distribution, given the prior?
27:51
How do I update This well this is called Bayesian statistics.
27:56
And you've heard this word, Bayes before.
27:58
And the way you've heard it is in the Bayes formula.
28:02
What was the Bayes formula?
28:03
The Bayes formula was telling you
28:05
that the probability of A, given B was equal to something that
28:11
depended on the probability of B, given A. That's what it was.
28:14
28:16
You can actually either remember the formula
28:18
or you can remember the definition.
28:20
And this is what p of A and B divided by p of B.
28:26
So this is p of B, given A times p of A divided by p of B.
28:35
That's what Bayes formula is telling you.
28:37
Agree?
28:40
So now what I want is to have something that's telling me
28:46
how this is going to work.
28:49
What is going to play the role of those events, A and B?
28:54
Well one is going to be, this is going
28:59
to be the distribution of my parameter of theta,
29:01
given that I see the data.
29:03
And this is going to tell me, what
29:05
is the distribution of the data, given that I know what
29:07
my parameter if theta is.
29:09
But that part, if this is theta and this
29:11
is the parameter of theta, this is what
29:13
we've been doing all along.
29:15
The distribution of the data, given the parameter here
29:18
was n iid Bernoulli p.
29:22
I knew exactly what their joint probability mass function is.
29:27
Then, that was what?
29:29
So we said that this is going to be my data
29:32
and this is going to be my parameter.
29:34
29:37
So that means that, this is the probability of my data,
29:40
given the parameter.
29:43
This is the probability of the parameter.
29:45
What is this?
29:46
What did we call this?
29:49
This is the prior.
29:50
It's just the distribution of my parameter.
29:53
Now what is this?
29:56
Well, this is just the distribution
29:57
of the data, itself.
30:00
This is essentially the distribution of this,
30:06
if this was indeed not conditioned on p.
30:15
So if I don't condition on p, this data
30:18
is going to be a bunch of iid, Bernoulli with some parameter.
30:23
But the perimeter is random, right.
30:25
So for different realization of this data set,
30:27
I'm going to get different parameters for the Bernoulli.
30:30
And so that leads to some sort of convolution.
30:34
It's not really a convolution in this case,
30:36
but it's like some sort of composition of distributions.
30:38
I have the randomness that comes from here and then,
30:41
the randomness that comes from realizing the Bernoulli.
30:44
That's just the marginal distribution.
30:46
It actually might be painful to understand what this is, right.
30:49
In a way, it's sort of a mixture and it's not super nice.
30:52
But we'll see that this actually won't matter for us.
30:55
This is going to be some number.
30:57
It's going to be there.
30:58
But it will matter for us, what it is.
31:00
Because it actually does not depend on the parameter.
31:02
And that's all that matters to us.
31:04
31:09
Let's put some names on those things.
31:11
This was very informal.
31:12
So let's put some actual names on what we call prior.
31:19
So what is the formal definition of a prior,
31:22
what is the formal definition of a posterior,
31:24
and what are the rules to update it?
31:27
So I'm going to have my data, which is going to be X1, Xn.
31:30
31:35
Let's say they are iid, but they don't actually have to.
31:38
And so I'm going to have given, theta.
31:41
31:47
And when I say given, it's either
31:48
given like I did in the first part of this course
31:51
in all previous chapters, or conditionally on.
31:55
If you're thinking like a Bayesian, what I really mean
31:58
is conditionally on this random parameter.
32:02
It's as if it was a fixed number.
32:06
They're going to have a distribution,
32:08
X1, Xn is going to have some distribution.
32:12
Let's assume for now it's a PDF, pn of X1, Xn.
32:19
I'm going to write theta like this.
32:22
So for example, what is this?
32:24
Let's say this is a PDF.
32:27
It could be a PMF.
32:28
Everything I say, I'm going to think of them as being PDF's.
32:31
I'm going to combine PDF's with PDF's, but I
32:33
could combine PDF it PMF, PMF with PDF's or PMF with PMF.
32:37
So everywhere you see a D could be an M.
32:41
Now I have those things.
32:42
So what does it mean?
32:43
So here is an example.
32:46
X1, Xn or iid, and theta 1.
32:53
Now I know exactly what the joint PDF of this thing is.
32:57
It means that pn of X1, Xn given theta is equal to what?
33:03
Well it's 1 over 2pi to the power n
33:10
e, to the minus sum from i equal 1 to n
33:15
of xi minus theta squared divided by 2.
33:18
So that's just the joint distribution of n iid
33:21
and theta 1, random variables.
33:25
That's my pn given theta.
33:27
Now this is what we denoted by f sub theta before.
33:33
We had the subscript before, but now we just put a bar in theta
33:36
because we want to remember that this is actually
33:38
conditioned on theta.
33:40
But this is just notation.
33:42
You should just think of this as being, just the usual thing
33:46
that you get from some statistical model.
33:50
Now, that's going to be pn.
33:53
34:11
Theta has prior distribution, pi.
34:19
34:22
For example, so think of it as either PDF or PMF again.
34:29
For example, pi of theta was what?
34:33
Well it was some constant times theta to the a minus 1,
34:40
1 minus theta to a minus 1.
34:43
So it has some prior distribution,
34:45
and that's another PMF.
34:49
So now I'm given the distribution of my,
34:51
x is given theta and given the distribution of my theta.
34:54
I'm given this guy.
34:57
That's this guy.
35:00
I'm given that guy, which is my pi.
35:05
So that's my pn of X1, Xn given theta.
35:11
That's my pi of theta.
35:13
35:17
Well, this is just the integral of pn
35:21
of X1, Xn times pi of theta, d theta,
35:28
over all possible sets of theta.
35:29
That's just when I integrate out my theta,
35:33
or I compute the marginal distribution,
35:35
I did this by integrating.
35:37
That's just basic probability, conditional probabilities.
35:41
Then if I had the PMF, I would just
35:42
sum over the values of thetas.
35:43
35:49
Now what I want is to find what's called,
35:55
so that's the prior distribution,
35:58
and I want to find the posterior distribution.
36:01
36:15
It's pi of theta, given X1, Xn.
36:18
36:21
If I use Bayes' rule I know that this
36:23
is pn of X1, Xn, given theta times pi of theta.
36:34
And then it's divided by the distribution
36:37
of those guys, which I will write as integral over theta
36:41
of pn, X1, Xn, given theta times pi of theta, d theta.
36:48
36:55
Everybody's with me, still?
36:57
If you're not comfortable with this,
36:59
it means that you probably need to go read your couple of pages
37:03
on conditional densities and conditional
37:04
PMF's from your probably class.
37:07
There's really not much there.
37:08
It's just a matter of being able to define those quantities, f
37:13
density of x, given y.
37:15
This is just what's called a conditional density.
37:17
You need to understand what this object is
37:19
and how it relates to the joint distribution of x and y,
37:21
or maybe the distribution of x or the distribution of y.
37:24
37:27
But it's the same rules.
37:29
One way to actually remember this
37:31
is, this is exactly the same rules as this.
37:33
When you see a bar, it's the same thing as the probability
37:36
of this and this guy.
37:37
So for densities, it's just a comma
37:40
divided by the second the probably the second guy.
37:43
That's it.
37:45
So if you remember this, you can just do some pattern matching
37:48
and see what I just wrote here.
37:49
37:53
Now, I can compute every single one of these guys.
37:57
This something I get from my modeling.
38:04
So I did not write this.
38:05
It's not written in the slides.
38:09
But I give a name to this guy that was my prior distribution.
38:14
And that was my posterior distribution.
38:16
38:22
In chapter three, maybe what did we call this guy?
38:26
38:32
The one that does not have a name and that's in the box.
38:35
38:39
What did we call it?
38:40
38:43
AUDIENCE: [INAUDIBLE]
38:46
PHILLIPE RIGOLLET: It is the joint distribution of the Xi's.
38:48
38:51
And we gave it a name.
38:53
AUDIENCE: [INAUDIBLE]
38:54
PHILLIPE RIGOLLET: It's the likelihood, right?
38:56
This is exactly the likelihood.
38:57
This was the likelihood of theta.
38:59
39:03
And this is something that's very important to remember,
39:06
and that really reminds you that these things are really not
39:10
that different.
39:11
Maximum likelihood estimation and Bayesian estimation,
39:13
because your posterior is really just your likelihood times
39:18
something that's just putting some weights on the thetas,
39:23
depending on where you think theta should be.
39:26
If I had, say a maximum likelihood estimate,
39:28
and my likelihood and theta looked like this,
39:31
but my prior and theta looked like this.
39:33
I said, oh I really want thetas that are like this.
39:37
So what's going to happen is that, I'm
39:38
going to turn this into some posterior that looks like this.
39:41
39:44
So I'm just really waiting, this posterior,
39:47
this is a constant that does not depend on theta right?
39:49
Agreed?
39:50
I integrated over theta, so theta is gone.
39:53
So forget about this guy.
39:56
I have basically, that the posterior distribution up
39:59
to scaling, because it has to be a probability density and not
40:01
just anything any function that's positive,
40:03
is the product of this guy.
40:05
It's a weighted version of my likelihood.
40:06
That's all it is.
40:07
I'm just weighing the likelihood,
40:09
using my prior belief on theta.
40:13
And so given this guy a natural estimator,
40:16
if you follow the maximum likelihood principle,
40:19
would be the maximum of this posterior.
40:23
Agreed?
40:24
That would basically be doing exactly what maximum likelihood
40:28
estimation is telling you.
40:31
So it turns out that you can.
40:33
It's called Maximum A Posteriori,
40:35
and I won't talk much about this, or MAP.
40:39
That's Maximum a Posteriori.
40:44
So it's just the theta hat is the arc
40:47
max of pi theta, given X1, Xn.
40:50
40:54
And it sounds like it's OK.
40:56
I'll give you a density and you say, OK
40:58
I have a density for all values of my parameters.
41:00
You're asking me to summarize it into one number.
41:03
I'm just going to take the most likely number of those guys.
41:06
But you could summarize it, otherwise.
41:08
You could take the average.
41:10
You could take the median.
41:12
You could take a bunch of numbers.
41:14
And the beauty of Bayesian statistics
41:16
is that, you don't have to take any number in particular.
41:19
You have an entire posterior distribution.
41:21
This is not only telling you where theta is,
41:25
but it's actually telling you the difference
41:29
if you actually give as something
41:31
that gives you the posterior.
41:33
Now, let's say the theta is p between 0 and 1.
41:36
If my posterior distribution looks like this,
41:39
or my posterior distribution looks like this,
41:43
then those two guys have one, the same mode.
41:47
This is the same value.
41:49
And their symmetric, so they'll also have the same mean.
41:51
So these two posterior distributions
41:53
give me the same summary into one number.
41:55
However clearly, one is much more confident
41:58
than the other one.
41:59
So I might as well just spit it out as a solution.
42:04
You can do even better.
42:05
People actually do things, such as drawing a random number
42:09
from this distribution.
42:10
Say, this is my number.
42:12
That's kind of dangerous, but you
42:14
can imagine you could do this.
42:15
42:20
This is what works.
42:22
That's what we went through.
42:23
So here, as you notice I don't care so much about this part
42:28
here.
42:30
Because it does not depend on theta.
42:32
So I know that given the product of those two things,
42:35
this thing is only the constant that I need to divide
42:37
so that when I integrate this thing over theta,
42:40
it integrates to one.
42:41
Because this has to be a probability density on theta.
42:45
I can write this and just forget about that part.
42:47
And that's what's written on the top of this slide.
42:52
This notation, this sort of weird alpha, or I don't know.
42:57
Infinity sign propped to the right.
42:59
Whatever you want to call this thing
43:02
is actually just really emphasizing the fact
43:04
that I don't care.
43:06
I write it because I can, but you know what it is.
43:12
43:17
In some instances, you have to compute the integral.
43:19
In some instances, you don't have to compute the integral.
43:21
And a lot of Bayesian computation
43:23
is about saying, OK it's actually
43:25
really hard to compute this integral,
43:27
so I'd rather not doing it.
43:28
So let me try to find some methods that will allow me
43:31
to sample from the posterior distribution,
43:33
without having to compute this.
43:35
And that's what's called Monte-Carlo Markov
43:37
chains, or MCMC, and that's exactly what they're doing.
43:40
They're just using only ratios of things,
43:42
like that for different thetas.
43:44
And which means that if you take ratios,
43:45
the normalizing constant is gone and you don't
43:47
need to find this integral.
43:50
So we won't go into those details at all.
43:53
That would be the purpose of an entire course
43:54
on Bayesian inference.
43:56
Actually, even Bayesian computations
43:59
would be an entire course on its own.
44:02
And there's some very interesting things
44:03
that are going on there, the interface of stats
44:05
and computation.
44:06
44:10
So let's go back to our example and see if we can actually
44:12
compute any of those things.
44:13
Because it's very nice to give you some data, some formulas.
44:17
Let's see if we can actually do it.
44:19
In particular, can I actually recover this claim
44:23
that the posterior associated to a beta prior with a Bernoulli
44:31
likelihood is actually giving me a beta again?
44:35
What was my prior?
44:36
44:42
So p was following a beta AA, which
44:45
means that p, the density.
44:48
44:53
That was pi of theta.
44:56
Well I'm going to write this as pi of p--
44:59
was proportional to p to the A minus 1 times 1 minus p
45:05
to the A minus 1.
45:08
So that's the first ingredient I need to complete my posterior.
45:11
I really need only two, if I wanted to bound up to constant.
45:14
The second one was p hat.
45:16
45:20
We've computed that many times.
45:22
And we had even a nice compact way of writing it,
45:25
which was that pn of X1, Xn, given the parameter p.
45:32
So the joint density of my data, given p, that's my likelihood.
45:36
The likelihood of p was what?
45:38
Well it was p to the sum of Xi's.
45:41
45:44
1 minus p to the n minus some of the Xi's.
45:46
45:50
Anybody wants me to parse this more?
45:53
Or do you remember seeing that from maximum likelihood
45:56
estimation?
45:57
Yeah?
45:57
AUDIENCE: [INAUDIBLE]
46:02
PHILLIPE RIGOLLET: That's what conditioning does.
46:04
46:10
AUDIENCE: [INAUDIBLE] previous slide.
46:15
[INAUDIBLE] bottom there, it says D pi of t.
46:19
Shouldn't it be dt pi of t?
46:23
PHILLIPE RIGOLLET: So D pi of T is
46:25
a measure theoretic notation, which I used without thinking.
46:29
And I should not because I can see it upsets you.
46:32
D pi of T is just a natural way to say
46:35
that I integrate against whatever I'm
46:38
given for the prior of theta.
46:43
In particular, if theta is just the mix of a PDF and a point
46:48
mass, maybe I say that my p takes
46:51
value 0.5 with probability 0.5.
46:54
And then is uniform on the interval with probability 0.5.
46:58
For this, I neither have a PDF nor a PMF.
47:01
But I can still talk about integrating with respect
47:04
to this, right?
47:04
It's going to look like, if I take a function f of T,
47:08
D pi of T is going to be one half of f of one half.
47:14
That's the point mass with probability one half,
47:16
at one half.
47:17
Plus one half of the integral between 0 and 1, of f of TDT.
47:23
This is just the notation, which is actually funnily enough,
47:26
interchangeable with pi of DT.
47:29
47:32
But if you have a density, it's really
47:34
just the density pi of TDT.
47:39
If pi is really a density, but that's
47:41
when it's when pi is and measure and not a density.
47:44
47:46
Everybody else, forget about this.
47:49
This is not something you should really
47:51
worry about at this point.
47:52
This is more graduate level probability classes.
47:55
But yeah, it's called measure theory.
47:57
And that's when you think of pi as being a measure
47:59
in an abstract fashion.
47:59
You don't have to worry whether it's a density
48:01
or not, or whether it has a density.
48:04
48:08
So everybody is OK with this?
48:10
48:15
Now I need to compute my posterior.
48:17
And as I said, my posterior is really
48:23
just the product of the likelihood weighted
48:25
by the prior.
48:28
Hopefully, at this stage of your application,
48:33
you can multiply two functions.
48:35
So what's happening is, if I multiply this guy
48:37
with this guy, p gets this guy to the power
48:41
this guy plus this guy.
48:42
48:53
And then 1 minus p gets the power n minus some of Xi's.
49:00
So this is always from I equal 1 to n.
49:02
And then plus A minus 1 as well.
49:04
49:10
This is up to constant, because I still need to solve this.
49:15
And I could try to do it.
49:17
But I really don't have to, because I
49:18
know that if my density has this form, then
49:24
it's a beta distribution.
49:25
And then I can just go on Wikipedia
49:26
and see what should be the normalization factor.
49:29
But I know it's going to be a beta distribution.
49:31
It's actually the beta with parameter.
49:34
So this is really my beta with parameter, sum of Xi,
49:39
i equal 1 to n plus A minus 1.
49:43
And then the second parameter is n minus sum
49:46
of the Xi's plus A minus 1.
49:49
49:54
I just wrote what was here.
49:59
What happened to my one?
50:01
Oh no, sorry.
50:02
Beta has the power minus 1.
50:05
So that's the parameter of the beta.
50:08
And this is the parameter of the beta.
50:10
50:15
Beta is over there, right?
50:16
So I just replace A by what I see.
50:19
A is just becoming this guy plus this guy
50:22
and this guy plus this guy.
50:26
Everybody is comfortable with this computation?
50:28
50:34
We just agreed that beta priors for Bernoulli observations
50:38
are certainly convenient.
50:42
Because they are just conjugate, and we know
50:44
that's what is going to come out in the end.
50:46
That's going to be a beta as well.
50:48
I just claim it was convenient.
50:50
It was certainly convenient to compute this, right?
50:52
There was certainly some compatibility
50:55
when I had to multiply this function by that function.
50:57
And you can imagine that things could go much more wrong,
51:00
than just having p to some power and p to some power, 1 minus p
51:03
to some power, when it might just be some other power.
51:06
Things were nice.
51:09
Now this is nice, but I can also question the following things.
51:12
Why beta, for one?
51:14
The beta tells me something.
51:17
That's convenient, but then how do I pick A?
51:20
I know that A should definitely capture the fact that where
51:27
I want to have my p most likely located.
51:30
But it also actually also captures
51:32
the variance of my beta.
51:34
And so choosing different As is going
51:36
to have different functions.
51:37
If I have A and B, If I started with the beta with parameter.
51:43
If I started with a B here, I would just pick up the B here.
51:48
Agreed?
51:49
And that would just be a symmetric.
51:51
But they're going to capture mean and variance
51:53
of this thing.
51:53
And so how do I pick those guys?
51:56
If I'm a doctor and you're asking me,
51:59
what do you think the chances of this drug working
52:01
in this kind of patients is?
52:03
And I have to spit out the parameters of a beta for you,
52:06
it might be a bit of a complicated thing to do.
52:08
So how do you do this, especially for problems?
52:10
So by now, people have actually mastered
52:14
the art of coming up with how to formulate those numbers.
52:19
But in new problems that come up, how do you do this?
52:21
What happens if you want to use Bayesian methods,
52:23
but you actually do not know what you expect to see?
52:30
To be fair, before we started this class, I hope all of you
52:33
had no idea whether people tend to bend their head to the right
52:36
or to the left before kissing.
52:38
Because if you did, well you have too much time
52:40
on your hands and I should double your homework.
52:42
52:44
So in this case, maybe you still want
52:46
to use the Bayesian machinery.
52:48
Maybe you just want to do something nice.
52:50
It's nice right, I mean it worked out pretty well.
52:53
What if you want to do?
52:54
Well you actually want to use some priors that
52:56
carry no information, that basically do not prefer
53:00
any theta to another theta.
53:02
Now, you could read this slide or you
53:05
could look at this formula.
53:06
53:10
We just said that this pi here was just here
53:14
to weigh some thetas more than others, depending
53:18
on their prior belief.
53:19
If our prior belief does not want
53:21
to put any preference towards some thetas than to others,
53:24
what do I do?
53:26
AUDIENCE: [INAUDIBLE]
53:27
PHILLIPE RIGOLLET: Yeah, I remove it.
53:29
And the way to remove something we multiply by,
53:31
is just replace it by one.
53:32
That's really what we're doing.
53:35
If this was a constant not depending on theta,
53:38
then that would mean that we're not preferring any theta.
53:41
And we're looking at the likelihood.
53:44
But not as a function that we're trying to maximize,
53:46
but it is a function that we normalize in such a way
53:50
that it's actually a distribution.
53:52
So if I have pi, which is not here,
53:54
this is really just taking the like likelihood,
53:56
which is a positive function.
53:57
It may not integrate to 1, so I normalize it
53:59
so that it integrates to 1.
54:02
And then I just say, well this is my posterior distribution.
54:05
Now I could just maximize this thing
54:06
and spit out my maximum likelihood estimator.
54:09
But I can also integrate and find
54:10
what the expectation of this guy is.
54:12
I can find what the median of this guy is.
54:14
I can sample data from this guy.
54:16
I can build, understand what the variance of this guy is.
54:19
Which is something we did not do when we just did
54:21
maximum likelihood estimation because given a function, all
54:24
we cared about was the arc max of this function.
54:27
54:31
These priors are called uninformative.
54:36
This is just replacing this number by one or by a constant.
54:43
Because it still has to be a density.
54:45
54:49
If I have a bounded set, I'm just
54:50
looking for the uniform distribution
54:52
on this bounded set, the one that puts constant one
54:56
over the size of this thing.
54:59
But if I have an invalid set, what
55:01
is the density that takes a constant value
55:03
on the entire real line, for example?
55:07
What is this density?
55:08
55:13
AUDIENCE: [INAUDIBLE]
55:16
PHILLIPE RIGOLLET: Doesn't exist, right?
55:18
It just doesn't exist.
55:20
The way you can think of it is a Gaussian
55:22
with the variance going to infinity, maybe,
55:24
or something like this.
55:26
But you can think of it in many ways.
55:27
You can think of the limit of the uniform between minus T
55:32
and T, with T going to infinity.
55:34
But this thing is actually zero.
55:36
There's nothing there.
55:39
You can actually still talk about this.
55:41
You could always talk about this thing, where
55:44
you think of this guy as being a constant,
55:46
remove this thing from this equation, and just say,
55:49
well my posterior is just the likelihood
55:51
divided by the integral of the likelihood over theta.
55:54
And if theta is the entire real line, so be it.
55:58
As long as this integral converges,
56:00
you can still talk about this stuff.
56:01
56:04
This is what's called an improper prior.
56:06
56:09
An improper prior is just a non-negative function defined
56:11
in theta, but it does not have to integrate neither to one,
56:17
nor to anything.
56:18
56:20
If I integrate the function equal to 1
56:22
on the entire real line, what do I get?
56:24
56:27
Infinity.
56:28
56:32
It's not a proper prior, and it's called and improper prior.
56:35
And those improper priors are usually
56:39
what you see when you start to want non-informative priors
56:42
on infinite sets of datas.
56:44
That's just the nature of it.
56:46
You should think of them as being the uniform distribution
56:50
of some infinite set, if that thing were to exist.
56:52
56:56
Let's see some examples about non-informative priors.
57:01
If I'm in the interval 0, 1 this is a finite set.
57:04
So I can talk about the uniform prior
57:07
on the interval 0, 1 for a parameter, p of a Bernoulli.
57:10
57:26
If I want to talk about this, then it
57:28
means that my prior is p follows some uniform on the interval
57:35
0, 1.
57:37
So that means that f of x is 1 if x is in 0, 1.
57:48
Otherwise, there is actually not even a normalization.
57:52
This thing integrates to 1.
57:53
And so now if I look at my likelihood,
57:56
it's still the same thing.
57:57
So my posterior becomes theta X1, Xn.
58:04
That's my posterior.
58:07
I don't write the likelihood again,
58:08
because we still have it--
58:09
well we don't have it here anymore.
58:11
58:15
The likelihood is given here.
58:17
Copy, paste over there.
58:20
The posterior is just this thing times 1.
58:23
So you will see it in a second.
58:24
So it's p to the power sum of the Xi's, one minus p
58:28
to the power, n minus sum of the Xi's.
58:31
And then it's multiplied by 1, and then divided by this
58:36
integral between 0 and 1 of p, sum of the Xi's.
58:42
1 minus p, n minus sum of the Xi's.
58:47
Dp, which does not depend on p.
58:51
And I really don't care what the thing actually is.
58:53
58:58
That's posterior of p.
59:03
And now I can see, well what is this?
59:06
It's actually just the beta with parameters.
59:12
This guy plus 1.
59:14
59:19
And this guy plus 1.
59:21
59:34
I didn't tell you what the expectation of a beta was.
59:38
We don't know what the expectation of a beta
59:39
is, agreed?
59:42
If I wanted to find say, the expectation of this thing that
59:45
would be some good estimator, we know
59:47
that the maximum of this guy-- what
59:49
is the maximum of this thing?
59:51
59:54
Well, it's just this thing, it's the average of the Xi's.
59:57
That's just the maximum likelihood estimator
59:59
for Bernoulli.
60:00
We know it's the average.
60:01
Do you think if I take the expectation of this thing,
60:03
I'm going to get the average?
60:05
60:13
So actually, I'm not going to get the average.
60:15
I'm going to get this guy plus this guy, divided by n plus 1.
60:19
60:27
Let's look at what this thing is doing.
60:28
It's looking at the number of ones and it's adding one.
60:34
And this guy is looking at the number of zeros
60:36
and it's adding one.
60:39
Why is it adding this one?
60:41
What's going on here?
60:42
60:47
This is going to matter mostly when the number of ones
60:52
is actually zero, or the number of zeros is zero.
60:56
Because what it does is just pushes the zero from non-zero.
61:00
And why is that something that this Bayesian method actually
61:03
does for you automatically?
61:04
It's because when we put this non-informative
61:06
prior on p, which was uniform on the interval 0, 1.
61:11
In particular, we know that the probability
61:12
that p is equal to 0 is zero.
61:16
And the probability p is equal to 1 is zero.
61:19
And so the problem is that if I did not
61:21
add this 1 with some positive probability,
61:24
I wouldn't be allowed to spit out something that actually had
61:28
p hat, which was equal to 0.
61:30
If by chance, let's say I have n is equal to 3,
61:33
and I get only 0, 0, 0, that could happen with probability.
61:37
1 over pq, one over 1 minus pq.
61:41
61:46
That's not something that I want.
61:47
And I'm using my priors.
61:49
My prior is not informative, but somehow it captures the fact
61:51
that I don't want to believe p is going
61:53
to be either equal to 0 or 1.
61:56
So that's sort of taken care of here.
61:59
So let's move away a little bit from the Bernoulli example,
62:05
shall we?
62:06
I think we've seen enough of it.
62:08
And so let's talk about the Gaussian model.
62:10
Let's say I want to do Gaussian inference.
62:12
62:17
I want to do inference in a Gaussian model,
62:19
using Bayesian methods.
62:20
62:30
What I want is that Xi, X1, Xn, or say 0, 1 iid.
62:39
62:44
Sorry, theta 1, iid conditionally on theta.
62:47
62:50
That means that pn of X1, Xn, given theta
62:56
is equal to exactly what I wrote before.
62:58
So 1 square root to pi, to the n exponential minus one half
63:04
sum of Xi minus theta squared.
63:09
So that's just the joint distribution
63:11
of my Gaussian with mean data.
63:13
And the another question is, what
63:14
is the posterior distribution?
63:17
Well here I said, let's use the uninformative prior,
63:22
which is an improper prior.
63:23
It puts weight on everyone.
63:25
That's the so-called uniform on the entire real line.
63:29
So that's certainly not a density.
63:31
But it can still just use this.
63:34
So all I need to do is get this divided
63:40
by normalizing this thing.
63:44
But if you look at this, essentially I
63:47
want to understand.
63:49
So this is proportional to the exponential
63:52
minus one half sum from I equal 1
63:55
to n of Xi minus theta squared.
63:58
And now I want to see this thing as a density,
64:01
not on the Xi's but on theta.
64:03
64:06
What I want is a density on theta.
64:10
So it looks like I have chances of getting something
64:13
that looks like a Gaussian.
64:16
To have a Gaussian, I would need to see minus one half.
64:19
And then I would need to see theta minus something
64:21
here, not just the sum of something minus thetas.
64:25
So I need to work a little bit more,
64:29
to expand the square here.
64:31
So this thing here is going to be
64:32
equal to exponential minus one half sum from I equal 1
64:37
to n of Xi squared minus 2Xi theta plus theta squared.
64:45
65:10
Now what I'm going to do is, everything remember
65:13
is up to this little sign.
65:15
So every time I see a term that does not depend on theta,
65:19
I can just push it in there and just make it disappear.
65:22
Agreed?
65:24
This term here, exponential minus one half sum of Xi
65:28
squared, does it depend on theta?
65:31
No.
65:32
So I'm just pushing it here.
65:33
This guy, yes.
65:34
And the other one, yes.
65:35
So this is proportional to exponential sum of the Xi.
65:45
And then I'm going to pull out my theta, the minus one half
65:47
canceled with the minus 2.
65:50
And then I have minus one half sum from I
65:56
equal 1 to n of theta squared.
65:58
66:01
Agreed?
66:03
So now what this thing looks like,
66:05
this looks very much like some theta minus something squared.
66:09
This thing here is really just n over 2 times theta.
66:15
66:18
Sorry, times theta squared.
66:21
So now what I need to do is to write this of the form, theta
66:25
minus something.
66:26
Let's call it mu, squared, divided by 2 sigma squared.
66:31
I want to turn this into that, maybe up to terms
66:34
that do not depend on theta.
66:36
That's what I'm going to try to do.
66:39
So that's called completing the squaring.
66:40
That's some exercises you do.
66:42
You've done it probably, already in the homework.
66:44
And that's something you do a lot when
66:46
you do Bayesian statistics, in particular.
66:48
So let's do this.
66:50
What is it going to be the leading term?
66:51
Theta squared is going to be multiplied by this thing.
66:54
So I'm going to pull out my n over 2.
66:57
And then I'm going to write this as minus theta over 2.
67:03
And then I'm going to write theta minus something squared.
67:06
And this something is going to be one half of what
67:08
I see in the cross-product.
67:10
67:12
I need to actually pull this thing out.
67:14
So let me write it like that first.
67:18
So that's theta squared.
67:21
And then I'm going to write it as minus 2 times 1 over n sum
67:30
from I equal 1 to n of Xi's times theta.
67:36
That's exactly just a rewriting of what we had before.
67:39
And that should look much more familiar.
67:41
67:44
A squared minus 2 blap A, and then I missed something.
67:49
So this thing, I'm going to be able to rewrite
67:51
as theta minus Xn bar squared.
67:57
But then I need to remove the square of Xn bar.
68:00
Because it's not here.
68:01
68:09
So I just complete the square.
68:11
And then I actually really don't care with this thing actually
68:13
was, because it's going to go again in the little Alpha's
68:16
sign over there.
68:18
So this thing eventually is going
68:19
to be proportional to exponential
68:24
of minus n over 2 times theta of minus Xn bar squared.
68:31
And so we know that if this is a density that's
68:33
proportional to this guy, it has to be some n with mean, Xn bar.
68:44
And variance, this is supposed to be 1 over sigma squared.
68:47
This guy over here, this n.
68:49
So that's really just 1 over n.
68:50
68:53
So the posterior distribution is a Gaussian
69:01
centered at the average of my observations.
69:05
And with variance, 1 over n.
69:08
69:13
Everybody's with me?
69:14
69:16
Why I'm saying this, this was the output of some computation.
69:19
But it sort of makes sense, right?
69:21
It's really telling me that the more observations I have,
69:24
the more concentrated this posterior is.
69:26
Concentrated around what?
69:27
Well around this Xn bar.
69:30
That looks like something we've sort of seen before.
69:33
But it does not have the same meaning, somehow.
69:35
This is really just the posterior distribution.
69:37
69:40
It's sort of a sanity check, that I have this 1 over n
69:43
when I have Xn bar.
69:44
But it's not the same thing as saying
69:45
that the variance of Xn bar was 1 over n, like we had before.
69:48
69:55
As an exercise, I would recommend
69:59
if you don't get it, just try pi of theta
70:10
to be equal to some n mu 1.
70:15
70:18
Here, the prior that we used was completely non-informative.
70:22
What happens if I take my prior to be some Gaussian, which
70:25
is centered at mu and it has the same variance
70:27
as the other guys?
70:30
So what's going to happen here is that we're
70:32
going to put a weight.
70:33
And everything that's away from mu
70:34
is going to actually get less weight.
70:38
I want to know how I'm going to be updating
70:40
this prior into a posterior.
70:41
70:44
Everybody sees what I'm saying here?
70:47
So that means that pi of theta has the density proportional
70:50
to exponential minus one half theta minus mu squared.
70:55
So I need to multiply my posterior with this,
71:00
and then see.
71:01
It's actually going to be a Gaussian.
71:03
This is also a conjugate prior.
71:04
It's going to spit out another Gaussian.
71:06
You're going to have to complete a square again, and just check
71:09
what it's actually giving you.
71:10
And so spoiler alert, it's going to look
71:12
like you get an extra observation, which is actually
71:14
equal to mu.
71:15
71:18
It's going to be the average of n plus 1 observations.
71:22
The first n1's being X1 to Xn.
71:24
And then, the last one being mu.
71:27
And it sort of makes sense.
71:30
That's actually a fairly simple exercise.
71:34
Rather than going into more computation,
71:36
this is something you can definitely
71:37
do when you're in the comfort of your room.
71:41
I want to talk about other types of priors.
71:43
The first thing I said is, there's this beta prior
71:47
that I just pulled out of my hat and that was just convenient.
71:50
Then there was this non-informative prior.
71:52
It was convenient.
71:53
It was non-informative, so if you don't know anything
71:56
else maybe that's what you want to do.
71:58
The question is, are there any other priors that
72:01
are sort of principled and generic, in the sense
72:04
that the uninformative prior was generic, right?
72:08
It was equal to 1, that's as generic as it gets.
72:11
So is there anything that's generic as well?
72:14
Well, there's this priors that are called Jeffrey's priors.
72:17
And Jeffrey's prior, which is proportional to square root
72:20
of the determinant of the Fisher information of theta.
72:23
72:26
This is actually a weird thing to do.
72:28
It says, look at your model.
72:31
Your model is going to have a Fisher information.
72:34
Let's say it exists.
72:34
72:38
Because we know it does not always exist.
72:39
For example, in the multinomial model,
72:41
we didn't have a Fisher information.
72:44
The determinant of a matrix is somehow
72:46
measuring the size of a matrix.
72:48
If you don't trust me, just think
72:50
about the matrix being of size one by one,
72:53
then the determinant is just the number that you have there.
72:56
And so this is really something that looks like the Fisher
73:00
information.
73:01
73:04
It's proportional to the amount of information
73:06
that you have at a certain point.
73:09
And so what my prior is saying well,
73:12
I want to put more weights on those thetas that
73:14
are going to just extract more information from the data.
73:17
73:20
You can actually compute those things.
73:22
In the first example, Jeffrey's prior
73:26
is something that looks like this.
73:28
In one dimension, Fisher information
73:30
is essentially one the word variance.
73:33
That's just 1 over the square root of the variance,
73:35
because I have the square root.
73:37
And when I have the Jeffrey's prior, when I have the Gaussian
73:45
case, this is the identity matrix
73:48
that I would have in the Gaussian case.
73:50
The determinant of the identities is 1.
73:52
So square root of 1 is 1, and so I would basically get 1.
73:56
And that gives me my improper prior, my uninformative prior
73:59
that I had.
74:01
So the uninformative prior 1 is fine.
74:03
Clearly, all the thetas carry the same information
74:06
in the Gaussian model.
74:08
Whether I translate it here or here,
74:10
it's pretty clear none of them is actually
74:12
better than the other.
74:13
But clearly for the Bernoulli case,
74:16
the p's that are closer to the boundary carry
74:22
more information.
74:23
I sort of like those guys, because they just
74:26
carry more information.
74:27
So what I do is, I take this function.
74:29
So p1 minus p.
74:30
Remember, it's something that looks like this.
74:34
On the interval 0, 1.
74:35
74:38
This guy, 1 over square root of p1 minus p
74:40
is something that looks like this.
74:42
74:45
Agreed
74:47
What it's doing is sort of wants to push
74:49
towards the piece that actually carry more information.
74:54
Whether you want to bias your data that
74:56
way or not, is something you need to think about.
74:59
When you put a prior on your data, on your parameter,
75:01
you're sort of biasing towards this idea your data.
75:06
That's maybe not such a good idea,
75:07
when you have some p that's actually close to one half,
75:13
for example.
75:13
You're actually saying, no I don't
75:14
want to see a p that's close to one half.
75:16
Just make a decision, one way or another.
75:18
But just make a decision.
75:19
So it's forcing you to do that.
75:20
75:23
Jeffrey's prior, I'm running out of time
75:26
so I don't want to go into too much detail.
75:29
We'll probably stop here, actually.
75:31
75:44
So Jeffrey's priors have this very nice property.
75:47
It's that they actually do not care about the parameterization
75:51
of your space.
75:53
If you actually have p and you suddenly
75:56
decide that p is not the right parameter for Bernoulli,
75:58
but it's p squared.
76:00
You could decide to parameterize this by p squared.
76:03
Maybe your doctor is actually much more able
76:05
to formulate some prior assumption on p squared,
76:08
rather than p.
76:09
You never know.
76:11
And so what happens is that Jeffrey's priors
76:14
are an invariant in this.
76:15
And the reason is because the information carried by p
76:18
is the same as the information carried by p squared, somehow.
76:21
76:28
They're essentially the same thing.
76:30
76:32
You need to have one to one map.
76:34
Where you basically for each parameter, before
76:37
you have another parameter.
76:39
Let's call Eta the new parameters.
76:40
76:45
The PDF of the new prior indexed by Eta this time
76:50
is actually also Jeffrey's prior.
76:52
But this time, the new Fisher information
76:55
is not the Fisher information with respect to theta.
76:57
But it's this Fisher information associated
77:00
to this statistical model indexed by Eta.
77:03
So essentially, when you change the parameterization
77:08
of your model, you still get Jeffrey's prior
77:10
for the new parameterization.
77:12
Which is, in a way, a desirable property.
77:15
77:19
Jeffrey's prior is just an uninformative priors,
77:21
or priors you want to use when you
77:24
want a systematic way without really thinking about what
77:26
to pick for your mile.
77:27
77:35
I'll finish this next time.
77:37
And we'll talk about Bayesian confidence regions.
77:39
We'll talk about Bayesian estimation.
77:41
Once I have a posterior, what do I get?
77:44
And basically, the only message is
77:45
going to be that you might want to integrate
77:47
against the posterior.
77:48
Find the posterior, the expectation of your posterior
77:51
distribution.
77:52
That's a good point estimator for theta.
77:54
77:56
We'll just do a couple of computation.
78:01