字幕記錄


00:00
00:00
The following content is provided under a Creative
00:02
Commons license.
00:03
Your support will help MIT OpenCourseWare
00:06
continue to offer high quality educational resources for free.
00:10
To make a donation or to view additional materials
00:12
from hundreds of MIT courses, visit MIT OpenCourseWare
00:16
at ocw.mit.edu.
00:17
00:19
PHILIPPE RIGOLLET: Doesn't to run Flash Player.
00:21
So I had to run them on Chrome.
00:26
All right, so let's move on to our second chapter.
00:29
And hopefully, in this chapter, you
00:31
will feel a little better if you felt
00:33
like it was going a bit fast in the first chapter.
00:36
And the main reason we actually went fast, especially
00:38
in terms of confidence intervals.
00:40
Some of you came and asked me what
00:41
you mean by this is a confidence interval?
00:43
What does it mean that it's not happening
00:45
in there with probability 95%, et cetera?
00:47
I just went really fast so that you
00:49
could see why I didn't want to give you a first week doing
00:53
probability only without understanding
00:57
what the statistical context for it was.
00:59
So hopefully, all these things that we've
01:02
done in terms of probability, you actually
01:03
know why we've been doing them.
01:05
And so we're basically going to go back to what we're doing,
01:08
maybe start with some statistical setup.
01:11
But the goal of this lecture is really
01:13
going to go back again to what we've seen from a purely
01:17
statistical perspective.
01:18
All right?
01:19
So the first thing we're going to do
01:22
is explain why we're doing statistical modeling, right?
01:26
So in practice, if you have data,
01:28
if you observe a bunch of points--
01:30
in here, I gave you some numbers, for example.
01:34
So here's the partial data sets with the number of siblings,
01:37
including self, that were collected from college
01:40
students a few years back.
01:41
So I was teaching a class like yours
01:43
and actually asked students to go
01:44
and fill out some Google form and tell me a bunch of things.
01:47
And one of the questions was, including yourself,
01:49
how many siblings do you have?
01:51
And so they gave me this list of numbers, right?
01:54
And there's many ways I can think of this list of numbers,
01:57
right?
01:57
I could think of it as being just a discrete distribution
02:01
on the set of numbers between 1--
02:05
I know there's not going to be an answer which is less than 1,
02:08
unless, well, someone doesn't understand the question.
02:10
But all the answers I should get are positive integers--
02:14
1, 2, 3, et cetera.
02:16
And there probably is an upper bound,
02:18
but I don't know it on the top of my head.
02:20
So maybe I should say 100.
02:22
Maybe I should say 15.
02:24
It depends, right?
02:25
And so I think the largest number I got for this was 6.
02:28
All right?
02:29
So here you can see you have pretty standard families,
02:33
you know, lots of 1s, 2s, and 3s.
02:37
What statistical modeling is doing
02:39
is to try to compress this information that I
02:42
could actually describe in a very naive way.
02:44
So let's start with the basic usual statistical set
02:49
up, right?
02:49
So I will start with many of the boards that look
02:52
like x1, xn, random variables.
02:58
And what I'm going to assume, as we said typically
03:01
is that those guys are IID.
03:04
And they have some distribution, all right?
03:06
So they all share the same distribution.
03:08
And the fact that their IID is so that I can actually
03:10
do statistics.
03:11
Statistics means looking at the global averaging thing
03:15
so that I can actually get a sense of what
03:17
the global behavior is for the population, right?
03:20
If I start assuming that those things are not identically
03:22
distributed-- they all live on their own--
03:24
that my sequence of number is your number of siblings--
03:28
the shoe size of this person--
03:30
the depth of the Charles River, and I
03:31
start measuring a bunch of stuff.
03:33
There's nothing I can actually get together.
03:34
I need to have something that's cohesive.
03:37
And so here, I collected some data that was cohesive.
03:41
And so the goal here--
03:42
the first thing is to say what is the distribution that I
03:44
actually have here, right?
03:46
03:50
So I could actually be very general.
03:52
I could just say at some distribution p.
03:56
And let's so those are random variables, not random vectors,
03:59
right?
03:59
I could collect entire vectors about students,
04:02
but let's say those are just random variables.
04:05
And so now I can start making assumptions
04:07
on this distribution p, right?
04:09
What can I say about a distribution?
04:11
Well, maybe if those numbers are continues,
04:14
for example, I could assume they have a density--
04:17
a probability density function.
04:19
That's already an assumption.
04:21
Maybe I could start to assume that they're probability
04:23
density function is smooth.
04:25
That's another assumption.
04:26
Maybe I could actually assume that it's piecewise constant.
04:29
That's even better, right?
04:30
And those things make my life simpler and simpler,
04:33
because what I do by making the successive assumptions is
04:37
reducing the degrees of freedom of the space in which I
04:40
am actually searching the distribution.
04:42
And so what we actually want is to have something
04:46
which is small enough so we can actually
04:48
have some averaging going on.
04:50
But we also want something which is big enough
04:53
that it can actually express.
04:55
It has chances of actually containing a distribution that
04:58
makes sense for us.
04:59
So let's start with the simplest possible example,
05:01
which is when the xi's belong to 0, 1.
05:07
And as I said, here, we don't have a choice.
05:09
The distribution of those guys has to be Bernoulli.
05:14
And since they are IID, they all share the same p.
05:18
So that's definitely the simplest possible thing
05:20
I could think of.
05:21
They are just Bernoulli p.
05:24
And so all I would have to figure out in this case is p.
05:28
05:31
And this is the simplest case.
05:34
And unsurprisingly, it has the simplest answer, right?
05:37
We will come back to this example
05:39
when we study maximum likelihood estimators or method
05:42
of moments estimators by method of moments.
05:45
But at the end of the day, the things
05:49
that we did-- the things that we will do
05:50
are always the naive estimator you
05:52
would come up with is what is the proportion of 1.
05:55
And this will be, in pretty much all respects,
05:58
the best estimator you can think of.
06:01
All right?
06:01
So then we're going to try to assess this performance.
06:03
And we saw how to do that in the first chapter as well.
06:06
So this problem here somehow is completely understood.
06:10
We'll come back to it.
06:11
But there's nothing fancy that is going to happen.
06:14
But now, I could have some more complicated things.
06:17
For example, in the example of the students now,
06:20
my xi's belong to the sequence of integers 1, 2, 3,
06:26
et cetera, OK, which is also denoted by n, maybe without 0
06:31
if you want to put 0 in that, right?
06:32
So the positive integers.
06:36
Or I could actually just maybe put
06:38
some prior knowledge about how humans
06:41
have time to have families.
06:42
But maybe some people thought of their college mates
06:47
as being their brothers and sisters.
06:49
And one student would actually put 465 siblings,
06:53
because we're all good friends.
06:56
Or maybe they actually think that all their Facebook
06:59
contacts are actually their siblings.
07:00
And so you never know what's going to happen.
07:03
So maybe you want to account for this,
07:04
but maybe you know that people are reasonable,
07:06
and they will actually give you something like this.
07:08
07:11
Now intuitively, maybe you would say, well,
07:13
why would you bother doing this if you're not
07:15
really sure about the 20?
07:17
But I think that probably all of you
07:18
intuitively guess that this is probably a good idea
07:21
to start putting this kind of assumption
07:24
rather than allowing for any number in the first place,
07:26
because this eventually will be injected
07:30
into the precision of our estimators.
07:33
If I allow anything, it's going to be more complicated for me
07:36
to get an accurate estimator.
07:37
If I know that the numbers are either 1 or 2, then
07:40
I'm actually going to be slightly more accurate as well.
07:42
All right?
07:43
Because I know that, for example, somebody put the 5,
07:45
I can remove it.
07:46
Then it's not going to actually corrupt with my estimator.
07:48
All right, so now, let's say we actually
07:55
agree that we have numbers.
07:57
And here I put seven numbers, OK?
08:01
So I just said, well, let's assume
08:03
that the numbers I'm going to get
08:05
are going to be 1 all the way to say this number that I denote
08:10
by larger than or equal to 7, which
08:12
is a placeholder for any numbers larger than 7, OK?
08:15
Because I know maybe I don't want
08:17
to distinguish between people that have 9 or 25 siblings.
08:20
OK, and so now, this is a distribution
08:24
on seven possible values-- the discrete distributions.
08:28
And you know from your probability class
08:30
that the way you describe this distribution
08:32
is using the probability mass function.
08:34
08:44
OK, or PMF--
08:45
08:48
So that's how we describe a discrete distribution.
08:51
And the PMF is just a list of numbers, right?
08:56
So as I wrote here, you have a list of numbers.
08:58
And here, you wrote the possible value
09:00
that your random variable can take.
09:03
And here you rate the probability
09:04
that your random variable takes this value.
09:07
So the possible values being 1, 2, 3 all the way to larger than
09:11
or equal to 7.
09:13
And then I'm trying to estimate those numbers.
09:16
Right?
09:16
If I give you those numbers, at least up to this
09:20
you know compression of all numbers that are equal to 7,
09:23
you have the full description of your distribution.
09:25
And that is the ultimate goal of statistics, right?
09:29
The ultimate goal of statistics is
09:30
to say what distribution your data came from,
09:33
because that's basically the best you're
09:34
going to be able to do.
09:36
Now admittedly, if I started looking at the fraction of 1s,
09:40
and the fraction of 2s, and the fraction of 3s, et cetera,
09:44
I would actually eventually get those numbers--
09:48
just like looking at the fraction of 1s
09:49
gave me a good estimate for p in the Bernoulli case,
09:53
it would do the same in this case, right?
09:55
It's a pretty intuitive idea.
09:56
It's just the law of large numbers.
09:59
Everybody agrees with that?
10:00
If I look at the proportion of 1s, the proportion of 2s,
10:02
the proportion of 3s, that should actually
10:04
give me something that gets closer and closer, as my sample
10:06
size increases to what I want.
10:10
But the problem is if my sample size is not huge,
10:14
here I have seven numbers to estimate.
10:17
And if I have 20 observations, the ratio
10:20
is not really in my favor--
10:23
20 observations to estimate seven parameters-- some of them
10:26
are going to be pretty off, typically
10:27
the ones with the large values.
10:29
If you have only 20 students, look at the list of numbers.
10:32
I don't know how many numbers I have, but it probably
10:34
is close to 20--
10:35
maybe 15 or something.
10:37
And so if you look at this list, nobody's
10:39
actually-- nobody has four or more siblings, right?
10:45
There's no such person.
10:46
So that means that eventually from this data set,
10:49
my estimates--
10:50
so those numbers I denote by say p1, p2, p3, et cetera--
10:56
those estimates p4 hat would be equal to what from this data?
11:01
11:06
0, right?
11:07
And p5 hat equal to 0 and p6 hat would be equal to 0.
11:12
And p larger than or equal to 7 hat would be equal to 0.
11:16
That would be my estimate from this data set.
11:19
So maybe this is not--
11:21
maybe I want to actually pull some information
11:25
from the people who have less siblings
11:28
to try to make a guess, which is probably
11:31
slightly better for the larger values, right?
11:33
It's pretty clear that in average, there is more than 0--
11:39
the proportion of the population of households
11:42
that have four children or more is definitely more than 0,
11:46
all right?
11:46
So it means that my data set is not
11:49
representative of what I'm going to try
11:51
to do is to find a model that tries to use the data they have
11:53
for the smaller values that I can observe and just push it up
11:56
to the other ones.
11:57
And so what we can do is to just reduce those parameters
12:00
into something that's understood.
12:03
And this is part of the modeling that I talked about
12:05
in the first place.
12:07
Now, how do you succinctly describe a number of something?
12:12
Well, one thing that you do is the Poisson distribution,
12:14
right?
12:15
Why do Poisson?
12:17
There's many reasons.
12:18
Again, that's part of statical modeling.
12:20
But once you know that you have number of something that
12:23
can be modeled by a Poisson, why not try a Poisson, right?
12:26
You could just fit a Poisson.
12:27
And the Poisson is something that looks like this.
12:30
And I guess you've all seen it.
12:32
But if x follows a Poisson distribution
12:36
with parameter lambda, than the probability
12:38
that x is equal to little x is equal to lambda
12:42
to the x over factorial x e to the minus lambda.
12:47
OK?
12:47
12:51
And if you did the sheet that I gave you on the first day,
12:54
you can check those numbers.
12:55
So this is, of course, for x equals 0, 1, et cetera, right?
13:00
So x is in natural integers.
13:04
And if you sum from x equals 0 to infinity, this thing you get
13:08
is e to the lambda.
13:09
And so they cancel, and you have some
13:11
which is equal to 1, which is indeed a PMF.
13:13
But what's key about this PMF is that it never takes value 0.
13:17
Like this thing is always strictly positive.
13:20
So whatever value of lambda I find from this data
13:24
will give me something that's certainly
13:25
more interesting than just putting the value 0.
13:29
But more importantly, rather than having
13:31
to estimate seven parameters and, as a consequence,
13:35
to actually have to estimate 1, 2,
13:37
3, 4 of them being equal to 0, I have only one parameter
13:42
to estimate which is lambda.
13:44
The problem with doing this is that now lambda may not
13:49
be just something as simple as computing the average number.
13:53
Right?
13:54
In this case, it will.
13:55
But in many instances, it's actually not clear
13:58
that this parametrization with lambda that I chose--
14:01
I'm going to be able to estimate lambda just by computing
14:03
the average number that I get.
14:06
It will be the case.
14:07
But if it's not, remember this example of the exponential
14:10
we did in the last lecture--
14:12
we could use the delta method and things
14:13
like that to estimate them.
14:16
All right, so here's modeling 101.
14:20
So the purpose of modeling is to restrict
14:22
the base of possible distributions
14:26
to a subspace that's actually plausible, but much simpler
14:29
for me to estimate.
14:31
So we went from all distributions
14:34
on seven parameters, which is a large space--
14:37
that's a lot of things--
14:38
to something which is just one number.
14:41
This number is positive.
14:42
14:46
Any question about the purpose of doing this?
14:50
OK, so we're going to have to do a little bit of formalism now.
14:55
And so if we want to talk--
14:58
this is a statistics classroom.
14:59
I'm not going to want to talk about the Poisson model
15:01
specifically every single time.
15:03
I'm going to want to talk about generic models.
15:05
And then you're going to build to plug in your favorite word--
15:08
Poisson, binomial, exponential, uniform--
15:11
all these words that you've seen,
15:12
you're going to be able to plug in there.
15:13
But we're just going to have some generic notations
15:16
and some generic terminology for a statistical model.
15:19
All right?
15:19
So here is the formal definition.
15:22
So I'm going to go through it with you.
15:24
15:29
OK, so the definition is that of a statistical model.
15:35
15:47
OK?
15:47
15:50
Sorry, that's a statistical experiment, I should say.
15:53
16:00
So a statistical experiment is actually just a pair--
16:04
E. And that's a set--
16:08
16:11
and a family of distributions P theta, where theta ranges
16:19
in some set capital theta.
16:22
OK?
16:22
So I hope you're up to date with your Greek letters.
16:26
So the small theta is the capital theta.
16:28
And enough of us--
16:29
I don't have the handwriting.
16:31
So if you don't see something, just ask me.
16:34
And so this thing now-- so each of this guy
16:36
is a probability distribution.
16:40
All right?
16:41
So for example, this could be a Poisson with parameter theta
16:47
or a Bernoulli with parameter theta--
16:51
OK, or an exponential with parameter--
16:54
I don't know-- 1 over theta squared if you want.
16:56
16:58
OK, but they're just indexed by theta.
17:00
But for each theta, this completely
17:02
describes the distribution.
17:05
It could be more complicated.
17:06
This theta should be a pair-- could be a pair-- a mu sigma
17:09
square.
17:10
And that could actually give you some n mu sigma squared.
17:16
OK so anything where you can actually--
17:20
rather than actually giving you a full distribution,
17:24
I can compress into a parameter.
17:26
But it could be worse.
17:27
It could be this guy here.
17:28
Right?
17:28
Theta could be p1--
17:32
p larger than or equal to 7.
17:34
And my distribution could just be something that has PMF--
17:37
17:40
p1-- p larger than 7.
17:44
That's another parameter.
17:45
This one is seven dimensional.
17:48
This one is two dimensional.
17:49
And all these guys are just one dimensional.
17:52
All these guys are parameters.
17:55
Is that clear?
17:56
What's important here is that once they give you theta,
17:59
you know exactly all the probabilities associated
18:00
with this random variable.
18:01
You know its distribution perfectly.
18:03
18:09
So this is the definition.
18:10
Is that clear?
18:11
Is there a question about this distribution--
18:13
about this definition, sorry?
18:14
18:17
All right.
18:18
So really, the key thing is the statistical model associated
18:22
to a statistical experiments.
18:24
OK?
18:24
18:27
So let's just see some examples.
18:29
It's probably just better because, again, the formalism
18:31
is never really clear.
18:33
Actually, that's the next slide.
18:35
OK, so there's two things we need to assume.
18:40
OK, so the purpose of a statistical model
18:44
is once I estimate the parameter,
18:46
I actually know exactly what distribution it has, OK?
18:51
So it means that I could potentially
18:54
have several parameters that give me
18:56
the same distribution that would still be fine, because I could
18:59
estimate one guy.
18:59
Or I could estimate the other guy.
19:01
And I would still recover the underlying distribution
19:03
of my data.
19:03
19:04
The problem is that this creates really annoying
19:07
theoretical problems, like things
19:09
don't work, the algorithms won't work,
19:11
the guarantees won't work.
19:12
And so what we typically assume is that the model
19:14
is so-called well-specified.
19:16
19:18
Sorry, that's not well specified.
19:20
I'm jumping ahead of myself.
19:23
OK, well-specified means that your data--
19:28
the distribution of your data is actually one of those guys.
19:32
OK?
19:33
So some vocabulary-- so well-specified
19:46
means that for my observations x,
19:51
there exists a theta in capital theta
19:55
such that x follows p sub theta.
20:00
I should put a double bar.
20:03
OK, so that's what well-specified means.
20:06
So that means that the distribution
20:08
of your actual data is just one of those guys.
20:12
This is a bit strong of an assumption.
20:18
It's strong in the sense that--
20:20
I don't know if you've heard of this sense, which I don't know,
20:26
I can tell you who it's attributed to,
20:28
but that probably means that this person did not
20:30
come up with it.
20:31
But I said that all models are wrong, but some of them
20:35
are useful.
20:37
All right, so all models are wrong
20:40
means that maybe it's not true that this Poisson distribution
20:44
that I assume for the number of siblings for college students--
20:47
maybe that's not perfectly correct.
20:50
Maybe there's a spike at three, right?
20:53
Maybe there's a spike at one, because you know,
20:55
maybe those are slightly more educated families.
20:58
They have less children.
20:59
Maybe this is actually not exactly perfect.
21:02
But it's probably good enough for our purposes.
21:04
And when we make this assumption,
21:05
we're actually assuming that the data really
21:07
comes from a Poisson model.
21:09
There is a lot of research that goes on
21:11
about misspecified models and that tells you
21:14
how well you're doing in the model that's
21:16
the closest to the actual distribution.
21:18
So that's pretty much it.
21:19
Yeah?
21:21
AUDIENCE: [INAUDIBLE].
21:22
21:24
PHILIPPE RIGOLLET: So my data--
21:25
so it's always the way I denote one
21:29
of the generic observations, right?
21:31
So my observations are x1, xn.
21:36
And they're IID with distribution p--
21:39
always.
21:40
So x is just one of those guys.
21:42
I don't want to write x5 or x4.
21:46
They're IID.
21:47
So they all have the same distribution.
21:49
So OK-- no, no, no.
21:54
They're all IID.
21:55
So they all have the same p data.
21:57
They'll have the same p, which means
21:59
they'll have the same p data.
22:00
So I can pick any one of them.
22:02
So I'd just remove the index just so we're clear.
22:05
OK?
22:06
So when I write x, I just mean think of x1.
22:09
Right they're an idea.
22:10
I can pick whichever I want.
22:12
I'm not going to write x1.
22:13
It's going to be weird.
22:14
22:17
OK?
22:18
Is that clear?
22:19
OK.
22:20
So this particular theta is called the true parameter.
22:26
22:34
Sometimes since we're going to want some variable theta,
22:37
we might denote it by theta star as opposed
22:41
to theta hat, which is always our estimator.
22:43
But I'll keep it to be theta for now.
22:47
And so the aim of this physical experiment
22:50
is to estimate theta so that once I actually
22:52
plug in theta in the form of my distribution, for example,
22:56
I could plug in theta here.
22:58
So theta here was actually lambda.
23:01
So once I estimate this guy, I would plug it in,
23:03
and I would know the probability that my random variable takes
23:06
any value, by just putting the lambda hat and the lambda hat
23:09
here.
23:10
OK?
23:11
So my goal is going to be to estimate
23:12
this guy so that I can actually compute those distributions.
23:16
But actually, we'll see, for example,
23:18
when we talk about regression that this parameter actually
23:21
has a meaning in many instances.
23:23
And so just knowing the parameter itself
23:26
intuitively or say more--
23:30
let's say more so than just computing probabilities,
23:33
will actually tell us something about the process.
23:36
For example, we're going to run linear regression.
23:38
And when we do linear regression,
23:40
there's going to be some coefficients
23:41
in the linear regression.
23:42
And the value of this coefficient
23:44
is actually telling me what is the sensitivity of the response
23:47
that I'm looking at to this particular input.
23:50
All right?
23:51
So just knowing if this number is larger
23:52
or if this number is small is actually
23:55
going to be useful for us to just look at this guy.
23:58
All right?
23:58
So there's going to be some instances where
24:00
it's going to be important.
24:01
Sometimes we're going to want to know if this parameter is
24:04
larger or smaller than something or if it's equal to something
24:07
or not equal to something.
24:08
And those things are also important-- for example,
24:10
if theta actually measures the true--
24:13
right?
24:13
So theta is the true unknown parameter-- true efficacy
24:16
of a drug.
24:18
OK?
24:18
Let's say I want to know what the true efficacy of a drug is.
24:21
And what I'm going to want to know is maybe it's a score.
24:25
Maybe I'm going to want to know if theta is larger than 2.
24:27
Maybe I want to know if theta is the average number of siblings.
24:30
Is this true number larger than 2 or not?
24:32
Right?
24:32
Maybe I am interested in knowing if college students come from--
24:37
so maybe from a sociological perspective,
24:40
I'm interested in knowing if college students come
24:42
from households with more than two children.
24:45
All right, so those can be the questions
24:47
that I may ask myself.
24:48
I'm going to want to know maybe theta is going
24:50
to be equal to 1/2 or not.
24:51
So maybe for a drug efficacy, is it completely
24:54
standard-- maybe for elections.
24:57
Is the proportion of the population
24:59
that is going to vote for this particular candidate
25:02
equal to 0.5?
25:03
Or is it different from 0.5?
25:05
OK, and I can think of different things.
25:07
When I'm talking about the regression,
25:09
I'm going to want to test if this coefficient is actually
25:11
0 or not, because if it's 0, it means
25:13
that the variable that's in front of it actually goes out.
25:17
And so those are things we're testing.
25:18
Actually having this very specific yes/no answer
25:22
is going to give me a huge intuition or huge understanding
25:26
of what's going on in the phenomenon that I observe.
25:29
But actually, since the questions are so precise,
25:32
it's going to be much more--
25:34
I'm going to be much better at answering them rather
25:36
than giving you an estimate for theta
25:38
with some confidence around it.
25:41
All right, it's sort of the same principle as trying to reduce.
25:44
What you're trying to do as a statistician
25:46
is to inject as much knowledge about the question and about
25:49
the problem that you can so that the data has
25:52
to do a minimal job.
25:54
And henceforth, you actually need less data.
25:58
So from now on, we will always assume--
26:00
and this is because this is an intro stats class--
26:03
you will always assume that theta--
26:05
the subset of parameters is a subset of r to the d.
26:09
That means that theta is a vector
26:11
with at most a finite number of coordinates.
26:16
Why do I say this?
26:17
Well, this is called a parametric model.
26:20
So it's called a parametric model or sometimes
26:31
parametric statistics.
26:35
Actually, we don't really talk about parametric statistics.
26:37
But we talk a lot about nonparametric statistics
26:40
or a non-parametric model.
26:42
Can somebody think of a model which is non-parametric?
26:45
26:53
For example, in the siblings example,
26:56
if I did not cap the number of siblings to 7,
27:01
but I let this list go to infinity,
27:06
I would have an infinite number of parameters to estimate.
27:09
Very likely, the last ones would be 0.
27:12
But still, I would have an infinite number of parameters
27:14
to estimate.
27:15
So this would not be a parametric model
27:17
if I just let this list of things
27:19
to be estimated to be infinite.
27:21
But there's other classes that are actually infinite
27:24
and cannot represented by vectors.
27:26
For example, function-- right?
27:29
If I tell you my model, pf, is just
27:38
the distribution of x's, the probability distributions,
27:43
that have density f, right?
27:48
So what I know is that the density is non-negative
27:50
and that it integrates to one, right?
27:52
That's all I know about densities.
27:54
Well f is not something you're going
27:57
to be able to describe with a finite number of values, right?
28:01
All possible functions is the huge set.
28:03
It's certainly not representable by 10 numbers.
28:08
And so non-parametric estimation is typically
28:12
when you actually want to parametrize this
28:14
by a large class of functions.
28:17
And so for example, histograms is the prime tool
28:20
of non-parametric estimation, because when
28:22
you fit a histogram to data, you're
28:24
trying to estimate the density of your data,
28:26
but you're not trying to represent it
28:28
as a finite number of points.
28:31
That's really-- I mean, effectively,
28:35
you have to represent it, right?
28:36
So you actually truncate somewhere and just
28:38
say those things are not going to matter.
28:40
All right?
28:41
But really the key thing is that this is non-parametric
28:44
where you have a potentially infinite number of parameters.
28:47
Whereas we're going to only talk about finites.
28:49
And actually finite in the overwhelming majority of cases
28:53
is going to be 1.
28:55
So theta is going to be a subset of r1.
28:58
OK, we're going to be interested in estimating
29:00
one parameter just like the parameter of a Poisson
29:03
or the parameter of an exponential--
29:05
the parameter of Bernoulli.
29:07
But for example, really, we're going
29:09
to be interested in estimating mu and sigma
29:11
square for the normal.
29:12
29:17
So here are some statistical models.
29:19
All right?
29:20
So I'm going to go through them with you.
29:23
29:31
So if I tell you I observe--
29:35
I'm interested in understanding--
29:38
I'm still [INAUDIBLE] I'm interested in understanding
29:42
the proportion of people who kiss by bending
29:44
their head to the right.
29:46
And for that, I collected n observations.
29:50
And I'm interested in making some inference
29:53
in the statistical model.
29:54
My question to you is, what is the statistical model?
29:58
Well, if you want to read the statistical model,
30:00
you're going to have to write this E--
30:02
oh, sorry, I never told you what E was.
30:03
OK, well actually just go to the examples,
30:06
and then you'll know what E is.
30:09
So you're going to have to write to me an E and a p theta, OK?
30:14
So let's start with the Bernoulli trials.
30:16
30:25
So this e here is called the sample space.
30:29
30:33
And in the normal people's words,
30:37
it just means the space or the set in which x and--
30:44
back to your question, x is just a generic observation lips.
30:48
30:51
OK, and hopefully, this is the smallest you can think of.
30:56
OK, so for example, for Bernoulli trials,
30:58
I'm going to observe a sequence of 0's and 1's.
31:01
So my experiment is going to be-- as written on the board,
31:04
is going to be 1, 0, 1.
31:06
And then the probability distributions
31:08
are going to be, well, it's just going
31:10
to be the Bernoulli distributions indexed
31:13
by p, right?
31:14
So rather than writing p sub p, I'm
31:17
going to write it as Bernoulli p,
31:20
because it's clear what I mean when I write that.
31:24
Is everybody happy?
31:25
Actually, I need to tell you something more.
31:27
This is a family of distributions.
31:28
So I need p.
31:29
And maybe I don't want to have to p
31:31
that's a value 0, 1, right?
31:33
It doesn't make sense.
31:34
I would probably not look at this problem
31:37
if I anticipated that everybody would kiss to the right.
31:40
And everybody would kiss to the left.
31:43
So I am going to assume that p is in 0, 1,
31:45
but does not have 0 and 1.
31:47
OK?
31:48
So that's the statistical model for a Bernoulli trial.
31:52
32:00
OK, now the next one, what do we have?
32:03
Exponential.
32:03
OK?
32:04
32:09
OK, so when I have exponential distributions,
32:12
what is the support of the exponential distribution?
32:14
What value is it going to take?
32:17
32:20
0 to infinity, right?
32:23
So what I have is that my first space
32:26
is the value that my random variables can take.
32:28
So its-- well, actually I can remove the 0 again--
32:34
0 to plus infinity.
32:37
And then the family of distributions
32:39
that I have are exponential with parameter lambda.
32:43
And again, maybe you've seen me switching
32:45
from p, to lambda, to theta, to mu, to sigma square.
32:49
Honestly you can do whatever you want.
32:50
But its just that it's customary to have this particular group
32:53
of letters.
32:54
OK?
32:55
And so the parameters of an exponential
32:58
are just positive numbers.
33:02
OK?
33:02
And that's my exponential model.
33:08
What is the third one?
33:09
Can somebody tell me?
33:11
Poisson, OK?
33:12
33:16
OK, so Poisson-- is a Poisson random verbal
33:20
discrete or continuous?
33:21
33:27
Go back to your probability.
33:29
All right, so the answer being the opposite of continuous--
33:34
good job.
33:36
All right, so it's going to be--
33:38
what value can a Poisson take?
33:39
33:43
All the natural integers, right?
33:44
So 0, 1, 2, 3, all the way to infinity.
33:47
We don't have any control of this.
33:48
So I'm going to write this as n without 0.
33:53
I think in the slides, it's n-star maybe.
33:55
Actually, no, you can take value 0.
33:57
I'm sorry.
33:57
This actually takes value 0 quite a lot.
33:59
That's typically, in many instances, actually the mode.
34:03
So it's n, and then I'm going to write it
34:05
as Poisson with parameter-- well,
34:08
here it's again lambda as a parameter.
34:11
And lambda can take any positive value.
34:13
OK?
34:13
34:17
And that's where you can actually see that the model
34:21
that we had for the siblings-- right?
34:23
So let me actually just squeeze in the siblings model here.
34:27
34:31
So that was the bad model that I had in the first place
34:35
when I actually kept this.
34:37
Let's say we just kept it at 7.
34:39
Forget about larger than or equal to 7.
34:40
We just assumed it was 7.
34:42
What was our sample space?
34:43
34:54
We said 7.
34:56
So it's 1, 2, to 7, right?
35:01
Those were the possible values that this thing would take.
35:04
And then what was my--
35:06
what's my parameter space?
35:07
35:10
So it's going to be a nightmare write.
35:12
But I'm going to write it.
35:14
OK, so I'm going to write it as something like the probability
35:18
that x is equal to k is equal to p sub k.
35:22
35:26
OK?
35:27
And that's going to be for p.
35:33
OK, so that's for all k's, right?
35:36
Or for k equal 1 to 7.
35:38
And here the index is the set of parameters p1 to pk.
35:44
And I know a little more about those guys, right?
35:47
I know there are going to be non-negative--
35:49
PJ non-negative.
35:50
And I know that they sum to 1.
35:52
35:57
OK, so maybe writing this, you start
36:01
seeing why we like those Poisson, exponential,
36:05
and short notation, because I actually don't have
36:08
to write the PMF of a Poisson.
36:09
The Poisson is really just this.
36:10
But I call it Poisson so I don't have
36:12
to rewrite this all the time.
36:14
And so here, I did not use a particular form.
36:17
So I just have this thing, and that's what it is.
36:19
The set of parameters is the set of positive numbers of--
36:24
p1 to p7, pk--
36:28
and sum to 7, right?
36:31
And so this as just a list of numbers
36:34
that are non-negative and sum up to 1.
36:37
So that's my parameter space.
36:39
OK?
36:40
So here that's my theta.
36:42
This whole thing here--
36:45
this is my capital theta.
36:47
OK?
36:48
36:51
So that's just the set of parameters
36:53
that theta-- the set of parameters
36:55
that theta is allowed to take.
36:58
OK, and finally, we're going to end with the star of all,
37:01
and that's the normal distribution.
37:03
And in the normal distribution, you still
37:06
have also some flexibility in terms of choices,
37:10
because then naturally, the normal distribution
37:13
is parametrized by--
37:16
the normal distribution is parametrized by two parameters,
37:19
right?
37:19
Mean and variance.
37:20
37:26
So what values can a Gaussian random variable take?
37:30
An entire real line, right?
37:33
And the set of parameters that it
37:35
can take it-- so this is going to be n, mu, sigma square.
37:42
And mu is going to be positive.
37:46
And stigma square is going--
37:49
sorry, m is going to be an r.
37:51
And sigma square is going to be positive.
37:55
OK, so again here, that's the way
37:57
you're supposed to write it.
37:58
If you really want to identify what theta is,
38:03
well, theta formally is the set of mu sigma square such that--
38:08
well, in r times 0 infinity, right?
38:15
38:19
That's just to be formal, but this does the job just fine.
38:22
OK?
38:22
You don't have to be super formal.
38:25
OK, that's not three.
38:28
That's like five.
38:30
Actually, I just want to write another one.
38:32
Let's call it 5-bit.
38:35
And 5-bit is just Gaussian with known variants.
38:41
38:46
And this arises a lot in labs when
38:50
you have measurement error--
38:51
when you actually receive your measurement device.
38:55
This thing has been tested by the manufacturer
38:57
so much that it actually comes in on the side of the box.
39:00
It says that the standard deviation of your measurements
39:04
is going to be 0.23.
39:07
OK, and actually why you do this is because we
39:09
can brag about accuracy, right?
39:11
That's how they sell you this particular device.
39:13
And so you actually know exactly what sigma square is.
39:16
So once you actually get your data in the lab,
39:20
you actually only have to estimate mu,
39:22
because stigma comes on the label.
39:25
So now, what is your statistical model?
39:28
Well, the numbers are collecting still in r.
39:33
But now, the models that I have is n, mu, sigma squared.
39:42
But the parameter space is not mu, and r, and sigma positive.
39:46
It's just mu and r.
39:46
39:54
And to be a little more emphatic about this,
39:58
this is enough to describe it, right?
39:59
Because if sigma is the sigma that
40:02
was specified by the manufacturer,
40:04
then this is the sigma you want.
40:07
But you can actually write sigma is equal to--
40:10
sigma square is equal to sigma square manufacturer.
40:15
Right?
40:15
You can just fix it to be this particular value.
40:18
Or maybe you don't want to write that index that's
40:21
the manufacturer.
40:22
And so you just say, well, the sigma--
40:23
when I write n squared what I mean
40:24
is the sigma square from the manufacturer.
40:26
Yeah?
40:27
AUDIENCE: [INAUDIBLE]
40:29
40:35
PHILIPPE RIGOLLET: Yeah.
40:37
For a particular measuring device?
40:39
You know, you're in a lab, and you have some measuring device.
40:42
I don't know-- something that measures
40:45
tensile strength of something.
40:48
And it's just going to measure something.
40:49
And it will naturally make errors.
40:51
But it's been tested so much by the manufacturer
40:53
and calibrated by them.
40:55
They know it's not going to be perfect.
40:57
But they knew exactly what error it
40:59
was making, because they've actually tried it
41:00
on things for which they exactly knew
41:02
what the tensile strength was.
41:04
OK?
41:05
Yeah.
41:06
AUDIENCE: [INAUDIBLE]
41:07
41:09
PHILIPPE RIGOLLET: This?
41:10
AUDIENCE: [INAUDIBLE]
41:11
PHILIPPE RIGOLLET: Oh, like that's pointing to--
41:13
5 prime?
41:14
41:19
OK?
41:21
And we can come up with other examples, right?
41:24
So for example, here's another one.
41:26
41:30
So the names don't really matter, right?
41:33
I call it the siblings model.
41:34
But you won't find the siblings model in the textbook, right?
41:37
So I wouldn't worry too much.
41:38
But for example, let's say you have something-- so
41:41
let's call it 6.
41:42
You have-- I don't know--
41:45
a truncated-- and that's the name I just came up with.
41:54
But it's actually not exactly describing what I want.
41:57
But let's say I observe y, which is the indicator of x larger
42:03
than say 5 when x follows some exponential with parameter
42:11
lambda.
42:13
OK?
42:13
This is what I get to observe.
42:15
I only observe if my waiting time
42:18
was more than five minutes, because I
42:20
see somebody coming out of the Kendall Station
42:23
being really upset.
42:24
And that's all I record is I've been waiting
42:26
for more than five minutes.
42:27
And that's all I get to record.
42:29
OK?
42:29
That happens a lot.
42:31
These are called censored data.
42:32
I should probably not call it truncated,
42:34
but this should be censored.
42:36
OK?
42:38
You see a lot of censored data when you ask people
42:40
how much they make.
42:42
They say, well, more than five figures.
42:45
And that's all they want to tell you.
42:47
OK?
42:48
And so you see a lot of censored data in survival analysis,
42:54
right?
42:55
You are trying to understand how long your patients are going
42:58
to live after some surgery, OK?
43:01
And maybe you're not going to keep people alive,
43:05
and you're not going to actually be
43:07
in touch in their family every day and ask them,
43:09
is the guy still alive?
43:10
And so what you can do is just you
43:12
ask people maybe five years after your study
43:15
and say, please, come in.
43:18
And you will just happen to have some people say, well, you
43:20
know, the person is deceased.
43:22
And you will only be able to know that the person deceased
43:25
less than five years ago.
43:27
But you only see what happens after that, OK?
43:31
And so this is this truncated and censored data.
43:34
It happens all the time just because you
43:35
don't have the ability to do better than that.
43:39
So this could happen here.
43:42
So what is my physical experiment, right?
43:45
So here, I should probably write this like this,
43:47
because I just told you that my observations are going to be x,
43:50
but there is some unknown y.
43:52
I will never get to see this y.
43:54
I only get to see the x.
43:57
What is my statistical experiment?
43:58
Please help me.
44:00
So is it the real line?
44:02
My sample space-- is it the real line?
44:04
44:09
Sorry, who does not know what this means?
44:12
I'm sorry.
44:13
OK.
44:15
So this is called an indicator.
44:18
So I read it as--
44:20
if I wrote well, that would be one with a double bar.
44:23
You can also write i if you prefer
44:26
if you don't feel like writing one in double bars.
44:28
And it's one of say--
44:31
I'm going to write it like that--
44:32
1 of a is equal to 1 if a is true and 0 if a is false.
44:43
OK?
44:44
So that means that if y is larger than 4, this thing is 1.
44:48
And if y is not larger than 5, this thing is 0.
44:52
OK.
44:53
So that's called an indicator--
44:56
45:00
indicator function.
45:01
45:06
It was very useful to just turn anything into a 0, 1.
45:10
So now that I'm here, what is my sample space?
45:14
45:17
0, 1.
45:18
Well, whatever this thing I did not tell you
45:21
was taking value with the thing you should have--
45:24
if I end up telling you that is taking value 6 or 7 that
45:26
would be your sample space, OK?
45:29
OK, so it takes values 0, 1.
45:33
And then what is the probability here?
45:37
What should I write here?
45:38
What should you write without even thinking?
45:40
45:44
Yeah.
45:45
So let's assume there's two seconds
45:47
before the end of the exam.
45:48
You're going to write Bernoulli.
45:50
And that's where you're going to start checking if I'm going
45:52
to give you extra time, OK?
45:54
So you write Bernoulli without thinking,
45:55
because it's taking value 0, 1.
45:57
So you just write Bernoulli, but you still have to tell me
45:59
what possible parameters this thing is taking, right?
46:04
So I'm going to write it p, because I don't know.
46:06
And then p take value--
46:09
OK, so sorry.
46:11
I could write it like that.
46:14
Right?
46:16
That would be perfectly valid, but actually no more.
46:21
It's not any p.
46:23
The p is the probability that an exponential lambda
46:26
is larger than 5.
46:27
And maybe I want to have lambda as a parameter.
46:30
OK, so what I need to actually compute is,
46:33
what is the probability that y is larger than 5--
46:38
when y is this exponential lambda,
46:40
which means that what I need to compute
46:42
is the integral between 5 and infinity of--
46:46
what is it?
46:47
1 over lambda.
46:49
How did I define it in this class?
46:52
Did I change it-- what?
46:54
AUDIENCE: [INAUDIBLE].
46:57
PHILIPPE RIGOLLET: Yeah, right, right, right.
46:59
Yeah.
46:59
Lambda e to the minus lambda x dx, right?
47:04
So that's what I need to compute.
47:07
What is this?
47:09
Yeah, so what is the value of this integral?
47:11
47:14
Can you take appropriate measures?
47:16
47:25
AUDIENCE: [INAUDIBLE]
47:28
47:32
PHILIPPE RIGOLLET: OK?
47:33
And again, you can cancel this, right?
47:35
So when I'm going to integrate this guy,
47:37
those guys are going to cancel.
47:39
I'm going to get 0 for infinity.
47:40
I'm going to get a 5 for this guy.
47:42
And well, I know it's going to be positive number, so I'm not
47:45
really going to bother with the signs,
47:46
because I know that's what it should be.
47:48
OK, so I get e to the minus 5 lambda.
47:51
And so that means that I can actually write this like that--
47:55
47:57
and now parametrize this thing by lambda positive.
48:01
OK?
48:02
So what I did here is I changed the parametrization from p
48:05
to lambda.
48:06
Why?
48:07
Well, because maybe if I know this is happening,
48:10
maybe I am actually interested in reporting lambda
48:13
to MBTA, for example.
48:15
Maybe I'm actually trying to estimate 1 over lambda, so
48:20
that I know it is--
48:22
well, lambda is actually the intensity
48:24
of arrival of my Poisson process, right?
48:26
I have a Poisson process.
48:27
That's how my trains are coming in.
48:31
And so I'm interested in lambda.
48:32
So I will parametrize things by lambda.
48:34
So the thing I get is lambda.
48:35
You can play with this, right?
48:37
I mean, I could parametrize this by 1 over lambda
48:39
and put 1 over lambda here if I want it.
48:42
But you know, the context of your problem
48:46
will tell you exactly how to parametrize this.
48:50
OK?
48:50
48:53
So what else did I want to tell you?
48:59
OK, let's do a final one.
49:00
49:13
By the way, are you guys OK with Poisson exponential,
49:17
Bernoulli's--
49:21
I don't know, binomial, normal--
49:22
all these things.
49:24
I'm not going to go back to it, but I'm
49:25
going to use them heavily.
49:26
So just spend five minutes on Wikipedia
49:29
if you forgot about what those things are.
49:31
Usually, you must have seen them the in your probability class.
49:35
So they should not be a crazy name.
49:36
And again, I'm not expecting you.
49:38
I don't remember what the density of an exponential is.
49:40
So it would be pretty unfair of me
49:42
to actually ask you to remember what it is.
49:44
Even for the Gaussian, I don't expect
49:45
you to remember what it is.
49:46
But I want you to remember that if I add 5 to a Gaussian, then
49:51
I have a Gaussian with me and mu plus 5 if I multiply it
49:54
by something, right?
49:55
You need to know how to operate those things.
49:59
But knowing complicated densities
50:02
is definitely not part of the game.
50:04
OK?
50:05
So let's do a final one.
50:09
I don't know what number I have now.
50:11
I'm going to just do uniform.
50:12
50:14
That's another one.
50:15
Everybody knows what uniform is?
50:18
So it's uniform, right?
50:19
So I'm going to have x, which my observations are
50:22
going to be uniform on the interval 0 theta, right?
50:27
So if I want to define a uniform distribution
50:30
for a random variable, I have to tell you which interval
50:32
or which set I want it to be uniform on.
50:35
And so here I'm telling you is the interval 0 theta.
50:38
And so what is going to be my sample space?
50:41
AUDIENCE: [INAUDIBLE]
50:42
PHILIPPE RIGOLLET: I'm sorry?
50:44
0 to theta.
50:44
50:47
And then what is my probability distribution?
50:50
My family of parameters?
50:52
50:57
So well, I can write it like this, right?
51:00
Uniform theta, right?
51:03
And theta let's say is positive.
51:06
51:09
Can somebody tell me what's wrong with what I wrote?
51:12
51:18
This makes no sense.
51:20
Tell me why.
51:21
51:24
Yeah?
51:26
Yeah, this set depends on theta, and why is that a problem?
51:30
AUDIENCE: [INAUDIBLE]
51:32
51:36
PHILIPPE RIGOLLET: There is no theta.
51:38
Right now, there's the families of theta.
51:40
Which one did you pick here?
51:43
Right, this is just something that's indexed by theta,
51:46
but I could have very well written it as, you know,
51:49
just not being Greek for a second,
51:51
I could have just written this as t rather than theta.
51:55
That would be the same thing.
51:56
And then what the hell is theta?
51:59
There's no such thing as theta.
52:00
We don't know what the parameter is.
52:02
This parameter should move with everyone.
52:04
And so that means that I actually am not
52:05
allowed to pick this theta.
52:07
I'm actually-- just for the reason that there is no
52:10
parameter to put on the left side--
52:12
there should not be, right?
52:13
So you just said, well, there's a problem because the parameter
52:14
is on the left-hand side.
52:16
But there's not even a parameter.
52:17
I'm describing the family of possible parameters.
52:19
There is no one that you can actually plug it in.
52:22
So this should really be 1.
52:24
And I'm going to go back to writing
52:25
this as theta because that's pretty standard.
52:29
Is that clear for everyone.
52:31
I cannot just pick one and put it in there and just take the--
52:37
before I run my experiments, I could potentially
52:40
get numbers that are all the way up to 1,
52:42
because I don't know what theta is going to be ahead of time.
52:45
Now, if somebody promised to me that theta
52:47
was going to be less than 0.5, that would be--
52:49
sorry, why do I put 1 here?
52:50
52:56
I could put theta between 0 and 1.
52:58
But if somebody is going to promise me, for example,
53:00
if theta is going to be less than 1,
53:01
then you expect to put 0, 1.
53:03
All right?
53:04
53:08
Is that clear?
53:12
OK, so now you know how to answer the question--
53:15
what is the statistical model?
53:18
And again, within the scope of this class,
53:20
you will not be asked to just come up with a model right that
53:23
will just tell you.
53:24
Poisson would be probably be a good idea here.
53:26
And then you would just have to trust me that indeed it
53:28
would be a good idea.
53:30
All right, so what I started talking about 20 minutes ago--
53:35
so it's definitely ahead of myself
53:38
is the notion-- so that's when I was
53:40
talking about well-specified.
53:41
Remember, well-specified says that the true distribution
53:44
is one of the distributions in this parametric families
53:47
of distribution.
53:48
The true distribution of my siblings
53:50
is actually a Poisson with some parameters.
53:52
And all I need to figure out is what this parameter is.
53:56
When I started saying that, I said, well,
53:58
but then that could be that there
53:59
are several parameters that give me
54:01
the same distribution, right?
54:03
It could be the case that Poisson 5 and Poisson 17
54:07
are exactly the same distributions when
54:09
I started putting those numbers in the formula which I erased,
54:13
OK?
54:14
So it could be the case that two different numbers would give me
54:18
exactly the same probabilities.
54:20
And in this case, we see that the model is not identifiable.
54:24
I mean, the parameter is not identifiable.
54:26
I cannot identify the parameter, even if you actually gave me
54:29
an infinite amount of data, which means that I could
54:32
actually estimate exactly the PMF.
54:34
I might not be able to go back, because there would
54:37
be several candidates, and I would not
54:38
be able to tell you which one it was in the first place.
54:41
OK?
54:41
So what we want is that this function--
54:45
theta maps to p theta is injective.
54:49
And that really can be fancy.
54:51
54:54
What I really mean is that if theta
54:57
is different from theta prime, then p of theta
55:01
is different from p of theta prime.
55:04
Or, if you prefer to think about the contrapositive of this,
55:07
this is the same as saying that if p theta gives me
55:11
the same distribution as theta prime,
55:15
then that implies that theta must
55:17
be equal to the theta prime.
55:20
The logic of those two things are equivalent, right?
55:24
So that's what this means.
55:26
So this is-- we say that the parameter is identifiable
55:37
or identified-- it doesn't really matter--
55:41
in this model.
55:42
55:49
And this is something we're going to want.
55:50
OK?
55:51
So in all the examples that I gave you,
55:54
those parameters are completely identified.
55:57
Right?
55:57
If I tell you--
55:58
I mean, if those things are in probability box,
56:01
it means that they were probably thought through, right?
56:03
So when I say exponential lambda,
56:06
I'm really talking about one specific distribution and not--
56:09
there's not another lambda going to give you
56:11
exactly the same distribution.
56:13
OK so that was the case.
56:15
And you can check that, but it's a little annoying.
56:17
So I would probably not do it.
56:19
But rather than doing this, let me just
56:20
give you some examples where it would not be the case.
56:24
Again, here's an example, if I take xi--
56:25
56:31
so now I'm back to just using this indicator function--
56:36
but now for a Gaussian.
56:39
So what I observe is x is the indicator
56:42
that y is, what did we say?
56:44
Positive.
56:44
56:48
OK?
56:49
So this is a Bernoulli random variable, right?
56:51
56:56
And it has some parameter p.
56:57
But p now is going to depend-- sorry,
56:59
and here y is n mu sigma square.
57:04
So the p, the probability that this thing is positive,
57:09
is actually--
57:10
I don't think I put the 0.
57:11
Oh, yeah, because I have mu.
57:13
OK, so this distribution-- this p the probability
57:15
that it's positive is just the probably
57:17
that some Gaussian is positive.
57:19
And it will depend on mu and sigma, right?
57:22
Because if I draw a 0, and I draw my Gaussian around mu,
57:31
then the probability of this Bernoulli being 1
57:35
is really the area under the curve here.
57:39
Right?
57:40
And this thing-- well, if mu is very large,
57:42
it's going to become very large.
57:44
If mu is very small, it's going to become very small.
57:48
And if sigma changes, it's also going to effect--
57:51
is that clear for everyone?
57:53
But we can actually compute this, right?
57:56
So the parameter p that I'm looking for here
57:59
as a function of mu and sigma is simply
58:01
the probability that some y is non-negative,
58:06
which is the probability that y minus mu divided by sigma
58:12
is larger than minus mu divided by sigma.
58:16
But when you study probability, is that some operation you
58:20
were used to making?
58:22
Removing the mean and dividing by the standard deviation?
58:26
What is the effect of doing that on the Gaussian
58:28
random variable?
58:30
Yeah, so you normalize it, right?
58:32
And you standardize it.
58:33
You make it a standard Gaussian.
58:34
You remove the mean.
58:36
The mean 0 is Gaussian.
58:38
And you remove the variance for it to become 1.
58:41
So when you have a Gaussian, remove the mean
58:43
and divide by the standard deviation,
58:44
it becomes a standard Gaussian--
58:46
which this thing has n , 0, 1 distribution,
58:50
which is the one you can read the quintiles of at the end
58:53
of the book.
58:54
Right?
58:55
And that's exactly what we did.
58:57
OK?
58:57
So now you have the probability that some standard Gaussian
59:00
exceeds negative mu over sigma, which
59:04
I can write in terms of the cumulative distribution
59:06
function, capital phi--
59:07
59:14
like we did in the first lecture.
59:16
So if I do this cumulative distribution function,
59:19
what is this probability in terms of phi?
59:21
59:25
[INAUDIBLE]?
59:26
AUDIENCE: [INAUDIBLE].
59:28
PHILIPPE RIGOLLET: Well, that's what your name tag says.
59:30
59:33
1 minus--
59:34
AUDIENCE: [INAUDIBLE].
59:36
PHILLIPPE RIGOLLET: 1 minus mu of sigma.
59:37
What happens with phi in our--
59:39
do you think I defined this for fun?
59:43
1 minus phi of mu over sigma, right?
59:50
Right?
59:50
Because this is 1 minus the probability
59:52
that it's less than this.
59:53
And this is exactly the definition
59:55
of the cumulative distribution function.
59:57
So in particular, this thing only depends on mu over sigma.
60:04
Agreed?
60:05
So in particular, if I had 2 mu over 2 sigma,
60:09
p would remain unchanged.
60:11
If I have 12 mu over 12 sigma, this thing
60:15
would remain unchanged, which means
60:18
that p does not change if I scale mu
60:22
and sigma by the same factor.
60:25
So there's no way just by observing x, even
60:28
an infinite times, so that I can actually get exactly what p is.
60:32
I'm never going to be able to get mu and sigma separately.
60:34
All I'm going to be able to get is mu over sigma.
60:37
So here, we say that mu sigma--
60:41
the parameter mu sigma--
60:43
or actually each of them individually-- those guys--
60:46
60:50
they're not identifiable.
60:51
60:58
But the parameter mu over sigma is identifiable.
61:03
61:09
So if I wanted to write a statistical model in which
61:13
the parameter is identifiable--
61:15
61:25
I would write 0, 1 Bernoulli.
61:32
And then I would write 1 minus phi over mu over sigma.
61:41
And then I would take two parameters,
61:42
which are mu and r and sigma squared positive.
61:48
So let's write sigma positive.
61:52
Right?
61:52
61:56
No, this is not identifiable.
61:59
I cannot write those two guys as being two things different.
62:02
62:12
Instead, what I want to write is 0, 1, Bernoulli 1 minus--
62:22
62:26
and now my parameter--
62:30
I forgot this-- my parameter is mu over sigma.
62:37
Can somebody tell me where mu over sigma lives?
62:41
What values can this thing take?
62:42
62:46
Any real value, right?
62:48
62:53
OK, so now I've done this definitely out of convenience,
62:55
right?
62:56
Because that was the only thing I was able to identify--
62:59
the ratio of mu over sigma.
63:01
But it's still something that has some meaning.
63:04
It's the normalized mean.
63:06
It really tells me what the mean is compared
63:08
to the standard deviation.
63:10
So in some models, in reality, in some real applications,
63:13
this actually might have a good meaning.
63:16
It's just telling me how big the mean
63:17
is compared to the standard fluctuations of this model.
63:22
But I won't be able to get more than that.
63:24
Agreed?
63:25
63:30
All right?
63:32
So now that we've set a parametric model,
63:37
let's try to see what our goals are going to be.
63:40
OK?
63:41
So now we have a sample and a statistical model.
63:44
And we want to estimate the parameter theta,
63:47
and I could say, well, you know what?
63:49
I don't have time for this analysis.
63:51
Collecting data is going to take me a while.
63:53
So I'm just going to mmm--
63:55
and I'm going to say that mu over sigma is 4.
63:57
And I'm just going to give it to you.
63:59
And maybe you will tell me, yeah,
64:00
it's not very good, right?
64:02
So we need some measure of performance
64:04
of a given parameter.
64:05
We need to be able to evaluate if eyeballing the problem
64:09
is worse than actually collecting
64:11
a large amount of theta.
64:13
We need to know if even if I come up with an estimator that
64:16
actually sort of uses the data, does it
64:18
make an efficient use of the data?
64:20
Would I actually need 10 times more observations
64:22
to achieve the same accuracy?
64:24
To be able to answer these questions,
64:25
well, I need to define what accuracy means.
64:28
And accuracy is something that sort of makes sense.
64:30
It says, well, I want theta.
64:31
I have to be close to theta.
64:33
And the theta is a random variable.
64:35
So I'm going to have to understand
64:36
what it means for a random variable
64:38
to be close to a deterministic number.
64:40
And so, what is a parameter estimator, right?
64:44
So I have an estimator, and I said it's a random variable.
64:46
64:49
And the formal definition--
64:51
64:59
so an estimator is a measurable function of the data.
65:10
So when I write theta hat, and that
65:12
will typically be my notation for an estimator, right?
65:18
I should really write theta hat of x1 xn.
65:24
OK?
65:25
That's what an estimator is.
65:26
If you want to know an estimator is,
65:28
this is a measurable function of the data.
65:30
And it's actually also known as a statistic.
65:35
65:37
And you know, if you're interested in,
65:39
you know, I see every day I think when I have like,
65:43
you know, a dinner with normal people.
65:47
And they say I'm a statistician.
65:48
Oh, yeah, I really like baseball.
65:50
And they talk to me about batting averages.
65:53
That's not what I do.
65:54
But for them, that's what it is, and that's
65:55
because in a way, that's what a statistic is.
65:58
A batting average is a statistic.
66:00
OK, and so here are some examples.
66:02
You can take the average xn bar.
66:04
You can take the maximum of your observation.
66:06
That's the statistics.
66:07
You can take the first one.
66:08
You can take the first one plus log of 1
66:10
plus the absolute value of the last one.
66:12
You can do whatever you want that will be an estimator.
66:15
Some of them are clearly going to be bad.
66:17
But that's still a statistic, and you can do this.
66:20
Now, when I say measurable, I always have--
66:24
so you know, graduate students sometimes
66:26
ask me like, yeah, how do I know if this estimator is measurable
66:28
or not.
66:29
And usually, my answer is, well, if I give you data,
66:31
can you compute it.
66:32
And they say, yeah, I'm like, well, then it's measurable.
66:35
That's a very good rule to check if you can actually--
66:38
if something is actually measurable.
66:40
When is this thing non-measurable?
66:42
It's when it's implicitly defined.
66:44
OK, and in particular, the things
66:46
that give you problems are--
66:48
66:52
sup or inf.
66:53
Anybody knows what a sup or an inf is?
66:55
It's like a max or a min.
66:57
But it's not always attained.
66:59
OK, so if I have x1.
67:02
So if I look at the infimum of the function
67:06
f of x for x on the real line and f of x, sorry,
67:11
let's say x on the 1 infinity.
67:13
And f of x is equal to 1 over x.
67:16
Right?
67:18
Then the infimum is the smallest value
67:20
we can take except that it doesn't really
67:22
take it at 0 right, because 1 over x is going to 0.
67:28
But it's never really getting there.
67:30
So we just called the inf 0.
67:32
But it's not the value that it ever takes.
67:34
And these things might actually be complicated to compute.
67:37
And so that's when you actually have problems, right?
67:40
When the limit is not--
67:41
you're not really quite reaching the limit.
67:44
You won't have this problem in general, but just so you know,
67:47
an estimator is not really anything.
67:48
It has to actually be measurable.
67:51
OK, so the first thing we want to know I mentioned it--
67:54
so an estimator is a statistic which does not depend on theta,
67:57
of course.
67:58
So if I give you the data, you have to be able to compute it.
68:01
And that probably should not require not knowing any known
68:04
parameters.
68:06
OK, so an estimator is said to be consistent.
68:11
When my data-- when I collect more and more data, this thing
68:13
is getting closer and closer to the true parameter.
68:16
All right?
68:16
And we said that eyeballing and saying that it's going to be 4
68:20
is not really something that's probably
68:21
going to be consistent.
68:22
But they can have things that are consistent
68:24
but that are converging to theta at different speeds.
68:28
OK?
68:29
And we know also that this is a random variable.
68:32
It converges to something.
68:33
And there might be some different notions
68:35
of convergence that kick in.
68:36
And actually there are.
68:38
And we say that it's weakly convergent if it converges
68:40
in probability and strongly convergent
68:43
if it converges almost [INAUDIBLE]..
68:46
OK?
68:46
And this is just vocabulary.
68:48
It won't make a big difference.
68:50
OK?
68:51
So we will typically say it's consistent with any of the two.
68:56
AUDIENCE: [INAUDIBLE].
68:57
69:02
PHILIPPE RIGOLLET: Well, so in parametric statistics,
69:07
it's actually a little difficult to come up with.
69:09
But in non-parametric ones, I could just say, if I had xi,
69:15
yi, and I know that yi is f of xi plus noise s1i.
69:24
And I know that f belongs to some class of function,
69:26
let's say--
69:27
[INAUDIBLE] class of smooth functions-- it's massive.
69:31
And now, I'm going to actually find the following estimator.
69:33
I'm going to take the average.
69:35
So I'm going to do least squares, right?
69:36
69:40
So I just check.
69:41
I'm trying to minimize the distance of each of my f of xi
69:44
to my yi.
69:45
And now, I want to find the smallest of them.
69:49
So if I look at the infimum here, then the question is--
69:56
so that could be--
69:57
well, that's not really an estimator for f.
69:59
But it's an estimator for the smallest possible value.
70:02
And so for example, this is actually
70:04
an estimator for the variance of sigma square.
70:07
This might not be attained, and this might not
70:09
be measurable if f is massive?
70:13
All right, so that's the infimum over some class f of x.
70:16
OK?
70:18
So those are all voice things that are defined implicitly.
70:20
If it's an average, for example, it's completely measurable.
70:24
OK?
70:27
Any other question?
70:28
70:31
OK, so we know that the first thing we might want to check,
70:37
and that's definitely something we want about estimators that
70:40
is consistent, because all consistency tells
70:43
us is that just as I collect more and more data,
70:45
my estimator is going to get closer
70:47
and closer to the parameter.
70:51
There's other things we can look at.
70:52
For each possible value of n-- now, right now,
70:55
I have a finite number of observations--
71:00
25.
71:01
And I want to know something about my estimator.
71:04
The first thing I want to check is maybe if in average, right?
71:08
So this is a random variable.
71:09
Is this random variable in average
71:11
going to be close to theta or not?
71:14
And so the difference how far I am from theta
71:17
is actually called the bias.
71:20
So the bias of an estimator is the expectation of theta hat
71:28
minus the value that I hope it gets, which is theta.
71:31
If this thing is equal to 0, we say that theta hat is unbiased.
71:38
71:42
And unbiased estimators are things that people
71:44
are looking for in general.
71:46
The problem is that there's lots of unbiased estimators.
71:49
And so it might be misleading to look for unbiasedness
71:52
when that's not really the only thing
71:54
you should be looking for.
71:55
OK, so what does it mean to be unbiased?
71:58
Maybe for this particular round of data
72:00
you collected, you're actually pretty far
72:02
from the true estimator.
72:04
But one thing that actually--
72:08
what it means is that if I redid this experiment over, and over,
72:12
and over again, and I averaged all the values of my estimators
72:16
that I got, then this would actually be the right--
72:19
the true parameter.
72:21
OK.
72:21
That's what it means.
72:22
If I were to repeat this experiment,
72:25
in average, I would actually get the right thing.
72:27
But you don't get to repeat the experiment.
72:30
OK, just a remark about estimators,
72:33
look at this estimator-- xn bar.
72:34
Right?
72:35
Think of the kiss example.
72:36
I'm looking at the average of my observations.
72:39
And I want to know what the expectation of this thing is.
72:41
72:44
OK?
72:45
Now, this guy is by linearity of the expectation,
72:56
it is this, right?
72:59
But my data is identically distributed.
73:03
So in particular, all the xi's have the same expectation,
73:07
right?
73:09
Everybody agrees with this.
73:10
When it's identically distributed,
73:12
they'll get the same expectation.
73:14
So what it means is that this guy's here--
73:17
they're all equal to the expectation of x1.
73:22
Right?
73:23
So what it means is that these guys--
73:25
I have the average of the same number.
73:28
So this is actually the expectation of x1.
73:31
OK?
73:32
And it's true.
73:33
In the kiss example, this was p.
73:36
And this is p--
73:37
73:40
the probability of turning your head right.
73:43
OK?
73:43
So those two things are the same.
73:45
In particular, that means that xn bar and just x1
73:50
have the same bias.
73:54
So that should probably illustrate to you
73:56
that bias is not something that really is telling you
73:59
the entire picture, Right?
74:02
I can take only one of my observations--
74:05
Bernoulli 0, 1.
74:06
This thing will have the same bias
74:07
as if I average 1,000 of them.
74:10
But the bias is really telling you where I am in average.
74:13
But it's really not telling me what fluctuations I'm getting.
74:16
And so if you want to start having fluctuations coming
74:18
into the picture, we actually have
74:20
to look at the risk or the quadratic risk
74:22
of the estimator.
74:23
And so the quadratic risk is the finest--
74:25
the expectation of the square distance between theta hat
74:28
and theta.
74:30
OK?
74:33
So let's look at this.
74:34
74:42
So the quadratic risk--
74:43
74:47
sometimes it's denoted that people
74:48
call it the l2 risk of theta hat, of course.
74:57
I'm sorry for maintaining such an ugly board.
74:59
[INAUDIBLE] this stuff.
75:00
75:09
OK, so I look at the square distance
75:10
between theta hat and theta.
75:12
This is still-- this is the function of a random variable.
75:14
So it's a random variable as well.
75:16
And now I'm looking at the expectation of this guy.
75:19
That's the definition.
75:23
I claimed that when this thing goes to 0, then
75:25
my estimator is actually going to be consistent.
75:28
Everybody agrees with this?
75:30
75:37
So if it goes to zero as n goes to infinity-- and here,
75:47
I don't need to tell you what kind of convergence I have,
75:50
because this is just the number, right?
75:51
It's an expectation.
75:52
So it's a regular, usual calculus-style convergence.
75:57
Then that implies that theta hat is actually weakly consistent.
76:03
76:07
What did I use to tell you this?
76:09
76:14
Yeah, this is the convergence in l2.
76:17
This actually is strictly equivalent.
76:19
This is by definition saying that theta hat converges in l2
76:26
to theta.
76:29
And we know that convergence in l2
76:31
implies convergence in credibility to theta.
76:37
That was the picture.
76:38
We're going up.
76:40
And this is actually equivalent to a consistency
76:42
by definition-- a weak consistency.
76:46
OK, so this is actually telling you a little more
76:48
because this guy here--
76:50
they are both unbiased.
76:52
Theta xn bar is unbiased.
76:55
X1 is unbiased.
76:56
But x1 is certainly not consistent ,
76:58
because the more data I collect, I'm not even doing anything
77:01
with it.
77:01
I'm just taking the first data point you're giving to me.
77:04
So they're both unbiased.
77:05
But this one is not consistent.
77:07
And this one we'll see is actually consistent.
77:09
xn bar is consistent.
77:11
And actually, we've seen that last time.
77:14
And that's because of the?
77:15
77:19
What guarantees the fact that xn bar is consistent?
77:23
AUDIENCE: The law of large numbers.
77:25
PHILIPPE RIGOLLET: The law of large numbers, right?
77:26
Actually, it's strongly consistent
77:27
if you have a strong [INAUDIBLE]..
77:29
OK, so just in the last two minutes,
77:35
I want to tell you a little bit about how this risk is linked
77:39
to see, quadratic risk is equal to bias
77:43
squared plus the variance.
77:44
So let's see what I mean by this?
77:48
So I'm going to forget about the absolute values
77:50
that you have a square.
77:50
I don't really need them.
77:54
If theta hat was unbiased, this thing
77:57
would be the expectation of theta hat.
78:01
It might not be the case.
78:02
So let me see how I can actually see-- put the bias in there.
78:06
Well, one way to do this is to see
78:07
that this is equal to the expectation of theta
78:10
hat minus the expectation of theta hat,
78:13
plus the expectation of theta hat minus theta.
78:17
78:21
OK?
78:22
I just removed the same and added the same thing.
78:24
So I didn't change anything.
78:27
Now, this guy is my bias, right?
78:29
78:32
So now let me expand the square.
78:34
So what I get is the expectation of the square of theta
78:37
hat minus its expectation.
78:39
78:42
I should put some square brackets--
78:45
plus two times the cross-product.
78:50
So the cross-product is what expectation
78:52
of theta hat minus the expectation of theta hat times
78:59
expectation of theta hat minus theta.
79:03
79:07
And then I have the last square.
79:08
79:17
Expectation of theta hat minus theta squared.
79:22
OK?
79:24
So square, cross-products, square.
79:27
Everybody is with me?
79:29
now this guy here--
79:32
if you pay attention, this thing is the expectation
79:35
of some random variable.
79:36
So it's a deterministic number.
79:38
Theta is the true parameter.
79:39
It's a deterministic number.
79:41
So what I can do is pull out this entire thing out
79:44
of the expectation like this and compute the expectation only
79:52
with respect to that part.
79:53
But what is the expectation of this thing?
79:56
79:59
It's zero, right?
80:00
The expectation of theta hat minus the expectation
80:02
of theta hat is 0.
80:03
So this entire thing is equal 0.
80:07
So now when I actually collect back my quadratic terms--
80:12
my two squared terms in this expansion--
80:15
what I get is that the expectation
80:18
of theta hat minus theta squared is
80:21
equal to the expectation of theta hat minus expectation
80:26
of theta hat squared plus the square of expectation
80:32
of theta hat minus theta.
80:35
80:40
Right?
80:41
So those are just the two--
80:42
the first and the last term of the previous equality?
80:46
Now, here I have the expectation of the square
80:48
of the difference between a random variable
80:50
and its expectation.
80:52
This is otherwise known as the variance, right?
80:56
So this is actually equal to the variance of theta hat.
81:03
And well, this was the bias.
81:05
We already said that's there.
81:07
So this whole thing is the bias square.
81:09
81:12
OK?
81:13
And hence the quadratic term is the sum
81:15
of the variance and the squared bias.
81:18
Why squared bias?
81:18
Well, because otherwise, you would add dollars
81:21
in dollars squared.
81:22
So you need to add dollars squared and dollars
81:24
squared so that this thing is actually homogeneous.
81:27
So if x is in dollars, then the bias is in dollars,
81:30
but the variance is in dollars squared.
81:32
OK, and the square here forced you to put everything
81:35
on the square scale.
81:36
All right, so what's nice is that if the quadratic risk goes
81:39
to 0, then since I have the sum of two positive terms,
81:42
both of them have to go to 0.
81:45
That means that my variance is going to 0--
81:46
very little fluctuations.
81:48
And my bias is also going to 0, which means that I'm actually
81:51
going to be on target once I reduce
81:53
my fluctuations, because it's one thing to reduce
81:55
the fluctuations.
81:56
But if I'm not on target, it's an issue, right?
81:58
For example, the estimator for the value 4 has no variance.
82:03
Every time I'm going to repeat the experiments,
82:05
I'm going to get 4, 4, 4, 4--
82:07
variance is 0.
82:08
But the bias is bad.
82:10
The bias is 4 minus theta.
82:12
And if theta is far from 4, that's not doing very well.
82:17
OK, so next week, we will--
82:21
we'll talk about what is a good estimate--
82:25
how estimators change if they have
82:26
high variance or low variance or high bias and low bias.
82:32
And we'll talk about confidence intervals as well.
82:35