https://www.youtube.com/watch?v=mc1y8m9-hOM&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=20


字幕記錄


00:00
00:00
The following content is provided under a Creative
00:02
Commons license.
00:03
Your support will help MIT OpenCourseWare
00:06
continue to offer high quality educational resources for free.
00:10
To make a donation or to view additional materials
00:12
from hundreds of MIT courses, visit MIT OpenCourseWare
00:16
at ocw.mit.edu.
00:17
00:20
PHILIPPE RIGOLLET: We're talking about
00:23
generalized linear models.
00:24
And in generalized linear models,
00:25
we generalize linear models in two ways.
00:28
The first one is to allow for a different distribution
00:31
for the response variables.
00:32
And the distributions that we wanted
00:34
was the exponential family.
00:37
00:44
And this is a family that can be generalized
00:46
over random variables that are defined
00:49
on c or q in general, with parameters rk.
00:52
But we're going to focus in a very specific case when
00:58
y is a real valued response variable, which
01:00
is the one you're used to when you're doing linear regression.
01:04
And the parameter theta also lives in r.
01:09
And so we're going to talk about the canonical case.
01:12
So that's the canonical exponential family,
01:15
where you have a density, theta of x, which is
01:19
of the form, exponential plus.
01:25
And then, we have y, which interacts with theta
01:27
only by taking a product.
01:29
Then, there's a term that depends only on theta,
01:32
some dispersion parameter phi.
01:35
And then, we have some normalization factor.
01:37
Let's call it c of y phi.
01:44
So it really should not matter too much, so it's c of y phi,
01:48
and that's really just the normal position factor.
01:50
And here, we're going to assume that phi is known.
01:54
01:57
I have no idea what I write.
01:58
I don't know if you guys can read.
02:00
I don't know what chalk has been used today,
02:01
but I just can't see it.
02:05
That's not my fault. All right, so we're going
02:08
to assume that phi is known.
02:09
And so we saw that several distributions
02:12
that we know well, including the Gaussian for example,
02:14
belong to this family.
02:15
And there's other ones, such as Poisson--
02:17
02:21
Poisson and Bernoulli.
02:22
So if the PMF has this form, if you
02:24
have a discrete random variable, this is also valid.
02:27
And the reason why we introduced this family
02:29
is because there are going to be some properties that we know
02:32
that this thing here, this function, b of theta,
02:34
is essentially what completely characterizes
02:37
your distribution.
02:38
So if phi is fixed, we know that the interaction is the form.
02:42
And this really just comes from the fact
02:44
that we want the function to integrate to one.
02:46
So this b here in the canonical form
02:49
encodes everything we want to know.
02:50
If I tell you what b of theta is--
02:53
and of course, I tell you what phi
02:54
is, but let's say for a second that phi is equal to one.
02:56
If I tell you this b of theta, you
02:58
know exactly what distribution I'm talking about.
03:00
So it should encode everything that's
03:02
specific to this distribution, such as mean, variance,
03:05
all the moments that you would want.
03:07
And we'll see how we can compute from this thing
03:12
the mean and the variance, for example.
03:14
So today, we're going to talk about likelihood,
03:16
and we're going to start with the likelihood
03:18
function or the log likelihood for one observation.
03:21
From this, we're going to do some computations,
03:23
and then, we'll move on to the actual log likelihood based
03:26
on n independent observations.
03:28
And here, as we will see, the observations
03:30
are not going to be identically distributed,
03:32
because we're going to want each of them,
03:35
conditionally on x to be a different function of x, where
03:39
theta is just a different function of x
03:41
for each of the observation.
03:43
So remember, the log likelihood--
03:45
03:50
and this is for one observation--
03:52
is just the log of the density, right?
03:54
03:59
And we have this identity that I mentioned
04:02
at the end of the class on Tuesday.
04:04
And this identity is just that the expectation
04:06
of the derivative of this guy with respect to theta
04:08
is equal to 0.
04:10
So let's see why.
04:11
So if I take the derivative with respect to theta, of log f,
04:15
theta of x, what I get is the derivative
04:18
with respect to theta of f, theta
04:21
of x, divided by f theta of x.
04:26
Now, if I take the expectation of this guy,
04:30
with respect to this theta as well, what I get
04:37
is that this thing-- what is the expectation?
04:40
Well, it's just the integral against f theta.
04:42
Or if I'm in a discrete case, I just
04:45
have the sum against f theta, if it's a pmf.
04:48
Just the definition, the expectation of x,
04:53
is either the integral-- well, let's say of h of x--
04:56
is integral of h of x.
04:59
F theta of x--
05:01
if this is discrete or is just the sum
05:04
of h of x, f theta of x.
05:07
If x is discrete--
05:09
so if it's continuous, you put this soft sum.
05:15
This guy is the same thing, right?
05:17
So I'm just going to illustrate the case when it's continuous.
05:20
So this is what?
05:21
Well, this is the integral of partial derivative with respect
05:24
to theta, of f theta of x, divided by f theta
05:29
of x, all time f theta of x--
05:35
dx.
05:36
And now, this f theta is canceled,
05:38
so I'm actually left with the integral
05:40
of the derivative, which I'm going
05:41
to write as the derivative of the integral.
05:43
05:50
But f theta being density for any value of theta
06:01
that I can take, this is the function.
06:04
As a function of theta, this function
06:07
is constantly equal to 1.
06:10
For any theta that I take it, it takes value of 1.
06:13
So this is constantly equal to 1.
06:16
I put three bars to see that for any value of theta,
06:18
this is 1, which actually tells me that the derivative is
06:21
equal to 0.
06:24
OK, yes?
06:25
06:30
AUDIENCE: What is the first [INAUDIBLE]
06:32
that you wrote on the board?
06:34
06:38
PHILIPPE RIGOLLET: That's just the definition
06:40
of the derivative of the log of a function?
06:44
AUDIENCE: OK.
06:45
06:49
PHILIPPE RIGOLLET: Log of f prime is f prime over f.
06:53
That's a log, yeah.
06:56
Just by elimination.
06:59
AUDIENCE: [INAUDIBLE]
07:01
PHILIPPE RIGOLLET: I'm sorry.
07:02
AUDIENCE: When you write a squiggle that starts with an l,
07:05
I assume it's lambda.
07:06
PHILIPPE RIGOLLET: And you do good, because that's probably
07:09
how my mind processes.
07:11
And so I'm like, yeah, l.
07:13
Here is enough information.
07:16
OK, everybody is good with this?
07:19
So that was convenient.
07:21
So it just said that the expectation
07:22
of the derivative of the log likelihood is equal to 0.
07:26
That's going to be our first identity.
07:29
Let's move onto the second identity,
07:30
using exactly the same trick, which is let's hope
07:34
that at some point, we have the integral
07:35
of this function that's constantly
07:37
equal to 1 as a function of theta, and then use the fact
07:41
that its derivative is equal to 0.
07:43
So if I start taking the second derivative of the log
07:54
of f theta, so what is this?
07:57
Well, it's the derivative of this guy
08:00
here, so I'm going to go straight to it.
08:02
So it's second derivative, f theta of x, times
08:08
f theta of x, minus first derivative of f theta of x,
08:19
times first derivative of f theta of x.
08:22
08:26
Here is some super important stuff--
08:29
no, I'm kidding.
08:31
So you can still see that guy over there?
08:34
So it's just the square.
08:35
And then, I divide by f theta of x squared.
08:38
08:43
So here I have the second derivative, times f itself.
08:48
And here, I have the product of the first derivative
08:51
with itself.
08:52
So that's the square.
08:53
So now, I'm going to integrate this guy.
08:55
So if I take the expectation of this thing here, what I get
09:01
is the integral.
09:03
So here, the only thing that's going
09:05
to happen when I'm going to take my integral
09:07
is that one of those squares is going to cancel
09:09
against f theta, right?
09:10
So I'm going to get the second derivative
09:22
minus the second derivative squared.
09:24
09:32
And then, I'm divided by f theta.
09:34
09:38
And I know that this thing is equal to 0.
09:39
09:44
Now, one of these guys here--
09:46
sorry, why do I have--
09:48
so I have this guy here.
09:49
So this guy here is going to cancel.
09:50
09:53
So this is what this is equal to--
09:57
09:59
the integral of the partial, so the second derivative of f
10:05
theta of x, because those two guys cancel--
10:09
minus the integral of the second derivative.
10:14
10:24
And this is telling me what?
10:28
10:55
Yeah, I'm losing one, because I have some weird sequences.
10:58
Thank you.
10:58
11:03
OK, this is still positive.
11:12
I want to say that this thing is actually equal to 0.
11:14
11:17
But then, it gives me some weird things,
11:19
which are that I have an integral
11:24
of a positive function, which is equal to 0.
11:26
11:32
Yeah, that's what I'm thinking of doing.
11:34
But I'm going to get 0 for this entire integral, which
11:37
means that I have the integral of a positive function, which
11:39
is equal to 0, which means that this function is equal to 0,
11:42
which sounds a little bad--
11:44
basically, tells me that this function, f theta, is linear.
11:48
11:52
So I went a little too far, I believe,
11:55
because I only want to prove that the expectation
11:57
of the second derivative--
11:58
12:24
Yes, so I want to pull this out.
12:25
12:31
So let's see, if I keep rolling with this, I'm going to get--
12:35
well, no because the fact that it's divided by f theta,
12:37
means that, indeed, the second derivative is equal to 0.
12:40
So it cannot do this here.
12:41
12:49
AUDIENCE: [INAUDIBLE]
12:51
12:59
PHILIPPE RIGOLLET: OK, but let's write it like this.
13:03
You're right, so this is what?
13:05
This is the expectation of the partial with respect
13:12
to theta of f theta of x, divided
13:15
by f theta of x squared.
13:21
And this is exactly the derivative of the log, right?
13:25
So indeed, this thing is equal to the expectation with respect
13:28
to theta of the partial of l--
13:34
of log f theta, divided by partial theta.
13:41
All right, so this is one of the guys that I want squared.
13:44
This is one of the guys that I want.
13:47
And this is actually equal, so this will
13:50
be equal to the expectation--
13:53
13:56
AUDIENCE: [INAUDIBLE]
13:58
13:59
PHILIPPE RIGOLLET: Oh, right, so this term should be equal to 0.
14:02
This was not 0.
14:03
You're absolutely right.
14:04
So at some point, I got confused,
14:06
because I thought putting this equal to 0
14:08
would mean that this is 0.
14:09
But this thing is not equal to 0.
14:10
So this thing, you're right.
14:11
I take the same trick as before, and this is actually
14:13
equal to 0, which means that now I have
14:16
what's on the left-hand side, which is equal to what's
14:19
on the right-hand side.
14:20
And if I recap, I get that e theta
14:23
of the second derivative of the log of f theta
14:30
is equal to minus--
14:32
because I had a minus sign here--
14:34
to the expectation with respect to theta, of log of f theta,
14:39
divided by theta squared.
14:44
Thank you for being on watch when I'm falling apart.
14:48
All right, so this is exactly what
14:50
you have here, except that both terms have
14:52
been put on the same side.
14:54
All right, so those things are going to be useful to us,
14:57
so maybe, we should write them somewhere here.
14:59
15:13
And then, we have that the expectation
15:16
of the second derivative of the log
15:26
is equal to minus the expectation of the square
15:33
of the first derivative.
15:34
15:40
And this is, indeed, my Fisher information.
15:42
This is just telling me what is the second derivative of my log
15:48
likelihood at theta, right?
15:49
So everything is with respect to theta
15:50
when I take these expectations.
15:52
And so it tells me that the expectation
15:55
of the second derivative-- at least first of all, what
15:58
it's telling me is that it's concave,
16:00
because the second derivative of this thing,
16:02
which is the second derivative of kl divergence,
16:05
is actually minus something which is must be non-negative.
16:09
And so it's telling me that it's concave here
16:11
at this [INAUDIBLE].
16:14
And in particular, it's also telling me
16:16
that it has to be strictly positive, unless the derivative
16:19
of f is equal to 0.
16:21
Unless f is constant, then it's not going to change.
16:27
All right, do you have a question?
16:32
So now, let's use this.
16:35
So what does my log likelihood look
16:37
like when I actually compute it for this canonical exponential
16:41
family.
16:42
We have this exponential function, so taking the log
16:45
should make my life much easier, and indeed, it does.
16:48
So if I look at the canonical, what I have
16:56
is that the log of f theta of x, it's equal
17:04
simply to y theta minus b of theta, divided by phi,
17:10
plus this function that does not depend on theta.
17:14
17:18
So let's see what this tells me.
17:20
Let's just plug-in those equalities in there.
17:23
I can take the derivative of the right-hand side
17:25
and just say that in expectation, it's equal to 0.
17:28
So if I start looking at the derivative,
17:32
this is equal to what?
17:33
Well, here I'm going to pick up only y.
17:37
Sorry, this is a function of y.
17:39
17:46
I was talking about likelihood, so I actually
17:48
need to put the random variable here.
17:50
So I get y minus the derivative of b of theta.
17:54
Since it's only a function of theta,
17:56
I'm just going to write b prime, is at OK--
17:58
rather than having the partial with respect to theta.
18:01
Then, this is a constant.
18:02
This does not depend on theta, so it goes away.
18:04
18:10
So if I start taking the expectation of this guy,
18:15
I get the expectation of this guy,
18:20
which is the expectation of y, minus-- well,
18:24
this does not depend on y, so it's just itself--
18:27
b prime of theta.
18:28
And the whole thing is divided by phi.
18:30
But from my first equality over there,
18:33
I know that this thing is actually equal to 0.
18:35
18:38
We just proved that.
18:40
So in particular, it means that since phi is non-zero,
18:43
it means that this guy must be equal to this guy--
18:45
or phi is not infinity.
18:47
And so that implies that the expectation
18:50
with respect to theta of y is equal to b prime of theta.
18:56
19:02
I'm sorry, you're not registered in this class.
19:04
I'm going to have to ask you to leave.
19:07
I'm not kidding.
19:09
AUDIENCE: [INAUDIBLE]
19:10
19:11
PHILIPPE RIGOLLET: You are?
19:12
I've never seen you here.
19:13
I saw you for the first lecture.
19:15
OK.
19:16
19:19
All right, so e theta of y is equal to b prime of theta.
19:23
Everybody agrees with that?
19:24
19:27
So this is actually nice, because if I give you
19:31
an exponential family, the only thing I really need to tell
19:34
you is what b theta is.
19:36
And if I give you b of theta, then computing a derivative
19:38
is actually much easier than having to integrate y
19:42
against the density itself.
19:44
You could really have fun and try
19:46
to compute this, which you would be able to do, right?
19:48
19:54
And then, there's the plus c of y phi, blah, blah, blah--
19:58
dy.
19:59
And that's the way you would actually compute this thing.
20:01
20:05
Sorry, this guy is here.
20:06
That would be painful.
20:07
I don't know what this normalization looks like,
20:10
so it would have to also explicit that,
20:12
so I can actually compute this thing.
20:13
And you know, just the same way, if you
20:15
want to compute the expectation of a Gaussian--
20:17
well, the expectation of a Gaussian
20:19
is not the most difficult one.
20:20
But even if you compute the expectation of a Poisson,
20:23
you start to have to work in a little bit.
20:25
There's a few things that you have to work through.
20:27
Here, I'm just telling you, all you have to know
20:29
is what b of theta is, and then, you
20:30
can just take the derivative.
20:33
Let's see what the second equality is going to give us.
20:35
20:56
OK, so what is the second equality?
21:00
It's telling me that if I look at the second derivative,
21:03
and then I take its expectation, I'm
21:07
going to have something which is equal to negative this guy
21:11
squared.
21:13
Sorry, that was the log, right?
21:14
21:19
We've already computed this first derivative
21:22
of the likelihood.
21:24
It's just the expectation of the square of this thing here.
21:29
So expectation of the derivative,
21:34
with respect to theta of log, f theta of x, divided
21:38
by partial theta squared.
21:44
This is equal to the expectation of the square of y,
21:50
minus b theta, divided by phi squared--
21:58
b prime, theta squared.
21:59
22:04
OK, sorry, I'm actually going to move on with the--
22:06
22:11
so if I start computing, what is this thing?
22:13
Well, we just agreed that this was what?
22:16
22:19
The expectation of theta, right?
22:22
So that's just the expectation of y.
22:27
We just computed it here.
22:28
22:31
AUDIENCE: [INAUDIBLE]
22:32
22:35
PHILIPPE RIGOLLET: Yeah, that's b prime.
22:37
There's a derivative here.
22:39
22:44
So now, this is what?
22:47
This is simply-- anyone?
22:57
23:01
I'm sorry?
23:02
Variance of y, but you're scaling by phi squared.
23:05
23:11
OK, so this is negative of the right-hand side
23:15
of our inequality.
23:17
And now, I just have to take one more derivative to this guy.
23:21
So now, if I look at the left-hand side now,
23:27
I have that the second derivative
23:29
of log, of f theta of y, divided by partial of theta squared.
23:38
So this thing is equal to--
23:40
well, no, I'm not left with much.
23:42
The white part is going to go away,
23:44
and I'm left only with the second derivative of theta,
23:47
minus the second derivative theta, divided by phi.
23:49
24:00
So if I take expectation--
24:03
well, it just doesn't change.
24:05
24:08
This is deterministic.
24:09
So now, what I've established is that this guy
24:11
is equal to negative this guy.
24:14
So those two things, the signs are going to go away.
24:17
And so this implies that the variance of y
24:20
is equal to b prime prime theta.
24:25
And then, I have a phi square in denominator
24:30
that cancels only one of the phi squares, so it's time phi.
24:33
24:37
So now, I have that my second derivative-- since I know phi
24:41
is completely determining the variance.
24:46
So basically, that's why b is called the cumulant generating
24:52
function.
24:52
It's not generating moments, but cumulants.
24:55
But cumulants, in this case, correspond, basically,
24:59
to the moments, at least for the first two.
25:01
If I start going farther, I'm going
25:03
to have more combinations of the expectation of y3, y2,
25:08
and y itself.
25:09
25:13
But as we know, those are the ones
25:14
that are usually the most useful, at least
25:17
if we're interested in asymptotic performance.
25:19
The central limit theorem tells us
25:20
that all that matters are the first two moments,
25:23
and then, the rest is just going to go and say well,
25:25
it doesn't matter.
25:26
It's all going to [INAUDIBLE] anyway.
25:29
So let's go to a Poisson for example.
25:31
So if I had a Poisson distribution--
25:33
25:39
so this is a discrete distribution.
25:42
And what I know is that f--
25:46
let me call mu the parameter of y.
25:51
25:56
So it's mu to the y, divided by y factorial, exponential
26:01
minus mu.
26:02
OK so mu is usually called lambda,
26:04
and y is usually called x, that's
26:06
why it takes me to a little bit of time.
26:07
But it usually it's lambda to the x over
26:09
factorial x, exponential minus lambda.
26:13
Since this is just the series expansion of the exponential
26:16
when I sum those things from 0 to infinity,
26:19
this thing sums to 1.
26:20
But then, if I wanted to start understanding what
26:22
the expectation of this thing is--
26:25
so if I want to understand the expectation with respect
26:28
to mu of y, then, I would have to compute the sum
26:33
from k equals 0 to infinity of k, times mu to the k,
26:48
over factorial of k, exponential minus mu--
26:51
which means that I would, essentially,
26:53
have to take the derivative of my series in the end.
27:06
So I can do this.
27:07
This is a standard exercise.
27:08
You've probably done it when you took probability.
27:10
But let's see if we can actually just read it off
27:12
from the first derivative of b.
27:14
So to do that, we need to write this
27:16
in the form of an exponential, where there
27:18
is one parameter that captures mu, that interacts with y, just
27:23
doing this parameter times y, and then something that
27:25
depends only on y, and then something that depends only
27:29
on mu.
27:32
That's the important one.
27:34
That's going to be our B. And then,
27:35
there's going to be something that depends only on y.
27:39
So let's write this and check that this f mu, indeed,
27:42
belongs to this canonical exponential family.
27:46
So I definitely have an exponential
27:48
that comes from this guy.
27:50
So I have minus mu.
27:51
And then, this thing is going to give me what?
27:53
It's going to give me plus y log mu.
27:58
And then, I'm going to have minus log of y factorial.
28:02
28:06
So clearly, I have a term that depends only
28:08
on mu, terms that depend only on y,
28:11
and I have a product of y and something that depends on mu.
28:15
If I want to be canonical, I must
28:17
have this to be exactly the parameter theta itself.
28:23
So I'm going to call this guy theta.
28:27
So theta is log mu, which means that mu
28:30
is equal to e to the theta.
28:32
And so wherever I see mu, I'm going
28:34
to replace it by [INAUDIBLE] the theta, because my new parameter
28:36
now, is theta.
28:38
So this is what?
28:38
This is equal to exponential y times theta.
28:43
And then, I'm going to have minus e of theta.
28:47
And then, who cares, something that depends only on mu.
28:51
So this is my c of y, and phi is equal to 1 in this case.
28:58
So that's all I care about.
29:00
So let's use it.
29:01
29:05
So this is my canonical exponential family.
29:08
Y interacts with theta exactly like this.
29:11
And then, I have this function.
29:13
So this function here must be b of theta.
29:17
29:20
So from this function, exponential theta,
29:22
I'm supposed to be able to read what the mean is.
29:25
29:39
AUDIENCE: [INAUDIBLE]
29:41
29:43
PHILIPPE RIGOLLET: Because since in this course
29:46
I always know what the dispersion is,
29:48
I can actually always absorb it into theta from one.
29:52
But here, it's really of the form y times something
29:54
divided by 1, right?
29:55
30:01
If it was like log of mu divided by phi,
30:04
that would be the question of whether I
30:06
want to call phi my dispersion, or if I
30:10
want to just have it in there.
30:12
30:16
This makes no difference in practice.
30:18
But the real thing is it's never going
30:20
to happen that this thing, this version,
30:22
is going to be an exact number.
30:23
If it's an actual numerical number,
30:26
this just means that this number should be absorbed
30:28
in the definition of theta.
30:32
But if it's something that is called sigma, say,
30:34
and I will assume that sigma is known,
30:36
then it's probably preferable to keep it in the dispersion,
30:39
so you can see that there's this parameter here
30:41
that you can, essentially, play with.
30:44
It doesn't make any difference when you know phi.
30:48
So now, if I look at the expectation of some y-- so now,
30:55
I'm going to have y, which follows my Poisson mu.
31:00
I'm going to look at the expectation,
31:01
and I know that the expectation is b prime of theta.
31:09
Agreed?
31:09
That's what I just erased, I think.
31:14
Agreed with this, the derivative?
31:17
So what is this?
31:18
Well, it's the derivative of e to the theta, which
31:23
is e to the theta, which is mu.
31:27
So my Poisson is parametrized by its mean.
31:30
I can also compute the variance, which
31:34
is equal to minus the second derivative of--
31:40
no, it's equal to the second derivative of b.
31:42
31:47
Dispersion is equal to 1.
31:49
Again, if I took phi elsewhere, I would see it here as well.
31:55
So if I just absorbed phi here, I would see it divided here,
31:57
so it would not make any difference.
32:00
And what is the second derivative of the exponential?
32:02
32:06
Still the exponential-- so it's still equal to mu.
32:09
32:14
So that certainly makes our life easier.
32:17
Just one quick from remark--
32:19
32:31
here's the function.
32:32
I am giving you problem--
32:35
can the b function--
32:36
32:39
can it ever be equal to log of theta?
32:46
32:55
Who says yes?
32:58
Who says no?
33:00
Why?
33:02
AUDIENCE: [INAUDIBLE]
33:04
33:09
PHILIPPE RIGOLLET: Yeah, so what I've learned from this--
33:13
it's sort of completely analytic, right?
33:16
So we just took derivatives, and this thing just happened.
33:19
This thing actually allowed us to relate the second derivative
33:22
of b to the variance.
33:24
And one thing that we know about a variance
33:26
is that this is non-negative.
33:27
And in particular, it's always positive.
33:30
If they give you a canonical exponential family that
33:35
has zero variance, trust me, you will see it.
33:39
That means that this thing is not
33:40
going to look like something that's finite,
33:42
and it's going to have a point mass.
33:44
It's going to take value infinity at one point.
33:46
So this will, basically, never happen.
33:48
This thing is, actually, strictly positive,
33:50
which means that this thing is always strictly concave.
33:52
It means that the second derivative of this function, b,
33:55
has to be strictly positive, and so that the function is convex.
34:00
So this is concave, so this is definitely not working.
34:03
I need to have something that looks like this
34:04
when I talk about my b.
34:07
So f theta squared--
34:10
we'll see a bunch of exponential theta.
34:13
And there's a bunch of them.
34:14
But if you started writing something, and you find b--
34:18
try to think of the plot of b in your mind,
34:20
and you find that b looks like it's going to become concave,
34:23
you've made a sign mistake somewhere.
34:25
34:30
All right, so we've done a pretty big parenthesis
34:33
to try to characterize what the distribution of y
34:37
was going to be.
34:37
We wanted to extend from, say, Gaussian to something else.
34:41
But when we're doing regression, which
34:43
means generalized linear models, we
34:46
are not interested in the distribution of y
34:48
but really the conditional distribution of y given x.
34:51
So I need now to couple those back together.
34:55
So what I know is that this same mu, in this case,
34:59
which is the expectation-- what I want to say is that
35:01
the conditional expectation of y given x--
35:09
35:12
this is some mu of x.
35:15
When we did linear models, we said well,
35:17
this thing was some x transpose beta for linear models.
35:21
35:26
And the whole premise of this chapter
35:27
is to say well, this might make no sense,
35:29
because x transpose beta can take the entire range
35:32
of real values.
35:34
Whereas, this mu can take only a partial range.
35:36
So even if you actually focus on the Poisson, for example,
35:40
we know that the expectation of a Poisson has to be
35:43
a non-negative number--
35:45
actually, a positive number as soon as you
35:47
have a little bit of variance.
35:49
It's mu itself-- mu is a positive number.
35:52
And so it's not going to make any sense
35:54
to assume that mu of x is equal to x transpose beta,
35:57
because you might find some x's for which this value ends up
36:00
being negative.
36:02
And so we're going to need, what we
36:03
call, the link function that relates,
36:05
that transforms mu, maps onto the real line,
36:08
so that you can now express it of the form x transpose beta.
36:13
So we're going to take not this, but we're
36:17
going to assume that g of mu of x
36:21
is not equal to x transpose beta,
36:24
and that's the generalized linear models.
36:26
36:33
So as I said, it's weird to transform x transpose beta--
36:40
a mu to make it take the real line.
36:43
At least to me, it feels a bit more
36:44
natural to take x transpose beta and make
36:47
it fit to the particular distribution that I want.
36:51
And so I'm going to want to talk about g and g inverse
36:53
at the same time.
36:55
So I'm going to actually take always g.
36:59
So g is my link function, and I'm
37:04
going to want g to be continuous differentiable.
37:10
37:16
OK, let's say that it has a derivative,
37:19
and its derivative is continuous.
37:22
And I'm going to want g to be strictly increasing.
37:28
37:34
And that actually implies that g inverse exists.
37:39
Actually, that's not true.
37:43
What I'm also going to want is that g of mu spans--
37:54
37:57
how do I do this?
37:58
38:06
So I want the g, as I arrange for all possible values of mu,
38:09
whether they're all positive values,
38:11
or whether they're values that are limited
38:12
between the intervals 0, 1, I want those to span
38:15
the entire real line, so that when I want to talk about g
38:18
inverses define over the entire real line,
38:20
I know where I started.
38:21
38:24
So this implies that gene inverse exists.
38:26
38:30
What else does it imply about g inverse?
38:32
38:39
So for a function to be invertible,
38:41
I only need for it to be strictly monotone.
38:43
I don't need it to be strictly increasing.
38:45
So in particular, the fact that I picked increasing
38:47
implies that this guy is actually increasing.
38:53
AUDIENCE: [INAUDIBLE]
38:54
PHILIPPE RIGOLLET: That's the image.
38:56
39:03
So this is my link function, and this slide is just telling me
39:06
I want my function to be invertible,
39:08
so I can talk about g inverse.
39:09
I'm going to switch between the two.
39:12
So what link functions am I going to get?
39:15
So for linear models, we just said
39:17
there's no link function, which is
39:18
the same as saying that the link function is identity,
39:20
which certainly satisfies all these conditions.
39:22
It's invertible.
39:23
It has all these nice properties,
39:25
but might as well not talk about it.
39:27
For Poisson data, when we assume that
39:29
the conditional distribution of y given x is Poisson,
39:32
the mu, as I just said, is required to be positive.
39:37
So I need a g that goes from the interval 0 infinity
39:43
to the entire real line.
39:45
I need a function that starts from one end
39:47
and just takes-- not only the positive values
39:51
are split between positive and negative values.
39:54
And here, for example, I could take the log link.
39:56
So the log is defined on this entire interval.
40:01
And as I range from 0 to plus infinity,
40:04
the log is ranging from negative infinity to plus infinity.
40:07
40:10
You can probably think of other functions
40:12
that do that, like 2 times log.
40:15
That's another one.
40:16
But there's many others you can think of.
40:20
But let's say the log is one of them
40:21
that you might want to think about.
40:23
40:32
It is unnatural in the sense that it's
40:34
one of the first function we can think of.
40:36
We will see, also, that it has another canonical property that
40:39
makes it a natural choice.
40:42
The other one is the other example,
40:44
where we had an even stronger condition on what mu could be.
40:47
Mu could only be a number between 0 and 1,
40:49
that was the probability of success of a coin flip--
40:52
probability of success of a Bernoulli random variable.
40:55
And now, I need g to map 0, 1 to the entire real line.
40:59
And so here are a bunch of things
41:02
that you can come up with, because now you
41:04
start to have maybe--
41:08
I will soon claim that this one, log of mu,
41:11
divided by 1 minus mu is the most natural one.
41:14
But maybe, if you had never thought of this,
41:16
that might not be the first function
41:18
you would come up with, right?
41:20
You mentioned trigonometric functions, for example,
41:23
so maybe, you can come up with something
41:25
that comes from hyperbolic trigonometry or something.
41:30
So what does this function do?
41:32
Well, we'll see a picture, but this function does
41:34
map the interval 0, 1 to the entire real line.
41:36
We also discuss the fact that if we think reciprocally--
41:40
41:43
what I want if I want to think about g inverse,
41:46
I want a function that maps the entire real line into the unit
41:49
interval.
41:49
And as we said, if I'm not a very creative statistician
41:52
or probabilist, I can just pick my favorite continuous,
41:55
strictly increasing cumulative distribution function,
41:59
which as we know, will arise as soon
42:01
as I have a density that has support
42:03
on the entire real line.
42:04
If I have support everywhere, then it means that my--
42:07
42:12
it is strictly positive everywhere, then,
42:14
it means that my community distribution function
42:17
has to be strictly increasing.
42:18
And of course, it has to go from 0 to 1, because that's just
42:21
the nature of those things.
42:22
And so for example, I can take the Gaussian,
42:24
that's one such function.
42:25
But I could also take the double exponential
42:28
that looks like an exponential on one end,
42:30
and then an exponential on the other end.
42:32
And basically, if you take capital phi, which
42:39
is the standard Gaussian cumulative distribution
42:43
function, it does work for you, and you can take its inverse.
42:47
And in this case, we don't talk about,
42:49
so this guy is called logit or logit.
42:51
And this guy is called probit.
42:53
And you see it, usually, every time
42:54
you have a package on generalized linear models.
42:58
You are trying to implement.
42:59
You have this choice.
43:00
And for what's called logistic regression-- so it's funny
43:04
that it's called logistic regression,
43:05
but you can actually use the probit link,
43:07
which in this case, is called probit regression.
43:10
But those things are essentially equivalent,
43:12
and it's really a matter of taste.
43:14
Maybe of communities-- some communities
43:16
might prefer one over the other.
43:18
We'll see that again, as I claimed
43:20
before, the logistic, the logit one
43:24
has a slightly more compelling argument for its reason
43:29
to exist.
43:30
I guess this one, the compelling argument
43:31
is that it involved the standard Gaussian, which
43:34
of course, is something that should show up everywhere.
43:37
And then, you can think about crazy stuff.
43:41
Even crazy gets name--
43:43
complimentary log, log, which is the log of minus, log 1 minus.
43:48
Why not?
43:49
43:52
So I guess you can iterate that thing.
43:56
You can just put a log 1 minus in front of this thing,
43:59
and it's still going to go.
44:01
So that's not true.
44:07
I have to put a minus and take--
44:10
no, that's not true.
44:11
44:13
So you can think of whatever you want.
44:15
44:19
So I claimed that the logit link is the natural choice, so
44:21
here's a picture.
44:22
I should have actually plotted the other one,
44:25
so we can actually compare it.
44:27
To be fair, I don't even remember how it would actually
44:29
fit into those two functions.
44:32
So the blue one, which is this one, for those of you
44:35
don't see the difference between blue and red--
44:37
sorry about that.
44:39
So this the blue one is the logistic one.
44:45
So this guy is the function that does e to the x, over 1 plus
44:49
e to the x.
44:50
As you can see, this is a function
44:51
that's supposed to map the entire real line
44:53
into the intervals, 0, 1.
44:55
So that's supposed to be the inverse of your function,
44:58
and I claimed that this is the inverse of the logistic
45:00
of the logit function.
45:02
And the blue one, well, this is the Gaussian CDF,
45:04
so you know it's clearly the inverse of the inverse
45:06
of the Gaussian CDF.
45:07
And that's the red one.
45:08
That's the one that goes here.
45:09
45:12
I would guess that the complimentary log, log is
45:15
something that's probably going above here,
45:17
and for which the slope is, actually,
45:19
even a little flatter as you cross 0.
45:22
45:26
So of course, this is not our link functions.
45:29
These are the inverse of our link function.
45:30
So what do they look like when actually,
45:32
basically, flip my thing like this?
45:36
So this is what I see.
45:38
And so I can see that in blue, this is my logistic link.
45:42
So it crosses 0 with a slightly faster rate.
45:46
Remember, if we could use the identity, that
45:49
would be very nice to us.
45:51
We would just want to take the identity.
45:52
The problem is that if I start having
45:55
the identity that goes here, it's
45:56
going to start being a problem.
45:58
And this is the probit link, the phi verse that you see here.
46:06
It's a little flatter.
46:07
46:10
You can compute the derivative at zero of those guys.
46:16
What is the derivative of the--
46:17
46:21
So I'm taking the derivative of log of x over 1 minus x.
46:24
So it's 1 over x, minus 1 over 1 minus x.
46:32
46:35
So if I look at 0.5--
46:39
sorry, this is the interval 0, 1.
46:40
So I'm interested in the slope at 0.5.
46:48
Yes, it's plus, thank you.
46:49
So at 0.5, what I get is 2 plus 2.
46:53
46:57
Yeah, so that's the slope that we get,
47:02
and if you compute for the derivative--
47:07
what is the derivative of a phi inverse?
47:09
Well, it's a little phi of x, divided
47:13
by little phi of capital phi, inverse of x.
47:20
So little phi at 1/2--
47:24
I don't know.
47:24
47:29
Yeah, I guess I can probably compute
47:30
the derivative of the capital phi
47:32
at 0, which is going to be just that.
47:34
1 over square root of 2 pi, and then just say well,
47:37
the slope has to be 1 over that.
47:38
47:42
Square root 2 pi.
47:43
47:47
So that's just a comparison, but again, so far, we
47:50
do not have any reason to prefer one to the other.
47:54
And so now, I'm going to start giving you some reasons
47:56
to prefer one to the other.
47:58
And one of those two--
48:01
and actually for each canonical family,
48:03
there is something which is called the canonical link.
48:05
And when you don't have any other reason
48:07
to choose anything else, why not choose the canonical one?
48:10
And the canonical link is the one
48:11
that says OK, what I want is g to map mu onto the real line.
48:19
48:22
But mu is not the parameter of my canonical family.
48:28
Here for example, mu is e of theta,
48:31
but the canonical parameter is theta.
48:33
48:36
But the parameter of a canonical exponential family
48:39
is something that lives in the entire real line.
48:42
It was defined for all thetas.
48:45
And so in particular, I can just take theta
48:50
to be the one that's x transpose beta.
48:52
And so in particular, I'm just going
48:54
to try to find the link that just says OK, when
48:57
I take g of mu, I'm going to map,
49:00
so that's what's going to be.
49:01
So I know that g of mu is going to be equal to x beta.
49:05
And now, what I'm going to say is OK,
49:07
let's just take the g that makes this guy equal to theta,
49:09
so that this is theta that actually model,
49:11
like x transpose beta.
49:14
Feels pretty canonical, right?
49:17
What else?
49:19
What other central, easy choice would you take?
49:22
This was pretty easy.
49:23
There is a natural parameter for this canonical family,
49:27
and it takes value on the entire real line.
49:29
I have a function that maps mu onto the entire real line,
49:32
so let's just map it to the actual parameter.
49:36
So now, OK, why do I have this?
49:40
Well, we've already figured that out.
49:41
The canonical link function is strictly increasing.
49:44
Sorry, so I said that now I want this guy--
49:49
so I want g of mu to be equal to theta,
49:57
which is equivalent to saying that I want mu to be
50:00
equal to g inverse of theta.
50:03
But we know that mu is what--
50:08
b prime of theta.
50:09
50:15
So that means that b prime is the same function as g inverse.
50:21
And I claimed that this is actually giving me, indeed,
50:24
a function that has the properties that I want,
50:27
because before I said, just pick any function that
50:30
has these properties.
50:31
And now, I'm giving you a very hard rule
50:33
to pick this, though you need still
50:34
to check that it satisfies those conditions and particular,
50:37
that it's increasing and invertible.
50:39
And so for this to be increasing and invertible,
50:41
strictly increasing and invertible,
50:42
really what I need is that the inverse is strictly
50:44
increasing and invertible, which is the case here,
50:48
because b prime as we said--
50:51
well, b prime is the derivative of a strictly convex function.
50:56
A strictly convex function has a second derivative
50:58
that's strictly positive.
50:59
We just figured that out using the fact
51:01
that the variance was strictly positive.
51:03
And if phi is strictly positive, then this thing
51:06
has to be strictly positive.
51:08
So if b prime, prime is strictly positive--
51:10
this is the derivative of a function called b prime.
51:13
If your derivative is strictly positive,
51:15
you are strictly increasing.
51:17
And so we know that b prime is, indeed, strictly increasing.
51:22
And what I need also to check-- well,
51:26
I guess this is already checked on its own,
51:28
because b prime is actually mapping all of our
51:33
into the possible values.
51:37
When theta ranges on the entire real line,
51:38
then b prime ranges in the entire interval
51:41
of the mean values that it can take.
51:45
And so now, I have this thing that's completely defined.
51:48
B prime inverse is a valid link.
51:50
51:56
And it's called a canonical link.
51:57
52:02
OK, so again, if I give you an exponential family, which
52:05
is another way of saying I'll give you a convex function, b,
52:09
which gives you some exponential family,
52:12
then if you just take b prime inverse,
52:15
this gives you the associated canonical link
52:17
for this canonical exponential family.
52:21
So clearly there's an advantage of doing
52:26
this, which is I don't have to actually think
52:28
about which one to pick if I don't want to think about it.
52:30
But there's other advantages that come to it,
52:34
and we'll see that in the representations.
52:36
There's, basically, going to be some light cancellations that
52:38
show up.
52:39
So before we go there, let's just
52:40
compute the canonical link for the Bernoulli distribution.
52:43
So remember, the Bernoulli distribution
52:46
has a PMF, which is part of the canonical exponential family.
52:55
So the PMF of the Bernoulli is f theta of x.
53:00
53:06
Let me just write it like this.
53:07
So it's p to the y, let's say-- one minus p
53:12
to the 1 minus y, which I will write
53:16
as exponential y log p, plus 1 minus y, log 1 minus p.
53:28
OK, we've done that last time.
53:30
Now, I'm going to group my terms in y
53:32
to see how y interacts with this parameter p.
53:37
And what I'm getting is y, which is times log p
53:40
divided by 1 minus p.
53:42
And then, the only term that remains is log, 1 minus p.
53:47
53:50
Now, I want this to be a canonical exponential family,
53:53
which means that I just need to call this guy,
53:56
so it is part of the exponential family.
53:58
You can read that.
53:59
If I want it to be canonical, this guy must be theta itself.
54:04
So I have that theta is equal to log p, 1 minus p.
54:11
If I invert this thing, it tells me
54:12
that p is e to the theta, divided by 1, plus e
54:16
to the theta.
54:18
It's just inverting this function.
54:19
54:23
In particular, it means that log, 1 minus p,
54:28
is equal to log, 1 minus this thing.
54:31
So the exponential thetas go away.
54:33
So in the numerator, this is what I get.
54:39
That's the log 1 minus this guy, which is equal to minus log 1,
54:44
plus e to the theta.
54:46
54:50
So I'm going a bit too fast, but these are
54:52
very elementary manipulations--
54:56
maybe, it requires one more line to convince yourself.
54:59
But just do it in the comfort of your room.
55:05
And then, what you have is the exponential of y times theta,
55:11
and then, I have minus log, 1 plus e theta.
55:16
So this is the representation of the p
55:20
and f of a Bernoulli distribution,
55:23
as part of a member of the canonical exponential family.
55:27
And it tells me that b of theta is equal to log 1,
55:33
plus e of theta.
55:36
That's what I have there.
55:38
From there, I can compute the expectation, which hopefully,
55:41
I'm going to get p as the mean and p times 1,
55:46
minus p as the variance.
55:47
Otherwise, that would be weird.
55:49
55:51
So let's just do this.
55:55
B prime of theta should give me the mean.
56:00
And indeed, b prime of theta is e to the theta,
56:04
divided by 1, plus e to the theta, which is exactly
56:08
this p that I had there.
56:09
56:14
OK just for fun--
56:18
well, I don't know.
56:19
Maybe, that's not part of it.
56:20
Yeah, let's not compute the second derivative.
56:22
That's probably going to be on your homework at some point--
56:25
if not, on the final.
56:29
So b prime now--
56:32
oh, I erased it, of course.
56:34
56:37
G, the canonical link is b prime inverse.
56:39
56:42
And I claim that this is going to give me
56:44
the logit function, log of mu, over 1 minus mu.
56:48
So let's check that.
56:50
So b prime is this thing, so now,
56:54
I want to find the inverse.
56:55
57:02
Well, I should really call my inverse a function of p.
57:05
And I've done it before--
57:06
all I have to do is to solve this equation, which
57:08
I've actually just done it, that's
57:10
where I'm actually coming from.
57:11
So it's actually telling me that the solution of this thing
57:14
is equal to log of p over 1 minus p.
57:17
57:25
We just solve this thing both ways.
57:28
And this is, indeed, logit of p by definition of logit.
57:38
So b prime inverse, this function that
57:40
seemed to come out of nowhere, is really
57:42
just the inverse of b prime, which we know
57:45
is the canonical link.
57:46
And canonical is some sort of ad hoc choices
57:49
that we've made by saying let's just take the link, such that d
57:53
of mu is giving me the actual canonical parameter of theta.
57:57
Yeah?
57:58
AUDIENCE: [INAUDIBLE]
58:00
58:02
PHILIPPE RIGOLLET: You're right.
58:03
58:08
Now, of course, I'm going through all this trouble,
58:11
but you could see it immediately.
58:13
I know this is going to be theta.
58:16
We also have prior knowledge, hopefully,
58:19
that the expectation of a Bernoulli is p itself.
58:23
So right at this step, when I say
58:25
that I'm going to take theta to be this guy,
58:28
already knew that the canonical link was the logit--
58:32
because I just said oh, here's theta.
58:34
And it's just this function of mu [INAUDIBLE]..
58:37
58:41
OK, so you can do that for a bunch of examples,
58:43
and this is what they're going to give you.
58:45
So the Gaussian case, b of theta--
58:47
we've actually computed it, actually, once.
58:49
This is theta squared over 2.
58:51
So the derivative of this thing is really
58:53
just theta, which means that g or g inverse is actually
58:56
equal to the identity.
58:59
And again, sanity check--
59:02
when I'm in the Gaussian case, there's
59:04
nothing general about general linear models
59:06
if you don't have a link.
59:09
The Poisson case-- you can actually check.
59:12
Did we do this, actually?
59:13
Yes we did.
59:14
So that's when we had this e of theta.
59:16
And so b is e of theta, which means that the natural link is
59:19
the inverse, which is log, which is the inverse of exponential.
59:24
And so that's logarithm link, which as I said,
59:29
I used the word natural.
59:31
You can also use the word canonical
59:33
if you want to describe this function as being
59:35
the right function to map the positive real line
59:38
to the entire real line.
59:40
The Bernoulli-- we just did it.
59:42
So b-- the cumulative enduring function is log of 1,
59:46
plus e of theta, which is log of mu over 1 minus mu.
59:52
And gamma function-- where you have
59:57
the thing you're going to see is minus log of minus [INAUDIBLE]..
60:00
You see the reciprocal link is the link that actually
60:04
shows up, so minus 1 over mu.
60:08
That maps.
60:08
60:35
So are there any questions about the canonical links,
60:40
canonical families?
60:42
I use the word canonical a lot.
60:45
But is everything fitting together right now?
60:48
So we have this function.
60:49
We have canonical exponential family, by assumption.
60:53
It has a function, b, which contains
60:54
every information we want.
60:56
At the beginning of the lecture, we
60:58
established that it has information
60:59
about the mean in the first derivative,
61:01
about the variance in the second derivative.
61:03
And it's also giving us a canonical link.
61:05
So just cherish this b once you've found it,
61:08
because it's everything you need.
61:09
Yeah?
61:09
AUDIENCE: [INAUDIBLE]
61:11
61:15
PHILIPPE RIGOLLET: I don't know, a political preference?
61:19
61:24
I don't know, honestly.
61:26
If I were a serious practitioner,
61:29
I probably would have a better answer for you.
61:31
At this point, I just don't.
61:32
I think it's a matter of practice
61:34
and actual preferences.
61:36
You can also try both.
61:38
We didn't mention it, but there's
61:39
this idea of cross-validation-- well,
61:41
we mentioned it without going too much into detail.
61:43
But you could try both and see which one performs
61:46
best on a yet unseen data set.
61:48
In terms of prediction, just say I prefer this one of the two,
61:51
because this actually comes as part of your modeling
61:53
assumption, right?
61:56
Not only did you decide to model the image of mu
61:59
through the link function as a linear model, but really
62:03
what you're saying--
62:03
your model is saying well, you have
62:05
two pieces of [INAUDIBLE],, the distribution of y.
62:07
But you also have the fact that mu
62:10
is modeled as g inverse of x transpose beta.
62:14
And for different g's, this is just different modeling
62:17
assumptions, right?
62:18
So why should this be linear--
62:25
I don't know.
62:26
62:29
My authority as a person who has not
62:32
examined the [INAUDIBLE] data sets
62:34
for both things would be that the changes are fairly minor.
62:38
62:42
OK, so this was all for one observation.
62:45
We just, basically, did probability.
62:49
We described some density, some properties of the densities,
62:52
how to compute expectations.
62:53
That was really just probability.
62:55
There was no data involved at any point.
62:57
We did a bit of modeling, but it was all for one observation.
63:00
What we're going to try to do now
63:01
is given the reverse engineering to probability
63:06
that is statistics, given data, what
63:08
can I infer about my model?
63:10
Now remember, there's three parameters
63:12
that are floating around in this model.
63:15
There is one that was theta.
63:18
There is one that was mu, and there is one that is beta.
63:21
OK, so those are the three parameters
63:23
that are floating around.
63:25
What we said is that the expectation of y, given x,
63:32
is mu of x.
63:34
So if I estimate mu, I know the conditional expectation of y,
63:37
given x, which definitely gives me theta of x.
63:44
How do I go from mu of x to theta of x?
63:46
63:55
The inverse of what--
63:58
of the arrow?
63:59
Yeah, sure, but how do I go from this guy to this guy?
64:07
So theta as a function of mu is?
64:08
64:12
AUDIENCE: [INAUDIBLE]
64:13
PHILIPPE RIGOLLET: Yeah, so we just
64:15
computed that mu was b prime of theta.
64:18
So it means that theta is just b prime inverse of mu.
64:23
So those two things are the same as far
64:24
as we're concerned, because we know that b prime is strictly
64:27
increasing it's invertible.
64:29
So it's just a matter of re-parametrization,
64:31
and we just can switch from one to the other whenever we want.
64:34
But why we go through mu, because so far
64:36
for the entire semester I told you
64:38
there was one parameter that's theta.
64:39
It does not have to be the mean, and that's the parameter
64:41
that we care about.
64:42
It's the one on which we want to do interference.
64:43
That's the one for which we're going to compute the Fisher
64:45
information.
64:46
This was the parameter that was our object of worship.
64:49
And now, I'm saying oh, I'm going to have
64:51
mu that's coming around.
64:53
And why we have mu, because this is the mu
64:55
that we use to go to beta.
64:58
So I can go freely from theta to mu using b prime or b
65:06
prime inverse.
65:07
And now, I can go from mu to beta,
65:11
because I have that g of mu of x is beta transpose x.
65:19
So in the end, now, this is going
65:21
to be my object of worship.
65:22
This is going to be the parameter that matters.
65:24
Because once I set beta, I set everything else
65:27
through this chain.
65:30
So the question is if I start stacking up this pile
65:33
of parameters-- so I start with my beta,
65:36
which in turns give me a mu, which in turn,
65:38
gives me a theta--
65:39
can I just have a long, streamlined--
65:43
what is the outcome when I actually
65:45
start writing my likelihood, not as a function of theta,
65:48
not as a function of mu, but as a function of beta,
65:50
which is the one at the end of the chain?
65:52
And hopefully, things are going to happen nicely,
65:55
and they might no.
65:56
Yeah?
65:56
AUDIENCE: [INAUDIBLE]
65:58
66:02
PHILIPPE RIGOLLET: Is G--
66:03
that's my link.
66:04
G of mu of x--
66:06
now, its mu is a function of x, because its conditional on x.
66:09
66:12
So this is really theta of x, mu of x,
66:17
but b is not a function of x, because it's just something
66:21
to tells me what the function of x is.
66:22
AUDIENCE: [INAUDIBLE]
66:24
66:26
PHILIPPE RIGOLLET: Mu is the conditional expectation
66:28
of y, given x.
66:29
It has, actually, a fancy name in the statistics literature.
66:33
It's called-- anybody knows the name of the function, mu
66:36
of x, which is a conditional expectation of y, given x?
66:39
66:42
AUDIENCE: [INAUDIBLE]
66:43
PHILIPPE RIGOLLET: That's the regression function.
66:46
That's actual definition.
66:47
If I tell you what is the definition of the regression
66:48
function, that's just the conditional expectation
66:51
of why, given x.
66:52
And I could look at any property of the conditional distribution
66:58
of y given x.
67:00
I could look at the conditional 95th percentile.
67:02
I can look at the conditional median.
67:04
I can look at the conditional [INAUDIBLE] range.
67:06
I can look at the conditional variance.
67:08
But I decide to look at the conditional expectation, which
67:12
is called the regression function.
67:15
67:18
Yes?
67:19
AUDIENCE: [INAUDIBLE]
67:21
67:24
PHILIPPE RIGOLLET: Oh, there's no transpose here.
67:26
Actually, only Victor-Emmanuel used this prime for transpose,
67:28
and I found it confusing with the derivatives.
67:30
So primes here is only a derivative.
67:33
AUDIENCE: [INAUDIBLE]
67:34
67:35
PHILIPPE RIGOLLET: Oh, yeah, sorry, beta transpose x.
67:38
So you said what?
67:40
I said that g of mu of x is beta transpose x?
67:43
AUDIENCE: [INAUDIBLE]
67:45
67:48
PHILIPPE RIGOLLET: Isn't that the same thing?
67:49
67:52
X is a vector here, right?
67:53
AUDIENCE: Yeah.
67:54
PHILIPPE RIGOLLET: So x transpose beta,
67:56
and beta transpose x are of the same thing.
68:00
AUDIENCE: [INAUDIBLE]
68:02
68:03
PHILIPPE RIGOLLET: So beta looks like this.
68:05
X looks like this.
68:08
It's just a simple number.
68:12
Yeah, you're right.
68:13
I'm going to start to look at matrices.
68:15
I'm going to have to be slightly more careful when I do this.
68:18
OK so let's do the reverse engineering.
68:20
I'm giving you data.
68:22
From this data, hopefully, you should
68:23
be able to get what the conditional-- if I give you
68:26
an infinite amount of data, you would know exactly,
68:29
of pairs xy, what the conditional distribution of y
68:33
given x is.
68:36
And in particular, you would know
68:37
what the conditional expectation of y given x
68:40
is, which means that you would know mu,
68:42
which means that you would know theta, which
68:44
means that you would know beta.
68:45
Now, when I have a finite number of observations,
68:48
I'm going to try to estimate mu of x.
68:50
But really, I'm going to go the other way around.
68:53
Because the fact that I assume, specifically, that mu of x
68:56
is of the form g of mu of x is x transpose beta, then that
69:00
means that I only have to estimate beta, which
69:02
is a much simpler object than the entire regression function.
69:06
So that's what I'm going to go for.
69:07
I'm going to try to represent the likelihood, the log
69:10
likelihood, of my data as a function, not of theta,
69:12
not of mu, but of beta--
69:15
and then, maximize that guy.
69:18
So now, rather than thinking of just one observation,
69:21
I'm going to have a bunch of observations.
69:23
69:27
So this might actually look a little confusing,
69:29
but let's just make sure that we understand each other
69:32
before we go any further.
69:33
So I'm going to have observations,
69:38
x1, y1, all the way to xn, yn, just
69:43
like in a natural regression problem,
69:45
except that here my y's might be 0 one valued.
69:49
They might be positive valued.
69:50
They might be exponential.
69:51
They might be anything in the canonical exponential family.
69:54
69:57
OK so I have this thing, and now,
69:59
what I have is that my observations are x1,
70:01
y1, xn, yn.
70:03
And what I want is that I'm going
70:06
to assume that the conditional expectation of yi, given--
70:11
70:14
the conditional distribution of yi, given xi,
70:18
is something that has density.
70:20
70:30
Did I put an i on y-- yeah.
70:31
70:42
I'm not going to deal with the phi and the c now.
70:45
And why do I have theta i and not theta
70:48
is because theta i is really a function of xi.
71:01
So it's really theta i of xi.
71:05
But what do I know about theta i of xi,
71:07
it's actually equal to b--
71:11
I did this error twice--
71:13
b prime inverse of mu of xi.
71:16
71:30
And I'm going to assume that this is of the form beta
71:34
transpose xi.
71:36
And this is why I have theta i--
71:37
is because this theta i is a function of xi,
71:40
and I'm going to assume a very simple form for this thing.
71:42
71:46
Sorry, sorry, sorry, sorry--
71:48
I should not write it like this.
71:50
This is only when I have the canonical link.
71:51
So this is actually equal to b prime inverse of g,
71:57
of xi transpose beta.
71:59
72:05
Sorry, g inverse-- those two things
72:07
are actually canceling each other.
72:09
72:13
So as before, I'm going to stack everything into some--
72:17
well, actually, I'm not going to stack anything for the moment.
72:20
I'm just going to give you a peek at what's
72:22
happening next week, rather than just manipulating the data.
72:28
So here is how we're going to proceed at this point.
72:33
Well now, I want to write my likelihood function,
72:36
not as a function of theta, but as a function of beta,
72:39
because that's the parameter I'm actually trying to maximize.
72:44
So if I have a link--
72:47
so this thing that matters here, I'm going to call h.
72:50
72:53
By definition, this is going to be h of xi transpose beta.
72:58
Helena, you have a question?
73:00
AUDIENCE: Uh, no [INAUDIBLE]
73:02
PHILIPPE RIGOLLET: So this is just all the things
73:04
that we know.
73:04
Theta is just the, by definition of the fact that mu
73:09
is b prime of theta, the mean is b prime of theta--
73:11
it means that theta is b prime inverse of mu.
73:14
And then, mu is modeled from the systematic component.
73:19
G of mu is xi transpose beta, so this is
73:21
g inverse of xi transpose beta.
73:23
So I want to have b prime inverse of g inverse.
73:27
This function is a bit annoying to say,
73:30
so I'm just going to call it h.
73:32
And when I do the composition of two inverses,
73:34
the inverse of the composition of those two things
73:36
in the reverse order--
73:38
so h is really the inverse of g, composed with b
73:42
prime, g of b prime inverse.
73:46
And now, if I have the canonical link,
73:48
since I know that g is b prime inverse,
73:51
this is really just the identity.
73:54
As you can imagine, this entire thing,
73:58
which is actually quite complicated--
73:59
would just say oh, this thing, actually, does not show up
74:01
when I have the canonical link.
74:03
I really just have that theta can be replaced by xi of beta.
74:06
So think about going back to this guy here.
74:09
Now, theta becomes only xi transpose beta.
74:15
That's going to be much more simple to optimize,
74:18
because remember, when I'm going to log likelihood,
74:20
this thing is going to go away.
74:21
I'm going to sum those guys.
74:23
And so what I'm going to have is something which
74:24
is essentially linear in beta.
74:26
And then, I'm going to have this minus b,
74:28
which is just minus the sum of convex functions of beta.
74:31
And so I'm going to have to bring in the tools of convex
74:34
optimization.
74:34
Now, it's not just going to be take the gradient, set it to 0.
74:37
It's going to be more complicated to do that.
74:39
I'm going to have to do that in an iterative fashion.
74:42
And so that's what I'm telling you,
74:43
when you look at your log likelihood for all
74:46
those functions.
74:47
You sum, the exponential goes away because you had the log,
74:50
and then, you have all these things here.
74:51
I kept the b.
74:52
I kept the h.
74:53
But if h is the identity, this is the linear function,
74:56
the linear part, yi times xi transpose
74:59
beta, minus b of my theta, which is now only xi transpose beta.
75:03
And that's the function I want to maximize in beta.
75:05
75:10
It's a convex function.
75:11
When I know what b is, I have an explicit formula for this,
75:15
and I want to just bring in some optimization.
75:18
And that's what we're going to do,
75:19
and we're going to see three different methods, which
75:21
are really, basically, the same method.
75:24
It's just an adaptation or specialization
75:28
of the so-called Newton-Raphson method, which is essentially
75:31
telling you do iterative local quadratic approximations
75:34
through your function-- so second order
75:36
[INAUDIBLE] expansion, minimize this guy,
75:38
and then do it again from where you were.
75:41
And we'll see that this can be, actually,
75:43
implemented using what's called iteratively re-weighted least
75:47
squares, which means that every step--
75:49
since it's just a quadratic, it's
75:51
going to be just squares in there--
75:53
can actually be solved by using a weighted least
75:56
squares version of the problem.
75:59
So I'm going to stop here for today.
76:02
So we'll continue and probably not finish this chapter,
76:05
but finish next week.
76:07
And then, I think there's only one lecture.
76:10
Actually, for the last lecture, what do you guys want to do?
76:13
76:16
Do you want to have doughnuts and cider?
76:18
Do you want to just have some more outlooking lecture
76:25
on what's happening post 1975 in statistics?
76:31
Do you want to have a review for the final exam--
76:36
pragmatic people.
76:38
AUDIENCE: [INAUDIBLE] interesting, advanced topics.
76:43
PHILIPPE RIGOLLET: You want to do interesting, advanced--
76:46
for the last lecture?
76:48
AUDIENCE: Something that we haven't thought of yet.
76:50
PHILIPPE RIGOLLET: Yeah, that's, basically, what I'm asking,
76:53
right-- interesting advanced topics,
76:55
versus ask me any question you want.
77:00
Those questions can be about interesting, advanced topics,
77:03
though.
77:03
Like, what are interesting, advanced topics?
77:06
I'm sorry?
77:06
AUDIENCE: Interesting with doughnuts-- is that OK?
77:08
PHILIPPE RIGOLLET: Yeah, we can always do the doughnuts.
77:10
[LAUGHTER]
77:11
AUDIENCE: As long as there are doughnuts.
77:14
PHILIPPE RIGOLLET: All right, so we'll do that.
77:16
So you guys have a good weekend.