https://www.youtube.com/watch?v=OYcdw5vOgIc&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=21


字幕記錄


00:00
00:00
The following content is provided under a Creative
00:02
Commons license.
00:03
Your support will help MIT OpenCourseWare
00:06
continue to offer high quality educational resources for free.
00:10
To make a donation or to view additional materials
00:12
from hundreds of MIT courses, visit MIT OpenCourseWare
00:16
at oct.mit.edu.
00:17
00:32
PHILIPPE RIGOLLET: All right.
00:34
So let's continue talking about maximum likelihood estimation
00:41
in the context of generalized linear models, all right?
00:43
So in those generalized linear models,
00:45
what we spent most of the past lectures working on
00:49
is the conditional distribution of Y given
00:55
X. And we're going to assume that this
00:59
follows some distribution in the exponential family.
01:08
01:12
OK.
01:13
And so what it means is that if we look at the density, say--
01:18
or the PMF, but let's talk about density
01:21
to make things clearer--
01:23
we're going to assume that Y given X has distribution.
01:29
So X is now fixed, because we're conditioning on it.
01:31
And it has a density, which is of this form, c of Yi phi.
01:50
OK.
01:50
So this c, again, we don't really need to think about it.
01:53
This is something that's going to come up naturally
01:55
as soon as you need normalization factor.
01:58
And so here what it means, if this is the distribution of Y
02:02
given Xi, so that's the density of Yi
02:08
given Xi is equal to little xi.
02:11
So if it's the conditional distribution of Yi given Xi,
02:15
it should depend on xi somehow.
02:18
And it does not appear to depend on Xi.
02:20
And here, the model is going to be on theta i, which is just
02:26
a function, theta i of Xi.
02:29
And we're going to take a very specific one.
02:30
It's going to be a function of a linear form of the Xi.
02:38
So really we're going to take something which is of the form
02:40
theta I, which is really just-- as theta does not depend
02:43
on Xi--
02:44
of Xi transposed from beta.
02:47
OK?
02:47
So all these parts here, this is really some modeling
02:50
assumptions that we're making once we've agreed
02:54
on what distribution we want.
02:55
OK.
02:56
So to do that, our goal, of course,
02:58
is going to try to understand what this beta is.
03:01
There's one beta here.
03:02
What's important is that this beta does not depend on i.
03:12
So if they observe pairs Xi, Yi--
03:17
let's say I observe n of them, i equals 1 to n--
03:22
the hope is that as a accumulate more and more pairs
03:25
of this form where there's always the same parameter that
03:29
links Xi to Yi, that's this parameter beta,
03:32
that I should have a better and better estimation of this beta.
03:35
Because it's always the same.
03:37
And that's essentially with couples
03:38
all of our distribution.
03:40
If I did not assume this, then I could
03:42
have a different distribution for each pair Xi given YI.
03:46
And I would not be able to do any statistics.
03:48
Nothing would average in the end.
03:50
But here I have the same beta, which
03:52
means that I can hope to do statistics and average
03:54
errors in them.
03:56
OK.
03:56
So I'm going to collect, so I'll come back to this.
04:00
But as usual in the linear regression model,
04:02
we're going to collect all our observations Yi.
04:05
So I'm going to assume that they're real valued
04:08
and that my Xi's takes value in Rp
04:10
just like in the regression model.
04:12
And I'm going to collect all my Yi's into one big vector of Y
04:16
in our n and all my X's into one big matrix in Rn times p
04:23
just like for the linear regression model.
04:26
All right, so, again, what I'm interested in here
04:30
is the conditional distribution of Yi given Xi.
04:36
OK.
04:36
I said this is this distribution.
04:39
When we're talking about regression,
04:40
I defined last time what the definition
04:42
of regression function was.
04:44
It's just one particular aspect of
04:46
this conventional distribution.
04:48
It's the conditional expectation of Yi given Xi.
04:51
OK.
04:52
And so this conditional expectation,
04:53
I will denote it by--
04:57
so I talk about the conditional, I'm
05:06
going to call it, say, mu i, which
05:08
is the conditional expectation of Yi given
05:11
Xi equals some little xi, say.
05:15
You can forget about this part if you find it confusing.
05:17
It really doesn't matter.
05:18
It's just that this means that this
05:21
is a function of little xi.
05:25
But if I only had the expectation of Yi given big Xi,
05:29
this would be just a function of big Xi.
05:32
So it really doesn't change anything.
05:34
It's just a matter of notation.
05:36
OK.
05:36
So just forget about this part.
05:39
But I'll just do it like that here.
05:41
OK.
05:42
So this is just the conditional expectation of Yi given Xi.
05:46
It just depends on Xi, so I think it depends on i,
05:50
and so I will call it mu i.
05:52
But I know that since in a canonical exponential family,
05:55
then I know that mu i is actually B prime of theta i.
06:00
OK.
06:01
So there's a 1 to 1 link between the canonical parameter
06:04
of my exponential family and the mean mu
06:07
i, the conditional expectation.
06:09
And the modeling assumption we're going to make
06:12
is not directly--
06:14
remember, that was the second aspect
06:15
of the generalized linear model.
06:17
We're not going to assume that theta i itself directly
06:20
depends on Xi.
06:22
We're going to assume that mu i has
06:23
a particular dependence on Xi through the link function.
06:28
So, again, we're back to modeling.
06:31
So we have a link function g.
06:33
06:38
And we assume that mu i depends on Xi as follows.
06:51
06:56
g of mu i--
06:58
and remember, all g does for us is really
07:02
map the space in which mu i lives, which
07:05
could be just the interval 0, 1 to the entire real line,
07:08
all right?
07:09
And we're going to assume that this thing that
07:11
lives in the real line is just Xi transpose beta.
07:14
I should maybe put a small one, Xi transpose beta.
07:17
OK?
07:17
07:20
So we're making, indeed, some modeling assumption.
07:23
But compared to in the linear regression model,
07:26
we only assume that mu i was Xi transpose beta.
07:29
So if you want to make a parallel
07:30
between generalized linear models
07:32
and linear model is the only difference
07:35
is that g is not the identity necessarily in this case.
07:39
And all the g does for us is to just
07:42
make this thing compatible, that those two
07:45
things on the left and the right of equality
07:48
live in the same space.
07:50
So in a way, we're not making a much bigger leap of faith
07:56
by assuming a linear model.
07:57
The linear link is already here.
07:59
We're just making things compatible, all right?
08:04
And so it's always the same link function.
08:09
So now if I want to go back to beta--
08:12
right, because I'm going to want to express my likelihood--
08:16
if I were to express my likelihood from this,
08:18
it would just be a function of theta, right?
08:21
And so if I want to maximize my likelihood,
08:22
I don't want to maximize it in theta.
08:24
I want to maximize it in beta.
08:26
So if I can write my density as a function of beta,
08:29
then I will be able to write my likelihood
08:31
as a function of beta, and then talk
08:33
about my maximum likelihood estimator.
08:35
And so all they need to do is to just say, OK,
08:37
how do I replace theta by--
08:39
I know that theta is a function of beta, right?
08:42
I wrote it here.
08:45
So the question is, what is this function?
08:46
And I actually have access to all of this.
08:49
So what I know is that theta--
08:50
right, so mu is b prime of theta,
08:54
which means that theta i is b prime inverse of mu i.
09:02
OK.
09:02
So that's what we've got from this derivative of the log
09:06
likelihood equal to 0.
09:07
That give us this guy inverted.
09:09
And now I know that mu i is g inverse of Xi beta.
09:13
09:23
So this composition of b prime inverse and g inverse
09:27
is actually just the composition of g with b prime.
09:30
09:38
Everybody's comfortable with this notation,
09:40
the little circle?
09:42
Any question about this?
09:43
It just means that I first applied b prime.
09:45
Well, actually, it's D inverse.
09:47
But if I look at a function g composed with b prime,
09:49
I first applied the g b prime of x, is just g of b prime of x.
09:59
OK.
10:00
And then I take the inverse of this function,
10:02
which is first take g inverse, and then take b prime inverse.
10:05
10:09
OK.
10:09
So now I have everywhere I saw theta,
10:12
now I see this function of beta.
10:16
So I could technically plug that in.
10:18
Of course, it's a little painful to have
10:20
to write g circle beta prime all the time.
10:22
So I'm going to give this guy a name.
10:24
And so you're just going to define
10:29
h, which is g b prime inverse so that theta i is simply
10:39
h of Xi transpose beta.
10:42
OK.
10:43
I could give it a name, you know.
10:46
But let's just call that the h function.
10:47
10:52
And something which is nice about this h function
10:54
is that if g is the canonical link--
11:01
11:04
what is the canonical link?
11:05
11:12
So what is it canonical to?
11:15
A canonical link, it's canonical to a particular distribution
11:19
in the canonical exponential family, right?
11:22
A canonical exponential family is completely characterized
11:26
by the function b.
11:28
Which means that if I want to talk about the canonical link,
11:31
all I need to tell you is how it depends on b.
11:35
So what is g as a function of b?
11:37
11:40
AUDIENCE: [INAUDIBLE]
11:41
PHILIPPE RIGOLLET: b inverse.
11:42
11:45
b prime inverse, right?
11:47
So this is g is equal to b prime inverse, which
11:52
means that if g is composed with b prime that means
11:56
that this is just the identity.
11:58
12:02
So h is the identity.
12:04
12:11
So h of Xi transpose beta is simply Xi transpose beta.
12:18
And it's true that the way we introduce the canonical link
12:22
was just the function for which we model directly theta i
12:26
as Xi transpose beta, which we can read off here, right?
12:30
So theta i is simply Xi transpose beta.
12:34
So now, for example, if I go back to my log-likelihood,
12:43
so if I look log-likelihood, the log-likelihood
12:51
is sum of the log of the densities.
12:56
So it's sum from i equal 1 to n of log of exponential Yi theta
13:06
i minus b theta i divided by phi plus c of Yi phi.
13:14
So this term does not depend on theta.
13:19
So I have two things.
13:19
First of all, the log and the exponential
13:22
are going to cancel each other.
13:23
And second, I actually know that theta is just
13:28
a function of beta.
13:29
And it has this form.
13:30
Theta i is h of Xi transpose beta.
13:32
And that's my modeling assumption.
13:34
So this is actually equal to the sum from i equal 1 to n of Yi.
13:41
And then here I'm going to write h of Xi transpose beta
13:45
minus b of h of Xi transpose beta divided by phi.
13:54
And then I have, again, this function
13:56
c of Yi phi, which again won't matter.
14:00
Because when I'm going to try to maximize this thing,
14:03
this is just playing the role of a constant
14:05
that's shifting the entire function.
14:07
In particular, your max is going to be exactly what it was.
14:10
OK?
14:12
So this thing is really not going to matter for me.
14:14
I'm keeping track of it.
14:16
And actually, if you look here, it's gone, right?
14:20
It's gone, because it does not matter.
14:23
So let's just pretend it's not here,
14:26
because it won't matter when I'm trying
14:28
to maximize the likelihood.
14:31
OK?
14:33
While it's here up to constant term, it says.
14:36
That's the constant term.
14:37
14:40
All right, any question?
14:41
14:44
All I'm doing here is replacing my likelihood as a function
14:49
of theta i's.
14:50
So if I had one theta i per observation,
14:52
again, this would not help me very much.
14:56
But if I assume that they are all linked together
14:58
by saying that theta i is of the form Xi transpose beta
15:02
or h of Xi transpose beta if I'm not using the canonical link,
15:07
then I can hope to make some estimation.
15:10
And so, again, if I have the canonical link,
15:14
h is the identity.
15:15
So I'm left only with Yi times Xi transpose beta.
15:19
And then I have b of Xi transpose beta
15:21
and not b composed with h, because h is the identity,
15:26
which is fairly simple, right?
15:28
Why is it simple?
15:30
Well, let's actually focus on this guy for one second.
15:35
So let me write it down, so we know what we're talking about.
15:38
15:47
So we just showed that the log-likelihood when I use
15:52
the canonical link--
15:54
so that h is equal to the identity,
15:57
the log-likelihood actually takes the form ln.
16:00
And it depends on a bunch of stuff.
16:01
But let's just make it depend only on the parameter
16:04
that we care about, which is beta, all right?
16:06
So this is of the form l of beta and that's equal to what?
16:09
It's the sum from i equal 1 to n of Yi Xi transpose beta minus--
16:16
let me put the phi here.
16:18
And then I'm going to have minus b of Xi transpose beta.
16:24
16:28
OK.
16:29
And phi we know is some known positive term.
16:33
So again, optimizing a function plus some constant
16:37
or optimizing of function time as a constant,
16:39
that's not going to change much either.
16:41
So it won't really matter to think about whether this phi is
16:44
here or not.
16:45
But let's just think about what this function looks like.
16:48
I'm trying to maximize a function.
16:50
I'm trying to maximize a log-likelihood.
16:52
If it looked like this, that would be a serious problem.
16:57
But we can do like a basic, you know, back
16:59
of the envelope guess of what the variations of this function
17:03
is.
17:04
This first term here is--
17:08
as a function of beta, what kind of function is it?
17:11
AUDIENCE: Linear.
17:12
PHILIPPE RIGOLLET: It's linear, right?
17:13
This is just Xi transpose beta.
17:16
If I multiply beta by 2, I get twice.
17:18
If I add something to beta, it just gets added,
17:21
so it's a linear function of beta.
17:23
And so this thing is both convex and concave.
17:25
In the one-dimensional case-- so think about p
17:28
as being one-dimensional-- so if beta
17:29
is a one-dimensional thing, those
17:31
are just the function that looks like this, right?
17:34
Those are linear functions.
17:35
They are both convex and concave.
17:40
So this is not going to matter when
17:42
it comes to the convexity of my overall function,
17:44
because I'm just adding something which is just a line.
17:46
And so if I started with convex, it's going to stay convex.
17:49
If I started with concave, it's going to stay concave.
17:51
And if I started with something which is both,
17:53
it's going to stay both, meaning neither.
17:56
It cannot be both.
17:57
Yeah.
17:58
So if you're neither convex or concave, adding this linear--
18:00
so this will not really matter.
18:02
If I want to understand when my function looks like,
18:04
I need to understand would b of Xi transpose beta does.
18:07
Begin, the Xi transpose beta--
18:10
no impact.
18:10
It's a linear function.
18:12
In terms of convexity, it's not going to play any role.
18:15
So I really need to understand what my function b looks like.
18:18
What do we know about b again?
18:19
18:22
So we know that b prime of theta is equal to mu, right?
18:30
Well, the mean of a random variable
18:34
in a canonical exponential family
18:36
can be a positive or negative number.
18:37
This really does not tell me anything.
18:40
That can be really anything.
18:41
However, if I look at the second the derivative of b,
18:48
I know that this is what?
18:49
This is the variance of Y divided by phi.
18:59
That was my dispersion parameter.
19:00
The variance was equal to phi times b prime prime.
19:03
So we know that if theta is not degenerate,
19:06
meaning that the density does not
19:08
take value infinity at only one point,
19:10
this thing is actually positive.
19:12
And clearly, when you have something
19:14
that looks like this, unless you have some crazy stuff happening
19:16
with phi being equal to 0 or anything that's not normal,
19:20
then you will see that you're not degenerate.
19:22
So this think is strictly positive.
19:24
And we've said several times that if b prime prime is
19:27
positive, then that means that's the derivative of b prime,
19:32
meaning that b prime is increasing.
19:34
And b prime is increasing is just
19:36
the same thing as saying that b is convex, All right?
19:39
So that implies that b is strictly convex.
19:44
And the strictly comes from the fact
19:47
that this is a strict sign.
19:50
Well, I should not do that, because now it's no longer.
19:52
So it's just a strict sign, meaning
19:54
that the function, this is not strictly convex,
19:56
because it's linear.
19:57
Strictly convex means there's always
19:59
some curvature everywhere.
20:02
So now I have this thing that's linear minus something that's
20:05
convex.
20:07
Something that's negative, something convex, is concave.
20:10
So this thing is linear plus concave.
20:12
So it is concave.
20:13
So I know just by looking at this that ln of beta, which,
20:18
of course, is something that lives in Rp,
20:20
but if I saw it living in R1 it would look like this.
20:23
And if I saw it living in R2, it would
20:25
look like a dome like this.
20:28
And the fact that it's strict is also
20:30
telling me that it is actually a unique maximizer.
20:34
So there's unique maximizer in Xi transpose beta,
20:37
but not in beta necessarily.
20:38
We're going to need extra assumptions for this.
20:41
OK.
20:41
So this is what I say here.
20:44
The log-likelihood is strictly concave.
20:46
And so as a consequence under extra assumptions on the Xi
20:49
is because, of course, if the Xi's are all the same, right?
20:55
So if the entries of Xi's--
20:58
so if Xi is equal to 1, 1, 1, 1, 1,
21:01
then Xi transpose beta is just the sum of the betas.
21:05
And of the beta i's, I will be strictly concaving those guys,
21:10
but certainly not in the individual entries.
21:13
OK.
21:13
So I need extra thing on my Xi, so that this happens,
21:17
just like we needed the matrix capital
21:19
X in the linear regression case to be a full rank,
21:23
so we could actually identify would beta was.
21:25
OK.
21:25
It's going to be exactly the same thing.
21:27
21:32
So here, this is when we have this very specific
21:35
parametrization.
21:36
And the question is--
21:39
but it may not be the case if we change the parameter
21:41
beta into something else.
21:42
OK.
21:43
So here, the fact that we use the canonical link, et cetera,
21:46
everything actually works really to our advantage,
21:49
so that everything becomes strictly concave.
21:51
And we know exactly what's happening.
21:53
21:56
All right, so I understand I went a bit fast
21:58
on playing with convex and concave functions.
22:00
This is not the purpose.
22:02
You know, I could spend a lecture
22:04
telling you, oh, if I add two concave functions,
22:06
then the result remains concave.
22:08
If I had a concave and a strictly concave,
22:11
then the result still remains strictly concave.
22:13
And we could spend time proving this.
22:16
This was just for you to get an intuition
22:17
as to why this is correct.
22:19
But we don't really have time to go into too much detail.
22:22
One thing you can do--
22:23
a strictly concave function, if it's in one dimension,
22:26
all I need to have is that the second derivative is strictly
22:32
negative, right?
22:32
That's a strictly concave function.
22:34
That was the analytic definition we had for strict concavity.
22:38
So if this was in one dimension, it
22:40
would look like this, Yi times Xi times beta.
22:45
Now, beta is just one number.
22:47
And then I would have minus beta Xi times b.
22:52
And this is all over phi.
22:54
You take second derivatives.
22:56
The fact that this is linear in beta, this is going to go away.
22:59
And here, I'm just going to be left with minus--
23:03
so if I take the second derivative with respect
23:07
to beta, this is going to be equal to minus b prime prime Xi
23:13
beta times Xi squared divided by phi.
23:18
So this is clearly positive.
23:21
If Xi is 0, this is degenerate, so I would not get it.
23:25
Then I have the second derivative
23:26
of b prime, which I know is positive,
23:28
because of the variance thing that I have here,
23:30
divided by phi.
23:32
And so that would all be fine.
23:34
That's for one dimension.
23:35
If I wanted to do this in higher dimensions,
23:37
I would have to say that the Hessian is
23:39
a positive definite matrix.
23:41
And that's maybe a bit beyond what this course is.
23:44
23:50
So in the rest of this chapter, I
23:54
will do what I did not do when we talked
23:56
about maximum likelihood.
23:58
And what we're going to do is we're
24:00
going to actually show how to do this maximization, right?
24:03
So here, we know that the function is concave.
24:06
But what it looks like specifically
24:08
depends on what b is.
24:11
And for different b's, I'm going to have different things to do,
24:15
Just like when I was talking about maximum likelihood
24:17
estimation, if it had a concave log-likelihood function,
24:22
I could optimize it.
24:23
But depending on what the function is,
24:24
I would actually need some algorithms
24:26
that may be working better on some functions than others.
24:29
Now, here I don't have random things.
24:32
I have the b is the cumulant generating
24:34
function of a canonical exponential family.
24:38
And there is a way for me to sort of leverage that.
24:42
So not only is there the b part, but there's also
24:45
the linear part.
24:46
And if I start trying to use that,
24:49
I'm actually going to be able to devise very specific
24:51
optimization algorithms.
24:53
And the way I'm going to be able to do this
24:54
is by thinking of simple black box
24:57
optimization to which I can actually technically feed
24:59
any function.
25:00
But it's going to turn out that the iterations
25:02
of this iterative algorithms are going
25:05
to look very familiar when we just
25:10
plug in the particular values of b, of the log-likelihood
25:13
that we have for this problem.
25:15
And so the three methods we're going to talk about going from
25:19
more black box-- meaning you can basically stuff in any function
25:22
that's going to work, any concave function that's going
25:24
to work, all the way to this is working specifically
25:27
for generalized linear models--
25:29
are Newton-Raphson method.
25:31
Who's already heard about the Newton-Raphson method?
25:35
So there's probably some people actually learned this algorithm
25:40
without even knowing the word algorithm, right?
25:43
It's a function.
25:44
Typically, it's supposed to be finding roots of functions.
25:46
But finding the root of a function of the derivative
25:49
is the same as finding the minimum of a function.
25:53
So that's the first black box method.
25:55
I mean, it's pretty old.
25:57
And then there's something that's
25:59
very specific to what we're doing, which is called--
26:01
so this Newton-Raphson method is going
26:03
to involve the Hessian of our log-likelihood.
26:06
And since we know something about the Hessian
26:09
for a particular problem, we're going to be able move one
26:11
to Fisher-scoring.
26:13
And the word Fisher here is actually exactly coming
26:15
from Fisher information.
26:17
So the Hessian is going to involve the Fisher information.
26:20
And finally, we will talk about iteratively re-weighted least
26:25
squares.
26:26
And that's not for any function.
26:28
It's really when we're trying to use
26:29
the fact that there is this linear dependence on the Xi.
26:35
And this is essentially going to tell us, well, you know,
26:37
you can use least squares for linear regression.
26:39
Here, you can use least squares, but locally,
26:41
and you have to iterate.
26:42
OK.
26:43
And this last part is essentially a trick
26:46
by statisticians to be able to solve
26:49
the Newton-Raphson updates without actually
26:52
having a dedicated software for this,
26:54
but just being able to reuse some least squares software.
26:59
OK.
26:59
So you know, we've talked about this many times.
27:02
I just want to make sure that we're all on the same page
27:05
here.
27:07
We have a function f.
27:09
We're going to assume that it has two derivatives.
27:12
And it's a function from Rm to R.
27:14
So it's fist derivative is called gradient.
27:16
That's the vector that collects all the partial derivatives
27:20
with respect to each of the coordinates.
27:23
It's dimension m, of course.
27:25
And the second derivative is an m by m matrix.
27:28
It's called the Hessian.
27:29
And ith row and jth column, you see
27:34
the second partial derivative with respect
27:37
to the ith component and the jth component.
27:39
OK.
27:40
We've seen that several times.
27:41
This is just multi-variable calculus.
27:44
But really the point here is to maybe the notation
27:48
is slightly different, because I want to keep track of f.
27:51
So when I write the gradient, I write nabla sub f.
27:55
And when I write Hessian, I write nabla H sub f.
28:00
And as I said, if f is strictly concave,
28:03
then Hf of x is negative definite.
28:05
What it means is that if I take any x in Rm, then x
28:14
transpose Hf, well, that's for any X0 X,
28:20
this is actually strictly negative.
28:23
That's what it means to be negative definite.
28:25
OK?
28:26
So every time I do x transpose--
28:30
so this is like a quadratic form.
28:31
And I want it to be negative for all values of X0 and X,
28:35
both of them.
28:36
That's very strong, clearly.
28:39
But for us, actually, this is what happens just
28:41
because of the properties of b.
28:42
28:45
Well, at least the fact that it's
28:48
negative, less than or equal to, if I
28:50
want it to be strictly less I need some properties on X.
28:54
And then I will call the Hessian map the function
28:56
that maps X to this matrix Hf of X.
29:02
So that's just the second derivative at x.
29:04
Yeah.
29:04
AUDIENCE: When you what are [INAUDIBLE]??
29:09
PHILIPPE RIGOLLET: Where do [INAUDIBLE]??
29:11
Oh, yeah.
29:13
I mean, you know, you need to be able to apply Schwarz lemma.
29:16
Let's say two continue derivatives that's smooth.
29:19
AUDIENCE: [INAUDIBLE]
29:21
PHILIPPE RIGOLLET: No, that's fine.
29:23
29:29
OK.
29:29
So how does the Newton-Raphson method work?
29:32
Well, what it does is that it forms a quadratic approximation
29:35
to your function.
29:37
And that's the one it optimizes at every single point.
29:39
OK.
29:40
And the reason is because we have
29:41
a closed-form solution to defining the minimum
29:44
of a quadratic function.
29:46
So if I give you a function that's
29:47
of the form ax squared plus b x plus c,
29:51
you know exactly a closed form for its minimum.
29:54
But if I give you any function or, let's say--
29:57
yeah, yeah.
29:58
So here, it's all about maximum.
29:59
I'm sorry.
30:01
If you're confused with me using the word minimum,
30:03
just assume that it was the word maximum.
30:06
So this is how it works.
30:07
OK.
30:08
If I give you a function which is concave, that's quadratic.
30:14
OK.
30:14
So it's going to look like this.
30:15
30:19
So that's of the form ax squared--
30:21
where a is negative, of course--
30:23
plus bx plus c.
30:25
Then you can solve your whatever.
30:28
You can take the derivative of this guy, set it equal to 0,
30:31
and you will have an exact equation
30:33
into what the value of x is that it realizes this maximum.
30:37
If I give you any function that's concave,
30:42
that's all clear, right?
30:43
I mean, if I tell you the function that we have here
30:46
is that the form ax minus b of x,
30:50
then I'm just going to have something that inverts b prime.
30:53
But how do I do that exactly?
30:56
It's not clear.
30:57
And so what we do is we do a quadratic approximation, which
31:00
should be true approximately everywhere, right?
31:03
So I'm at this point here, I'm going to say,
31:05
oh, I'm close to being that function.
31:08
And if I'm at this point here, I'm
31:10
going to be close to being that function.
31:12
And for this function, I can actually optimize.
31:14
And so if I'm not moving too far from one to the other,
31:17
I should actually get something.
31:19
So here's how the quadratic approximation works.
31:21
I'm going to write the second order Taylor expansion.
31:27
31:31
OK.
31:32
And so that's just going to be my quadratic approximation.
31:35
It's going to say, oh, f of x, when x is close
31:38
to some point x0, is going to close
31:40
to f of x0 plus the gradient of f at x0 transpose x minus x0.
31:48
And then I'm going to have plus 1/2x minus x0 transpose
31:53
Hf at 0x x minus x transpose, right--
31:58
x minus x0.
31:59
So that's just my second order Taylor expansion
32:02
multi-variate 1.
32:03
And let's say x0 is this guy.
32:06
32:08
Now, what I'm going to do is say, OK,
32:13
if I wanted to set this derivative of this guy
32:15
equal to 0, I would just have to solve,
32:17
well, you know, f prime of x equals 0,
32:21
meaning that X has to be f prime inverse of 0.
32:26
And really apart from like being some notation manipulation,
32:29
this is really not helping me.
32:31
OK.
32:32
Because I don't know what f prime inverse of 0
32:35
is in many instances.
32:36
However if f has a very specific form which
32:39
is something that depends on x in a very specific way,
32:43
there's just a linear term and then a quadratic term,
32:45
then I can actually do something.
32:47
So let's forget about this approach.
32:49
And rather than minimizing f, let's just
32:51
minimize the right-hand side.
32:54
OK.
32:55
So sorry- maximize.
32:59
So maximize the right-hand side.
33:06
And so how do I get this?
33:08
Well, I just set the gradient equal to 0.
33:11
So what is the gradient?
33:12
The first term does not depend on x.
33:15
So that means that this is going to be 0 plus--
33:21
what is the gradient of this thing, of the gradient of f
33:25
at x0 transpose x minus x0?
33:27
What is the gradient of this guy?
33:28
33:40
So I have a function of the form b transpose x.
33:43
What is the gradient of this thing?
33:46
AUDIENCE: [INAUDIBLE]
33:47
PHILIPPE RIGOLLET: I'm sorry?
33:49
AUDIENCE: [INAUDIBLE]
33:50
PHILIPPE RIGOLLET: I'm writing everything in two-column form,
33:52
right?
33:53
So it's just b.
33:55
OK.
33:55
So here, what is b?
33:57
Well, it's gradient of f at x0.
33:59
34:03
OK.
34:04
And this term here gradient of f at x0 transpose x0
34:07
is just a constant.
34:07
This thing is going away as well.
34:10
And then I'm looking at the derivative of this guy here.
34:13
And this is like a quadratic term.
34:15
It's like H times x minus x0 squared.
34:18
So when I'm going to take the derivative,
34:20
I'm going to have a factor 2 that's
34:22
going to pop out and cancel this one half.
34:23
And then I'm going to be left only
34:25
with this part times this part.
34:28
OK.
34:28
So that's plus Hf x minus x0.
34:36
OK.
34:36
So that's just a gradient.
34:39
And I want it to be equal to 0.
34:40
So I'm just going to solve this equal to 0.
34:44
OK?
34:46
So that means that if I want to find the minimum,
34:49
this is just going to be the x* that satisfies this.
34:53
So that's actually equivalent to Hf times x* is equal to Hf x0
35:07
minus gradient f at x0.
35:14
Now, this is a much easier thing to solve.
35:16
What is this?
35:16
35:21
This is just a system of linear equations, right?
35:25
I just need to find the x* such that when I pre-multiply it
35:29
by a matrix I get this vector on the right-hand side.
35:33
This is just something of the form ax equals b.
35:37
And I have many ways I can do this.
35:39
I could do Gaussian elimination, or I
35:41
could use Spielman's fast Laplacian solvers
35:46
if I had some particular properties of H. I mean,
35:49
there's huge activity in terms of how to solve those systems.
35:54
But let's say I have some time.
35:56
It's not a huge problem.
35:58
I can actually just use linear algebra.
35:59
And linear algebra just tells me that x* is equal to Hf inverse
36:06
times this guy, which those two guys are going to cancel.
36:16
So this is actually equal to x0 minus Hf inverse gradient f
36:23
at x0.
36:26
And that's just what's called a Newton iteration.
36:28
I started at some x0.
36:31
I'm at some x0, where I make my approximation.
36:34
And it's telling me starting from this x0,
36:37
I wanted to fully optimize a quadratic approximation,
36:40
I would just have to take the x*.
36:43
That's this guy.
36:44
And then I could just use this guy as my x0 and do it again,
36:48
and again, and again, and again.
36:50
And those are called Newton iterations.
36:52
And they're basically the workhorse
36:54
of interior point methods, for example, a lot
36:57
of optimization algorithms.
36:59
And that's what you can see here.
37:01
x* is equal to x0 minus the inverse Hessian times
37:05
the gradient.
37:07
We briefly mentioned gradient descent.
37:10
We briefly mentioned gradient decent, at some point,
37:13
to optimize the convex function, right?
37:15
And if I wanted to use gradient descent, again, H is a matrix.
37:22
But if I wanted to think of H as being a scalar,
37:24
would it be a positive or negative number?
37:26
37:31
Yeah.
37:33
AUDIENCE: [INAUDIBLE]
37:34
PHILIPPE RIGOLLET: Why?
37:36
AUDIENCE: [INAUDIBLE]
37:39
PHILIPPE RIGOLLET: Yeah.
37:40
So that would be this.
37:42
So I want to move against the gradient to do what?
37:44
AUDIENCE: [INAUDIBLE]
37:45
PHILIPPE RIGOLLET: To minimize.
37:46
But I'm maximizing here, right?
37:47
Everything is maximized, right?
37:49
So I know that H is actually negative definite.
37:52
So it's a negative number.
37:55
So you have the same confusions they do.
37:59
We're maximizing a concave function here.
38:01
So H is negative.
38:02
So this is something of the form x0 plus something
38:07
times the gradient.
38:10
And this is what your gradient ascent, rather than descent,
38:13
would look like.
38:15
And all it's saying, Newton is telling
38:17
you don't take the gradient for granted
38:19
as a direction in which you want to go.
38:21
It says, do a slight change of coordinates before you
38:25
do this according to what your Hessian looks like, all right?
38:30
And those are called second order methods that require
38:34
knowing what the Hessian is.
38:36
But those are actually much more powerful
38:39
than the gradient descent, because they're
38:41
using all of the local geometry of the problem.
38:43
All of the local geometry of your function
38:45
is completely encoded in this Hessian.
38:48
And in particular it implies that it tells you
38:50
where to switch and not to go slower in some places
38:53
or go faster in other places.
38:55
Now, this in practice for, say, modern large scale
38:58
machine learning problems, inverting this matrix H
39:02
is extremely painful.
39:03
It takes too much time.
39:05
The matrix is too big, and computers cannot do it.
39:08
And people resort to what's called
39:10
pseudo-Newton method, which essentially tries
39:13
to emulate what this guy is.
39:15
And there's many ways you can do this.
39:17
Some of them is by using gradients
39:20
that you've collected in the past.
39:23
Some of them just say, well let's just pretend H
39:27
is diagonal.
39:29
There's a lot of things you can do
39:30
to just play around this and not actually have
39:32
to invert this matrix.
39:34
OK?
39:36
So once you have this, you started from edge 0.
39:38
It tells you which H* you can get as a maximizer of the local
39:44
quadratic approximation to your function.
39:47
You can actually just iterate that, all right?
39:49
So you start at some x0 somewhere.
39:52
And then once you get to some xk,
39:55
you just do the iteration which is described,
39:57
which is just find a k plus 1, which
39:59
is the maximizer of the local quadratic approximation
40:03
to your function at xk and repeat until convergence.
40:09
OK.
40:09
So if this was an optimization class,
40:13
we would prove that convergence actually, eventually,
40:16
happens for a strictly concave function.
40:20
This is a stats class, so you're just
40:22
going to have to trust me that this is the case.
40:24
And it's globally convergent, meaning
40:26
that you can start wherever you want, and it's
40:30
going to work for under minor conditions on f.
40:35
And in particular, those conditions
40:36
are satisfied for the log-likelihood functions
40:40
we have in mind.
40:41
OK.
40:42
And it converges at an extremely fast rate.
40:44
Usually it's quadratic convergence,
40:46
which means that every time you make one step,
40:49
you improve the accuracy of your solution by two digits.
40:52
40:56
If that's something you're vaguely interested in,
40:58
I highly recommend that you take a class on them
41:01
in your optimization.
41:02
It's a fascinating topic.
41:04
Unfortunately, we don't have much time,
41:05
but it starts being more and more intertwined
41:08
with high dimensional statistics and machine learning.
41:11
41:14
I mean, it's an algorithms class, typically.
41:17
But it's very much more principled.
41:22
It's not a bunch of algorithms that solve a bunch of problems.
41:24
There's basically one basic idea,
41:26
which is if I have a convex function,
41:28
I can actually minimize it.
41:30
If I have a concave function, I can maximize it.
41:32
And it evolves around a similar thing.
41:35
So let's stare at this iterative step for a second and pause.
41:40
And let me know if you have any questions.
41:43
41:46
OK.
41:47
So, of course, in a second we will plug in
41:50
for the log-likelihood.
41:51
This is just a general thing for a general function f.
41:54
But then in a second, f is going to be ln.
41:57
OK.
41:58
So if I wanted to implement that for real,
42:00
I would have to compute the gradient of ln at a point xk.
42:04
And I would have to compute the Hessian at a given point
42:06
and invert it.
42:08
OK.
42:08
So this is just the basic algorithm.
42:11
And this, as you can tell, used in no place
42:14
the fact that ln was the log-likelihood associated
42:17
to some canonical exponential family
42:19
in a generalized linear model.
42:21
This never showed up.
42:23
So can we use that somehow?
42:26
Optimization for longest time was about making your problems
42:28
as general as possible accumulating maybe
42:31
in the interior point method theory in Koenig programming
42:34
in the mid-'90s.
42:35
And now what optimization is doing
42:37
is that it's [INAUDIBLE] very general.
42:38
It' says, OK, if I want to start to go fast,
42:40
I need to exploit as much structure about my problem
42:42
as I can.
42:43
And the beauty is that as statisticians
42:45
are a machine learning people, we
42:47
do have a bunch of very specific problem
42:49
that we want optimizers to solve.
42:50
And they can make things run much faster.
42:53
But this did not require to wait until the 21st century.
42:56
Problems with very specific structure
42:58
arose already in this generalized linear model.
43:01
So what do we know?
43:03
Well, we know that this log-likelihood is really
43:07
one thing that comes when we're trying
43:09
to replace an expectation by an average,
43:12
and then doing something fancy, right?
43:14
That was our statistical hammer.
43:16
And remember when we introduced likelihood maximization we just
43:19
said, what do we really want to do
43:21
is to minimize the KL, right?
43:23
That's the thing we wanted to minimize,
43:25
the KL divergence between two distributions, the true one
43:29
and the one that's parameterized by some unknown theta.
43:32
And we're trying to minimize that over theta.
43:34
And we said, well, I don't know what
43:36
this is, because it's an expectation with respect
43:38
to some known distribution.
43:40
So let me just replace the expectation with respect
43:42
to my unknown distribution by an average over my data points.
43:46
And that's how we justified the existence
43:49
of the log-likelihood maximization problem.
43:54
But here, actually, I might be able to compute
44:00
this expectation, at least partially where I need it.
44:04
And what we're going to do is we're going to say, OK,
44:07
since at a given point xk, say, let me call it here theta,
44:12
I'm trying to find the inverse of the Hessian
44:16
of my log-likelihood, right?
44:17
So if you look at the previous one, as I said,
44:19
we're going to have to compute the Hessian H sub l n of xk,
44:23
and then invert it.
44:24
But let's forget about the inversion step for a second.
44:27
We have to compute the Hessian.
44:30
This is the Hessian of the function
44:31
we're trying to minimize.
44:32
But if I could actually replace it
44:34
not by the function I'm trying to minimize to maximize
44:37
or the log-likelihood, but really
44:38
by the function I wish I was actually minimizing,
44:42
which is the KL, right?
44:45
Then that would be really nice.
44:46
And what happens is that since I'm actually
44:48
trying to find this at a given xk,
44:51
I can always pretend that this xk that I have
44:53
in my current iteration is the true one
44:55
and compute my expectation with respect to that guy.
44:58
And what happens is that I know that when
45:00
I compute the expectation of the Hessian of the log-likelihood
45:05
at a given theta and when I take the expectation with respect
45:08
to the same theta, what I get out
45:10
is negative Fisher information.
45:15
The Fisher information was defined in two ways--
45:18
as the expectation of the square of the derivative or negative
45:25
of the expectation of the second derivative
45:27
of the log-likelihood.
45:30
And so now, I'm doing some sort of a leap of faith here.
45:35
Because there's no way the theta, which is the current xk,
45:40
that's the current theta at which I'm actually
45:42
doing this optimization--
45:44
I'm actually pretending that this is the right one.
45:46
45:48
But what's going to change by doing this
45:51
is that it's going to make my life easier.
45:52
Because when I take expectations,
45:56
we'll see that when we look at the Hessian,
46:00
the Hessian as essentially the derivative of, say, a product
46:07
is going to be the sum of two terms, right?
46:09
The derivative of u times v is u prime v plus uv prime.
46:13
One of those two terms is actually
46:14
going to have expectation 0.
46:16
And that's going to make my life very easy when
46:18
I take expectations and basically just
46:20
have one term that's going to go away.
46:22
And so in particular, my formula,
46:24
just by the virtue of taking this expectation
46:27
before inverting the Hessian, is going
46:29
to just shrink the size of my formulas by half.
46:33
OK.
46:34
So let's see how this works.
46:35
You don't have to believe me.
46:37
Is there any question about this slide?
46:39
You guys remember when we were doing
46:41
maximum estimation and Fisher information
46:43
and the KL divergence, et cetera?
46:47
Yeah.
46:48
AUDIENCE: [INAUDIBLE]
46:51
PHILIPPE RIGOLLET: Because that's what we're really
46:53
trying to minimize.
46:55
AUDIENCE: [INAUDIBLE]
46:59
PHILIPPE RIGOLLET: Yeah.
47:00
So there's something you need to trust me
47:05
with, which is that the expectation of H of ln
47:08
is actually H of the expectation of ln, all right?
47:14
Yeah, it's true, right?
47:16
Because taking derivative is a linear operator.
47:20
And we actually used that several times when
47:22
we said expectation of partial of l with respect to theta
47:30
is equal to 0.
47:31
Remember we did that?
47:32
That's basically what we used, right?
47:34
47:39
AUDIENCE: ln is the likelihood.
47:40
PHILIPPE RIGOLLET: It's the log-likelihood.
47:42
AUDIENCE: Log-likelihood, [INAUDIBLE] OK.
47:45
When we did Fisher [INAUDIBLE],, we
47:47
did the likelihood of [INAUDIBLE] observation.
47:49
PHILIPPE RIGOLLET: Yeah.
47:51
AUDIENCE: Why is it ln in this case?
47:54
PHILIPPE RIGOLLET: So actually, ln is typically not normalized.
48:01
So I really should talk about ln over n.
48:04
OK.
48:04
But let's see that, OK?
48:05
So if I have IID observations, that should be pretty obvious.
48:10
OK.
48:11
So if I have IID x1, xn, with density f theta and if I look
48:22
at log f theta of Xi, sum from i equal 1 to n, as I said,
48:28
I need to actually have a 1 over n here.
48:30
When I look at the expectation, they all have
48:32
the same expectation, right?
48:34
So this is actually, indeed, equal to negative KL
48:41
plus a constant.
48:42
OK?
48:43
And negative KL is because this--
48:45
sorry, if I look at the expectation.
48:46
So the expectation of this guy is just the expectation of one
48:50
of them, all right?
48:56
So I just do expectation theta.
48:57
OK?
48:59
Agree?
49:02
Remember, the KL was expectation theta log f theta divided by f.
49:09
So that's between p theta and p theta prime.
49:11
49:14
Well, no, sorry.
49:15
That's the true p.
49:17
And let's call it f.
49:19
p theta, right?
49:20
So that's what showed up, which is, indeed,
49:23
equal to minus expectation theta log
49:27
f theta plus log of f, which is just a constant with respect
49:33
to theta.
49:34
It's just the thing that's up doesn't matter.
49:36
OK?
49:38
So this is what shows up here.
49:40
And just the fact that I have this 1 over n
49:42
doesn't change, because they're IID.
49:44
Now, when I have things that are not IID-- because what I really
49:47
had was Y1 Yn, and Yi at density f theta i, which is just
49:54
the conditional density given Xi,
49:56
then I could still write this.
49:59
And now when I look at the expectation of this guy, what
50:02
I'm going to be left with is just 1 over n sum from i
50:08
equal 1 to n of the expectation of log f theta i of Yi.
50:17
50:19
And it's basically the same thing, except that I have a 1
50:22
over n expectation in front.
50:24
And I didn't tell you this, because I only showed you
50:26
what the KL divergence was for between two distributions.
50:32
But here, I'm telling you what the KL
50:35
is between two products of distributions
50:39
that are independent, but not necessarily
50:40
identically distributed.
50:42
50:48
But that's what's going to show up, just because it's
50:50
a product of things.
50:51
So when you have the log, it's just going to be a sum.
50:53
50:58
Other questions?
51:01
All right, so what do we do here?
51:05
Well, as I said, now we know that the expectation of H
51:08
is negative Fisher information.
51:10
So rather than putting H inverse in my iterates
51:13
for Newton-Raphson, I'm just going to put the inverse Fisher
51:19
information.
51:20
And remember, it had a minus sign in front.
51:23
So I'm just going to pick up a plus sign now,
51:25
just because i is negative, the expectation of the Hessian.
51:31
And this guy has, essentially, the same convergence
51:33
properties.
51:35
And it just happens that it's easier
51:37
to compute the i than H Ln.
51:39
And that's it.
51:40
That's really why you want to do this.
51:43
Now, you might say that, well, if I use more information,
51:50
I should do better, right?
51:51
But it's actually not necessarily
51:53
true for several reasons.
51:54
But let's say that one is probably the fact that I
51:56
did not use more information.
51:58
Every step when I was computing this thing at xk,
52:02
I actually pretended that at theta k
52:04
the true distribution was the one distributed
52:07
according to theta k.
52:09
And that was not true.
52:10
This is only true when theta k becomes
52:12
close to the true theta.
52:13
52:14
And so in a way, what I gained I lost again
52:18
by making this thing.
52:20
It's just really a matter of simple computation.
52:23
So let's just see it on a particular example.
52:25
Actually, in this example, it's not going to look much simpler.
52:28
It's actually going to be the same.
52:30
All right, so I'm going to have the Bernoulli example.
52:35
All right, so we know that Bernoulli
52:36
belongs to the canonical exponential family.
52:41
And essentially, all I need to tell you what b is.
52:46
And b of theta for Bernoulli is log 1 plus e theta, right?
52:53
We computed that.
52:56
OK.
52:57
And so when I look at my log-likelihood,
53:02
it is going to look like the sum from i equal 1 to n of Yi of--
53:10
OK, so here I'm going to actually use
53:11
the canonical link.
53:12
So it's going to be Xi transpose beta
53:15
minus log 1 plus exponential Xi transpose beta.
53:20
53:24
And phi for this guy is equal to 1.
53:27
Is it clear for everyone what I did?
53:29
OK.
53:29
So remember the density, so that was really just--
53:35
so the PMF was exponential Y theta minus log 1 plus e theta.
53:45
There was actually no normalization.
53:47
That's just the density of a Bernoulli.
53:51
And the theta is actually log p over 1 minus p.
54:00
And so that's what actually gives me what my--
54:03
since p is the expectation, this is actually
54:05
giving me also my canonical link,
54:06
which is the log [? at link. ?] We saw that last time.
54:09
And so if I start taking the log of this guy and summing over n
54:12
and replacing theta by Xi transpose beta, which
54:16
is what the canonical link tells me to do, I get this guy.
54:19
Is that clear for everyone?
54:21
If it's not, please redo this step on your own.
54:25
54:27
OK.
54:28
54:31
So I want to maximize this function.
54:33
Sorry.
54:34
So I want to maximize this function over there
54:36
on the first line as a function of beta.
54:39
And so to do this, I want to use either Newton-Raphson
54:44
or what I call Fisher-scoring.
54:45
So Fisher-scoring is the second one,
54:47
when you replace the Hessian by negative Fisher information.
54:51
So I replace these two things.
54:53
And so I first take the gradient.
54:55
OK.
54:55
So let's take the gradient of ln.
54:57
55:04
So the gradient of ln is going to be, well, sum--
55:08
so here, this is of the form Yi, which
55:11
is a scalar, times a vector, Xi beta.
55:14
That's what I erased from here.
55:18
The gradient of b transpose x is just b.
55:20
So here, I have just Yi Xi.
55:23
So that's of the form Yi, which is a scalar, times Xi,
55:27
which is a vector.
55:30
Now, what about this guy?
55:31
Well, here I have a function.
55:32
So I'm going to have just the usual rule, the chain rule,
55:34
right?
55:35
So that's just going to be 1 over this guy.
55:38
55:40
And then I need to find the Hessian of this thing.
55:43
So the 1 is going away.
55:44
And then I apply the chain rule again.
55:46
So I get e of Xi transpose beta, and then
55:49
the Hessian of this thing, which is Xi.
55:53
55:57
So my Hessian-- my radiant, sorry,
56:00
I can actually factor out all my Xi's.
56:02
And it's going to look like this.
56:03
56:15
My gradient is a weighted average or weighted sum
56:21
of the Xi's.
56:22
56:25
This will always happen when you have
56:30
a generalized linear model.
56:32
And that's pretty clear.
56:33
Where did the Xi show up?
56:35
Whether it's from this guy or that guy,
56:37
the Xi came from the fact that when
56:39
I take the gradient of Xi transpose beta,
56:41
I have this vector Xi that comes out.
56:43
It's always going to be the thing that comes out.
56:46
So I will always have something that looks like with some sum
56:49
with some weights here of the Xi's.
56:53
Now, when I look at the second derivative--
56:55
57:04
so same thing, I'm just going to take the derivative this guy.
57:08
Since nothing depends on beta here or here,
57:11
I'm just going to have to take the derivative of this thing.
57:13
And so it's going to be equal.
57:15
So if I look now at the Hessian ln as a function of beta,
57:21
I'm going to have sum from i equal 1 to n of, well, Yi--
57:25
what is a derivative of Yi with respect to beta?
57:28
57:31
AUDIENCE: [INAUDIBLE]
57:32
PHILIPPE RIGOLLET: What?
57:34
AUDIENCE: [INAUDIBLE]
57:37
PHILIPPE RIGOLLET: Yeah.
57:39
0.
57:39
OK?
57:40
It doesn't depend on data.
57:41
I mean, this distribution does.
57:42
But Y itself is just a number, right?
57:46
So this is 0.
57:47
So I'm going to get the minus.
57:49
And then I'm going to have, again, the chain
57:51
rule that shows up.
57:52
So I need to find the derivative of x over 1 plus x.
57:56
What is the derivative of x over 1 plus x.
57:59
Actually don't even know.
58:00
58:03
So that gives me--
58:04
58:12
OK.
58:12
So that's 1 over 1 plus x squared.
58:15
So that's minus e Xi transpose beta--
58:19
sorry, 1 divided by 1 plus e Xi transpose beta squared times
58:27
the derivative of the exponential, which is e Xi
58:31
transpose beta and again, Xi.
58:35
58:38
And then I have this Xi that shows up.
58:39
But since I'm looking for a matrix,
58:41
I'm going to have Xi, Xi transpose, right?
58:42
58:53
OK?
58:55
AUDIENCE: [INAUDIBLE]
58:58
PHILIPPE RIGOLLET: So I know I'm going to need something that
59:02
looks like a matrix in the end.
59:04
And so one way you want to think about it is this
59:07
is going to spit out an Xi.
59:09
There's already an Xi here.
59:11
So I'm going to have something that looks like Xi.
59:14
And I'm going to have to multiply by it another vector
59:15
Xi.
59:16
And I want it to form a matrix.
59:18
And so what you need to do is to take an outer product.
59:22
And that's it.
59:23
59:36
So now as a result, the updating rule is this.
59:42
Honestly, this is not a result of anything.
59:44
I actually rewrote everything that I had before with a theta
59:47
replaced by beta, because it's just
59:50
painful to rewrite this entire thing, put some big parenthesis
59:54
and put minus 1 here.
59:55
59:58
And then I would have to put the gradient, which
60:00
is this thing here.
60:03
So as you can imagine, this is not super nice.
60:07
Actually, what's interesting is at some point
60:11
I mentioned there's a pseudo-Newton method.
60:15
They're actually doing exactly this.
60:16
They're saying, oh, at each iteration,
60:22
I'm actually going to just take those guys.
60:24
If I'm at iteration k, I'm actually
60:26
just going to sum those guys up to k
60:28
rather than going all the way to n and look at every one.
60:30
So you're just looking at your observations one at a time
60:33
based on where you were before.
60:35
OK.
60:36
So you have a matrix.
60:38
You need to invert it.
60:39
So if you want to be able to invert it,
60:41
you need to make sure that the sum
60:43
with those weights of Xi outer, Xi, or Xi, Xi transpose
60:46
is invertible.
60:47
So that's a condition that you need to have.
60:50
And well, you don't have to, because technically you
60:53
don't need to invert.
60:54
You just need to solve the linear system.
60:57
But that's actually guaranteed in most of the cases
61:00
if n is large enough.
61:03
All right, so everybody sees what we're doing here?
61:05
OK.
61:06
So that's for the Newton-Raphson.
61:11
If I wanted to actually do the Fisher-scoring,
61:17
all I would need to do is to replace the Hessian here
61:22
by its expectation when I pretend that the beta have,
61:25
iteration k, is the true one.
61:28
What is the expectation of this thing?
61:31
And when I say expectation here, I'm
61:34
always talking about conditional expectation of Y given X.
61:37
The only distributions that matter, that have mattered
61:40
in this entire chapter, are conditional expectation of Y
61:44
given X.
61:45
The conditional expectation of this thing given X is what?
61:50
61:55
It's itself.
61:57
It does not depend on Y. It only depends on the X's.
61:59
So conditionally on x, this thing
62:01
as far as we're concerned, is completely deterministic.
62:04
So it's actually equal to it's expectation.
62:07
And so in this particular example,
62:11
there's no difference between Fisher-scoring
62:16
and Newton-Raphson.
62:19
And the reason is because the gradient no longer
62:24
depends on Yi--
62:26
I'm sorry.
62:26
The Hessian no longer depends on Yi.
62:30
OK?
62:30
62:41
This slide is just repeating some stuff that I've said.
62:44
62:49
OK.
62:49
62:54
So I think this is probably--
62:55
62:58
OK, let's go through this actually.
63:00
63:04
At some point, I said that Newton-Raphson--
63:07
do you have a question?
63:08
AUDIENCE: Yeah.
63:08
When would the gradient-- sorry, the Hessian ever depend on Yi?
63:12
Because it seems like Yi is just-- or at least
63:15
when you have a canonical link, that the log-likelihood is just
63:22
[INAUDIBLE] to Yi Xi [INAUDIBLE] theta and that's the only place
63:27
Y shows up.
63:28
So [INAUDIBLE] derivative [INAUDIBLE] never depend on Y?
63:30
PHILIPPE RIGOLLET: Not when you have a canonical link.
63:33
AUDIENCE: So if it's not a [INAUDIBLE] there's is no
63:35
difference between--
63:35
PHILIPPE RIGOLLET: No.
63:36
AUDIENCE: OK.
63:37
PHILIPPE RIGOLLET: Yeah.
63:38
So yeah, maybe I wanted you to figure that out for yourself.
63:45
OK.
63:45
So Yi times Xi transpose beta.
63:49
So essentially, when I have a general family, what
63:55
he's referring to is that this is just b of Xi transpose beta.
64:00
So I'm going to take some derivatives.
64:01
And there's going to be something
64:02
complicated coming out of this.
64:04
But I'm certainly not going to have some Yi showing up.
64:06
The only place where Yi shows up is here.
64:09
Now, if I take two derivatives, this thing
64:11
is gone, because it's linear.
64:13
The first one is going to keep on like this guy.
64:15
And the second one is going to make it go on.
64:17
The only way this actually shows up is when to have an H here.
64:21
And if I have an H, then I can take second derivatives.
64:24
And this thing is not going to be
64:26
completely independent of beta.
64:30
Sorry.
64:31
Yeah, this thing is still going to depend
64:33
on beta, which means that this Yi term is not
64:35
going to disappear.
64:36
64:39
I believe we'll see an example of that, or maybe I removed it.
64:41
I'm not sure, actually.
64:42
I think we will see an example.
64:44
64:49
So let us do a Iteratively Re-weighted Least
64:55
Squares, or IRLS, which I've actually recently learned
65:00
is a term that even though it was defined in the '50s,
65:03
people still feel free to use to define
65:06
to call their new algorithms which have nothing
65:08
to do with this.
65:09
This is really something where you actually do iteratively
65:12
re-weighted least squares.
65:13
65:18
OK.
65:18
Let's just actually go through this quickly
65:20
what is going to be iteratively re-weighted least squares.
65:23
The way the steps that we had here showed up--
65:28
let's say those guys, x*--
65:32
is this, is when were actually solving this linear system,
65:37
right?
65:37
That was the linear system we were trying to solve.
65:43
But solving a linear system can be
65:45
done by just trying to minimize, right?
65:48
If I have x a and b, it's the same
65:50
as minimizing the norm of ax minus b squared over x.
65:58
If I can actually find an x for which it's 0,
66:01
it means that I've actually solved my problem.
66:04
And so that means that I can solve linear systems by solving
66:10
least square problems.
66:11
And least square problems are things that statisticians
66:14
are comfortable solving.
66:15
And so all I have to do is to rephrase this
66:18
as at least square problem.
66:19
OK?
66:20
And you know, I could just write it directly like this.
66:23
But there's a way to streamline it a little bit.
66:25
And that's actually by using weights.
66:30
OK.
66:30
So I've come in the weights--
66:32
well, not today, actually, but very soon, all right?
66:37
So this is just a reminder of what we had.
66:39
We have that's Yi give Xi as a distribution distributed
66:43
according to some distribution in the canonical exponential
66:47
family.
66:48
So that means that the log-likelihood looks like this.
66:50
Again, this does not matter to us.
66:52
This is the form that matters.
66:54
And we have a bunch of relationships
66:55
that we actually spent some time computing.
66:58
The first one is that mu is b prime of theta i.
67:01
The second one is that if I take g of mu i,
67:04
I get this systematic component, Xi transpose beta that's
67:08
modeling.
67:09
Now, if I look at the derivative mu i with respect to theta i,
67:13
this is the derivative of b prime of theta i
67:15
with respect to theta i.
67:16
So that's the second derivative.
67:18
And I'm going to call it Vi.
67:20
If phi is equal to 1, this is actually the variance.
67:24
And then I have this function H, which
67:27
allows me to bypass altogether the existence of this parameter
67:30
mu, which says if I want to go from Xi transpose beta
67:33
all the way to theta i, I have to first do g inverse,
67:37
and then b prime inverse.
67:39
If I stopped here, I would just have mu.
67:41
OK?
67:42
67:46
OK.
67:46
So now what I'm going to do is I'm
67:48
going to apply the chain rule.
67:50
And I'm going to try to compute the derivative
67:53
of my log-likelihood with respect to beta.
67:56
So, again, the log-likelihood is much nicer
68:00
when I read it as a function of theta than a function of beta,
68:03
but it's basically what we've been doing by hand.
68:05
You can write it as a derivative with respect
68:08
to theta first, and then multiply by the derivative
68:11
of theta with respect to beta.
68:13
OK.
68:13
And we know that theta depends on beta
68:16
as H of Xi transpose beta.
68:20
OK?
68:20
I mean, that's basically what we've been
68:22
doing for the Bernoulli case.
68:26
I mean, we used the chain rule without actually saying it.
68:28
But this is going to be convenient to actually make
68:30
it explicitly show up.
68:32
OK.
68:32
So when I first take the derivative of my log-likelihood
68:35
with respect to theta, I'm going to use the fact
68:39
that my canonical family is super simple.
68:42
OK.
68:42
So what I have is that my log-likelihood ln
68:50
is the sum from i equal 1 to n of Yi theta i minus b
68:56
of theta i divided by phi plus some constant, which
69:00
will go away as soon as I'm going
69:01
to take my first derivative.
69:03
So if I take the derivative with respect to theta i of this guy,
69:08
this is actually going to be equal to Yi minus b prime theta
69:12
i divided by phi.
69:16
And then I need to multiply by the derivative of theta i
69:20
with respect to beta.
69:21
Remember, theta is H of Xi transpose beta.
69:26
So the derivative of theta i with respect to beta j,
69:31
this is equal to H prime of Xi transpose beta.
69:36
And then I have the derivative of this guy.
69:40
Actually, let me just do the gradient of theta I
69:46
at beta, right?
69:48
That's what we did.
69:50
I'm just thinking of theta i as being a function of theta.
69:53
So what should I add here?
69:55
It was just the vector Xi, which is just the chain rule again.
70:01
That's Hi prime, right?
70:02
You don't see it, but there's a prime here that's derivative.
70:06
OK.
70:06
We've done that without saying it explicitly.
70:10
So now if I multiply those two things
70:11
I have this Yi minus b prime of the theta
70:14
i, which I call by its good name, which is mu i.
70:19
b prime of theta i is the expectation
70:20
of Yi conditionally on Xi.
70:22
And then I multiply by this thing here.
70:24
So here, this thing is written coordinate by coordinate.
70:28
But I can write it as a big vector
70:30
when I stack them together.
70:33
And so what I claim is that this thing here
70:36
is of the form Y minus mu.
70:39
But here I put some tildes.
70:40
Because what I did is that first I multiplied everything
70:46
by g prime of mu for each mu.
70:50
OK.
70:50
So why not?
70:52
OK.
70:54
Actually, on this slide it will make no sense why I do this.
70:58
I basically multiply by g prime on one side
71:01
and divide by g prime on the other side.
71:04
So what I write so far is that the gradient of ln with respect
71:12
to beta is the sum from i equal 1 to n of Yi minus mu i,
71:19
let's call it, divide by phi times
71:23
H prime of Xi transpose beta Xi.
71:28
OK.
71:29
So I just stacked everything that's here.
71:31
And now I'm going to start calling things.
71:33
The first thing I'm going to do is I'm going to divide.
71:36
So this guy here I'm going to push here.
71:38
71:40
Now, this guy here I'm actually going
71:42
to multiply by g prime of mu i.
71:48
And this guy I'm going to divide by g prime of mu i.
71:51
So there's really nothing that happened here.
71:54
I just took g prime and multiply and divide it by g prime.
71:59
Why do I do this?
72:00
Well, that's actually going to be clear
72:02
when we talk about iteratively re-weighted least squares.
72:06
But now, essentially I have a new mu, a Y which is--
72:11
so this thing now is going to be Y tilde minus mu
72:14
tilde, so i, i.
72:19
Now, this guy here I'm going to call Wi.
72:23
72:27
And I have the Xi that's there, which
72:31
means that now the thing that I have here I can write as
72:38
follows.
72:39
Gradient ln of beta is equal to what?
72:46
Well, I'm going to write it in matrix forms.
72:49
So I have the sum over i of something multiplied by Xi.
72:53
So I'm going to write it as x transpose.
72:57
Then I'm going to have this matrix W1 Wn, and then
73:02
0 elsewhere.
73:05
And then I'm going to have my Y tilde minus mu.
73:09
And remember, X is the matrix with-- sorry,
73:15
it should be a bit [INAUDIBLE].
73:16
I have n, and then p.
73:19
And here I have my Xi j in this matrix on row i and column j.
73:26
And this is just a matrix that has the Wi's on the diagonal.
73:30
And then I have Y tilde minus mu.
73:32
So this is just the matrix we're writing of this formula.
73:36
All right.
73:37
So it's just saying that if I look
73:38
at the sum of weighted things of my columns of Xi,
73:42
it's basically the same thing.
73:44
When I'm going to multiply this by my matrix,
73:46
I'm going to get exactly those terms, right?
73:48
Yi minus mu i tilde times Wi.
73:52
And then when I actually take this Xi transpose
73:56
times this guy, I'm really just getting
73:58
the sum of the columns with the weights, right?
74:04
Agree?
74:05
If I look at this thing here, this
74:08
is a vector that has S coordinates, Wi times Yi
74:15
tilde minus mu i tilde.
74:19
And I have n of them.
74:21
So when I multiply X transpose by this guy,
74:24
I'm just looking at a weighted sum of the columns of X
74:28
transpose, which is a weighted sum of the rows of X,
74:31
which are exactly my Xi's.
74:34
All right, and that's this weighted sum of the Xi's.
74:36
74:40
OK.
74:40
So here, as I said, the fact that we
74:43
decided to put this g prime of mu i here and g prime of mu i
74:48
here, we could have not done this, right?
74:50
We could have just said, I forget about the tilde
74:54
and just call it Yi minus mu i.
74:56
And here, I just put everything I don't know into some Wi.
75:00
And so why do I do this?
75:01
Well, it's because when I actually
75:03
start looking at the Hessian, what's going to happen?
75:06
AUDIENCE: [INAUDIBLE].
75:08
PHILIPPE RIGOLLET: Yeah.
75:09
We'll do that next time.
75:10
But let's just look quickly at the outcome of the computation
75:14
of my Hessian.
75:16
So I compute a bunch of second derivatives.
75:20
And here, I have two terms, right?
75:21
Well, he's gone.
75:22
So I have two terms.
75:23
And when I take the expectation now,
75:26
it's going to actually change, right?
75:28
This thing is actually going to depend on Yi.
75:31
Because I have an H which is not the identity.
75:34
Oh, no, you're here, sorry.
75:37
So when I start looking at the expectation,
75:39
so I look at the conditional expectation given Xi.
75:42
The first term here has a Yi minus expectation.
75:46
So when I take the conditional expectation,
75:48
this is going to be 0.
75:49
The first term is going away when I take
75:50
the conditional expectation.
75:52
But this was actually gone already
75:54
if we had the canonical term, because the second derivative
75:56
of H when H is the identity is 0.
76:00
But if H is not the identity, H prime prime may not be 0.
76:03
And so I need that part to remove that term.
76:07
And so now, you know, I work a little bit,
76:09
and I get this term.
76:10
That's not very surprising.
76:11
In the second derivative, I see I have terms in b prime prime.
76:15
I have term in H prime, but squared.
76:18
And then I have my Xi outer Xi, Xi, Xi transpose,
76:21
which we know we would see.
76:23
OK.
76:23
So we'll go through those things next time.
76:26
But what I want to show you is that now once I compute this,
76:31
I can actually show that if I look at this product that
76:35
showed up, I had b prime prime times H prime squared.
76:40
One of those terms is actually 1 over g prime.
76:44
And so I can rewrite it as one of the H primes,
76:46
because I had a square, divided by g prime.
76:49
And now, I have this Xi Xi transpose.
76:51
So if I did not put the g prime in the W
76:57
that I put here completely artificially,
77:00
I would not be able to call this guy Wi, which is exactly
77:04
what it is from this board.
77:06
And now that this guy is Wi, I can actually write this thing
77:09
here as X transpose WX.
77:14
OK?
77:15
And that's why I really wanted my W
77:17
to have this g prime of mu i in the denominator.
77:20
Because now I can actually write a term that depends on W.
77:24
Now, you might say, how do I reconcile those two things?
77:26
What the hell are you doing?
77:28
And what the hell I'm doing is essentially
77:29
that I'm saying that if you write beta k according
77:35
to the Fisher-scoring iterations,
77:37
you can actually write it as just this term here,
77:41
which is of the form X transpose X inverse X transpose Y.
77:46
But I actually squeezed in these W's.
77:50
And that's actually a weighted least square.
77:52
And it's applied to this particular guy.
77:53
So we'll talk about those weighted least squares.
77:56
But remember, least squares is of the form--
77:58
beta hat is X transpose X inverse X transpose Y.
78:02
And here it's basically the same thing,
78:04
except that I squeeze in some W after my X transpose.
78:09
OK.
78:09
So that's how we're going to solve it.
78:12
I don't want to go into the details
78:14
now, mostly because we're running out of time.
78:18
Are there any questions?