https://www.youtube.com/watch?v=lWW54ts9Ubo&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=22

字幕記錄


00:00
00:00
The following content is provided under a Creative
00:02
Commons license.
00:03
Your support will help MIT OpenCourseWare
00:06
continue to offer high-quality educational resources for free.
00:10
To make a donation or to view additional materials
00:12
from hundreds of MIT courses, visit MIT OpenCourseWare
00:16
at ocw.mit.edu.
00:17
00:21
PHILIPPE RIGOLLET: So I apologize.
00:23
My voice is not 100%.
00:27
So if you don't understand what I'm saying, please ask me.
00:32
So we're going to be analyzing-- actually, not really analyzing.
00:36
We described a second-order method
00:38
to optimize the log likelihood in a generalized linear model,
00:42
when the parameter of interest was beta.
00:45
So here, I'm going to rewrite the whole thing as a beta.
00:47
So that's the equation you see.
00:49
But we really have this beta.
00:51
And at iteration k plus 1, beta is given by beta k.
00:58
And then I have a plus sign.
01:01
And the plus, if you think of the Fisher information at beta
01:06
k as being some number--
01:09
if you were to say whether it's a positive
01:11
or a negative number, it's actually
01:12
going to be a positive number, because it's
01:14
a positive semi-definite matrix.
01:15
So since we're doing gradient ascent,
01:18
we have a plus sign here.
01:19
And then the direction is basically
01:21
gradient ln at beta k.
01:26
OK?
01:27
So this is the iterations that we're trying to implement.
01:30
And we could just do this.
01:31
At each iteration, we compute the Fisher information,
01:34
and then we do it again and again.
01:36
All right.
01:36
That's called the Fisher-scoring algorithm.
01:38
And I told you that this was going to converge.
01:41
And what we're going to try to do in this lecture
01:44
is to show how we can re-implement this,
01:46
using iteratively re-weighted least squares,
01:48
so that each step of this algorithm
01:50
consists simply of solving a weighted least square problem.
01:54
All right.
01:54
So let's go back quickly and remind ourselves
01:59
that we are in the Gaussian--
02:04
sorry, we're in the exponential family.
02:07
So if I look at the log likelihood for one observation,
02:10
so here it's ln--
02:12
sorry.
02:13
This is the sum from i equal 1 to n of yi minus--
02:17
02:20
OK, so it's yi times theta i, sorry, minus b of theta i.
02:26
Then there's going to be some parameter.
02:28
And then I have plus c of yi phi.
02:32
OK.
02:33
So just the exponential went away
02:35
when I took the log of the likelihood.
02:36
And I have n observations, so I'm summing
02:38
over all n observations.
02:40
All right.
02:40
Then we had a bunch of formulas that we came up to be.
02:43
So if I look at the expectation of yi--
02:46
so that's really the conditional of yi, given xi.
02:49
But like here, it really doesn't matter.
02:52
It's just going to be different for each i.
02:54
This is denoted by mu i.
02:55
And we showed that this was beta prime of theta i.
03:00
Then the other equation that we found was that.
03:03
And so what we want to model is this thing.
03:06
We want it to be equal to xi transpose beta- sorry
03:10
g of this thing.
03:11
03:14
All right.
03:15
So that's our model.
03:19
And then we had that the variance was also
03:21
given by the second derivative.
03:23
I'm not going to go into it.
03:24
What's actually interesting is to see,
03:26
if we want to express theta i as a function of xi, what we get,
03:32
going from xi to mu i by g inverse, and then to theta i
03:38
by b inverse, we get that theta i
03:43
is equal to h of xi transpose beta, h of xi transpose beta,
03:51
where h is the inverse--
03:56
so which order is --this?
03:58
Is the inverse of g, and then the compose would be prime.
04:03
OK?
04:05
So we remember that last time, those are all computations
04:09
that we've made, but they're going
04:10
to be useful in our derivation.
04:12
And the first thing we did last time is
04:14
to show that, if I look now at the derivative of the log
04:17
likelihood with respect to one coordinate of beta, which
04:20
is going to give me the gradient if I do that for all
04:23
the coordinates, what we ended up finding
04:25
is that we can rewrite it in this form, some
04:28
of yi tilde minus mu tilde.
04:31
So let's remind ourselves that--
04:33
04:36
so y tilde is just y divided--
04:41
well, OK y tilde i is yi--
04:45
is it times or divided--
04:46
times g prime of mu i.
04:50
Mu tilde i is mu i times g prime of mu i.
05:00
And then that was just an artificial thing,
05:02
so that we could actually divide the weights by g prime.
05:07
But the real thing that built the weights are this h prime.
05:10
And there's this normalization factor.
05:12
And so if we read it like that-- so if I also
05:14
write that wi is h prime of xi transpose beta divided
05:22
by g prime of mu i times phi, then
05:27
I could actually rewrite my gradient, which
05:30
is a vector, in the following matrix form,
05:34
the gradient ln at beta.
05:40
So the gradient of my log likelihood of beta
05:44
took the following form.
05:45
It was x transpose w, and then y tilde minus mu tilde.
05:53
And here, w was just the matrix with w1,
05:57
w2, all the way to wn on the diagonal and 0
06:02
on of the up diagonals.
06:04
OK?
06:06
So that was just taking the derivative
06:08
and doing a slight manipulations that said,
06:11
well, let's just divide whatever is here by g
06:14
prime and multiply whatever is here by g prime.
06:17
So today, we'll see why we make this division
06:19
and multiplication by g prime, which seems to make no sense,
06:23
but it actually comes from the Hessian computations.
06:26
So the Hessian computations are going
06:28
to be a little more annoying.
06:29
Actually, let me start directly with the coordinate y's
06:33
derivative, right?
06:34
So to build this gradient, what we used, in the end,
06:37
was that the partial derivative of ln with respect to the gth
06:41
coordinate of beta was equal to the sum over i
06:49
of yi tilde minus mu i tilde times wi times
06:55
the gth coordinate of xi.
06:59
OK?
07:01
So now, let's just take another derivative,
07:03
and that's going to give us the entries of the Hessian.
07:07
OK, so we're going to the second derivative.
07:11
So what I want to compute is the derivative with respect
07:16
to beta j and beta k.
07:18
07:21
OK.
07:22
So where does beta j--
07:24
so here, I already took the derivative with respect
07:26
to beta j.
07:27
So this is just the derivative with respect
07:29
to beta k of the derivative with respect to beta j.
07:32
07:36
So what I need to do is to take the derivative of this guy
07:39
with respect to beta k.
07:40
Where does beta k show up here?
07:42
07:48
It's set in, in two places.
07:52
AUDIENCE: In the y's?
07:53
PHILIPPE RIGOLLET: No, it's not in the y's.
07:54
The y's are my data, right?
07:56
07:59
But I mean, it's in the y tildes.
08:02
Yeah, because it's in mu, right?
08:03
Mu depends on beta.
08:04
Mu is g inverse of xi transpose beta.
08:09
And it's also in the wi's.
08:12
Actually, everything that you see is directly-- well, OK, w
08:17
depends on mu n on beta explicitly.
08:21
But the rest depends only on mu.
08:24
And so we might want to be a little--
08:27
well, we can actually use the--
08:30
did I use the chain rule already?
08:32
08:35
Yeah, it's here.
08:36
But OK, well, let's go for it.
08:40
08:49
Oh yeah, OK.
08:50
Sorry, I should not write it like that,
08:52
because that was actually--
08:54
right, so I make my life miserable by just multiplying
08:57
and dividing by this g prime of mu.
09:00
I should not do this, right?
09:02
So what I should just write is say that this guy here--
09:04
I'm actually going to remove the g prime of mu,
09:09
because I just make something that depends on theta
09:11
appear when it really should not.
09:13
So let's just look at the last but one equality.
09:15
09:23
OK.
09:23
So that's the one over there, and then I have xi j.
09:27
OK, so here, it make my life much more simple,
09:29
because yi does not depend on beta,
09:31
but this guy depends on beta, and this guy depends on beta.
09:34
All right.
09:35
So when I take the derivative, I'm
09:36
going to have to be a little more careful now.
09:38
But I just have a derivative of a product,
09:40
nothing more complicated.
09:42
So this is what?
09:43
Well, the sum is going to be linear,
09:45
so it's going to come out.
09:46
Then I'm going to have to take the derivative of this term.
09:51
So it's just going to be 1 over psi.
09:54
Then the derivative of mu i with respect
09:58
to beta k, which I will just write like this,
10:04
times h prime of xi transpose beta xi j.
10:09
And then I'm going to have the other one, which is yi minus mu
10:15
i over 5 times the second derivative of h of xi transpose
10:24
beta.
10:25
And then I'm going to take the derivative
10:27
of this guy with respect to beta j with beta k, which is just
10:30
xi k.
10:32
So I have xi j times xi k.
10:36
OK.
10:36
So I still need to compute this guy.
10:40
So what is the partial derivative
10:42
with respect to beta k of g?
10:46
So mu is g of--
10:49
worry, it's g inverse of xi transpose beta.
10:52
10:56
OK?
10:58
So what do I get?
10:59
Well, I'm going to get definitely
11:01
the second derivative of g.
11:02
11:11
Well, OK, that's actually not a bad idea.
11:14
11:17
Well, no, that's OK.
11:18
I can make the second--
11:21
what makes my life easier, actually?
11:22
11:26
Give me one second.
11:31
Well, there's no one that actually
11:33
makes my life so much easier.
11:35
Let's just write it.
11:36
Let's go with this guy.
11:37
So it's going to be g prime prime of xi transpose beta
11:43
times xi k.
11:47
OK?
11:50
So now, what do I have if I collect my terms?
11:53
I have that this whole thing here, the second derivative is,
12:05
well, I have the sum from 1 equal 1 to n.
12:10
Then I have terms that I can factor out, right?
12:13
Both of these guys have xi j, and this guy pulls out an xi k.
12:17
And it's also here, xi j times xi k, right?
12:21
So everybody here is xi j xi k.
12:26
And now, I just have to take the terms that I have here.
12:29
The 1 over phi, I can actually pull out in front.
12:33
And I'm left with the second derivative of g times
12:40
the first derivative of h, both taken at xi transpose beta.
12:46
And then, I have this yi minus mu i
12:48
times the second derivative of h, taken at xi transpose beta.
12:52
13:00
OK.
13:00
But here, I'm looking at Fisher scoring.
13:03
I'm not looking at Newton's method, which
13:07
means that I can actually take the expectation
13:09
of the second derivative.
13:11
So when I start taking the expectation,
13:13
what's going to happen--
13:15
so if I take the expectation of this whole thing
13:17
here, well, this guy, it's not--
13:21
and when I say expectation, it's always conditionally on xi.
13:24
So let's write it--
13:27
x1 xn.
13:29
So I take conditional.
13:31
This is just deterministic.
13:32
But what is the conditional expectation
13:34
of yi minus mu i times this guy, conditionally on xi?
13:39
13:42
0, right?
13:43
Because this is just the conditional expectation of yi,
13:45
and everything else depends on xi only,
13:47
so I can push it out of the conditional expectation.
13:50
So I'm left only with this term.
13:52
14:06
OK.
14:06
14:13
So now I need to--
14:14
sorry, and I have xi xj, xi j xi j.
14:23
OK
14:26
So now, I want to go to something that's slightly more
14:34
convenient for me.
14:35
So maybe we can skip that part here,
14:37
because this is not going to be convenient for me anyway.
14:40
So I just want to go back to something that looks eventually
14:45
like this.
14:48
OK, that's what I'm going to want.
14:50
So I need to have my xi show up with some weight somehow.
14:53
And the weight should involve h prime divided by g prime.
14:57
Again, the reason why I want to see g prime coming back
15:00
is because I had g prime coming in the original w.
15:03
This is actually the same definition as the w
15:06
that I used when I was computing the gradient.
15:09
Those are exactly these w's, those guys.
15:13
So I need to have g prime that shows up.
15:15
And that's where I'm going to have
15:17
to make a little bit of computation here.
15:21
And it's coming from this kind of consideration.
15:26
OK?
15:27
So this thing here--
15:29
15:33
well, actually, I'm missing the phi over there, right?
15:39
There should be a phi here.
15:41
OK.
15:41
So we have exactly this thing, because this tells me that,
15:46
if I look at the Hessian--
15:47
15:53
so this was entry-wise, and this is exactly
15:56
the form of something of the form of the k.
15:58
This is exactly the jth kth entry of xi xi transpose.
16:06
Right?
16:06
We've used that before.
16:07
So if I want to write this in a vector form,
16:09
this is just going to be the sum of something that depends
16:13
on i times xi xi transpose.
16:15
So this is 1 over phi sum from i equal 1 to n of g
16:20
prime prime xi transpose beta h prime xi transpose beta xi xi
16:28
transpose.
16:30
OK?
16:31
And that's for the entire matrix.
16:32
Here, that was just the j kth entries of this matrix.
16:34
16:38
And you can just check that, if I take this matrix,
16:41
the j kth entry is just the product of the jth coordinate
16:45
and the kth coordinate of xi.
16:48
All right.
16:51
So now I need to do my rewriting.
16:54
Can I write this?
16:54
16:58
So I'm missing something here, right?
17:00
17:11
Oh, I know where it's coming from.
17:13
17:18
Mu is not g prime of x beta.
17:22
Mu is g inverse of x beta, right?
17:24
17:27
So the derivative of x prime is not g prime prime.
17:34
It's like this guy--
17:39
17:44
no, 1 over this, right?
17:46
17:51
Yeah.
17:52
18:06
OK?
18:08
The derivative of g inverse is 1 over g prime of gene inverse.
18:12
18:15
I need you guys, OK?
18:18
All right.
18:18
So now, I'm going to have to rewrite this.
18:20
This guy is still going to go away.
18:21
It doesn't matter, but now this thing
18:23
is becoming h prime over g prime of g inverse of xi transpose
18:30
beta, which is the same here, which is the same here.
18:41
18:52
OK?
18:53
Everybody approves?
18:55
All right.
18:55
Well, now, it's actually much nicer.
18:58
What is g inverse of xi transpose beta?
19:01
19:05
Well, that was exactly the mistake that I just made,
19:07
right?
19:08
It's mu i itself.
19:10
So this guy is really g prime of mu i.
19:18
Sorry, just the bottom, right?
19:19
19:23
So now, I have something which looks like a sum from i
19:32
equal 1 to n of h prime of xi transpose beta,
19:38
divided by g prime of mu i phi times xi xi transpose, which
19:46
I can certainly write in matrix form as x transpose wx,
19:54
where w is exactly the same as before.
20:00
So it's w1 wn.
20:05
And wi is h prime of xi transpose beta
20:11
divided by g prime of mu i.
20:16
There's a prime here times phi, which
20:20
is the same that we had here.
20:23
And it's supposed to be the same that we have here,
20:26
except the phi is in white.
20:30
That's why it's not there.
20:31
OK.
20:32
20:37
All right?
20:39
So it's actually simpler than what's on the slides, I guess.
20:42
All right.
20:43
So now, if you pay attention, I actually
20:46
never force this g prime of mu i to be here.
20:49
Actually, I even tried to make a mistake to not have it.
20:52
And so this g prime of mu i shows up completely naturally.
20:57
If I had started with this, you would have never questioned
21:04
why I actually didn't multiply by g prime
21:07
and divided by g prime completely artificially here.
21:09
It just shows up naturally in the weights.
21:12
But it's just more natural for me
21:14
to compute the first derivative first
21:15
than the second derivative second, OK?
21:17
And so we just did it the other way around.
21:20
But now, let's assume we forgot about everything.
21:23
We have this.
21:24
This is a natural way of writing it, x transpose wx.
21:28
If I want something that involves some weights,
21:30
I have to force them in by dividing by g prime of mu i
21:34
and therefore, multiplying yi n mu i by this wi.
21:40
OK?
21:41
So now, if we recap what we've actually found, we got that--
21:46
21:49
let me write it here.
21:51
21:58
We also have that the expectation
22:02
of H ln of beta x transpose xw.
22:12
So if I go back to my iterations over there,
22:15
I should actually update beta k plus 1
22:20
to be equal to beta k plus the inverse.
22:25
So that's actually equal to negative i of beta k--
22:30
well, yeah.
22:33
That's negative i of beta, I guess.
22:35
22:38
Oh, and beta here shows up in w, right? w depends on beta.
22:42
So that's going to be beta k.
22:44
So let me call it wk.
22:45
22:49
So that's the diagonal of H prime xi transpose beta
22:54
k, this time, divided by g prime of mu i k phi.
23:01
OK?
23:02
So this beta k induces a mu by looking
23:06
at g inverse of xi transpose beta k.
23:11
All right.
23:11
So mu i k is g inverse of xi transpose beta k.
23:21
So that's 2 to the-- sorry, that's an iteration.
23:25
And so now, if I actually write these things together,
23:28
I get minus x transpose wx inverse.
23:37
So that's wk.
23:38
23:41
And then I have my gradient here that I
23:45
have to apply at k, which is x transpose wk.
23:50
And then I have y tilde k minus mu tilde k, where, again, the
23:58
indices--
23:59
I mean the superscript k are pretty natural.
24:01
y tilde k just means that--
24:05
so that's just yi.
24:07
So that's just yi times g prime of mu i k.
24:14
And mu tilde k is, if I look at the i coordinate,
24:21
it's just going to be mu i times g prime of mu i.
24:27
24:31
OK?
24:32
So I just add superscripts k to everything.
24:34
So I know that those things get updated real time, right?
24:37
Every time I make one iteration, I get a new value for beta,
24:41
I get a new value for mu, and therefore, I
24:43
get a new value for w.
24:44
Yes?
24:45
AUDIENCE: [INAUDIBLE] the Fisher equation [INAUDIBLE]??
24:50
PHILIPPE RIGOLLET: Yeah, that's a good point.
24:52
So that's definitely a plus, because this
24:54
is a positive, semi-definite matrix.
24:56
So this is a plus.
24:57
And well, that's probably where I erased it.
25:01
25:15
OK.
25:16
Let's see where I made my mistake.
25:19
25:23
So there should be a minus here.
25:28
There should be a minus here.
25:29
There should be a minus even at the beginning, I believe.
25:32
So that means that what is my-- oh, yeah, yeah.
25:37
So you see, when we go back to the first,
25:41
so what I erased was basically this thing here, yi minus mu i.
25:47
And when I took the first derivative--
25:49
so it was the derivative with respect to H prime.
25:53
So the derivative with respect to the second term--
25:55
I mean, the derivative of the second term
25:57
was actually killed, because we took
25:59
the expectation of this guy.
26:00
But when we took the derivative of the first term, which
26:03
is the only one that stayed, this guy went away.
26:05
But there was a negative sign from this guy,
26:07
because that's the thing we took the negative off.
26:09
So it's really, when I take my second derivative,
26:12
I should carry out the minus signs everywhere.
26:15
26:22
OK?
26:24
So it's just I forget this minus throughout.
26:26
26:31
You see the first term went away, on the first line there.
26:34
The first term went away, because
26:36
the conditional expectation of yi, given xi 0.
26:38
And then I had this minus sign in front of everyone,
26:41
and I forgot it.
26:42
26:44
All right.
26:45
Any other mistake that I made?
26:47
26:51
We're good?
26:52
All right.
26:54
So now, this is what we have, that xk--
27:08
sorry, that beta k plus 1 is equal to beta k
27:14
plus this thing.
27:15
OK?
27:16
And if you look at this thing, it sort of reminds
27:19
us of something.
27:20
Remember the least squares estimator.
27:22
So here, I'm going to actually deviate slightly
27:24
from the slides.
27:25
And I will tell you how.
27:27
The slides take beta k and put it
27:30
in here, which is one way to go.
27:33
And just think of this as a big least square solution.
27:36
Or you can keep the beta k, solve another least squares,
27:41
and then add it to the beta k that you have.
27:43
It's the same thing.
27:44
So I will take the different routes.
27:45
So you have the two options, all right?
27:47
28:07
OK.
28:09
So when we did the least squares--
28:10
so parenthesis least squares--
28:15
28:19
we had y equals x beta plus epsilon.
28:23
And our estimator beta hat was x transpose
28:27
x inverse x transpose y, right?
28:33
And that was just solving the first order condition,
28:36
and that's what we found.
28:38
Now look at this--
28:40
x transpose bleep x inverse, x transpose bleep something.
28:46
OK?
28:47
So this looks like, if this is the same as the left board,
28:58
if wk is equal to the identity matrix, meaning we
29:04
don't see it, and y is equal to y tilde k minus mu tilde k--
29:11
29:13
so those similarities, the fact that we just squeeze in--
29:16
so the fact that the response variable is different
29:19
is really not a problem.
29:20
We just have to pretend that this
29:22
is equal to y tilde minus mu tilde.
29:24
I mean, that's just the least squares.
29:26
When you call a software that does least squares for you,
29:29
you just tell it what y is, you tell it with x is,
29:31
and it makes the computation.
29:32
So you would just lie to it and say all the actual y
29:35
I want is this thing.
29:37
And then we need to somehow incorporate those weights.
29:42
And so the question is, is that easy to do?
29:44
And the answer is yes, because this is a setup where
29:48
this would actually arise.
29:50
So one of the things that's very specific to what
29:52
we did here and with least squares, we assume
29:54
that epsilon, when we did at least the inference,
29:58
we assumed that epsilon was normal 0
30:01
and the covariance matrix was the identity, right?
30:04
What if the covariance matrix is not the identity?
30:07
If the covariance matrix is not the identity,
30:09
then your maximum likelihood is not exactly
30:12
these least squares.
30:13
If the covariance matrix is any matrix
30:15
you have another solution, which involves the inverse
30:18
of the covariance matrix that you have,
30:20
but if your covariance matrix, in particular, is diagonal--
30:24
which would mean that each observation that you
30:26
get in this system of equations is still independent,
30:30
but the variances can change from one line
30:32
to another, from one observation to another--
30:35
then it's called heteroscedastic.
30:37
"Hetero" means "not the same."
30:39
"Scedastic" is "scale."
30:41
And a heteroscedastic case, you would have
30:45
something slightly different.
30:47
And it makes sense that, if you know
30:49
that some observations have much less variance than others,
30:52
you might want to give them more weight.
30:54
OK?
30:55
So if you think about your usual drawing,
31:02
and maybe you have something like this,
31:07
but the actual line is really--
31:08
OK, let's say you have this guy as well, so just a few here.
31:12
If you start drawing this thing, if you do least squares,
31:16
you're going to see something that looks
31:18
like this on those points.
31:20
But now, if I tell you that, on this side,
31:22
the variance is equal to 100, meaning that those points are
31:26
actually really far from the true one,
31:29
and here on this side, the variance is equal to 1,
31:31
meaning that those points are actually close to the line
31:33
you're looking for, then the line you should be fitting
31:36
is probably this guy, meaning do not
31:38
trust the guys that have a lot of variance.
31:42
And so you need somehow to incorporate that.
31:44
If you know that those things have much more variance
31:46
than these guys, you want to weight this.
31:49
And the way you do it is by using weighted least squares.
31:52
OK.
31:53
So we're going to open in parentheses
31:54
on weighted least squares.
31:55
It's not a fundamental statistical question,
31:57
but it's useful for us, because this is exactly
32:00
what's going to spit out--
32:01
something that looks like this with this matrix w in there.
32:05
OK.
32:05
So let's go back in time for a second.
32:09
Assume we're still covering least squares regression.
32:12
So now, I'm going to assume that y is x beta plus epsilon,
32:19
but this time, epsilon is a multivariate Gaussian in, say,
32:23
p dimensions with mean 0.
32:25
And covariance matrix, I will write it as w inverse,
32:29
because w is going to be the one that's going to show up.
32:32
OK?
32:34
So this is the so-called heteroscedastic.
32:37
That's how it's spelled, and yet another name
32:43
that you can pick for your soccer team or a capella group.
32:47
All right.
32:48
So the maximum likelihood, in this case--
32:52
so actually, let's compute the maximum likelihood
32:54
for this problem, right?
32:55
So the log likelihood is what?
32:58
Well, we're going to have the term that tells us
33:02
that it's going to be-- so OK.
33:04
What is the density of a multivariate Gaussian?
33:06
33:10
So it's going to be a multivariate Gaussian
33:12
in p dimension with mean x beta and covariance matrix w
33:17
inverse, right?
33:19
So that's the density that we want.
33:20
Well, it's of the form 1 over determinant of w inverse times
33:30
2 pi to the p/2.
33:35
OK?
33:37
And times exponential, and now, what I have is x minus x beta
33:47
transpose w-- so that's the inverse of w inverse--
33:51
x minus x beta divided by 2.
33:58
OK?
33:59
So this is x minus mu transpose sigma inverse x
34:03
minus mu divided by 2.
34:04
And if you want a sanity check, just assume that sigma--
34:10
yeah?
34:11
AUDIENCE: Is it x minus x beta or y?
34:15
PHILIPPE RIGOLLET: Well, you know, if you want this to be y,
34:18
then this is y, right?
34:21
Sure.
34:22
Yeah, maybe it's less confusing.
34:24
So if you should do p is equal to 1, then what does it mean?
34:29
It means that you have this mean here.
34:31
So let's forget about what it is.
34:32
But this guy is going to be just 1 sigma squared, right?
34:35
So what you see here is the inverse of sigma squared.
34:38
So that's going to be 2 over 2 sigma squared, like we usually
34:41
see it.
34:42
The determinant of w inverse is just
34:44
the product of the entry of the 1
34:45
by 1 matrix, which is just sigma square.
34:53
OK?
34:53
So that should be actually--
34:58
yeah, no, that's actually-- yeah, that's sigma square.
35:01
And then I have this 2 pi.
35:02
So square root of this, because p is equal to 1,
35:04
I get sigma square root 2 pi, which is
35:06
the normalization that I get.
35:07
This is not going to matter, because,
35:09
when I look at the log likelihood
35:12
as a function of beta--
35:15
so I'm assuming that w is known--
35:17
what I get is something which is a constant.
35:19
So it's minus p minus n times p/2 times
35:25
log that w inverse times 2 pi.
35:31
OK?
35:31
So this is just going to be a constant.
35:33
It won't matter when I do the maximum likelihood.
35:35
And then I'm going to have what?
35:36
I'm going to have plus 1/2 of y minus x beta transpose w
35:44
y minus x beta.
35:45
35:49
So if I want to take the maximum of this guy--
35:53
sorry, there's a minus here.
35:56
So if I want to take the maximum of this guy,
35:58
I'm going to have to take the minimum of this thing.
36:01
And the minimum of this thing, if you take the derivative,
36:04
you get to see--
36:05
so that's what we have, right?
36:07
We need to compute the minimum of y
36:09
minus x beta transpose w minus y minus x beta.
36:13
And the solution that you get--
36:16
I mean, you can actually check this for yourself.
36:20
The way you can see this is by doing the following.
36:24
If you're lazy and you don't want to redo the entire thing--
36:27
maybe I should keep that guy.
36:28
36:36
W is diagonal, right?
36:39
I'm going to assume that so w inverse is diagonal,
36:42
and I'm going to assume that no variance is equal to 0
36:45
and no variance is equal to infinity,
36:47
so that both w inverse and w have only positive entries
36:52
on the diagonal.
36:53
All right?
36:53
So in particular, I can talk about the square root
36:55
of w, which is just the matrix, the diagonal matrix,
36:58
with the square roots on the diagonal.
37:00
OK?
37:01
And so I want to minimize in beta y minus x beta transpose w
37:08
y minus x beta.
37:11
So I'm going to write w as square root
37:13
of w times square root of w, which I can, because w--
37:17
and it's just the simplest thing, right?
37:19
If w is w1 wn, so that's my w, then the square root of w
37:28
is just square root of w1 square root of wn,
37:31
and then 0 is elsewhere.
37:33
OK?
37:35
So the product of those two matrices
37:37
gives me definitely back what I want,
37:38
and that's the usual matrix product.
37:41
Now, what I'm going to do is I'm going to push one on one side
37:43
and push the other one on the other side.
37:45
So that gives me that this is really
37:47
the minimum over beta of--
37:49
well, here I have this transposed,
37:50
so I have to put it on the other side.
37:52
w is clearly symmetric and so is square root of w.
37:55
So the transpose doesn't matter.
37:57
And so what I'm left with is square root
37:59
of wy minus square root of wx beta transpose, and then times
38:06
itself.
38:07
So that's square root wy minus square root w--
38:15
oh, I don't have enough space--
38:17
x beta.
38:20
OK, and that stops here.
38:23
But this is the same thing that we've been doing before.
38:25
This is a new y.
38:26
Let's call it y prime.
38:28
This is a new x.
38:28
Let's call it x prime.
38:31
And now, this is just the least squares estimator
38:33
associated to a response y prime and a design matrix x prime.
38:39
So I know that the solution is x prime transpose x prime inverse
38:47
x prime transpose y prime.
38:53
And now, I'm just going to substitute again
38:55
what my x prime is in terms of x and what
38:58
my y prime is in terms of y.
39:01
And that gives me exactly x square root w square root w
39:06
x inverse.
39:11
And then I have x transpose square root w for this guy.
39:17
And then I have square root wy for that guy.
39:21
And that's exactly what I wanted.
39:23
I'm left with x transpose wx inverse x transpose wy.
39:30
39:34
OK?
39:35
39:38
So that's a simple way to take into account
39:41
the w that we had before.
39:44
And you could actually do it with any matrix that's positive
39:47
semi-definite, because you can actually
39:48
talk about the square root of those matrices.
39:52
And it's just the square root of a matrix is just a matrix
39:54
such that, when you multiply it by itself,
39:58
it gives you the original matrix.
40:00
OK?
40:03
So here, that was just a shortcut
40:06
that consisted in saying, OK, maybe I
40:08
don't want to recompute the gradient of this quantity,
40:12
set it equal to 0, and see what beta hat had should be.
40:17
Instead, I am going to assume that I already
40:19
know that, if I did not have the w,
40:23
I would know how to solve it.
40:24
And that's exactly what I did.
40:25
I said, well, I know that this is
40:28
the minimum of something that looks like this,
40:30
when I have the primes.
40:32
And then I just substitute back my w in there.
40:36
All right.
40:36
So that' just the lazy computation.
40:38
But again, if you don't like it, you
40:40
can always take the gradient of this guy.
40:42
Yes?
40:42
AUDIENCE: Why is the solution written
40:44
in the slides different?
40:45
PHILIPPE RIGOLLET: Because there's a mistake.
40:47
40:49
Yeah, there's a mistake on the slides.
40:51
40:58
How did I make that one?
40:59
I'm actually trying to parse it back.
41:01
41:11
I mean, it's clearly wrong, right?
41:13
Oh, no, it's not.
41:14
41:24
No, it is.
41:27
So it's not clearly wrong.
41:29
41:32
Actually, it is clearly wrong.
41:34
Because if I put the identity here,
41:37
those are still associative, right?
41:39
So this product is actually not compatible.
41:42
So it's wrong, but there's just this extra thing
41:44
that I probably copy-pasted from some place.
41:46
Since this is one of my latest slide,
41:48
I'll just color it in white.
41:51
But yeah, sorry, there's a mis-- this parenthesis is not here.
41:54
Thank you.
41:55
AUDIENCE: [INAUDIBLE].
41:56
PHILIPPE RIGOLLET: Yeah.
41:57
42:01
OK?
42:03
AUDIENCE: So why not square root [INAUDIBLE]??
42:06
PHILIPPE RIGOLLET: Because I have two of them.
42:08
I have one that comes from the x prime that's here, this guy.
42:11
And then I have one that comes from this guy here.
42:15
OK, so the solution--
42:17
let's write it in some place that's actually legible--
42:20
42:25
which is the correction for this thing
42:27
is x transpose wx inverse x transpose wy.
42:34
OK?
42:35
So you just squeeze in this w in there.
42:38
And that's exactly what we had before,
42:41
x transpose wx inverse x transpose w some y.
42:47
OK?
42:49
And what I claim is that this is routinely implemented.
42:53
As you can imagine, heteroscedastic linear
42:55
regression is something that's very common.
42:57
So every time you a least squares formula,
43:00
you also have a way to put in some weights.
43:02
You don't have to put diagonal weights,
43:04
but here, that's all we need.
43:05
43:08
So here on the slides, again, I took the beta k,
43:12
and I put it in there, so that I have only one least square
43:15
solution to formulate.
43:17
But let's do it slightly differently.
43:19
What I'm going to do here now is I'm
43:21
going to say, OK, let's feed it to some least squares.
43:24
So let's do weighted least squares on a response,
43:32
y being y tilde k minus mu tilde k, and design matrix being,
43:44
well, just the x itself.
43:47
So that doesn't change.
43:50
And the weights-- so the weights are what?
44:00
The weights are the wk that I had here.
44:04
So wki is h prime of xi transpose beta
44:15
k divided by g prime of mu i at time k times phi.
44:24
44:28
OK, and so this, if I solve it, will spit out something
44:32
that I will call a solution.
44:33
I will call it u hat k plus 1.
44:41
And to get beta hat k plus 1, all I
44:44
need to do is to do beta k plus u hat k plus 1--
44:53
sorry, beta-- yeah.
44:55
OK?
44:58
And that's because-- so here, that's not clear.
45:01
But I started from there, remember?
45:04
I started from this guy here.
45:08
So I'm just solving a least squares, a weighted least
45:10
square that's going to give me this thing.
45:12
That's what I called u hat k plus 1.
45:15
And then I add it to beta k, and that gives me beta k minus 1.
45:18
So I just have this intermediate step,
45:21
which is removed in the slides.
45:25
OK?
45:28
So then you can repeat until convergence.
45:29
What does it mean to repeat until convergence?
45:32
45:35
AUDIENCE: [INAUDIBLE]?
45:37
PHILIPPE RIGOLLET: Yeah, exactly.
45:38
So you just set some threshold and you say,
45:41
I promise you that this will converge, right?
45:43
So you know that at some point, you're going to be there.
45:46
You're going to go there, but you're never
45:48
going to be exactly there.
45:49
And so you just say, OK, I want this accuracy on my data.
45:52
Actually, the machine is a little strong.
45:55
Especially if you have 10 observations to start with,
45:57
you know you're going to have something that's going
46:01
to have some statistical error.
46:03
So that should actually guide you into what kind of error
46:05
you want to be making.
46:06
So for example, a good rule of thumb
46:08
is that if you have n observations,
46:11
you just take some within--
46:13
if you want the L2 distance between the beta--
46:17
the two consecutive beta to be less than 1/n,
46:19
you should be good enough.
46:20
It doesn't have to be that machine precision.
46:24
And so it's clear how we do this, right?
46:27
So here, I just have to maintain a bunch of things, right?
46:30
So remember, when I want to recompute-- at every step,
46:33
I have to recompute a bunch of things.
46:35
So I have to recompute the weights.
46:36
But if I want to recompute the weights, not only do
46:39
I need to previous iterate, but I
46:40
need to know how the previous iterate impacts my means.
46:46
So at each step, I have to recalculate mu i k
46:50
by doing g prime, rate?
46:53
Remember mu i k was just g inverse of xi transpose beta k,
47:02
right?
47:03
So I have to recompute that.
47:05
And then I use this to compute my weights.
47:09
I also use this to compute my y, right?
47:15
so my y depends also on g prime of mu i k.
47:20
I feed that to my weighted least squares engine.
47:24
It spits out the u hat k, that I add to my previous beta k.
47:28
And that gives me my new beta k plus 1.
47:30
47:33
OK.
47:33
So here's the pseudocode, if you want
47:35
to take some time to parse it.
47:40
All right.
47:41
So here again, the trick is not much.
47:43
It's just saying, if you don't feel like implementing Fisher
47:49
scoring or inverting your Hessian at every step,
47:52
then a weighted least squares is actually going
47:54
to do it for you automatically.
47:56
All right.
47:56
Then that's just a numerical trick.
47:58
There's nothing really statistical about this,
48:00
except the fact that this calls for a solution for each
48:04
of the step reminded us of sum of the squares,
48:09
except that there was some extra weights.
48:11
48:14
OK.
48:14
So to conclude, we'll need to know, of course,
48:18
xy, the link function.
48:19
48:22
Why do we need the variance function?
48:24
48:29
I'm not sure we actually need the variance function.
48:33
No, I don't know why I say that.
48:36
You need phi, not the variance function.
48:39
So where do you start actually, right?
48:41
So clearly, if you start very close to your solution,
48:44
you're actually going to do much better.
48:46
And one good way to start--
48:48
so for the beta itself, it's not clear what it's going to be.
48:51
But you can actually get a good idea
48:53
of what beta is by just having a good idea of what mu is.
48:57
Because mu is g inverse of xi transpose beta.
49:01
And so what you could do is to try
49:04
to set mu to be the actual observations that you have,
49:07
because that's the best guess that you
49:09
have for their expected value.
49:11
And then you just say, OK, once I have my mu,
49:14
I know that my mu is a function of this thing.
49:17
So I can write g of mu and solve it, using your least squares
49:21
estimator, right?
49:22
So g of mu is of the form x beta.
49:28
So you just solve for-- once you have your mu,
49:33
you pass it through g, and then you solve for the beta
49:36
that you want.
49:37
And then that's the beta that you initialize with.
49:40
49:42
OK?
49:44
And actually, this was your question from last time.
49:47
As soon as I use the canonical link,
49:50
Fisher scoring and Newton-Raphson
49:53
are the same thing, because the Hessian is actually
49:57
deterministic in that case, just because when
50:05
you use the canonical link, H is the identity, which
50:09
means that its second derivative is equal to 0.
50:12
So this term goes away even without taking the expectation.
50:15
So remember, the term that went away
50:17
was of the form yi minus mu i divided
50:23
by phi times h prime prime of xi transpose beta, right?
50:29
That's the term that we said, oh, the conditional expectation
50:32
of this guy is 0.
50:34
But if h prime prime is already equal to 0,
50:36
then there's nothing that changes.
50:37
There's nothing that goes away.
50:39
It was already equal to 0.
50:40
And that always happens when you have the canonical link,
50:43
because h is g b prime inverse.
50:54
And the canonical link is b prime inverse,
50:57
so this thing is the identity.
51:00
So the second derivative of f of x is equal to x is 0.
51:06
OK.
51:08
My screen says end of show.
51:11
So we can start with some questions.
51:13
AUDIENCE: I just wanted to clarify.
51:15
So iterative-- what is it say for iterative--
51:19
PHILIPPE RIGOLLET: Reweighted least squares.
51:20
AUDIENCE: Reweighted least squares
51:21
is an implementation of the Fisher scoring [INAUDIBLE]??
51:23
PHILIPPE RIGOLLET: That's an implementation
51:25
that's just making calls to weighted least squares oracles.
51:29
It's called an oracle sometimes.
51:30
An oracle is what you assume the machine can do easily for you.
51:33
So if you assume that your machine is
51:35
very good at multiplying by the inverse of a matrix,
51:38
you might as well just do Fisher scoring yourself, right?
51:40
It's just a way so that you don't have to actually do it.
51:43
And usually, those things are implemented--
51:46
and I just said routinely-- in statistical software.
51:49
But they're implemented very efficiently
51:51
in statistical software.
51:52
So this is going to be one of the fastest ways you're
51:54
going to have to solve, to do this step,
51:59
especially for large-scale problems.
52:01
AUDIENCE: So the thing that computers can do well
52:03
is the multiplier [INAUDIBLE].
52:05
What's the thing that the computers can do fast
52:07
and what's the thing that [INAUDIBLE]??
52:09
PHILIPPE RIGOLLET: So if you were
52:10
to do this in the simplest possible way,
52:13
your iterations for, say, Fisher scoring
52:18
is just multiply by the inverse of the Fisher information,
52:21
right?
52:22
AUDIENCE: So finding that inverse is slow?
52:24
PHILIPPE RIGOLLET: Yeah, so it takes a bit of time.
52:26
Whereas, since you know you're going to multiply directly
52:30
by something, if you just say--
52:33
those things are not as optimized as solving
52:35
least squares.
52:35
Actually, the way it's typically done
52:36
is by doing some least squares.
52:38
So you might as well just do least squares that you like.
52:41
And there's also less--
52:42
52:45
well, no, there's no--
52:47
well, there is less recalculation, right?
52:51
Here, your Fisher, you would have
52:52
to recompute the entire matrix of Fisher information.
52:54
Whereas here, you don't have to.
52:56
Right?
52:56
You really just have to compute some vectors and the vector
52:59
of weights, right?
53:00
So the Fisher information matrix has, say,
53:03
n choose two entries that you need to compute, right?
53:05
It's symmetric, so it's order n squared entries.
53:08
But here, the only things you update, if you think about it,
53:11
are this weight matrix.
53:13
So there is only the diagonal elements
53:15
that you need to update, and these vectors in there also.
53:19
There's two inverses n squared.
53:21
So that's much less thing to actually put in there.
53:23
It does it for you somehow.
53:25
53:29
Any other question?
53:30
53:34
Yeah?
53:35
AUDIENCE: So if I have a data set [INAUDIBLE],,
53:37
then I can always try to model it with least squares, right?
53:40
PHILIPPE RIGOLLET: Yeah, you can.
53:41
AUDIENCE: And so this is like setting my weight equal to 1--
53:44
the identity, essentially, right?
53:46
PHILIPPE RIGOLLET: Well, not exactly,
53:47
because the g also shows up in this correction
53:50
that you have here, right?
53:51
AUDIENCE: Yeah.
53:52
PHILIPPE RIGOLLET: I mean, I don't know what you mean by--
53:55
AUDIENCE: I'm just trying to say,
53:56
are there ever situations where I'm trying to model a data set
53:59
and I would want to pick my weights in a particular way?
54:03
PHILIPPE RIGOLLET: Yeah.
54:04
AUDIENCE: OK.
54:05
PHILIPPE RIGOLLET: I mean--
54:06
AUDIENCE: [INAUDIBLE] example [INAUDIBLE]..
54:07
PHILIPPE RIGOLLET: Well, OK, there's
54:09
the heteroscedastic case for sure.
54:10
So if you're going to actually compute those things-- and more
54:13
generally, I don't think you should think
54:15
of those as being weights.
54:16
You should really think of those as being matrices
54:18
that you invert.
54:19
And don't think of it as being diagonal,
54:21
but really think of them as being full matrices.
54:23
So if you have--
54:25
when we wrote weighted least squares here, this was really--
54:30
the w, I said, is diagonal.
54:31
But all the computations really never really use the fact
54:34
that it's diagonal.
54:35
So what shows up here is just the inverse
54:38
of your covariance matrix.
54:40
And so if you have data that's correlated,
54:42
this is where it's going to show up.