https://www.youtube.com/watch?v=lWW54ts9Ubo&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0&index=22 字幕記錄 00:00 00:00 The following content is provided under a Creative 00:02 Commons license. 00:03 Your support will help MIT OpenCourseWare 00:06 continue to offer high-quality educational resources for free. 00:10 To make a donation or to view additional materials 00:12 from hundreds of MIT courses, visit MIT OpenCourseWare 00:16 at ocw.mit.edu. 00:17 00:21 PHILIPPE RIGOLLET: So I apologize. 00:23 My voice is not 100%. 00:27 So if you don't understand what I'm saying, please ask me. 00:32 So we're going to be analyzing-- actually, not really analyzing. 00:36 We described a second-order method 00:38 to optimize the log likelihood in a generalized linear model, 00:42 when the parameter of interest was beta. 00:45 So here, I'm going to rewrite the whole thing as a beta. 00:47 So that's the equation you see. 00:49 But we really have this beta. 00:51 And at iteration k plus 1, beta is given by beta k. 00:58 And then I have a plus sign. 01:01 And the plus, if you think of the Fisher information at beta 01:06 k as being some number-- 01:09 if you were to say whether it's a positive 01:11 or a negative number, it's actually 01:12 going to be a positive number, because it's 01:14 a positive semi-definite matrix. 01:15 So since we're doing gradient ascent, 01:18 we have a plus sign here. 01:19 And then the direction is basically 01:21 gradient ln at beta k. 01:26 OK? 01:27 So this is the iterations that we're trying to implement. 01:30 And we could just do this. 01:31 At each iteration, we compute the Fisher information, 01:34 and then we do it again and again. 01:36 All right. 01:36 That's called the Fisher-scoring algorithm. 01:38 And I told you that this was going to converge. 01:41 And what we're going to try to do in this lecture 01:44 is to show how we can re-implement this, 01:46 using iteratively re-weighted least squares, 01:48 so that each step of this algorithm 01:50 consists simply of solving a weighted least square problem. 01:54 All right. 01:54 So let's go back quickly and remind ourselves 01:59 that we are in the Gaussian-- 02:04 sorry, we're in the exponential family. 02:07 So if I look at the log likelihood for one observation, 02:10 so here it's ln-- 02:12 sorry. 02:13 This is the sum from i equal 1 to n of yi minus-- 02:17 02:20 OK, so it's yi times theta i, sorry, minus b of theta i. 02:26 Then there's going to be some parameter. 02:28 And then I have plus c of yi phi. 02:32 OK. 02:33 So just the exponential went away 02:35 when I took the log of the likelihood. 02:36 And I have n observations, so I'm summing 02:38 over all n observations. 02:40 All right. 02:40 Then we had a bunch of formulas that we came up to be. 02:43 So if I look at the expectation of yi-- 02:46 so that's really the conditional of yi, given xi. 02:49 But like here, it really doesn't matter. 02:52 It's just going to be different for each i. 02:54 This is denoted by mu i. 02:55 And we showed that this was beta prime of theta i. 03:00 Then the other equation that we found was that. 03:03 And so what we want to model is this thing. 03:06 We want it to be equal to xi transpose beta- sorry 03:10 g of this thing. 03:11 03:14 All right. 03:15 So that's our model. 03:19 And then we had that the variance was also 03:21 given by the second derivative. 03:23 I'm not going to go into it. 03:24 What's actually interesting is to see, 03:26 if we want to express theta i as a function of xi, what we get, 03:32 going from xi to mu i by g inverse, and then to theta i 03:38 by b inverse, we get that theta i 03:43 is equal to h of xi transpose beta, h of xi transpose beta, 03:51 where h is the inverse-- 03:56 so which order is --this? 03:58 Is the inverse of g, and then the compose would be prime. 04:03 OK? 04:05 So we remember that last time, those are all computations 04:09 that we've made, but they're going 04:10 to be useful in our derivation. 04:12 And the first thing we did last time is 04:14 to show that, if I look now at the derivative of the log 04:17 likelihood with respect to one coordinate of beta, which 04:20 is going to give me the gradient if I do that for all 04:23 the coordinates, what we ended up finding 04:25 is that we can rewrite it in this form, some 04:28 of yi tilde minus mu tilde. 04:31 So let's remind ourselves that-- 04:33 04:36 so y tilde is just y divided-- 04:41 well, OK y tilde i is yi-- 04:45 is it times or divided-- 04:46 times g prime of mu i. 04:50 Mu tilde i is mu i times g prime of mu i. 05:00 And then that was just an artificial thing, 05:02 so that we could actually divide the weights by g prime. 05:07 But the real thing that built the weights are this h prime. 05:10 And there's this normalization factor. 05:12 And so if we read it like that-- so if I also 05:14 write that wi is h prime of xi transpose beta divided 05:22 by g prime of mu i times phi, then 05:27 I could actually rewrite my gradient, which 05:30 is a vector, in the following matrix form, 05:34 the gradient ln at beta. 05:40 So the gradient of my log likelihood of beta 05:44 took the following form. 05:45 It was x transpose w, and then y tilde minus mu tilde. 05:53 And here, w was just the matrix with w1, 05:57 w2, all the way to wn on the diagonal and 0 06:02 on of the up diagonals. 06:04 OK? 06:06 So that was just taking the derivative 06:08 and doing a slight manipulations that said, 06:11 well, let's just divide whatever is here by g 06:14 prime and multiply whatever is here by g prime. 06:17 So today, we'll see why we make this division 06:19 and multiplication by g prime, which seems to make no sense, 06:23 but it actually comes from the Hessian computations. 06:26 So the Hessian computations are going 06:28 to be a little more annoying. 06:29 Actually, let me start directly with the coordinate y's 06:33 derivative, right? 06:34 So to build this gradient, what we used, in the end, 06:37 was that the partial derivative of ln with respect to the gth 06:41 coordinate of beta was equal to the sum over i 06:49 of yi tilde minus mu i tilde times wi times 06:55 the gth coordinate of xi. 06:59 OK? 07:01 So now, let's just take another derivative, 07:03 and that's going to give us the entries of the Hessian. 07:07 OK, so we're going to the second derivative. 07:11 So what I want to compute is the derivative with respect 07:16 to beta j and beta k. 07:18 07:21 OK. 07:22 So where does beta j-- 07:24 so here, I already took the derivative with respect 07:26 to beta j. 07:27 So this is just the derivative with respect 07:29 to beta k of the derivative with respect to beta j. 07:32 07:36 So what I need to do is to take the derivative of this guy 07:39 with respect to beta k. 07:40 Where does beta k show up here? 07:42 07:48 It's set in, in two places. 07:52 AUDIENCE: In the y's? 07:53 PHILIPPE RIGOLLET: No, it's not in the y's. 07:54 The y's are my data, right? 07:56 07:59 But I mean, it's in the y tildes. 08:02 Yeah, because it's in mu, right? 08:03 Mu depends on beta. 08:04 Mu is g inverse of xi transpose beta. 08:09 And it's also in the wi's. 08:12 Actually, everything that you see is directly-- well, OK, w 08:17 depends on mu n on beta explicitly. 08:21 But the rest depends only on mu. 08:24 And so we might want to be a little-- 08:27 well, we can actually use the-- 08:30 did I use the chain rule already? 08:32 08:35 Yeah, it's here. 08:36 But OK, well, let's go for it. 08:40 08:49 Oh yeah, OK. 08:50 Sorry, I should not write it like that, 08:52 because that was actually-- 08:54 right, so I make my life miserable by just multiplying 08:57 and dividing by this g prime of mu. 09:00 I should not do this, right? 09:02 So what I should just write is say that this guy here-- 09:04 I'm actually going to remove the g prime of mu, 09:09 because I just make something that depends on theta 09:11 appear when it really should not. 09:13 So let's just look at the last but one equality. 09:15 09:23 OK. 09:23 So that's the one over there, and then I have xi j. 09:27 OK, so here, it make my life much more simple, 09:29 because yi does not depend on beta, 09:31 but this guy depends on beta, and this guy depends on beta. 09:34 All right. 09:35 So when I take the derivative, I'm 09:36 going to have to be a little more careful now. 09:38 But I just have a derivative of a product, 09:40 nothing more complicated. 09:42 So this is what? 09:43 Well, the sum is going to be linear, 09:45 so it's going to come out. 09:46 Then I'm going to have to take the derivative of this term. 09:51 So it's just going to be 1 over psi. 09:54 Then the derivative of mu i with respect 09:58 to beta k, which I will just write like this, 10:04 times h prime of xi transpose beta xi j. 10:09 And then I'm going to have the other one, which is yi minus mu 10:15 i over 5 times the second derivative of h of xi transpose 10:24 beta. 10:25 And then I'm going to take the derivative 10:27 of this guy with respect to beta j with beta k, which is just 10:30 xi k. 10:32 So I have xi j times xi k. 10:36 OK. 10:36 So I still need to compute this guy. 10:40 So what is the partial derivative 10:42 with respect to beta k of g? 10:46 So mu is g of-- 10:49 worry, it's g inverse of xi transpose beta. 10:52 10:56 OK? 10:58 So what do I get? 10:59 Well, I'm going to get definitely 11:01 the second derivative of g. 11:02 11:11 Well, OK, that's actually not a bad idea. 11:14 11:17 Well, no, that's OK. 11:18 I can make the second-- 11:21 what makes my life easier, actually? 11:22 11:26 Give me one second. 11:31 Well, there's no one that actually 11:33 makes my life so much easier. 11:35 Let's just write it. 11:36 Let's go with this guy. 11:37 So it's going to be g prime prime of xi transpose beta 11:43 times xi k. 11:47 OK? 11:50 So now, what do I have if I collect my terms? 11:53 I have that this whole thing here, the second derivative is, 12:05 well, I have the sum from 1 equal 1 to n. 12:10 Then I have terms that I can factor out, right? 12:13 Both of these guys have xi j, and this guy pulls out an xi k. 12:17 And it's also here, xi j times xi k, right? 12:21 So everybody here is xi j xi k. 12:26 And now, I just have to take the terms that I have here. 12:29 The 1 over phi, I can actually pull out in front. 12:33 And I'm left with the second derivative of g times 12:40 the first derivative of h, both taken at xi transpose beta. 12:46 And then, I have this yi minus mu i 12:48 times the second derivative of h, taken at xi transpose beta. 12:52 13:00 OK. 13:00 But here, I'm looking at Fisher scoring. 13:03 I'm not looking at Newton's method, which 13:07 means that I can actually take the expectation 13:09 of the second derivative. 13:11 So when I start taking the expectation, 13:13 what's going to happen-- 13:15 so if I take the expectation of this whole thing 13:17 here, well, this guy, it's not-- 13:21 and when I say expectation, it's always conditionally on xi. 13:24 So let's write it-- 13:27 x1 xn. 13:29 So I take conditional. 13:31 This is just deterministic. 13:32 But what is the conditional expectation 13:34 of yi minus mu i times this guy, conditionally on xi? 13:39 13:42 0, right? 13:43 Because this is just the conditional expectation of yi, 13:45 and everything else depends on xi only, 13:47 so I can push it out of the conditional expectation. 13:50 So I'm left only with this term. 13:52 14:06 OK. 14:06 14:13 So now I need to-- 14:14 sorry, and I have xi xj, xi j xi j. 14:23 OK 14:26 So now, I want to go to something that's slightly more 14:34 convenient for me. 14:35 So maybe we can skip that part here, 14:37 because this is not going to be convenient for me anyway. 14:40 So I just want to go back to something that looks eventually 14:45 like this. 14:48 OK, that's what I'm going to want. 14:50 So I need to have my xi show up with some weight somehow. 14:53 And the weight should involve h prime divided by g prime. 14:57 Again, the reason why I want to see g prime coming back 15:00 is because I had g prime coming in the original w. 15:03 This is actually the same definition as the w 15:06 that I used when I was computing the gradient. 15:09 Those are exactly these w's, those guys. 15:13 So I need to have g prime that shows up. 15:15 And that's where I'm going to have 15:17 to make a little bit of computation here. 15:21 And it's coming from this kind of consideration. 15:26 OK? 15:27 So this thing here-- 15:29 15:33 well, actually, I'm missing the phi over there, right? 15:39 There should be a phi here. 15:41 OK. 15:41 So we have exactly this thing, because this tells me that, 15:46 if I look at the Hessian-- 15:47 15:53 so this was entry-wise, and this is exactly 15:56 the form of something of the form of the k. 15:58 This is exactly the jth kth entry of xi xi transpose. 16:06 Right? 16:06 We've used that before. 16:07 So if I want to write this in a vector form, 16:09 this is just going to be the sum of something that depends 16:13 on i times xi xi transpose. 16:15 So this is 1 over phi sum from i equal 1 to n of g 16:20 prime prime xi transpose beta h prime xi transpose beta xi xi 16:28 transpose. 16:30 OK? 16:31 And that's for the entire matrix. 16:32 Here, that was just the j kth entries of this matrix. 16:34 16:38 And you can just check that, if I take this matrix, 16:41 the j kth entry is just the product of the jth coordinate 16:45 and the kth coordinate of xi. 16:48 All right. 16:51 So now I need to do my rewriting. 16:54 Can I write this? 16:54 16:58 So I'm missing something here, right? 17:00 17:11 Oh, I know where it's coming from. 17:13 17:18 Mu is not g prime of x beta. 17:22 Mu is g inverse of x beta, right? 17:24 17:27 So the derivative of x prime is not g prime prime. 17:34 It's like this guy-- 17:39 17:44 no, 1 over this, right? 17:46 17:51 Yeah. 17:52 18:06 OK? 18:08 The derivative of g inverse is 1 over g prime of gene inverse. 18:12 18:15 I need you guys, OK? 18:18 All right. 18:18 So now, I'm going to have to rewrite this. 18:20 This guy is still going to go away. 18:21 It doesn't matter, but now this thing 18:23 is becoming h prime over g prime of g inverse of xi transpose 18:30 beta, which is the same here, which is the same here. 18:41 18:52 OK? 18:53 Everybody approves? 18:55 All right. 18:55 Well, now, it's actually much nicer. 18:58 What is g inverse of xi transpose beta? 19:01 19:05 Well, that was exactly the mistake that I just made, 19:07 right? 19:08 It's mu i itself. 19:10 So this guy is really g prime of mu i. 19:18 Sorry, just the bottom, right? 19:19 19:23 So now, I have something which looks like a sum from i 19:32 equal 1 to n of h prime of xi transpose beta, 19:38 divided by g prime of mu i phi times xi xi transpose, which 19:46 I can certainly write in matrix form as x transpose wx, 19:54 where w is exactly the same as before. 20:00 So it's w1 wn. 20:05 And wi is h prime of xi transpose beta 20:11 divided by g prime of mu i. 20:16 There's a prime here times phi, which 20:20 is the same that we had here. 20:23 And it's supposed to be the same that we have here, 20:26 except the phi is in white. 20:30 That's why it's not there. 20:31 OK. 20:32 20:37 All right? 20:39 So it's actually simpler than what's on the slides, I guess. 20:42 All right. 20:43 So now, if you pay attention, I actually 20:46 never force this g prime of mu i to be here. 20:49 Actually, I even tried to make a mistake to not have it. 20:52 And so this g prime of mu i shows up completely naturally. 20:57 If I had started with this, you would have never questioned 21:04 why I actually didn't multiply by g prime 21:07 and divided by g prime completely artificially here. 21:09 It just shows up naturally in the weights. 21:12 But it's just more natural for me 21:14 to compute the first derivative first 21:15 than the second derivative second, OK? 21:17 And so we just did it the other way around. 21:20 But now, let's assume we forgot about everything. 21:23 We have this. 21:24 This is a natural way of writing it, x transpose wx. 21:28 If I want something that involves some weights, 21:30 I have to force them in by dividing by g prime of mu i 21:34 and therefore, multiplying yi n mu i by this wi. 21:40 OK? 21:41 So now, if we recap what we've actually found, we got that-- 21:46 21:49 let me write it here. 21:51 21:58 We also have that the expectation 22:02 of H ln of beta x transpose xw. 22:12 So if I go back to my iterations over there, 22:15 I should actually update beta k plus 1 22:20 to be equal to beta k plus the inverse. 22:25 So that's actually equal to negative i of beta k-- 22:30 well, yeah. 22:33 That's negative i of beta, I guess. 22:35 22:38 Oh, and beta here shows up in w, right? w depends on beta. 22:42 So that's going to be beta k. 22:44 So let me call it wk. 22:45 22:49 So that's the diagonal of H prime xi transpose beta 22:54 k, this time, divided by g prime of mu i k phi. 23:01 OK? 23:02 So this beta k induces a mu by looking 23:06 at g inverse of xi transpose beta k. 23:11 All right. 23:11 So mu i k is g inverse of xi transpose beta k. 23:21 So that's 2 to the-- sorry, that's an iteration. 23:25 And so now, if I actually write these things together, 23:28 I get minus x transpose wx inverse. 23:37 So that's wk. 23:38 23:41 And then I have my gradient here that I 23:45 have to apply at k, which is x transpose wk. 23:50 And then I have y tilde k minus mu tilde k, where, again, the 23:58 indices-- 23:59 I mean the superscript k are pretty natural. 24:01 y tilde k just means that-- 24:05 so that's just yi. 24:07 So that's just yi times g prime of mu i k. 24:14 And mu tilde k is, if I look at the i coordinate, 24:21 it's just going to be mu i times g prime of mu i. 24:27 24:31 OK? 24:32 So I just add superscripts k to everything. 24:34 So I know that those things get updated real time, right? 24:37 Every time I make one iteration, I get a new value for beta, 24:41 I get a new value for mu, and therefore, I 24:43 get a new value for w. 24:44 Yes? 24:45 AUDIENCE: [INAUDIBLE] the Fisher equation [INAUDIBLE]?? 24:50 PHILIPPE RIGOLLET: Yeah, that's a good point. 24:52 So that's definitely a plus, because this 24:54 is a positive, semi-definite matrix. 24:56 So this is a plus. 24:57 And well, that's probably where I erased it. 25:01 25:15 OK. 25:16 Let's see where I made my mistake. 25:19 25:23 So there should be a minus here. 25:28 There should be a minus here. 25:29 There should be a minus even at the beginning, I believe. 25:32 So that means that what is my-- oh, yeah, yeah. 25:37 So you see, when we go back to the first, 25:41 so what I erased was basically this thing here, yi minus mu i. 25:47 And when I took the first derivative-- 25:49 so it was the derivative with respect to H prime. 25:53 So the derivative with respect to the second term-- 25:55 I mean, the derivative of the second term 25:57 was actually killed, because we took 25:59 the expectation of this guy. 26:00 But when we took the derivative of the first term, which 26:03 is the only one that stayed, this guy went away. 26:05 But there was a negative sign from this guy, 26:07 because that's the thing we took the negative off. 26:09 So it's really, when I take my second derivative, 26:12 I should carry out the minus signs everywhere. 26:15 26:22 OK? 26:24 So it's just I forget this minus throughout. 26:26 26:31 You see the first term went away, on the first line there. 26:34 The first term went away, because 26:36 the conditional expectation of yi, given xi 0. 26:38 And then I had this minus sign in front of everyone, 26:41 and I forgot it. 26:42 26:44 All right. 26:45 Any other mistake that I made? 26:47 26:51 We're good? 26:52 All right. 26:54 So now, this is what we have, that xk-- 27:08 sorry, that beta k plus 1 is equal to beta k 27:14 plus this thing. 27:15 OK? 27:16 And if you look at this thing, it sort of reminds 27:19 us of something. 27:20 Remember the least squares estimator. 27:22 So here, I'm going to actually deviate slightly 27:24 from the slides. 27:25 And I will tell you how. 27:27 The slides take beta k and put it 27:30 in here, which is one way to go. 27:33 And just think of this as a big least square solution. 27:36 Or you can keep the beta k, solve another least squares, 27:41 and then add it to the beta k that you have. 27:43 It's the same thing. 27:44 So I will take the different routes. 27:45 So you have the two options, all right? 27:47 28:07 OK. 28:09 So when we did the least squares-- 28:10 so parenthesis least squares-- 28:15 28:19 we had y equals x beta plus epsilon. 28:23 And our estimator beta hat was x transpose 28:27 x inverse x transpose y, right? 28:33 And that was just solving the first order condition, 28:36 and that's what we found. 28:38 Now look at this-- 28:40 x transpose bleep x inverse, x transpose bleep something. 28:46 OK? 28:47 So this looks like, if this is the same as the left board, 28:58 if wk is equal to the identity matrix, meaning we 29:04 don't see it, and y is equal to y tilde k minus mu tilde k-- 29:11 29:13 so those similarities, the fact that we just squeeze in-- 29:16 so the fact that the response variable is different 29:19 is really not a problem. 29:20 We just have to pretend that this 29:22 is equal to y tilde minus mu tilde. 29:24 I mean, that's just the least squares. 29:26 When you call a software that does least squares for you, 29:29 you just tell it what y is, you tell it with x is, 29:31 and it makes the computation. 29:32 So you would just lie to it and say all the actual y 29:35 I want is this thing. 29:37 And then we need to somehow incorporate those weights. 29:42 And so the question is, is that easy to do? 29:44 And the answer is yes, because this is a setup where 29:48 this would actually arise. 29:50 So one of the things that's very specific to what 29:52 we did here and with least squares, we assume 29:54 that epsilon, when we did at least the inference, 29:58 we assumed that epsilon was normal 0 30:01 and the covariance matrix was the identity, right? 30:04 What if the covariance matrix is not the identity? 30:07 If the covariance matrix is not the identity, 30:09 then your maximum likelihood is not exactly 30:12 these least squares. 30:13 If the covariance matrix is any matrix 30:15 you have another solution, which involves the inverse 30:18 of the covariance matrix that you have, 30:20 but if your covariance matrix, in particular, is diagonal-- 30:24 which would mean that each observation that you 30:26 get in this system of equations is still independent, 30:30 but the variances can change from one line 30:32 to another, from one observation to another-- 30:35 then it's called heteroscedastic. 30:37 "Hetero" means "not the same." 30:39 "Scedastic" is "scale." 30:41 And a heteroscedastic case, you would have 30:45 something slightly different. 30:47 And it makes sense that, if you know 30:49 that some observations have much less variance than others, 30:52 you might want to give them more weight. 30:54 OK? 30:55 So if you think about your usual drawing, 31:02 and maybe you have something like this, 31:07 but the actual line is really-- 31:08 OK, let's say you have this guy as well, so just a few here. 31:12 If you start drawing this thing, if you do least squares, 31:16 you're going to see something that looks 31:18 like this on those points. 31:20 But now, if I tell you that, on this side, 31:22 the variance is equal to 100, meaning that those points are 31:26 actually really far from the true one, 31:29 and here on this side, the variance is equal to 1, 31:31 meaning that those points are actually close to the line 31:33 you're looking for, then the line you should be fitting 31:36 is probably this guy, meaning do not 31:38 trust the guys that have a lot of variance. 31:42 And so you need somehow to incorporate that. 31:44 If you know that those things have much more variance 31:46 than these guys, you want to weight this. 31:49 And the way you do it is by using weighted least squares. 31:52 OK. 31:53 So we're going to open in parentheses 31:54 on weighted least squares. 31:55 It's not a fundamental statistical question, 31:57 but it's useful for us, because this is exactly 32:00 what's going to spit out-- 32:01 something that looks like this with this matrix w in there. 32:05 OK. 32:05 So let's go back in time for a second. 32:09 Assume we're still covering least squares regression. 32:12 So now, I'm going to assume that y is x beta plus epsilon, 32:19 but this time, epsilon is a multivariate Gaussian in, say, 32:23 p dimensions with mean 0. 32:25 And covariance matrix, I will write it as w inverse, 32:29 because w is going to be the one that's going to show up. 32:32 OK? 32:34 So this is the so-called heteroscedastic. 32:37 That's how it's spelled, and yet another name 32:43 that you can pick for your soccer team or a capella group. 32:47 All right. 32:48 So the maximum likelihood, in this case-- 32:52 so actually, let's compute the maximum likelihood 32:54 for this problem, right? 32:55 So the log likelihood is what? 32:58 Well, we're going to have the term that tells us 33:02 that it's going to be-- so OK. 33:04 What is the density of a multivariate Gaussian? 33:06 33:10 So it's going to be a multivariate Gaussian 33:12 in p dimension with mean x beta and covariance matrix w 33:17 inverse, right? 33:19 So that's the density that we want. 33:20 Well, it's of the form 1 over determinant of w inverse times 33:30 2 pi to the p/2. 33:35 OK? 33:37 And times exponential, and now, what I have is x minus x beta 33:47 transpose w-- so that's the inverse of w inverse-- 33:51 x minus x beta divided by 2. 33:58 OK? 33:59 So this is x minus mu transpose sigma inverse x 34:03 minus mu divided by 2. 34:04 And if you want a sanity check, just assume that sigma-- 34:10 yeah? 34:11 AUDIENCE: Is it x minus x beta or y? 34:15 PHILIPPE RIGOLLET: Well, you know, if you want this to be y, 34:18 then this is y, right? 34:21 Sure. 34:22 Yeah, maybe it's less confusing. 34:24 So if you should do p is equal to 1, then what does it mean? 34:29 It means that you have this mean here. 34:31 So let's forget about what it is. 34:32 But this guy is going to be just 1 sigma squared, right? 34:35 So what you see here is the inverse of sigma squared. 34:38 So that's going to be 2 over 2 sigma squared, like we usually 34:41 see it. 34:42 The determinant of w inverse is just 34:44 the product of the entry of the 1 34:45 by 1 matrix, which is just sigma square. 34:53 OK? 34:53 So that should be actually-- 34:58 yeah, no, that's actually-- yeah, that's sigma square. 35:01 And then I have this 2 pi. 35:02 So square root of this, because p is equal to 1, 35:04 I get sigma square root 2 pi, which is 35:06 the normalization that I get. 35:07 This is not going to matter, because, 35:09 when I look at the log likelihood 35:12 as a function of beta-- 35:15 so I'm assuming that w is known-- 35:17 what I get is something which is a constant. 35:19 So it's minus p minus n times p/2 times 35:25 log that w inverse times 2 pi. 35:31 OK? 35:31 So this is just going to be a constant. 35:33 It won't matter when I do the maximum likelihood. 35:35 And then I'm going to have what? 35:36 I'm going to have plus 1/2 of y minus x beta transpose w 35:44 y minus x beta. 35:45 35:49 So if I want to take the maximum of this guy-- 35:53 sorry, there's a minus here. 35:56 So if I want to take the maximum of this guy, 35:58 I'm going to have to take the minimum of this thing. 36:01 And the minimum of this thing, if you take the derivative, 36:04 you get to see-- 36:05 so that's what we have, right? 36:07 We need to compute the minimum of y 36:09 minus x beta transpose w minus y minus x beta. 36:13 And the solution that you get-- 36:16 I mean, you can actually check this for yourself. 36:20 The way you can see this is by doing the following. 36:24 If you're lazy and you don't want to redo the entire thing-- 36:27 maybe I should keep that guy. 36:28 36:36 W is diagonal, right? 36:39 I'm going to assume that so w inverse is diagonal, 36:42 and I'm going to assume that no variance is equal to 0 36:45 and no variance is equal to infinity, 36:47 so that both w inverse and w have only positive entries 36:52 on the diagonal. 36:53 All right? 36:53 So in particular, I can talk about the square root 36:55 of w, which is just the matrix, the diagonal matrix, 36:58 with the square roots on the diagonal. 37:00 OK? 37:01 And so I want to minimize in beta y minus x beta transpose w 37:08 y minus x beta. 37:11 So I'm going to write w as square root 37:13 of w times square root of w, which I can, because w-- 37:17 and it's just the simplest thing, right? 37:19 If w is w1 wn, so that's my w, then the square root of w 37:28 is just square root of w1 square root of wn, 37:31 and then 0 is elsewhere. 37:33 OK? 37:35 So the product of those two matrices 37:37 gives me definitely back what I want, 37:38 and that's the usual matrix product. 37:41 Now, what I'm going to do is I'm going to push one on one side 37:43 and push the other one on the other side. 37:45 So that gives me that this is really 37:47 the minimum over beta of-- 37:49 well, here I have this transposed, 37:50 so I have to put it on the other side. 37:52 w is clearly symmetric and so is square root of w. 37:55 So the transpose doesn't matter. 37:57 And so what I'm left with is square root 37:59 of wy minus square root of wx beta transpose, and then times 38:06 itself. 38:07 So that's square root wy minus square root w-- 38:15 oh, I don't have enough space-- 38:17 x beta. 38:20 OK, and that stops here. 38:23 But this is the same thing that we've been doing before. 38:25 This is a new y. 38:26 Let's call it y prime. 38:28 This is a new x. 38:28 Let's call it x prime. 38:31 And now, this is just the least squares estimator 38:33 associated to a response y prime and a design matrix x prime. 38:39 So I know that the solution is x prime transpose x prime inverse 38:47 x prime transpose y prime. 38:53 And now, I'm just going to substitute again 38:55 what my x prime is in terms of x and what 38:58 my y prime is in terms of y. 39:01 And that gives me exactly x square root w square root w 39:06 x inverse. 39:11 And then I have x transpose square root w for this guy. 39:17 And then I have square root wy for that guy. 39:21 And that's exactly what I wanted. 39:23 I'm left with x transpose wx inverse x transpose wy. 39:30 39:34 OK? 39:35 39:38 So that's a simple way to take into account 39:41 the w that we had before. 39:44 And you could actually do it with any matrix that's positive 39:47 semi-definite, because you can actually 39:48 talk about the square root of those matrices. 39:52 And it's just the square root of a matrix is just a matrix 39:54 such that, when you multiply it by itself, 39:58 it gives you the original matrix. 40:00 OK? 40:03 So here, that was just a shortcut 40:06 that consisted in saying, OK, maybe I 40:08 don't want to recompute the gradient of this quantity, 40:12 set it equal to 0, and see what beta hat had should be. 40:17 Instead, I am going to assume that I already 40:19 know that, if I did not have the w, 40:23 I would know how to solve it. 40:24 And that's exactly what I did. 40:25 I said, well, I know that this is 40:28 the minimum of something that looks like this, 40:30 when I have the primes. 40:32 And then I just substitute back my w in there. 40:36 All right. 40:36 So that' just the lazy computation. 40:38 But again, if you don't like it, you 40:40 can always take the gradient of this guy. 40:42 Yes? 40:42 AUDIENCE: Why is the solution written 40:44 in the slides different? 40:45 PHILIPPE RIGOLLET: Because there's a mistake. 40:47 40:49 Yeah, there's a mistake on the slides. 40:51 40:58 How did I make that one? 40:59 I'm actually trying to parse it back. 41:01 41:11 I mean, it's clearly wrong, right? 41:13 Oh, no, it's not. 41:14 41:24 No, it is. 41:27 So it's not clearly wrong. 41:29 41:32 Actually, it is clearly wrong. 41:34 Because if I put the identity here, 41:37 those are still associative, right? 41:39 So this product is actually not compatible. 41:42 So it's wrong, but there's just this extra thing 41:44 that I probably copy-pasted from some place. 41:46 Since this is one of my latest slide, 41:48 I'll just color it in white. 41:51 But yeah, sorry, there's a mis-- this parenthesis is not here. 41:54 Thank you. 41:55 AUDIENCE: [INAUDIBLE]. 41:56 PHILIPPE RIGOLLET: Yeah. 41:57 42:01 OK? 42:03 AUDIENCE: So why not square root [INAUDIBLE]?? 42:06 PHILIPPE RIGOLLET: Because I have two of them. 42:08 I have one that comes from the x prime that's here, this guy. 42:11 And then I have one that comes from this guy here. 42:15 OK, so the solution-- 42:17 let's write it in some place that's actually legible-- 42:20 42:25 which is the correction for this thing 42:27 is x transpose wx inverse x transpose wy. 42:34 OK? 42:35 So you just squeeze in this w in there. 42:38 And that's exactly what we had before, 42:41 x transpose wx inverse x transpose w some y. 42:47 OK? 42:49 And what I claim is that this is routinely implemented. 42:53 As you can imagine, heteroscedastic linear 42:55 regression is something that's very common. 42:57 So every time you a least squares formula, 43:00 you also have a way to put in some weights. 43:02 You don't have to put diagonal weights, 43:04 but here, that's all we need. 43:05 43:08 So here on the slides, again, I took the beta k, 43:12 and I put it in there, so that I have only one least square 43:15 solution to formulate. 43:17 But let's do it slightly differently. 43:19 What I'm going to do here now is I'm 43:21 going to say, OK, let's feed it to some least squares. 43:24 So let's do weighted least squares on a response, 43:32 y being y tilde k minus mu tilde k, and design matrix being, 43:44 well, just the x itself. 43:47 So that doesn't change. 43:50 And the weights-- so the weights are what? 44:00 The weights are the wk that I had here. 44:04 So wki is h prime of xi transpose beta 44:15 k divided by g prime of mu i at time k times phi. 44:24 44:28 OK, and so this, if I solve it, will spit out something 44:32 that I will call a solution. 44:33 I will call it u hat k plus 1. 44:41 And to get beta hat k plus 1, all I 44:44 need to do is to do beta k plus u hat k plus 1-- 44:53 sorry, beta-- yeah. 44:55 OK? 44:58 And that's because-- so here, that's not clear. 45:01 But I started from there, remember? 45:04 I started from this guy here. 45:08 So I'm just solving a least squares, a weighted least 45:10 square that's going to give me this thing. 45:12 That's what I called u hat k plus 1. 45:15 And then I add it to beta k, and that gives me beta k minus 1. 45:18 So I just have this intermediate step, 45:21 which is removed in the slides. 45:25 OK? 45:28 So then you can repeat until convergence. 45:29 What does it mean to repeat until convergence? 45:32 45:35 AUDIENCE: [INAUDIBLE]? 45:37 PHILIPPE RIGOLLET: Yeah, exactly. 45:38 So you just set some threshold and you say, 45:41 I promise you that this will converge, right? 45:43 So you know that at some point, you're going to be there. 45:46 You're going to go there, but you're never 45:48 going to be exactly there. 45:49 And so you just say, OK, I want this accuracy on my data. 45:52 Actually, the machine is a little strong. 45:55 Especially if you have 10 observations to start with, 45:57 you know you're going to have something that's going 46:01 to have some statistical error. 46:03 So that should actually guide you into what kind of error 46:05 you want to be making. 46:06 So for example, a good rule of thumb 46:08 is that if you have n observations, 46:11 you just take some within-- 46:13 if you want the L2 distance between the beta-- 46:17 the two consecutive beta to be less than 1/n, 46:19 you should be good enough. 46:20 It doesn't have to be that machine precision. 46:24 And so it's clear how we do this, right? 46:27 So here, I just have to maintain a bunch of things, right? 46:30 So remember, when I want to recompute-- at every step, 46:33 I have to recompute a bunch of things. 46:35 So I have to recompute the weights. 46:36 But if I want to recompute the weights, not only do 46:39 I need to previous iterate, but I 46:40 need to know how the previous iterate impacts my means. 46:46 So at each step, I have to recalculate mu i k 46:50 by doing g prime, rate? 46:53 Remember mu i k was just g inverse of xi transpose beta k, 47:02 right? 47:03 So I have to recompute that. 47:05 And then I use this to compute my weights. 47:09 I also use this to compute my y, right? 47:15 so my y depends also on g prime of mu i k. 47:20 I feed that to my weighted least squares engine. 47:24 It spits out the u hat k, that I add to my previous beta k. 47:28 And that gives me my new beta k plus 1. 47:30 47:33 OK. 47:33 So here's the pseudocode, if you want 47:35 to take some time to parse it. 47:40 All right. 47:41 So here again, the trick is not much. 47:43 It's just saying, if you don't feel like implementing Fisher 47:49 scoring or inverting your Hessian at every step, 47:52 then a weighted least squares is actually going 47:54 to do it for you automatically. 47:56 All right. 47:56 Then that's just a numerical trick. 47:58 There's nothing really statistical about this, 48:00 except the fact that this calls for a solution for each 48:04 of the step reminded us of sum of the squares, 48:09 except that there was some extra weights. 48:11 48:14 OK. 48:14 So to conclude, we'll need to know, of course, 48:18 xy, the link function. 48:19 48:22 Why do we need the variance function? 48:24 48:29 I'm not sure we actually need the variance function. 48:33 No, I don't know why I say that. 48:36 You need phi, not the variance function. 48:39 So where do you start actually, right? 48:41 So clearly, if you start very close to your solution, 48:44 you're actually going to do much better. 48:46 And one good way to start-- 48:48 so for the beta itself, it's not clear what it's going to be. 48:51 But you can actually get a good idea 48:53 of what beta is by just having a good idea of what mu is. 48:57 Because mu is g inverse of xi transpose beta. 49:01 And so what you could do is to try 49:04 to set mu to be the actual observations that you have, 49:07 because that's the best guess that you 49:09 have for their expected value. 49:11 And then you just say, OK, once I have my mu, 49:14 I know that my mu is a function of this thing. 49:17 So I can write g of mu and solve it, using your least squares 49:21 estimator, right? 49:22 So g of mu is of the form x beta. 49:28 So you just solve for-- once you have your mu, 49:33 you pass it through g, and then you solve for the beta 49:36 that you want. 49:37 And then that's the beta that you initialize with. 49:40 49:42 OK? 49:44 And actually, this was your question from last time. 49:47 As soon as I use the canonical link, 49:50 Fisher scoring and Newton-Raphson 49:53 are the same thing, because the Hessian is actually 49:57 deterministic in that case, just because when 50:05 you use the canonical link, H is the identity, which 50:09 means that its second derivative is equal to 0. 50:12 So this term goes away even without taking the expectation. 50:15 So remember, the term that went away 50:17 was of the form yi minus mu i divided 50:23 by phi times h prime prime of xi transpose beta, right? 50:29 That's the term that we said, oh, the conditional expectation 50:32 of this guy is 0. 50:34 But if h prime prime is already equal to 0, 50:36 then there's nothing that changes. 50:37 There's nothing that goes away. 50:39 It was already equal to 0. 50:40 And that always happens when you have the canonical link, 50:43 because h is g b prime inverse. 50:54 And the canonical link is b prime inverse, 50:57 so this thing is the identity. 51:00 So the second derivative of f of x is equal to x is 0. 51:06 OK. 51:08 My screen says end of show. 51:11 So we can start with some questions. 51:13 AUDIENCE: I just wanted to clarify. 51:15 So iterative-- what is it say for iterative-- 51:19 PHILIPPE RIGOLLET: Reweighted least squares. 51:20 AUDIENCE: Reweighted least squares 51:21 is an implementation of the Fisher scoring [INAUDIBLE]?? 51:23 PHILIPPE RIGOLLET: That's an implementation 51:25 that's just making calls to weighted least squares oracles. 51:29 It's called an oracle sometimes. 51:30 An oracle is what you assume the machine can do easily for you. 51:33 So if you assume that your machine is 51:35 very good at multiplying by the inverse of a matrix, 51:38 you might as well just do Fisher scoring yourself, right? 51:40 It's just a way so that you don't have to actually do it. 51:43 And usually, those things are implemented-- 51:46 and I just said routinely-- in statistical software. 51:49 But they're implemented very efficiently 51:51 in statistical software. 51:52 So this is going to be one of the fastest ways you're 51:54 going to have to solve, to do this step, 51:59 especially for large-scale problems. 52:01 AUDIENCE: So the thing that computers can do well 52:03 is the multiplier [INAUDIBLE]. 52:05 What's the thing that the computers can do fast 52:07 and what's the thing that [INAUDIBLE]?? 52:09 PHILIPPE RIGOLLET: So if you were 52:10 to do this in the simplest possible way, 52:13 your iterations for, say, Fisher scoring 52:18 is just multiply by the inverse of the Fisher information, 52:21 right? 52:22 AUDIENCE: So finding that inverse is slow? 52:24 PHILIPPE RIGOLLET: Yeah, so it takes a bit of time. 52:26 Whereas, since you know you're going to multiply directly 52:30 by something, if you just say-- 52:33 those things are not as optimized as solving 52:35 least squares. 52:35 Actually, the way it's typically done 52:36 is by doing some least squares. 52:38 So you might as well just do least squares that you like. 52:41 And there's also less-- 52:42 52:45 well, no, there's no-- 52:47 well, there is less recalculation, right? 52:51 Here, your Fisher, you would have 52:52 to recompute the entire matrix of Fisher information. 52:54 Whereas here, you don't have to. 52:56 Right? 52:56 You really just have to compute some vectors and the vector 52:59 of weights, right? 53:00 So the Fisher information matrix has, say, 53:03 n choose two entries that you need to compute, right? 53:05 It's symmetric, so it's order n squared entries. 53:08 But here, the only things you update, if you think about it, 53:11 are this weight matrix. 53:13 So there is only the diagonal elements 53:15 that you need to update, and these vectors in there also. 53:19 There's two inverses n squared. 53:21 So that's much less thing to actually put in there. 53:23 It does it for you somehow. 53:25 53:29 Any other question? 53:30 53:34 Yeah? 53:35 AUDIENCE: So if I have a data set [INAUDIBLE],, 53:37 then I can always try to model it with least squares, right? 53:40 PHILIPPE RIGOLLET: Yeah, you can. 53:41 AUDIENCE: And so this is like setting my weight equal to 1-- 53:44 the identity, essentially, right? 53:46 PHILIPPE RIGOLLET: Well, not exactly, 53:47 because the g also shows up in this correction 53:50 that you have here, right? 53:51 AUDIENCE: Yeah. 53:52 PHILIPPE RIGOLLET: I mean, I don't know what you mean by-- 53:55 AUDIENCE: I'm just trying to say, 53:56 are there ever situations where I'm trying to model a data set 53:59 and I would want to pick my weights in a particular way? 54:03 PHILIPPE RIGOLLET: Yeah. 54:04 AUDIENCE: OK. 54:05 PHILIPPE RIGOLLET: I mean-- 54:06 AUDIENCE: [INAUDIBLE] example [INAUDIBLE].. 54:07 PHILIPPE RIGOLLET: Well, OK, there's 54:09 the heteroscedastic case for sure. 54:10 So if you're going to actually compute those things-- and more 54:13 generally, I don't think you should think 54:15 of those as being weights. 54:16 You should really think of those as being matrices 54:18 that you invert. 54:19 And don't think of it as being diagonal, 54:21 but really think of them as being full matrices. 54:23 So if you have-- 54:25 when we wrote weighted least squares here, this was really-- 54:30 the w, I said, is diagonal. 54:31 But all the computations really never really use the fact 54:34 that it's diagonal. 54:35 So what shows up here is just the inverse 54:38 of your covariance matrix. 54:40 And so if you have data that's correlated, 54:42 this is where it's going to show up.